Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat

Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance Counters Gem5 Tutorial Security Evaluation (next week)

Iron Law of Performance Execution Time (ET) = Instruction Count (IC) x Cycles Per Instruction (CPI) Clock Frequency How to optimize -- Instruction Count? Clock Frequency? CPI?

Performance Comparison Speedup = Performance after Optimization Performance before Optimization Slowdown = Performance with Mitigation Performance without Mitigation What metric should we use for performance in the above equation?

Identifying Bottlenecks CPI Stacks Base CPI = CPI of a pipeline in absence of any stalls (1 for classic MIPS, much lower of typical OOO superscalars) Overall CPI = Base CPI + <stalls due to branch mispredictions> + <stalls due to cache misses> + <stalls due to resource contention> + Credit: Eyerman, et al -- A Performance Counter Architecture for Computing Accurate CPI Components

Identifying Bottlenecks More Metrics Branch Prediction Unit Misprediction rate Misprediction penalty Cycles lost due to misprediction (bar on the CPI stack) Caches and other cache-like structures (e.g., BTB) Hit Latency Hit Rate Miss Penalty Raw number of misses (3Cs compulsory, conflict, capacity) Average Access Time = Hit Latency + (Miss Rate * Miss Penalty) Cycles lost due to cache misses (bar on the CPI stack)

Identifying Bottlenecks More Metrics Resource Contention Queue full scenarios (IQ, ROB, LSQ, reservation stations, etc.) Functional unit contention (FU busy events) Instruction Level Parallelism Fetch rate Issue rate Queue drain rates Commit rate?

Multiprogrammed Mixed Workloads An SMT or a CMP processor typically co-executes multiple programs at a time, with each program running on a separate logical core. Co-executing programs affect the performance of each other some synergistically and some contending. Normalized Progress (NP) = ET (Single-program mode) System Throughput = σ n i=0 n Weighted Speedup = σ i=0 ET (Multi-program mode) NP(i) IPC i MP IPC i SP

Benchmark Suites SPEC CPU 2006 Still widely used to measure effects of processor/compiler optimizations. Integer and FP benchmarks Written in C, C++, and Fortran Compression, codecs, compilers/interpreters, games, scientific applications, routing algorithms, event simulation, etc. SPEC CPU 2017 Includes many new (AI) benchmarks. Other benchmarks of interest PARSEC, mibench (includes crypto), cloudsuite, NAS Parallel, Dhrystone, etc.

Common SPEC 2006 command lines bzip2 "input.combined 200"; gcc "scilab.i -o scilab.s"; gobmk "--quiet --mode gtp" -i "13x13.tst"; h264ref "-d foreman_ref_encoder_baseline.cfg"; hmmer "nph3.hmm swiss41"; lbm "3000 reference.dat 0 0 100_100_130_ldc.of"; libquantum "1397 8"; mcf "inp.in"; milc < su3imp.in perlbench "-I./lib diffmail.pl 4 800 10 17 19 300"; sjeng "ref.txt"; sphinx3 "ctlfile. args.an4";

Steady-State CPI Most applications have an initialization phase (including libc startup), after which actual program execution begins. After the pipeline and other structures (e.g., caches) are sufficiently warmed up, execution is said to have reached steady-state, at which the CPI can be accurately measured. In order to record the steady-state CPI, execution is typically fastforwarded before actual measurements are taken. Fast-forward intervals vary for different programs

Simpoints Automatically Characterizing Large Scale Program Behavior Seminal paper by Tim Sherwood ASPLOS Influential Paper Award Maurice Wilkes Award Programs are typically made of phases. Phase changes are marked by rise/drop in IPC. Similar phases/program intervals typically fall in the same IPC cluster. Overall IPC of a program may often be computed using a sum of weighted IPCs of specific representative phases.

Performance Counters Only a few of these can be enabled at any given point. Offers a high level insight on performance bottlenecks. LBR (not shown here) holds a record of most recent branches.

Linux Perf Events Found in the linux-tools-common package. Profiles programs leveraging hardware performance counters. Common functions: top: identify hot functions stat: count specific events (number of loads issued to L1, number of misses) record: profile a program by sampling certain performance counters at the configured frequency report: show report of the sampled performance counters

Gem5 Architectural Simulator Event-driven Microarchitectural Simulation Useful for rapid prototyping Provides multiple knobs for exploring large processor architecture design spaces Deeper insights than regular performance counters room for more sophisticated stats Written in C++ and Python May not get cycle-accurate results if you re making major modifications to the microarchitecture

Gem5 modes Full System (FS) Simulates a full-fledged linux kernel running a suite of applications Very slow Syscall Emulation (SE) Full detailed simulation of application code Traps and Emulates System Calls Faster than FS, but much slower than native execution Useful when user-mode execution dominates kernel-mode

Gem5 CPU Models Atomic CPU Function simulation (no/limited timing information) Useful for verifying correctness and/or profiling (e.g., generating simpoints) Minor CPU Inorder Pipeline Stages: Fetch, Decode, Execute, LSQ (Mem), Commit (implicit) O3 CPU Out-Of-Order Pipeline Stages: Fetch, Decode, Rename, IEW (Issue, Execute, Writeback), LSQ, Commit

What can you configure with gem5? Design Parameter Design Choice Execution Semantics In-order, Out-of-order Issue Width 1, 2, 4 Branch Predictor Local, Tournament, Gshare, LTAGE Reorder Buffer Size 64, 128 entries Physical Register File (integer) 96, 160 Physical Register File (FP/SIMD) 64, 96 Integer ALUs 1, 3, 6 Integer Multiply/Divide Units 1, 2 Floating-point ALUs 1, 2, 4 FP Multiply/Divide Units 1, 2 SIMD Units 1, 2, 4 Load/Store Queue 16,32 entries Instruction Cache 32KB 4-way, 64KB 4-way Private Data Cache 32KB 4-way, 64KB 8-way Shared Last Level (L2) cache 4-banked 4MB 4-way, 4-banked 8MB 8-way

Demo Running gem5 Checkpoint/Restore (to simulate only a region of interest/fast-forward execution) Debugging (--debug-flags=<> --debug-help) Source Files of Interest

Visualizing Spectre with gem5 http://www.lowepower.com/jason/visualizing-spectre-with-gem5.html

Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat