Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Similar documents
Microarchitecture Overview. Performance

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Course web site: teaching/courses/car. Piazza discussion forum:

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Limitations of Scalar Pipelines

Microarchitecture Overview. Performance

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

CS433 Homework 2 (Chapter 3)

CS425 Computer Systems Architecture

Processors, Performance, and Profiling

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Exploitation of instruction level parallelism

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Lecture-13 (ROB and Multi-threading) CS422-Spring

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

INSTRUCTION LEVEL PARALLELISM

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

A Front-end Execution Architecture for High Energy Efficiency

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS433 Homework 2 (Chapter 3)

Tutorial 11. Final Exam Review

Harnessing ISA Diversity: Design of a Heterogeneous-ISA Chip Multiprocessor

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Superscalar Processor

November 7, 2014 Prediction

Handout 2 ILP: Part B

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

5008: Computer Architecture

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

CS152 Computer Architecture and Engineering. Complex Pipelines

Lecture: Out-of-order Processors

CS 152, Spring 2011 Section 8

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

TDT 4260 lecture 7 spring semester 2015

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Dynamic Memory Dependence Predication

Dynamic Scheduling. CSE471 Susan Eggers 1

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Low-Complexity Reorder Buffer Architecture*

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

StressRight: Finding the Right Stress for Accurate In-development System Evaluation

UCB CS61C : Machine Structures

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Architectures for Instruction-Level Parallelism

CS Mid-Term Examination - Fall Solutions. Section A.

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Good luck and have fun!

Execution-based Prediction Using Speculative Slices

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Multithreaded Processors. Department of Electrical Engineering Stanford University

Case Study IBM PowerPC 620

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

CS 654 Computer Architecture Summary. Peter Kemper

Design of Experiments - Terminology

Simultaneous Multithreading (SMT)

Control Hazards. Prediction

CS 152 Computer Architecture and Engineering

Evaluation of RISC-V RTL with FPGA-Accelerated Simulation

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths


ECE 571 Advanced Microprocessor-Based Design Lecture 4

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Hardware-Based Speculation

Getting CPI under 1: Outline

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs

Pipelining. CS701 High Performance Computing

Superscalar Organization

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

Instruction Pipelining Review

Multiple Instruction Issue. Superscalars

A Fast Instruction Set Simulator for RISC-V

E0-243: Computer Architecture

A First-Order Mechanistic Model for Architectural Vulnerability Factor

{ssnavada, nkchoudh,

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Simultaneous Multithreading (SMT)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Lecture 14: Multithreading

Instruction Level Parallelism

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

CS146 Computer Architecture. Fall Midterm Exam

Transcription:

Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat

Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance Counters Gem5 Tutorial Security Evaluation (next week)

Iron Law of Performance Execution Time (ET) = Instruction Count (IC) x Cycles Per Instruction (CPI) Clock Frequency How to optimize -- Instruction Count? Clock Frequency? CPI?

Performance Comparison Speedup = Performance after Optimization Performance before Optimization Slowdown = Performance with Mitigation Performance without Mitigation What metric should we use for performance in the above equation?

Identifying Bottlenecks CPI Stacks Base CPI = CPI of a pipeline in absence of any stalls (1 for classic MIPS, much lower of typical OOO superscalars) Overall CPI = Base CPI + <stalls due to branch mispredictions> + <stalls due to cache misses> + <stalls due to resource contention> + Credit: Eyerman, et al -- A Performance Counter Architecture for Computing Accurate CPI Components

Identifying Bottlenecks More Metrics Branch Prediction Unit Misprediction rate Misprediction penalty Cycles lost due to misprediction (bar on the CPI stack) Caches and other cache-like structures (e.g., BTB) Hit Latency Hit Rate Miss Penalty Raw number of misses (3Cs compulsory, conflict, capacity) Average Access Time = Hit Latency + (Miss Rate * Miss Penalty) Cycles lost due to cache misses (bar on the CPI stack)

Identifying Bottlenecks More Metrics Resource Contention Queue full scenarios (IQ, ROB, LSQ, reservation stations, etc.) Functional unit contention (FU busy events) Instruction Level Parallelism Fetch rate Issue rate Queue drain rates Commit rate?

Multiprogrammed Mixed Workloads An SMT or a CMP processor typically co-executes multiple programs at a time, with each program running on a separate logical core. Co-executing programs affect the performance of each other some synergistically and some contending. Normalized Progress (NP) = ET (Single-program mode) System Throughput = σ n i=0 n Weighted Speedup = σ i=0 ET (Multi-program mode) NP(i) IPC i MP IPC i SP

Benchmark Suites SPEC CPU 2006 Still widely used to measure effects of processor/compiler optimizations. Integer and FP benchmarks Written in C, C++, and Fortran Compression, codecs, compilers/interpreters, games, scientific applications, routing algorithms, event simulation, etc. SPEC CPU 2017 Includes many new (AI) benchmarks. Other benchmarks of interest PARSEC, mibench (includes crypto), cloudsuite, NAS Parallel, Dhrystone, etc.

Common SPEC 2006 command lines bzip2 "input.combined 200"; gcc "scilab.i -o scilab.s"; gobmk "--quiet --mode gtp" -i "13x13.tst"; h264ref "-d foreman_ref_encoder_baseline.cfg"; hmmer "nph3.hmm swiss41"; lbm "3000 reference.dat 0 0 100_100_130_ldc.of"; libquantum "1397 8"; mcf "inp.in"; milc < su3imp.in perlbench "-I./lib diffmail.pl 4 800 10 17 19 300"; sjeng "ref.txt"; sphinx3 "ctlfile. args.an4";

Steady-State CPI Most applications have an initialization phase (including libc startup), after which actual program execution begins. After the pipeline and other structures (e.g., caches) are sufficiently warmed up, execution is said to have reached steady-state, at which the CPI can be accurately measured. In order to record the steady-state CPI, execution is typically fastforwarded before actual measurements are taken. Fast-forward intervals vary for different programs

Simpoints Automatically Characterizing Large Scale Program Behavior Seminal paper by Tim Sherwood ASPLOS Influential Paper Award Maurice Wilkes Award Programs are typically made of phases. Phase changes are marked by rise/drop in IPC. Similar phases/program intervals typically fall in the same IPC cluster. Overall IPC of a program may often be computed using a sum of weighted IPCs of specific representative phases.

Performance Counters Only a few of these can be enabled at any given point. Offers a high level insight on performance bottlenecks. LBR (not shown here) holds a record of most recent branches.

Linux Perf Events Found in the linux-tools-common package. Profiles programs leveraging hardware performance counters. Common functions: top: identify hot functions stat: count specific events (number of loads issued to L1, number of misses) record: profile a program by sampling certain performance counters at the configured frequency report: show report of the sampled performance counters

Gem5 Architectural Simulator Event-driven Microarchitectural Simulation Useful for rapid prototyping Provides multiple knobs for exploring large processor architecture design spaces Deeper insights than regular performance counters room for more sophisticated stats Written in C++ and Python May not get cycle-accurate results if you re making major modifications to the microarchitecture

Gem5 modes Full System (FS) Simulates a full-fledged linux kernel running a suite of applications Very slow Syscall Emulation (SE) Full detailed simulation of application code Traps and Emulates System Calls Faster than FS, but much slower than native execution Useful when user-mode execution dominates kernel-mode

Gem5 CPU Models Atomic CPU Function simulation (no/limited timing information) Useful for verifying correctness and/or profiling (e.g., generating simpoints) Minor CPU Inorder Pipeline Stages: Fetch, Decode, Execute, LSQ (Mem), Commit (implicit) O3 CPU Out-Of-Order Pipeline Stages: Fetch, Decode, Rename, IEW (Issue, Execute, Writeback), LSQ, Commit

What can you configure with gem5? Design Parameter Design Choice Execution Semantics In-order, Out-of-order Issue Width 1, 2, 4 Branch Predictor Local, Tournament, Gshare, LTAGE Reorder Buffer Size 64, 128 entries Physical Register File (integer) 96, 160 Physical Register File (FP/SIMD) 64, 96 Integer ALUs 1, 3, 6 Integer Multiply/Divide Units 1, 2 Floating-point ALUs 1, 2, 4 FP Multiply/Divide Units 1, 2 SIMD Units 1, 2, 4 Load/Store Queue 16,32 entries Instruction Cache 32KB 4-way, 64KB 4-way Private Data Cache 32KB 4-way, 64KB 8-way Shared Last Level (L2) cache 4-banked 4MB 4-way, 4-banked 8MB 8-way

Demo Running gem5 Checkpoint/Restore (to simulate only a region of interest/fast-forward execution) Debugging (--debug-flags=<> --debug-help) Source Files of Interest

Visualizing Spectre with gem5 http://www.lowepower.com/jason/visualizing-spectre-with-gem5.html

Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat