Making Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures

Similar documents
Hardware Acceleration for Tagging

Integrating the Par Lab Stack Running Damascene on SEJITS/ROS/RAMP Gold

Parallel Webpage Layout

Code Generators for Stencil Auto-tuning

Contour Detection on Mobile Platforms

A Retrospective on Par Lab Architecture Research

Active Testing for Concurrent Programs

Code Generators for Stencil Auto-tuning

BERKELEY PAR LAB. RAMP Gold Wrap. Krste Asanovic. RAMP Wrap Stanford, CA August 25, 2010

Active Testing for Concurrent Programs

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Tessellation: Space-Time Partitioning in a Manycore Client OS

Memory Hierarchy. Slides contents from:

SEJITS: High Performance for Productivity Applications

Lecture 14: Cache Innovations and DRAM. Today: cache access basics and innovations, DRAM (Sections )

Page 1. Memory Hierarchies (Part 2)

Kaisen Lin and Michael Conley

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Caching Basics. Memory Hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

End of Parlab talk. May 29, 2013 Gans Srinivasa Intel Corp

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Page 1. Multilevel Memories (Improving performance using a little cash )

Tessellation OS. Present and Future. Juan A. Colmenares, Sarah Bird, Andrew Wang, Krste Asanovid, John Kubiatowicz [Parallel Computing Laboratory]

Page 1. Goals for Today" TLB organization" CS162 Operating Systems and Systems Programming Lecture 11. Page Allocation and Replacement"

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS 136: Advanced Architecture. Review of Caches

RAMP Gold Hardware and Software Architecture. Zhangxi Tan, Yunsup Lee, Andrew Waterman, Rimas Avizienis, David Patterson, Krste Asanovic UC Berkeley

Getting CPI under 1: Outline

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Embedded Systems: Projects

Adapted from David Patterson s slides on graduate computer architecture

An FPGA Host-Multithreaded Functional Model for SPARC v8

14:332:331. Week 13 Basics of Cache

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

History. PowerPC based micro-architectures. PowerPC ISA. Introduction

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

ClearSpeed Visual Profiler

Comparing Memory Systems for Chip Multiprocessors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS152 Computer Architecture and Engineering Lecture 17: Cache System

Advanced cache optimizations. ECE 154B Dmitri Strukov

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Memory hierarchy review. ECE 154B Dmitri Strukov

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

EE 4683/5683: COMPUTER ARCHITECTURE

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

The Nios II Family of Configurable Soft-core Processors

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Memory Hierarchy. Slides contents from:

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

CS 654 Computer Architecture Summary. Peter Kemper

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Supercomputing and Mass Market Desktops

CS 341l Fall 2008 Test #4 NAME: Key

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

EITF20: Computer Architecture Part4.1.1: Cache - 2

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

Cray XE6 Performance Workshop

EE 660: Computer Architecture Advanced Caches

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 3

CENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Lecture 4: Memory Management & The Programming Interface

2 TEST: A Tracer for Extracting Speculative Threads

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Itanium 2 Processor Microarchitecture Overview

Portland State University ECE 588/688. Directory-Based Cache Coherence Protocols

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Flexible Architecture Research Machine (FARM)

Transcription:

Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Making Performance Understandable: Towards a Standard for Performance Counters on Manycore Architectures Sarah Bird, Andrew Waterman, Kevin Klues, Sam Williams, Kaushik Datta, Rajesh Nishtala, Krste Asanovic, James Demmel, Dave Patterson 1

Diagnosing Power/Performance Correctness Par Lab Research Overview Easy to write correct programs that run efficiently on manycore Personal Health Image Hearing, Speech Retrieval Music Design Patterns/Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Browser Static Verification Parallel Libraries Parallel Frameworks Type Systems Efficiency Languages Legacy Code Sketching Autotuners Communication & Schedulers Synch. Primitives Efficiency Language Compilers OS Libraries & Services Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Directed Testing Dynamic Checking Debugging with Replay 2

Diagnosing Power/Performance Correctness Par Lab Research Overview Easy to write correct programs that run efficiently on manycore Personal Health Image Hearing, Speech Retrieval Music Design Patterns/Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Browser Static Verification Parallel Libraries Parallel Frameworks Type Systems Efficiency Languages Legacy Code Sketching Autotuners Communication & Schedulers Synch. Primitives Efficiency Language Compilers OS Libraries & Services Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Directed Testing Dynamic Checking Debugging with Replay 3

Diagnosing Power/Performance Correctness Par Lab Research Overview Easy to write correct programs that run efficiently on manycore Personal Health Image Hearing, Speech Retrieval Music Design Patterns/Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Browser Static Verification Parallel Libraries Parallel Frameworks Type Systems Efficiency Languages Legacy Code Sketching Autotuners Communication & Schedulers Synch. Primitives Efficiency Language Compilers OS Libraries & Services Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Directed Testing Dynamic Checking Debugging with Replay 4

Outline Motivation Current State of Performance Counters Proposed Solution Framework Activity Counters Motivating Applications RAMP and Future Work 5

Parallel Programming Parallel Programming is Challenging Efficiency Programmers have struggled for years Now we expect Productivity Programmers to write parallel code? Correct Power Efficient Reasonable Performance Quickly Written What is the solution? There isn t a quick fix and programmers need help! 6

Programmer Tools More insight into application behavior Better Debugging Tools Better Performance Analysis Tools Tools which help make scheduling and resource decisions Must be portable Must work in Realtime Claim: Accurate, useful performance counters more important to business success of multicore bet than clock rate, cache size, transactional memory support, 7

Outline Motivation Current State of Performance Counters Proposed Solution Activity Counters Framework Motivating Applications RAMP and Future Work 8

5 problems with Current Systems State of performance counters today is lousy Intended for use by chip designers not users Low priority since they are intended for internal use Opportunistic bottom-up measurements 1. Inaccurate 2. Non-functional 3. Incomplete 4. Overly Complex 5. Inconsistent 9

Current Uses Only use the simple counters Rely on simple performance models like CPI Get a graduate student to analyze the application, architecture, and perf. data Doesn t scale Use machine learning on all the counters Complex and unclear if it is useful.. particularly if some of the counters are non-functional or inaccurate Software solutions like PAPI Can t overcome inconsistent, inaccurate or incomplete counters 10

Outline Motivation Current State of Performance Counters Proposed Solution Activity Counters Framework Motivating Applications RAMP and Future Work 11

Proposal Create a standard for counters on all future architectures Places pressure on chip designers to require them to be functional, accurate, and available Proactive Avoid the problems of PAPI Allows the creation of portable software Autotuners & other Performance Analysis tools Dynamically adjusting applications (Music) Operating System Schedulers 12

Our Approach Measure Computation, Communication, and Energy for all components CPU CPU CPU CPU CPU Computation L1 L1 L1 L1 L1 Communication Energy L2 Bank L2 Bank L2 Interconnect L2 Bank L2 Bank DRAM & I/O Interconnect L2 Bank DRAM DRAM I/O DRAM DRAM 13

Computation Efficient Execution of each core still important Power/Energy Overall system performance Measure instructions retired Used by Applications, Scheduler, Productivity Programmers Measure instruction mix Floating Point Ops, Loads, Stores, etc Used by Efficiency Programmers, Program Analysis tools 14

Communication Network behavior can have a big impact on manycore performance Access to DRAM and I/O Communication between cores Measure Traffic on Each Edge Used by Applications, Scheduler, Productivity Programmers Break traffic into types Prefetch, Compulsory, Coherency, etc Used by Efficiency Programmers, Program Analysis tools 15

Energy Counters Energy information can effect some non-obvious tradeoffs for applications Client Server Split Counters to measure energy of all components Everything in units of energy Affects battery life Works with DVFS Shared Resources must attribute energy to apps DRAM provides a power model Memory controller uses the model and accesses 16

Resources This seems like a lot of hardware Counters designed to be Fixed Function As a result they can easily be made to be Simple Small Low Power It s worth it if we improve significantly improve performance 17

Counter Characteristics Fixed Function Small Low Power Wide (64 bits) Accurate Always On Accessed in a Reasonable time Latches to quickly record values Buffers & DMA to save values to memory CPU Events Counters OS Read Latches App Read Latches CPU L2 Cache Events Events Counters OS Read Latches App Read Latches Counters OS Read Latches App Read Latches 18

Snapshot Signals Snapshot Signals Snapshot Signals Counter Framework Atomically Snapshot Set of Counters Use a 100 Mhz Global Realtime Clock (GRTC) Helps solve DVFS Triggers to the OS and User Level Latches Global Realtime Clock CPU Events Counters CPU Events Counters OS Read Latches OS Read Latches App Read Latches App Read Latches L2 Cache Events Counters OS Read Latches App Read Latches 19

Counter Framework Composable Different levels of the system are interesting Application-> Cores, Partitions OS -> Partitions, System L2$ Slice Intercon. L1D$ Core L1I$ DRAM I/O Large Compute-Bound Application Persistent Storage & File System Real-Time Application Network Service Identity Firewall Virus Intrusion Monitor And Adapt Video & Window Drivers HCI/ Voice Rec 20

Outline Motivation Current State of Performance Counters Proposed Solution Activity Counters Framework Motivating Applications RAMP and Future Work 21

Performance Counter Uses Applications Dynamic Execution Adjustment Debugging Tools Performance Analysis Tools Autotuners Operation System Scheduler/Resource Allocation 22

The Music Application Strong realtime requirements Realtime clock Packet timestamping and logging Lots of I/O Need good bandwidth Traffic measurements Novel I/O Devices Ethernet AVB Interface to get counters from devices Extremely latency sensitive I/O logging with timestamping using the global clock 23

Autotuning Autotuning with Machine Learning Standard counters to create a portable system Compare the performance of on application with different architectures Roofline Model to represent the performance of application on an architecture Autogenerate model using performance counters 24

Space-Time Scheduling Portable Standard for all architectures Track ALL resource usage and compute performance-bandwidth-energy curves on the fly Computation, Communication and Energy Profile resource usage in different application phases Atomic snapshot a whole application at once Energy constraints on applications Energy Counter and Energy models for shared resources Time Large Compute-Bound Application Persistent Storage & File System Real-Time Application Network Service Identity Firewall Virus Intrusion Monitor And Adapt Video & Window Drivers HCI/ Voice Rec Space

Outline Motivation Current State of Performance Counters Proposed Solution Activity Counters Framework Motivating Applications RAMP and Future Work 26

RAMP Gold and Performance Counters Research Accelerator for Multiple Processors Manycore emulation on FPGAs Use RAMP to implement performance counters Study application behavior using our activity counters Do complete tracing of dependency behavior to extract useful information Write new applications/tools that use BEE3 the counters Experiment with new counters 27

Future Work: Diagnostic Tools Goals Recreate Dependency Graphs and Calculate Slack Logging of I/O latency for novel I/O devices Implementation: Add simple non-intrusive hardware to record information Software can recreate program information from logged data Lots of compression can be done 28

Future Work: Starting Points Dependency Graphs Use something similar to the Shotgun Approach Keep track of the last writer on each cache line Allows communication arcs to be recorded on reads I/O Information Allow I/O devices to timestamp packets using the GRTC System packets are also timestamped with the GRTC All packets can be logged Create a standard way for the network interface to access relevant performance counters on novel I/O devices 29

Conclusions Must have a standard for performance counters Always On Accurate and Accessible Same across all architectures Big impact on future software systems Aid programmer, autotuner, scheduler, OS in adapting system Help turn data into useful information that can help efficiency-level programmer improve system Why not getting 100% of memory bandwidth? Conflict misses? Help turn data into useful information that can help productivitylevel programmer improve app Where am I spending my time in my program? If I change it like this, impact on performance? 30

Acknowledgements In addition to all the authors at the beginning of this talk, I would like to thank Ras Bodik David Wessel The BeBOP group Questions? slbird@eecs.berkeley.edu 31

Extra Slides 32

Solving the Parallel Problem A lot of solutions have been proposed New programming languages Parallel Frameworks Better Compilers Speculative Execution/Transactional Memory Better Hardware. There isn t a quick fix and programmers need help! 33

Utilization Meters Communication Computation Compulsory Misses Floating Point Instructions Capacity Misses Atomic Instructions Conflict Misses Branches Coherency Misses Integer Ops Prefetch Data Load/Stores Write Allocation Data SIMD/Vector Misc Data Misc Instructions ----------------------------------- ---------------------------------- Sum (All Traffic) Sum (All Retired Instrs) 34

Communication Measure Traffic on Each Edge Compulsory Traffic TLB/Page Tables maintain a reference bit per cache line 4KB pages and 64B blocks => 64 new bits per PTE Conflict Misses Approximate using a tag victim cache Coherency Traffic Invalidations 35