Detecting Memory-Boundedness with Hardware Performance Counters

Similar documents
Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

Potentials and Limitations for Energy Efficiency Auto-Tuning

Software and Tools for HPE s The Machine Project

ibench: Quantifying Interference in Datacenter Applications

Advanced Parallel Programming I

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Jackson Marusarz Intel Corporation

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

Performance of Variant Memory Configurations for Cray XT Systems

Basics of Performance Engineering

Master Informatics Eng.

Recap: Machine Organization

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Modern CPU Architectures

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Memory Hierarchy. Slides contents from:

Profiling of Data-Parallel Processors

NEXTGenIO Performance Tools for In-Memory I/O

Walking Four Machines by the Shore

A Case Study: Performance Evaluation of a DRAM-Based Solid State Disk

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Memory Hierarchy. Slides contents from:

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

RIGHTNOW A C E

Performance of Variant Memory Configurations for Cray XT Systems

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Performance comparison between a massive SMP machine and clusters

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Chapter 18 - Multicore Computers

COL862 Programming Assignment-1

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

Architectural Musings

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Thread and Data parallelism in CPUs - will GPUs become obsolete?

AN ELASTIC MULTI-CORE ALLOCATION MECHANISM FOR DATABASE SYSTEMS

An introduction to SDRAM and memory controllers. 5kk73

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

AUTOMATIC SMT THREADING

Shared Symmetric Memory Systems

Analyzing I/O Performance on a NEXTGenIO Class System

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Advanced Memory Organizations

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Почему IBM POWER8 оптимальная платформа для PostgreSQL

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

ARE WE OPTIMIZING HARDWARE FOR

Power Measurement Using Performance Counters

Application Performance on Dual Processor Cluster Nodes

Profiling: Understand Your Application

Virtual Memory. Motivation:

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Multicore Scaling: The ECM Model

Multiprocessor Synchronization

Performance of a multi-physics code on Cavium ThunderX2

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Multiple Context Processors. Motivation. Coping with Latency. Architectural and Implementation. Multiple-Context Processors.

2

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Design Space Exploration of Network Processor Architectures

Out-Of-Core Sort-First Parallel Rendering for Cluster-Based Tiled Displays

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16

Managing GPU Concurrency in Heterogeneous Architectures

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

Non-uniform memory access (NUMA)

Comparing Performance of Solid State Devices and Mechanical Disks

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP

CSE 392/CS 378: High-performance Computing - Principles and Practice

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Design of a System-on-Chip Switched Network and its Design Support Λ

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (3 rd Week)

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Keywords and Review Questions

HAEC-SIM: A Simulation Framework for Highly Adaptive Energy-Efficient Computing Platforms

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Transcription:

Center for Information Services and High Performance Computing (ZIH) Detecting ory-boundedness with Hardware Performance Counters ICPE, Apr 24th 2017 (daniel.molka@tu-dresden.de) Robert Schöne (robert.schoene@tu-dresden.de) Daniel Hackenberg (daniel.hackenberg@tu-dresden.de) Wolfgang E. Nagel (wolfgang.nagel@tu-dresden.de)

Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary 2

Scaling of Parallel Applications Core 0 Core 1... Core n Last Level Cache ory Controller Point-to-Point Interconnect RAM... RAM I/O Other Processors Multiple levels of cache to bridge processor-dram performance gap Last Level Cache (LLC) and memory controller usually shared between cores 3

Scaling of Parallel Applications Core 0 Core 1... Last Level Cache Core n ory Controller Point-to-Point Interconnect RAM... RAM I/O Other Processors Typically integrated memory controllers in every processor Performance of memory accesses depends on distance from the data 4

Scaling of Parallel Applications Core 0 Core 1... Core n Last Level Cache ory Controller Point-to-Point Interconnect RAM... RAM I/O Other Processors 5

Potential Bottlenecks Local memory hierarchy Limited data path widths Access latencies Remote memory accesses Additional latency Multicore Processor Multicore Processor Core Core Core Core Core Core...... Last Level Cache Last Level Cache Interconnect bandwidth ory ory Performance of on-chip transfers and remote cache accesses Saturation of shared resources Need to understand which hardware characteristics determine the application performance Requires knowledge about: Peak achievable performance of individual components Component utilization at application runtime 6

Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary 7

Common ory Benchmarks Multicore Processor Multicore Processor Core Core Core Core Core Core...... Last Level Cache Last Level Cache ory ory Multicore Processor Multicore Processor Core Core Core Core Core Core...... Local memory hierarchy Bandwidth: STREAM Latency: Lmbench Remote memory accesses STREAM and Lmbench plus numactl On-chip transfers and remote cache accesses Not covered by common tools Not easily extendable Last Level Cache ory Last Level Cache ory Saturation of shared resources Also covered by STREAM 8

ZIH Development BenchIT Includes measurement of core-to-core transfers Example: Opteron 6176 memory latency L3 RAM 9

ZIH Development BenchIT Includes measurement of core-to-core transfers Example: Opteron 6176 memory latency L3 RAM transfers within one processor 10

ZIH Development BenchIT Includes measurement of core-to-core transfers Example: Opteron 6176 memory latency L3 RAM remote cache accesses transfers within one processor main memory (NUMA) 11

ZIH Development BenchIT Includes measurement of core-to-core transfers Example: Opteron 6176 memory latency L3 RAM remote cache accesses not covered by other benchmark suites transfers within one processor main memory (NUMA) Sophisticated data placement enables performance measurements for individual components in the memory hierarchy Also considers state transitions of coherence protocols 12

bandwidth [GB/s] Scaling of Shared Resources in Multi-core Processors bandwidth [GB/s] 300 250 200 150 100 50 0 last level cache 1 2 3 4 5 6 7 8 9 10 11 12 number of cores 70 60 50 40 30 20 10 0 main memory 1 2 3 4 5 6 7 8 9 10 11 12 number of cores Xeon E5-2680 v3 Xeon E5-2670 Xeon X5670 Opteron 6274 Opteron 2435 On some processors, the bandwidth of the last level cache scales linearly with the number of cores that access it concurrently The DRAM bandwidth can typically be saturated without using all cores 13

Saturation of Shared Resources Core 0 Core 1... Core n Last Level Cache ory Controller Point-to-Point Interconnect RAM... RAM I/O Other Processors Multiple levels of cache to bridge processor-dram performance gap Last Level Cache (LLC) and memory controller usually shared between cores 14

Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary 15

Hardware Performance Counters RAM RAM Core 0 Core 1 Last Level Cache ory Controller...... Core n Point-to-Point Interconnect I/O Other Processors Per-core counters Record events that occur within the individual cores, e.g., pipeline stalls, misses in the local and l2 caches Uncore counters Monitor shared resources Events cannot be attributed to a certain core Accessible via PAPI 16

Properties of Hardware Performance Counters Not designed with performance analysis in mind Included for verification purposes, not guaranteed to work Some events are poorly documented Some countable events can have different origins E.g., stalls in execution can happen because of long latency operations as well as memory accesses Unclear whether counters are good indicators for capacity utilization Are 10 million cache misses per second too much? Not stable between processor generations => Methodology to identify meaningful events is needed 17

Identification of Meaningful Performance Counter Events Component Utilization: Use Micro-benchmarks to stress individual components Identify performance monitoring events that correlate with the component utilization Determine peak event ratios Estimate performance Impact: Extended latency benchmark to determine which stall counters best represent delays that are caused by memory accesses Search for events that represent stalls caused by limited bandwidth 18

Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g., for every load and store or per cache line) 19

Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line) Clearly separate the levels in the memory hierarchy 20

Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line) Clearly separate the levels in the memory hierarchy 21

Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line) Clearly separate the levels in the memory hierarchy 22

Measuring Component Utilization Required are performance counter events that: Generate one event for a certain amount of transferred data (e.g. for every load and store or per cache line) Clearly separate the levels in the memory hierarchy 23

Measuring Component Utilization Example: Events that count L3 accesses (per cache line) Good counters available for all levels in the memory hierarchy, except: accesses only per load resp. store instruction (different widths) Writes to DRAM only counted per package 24

Estimating Performance Impact of ory Accesses A high component utilization indicates a potential performance problem But, the actual effect on the performance cannot easily be quantified Modified latency benchmarks to check which stall counters provide good estimates for the delays caused by memory accesses additional multiplications between loads Two versions: independent operations that can overlap with memory accesses reported number of stalls should decrease accordingly multiplications that are part of the dependency chain Ideal counter reports same results as for latency benchmark 25

Haswell Stall Counters 26

Haswell Stall Counters 27

Estimating Performance Impact of ory Accesses Also need to consider stalls cycles in bandwidth-bound scenarios Best reflected by events that indicate full request queues, but far from an optimal correlation Events for loads and stores can overlap, but do not have to On Haswell the following events can be used to categorize stall cycles, but accuracy is limited: CPU_CLK_UNHALTED CYCLE_ACTIVITY:CYCLES_NO_EXECUTE productive cycles max(resource_stalls:sb, D_PEND_MISS:FB_FULL + OFFCORE_REQUESTS_BUFFER :SQ_FULL) bandwidth bound CPU_CLK_UNHALTED active cycles max(resource_stalls:sb, CYCLE_ACTIVITY:STALLS_D_PENDING) memory bound CYCLE_ACTIVITY :CYCLES_NO_EXECUTE latency bound stall cycles memory bound bandwidth bound stall cycles memory bound other stall reason 28

Outline Motivation Benchmarks for memory subsystem performance Identification of meaningful hardware performance counters Summary 29

Summary Raw performance counter data typically difficult to interpret Selecting the (most) relevant events is not a trivial task Some events do not show the expected behavior E.g., the LDM_PENDING event Verification needed before relying on the reported event rates The presented micro-benchmark based approach can be used to tackle these challenges Acknowledgment: This work has been funded in a part by the European Union s Horizon 2020 program in the READEX project and by the Bundesministerium für Bildung und Forschung via the research project Score-E 30

Thank You For Your Attention 31