Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Similar documents
Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores

Adaptive Cache Partitioning on a Composite Core

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Performance Evaluation of Heterogeneous Microprocessor Architectures

Multicore Hardware and Parallelism

Keywords and Review Questions

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Symbiotic Job Scheduling on the IBM POWER8

Core Specific Block Tagging (CSBT) based Thread Migration Model for Single ISA Heterogeneous Multi-core Processors

GPUs and Emerging Architectures

Outline Marquette University

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Computer Architecture 计算机体系结构. Lecture 9. CMP and Multicore System 第九讲 片上多处理器与多核系统. Chao Li, PhD. 李超博士

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Microprocessor Trends and Implications for the Future

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

CPU-GPU Heterogeneous Computing

Parallelism in Hardware

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Processor Performance and Parallelism Y. K. Malaiya

Response Time and Throughput

Memory Hierarchy. Slides contents from:

The University of Adelaide, School of Computer Science 13 September 2018

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research

TDT 4260 lecture 3 spring semester 2015

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 1. Performance and Energy Modeling of Heterogeneous Many-core Architectures

NAME: Problem Points Score. 7 (bonus) 15. Total

A Framework for Providing Quality of Service in Chip Multi-Processors

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

CSC501 Operating Systems Principles. OS Structure

Optimizing Replication, Communication, and Capacity Allocation in CMPs

CS425 Computer Systems Architecture

Lixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

The Performance Potential for Single Application Heterogeneous Systems

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Heterogeneous Computing with a Fused CPU+GPU Device

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Cache Memory and Performance

CS3350B Computer Architecture CPU Performance and Profiling

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Exploitation of instruction level parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

The Effect of Temperature on Amdahl Law in 3D Multicore Era

Exam Parallel Computer Systems

PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Course web site: teaching/courses/car. Piazza discussion forum:

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

45-year CPU Evolution: 1 Law -2 Equations

Topics. Digital Systems Architecture EECE EECE Need More Cache?

TDT 4260 lecture 7 spring semester 2015

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Lec 25: Parallel Processors. Announcements

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Fit for Purpose Platform Positioning and Performance Architecture

Write only as much as necessary. Be brief!

Copyright 2012, Elsevier Inc. All rights reserved.

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Modern Processor Architectures. L25: Modern Compiler Design

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Advanced Memory Organizations

! Readings! ! Room-level, on-chip! vs.!

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Selective Fill Data Cache

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Performance, Power, Die Yield. CS301 Prof Szajda

A Comparison of Capacity Management Schemes for Shared CMP Caches

Position Paper: OpenMP scheduling on ARM big.little architecture

Adapted from David Patterson s slides on graduate computer architecture

Mapping applications into MPSoC

Kaisen Lin and Michael Conley

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Transcription:

By Dan Stafford

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General Limited Off-Chip Bandwidth Impact of LLC Size Optimal Heterogeneous Design Job Scheduling Summary of Design Considerations References

Typically used to obtain a higher performance for a lower power budget CPU/GPU Heterogeneous Systems Intel Core Series, AMD Fusion, NVIDIA Tegra Single ISA Heterogeneous Systems Energy optimized cores Performance optimized cores Every core shares a common ISA ARM big.little, NVIDIA Kal-El Clock Rate Heterogeneous Systems Homogenous cores Different clock rates

Single Core CPI Memory CPI Fraction of single core CPI waiting for memory Stack Distance Counters (SDC)s Captures the programs temporal memory access in the Last Level of Cache (LLC) All metrics captured every 20M instructions SPEC CPU2006 workloads

Profiles each core type across all SPEC CPU2006 workloads Single-core profiles then used to estimate multicore performance Traditional methods take 80 plus days Only takes a single day Accurate within 2.1%

Out-of-Order cores 4-wide: 128-entry reorder buffer 2-wide: 32-entry reorder buffer In-Order cores 4-wide, 2-wide, and scalar Caches LRU policy Private L1 instruction and data caches 32 KB, 8-way set associative Private L2 cache 256KB 8-way set associative Shared L3 cache (LLC) 1-4MB 16-way set associative

BCE Base Core Equivalent Relative chip area measurement Heterogeneous designs configured to use 40 BCEs #BCEs scalar in-order core 1 2-wide in-order core 2 4-wide in-order core 3 2-wide out-of-order core 4 4-wide out-of-order core 8 512KB LLC slice 1

System Throughput Multicore performance from system perspective,, Average Normalized Turnaround Time User perceived performance,, Note: n independent jobs and cores p programs

[1]

Simple in order cores have better system throughput Aggressive out-of-order cores have better turnaround time

[1]

[1]

Same tradeoff between system throughput and turnaround time Some heterogeneous configurations outperform homogenous configurations Heterogeneity allows more precise control over the system throughput and turnaround time Two different core types provide the majority of the benefit from heterogeneity

[1]

[1]

Limiting the off-chip bandwidth will proportionally affect the per-program performance more Best performance achieved using heterogeneous configurations

[1]

Cache reduces the off-chip bandwidth pressure Under unlimited bandwidth Less cache leads to integrating more cores together Assuming same chip area vs. with cache

[1]

High throughput: single-issue and dual-issue in-order cores Per-program performance: At least one outof-order core

Optimal Mapping Optimal mapping so performance is optimized Prior profiling of all configurations Not feasible Cache-miss-rate Higher LLC miss-rate jobs mapped to lower-end cores Relative Slowdown Assumes relative performance of each job is known Job with highest slowdown on smaller core assigned to higher performing core Random

Two Core Types 4-wide out-of-order 2-wide in-order 6 separate heterogeneous configurations 4-wide out-of-order cores 2-wide in-order cores 500 randomly chosen multi-program workload mixes

[1]

None of the scheduling techniques are quantitatively better Cache-miss rate does not take into account memory parallelism Relative slowdown requires a substantial amount of information Active area of research for all types of heterogeneous architecture

Perform many simulations before committing to a specific architecture Large LLC Cache vs. Additional Cores Increase LLC cache if bandwidth constrained Additional cores otherwise System Throughput vs. Per-Program Performance In-order cores have better system throughput Out-of-order cores have better per-program throughput

[1]K. Van Craeynest and L. Eeckhout, "Understanding fundamental design choices in single-isa heterogeneous multicore architectures", TACO, vol. 9, no. 4, pp. 1-23, 2013. [2]R. Kumar, D. Tullsen, P. Ranganathan, N. Jouppi and K. Farkas, "Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance", ACM SIGARCH Computer Architecture News, vol. 32, no. 2, p. 64, 2004.