Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Size: px

Start display at page:

Download "Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore"

Susan Eaton
6 years ago
Views:

1 By Dan Stafford

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space

2 Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General Limited Off-Chip Bandwidth Impact of LLC Size Optimal Heterogeneous Design Job Scheduling Summary of Design Considerations References

4 Typically used to obtain a higher performance for a lower power budget CPU/GPU Heterogeneous Systems Intel Core Series, AMD Fusion, NVIDIA Tegra Single ISA Heterogeneous Systems Energy optimized cores Performance optimized cores Every core shares a common ISA ARM big.little, NVIDIA Kal-El Clock Rate Heterogeneous Systems Homogenous cores Different clock rates

6 Single Core CPI Memory CPI Fraction of single core CPI waiting for memory Stack Distance Counters (SDC)s Captures the programs temporal memory access in the Last Level of Cache (LLC) All metrics captured every 20M instructions SPEC CPU2006 workloads

7 Profiles each core type across all SPEC CPU2006 workloads Single-core profiles then used to estimate multicore performance Traditional methods take 80 plus days Only takes a single day Accurate within 2.1%

9 Out-of-Order cores 4-wide: 128-entry reorder buffer 2-wide: 32-entry reorder buffer In-Order cores 4-wide, 2-wide, and scalar Caches LRU policy Private L1 instruction and data caches 32 KB, 8-way set associative Private L2 cache 256KB 8-way set associative Shared L3 cache (LLC) 1-4MB 16-way set associative

10 BCE Base Core Equivalent Relative chip area measurement Heterogeneous designs configured to use 40 BCEs #BCEs scalar in-order core 1 2-wide in-order core 2 4-wide in-order core 3 2-wide out-of-order core 4 4-wide out-of-order core 8 512KB LLC slice 1

12 System Throughput Multicore performance from system perspective,, Average Normalized Turnaround Time User perceived performance,, Note: n independent jobs and cores p programs

13 [1]

14 Simple in order cores have better system throughput Aggressive out-of-order cores have better turnaround time

15 [1]

16 [1]

Same tradeoff between system throughput and turnaround time Some heterogeneous configurations outperform homogenous configurations Heterogeneity

17 Same tradeoff between system throughput and turnaround time Some heterogeneous configurations outperform homogenous configurations Heterogeneity allows more precise control over the system throughput and turnaround time Two different core types provide the majority of the benefit from heterogeneity

18 [1]

19 [1]

20 Limiting the off-chip bandwidth will proportionally affect the per-program performance more Best performance achieved using heterogeneous configurations

21 [1]

22 Cache reduces the off-chip bandwidth pressure Under unlimited bandwidth Less cache leads to integrating more cores together Assuming same chip area vs. with cache

23 [1]

24 High throughput: single-issue and dual-issue in-order cores Per-program performance: At least one outof-order core

26 Optimal Mapping Optimal mapping so performance is optimized Prior profiling of all configurations Not feasible Cache-miss-rate Higher LLC miss-rate jobs mapped to lower-end cores Relative Slowdown Assumes relative performance of each job is known Job with highest slowdown on smaller core assigned to higher performing core Random

27 Two Core Types 4-wide out-of-order 2-wide in-order 6 separate heterogeneous configurations 4-wide out-of-order cores 2-wide in-order cores 500 randomly chosen multi-program workload mixes

28 [1]

29 None of the scheduling techniques are quantitatively better Cache-miss rate does not take into account memory parallelism Relative slowdown requires a substantial amount of information Active area of research for all types of heterogeneous architecture

31 Perform many simulations before committing to a specific architecture Large LLC Cache vs. Additional Cores Increase LLC cache if bandwidth constrained Additional cores otherwise System Throughput vs. Per-Program Performance In-order cores have better system throughput Out-of-order cores have better per-program throughput

[1]K. Van Craeynest and L. Eeckhout, "Understanding fundamental design choices in single-isa heterogeneous multicore architectures", TACO, vol. 9, no. 4, pp. 1-23, 2013. [2]R. Kumar, D. Tullsen, P.

33 [1]K. Van Craeynest and L. Eeckhout, "Understanding fundamental design choices in single-isa heterogeneous multicore architectures", TACO, vol. 9, no. 4, pp. 1-23, [2]R. Kumar, D. Tullsen, P. Ranganathan, N. Jouppi and K. Farkas, "Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance", ACM SIGARCH Computer Architecture News, vol. 32, no. 2, p. 64, 2004.

Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores

Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores Anthony Gutierrez Adv. Computer Architecture Lab. University of Michigan EECS Dept. Ann Arbor, MI, USA atgutier@umich.edu