Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Size: px

Start display at page:

Download "Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores"

Branden Bond
5 years ago
Views:

1 Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering,

2 Outline Problem statement Assumptions and studied system Methodology Results Conclusion

3 Problem What is the best trade-off between the number of cores and their complexity in a CMP? Wide design space ranging from very few very complex superscalar processors to lots of very simple single-issue cores.

4 Assumptions Chip-area requirements are constant in all designs Clock frequency is constant in all designs Parallel applications

5 Assumptions & Disclamers Chip-area requirements are constant in all designs Very rough area estimates Clock frequency is constant in all designs Perhaps more realistic with faster clock for simpler designs Parallel applications The world is not entirely parallel

6 Four basic systems studied 2 cores, 8-issue 4 cores, 4-issue 8 cores, dual-issue 16 cores, single-issue

7 Things that we study Total execution time of the same task on all systems How does applications exploit ILP vs. TLP? Power consumption for the different systems Gives hints about hot-spots in the designs Total energy consumption of executing the same task on all systems How efficient is the system?

8 Simulation methodology (complexity effective?) Multiprocessor version of SimWattch [1] SimWattch is based on Simics [2] and Wattch [3] (which is based on SimpleScalar [4] and Cacti [5]). [1] SimWattch, 2003 IEEE International Symposium on Performance Analysis of Systems and Software [2] [3] ISCA 2000 [4] [5] research.compaq.com/wrl/people/jouppi/cacti.html

9 How it works Simics generates traces dynamically Traces are fed into the detailed processor simulators, which tell Simics if they can handle more instructions or if they should stall. Activity counters are used in order to get an estimation of energy consumption

10 Simulation parameters all systems SimpleScalar pipeline Snoop-based MOESI protocol Shared bus, with contention modeled Shared L2-Cache: L1-latency: L2-latency: Mem-latency: 2M, 8-way 1 cycle 12 cycles+bus-arb. 128 cycles

11 Simulation parameters 8-issue core Issue-width: 8 Window and ROB-size: 128 Load/Store-queue: 64 G-Share BP: 16K-entries Branch Target Buffer: 4K-entries Return Address Stack: 8 entries L1I-Cache 64K, 2-way L1D-Cache 64K, 4-way

12 Scaling methodology Everything except return address stack is scaled linearly. Tend to favor systems with many cores.

13 Benchmarks Parallel applications from Splash-2 Cholesky Raytrace FFT Radix Water-sp

14 Execution time

15 Instructions per cycle

16 Executed instructions 1IPC system Baseline system

17 IPC with perfect memory

18 Execution time with longer memory latency (3x) Increased execution time Cholesky: 114% Radix: 112% FFT: 103% Water: 61% Raytrace: 94%

19 Power consumption Radix FFT Water-sp

20 Energy consumption Radix FFT Water-sp

21 Conclusions Four 4-issue cores seem to yield almost as good performance as more cores for these multi-threaded applications. Considering power and energy, four or eight cores seem beneficial. Choose four cores in order to achieve good single-thread performance!

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross