Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Size: px

Start display at page:

Download "Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas"

Janice Harris
5 years ago
Views:

1 Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

2 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore, manycore, SIMD- like GPUs, low- power designs Intel s SCC presents manycore 2

3 3 Various trade- offs Performance, power and energy Programmability Different for each applicaeon Each architecture has advantages No single best solueon (But manycore is a good opeon) 3

4 4 Benchmark different parallel applicaeons On all architectures Analyze results, learn about architecture Where does each architecture stand? In various tradeoffs What could be probably bezer for future? Reasonable in all metrics: power, performance For most applicaeons 4

5 5 Jacobi: nearest neighbor communicaeon Standard floaeng- point benchmark NAMD: Molecular dynamics Dynamic scienefic applicaeon Nqueens: state space search Integer intensive 5

6 6 CG: Conjugate Gradient, from NAS Parallel Benchmarks (NPB) A lot of fine- grain communicaeon Integer Sort: represents commercial workloads Integer intensive, heavy communicaeon 6

7 7 All codes reasonably opemized, not extensively! Programmer producevity MPI and Charm++ CUDA and OpenCL codes for GPU Reasonably opemized, same complexity as CPU ones Fair apple- to- apple comparison 7

8 8 MulE- core: Intel Core i7 4 cores, 8 threads, 2.8 GHz Low- power: Intel Atom D 2 cores, 4 threads, 1.8 GHz SIMD- like GPU: Nvidia ION2 16 CUDA cores, low power, 475 MHz Manycore: Intel s SCC 48 cores, 800 MHz 8

9 9 Single- Chip Cloud Computer (SCC) Just a name, not only for cloud Experimental chip, for research on manycore Slower than produceon chips, but we can analyze 48 PenEum cores, Mesh interconnect Not cache coherent, message- passing programs 9

10 10 We ported Charm infrastructure to SCC To run Charm++ and MPI parallel codes Charm++ is message- passing parallel objects No conceptual difficulty SCC is similar to a small cluster Just headaches of experimental chip Old compiler, OS and libraries 10

11 11 Apps can use parallelism in architecture Speedup NAMD Jacobi NQueens CG Sort Number of Threads 11

12 12 150W NAMD Jacobi NQueens CG Sort Power (W) W Number of Threads 12

13 Speedup offsets Power increase Normalized Energy 2.8 2.6 2.4 2.2 2 1.8 1.

13 13 Speedup offsets Power increase Normalized Energy NAMD Jacobi NQueens CG Sort Number of Threads 13

14 NAMD Jacobi NQueens CG Sort Speedup Number of Threads 14

15 15 31W NAMD Jacobi NQueens CG Sort Power (W) W Number of Threads 15

16 Speedup and near constant power Normalized Energy 3 2.8 2.6 2.4 2.2 2 1.

16 16 Speedup and near constant power Normalized Energy NAMD Jacobi NQueens CG Sort Number of Threads 16

17 40 35 Speedup Power (W) Relative Energy 2 1.8 1.

17 Speedup Power (W) Relative Energy Energy Speedup, Power (W) Relative Energy Jacobi NAMD NQueens Sort 0 17

18 18 No change in source code! NAMD Jacobi NQueens CG Sort Except CG (fine- grain collec5ves) Speedup Number of Cores

80 75 70 65 60 55 50 45 40 35 NAMD Jacobi

19 19 19 Different power 85W 60W (Communica5on bound so processors stall) 40W Power (W) NAMD Jacobi NQueens CG Sort Number of Cores

not for CG Normalized Energy 22 20 18 16 14 12

20 20 Apps scalability needed to offset power increase! not for CG Normalized Energy NAMD Jacobi NQueens CG Sort Number of Cores

21 Regular, highly parallel GPU good for

ION2 (16 Threads) Mul5core good for most

21 21 Regular, highly parallel GPU good for some apps experimental chip! Relative Speed SCC beats mul5core 10 on Nqueens 5 Atom (4 Threads) Even dynamic, Core i7 (8 irregular, Threads) less parallel SCC (48 Threads) ION2 (16 Threads) Mul5core good for most apps Low- power is really slow! 21 0 Jacobi NAMD NQueens CG Sort

22 22 GPU and have low consump5on Power (W) Power nearly same for different apps Atom (4 Threads) Core i7 (8 Threads) SCC (48 Threads) ION2 (16 Threads) High frequency, fancy features Mul5core has high consump5on More parallelism, less frequency SCC has moderate power 22 0 Jacobi NAMD NQueens CG Sort

23 23 Despite low speed SCC is OK High speed offsets power Relative Energy Mul5core is energy efficient! Atom (4 Threads) Core i7 (8 Threads) SCC (48 Threads) ION2 (16 Threads) Low speed offsets power savings Low- power design is not energy efficient! Jacobi NAMD NQueens CG Sort

24 24 MulEcore sell effeceve for many apps Dynamic, irregular, heavy communicaeon GPU is excepeonal for some apps Not for all apps Programmability, portability issues Low- power design just low power Slow! So not energy efficient 24

25 25 SCC: a good opeon Low power and can run legacy code fast Balanced design point Lower power than mulecore, faster than low- power design No programmability and portability issues of GPU Experimental, slow but can be improved 25

26 26 SequenEal performance, floaeng point Apps show good scaling on more cores of SCC Easy with more CMOS transistors Interconnect also has to improve with it (sell easy) Global colleceves network Like BlueGene supercomputers, low cost To support fine- grain global communicaeons As in CG 26

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance