Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Size: px

Start display at page:

Download "Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs"

Vincent Jones
5 years ago
Views:

1 Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez

2 Motivation Continue scaling of performance Paradigm shift towards parallelism Many different options for the future Intel SCC emerging option

3 Goal Compare state-of-the art platforms that differ in architectural build Performance Power consumption Energy efficiency Intel SCC Intel Core i7 Intel Atom Nvidia ION2

4 Methodology and Result Introduce Platforms Applications Compare results Suggest improvements to the SCC

5 Libraries and benchmarks Applications For i7, Atom, and SCC: CHARM++ and MPI For ION2: MPI and Cuda/OpenCL For load balancing: LBTest Part of CHARM++ RefineLB The actual balancer

6 Intel Single-chip Cloud Computer Research many-core architecture 48 cores Power management DVFS

7 Intel Single-chip Cloud Computer Research many-core architecture 48 cores Power management DVFS

8 Intel SCC P = F*V^2 High power 125W 1.14V 1GHz Low power 25W 0.7V 125MHz Lack of standard libraries made implementation time consuming and therefore some applications were not ported

9 Other platforms 1/3: Intel i7 Heavy processor Power consumption High performance Versatile due to Hyperthreading Out of order execution Pipelining SSE Processor Number/ID i7-860 Nehalem Architecture # of Cores 4 # of Threads 8 Clock Speed 2.8 GHz Cache Size 8 MB Litography 45 nm Max TDP 95W VID Voltage Range 0.625V-1.40V Processing Die Size 296 mm2 # of Processing Transistors on Die 774 million

10 Other platforms 2/3: Intel Atom x86-64 Ultra-Low voltage Mainly used in Embedded devices Smartphones Processor Number/ID D525 Architecture # of Cores 2 # of Threads 4 Clock Speed 1.80 GHz Cache Size 512 KB Litography 45 nm Max TDP 13W VID Voltage Range 0.800V-1.175V Processing Die Size 87 mm2 # of Processing Transistors on Die 176 million

11 Other platforms 3/3: Nvidia ION2 System/Motherboard platform Uses Intel Atom processor Uses GeForce 305M/310M (GT218) Intended as GPGPU Supports CUDA OpenCL A Pinetrail 12 netbook platform was used in this paper ION Series ION2 GPU Number GT218 # of CUDA Cores 16 Clock Speed 475 MHz Memory 256 MB Memory Bus Width 64-bit Power Consumption 12W

12 Applications used 1/2 Timed and power-measured applications: Iterative Jacobi Low communication (border exchange) after initial distribution and final collection. Typical example of scientific application. NAMD Highly scalable Molecular Dynamics Sim. Small memory footprint, representing dynamic and complicated scientific applications.

13 Applications used 2/2 NQueens Typical State space search application. Integer program, as opposed to floating-point problems. CG Iterative, highly parallel, communications heavy application. From the NAS Parallel Benchmarks. Not available for the GPGPU system (ION2). Integer Sort Radix sort application implemented. Highly parallelizable, low amount of floating-point operations.

14 Evaluation results Platforms presented one by one Focus Parallelism -> power and energy consumption

15 Results 1/7: Intel SCC Speedup Speedup s 24.60s 28.00s 32.70s 4.91s - 32 cores CG Fine grained global communication in the algorithm

16 Results 2/7: Intel SCC Power Communication bound Processors stall

17 Results 3/7: Intel SCC Energy Energy = Power*Time_consumed Normalized Energy = Energy#cores / Energyn - where Energyx is the energy consumed by x number of cores. n is the total number of cores on the platform

18 Results 4/7: Intel i7

19 Results 4/7: Intel i7 Energy reduction not as large as in the SCC case

20 Results 5/7: Intel Atom

21 Results 6/7: Intel Atom <1W increase per thread added More threads == Less energy consumption But not as efficiently as on the i7

Results 7/7: Nvidia ION2 All 16 CUDA Cores Some applications could not be run/did not perform as well as expected: NAMD could be optimized better, but effort

22 Results 7/7: Nvidia ION2 All 16 CUDA Cores Some applications could not be run/did not perform as well as expected: NAMD could be optimized better, but effort deemed to much Sort is not fit for GPGPUs CG was not tested due to its algorithmic need of data transfers Relative Energy == Normalized energy required by the Atom

23 Load Balancing 1/5: # of Cores per chip increasing => Load Balancing becomes more important Load Balancing becomes more challenging LBTest benchmark in CHARM++ is used With RefineLB as the balancer LBTest creates a 3D mesh graphs to facilitate communication and spreading the work. RefineLB attempts to balance by removing work from overloaded threads.

24 Load Balancing ⅖: SCC Graph => utilization of threads during runtime Only 35% load balanced

25 Load Balancing ⅗: SCC 50% Load Balancing achieved on the 48 cores

26 Load Balancing ⅘: i7 Graph => utilization of threads during runtime 59% Load Balanced

27 Load Balancing 5/5: i7 99% Load Balancing achieved LB improves performance of benchmarks by 30% for SCC and 45% for i7

28 Architecture Comparison 1/6 Important to remember: SCC is a research chip Not optimized production machine like the others Source code identical for all platforms except ION2 A reasonably tuned CG was not found for ION2

29 Architecture Comparison 2/6: ION2 ION2 outperforms the others on Jacobi and NQueens But not for NAMD or Sort; Jacobi and NQueens has highly parallel memory access and communication patterns

30 Architecture Comparison 3/6: i7 i7 outperforms the others on NAMD, CG, and Sort. Heavy-weight multi-cores are therefore attractive for applications w/ complex execution/control flows, where the following is supported by the architecture; SSEs prediction/speculation high degree of ILP high floating-point performance

31 Architecture Comparison 4/6: SCC SCC speed up not that good, like the Atom sequential performance Floating point Atom Low power => Slower

32 Architecture Comparison 5/6: Power SCC Between the two low power platforms and the Intel i7

33 Architecture Comparison 6/6: Energy i7 energy efficient Atom less efficient SCC still competitive Exception CG Communication

34 Related work First paper to compare the same benchmarks across multi-core, many-core and GPGPU architectures without changing the source code Power management DVFS Porting Charm++ on top of communication libraries may result in performance improvements

35 Conclusion Important factors affecting choice Speed Power Energy Programmability Portability

36 Conclusion Intel SCC Intel i7 research chip Dynamic & many core complicated low power than heavy applications weight Irregular access faster than low power GPGPU processors Powerful for many No generality or applications portability issue High programming sequential performance effort

37 Conclusion No single best solution Applications Goals

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas 2 Increasing number of transistors on chip Power and energy limited Single- thread performance limited => parallelism Many opeons: heavy mulecore,