Thread and Data parallelism in CPUs - will GPUs become obsolete?

Size: px

Start display at page:

Download "Thread and Data parallelism in CPUs - will GPUs become obsolete?"

Rudolf Floyd
5 years ago
Views:

1 Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für Informatik, Technische Universität München Germany LRR-TUM, March 25, 2011

2 Microprocessor 25/03/11 2

3 Microprocessor 25/03/11 3

4 Graphics Processor 25/03/11 4

5 The Future: 22nm Intel Sample... 5

CMP with ~10 cores Dual core Symmetric multithreading Evolution Many-core array CMP with 10s-100s

6 Evolution of Architectures Large, Scalar cores for high single-thread performance Scalar plus many core for highly threaded workloads Intel Tera-Scale Computing Research Program Multi-core array CMP with ~10 cores Dual core Symmetric multithreading Evolution Many-core array CMP with 10s-100s low power cores Scalar cores Capable of TFLOPS+ Full System-on-Chip Servers, workstations, embedded 6

7 Goals of Parallel Computing Reduction of applications' execution time Increased extensibility and configurability Natural organization for system with special purpose processors numerical applications! Scientific interest 7

8 Parallelism for Performance Processor Bit-level up to 128 bit Instruction-level: pipelining, functional units Latency gets very important, branch-prediction Toleration of latency, hyper-threading Multiprocessors on a chip Multiple processors Multiple nodes... Data parallelism! 8

9 Classification Parallel Systems SIMD Array MIMD Vector Distributed Memory MPP NOW Shared Memory Cluster UMA NUMA 9

10 What to expect in High Performance Computing Accelerators will become increasingly important! Hence: Adaptation of code to Many-core architectures, and Vector registers (AVX, LRB, ) Currently/soon under investigation: Intel Parallel Building Blocks, comprising ArBB, formerly known as Ct, RapidMind... CILK+ (CEAN + CILK) TBB (...) 10

11 Knights Ferry 11

12 Knights Ferry 12

13 Intel MIC 32 cores, 1.2GHz / core 8MB shared L2 cache 1-2GB GDDR5 RAM Cores directly support direct vector instructions on 512 bit registers, e.g. Multiply-Add Scatter/Gather Prototypes available.. 25/03/11 13

14 Intel MIC Fully x86 compatible C/C++, Fortran Intel tools & libraries Shipped to selected customers/ Taken from Intel ISC 2010, Hamburg, Germany 25/03/11 14

15 SuperMUC: LRZ's next generation High Performance Computer Manufactured by IBM Powered by Intel chips: Sandy Bridge Plenty of cores :). Novel hot water cooling technology Supposed to consume 40 percent less energy. Heating is accomplished through cooling system! 15

16 SuperMUC: LRZ's next generation High Performance Computer Estimated costs: 135 Million Euros Supposed to start operation mid 2012 Peak Performance: 3 Petaflop/s (=10^15 FLOPS). 16

17 SuperMUC: LRZ's next generation High Performance Computer Fastest Computer in Germany in 2011/ Nodes with 2 Intel Sandy Bridge EP CPUs 209 Nodes with 4 Intel Westmere EX CPUs 9623 Nodes with total CPUs / Cores 3 PetaFLOP/s Peak Performance 17

18 SuperMUC: LRZ's next generation High Performance Computer 327 TB Memory Infiniband Interconnect Large File Space for multiple purpose 10 PetaByte File Space based on IBM GPFS with 200GigaByte/s aggregated I/O Bandwidth 2 PetaByte NAS Storage with 10GigaByte/s aggregated I/O Bandwidth No GPGPUs or other Accelerator Technology 18

19 Controlled Component- and Assembly-Level Optimization of Industrial Devices 19

20 The CASOPT Project 7th Framework Program (FP7) of the European Union Marie Curie IAPP 1.3 Million Euros Partners: TUM (Germany), TUG (Audtria), UoC (UK), and ABB Corporate Research (Switzerland) Secondment & Recruitment 20

21 Test Case: HV GIS Disconnector Three phases, six identical contacts. Shielding electrodes are subject to shape optimization. Test case: One contact on 1050 Volts (red), remaining five contacts on 0 Volts (blue). 21

22 Test Case: HV GIS Disconnector Field distribution on initial model: rather heterogeneous! Maximum: 35.2 V/m 22

23 Test Case: HV GIS Disconnector Field distribution on initial model: rather heterogeneous! Maximum: 35.2 V/m 23

24 Test Case: HV GIS Disconnector Red curve made of segments and arcs determines shape to be optimized. Lengths of segments and radii of arcs form design parameters. 24

25 Test Case: HV GIS Disconnector Red curve made of segments and arcs determines shape to be optimized. Lengths of segments and radii of arcs form design parameters. 25

26 Test Case: HV GIS Disconnector Red curve made of segments and arcs determines shape to be optimized. Lengths of segments and radii of arcs form design parameters. 26

27 Dimensions and lower / upper bounds 27

28 Computing Environment Computations carried out by three modules in local network: Mesher (Pro/E): x86, Intel Core i5 3.33GHz, 8GB Optimizer: x86, Intel Pentium M 1.6GHz, 1.2GB Solver (POLOPT): x86 Cluster, AMD single core Opteron processor, 2.4GHz, 2.0GB/processor 28

29 Speedup Curve on Local Grid 29

30 Relative Runtime on Local Grid 30

31 Optimization: Covariance Matrix Adaptation CMA-ES Number of iterations 260 Number of Evaluations 3900 Number of Analyses 3012 Total Runtime (h) Initial Fitness Value Final Fitness Value 31.2 Fitness Reduction 11.34% Sigma 6.15 *

32 Optimization: CMA-ES 32

33 Optimization: CMA-ES 33

34 Placema / autopin+ Ongoing research project. Keep in mind that pinning can be crucial (see last year's talk :) )! Depends on system architecture and cache hierarchy. Use of autopin tool. Pins threads to specific processor cores. New joint project DFG / ANR. 34

35 Placema / autopin+ Focus on so called performance dominating factors: OS interaction Program phases Cache effects. Etc. Attempt to get car industry on board (PAM-Crash). 35

36 Conclusions and Outlook New levels of parallelism show up: Data parallelism (SSE, AVX, KNF, ) Thread parallelism (multi- and many-core, multi-socket) Distributed memory: across cluster nodes Through dedicated interconnect / topology Two ongoing research projects: CASOPT (D-A-CH-GB) within IAPP autopin 36

Sparse Matrix Operations on Multi-core Architectures

Sparse Matrix Operations on Multi-core Architectures Carsten Trinitis 1, Tilman Küstner 1, Josef Weidendorfer 1, and Jasmin Smajic 2 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für