SUPPORT FOR ADVANCED COMPUTING IN PROJECT CHRONO

Size: px

Start display at page:

Download "SUPPORT FOR ADVANCED COMPUTING IN PROJECT CHRONO"

Job Ellis
5 years ago
Views:

1 SUPPORT FOR ADVANCED COMPUTING IN PROJECT CHRONO Dan Negrut Vilas Associate Professor NVIDIA CUDA Fellow Simulation-Based Engineering Lab University of Wisconsin-Madison December 9, 2015

2 Acknowledgement Funding for Project Chrono comes from US Army TARDEC Ends in September 2016 [looking for organizations to partner with for transfer of technology and joint projects] 2

3 Overview Part 1: discussion of two trends in computing Part 2: how we position Chrono to accommodate future trends in computing 3

4 The Price of 1 Gflop/second 1961: Combine 17 million IBM-1620 computers At $64K apiece, when adjusted for inflation, this would cost $8.3 trillion 2000: About $1, : 8 cents [wikipedia ] 4

5 The inside of a computer is as dumb as hell but it goes like mad. --Richard Feynman 5

6 Adopting a Positive Outlook The inside of a computer goes like mad but needs some hand holding. 6

7 First Trend Discussed Here: Memory Speed 3D Memory A major breakthrough 7

8 Basic Fact, Speed of Execution: Math Doesn t Matter void somefunction(double* a, double* b, unsigned int arrsize) { double dummy[3]; } dummy[0] = sin(a[1]); dummy[1] = log(fabs(a[2])) + sqrt(2.+dummy[0]); dummy[2] = cos(b[1]) + exp(b[0]); a[0] = dummy[1]; b[0] = dummy[2]; // and so on... void somefunction(double* a, double* b, unsigned int arrsize) { double dummy[3]; } dummy[0] a[1] dummy[1] a[2] dummy[2] b[1] and b[0] // and so on... 8

9 Why Math Operations Don t Count Memory speed almost always dictates performance of computation One transaction to GPU global memory: 400 clock cycles 32 fused multiply-add operations; i.e., 64 operations: 1 clock cycle c = α c + b Bottom line: 100X more expensive to move data where is needed 9

10 Memory Speed: Hard Nut to Crack Historically, memory speed increasing at a rate of approx. 1.07/year Historically, processors improved at faster rates 1.25/year ( ) 1.52/year ( ) 1.20/year ( ) Growing gap between memory speed and processing speed 10

11 Memory Speed: Widening of the Processor-DRAM Performance Gap Courtesy of Elsevier, Computer Architecture, Hennessey and Patterson, fourth edition 11 11

12 3D Stacked Memory [future looks quite bright] SK Hynix s High Bandwidth Memory (HBM) Developed by AMD and SK Hynix 1st Generation (HBM1) introduced in AMD Fiji GPUs 1GB & 128GB/s per stack AMD Radeon R9 Fury X: had four stacks 4GB & 512GB/s [AMD] 2nd Generation (HBM2) will be used in NVIDIA Pascal and AMD Arctic Island GPUs 2 GB & 256GB/s bandwidth per stack NVIDIA Pascal reported to have 1TB/s memory bandwidth 12

13 3D Stacked Memory [ leap in technology ] OLD EMERGING [AMD] 13

14 3D Stacked Memory More power efficient Electrons move shorter distances Less power wasted moving data Smaller memory footprint More memory can be packed into space [AMD] 14

15 15

16 Memory Speeds in a CPU-GPU System CPU Core Latency: Low 80GB/s Cache GPU cores GB/s GB/s Latency: Medium Low Latency: Rel.Low Infiniband to Next Node 6GB/s Latency: High System Memory 8-16 GB/s Latency: Medium GPU Memory 16 16

17 Second Trend Discussed Here: Moore s Law Number of transistors per unit area has been steadily going up ILP and Clock Speed have stagnated 17

18 Intel Roadmap, and Relevance to Us nm Tick: Ivy Bridge Tock: Haswell nm Tick: Broadwell Tock: Skylake nm Refresh Kaby Lake nm Tick: Cannonlake (delayed to 2 nd Half 2017) nm nm 2023??? (carbon nanotubes?) Happening now: Moore s law moving from month cycle to month cycle for the first time in 50 years 18

19 Transistor Densities Still Going Up Although not as fast as before, transistor densities are still going up Consequence: lots of cores in one chip CPU Cores: 18 today, probably 32 in two-three years GPU Scalar Processors: 3,000 today (Maxwell), probably 4,500 in two years (Pascal) Intel Xeon Phi : 61 today, very likely close to 200 in two years 19

20 Parallel Computing: Some Black Spots More transistors = More computational units 2015 Vintage: 18-core Xeon Haswell-EX E V3 5.6 billion transistors ($7200) Black silicon: owing to high density and power leaks, not able to fully power these chips Black silicon: transistors that today don t get used and are dead weight Dennard s scaling started to break down at the end of last decade Dennard s law is the secrete sauce for Moore s law 20

21 Lots of Cores: There s More Than Meets the Eye Solutions rarely scale beyond 20 cores when using shared memory Cache coherence and NUMA slow things down I have 32 cores and only see the net effect of 20 of them??? What if I have one workstation with four sockets; i.e, 128 cores can I only scale up to 20??? It looks like that shared memory solutions don t scale well 21

22 Scaling, with Lots of Cores: Via Distributed Memory To scale, a different parallel programming paradigm needed: distributed memory Distributed memory eliminates cache coherence issues Also good since you can solve very large problems lots of memory available to user Why not always do this? Distributed memory solution calls for major code re-write If not implemented well, distributed memory solution has high data access latencies 1000X higher than accessing memory on a workstation 22

23 Distributed Memory, Good for Long Run Though Five to six years from now, it s not clear what will replace Moore s law No technology yet to continue past steady increase in core count Can t improve anymore speeds by use of more cores on one workstation Distributed memory is the path towards running by drawing on multiple workstations Called nodes 23

24 Project Chrono Goal Solve one billion degrees of freedom by the time we get together in December

25 Positioning Chrono for Advanced Computing Cluster Node Coprocessors/Accelerators Socket Core Hyper-Threads Superscalar Pipeline Vector Group of nodes communicating through fast interconnect Group of processors communicating through shared memory Special compute devices attached to the local node through special interconnect Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units [Intel] We have full control We have little to no control 25

26 HPC in Computational Dynamics: Is MPI the Way to Go? Applications are getting more sophisticated Multi-scale, multi-module, multi-physics The traditional approach based on MPI not attractive Working on 100s of nodes is run of the mill in HPC Load imbalance emerges as a big issue for some apps 26

27 27

28 28

29 29

30 30

31 MPI or Charm++? Charm++ is a generalized approach to writing parallel programs An alternative to the likes of MPI, Chapel, UPC, etc. Charm++, three facets A style of writing parallel programs An ecosystem that facilitates the act of writing parallel programs Debugger, profiler, ability to define own load balancing, etc. A runtime system 31

32 Charm++ Attribute: Overdecomposition Decompose the work units & data units into many more pieces than execution units Cores/Nodes/.. Why do this? Central idea: oversubscription of the hardware Hide memory latency w/ useful execution This oversubscription idea is a general tenet Done by the GPU 32

33 Charm++ Attribute: Migratability Make the work and data units on previous slide migratable at runtime That is, the programmer or runtime can move them from execution unit (PE, from processing element) to execution unit From PE to PE, that is Consequences for the app-developer Communication must now be addressed to logical units with global names, not to physical processors But this is a good thing Consequences for the runtime system (RTS) Must keep track of where each unit is Naming and location management 33

34 Positioning Chrono for Advanced Computing Cluster Node Coprocessors/Accelerators Socket Core Hyper-Threads Superscalar Pipeline Vector Group of nodes communicating through fast interconnect Group of processors communicating through shared memory Special compute devices attached to the local node through special interconnect Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units [Intel] We have full control We have little to no control 35

35 Positioning Chrono for Advanced Computing Cluster Node Coprocessors/Accelerators Socket Core Hyper-Threads Superscalar Pipeline Vector Group of nodes communicating through fast interconnect Group of processors communicating through shared memory Special compute devices attached to the local node through special interconnect Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units [Intel] We have full control We have little to no control 36

36 HMMWV on Deformable Terrain. Year:

37 Chrono GPU: HMMWV on Deformable Terrain. Year:

38 HMMWV on Discrete Terrain k rigid spheres Length of simulation: 15 seconds Hardware used: CPU (Intel) Multicore, based on OpenMP Integration time step: 0.001s Velocity Based Complementarity 17 seconds per time step Simulation time: ~2.5 days 2015 ~1.5 million rigid spheres Length of simulation: 15 seconds Hardware: GPU (NVIDIA) Tesla K40X Integration time step: s Position Based Dynamics 0.3 seconds per time step Simulation time: ~2.5 hours 2015 Simulation: although 5X more bodies, runs about 25 times faster 39

39 Positioning Chrono for Advanced Computing Cluster Node Coprocessors/Accelerators Socket Core Hyper-Threads Superscalar Pipeline Vector Group of nodes communicating through fast interconnect Group of processors communicating through shared memory Special compute devices attached to the local node through special interconnect Group of cores communicating through shared cache Group of functional units communicating through registers Group of thread contexts sharing functional units Group of instructions sharing functional units Sequence of instructions sharing functional units Single instruction using multiple functional units [Intel] We have full control We have little to no control 40

40 4 wide add operation (SSE 1.0) C++ code m128 Add (const m128 &x, const m128 &y){ return _mm_add_ps(x, y); } mm128 z, x, y; x = _mm_set_ps(1.0f,2.0f,3.0f,4.0f); y = _mm_set_ps(4.0f,3.0f,2.0f,1.0f); z = Add(x,y); x x3 x2 x1 x y y3 y2 y1 y0 = = = = = gcc S O3 sse_example.cpp z z3 z2 z1 z0 Assembly Z10AddRKDv4_fS1 Z10AddRKDv4_fS1_: movaps (%rsi), %xmm0 # move y into SSE register xmm0 addps (%rdi), %xmm0 # add x with y and store xmm0 ret # xmm0 is returned as result [Hammad] 41

41 Conclusions, Chrono::HPC Moore s law reaching terminus in six years: distributed memory solutions all we have left Looking at all of the above opportunities to speed up large simulations in Chrono Large simulations in Chrono: billion degree of freedom dynamic systems Fluid solid interaction Granular material (high/low saturation) Large nonlinear FEA Aiming to present billion DOF simulation in Chrono at Fall 2016 MaGIC meeting 42

42 Thank You. 43

Fra superdatamaskiner til grafikkprosessorer og

Fra superdatamaskiner til grafikkprosessorer og Brødtekst maskinlæring Prof. Anne C. Elster IDI HPC/Lab Parallel Computing: Personal perspective 1980 s: Concurrent and Parallel Pascal 1986: Intel ipsc