FVM - How to program the Multi-Core FVM instead of MPI

Size: px

Start display at page:

Download "FVM - How to program the Multi-Core FVM instead of MPI"

Diane Henderson
5 years ago
Views:

1 FVM - How to program the Multi-Core FVM instead of MPI DLR, 15. October 2009 Dr. Mirko Rahn Competence Center High Performance Computing and Visualization

2 Fraunhofer Institut for Industrial Mathematics Founded 1996 ITWM Staff 400 Revenue 12 Mio Industry 50% Modeling Simulation Optimization Visualization Scientific Computing Data Analysis

3 Fraunhofer Institut for Industrial Mathematics ITWM Departments Transport Processes System Analysis Image Analysis Financial Mathematics Optimzation CC -HPC Flow Material Simulation Dynamics + Durability

Competence Center High Performance Computing

Benchmarking Seismic Imaging Hercules PS3 FhG-FS

Visualization HPC - Tools Write MB/s 600 Read

4 Competence Center High Performance Computing Service oriented Computing Parallelization and Benchmarking Seismic Imaging Hercules PS3 FhG-FS Scalability (128 Clients) MB/s Visualization HPC - Tools Write MB/s 600 Read MB/s #Server MIC Multicore Innovation Center 16 32

5 Visualization beyond the GPU Parallel volume rendering surface rendering 56 Mio triangles

6 2010 : Interactive Photorealistic Rendering

Multicore Innovation Centre - MIC Operations Research Industry Applications Numerical algorithms at the multicore barrier new algorithms new implementations Development of production ready industry

7 Multicore Innovation Centre - MIC Operations Research Industry Applications Numerical algorithms at the multicore barrier new algorithms new implementations Development of production ready industry applications Programming Models PGAS Fraunhofer VM GraPA PHASTGrid Cell based Cluster Next generation Cell other architectures (Larrabee, GPU) Oil and Gas** Finance** CFD... Embedded in Projects with Industry Partners

PEGASUS versus 70 QS22 Cell Blades = 1120 cores Peak TFlop = 28 Power comsumption = 18 KW Green500 2008 #1 Top 500 2008 #464

8 PEGASUS versus 70 QS22 Cell Blades = 1120 cores Peak TFlop = 28 Power comsumption = 18 KW Green #1 Top #464 Herkules 270 Blades = 1080 cores ( Dual Core Woodcrest) Peak Tflop = 10,3 Power comsumption = 80 KW Green #33 Top #134

Application field seismic imaging PreStack typical data volumes 1--30 TB 2 1 2 u (, t )

9 Application field seismic imaging PreStack typical data volumes TB u (, t ) 0, c ( ) t 2 ( x, y, z ) Given: u(x, y, z 0, t) {( x, y, z ) 3 To find: z 0} u(x, y, z, t 0)

10 MPI local mem MPI Process Buffer local mem MPI Process Buffer local mem MPI Process Buffer local mem MPI Process communication implies synchronization copy to buffer, MPI RMA via windows no global address space no fault tolerance Multicore: MPI/OpenMP or explicit thread management Node 2 Node 1 Buffer

11 Fraunhofer Virtual Machine global address space direct onesided access no cycles for transfer no handshake -> perfect overlap MCTP Multicore Threadpool Management basic fault tolerance

12 FVM API read and write global data send and receive messages (sync) send commands global atomic counter global spinlocks barrier socket communication DMA Queues (sep. of concern) Execution Modell Daemon start, stop IB Check cleanup authentification autodetection of IP ports und channel Bonding machine file ranks

load balancing with atomic counter overlap calculation and

13 Application: Real Time Image Composer Parallel rendering of parts of images each nodes directly transfers into GPU memory load balancing with atomic counter overlap calculation and communication Benchmark DDR IB ( MT25208) - old Dual Core AMD host

14 Microbenchmarks Latency Gasnet developed at LBNL supervised by Kathy Yelick basis for UPC, Co Array Fortran no atomic counters Hardware: QuadCore HP 3.16 Ghz DDR ConncectX

15 Microbenchmark throughput FVM achieves speed of The native IB-Tools. Wirespeed! Hardware: QuadCore HP 3.16 Ghz DDR ConncectX

16 Microbenchmark Barrier Multicast Barrier is even faster But still not reliable

17 FVM Performance (Barrier Multi-rail bandwidth)

18 3D Angle Domain Migration Data for x,y Coordinate times oversampled Goal: reflection power as function of reflection angle

3D Angle Domain Migration O(1010) ops per image point summation of all contributions to reflection power in a single subsurface point as a function of the

19 3D Angle Domain Migration O(1010) ops per image point summation of all contributions to reflection power in a single subsurface point as a function of the reflection angle Ray Tracing calculation of ways of rays for a given velocity field results in running time and weight 109 points (highfreq. Approximation)

20 For each output point x: For each opening angle For each opening angle azimuth For each dip angle inclination For each dip angle azimuth Build s-x-r ray pair and sum event from closest trace to image Needs : all data in memory each core randomly accesses the memory ( 10 GB from 4 TB are summed up) calculation time: sec per point, -> accumulates to days on 1000 cores fault tolerance

21 Role of the FVM Loop Compute thread FVM 5 TByte data distributed in memory each core calculates result for a point or a group of points Request nonlocal data RDMA read Calculate ray pairs Interpolate overlap transfer with calculation IB Card Summation Loop dynamic load balancing with AC caching of non-local data: Local for the next step Result Location of data is nearly unimportant!

22 Fault tolerance FVM detects transfer failures, informs application application saves the DMA request FVM starts new node from a set of spare nodes lost data is restored from harddisk DMA request are re-issued Result: Lost computational power during restart, maybe some remote accesses to this node are stalled.

23 Result of simulation 32 nodes Dual Woodcrest DDR IB Memory access time only Overlap communication and calculation Gasnet has no atomic counters And no fault tolerance Real life: latency bound! Improvement: reorder blocks to form larger transfer units -> weakly regular alltoallv.

24 alltoall, regular MPI has problems to handle many small blocks properly

25 alltoallv, weakly regular, inplace for real permutations the performance is independend from the number of nodes

26 alltoallv, weakly regular, inplace 5 secs for reordering: from latency bound to compute bound again

27 Future FVM on Cell: Up and running: each SPE can read remote memory FVM on 10GB Ethernet -- some restrictions Richer API: Collectives, even fault tolerant ones, e.g. AllReduce. GraPA for FVM FVM4: in production since september: Even lower latency, new startup mechanism, extensive environment checking in work: passive recv with wakeup, on top: all kinds of load balancing Under development: Communication infrastructure for Bus-memory Devices (e.g. Larrabee and Co.) future: direct hardware support Thank you!

GPI-2: a PGAS API for asynchronous and scalable parallel applications

GPI-2: a PGAS API for asynchronous and scalable parallel applications Rui Machado CC-HPC, Fraunhofer ITWM Barcelona, 13 Jan. 2014 1 Fraunhofer ITWM CC-HPC Fraunhofer Institute for Industrial Mathematics