From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Size: px

Start display at page:

Download "From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC"

Emma Jasmine Craig
5 years ago
Views:

1 From the latency to the throughput age Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC ETP4HPC Post-H2020 HPC Vision Frankfurt, June 24 th 2018

2 To exascale... and beyond 2

Vision The multicore and memory revolution ISA leak

The power wall made us go multicore and the ISA interface

Complexity + variability = Divergence Between our mental

3 Vision The multicore and memory revolution ISA leak Plethora of architectures Heterogeneity Memory hierarchies The power wall made us go multicore and the ISA interface to leak our world is shaking Applications Applications Complexity + variability = Divergence Between our mental models and actual system behavior ISA / API What programmers need? HOPE!!! 3

4 dat a 2 dat a 2 Vision similar effect at system level/coarse grain Plethora of architectures Heterogeneity Memory hierarchies New usage practices Online simulation, analytics and visualization Interactive supercomputing, response time Value based computing Urgent computing Important Integration of concurrency and data Dynamic resource sharing Simulation 1 data 1 Simul 2 BSC vision. BDEC. Fukuoka. Feb

5 Evolution vs. revolution Revolutions Change of mindset before after Do we think outside the box?

6 Do we think outside the box? Very strong walls in the HPC box!!! 6

7 Do we think outside the box? Very strong walls in the HPC box!!! Sometimes we try to blow them up 7

8 Do we think outside the box? Very strong walls in the HPC box!!! Sometimes we try to blow them up But the walls are in our mind!!! 8

9 Do we think outside the box? We do (I may be exaggerating or may be not that much) Proudly show the performances we achieve and not the code we write Use variables about resources (cores, GPUs) omp_get-num_threads(), Run sequences of jobs with 5K core because each of them takes 20% less time than with 2K cores Believe that overlap == changing sends isends or using one sided calls Burn million hours to estimate good configuration Integrate simulation, analytics, visualization in a single MPI binary 9

10 Do we think outside the box? Do we? Interleave processes? Think of using MPI + OpenMP with just 1 OpenMP thread? Share nodes among jobs? Serialize (and overlap) reductions? Taskify MPI calls to allow their out of order execution? Spawn packing and unpacking tasks to allow for fast draining of incoming messages by main process? Parallelize packing and unpacking of messages? Depending on message size? 10

11 Do we think outside the box? Why? Follow recommended best practices Never thought of? Some bad experience never again I can do it better!!!!! Dazzled by performance!! 11

12 All about the mindset The real parallel programming revolution is in the mindset of programmers From the latency to the throughput age!!! and can/should be achieved productively Incrementally On a standard programming model/language (MPI+OpenMP, Python, ) Real revolution, real effort Issue everywhere. At home first. Shape minds vs. reshape minds 12

13 Key aspects Actual behavior/performance analysis Avoid flying blind!! Towards insight and understanding of fundamental issues For application & system developers Programming practices and models Decouple programmer from machine Programs to convey ideas to humans that happen to be executable by machines Enable productive/evolutionary/composable approaches Can we avoid/contain the complexity explosion? Dynamic resource sharing 13

14 Behavior awareness A common language about fundamental issues Evolution of bottlenecks Methodology 195 studies: ~25% industry Awareness Opportunity to improve And examples how Co-design input 14

15 15 Behavior awareness Tracking scaling behavior of computation regions (Strong scaling MPI+OpenMP example) 15

16 16 Behavior awareness Coupled codes Multiple physics, domains Compute & I/O 2.5 s EC-EARTH 1600 cores Atmosphere Ocean 26.7MB trace Eff: 0.43; LB: 0.52; Comm:

17 Vision in the programming revolution General purpose Applications PM: High-level, clean, abstract interface Power to the runtime ISA / API Decouple Forget about resources Minimal & sufficient permeability? Intelligence & Resource management Reuse & expand old architectural ideas under new constraints 17

18 Vision in the programming revolution Applications DSL2 DSL1 DSL3 PM: High-level, clean, abstract interface Fast prototyping Special purpose Must be easy to develop/maintain Power to the runtime ISA / API 18

Lookahead: About instantiating work Locality & data

19 Integrate concurrency and data Single mechanism Concurrency: Dependences built from data accesses Lookahead: About instantiating work Locality & data management From data accesses Task based parallel programming 19

20 Task based parallel programming Some important features Dependences, Lookahead Taskloops Nesting Array sections / Regions Exploiting malleability: Dynamic Load Balance (DLB) Within App, across apps MPI+OpenMP interoperability Think global, specify local 20

21 Towards the throughput age By Express potential concurrency Malleability Dynamic resource sharing/management Configuration independence Amount of resources is what really matters Side effects Nx1 can be better than pure MPI!!! hope for lazy programmers 21

22 Infrastructures for new usage modes Persistent KVS Alternative for parallel programs I/O? Flexible querying: 3D indexing, Data-thinning Need/opportunity of clean integration of concurrency and data Within one app Shared communication space between multiple apps. Malleable/Elastic/opportunistic resource management/sharing

23 Impact on architecture? High throughput devices Long Vectors Decouple Front end - Back end engines, reduce front end pressure, optimize memory throughput, explicit locality management Specialized compute and data motion engines Tuned numerical precision ISA is important Decouple/hide again hardware details, reuse SW technologies (compilers, OS, ), Specific instructions? limited number of control flows Hierarchical Acceleration Nesting Homogenize heterogeneity Runtime aware architectures (RAA) 23

24 Age before beauty Behavior (insight/models) before syntax Detail performance analytics before aggregated profiles Work instantiation and order before overhead Malleability before fitted rigid structure Possibilities before how tos Elegance before one day shine 24

25 The challenge Think of fundamentals, think out of the box Revolution: change everything so that nothing changes Should we: change as little as possible so that everything is different? Programmers!!!! Develop a culture of Efficiency awareness Latency throughput mindset Dynamic sharing of resources To exascale and before 25

POP CoE: Understanding applications and how to prepare for exascale

POP CoE: Understanding applications and how to prepare for exascale Jesus Labarta (BSC) EU H2020 Center of Excellence (CoE) Lecce, May 17 th 2018 5 th ENES HPC workshop POP objective Promote methodologies