Orchestration: Turning material breakthroughs into application performance

Size: px

Start display at page:

Download "Orchestration: Turning material breakthroughs into application performance"

Jeremy Carter
5 years ago
Views:

1 Orchestration: Turning material breakthroughs into application performance Jeronimo Castrillon (on behalf of the Orchestration team) TU Dresden Cfaed Chair for Compiler Construction

2 Cfaed: Research program (revisited) Goal to explore new technologies for electronic information processing which overcome the limits of CMOS technology Page: 2

Cfaed: Research program (revisited) Information Processing Systems research to handle heterogeneity Goal Systems to explore new Computing technologies for electronic Devices & information processing

3 Cfaed: Research program (revisited) Information Processing Systems research to handle heterogeneity Goal Systems to explore new Computing technologies for electronic Devices & information processing which overcome the limits Circuits of CMOS technology Materials & Functions CMOS (industry.focus) H HAEC: Highly Adaptive Energy6Efficient A!!Silicon!Nanowires B!!Carbon F Orchestration G Resilience C!!Organic D!!Biomolecular Assembly E!!Chemical Biology to inspire solutions I Biological Material research for post-cmos technologies Page: 3

4 Orchestration Path Mission Turn materials breakthroughs into application performance F Orchestration Page: 4

5 Underlying algorithms: self-adjusting Code synthesis: variant generation Orchestration Path Mission App.2Algorithms Skeletons Turn materials breakthroughs HW/SW%Stack Compilers into application performance Operating2system Heterogeneous template Cross-layer strategies Mobile2 Communi) cation Coordination High2 Performance Computing Hardware F Orchestration Database application Formal2Analysis Materials)Inspired Paths (Paths A2 E) Programming abstractions Micro-kernels System analysis: modeling, verification & reasoning Page: 5

6 Orchestration Path: the team! Investigators Prof. Uwe Aßmann Prof. Franz Baader Prof. Christel Baier Prof. Jeronimo Castrillon Prof. Gerhard Fettweis Prof. Christof Fetzer Prof. Jochen Fröhlich Prof. Hermann Härtig Prof. Wolfgang Lehner Prof. Wolfgang E. Nagel Dr. Marcus Völp Prof. Axel Voigt Page: 6

7 Orchestration Path: the team (2)! PhD/Postdocs Nils Asmußen Andres Goens Sebastian Haas Dirk Habich Immo Huismann Tomas Karnagel Sven Karol Sascha Klüppelholz Linda Leuschner Matthias Lieber Siqi Ling Steffen Märker Johannes Mey Benedikt Nöthen Michael Raitza Norman Rink Annett Ungethüm Page: 7

8 Outline! Introduction! HW/SW stacks! Orchestration stack! Outlook! Concluding remarks Page: 8

9 What is a stack! Software components that form a complete platform (where you can run applications on) Web app: Operating system, web server, database, programming language! Provide means to handle complexity Page: 9

10 HW/SW Stack: Example 1 Android%sack Source:% Page: 10

HW/SW Stack: Example 2 HSA%Stack Source:%Heterogeneous%System%Architecture http://developer.amd.

11 HW/SW Stack: Example 2 HSA%Stack Source:%Heterogeneous%System%Architecture eous=computing/what=is=heterogeneous= system=architecture=hsa/ Page: 11

12 Stacks and wild heterogeneity! Abstract/hide complexity! Leave as much visible as needed for efficiency! Emerging techs.: rethink established layers Languages, compilers, operating systems, Page: 12

13 Example: Compiler Runtime Compiler protocol assumptions Code Data Stack 0x0 Endurance problem: try not to use same locations that often! Runtime environment Heap 0xF..F Page: 13

14 Example: Flash FTL File%system File%system Flash%Translation% Layer%(FTL) Mapping Wear%leveling Hard%disk Flash%drive Page: 14

15 Outline! Introduction! HW/SW stacks! Orchestration stack! Outlook! Concluding remarks Page: 15

16 Recall: orchestration stack! Heterogeneous materials In Cfaed: Along with orchestration (see other DMA presentations)! Research plan Start with heterogeneous CMOS Tiled for scalability, integrability Pilot projects with materials as they become available Mobile2 Communi) cation Coordination High2 Performance Computing App.2Algorithms Skeletons Compilers Operating2system Hardware Database application Formal2Analysis Materials)Inspired Paths (Paths A2 E) Page: 16

17 Applications/Drivers! Mobile communication: typically run on heterogeneous! Data-bases: benefits of heterogeneity (see below)! HPC: high level languages, abstractions for scalability, productivity and portability Adaptive Finite Element Methods (FEM) Computational Fluid Dynamics (CFD) Coordination Applications App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis Page: 17

18 Applications Applications/Drivers (2)! Adaptive Finite Element Methods (FEM) Working with multiple-meshes Communication across subdomains Requires repartitioning " load balancing Coordination App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis Working on! Portable templatized library [Witkowski15] Page: 18! First results on heterogeneous platforms Dendritic%growth

19 Applications Applications/Drivers (3)! Computational Fluid Dynamics (CFD)! Large applications (10 10 unknowns)! Communication/computation is key Coordination App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis! Working on! Heterogeneous implementation! More versatile fundamental concepts [Huismann15] Page: 19

20 Applications Hardware! Tomahawk: heterogeneous CMOS Coordination App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis CM /µk! Tiled (with NoC) for scalability! Allows to integrate wildly heterogeneous components! CoreManager (CM): HW support for fine-grained task dispatching [Arnold13] Page: 20

21 Hardware Applications App.% Algorithms Skeletons Compilers Operating% system! DB-processor for sorting algorithms Hardware Coordination Formal% Analysis Selectivity: [Arnold14] Page: 21

22 Applications Operating Systems! Microkernel for heterogeneity Isolation (at NoC level)! Dynamic compartments! Isolation at network boundary Remote control: no need for OS everywhere! Simplify integration of novel materials! OS services on remote cores Coordination App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis Page: 22

23 Applications Operating Systems! Microkernel for heterogeneity Coordination App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis App2 1 File2 System [Asmussen15] App2 2 CM/ µk Isolation & remote control Page: 23

24 Compilers! Parallel dataflow programming languages and compilers PNkpn AudioAmp PNin(short A[2]) PNout(short B[2]) PNparam(short boost){ while (1) PNin(A) PNout(B) { for (int i = 0; i < 2; i++) B[i] = A[i]*boost; }} PNprocess Amp1 = AudioAmp PNin(C) PNout(F) PNparam(3); PNprocess Amp2 = AudioAmp PNin(D) PNout(G) PNparam(10); Coordination Applications App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis [Castrillon14] Page: 24

25 Applications Compilers! SW synthesis for Tomahawk Static/dynamic mapping Automatic code generation Coordination App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis P2 P5 P1 P3 P6 P4 CM / µk Page: 25

26 Skeletons! Basic building blocks for parallelism! Heterogeneous programming Different possible implementations of same principles (CUDA, Fortran, ) Advanced pre-compiler to generate and manage variants " Productivity Coordination Applications App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis Page: 26

27 Skeletons Applications App.% Algorithms Skeletons Compilers Operating% system! Basic building blocks for parallelism Hardware Coordination Formal% Analysis 2 Style*Definitions Style:*OpenACC Type:*parallelization Fragment Target:DoConcurrent!$acc*loop*private(#PRIVATE_VARS#) #INNER#!$acc*end*do Attribute Loop.private_vars() Page: 27 1 Sequential*Program do*concurrent*(i =*1:n,* j =* 1:m) mat(i,j))=)x), y(i,j,k))*)2 end*do 2 Style*Application**Recipe RECIPE*{parallelization:OpenACC} OpenMP 3 Parallel*Program!$acc*parallel!$acc*loop*private(i,*j) do*k* =*1,*num_ele do*i*=*0,*po mat(i,j))=)x), y(i,j,k))*)2 end* do end*do!$acc*end*loop!$acc*end*parallel [Karol15]

28 Algorithms! Raise the level of abstraction High-level domain specific languages (more later) Algorithmic transformations: more optimization room Algorithmic specification: allow for adaptability Coordination Applications App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis Page: 28

29 Formal analysis! Probabilistic model checking (PMC) Analyze approaches across layers of the stack Work on models for heterogeneous systems Examples:! Probability of finishing N tasks in T time! Probability of finishing N tasks with energy budget B! Trade-off analysis cost-utility (energy vs. performance) Model protocols: Spinlock, barriers, ebond, Coordination Applications App.% Algorithms Skeletons Compilers Operating% system Hardware Formal% Analysis Page: 29

30 Formal analysis Applications App.% Algorithms Skeletons Compilers Operating% system! Probabilistic model checking (PMC) Hardware Coordination Formal% Analysis [Baier14] ebond Minimal%expected%energy minimal&expected&energy&[wa2s] 2.9$ 2.7$ 2.5$ 2.3$ 2.1$ 1.9$ 1.7$ 1.5$ 1.3$ 7.3$ 6.3$ 5.3$ 4.3$ 3.3$ 2.3$ 1.3$ 200$ 300$ 400$ 500$ 600$ 700$ 800$ 900$ 1000$ 1100$ 1200$ 1300$ 1400$ 1500$ 1600$ 1700$ 1800$ 1900$ 2000$ 2100$ 2200$ 2300$ 2400$ 2500$ 2600$ 2700$ 2800$ 2900$ 3000$ 3100$ 3200$ 3300$ 3400$ 3500$ 3600$ 3700$ 3800$ 3900$ 4000$ 4100$ 4200$ 4300$ 4400$ 4500$ 4600$ 4700$ 4800$ 4900$ 5000$ 5100$ 5200$ 5300$ 5400$ 5500$ 5600$ 5700$ 5800$ 5900$ 6000$ 6100$ 6200$ 6300$ 6400$ 6500$ 6600$ 6700$ 6800$ 6900$ 7000$ 7100$ 7200$ 200$ 300$ 400$ 500$ 600$ 700$ 800$ 900$ 1000$ 1100$ 1200$ 1300$ 1400$ 1500$ 1600$ 1700$ 1800$ 1900$ 2000$ 2100$ 2200$ 2300$ 2400$ 2500$ 2600$ 2700$ 2800$ 2900$ 3000$ 3100$ 3200$ Maximal%required%bandwidth maximal&required&ebond+&bandwidth&[mbit/s] Page: 30

31 Outline! Introduction! HW/SW stacks! Orchestration stack! Outlook! Concluding remarks Page: 31

32 Current and future work! Further work on languages and methodologies Computational biology Chemical information processing! Technology analysis at system-level Pros (USPs) and cons Attempt skip incremental improvement System-level benefit based on assumptions Page: 32

33 Language for computational biology! Particle and mesh-based Simulations Discrete models of physical phenomena! Goal: Abstract programming language + optimization for heterogeneous HPC-clusters Particle0 Mesh DSL Analysis Static Opt. Runtime opt. [Karol15] Page: 33

34 Language for computational biology client grayscott integer, dimension(6) :: bcdef = real(..), dimension(:,:), pointer :: displace integer :: interval add_arg(k_rate,<#real(mk)#>,1.0_mk,0.0_m k,'k_rat e', ) add_arg(f,<#real(mk)#>,1.0_mk,0.0_mk,'f_ param', ') add_arg(d_u,<#real(mk)#>,1.0_mk,0.0_mk,' Du_param ', ) add_arg(d_v,<#real(mk)#>,1.0_mk,0.0_mk,' Dv_param ', ') ppm_init(1) U = create_field(1, "U") V = create_field(1, "V ) topo = create_topology(bcdef) c = create_particles(topo) allocate(displace(ppm_dim,c%npart)) call random_number(displace) call c%move(displace, info) call c%apply_bc(info) global_mapping(c, topo) discretize(u,c) discretize(v,c) ghost_mapping(c) foreach p in particles(c) with sca_fields(u,v) U_p = 1.0_mk V_p = 0.0_mk if (((x_p(1)-0.5)**2+(x_p(2)-0.5_mk)**2).lt.0. 01) [Karol15] U_p = 0.5_mk _mk*noise V_p = 0.25_mk _mk*noise end if end foreach Page: 34 then n = create_neighlist(c,cutoff=<#4._mk * c%h_avg#>) Lap = define_op(2, [2,0, 0,2], [1.0_mk, 1.0_mk], "Lap") W = discretize_op(lap, c, ppm_param_op_dcpse, [order=>2,c=>1.0_mk]) o,nstag = create_ode([u,v],grayscott_rhs,[u=>c,v], rk4) interval = 1 t = timeloop() do istage=1,nstag ghost_mapping(c) ode_step(o, t, time_step, 1) end do print([u=>c, V=>c],interval ) end timeloop ppm_finalize() end client rhs grayscott_rhs(u=>parts,v) get_fields(du,dv) du = apply_op(w, U) dv = apply_op(w, V) foreach p in particles(parts) with sca_fields(u,v,du,dv) du_p = D_u*dU_p - U_p*(V_p**2) + F*(1.0_mk-U_p) dv_p = D_v*dV_p + U_p*(V_p**2) - (F+k_rate)*V_p end foreach end rhs

35 SiNW: Components and reconfigurability! Edge of SiNW over other technologies P/N reconfigurability Multi-gate devices (with low latency penalty)! Components based on assumptions New uarchitectures? New pipeline structures and compiler optimizations? Page: 35

36 Outline! Introduction! HW/SW stacks! Orchestration stack! Outlook! Concluding remarks Page: 36

37 Summary! Orchestration path: Multi-disciplinary effort to turn materials breakthroughs into application performance! HW/SW stack to handle heterogeneity Using CMOS as vehicle Prepare for wildly heterogeneous systems! Interact with material experts, understand technologies and analyze at system-level Mobile2 Communi) cation Coordination High2 Performance Computing App.2Algorithms Skeletons Compilers Operating2system Hardware Database application Formal2Analysis Materials)Inspired Paths (Paths A2 E) Page: 37

38 Thanks! Retreat:%Orchestration% and%resilience%paths%( ) Page: 38

39 References [Witkowski15] Witkowski, T., Ling, S., Praetorius, S., & Voigt, A. (2015). Software concepts and numerical algorithms for a scalable adaptive parallel finite element method. Advances in Computational Mathematics, 1-33 [Huismann15] Huismann, I., Stiller, J., & Fröhlich, J. (2015). Two-level parallelization of a fluid mechanics algorithm exploiting hardware heterogenity. Computers & Fluids [Arnold13] Arnold, O., Matus, E., Noethen, B., Winter, M., Limberg, T., & Fettweis, G. (2014). Tomahawk: Parallelism and heterogeneity in communications signal processing MPSoCs. ACM Transactions on Embedded Computing Systems (TECS), 13(3s), 107 [Arnold14] Arnold, O., Haas, S., Fettweis, G., Schlegel, B., Kissinger, T., Karnagel, T., & Lehner, W. (2014). HASHI: An Application-Specific Instruction Set Extension for Hashing. ADMS@ VLDB, [Asmussen15] Asmussen, N., Volp, M., Nothen, B., & Ungethum, A. (2015, April). Demo abstract: Taming many heterogeneous cores. In Real-Time and Embedded Technology and Applications Symposium (RTAS), 2015 IEEE (pp ) [Castrillon14] J. Castrillon and R. Leupers, Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap. Springer, 2014 [Karol15] Sven Karol, Well-Formed and Scalable Invasive Software Composition. PhD thesis. May 2015 [Baier14] Baier, C., Dubslaff, C., & Klüppelholz, S. (2014, July). Trade-off analysis meets probabilistic model checking. In Proceedings of the Joint Meeting of the Twenty-Third EACSL Annual Conference on Computer Science Logic (CSL) and the Twenty-Ninth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS) [Karol15] S. Karol, P. Incardona, Y. Afshar, I. Sbalzarini, and J. Castrillon. Towards a Next-Generation Parallel Particle-Mesh Language, in Domain-Specific Language Design and Implementation (DSLDI). July 2015 Page: 39

Compiling for deeply embedded and heterogeneous signal processing systems

Compiling for deeply embedded and heterogeneous signal processing systems Jeronimo Castrillon Cfaed Chair for Compiler Construction (CCC) 5G Summit, Dresden, Germany September 29, 2016 Multi-Processor/core