Reactive Runtime Systems for Heterogeneous Extreme Computations

Size: px

Start display at page:

Download "Reactive Runtime Systems for Heterogeneous Extreme Computations"

Maud Wade
5 years ago
Views:

1 Reactive Runtime Systems for Heterogeneous Extreme Computations Thomas Sterling Chief Scientist, CREST Professor, School of Informatics and Computing Indiana University November 19, 2014

2 Shifting Paradigms of Computing Abacus Counting tables Pascaline Difference engine Charles Babbage Per Georg Scheutz Tabulators Herman Hollerith Analog computer Vannevar Bush Machine Harvard Architecture Howard Aiken Konrad Zuse Charles Babbage Analytical Engine 2

Foundations: The von Neumann Age Information Theory Claude

Stored Program Computer Architecture von Neumann The von

assumptions ALUs are most expensive components Memory

movement of secondary importance Von Neumann extended: 1960

completion Vector SIMD Multiprocessor (MIMD) SMP Maintain

3 Foundations: The von Neumann Age Information Theory Claude Shannon Computabilty Turing/Church Cybernetics Norbert Wiener Stored Program Computer Architecture von Neumann The von Neumann Shift: Vacuum tubes, core memory Technology assumptions ALUs are most expensive components Memory capacity and clock rate are scale drivers mainframes Data movement of secondary importance Von Neumann extended: Semiconductor, Exploitation of parallelism Out of order completion Vector SIMD Multiprocessor (MIMD) SMP Maintain sequential consistency MPP/Clusters Ensemble computations with message passing 3

Capability Computing Memory Requirements 100 PB

Cellular Human Brain 100 TB 1 TB 10 GB Jülich

BlueGene/Q + BGAS CADMOS 4-rack BlueGene/P EPFL

Cellular Mesocircuit Cellular Rodent Brain

4 Capability Computing Memory Requirements 100 PB Planned EU HBP Exaflop machine Subcellular Detail Cellular Human Brain 100 TB 1 TB 10 GB Jülich 28-rack BlueGene/Q machine BBP / CSCS 4-rack BlueGene/Q + BGAS CADMOS 4-rack BlueGene/P EPFL 4-rack BlueGene/L Cellular Neocortical Column Cellular Mesocircuit Cellular Rodent Brain Glia-Cell/Vasculature O(1-10x) Reaction-Diffusion O(100-1,000x) Molecular Dynamics O(>1,000,000,000x?) Plasticity O(1-10x) Learning & Memory O(10-100x) Behavior O( x) 1 MB 1 Gigaflops Single Cellular Model 1 Teraflops 1 Petaflops 1 Exaflops Computational Complexity

5 The Negative Impact of Global Barriers in Astrophysics Codes Computational phase diagram from the MPI based GADGET code (used for N- body and SPH simulations) using 1M particles over four time steps on 128 procs. Red indicates computation Blue indicates waiting for communication

6 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 Clock Rate (MHz) Clock Rate 10,000 1, Heavyweight Lightweight Heterogeneous Courtesy of Peter K ogge, U N D

7 1/1/1992 1/1/1996 1/1/2000 1/1/2004 1/1/2008 1/1/2012 TC (Flops/Cycle) Total Concurrency 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 Heavyweight Lightweight Heterogeneous Courtesy of Peter K ogge, U N D

8 Performance Gain Gain with Respect to Cores per Node and Latency; 8 Tasks per Core, Overhead of 16 reg-ops, 8% Network Ops Performance Gain of Non-Blocking Programs over Blocking Programs with Respect to Core Count (Memory Contention) and Net. Latency Cores per Node Latency (Reg-ops)

Performance Gain of Non-Blocking Programs over Blocking Programs

9 Performance Gain Gain with Respect to Cores per Node and Overhead; Latency of 8192 reg-ops, 64 Tasks per Core Performance Gain of Non-Blocking Programs over Blocking Programs with Varying Core Counts (Memory Contention) and Overheads

Objectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithms Methodology The Purpose of a QUARK Dynamic DAG scheduling (QUARK) Explicit parallelism

10 Objectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithms Methodology The Purpose of a QUARK Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layout Arbitrary DAG with dynamic scheduling Runtime DAG scheduled parallelism Fork-join parallelism Notice the synchronization penalty in the presence of heterogeneity. Courtesy of Jack Dongarra, U T K

11 Asynchronous Many-Task Runtime Systems Dharma Sandia National Laboratories Legion Stanford University Charm++ University of Illinois Uintah University of Utah STAPL Texas A&M University HPX Indiana University OCR Rice University 11

12 Semantic Components of ParalleX

communication MPI HPX Overdecomposition: MPI

13 Overlapping computational phases for hydrodynamics Computational phases for LULESH (mini-app for hydrodynamics codes). Red indicates work White indicates waiting for communication MPI HPX Overdecomposition: MPI used 64 process while HPX used 1E3 threads spread across 64 cores.

14 HPX-5 Development Progress All cases run on 16 cores (1 locality) Zoom-in on best performers Courtesy of M att Anderson, I ndiana U niversity

Memory Vault E I N T E R C O N N E C T G I H

Row Buffers Scratchpad Thread 0 registers C D

.. Wide-word Struct ALU Wide Instruction Buffer L

interfaces Power Management Thread Manager A B N

15 Memory Vault E I N T E R C O N N E C T G I H Parcel Handler J AGAS Translation Memory Interface Row Buffers Scratchpad Thread 0 registers C D Thread N-1 registers F Access Control K... Wide-word Struct ALU Wide Instruction Buffer L Datapaths with associated control Control-only interfaces Power Management Thread Manager A B N Dataflow Control State M PRECISE Decoder O Fault Detection P and Handling

16 Goal and Objectives Neo-Digital Age Devise means of exploiting nano-scale semiconductor technologies at end of Moore s Law to leverage fabrication facilities investments Create scalable structures and semantics capable of optimal performance (time to solution) within technology and power limitations Technical Strategy 1. Liberate parallel computer architecture from von Neumann (vn) archaism for efficiency and scalability; eliminate vn bottleneck 2. Rebalance and integration of functional elements for data movement, operations, storage, control to minimize time & energy 3. Emphasize tight coupled logic locality for low time & energy 4. Dynamic adaptive localized control to address asynchrony and expose parallelism with emerging behavior of global computing 5. Innovation in execution model for governing principles at all levels 16

Neo-Digital Age extending past foundations Near nano-scale semiconductor technology Flat-lining of Moore s Law at single-digit nano-meters Cost benefits of fab lines for economy of scale through

17 Neo-Digital Age extending past foundations Near nano-scale semiconductor technology Flat-lining of Moore s Law at single-digit nano-meters Cost benefits of fab lines for economy of scale through mass-market Logic function modules Exploit existing and new IP for efficient functional units ALUs, latches/registers, nearest-neighbor data paths Integrated optics Orders of magnitude bandwidth increase Inter-socket On chip Advanced packaging and cooling Dramatic improvement opportunities in volumetric utilization 3-D integration of combined memory/logic/communication dies Return to wafer-scale integration 17

18 Neo-Digital Age Principles Elimination of vn-based parallel architecture Constrained parallel control state Processor-centric approach optimizes for ALU utilization with very deep memory hierarchy; wrong answer ALU-pervasive structures Merge ALUs with storage and communication structures High availability of ALUs rather than high utilization Slashes access latencies to save time and energy Cells of logic/memory/data-passing for single-cycle actions Optimized for space/energy/performance Optimized for memory bandwidth utilization Emphasis on fine-grain nearest-neighbor data-movement structures Direct access to adjacent state storage Enables communication through nearest neighbor Virtual objects in global name space Data, instructions, synchronization Intra-medium packet-switched wormhole routing Dynamic allocation 18

Prior Art: Non-von Neumann Architectures Dataflow Static (Dennis) Dynamic (Arvind) Systolic Arrays Streaming (HT Kung) Neural Networks connectionist Processor in Memory (PIM) Bit or word-level logic

19 Prior Art: Non-von Neumann Architectures Dataflow Static (Dennis) Dynamic (Arvind) Systolic Arrays Streaming (HT Kung) Neural Networks connectionist Processor in Memory (PIM) Bit or word-level logic on-chip at/near sense amps of memory SIMD (Iobst) or MIMD (Kogge) Cellular Automata (CA) Global emergent behavior from local rules and state von Neumann (ironically) Continuum Computer Architecture CA with global naming and active packets (Sterling/Brodowicz) 19

Local communication Systolic array: mostly Dataflow: classically not Properties Processor in Memory: logic at the sense amps Cellular Automata: fully Continuum Computer Architecture: fully, with

20 Local communication Systolic array: mostly Dataflow: classically not Properties Processor in Memory: logic at the sense amps Cellular Automata: fully Continuum Computer Architecture: fully, with pipelined packet switched Neural networks: not Event driven for asynchrony management Systolic array: classically not, iwarp yes Dataflow: yes PIM: no Cellular Automata: can be CCA: yes Neural networks: classically no, neuromorphic yes Merged logic/storage/communication Systolic array: yes Dataflow: classically not, possible PIM: yes Cellular Automata: yes CCA: yes Neural networks: yes 20

21 tag tag tag match logic register file value value value ALU network interface control

22 fonton 3-D die stack TSVs and on-chip network optical fiber bundle integrated optical transceiver CCA system CCA die

Performance Maximum ALUs For high utilization Maximum memory bandwidth Lots of inter-cell communication bandwidth Reduced overhead Reduced latency

23 Performance Maximum ALUs For high utilization Maximum memory bandwidth Lots of inter-cell communication bandwidth Reduced overhead Reduced latency Adaptive routing for contention avoidance Multi-variate storage beyond the bit Multi-variate logic beyond base-2 Boolean Dynamic adaptive execution model 23

Energy Minimize distance between elements Permeate structures with ALUs Short latencies between memory & registers Multiple clock rates to match

24 Energy Minimize distance between elements Permeate structures with ALUs Short latencies between memory & registers Multiple clock rates to match timings between logic and storage and Neighbor cell memory/register access Eliminate large caches and multi-level caches Eliminate speculative execution 24

Design of logic cell (Fonton) Very simple compared to full conventional processors 5.

25 Questions can be Answered Now 1. Trade-offs of DRAM bits and SRAM bits Space, Time, Energy 2. Cell rules derived from parallel execution model 3. Cell granularity Breakpoint where mitosis is better 4. Design of logic cell (Fonton) Very simple compared to full conventional processors 5. Reference implementation Emulation, Simulation, FPGA, ASIC, custom 6. 3-D packaging 7. Software environment 8. Application analysis 25

26 Conclusions New high density functional structures are required at end of Moore s Law and are emerging Reactive Runtime systems supported by innovations in hardware architecture mechanisms will exploit extremes of parallelism at high efficiency Neo-Digital age advances beyond von Neumann architectures to maximize execution concurrency and react to uncertainties of asynchrony 26

Continuum Computer Architecture

Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005