Computing in the age of parallelism: Challenges and opportunities

Size: px

Start display at page:

Download "Computing in the age of parallelism: Challenges and opportunities"

Hope Johnston
6 years ago
Views:

1 Computing in the age of parallelism: Challenges and opportunities Jorn W. Janneck Computer Science Dept. Lund University Multicore Day, 23 Sep 2013

Hammond, and C. Batten Dotted line extrapolations by C.

2 why we are here Original data collected and plotted by M. Horowitz, F. Labonte, O.Shacham, K. Olukotun, L. Hammond, and C. Batten Dotted line extrapolations by C. Moore. C. Moore, Data Processing in ExaScale-ClassComputer Systems, April

3 physics of computing speed Circuits are still getting larger at a steady pace... but speed-up is no longer the result of a faster clock. Instead, we get more parallel devices. Intel desktop processors, top model for the resp. year. 3

4 physics of computing efficiency Performance = parallelism. Efficiency = locality. It really is that simple. Bill Dally approximate power in 28nm processor Bill Dally 4

5 sin r co e gl P ec U Pentium IV ( Northwood ) m i- c ult or P ec U Core i7 980X ( Gulftown ) pro ce y rra a or s s Epiphany-IV 5

6 programming parallel machines Pentium IV processor 1 Penryn quadcore multi-core 2 ~ 10s Am2045 MPPA, GPU 10s ~ 100s... Nvidia GT200 Virtex FPGA 100s ~ 1000s 6

7 programming parallel machines SOPM ( ajava ) Am2045 C w/ ilib CUDA C + VHDL TILE 64 PC102 GT200 SEAforth 40C18 Virtex VentureForth HDL 7

8 portability unicore Pentium IV then PDP-11 now instruction set all of the above, plus... number/kind of registers number and kind of processing elements address space (size, structure) Fujitsu SuperSPARC system software OS GUI... and possibly other 'stuff', like memories multicore processor array topology of interconnect GPU FPGA 8

9 programming sequential machines Pentium IV PDP-11 Fujitsu SuperSPARC

10 programming sequential machines Pentium IV von Neumann machine* C, Pascal, Forth, Java, Lisp, Fortran, Haskell,... PDP-11 Fujitsu SuperSPARC * for the purpose of this talk von Neumann machine single stored-program computer 10

11 a pivotal model of computing computer architects compiler writers programmers language designers 11

12 programming parallel machines? Pentium IV processor 1 Penryn quadcore multi-core 2 ~ 10s Am2045 MPPA, GPU 10s ~ 100s... Nvidia GT200 Virtex FPGA 100s ~ 1000s 12

13 our work Implement streaming applications efficiently on a wide range of computing platforms. Pentium IV processor 1 Penryn quadcore multi-core 2 ~ 10s Am2045 MPPA, GPU 10s ~ 100s... Nvidia GT200 language, methodology VHDL/Verilog synthesis software synthesis, IDE profiling, hardware synth., STHORM profiling, tools, software synth. software synthesis, applications software synthesis, applications UC Berkeley Xilinx Ericsson EPFL Lund University INSA Rennes Halmstad Högskolan Virtex FPGA 100s ~ 1000s 13

14 streaming applications how a video expert explains an MPEG decoding algorithm digital signal processing media processing video/audio coding image processing / analytics networking / packet processing automatic control cryptography gene sequencing... 14

15 dataflow computational kernels (actors) provide explicit concurrency execute in discrete atomic steps directed point-to-point connections lossless, order-preserving conceptually unbounded asynchrony, abstraction from time communicating streams of discrete data packets (tokens) no shared state Some things not represented in this model: time (incl. speed of actors and communication channels) execution policy of actors communication protocol and communication medium location and implementation technology of actors MPEG/ISO standardized RVC-CAL, a subset of our actor language, in

16 implementation parsing, translation composition, scheduling code generation partitioning architecture interconnect 16

17 questions we try to answer about stream programs Is it determinate? Does it do what it is supposed to do? Is it bounded? Does it deadlock? Can it be implemented efficiently? about their (more or less) parallel implementations Does it still work? Does it deadlock now? Is it sufficiently fast/efficient/small...? What/how many resources does it use? 17

18 dataflow profiling A B C trace: the structure of a computation trace is a DAG vertices are steps edges are dependencies dependencies mediated by tokens state variables A:1 B:1 C:1 A:2 B:2 C:2 A:3 B:3 C:3 18

we...... changed latency somewhere?... multiplexed resources?

19 analysis: post-mortem scheduling basic schedule of trace latency density laboratory for hypothetical designs/architectures What if we changed latency somewhere?... multiplexed resources?... did so dynamically?... used different communication fabric?... 19

20 trace analysis input/output tagging 20

21 trace metrics input (examples) latency (bytes) (incremental) power (Σ power) 21

22 critical path Critical path: the longest time-weighted sequence of events from the start of the program to its termination. 22

23 critical path metrics Impact Analysis: identifying potential for improvement

24 critical path & buffer optimization Objective: Evaluate a trade-off between the buffer size and the critical path length Minimum buffer size configuration for a given throughput Optimize only critical buffers (i.e. buffers that block steps in the current critical path) Critical Path Length CP length with min. buffer size Deadlock Region CP length with unbounded buffer size min. Buffers Size Buffer Size

25 critical path & buffer optimization the throughput is not significatively improved resonable optimal buffer size configuration 25

26 programmers computer architects compiler writers te ac to her o! s, a pivotal model of computing language designers 26

synchronization programming languages objects communication races types determinacy asynchrony threads design

27 here be dragons teaching parallelism There Is No More Sequential Programming. Why Are We Still Teaching It? panel title at the Intel Academic Forum, Supercomputing 08 Parallel Programming Programming preemption synchronization programming languages objects communication races types determinacy asynchrony threads design patterns variables tasks deadlock semaphores complexity data structures processors algorithms loops memory management design-by-contract 27

28 an alternative? Arvind, CSAIL, Massachusetts Institute of Technology Workshop on Directions in Multicore Programming Education, March 8,

29 the larger conversation? These are exciting times. computer architects compiler writers programmers Computing is losing its North Star. language designers Let's imagine that most computers have scores or hundreds of general-purpose cores. (And let's say we do not know exactly how many there are, or how they are connected.) Or programmable logic, or combinations of both. How would that affect... our idea of an algorithm? the languages and abstractions we use to program such a machine? the concept of a module? the things we need to do to validate programs? the techniques for making programs efficient? the concepts and the implementation of an operating system? the way we teach programming? 29

30 Thanks! Johan Eker Ericsson Edward A. Lee UC Berkeley Marco Mattavelli EPFL Carl von Platen Ericsson Dave B. Parlour Tabula Ian Miller Overture Mickaël Raulet INSA Rennes Mathieu Wipliez INSA Rennes Christophe Lucarz EPFL Shuvra S Bhattacharyya U Maryland Karl-Erik Arzen LTH Ruirui Gu Microsoft Endri Bezati EPFL Simone Casale Brunet EPFL Ghislain Roquier EPFL and many others. opendf.org Open Dataflow portal opendf.sf.net open source tool suite orcc.sf.net Open RVC-CAL Compiler ACTORS Project mpeg.chiariglione.org MPEG jwj@cs.lth.se 30

Buffer Dimensioning for Throughput Improvement of Dynamic Dataflow Signal Processing Applications on Multi-Core Platforms

Buffer Dimensioning for Throughput Improvement of Dynamic Dataflow Signal Processing Applications on Multi-Core Platforms Małgorzata Michalska, Endri Bezati, Simone Casale-Brunet, Marco Mattavelli EPFL