Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Size: px

Start display at page:

Download "Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison"

Luke Watts
5 years ago
Views:

1 Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison

2 Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2

3 Beyond pipelining to ILP Late 1980s to mid 1990s Search for post RISC architecture More accurately, instruction processing model Desire to do more than one instruction per cycle exploit ILP VLIW/EPIC Out-of-order (OOO) superscalar 3

4 VLIW/EPIC School Descendants of HPC computing experience (array processors) Search for independence (by compiler) Express independence in static program Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism Strive for efficiency Static scheduling to saturate a resource 4

5 VLIW/EPIC School Creating effective parallel representations (statically) introduces several problems Predication Statically scheduling loads Exception handling Recovery code Lots of research addressing these problems 5

6 Not from HPC school OOO Superscalar Non-scientific influence important (e.g., branch prediction) No static search/representation for independence Arguably statically representing dependent operations (e.g., accumulator) would help Improvements to superscalar were in this direction 6

7 OOO Superscalar Create dynamic parallel execution from sequential static representation dynamic dependence information accurate execution schedule flexible Parallelism in application/algorithm required but program representation is sequential None of the problems associated with trying to create a parallel representation statically 7

8 VLIW/Superscalar Superscalar not at efficient as VLIW Has all this extra hardware for dynamic dataflow execution Hard to saturate a resource like VLIW But provides natural (sequential) interface for program generator Much more adaptable to run time uncertainties E.g., resources changing dynamically 8

9 Lessons from VLIW/Superscalar Wisdom from HPC is natural to apply, but is this a good idea? Hardware efficiency may not be the best notion of efficiency Parallel execution can be achieved even without a parallel representation 9

10 Lessons from VLIW/Superscalar Dataflow execution is much more flexible and adaptable than control flow execution e.g., new forms of speculation easily added e.g., resource architecture easily changed (Hardware) overheads for achieving dynamic dataflow execution worth the effort 10

11 Enhancing Superscalar How can we big a very large window superscalar processor which can process lots of instructions per cycle? Understand the characteristics of the dynamic dataflow graph and exploit patterns/localities Values tend to be used locally ; most parallelism is non-local Break up centralized hardware into clusters Most value communication within cluster Reduce load on inter-cluster communication network 11

12 Lessons from Beyond Superscalar Parallelism derived from sequential program (instruction stream) has patterns Local groups of instructions are dependent Most value communication happens here Independence is in non-local groups Few values passed between independent groups Altering this natural pattern requires a lot of storage Dependence relationships are highly stable and thus predictable when unknown 12

13 Lessons from Beyond Superscalar For good parallelism exploitation, dependent operations should be packed together Opposite of conventional wisdom Most value communication local within core Reduced demand for inter-core communication network To get parallelism, focus on dependence, not independence 13

14 More Lessons For good parallelism exploitation need lots of temporary storage Fast access to large storage necessitates creating and exploiting localities Creating localities for value communication implies grouping together dependent operations Is sequential (or pipelined ) a preferred way of representing parallel computation? 14

15 Summarizing Lessons HPC experience may not be the best for small scale parallelism Hardware efficiency may not be best notion of efficiency Dataflow execution works nicely to unwind available parallelism from a sequential program Overheads may be worth paying 15

16 Summarizing Lessons Statically representing dependence may be better than representing independence Easier dataflow execution, optimizing value communication, minimal resource assumptions, etc. Statically representing parallelism may create problems that are hard to deal with E.g., exceptions in VLIW, I/O in transactions And many others.. 16

17 The Multicore Generation How to achieve parallel execution on multiple processors? Over four decades of conventional wisdom in parallel processing Mostly in the scientific application/hpc arena Use this as basis Create a program with parallelism expressed statically 17

18 Hardware Going Forward Multiple general-purpose processing cores Some special-purpose hardware GPUs, specialized units, etc. Pool of available (i.e., powered on) resources might change frequently Need to optimize storage of values (caches) and value communication (interconnect) Use of software (e.g, VM) to hide hardware detail 18

19 How to use future hardware? Program in parallel Teach students about parallel programming Transactional memory Etc.. Do we really believe this? 19

20 Going Forward Programmers are going to continue to express computation in familiar ways: sequential, objectoriented programs May use parallel algorithm, but likely won t be a statically-parallel program How are we going to make it work? 20

21 Abstraction, Sequential, Dataflow Then: abstraction is a friend of software Now: abstraction is going to help us use future hardware Program is going to be a sequential representation of abstractions of computations which are going to be executed on a heterogeneous pool of hardware resources in a dataflow manner 21

22 Lessons Applied OO programming naturally creates groups of dependent operations Can optimize value communication Can optimize storage for values (internal/external) Can optimize traditional cache operations for many values 22

23 Lessons Applied Can process methods (chunks of dynamic instructions) in a dataflow manner Don t care how internals of method are implemented Don t care when, where and how it is executed Only data flow matters Dataflow execution can easily be unwound from sequential representation With right granularity of methods 23

24 Lessons Applied Dynamic dataflow execution probably not as efficient as bare bones parallel program, but other efficiencies probably more important Achieving dataflow execution from sequential program probably going to have software (or hardware) overhead, but likely worth it 24

25 Dynamic Serialization: What? Data-driven parallel execution from sequential program Data-centric (dynamic) expression of dependence Determinate, race-free execution No locks and no explicit synchronization Easier to write, debug, and maintain No speculation a la TLS or TM Comparable or better performance than conventional parallel models 25

26 How? Big Picture Write program in well object-oriented style Method operates on data of associated object (ver. 1) Identify parts of program for potential parallel execution Make suitable annotations as needed Don t impose how parallelism is executed Dynamically determine data object touched by selected code Identify dependence Program thread assigns selected code to bins in a determined (sequential) order 26

27 How? Big Picture Serialize computations to same object Enforce dependence Assign them to same bin; delegate thread executes computations in same bin sequentially Do not look for/represent independence Falls out as an effect of enforcing dependence Computations in different bins execute in parallel Updates to given state in same order as in sequential program Determinism No races If sequential correct; parallel execution is correct (same input) 27

28 Methodology Study existing parallel programs Empirical comparison of multithreaded vs. Prometheus x86-64 multi-core and ccnuma servers 64-bit binaries, maximum optimization Prometheus implementation Convert benchmark to idiomatic OOP program in C++ Use objects, inheritance, STL containers Parallelize same operations using Prometheus Prometheus version may be more fine-grained Some unavoidable differences due to locks, shared data 28

29 Hardware configurations μ-arch AMD Barcelona Intel Nehalem Processor Phenom 9850 Opteron 8350 Opteron 8356 Core i7 965 Xeon X5550 Sockets Cores Threads Total contexts Clock (GHz) Memory (GB)

30 30 Benchmarks Program Source Language Synchronization Description barnes-hut Lonestar C++ barrier black-scholes PARSE C C bzip2 pbzip2 C canneal dedup PARSE C PARSE C barrier histogram Phoenix C barrier N-body simulation financial analysis mutex, condition variables compression C++ atomic, optimistic VLSI CAD C mutex, condition variables enterprise storage image analysis reverse index Phoenix C mutex web indexing word count Phoenix C barrier text analysis

31 31 Micro-benchmark results

32 32 Multicore results

33 33 Multi-socket results

34 Conclusions Lessons from the past 25 years are going to be important for the future Think parallel, use parallel algorithms, but program sequentially! Focus on dependence, not independence Techniques like dynamic serialization can do as well or better than parallel programming techniques for achieving parallel execution! 34

35 Questions? 35

Parallel Computing. Parallel Computing. Hwansoo Han

Parallel Computing. Parallel Computing. Hwansoo Han Parallel Computing Parallel Computing Hwansoo Han What is Parallel Computing? Software with multiple threads Parallel vs. concurrent Parallel computing executes multiple threads at the same time on multiple