Breaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks

Size: px

Start display at page:

Download "Breaking Cyclic-Multithreading Parallelization with XML Parsing. Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks"

Leslie Crawford
5 years ago
Views:

1 Breaking Cyclic-Multithreading Parallelization with XML Parsing Simone Campanoni, Svilen Kanev, Kevin Brownell Gu-Yeon Wei, David Brooks 0 / 21

2 Scope Today s commodity platforms include multiple cores 1 / 21

3 Scope Today s commodity platforms include multiple cores 1 / 21

4 Scope Today s commodity platforms include multiple cores Use multiple cores for a single program 1 / 21

5 Scope Today s commodity platforms include multiple cores Use multiple cores for a single program Distribute loop iterations among cores a.k.a. Cyclic-Multithreading (CMT) 1 / 21

6 Cyclic-Multithreading (CMT) 2 / 21

7 Cyclic-Multithreading (CMT) This talk is about limits of CMT 2 / 21

8 Cyclic-Multithreading (CMT) This talk is about limits of CMT HELIX is a re-evaluation of CMT for today s multicore 2 / 21

9 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 2 / 21

10 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 2 / 21

11 Team of the HELIX Project 3 / 21

12 Project Goal 4 / 21

13 Project Goal 4 / 21

14 Project Goal 4 / 21

15 Project Goal 4 / 21

16 Project Goal 4 / 21

17 Project Goal 4 / 21

18 Project Goal 4 / 21

19 Project Goal 4 / 21

20 HELIX Performance Benchmark LOC HELIX-RC Speedup CFP mesa 42, art 1, equake 1, ammp 9,805 CINT gzip 5, vpr 11, mcf 1, parser 7, bzip2 3, twolf 17,875 libxml2 170,893 5 / 21

21 HELIX Performance Benchmark LOC HELIX-RC Speedup CFP mesa 42, art 1, equake 1, ammp 9, CINT gzip 5, vpr 11, mcf 1, parser 7, bzip2 3, twolf 17, libxml2 170, / 21

22 The HELIX Execution Model 6 / 21

23 The HELIX Execution Model 6 / 21

24 The HELIX Execution Model 6 / 21

25 The HELIX Execution Model 6 / 21

26 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21

27 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21

28 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21

29 The HELIX Execution Model Iterations grouped on modular value Cores organized as a ring TLP extracted between loop iterations 6 / 21

30 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle 7 / 21

31 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle 7 / 21

32 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore 7 / 21

33 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Intel Hyper-Threading 7 / 21

34 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] 7 / 21

35 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Enhanced multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] 7 / 21

36 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Enhanced multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] 7 / 21

37 Core-to-Core Communication Status of HELIX Static code generation Number of cores decided at compile time Communication: the main obstacle Today s multicore Enhanced multicore Intel Hyper-Threading [CGO 2012, IEEE Micro 2012, DAC 2012] [ISCA 2014] 7 / 21

38 HELIX Performance Benchmark LOC HELIX-RC Speedup CFP mesa 42, art 1, equake 1, ammp 9, CINT gzip 5, vpr 11, mcf 1, parser 7, bzip2 3, twolf 17, libxml2 170, / 21

39 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 8 / 21

40 Algorithm 9 / 21

41 Algorithm: Nested Tree Nodes 10 / 21

42 Algorithm: Single Element Analysis 11 / 21

43 Algorithm: CMT Opportunity 12 / 21

44 Outline The HELIX research project The XML library Limits of Cyclic-Multithreading parallelizations 12 / 21

45 Evaluation 13 / 21

46 Evaluation Architecture Conventional multicore Ring cache [ISCA 2014] 4 cores (Intel Atom-like) 13 / 21

47 Evaluation Architecture Conventional multicore Compiler Ring cache [ISCA 2014] 4 cores (Intel Atom-like) HELIX compiler: HCCv3 [ISCA 2014] 13 / 21

48 Evaluation Architecture Conventional multicore Compiler Simulator IRSim Ring cache [ISCA 2014] 4 cores (Intel Atom-like) HELIX compiler: HCCv3 [ISCA 2014] IR-based simulator [ISCA 2014] 13 / 21

49 Limits of HELIX 14 / 21

50 Limits of HELIX 14 / 21

51 Limits of HELIX (2) 15 / 21

52 Limits of CMT Oracle Control and data dependences Invariant variables Function pointers 16 / 21

53 Limits of CMT Oracle Control and data dependences Invariant variables Function pointers 16 / 21

54 Multiple CMT: Beyond the Single Loop Parallelism 17 / 21

55 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? 17 / 21

56 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core 17 / 21

57 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core Data and control dependences properly satisfied 17 / 21

58 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core Data and control dependences properly satisfied Constraint: No parallelism for recursive loops 17 / 21

59 Multiple CMT: Beyond the Single Loop Parallelism Goal Is there any hope on looking at parallelism among loops? Execution Model Any iteration of any loop can be executed by any core Data and control dependences properly satisfied Constraint: No parallelism for recursive loops Idealization No communication cost No dispatching cost No cost to switch loop iteration 17 / 21

60 Opportunity of MCMT 18 / 21

61 Opportunity of MCMT Static DDG: no hope 18 / 21

62 Opportunity of MCMT Static DDG: no hope 18 / 21

63 Opportunity of MCMT Static DDG: no hope Dynamic DDG: great potential for parsing flat trees 18 / 21

64 Opportunity of MCMT Static DDG: no hope Dynamic DDG: great potential for parsing flat trees 18 / 21

65 Opportunity of MCMT Static DDG: no hope Dynamic DDG: great potential for parsing flat trees Nested trees: require parallelism among same-loop invocations 18 / 21

66 Algorithm: CMT Opportunity 19 / 21

67 Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches 20 / 21

68 Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches Multiple CMT There is parallelism among multiple loops Dynamic analyses and/or code transformations are necessary 20 / 21

Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches Multiple CMT There is parallelism

69 Conclusion CMT High performance if most of the time is spent in natural loops Libxml highlights limits of CMT-based approaches Multiple CMT There is parallelism among multiple loops Dynamic analyses and/or code transformations are necessary References HELIX project 20 / 21

70 Thanks for your attention! Questions? 21 / 21

Automatically Accelerating Non-Numerical Programs By Architecture-Compiler Co-Design

Automatically Accelerating Non-Numerical Programs By Architecture-Compiler Co-Design Simone Campanoni * Kevin Brownell Svilen Kanev Timothy M. Jones + Harvard University Northwestern University * University