Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting

Size: px

Start display at page:

Download "Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting"

Ariel Harvey
6 years ago
Views:

A. Laurenzano, Lingjia Tang and Jason Mars International

1 Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars International Symposium on Microarchitecture (MICRO), 2016 October 18, 2016

2 Rampant Dynamism in Datacenters Datacenters

3 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Datacenters

4 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Co-running of applications Datacenters

5 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Microarchitectural flexibility Co-running of applications Datacenters

6 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Microarchitectural flexibility Co-running of applications Platform diversity Datacenters

Microarchitectural flexibility Co-running of applications

7 Rampant Dynamism in Datacenters Dynamism - Dynamic factors that affect application runtime environments Microarchitectural flexibility Co-running of applications Platform diversity Datacenters Dynamism affects the runtime availability of resources

8 Static Compiler Optimizations Compilation assumptions might not be met at runtime

9 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism

10 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse

11 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism

12 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal

13 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal

14 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal

15 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application

16 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application

17 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application Partitioned cache

18 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Normal Co-running application Partitioned cache Different architecture

19 Static Compiler Optimizations Compilation assumptions might not be met at runtime Resource dependent static optimizations do not react to dynamism Loop Tiling Restructures memory access pattern to utilize data reuse Conceptualized before multicore era, presenting little dynamism Static Ideal Normal Co-running application Partitioned cache Different architecture

20 Co-runner Tiling Comparison Static vs Dynamic

21 Co-runner Tiling Comparison Static vs Dynamic

22 Co-runner Tiling Comparison Static vs Dynamic

23 Co-runner Tiling Comparison Static vs Dynamic Static vs Dynamic Static vs Dynamic

24 Co-runner Tiling Comparison Static vs Dynamic Static vs Dynamic Static vs Dynamic Dynamism requires rethinking cache tiling

25 Design Objectives Dynamic Should react to changes in runtime environment

26 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy

27 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead

28 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches

29 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches White-box approach BLAS libraries Dynamic Accuracy Low-overhead

30 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches Math kernel libraries like Intel MKL, ATLAS White-box approach BLAS libraries Dynamic Accuracy Low-overhead

31 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches Math kernel libraries like Intel MKL, ATLAS White-box approach BLAS libraries Dynamic Accuracy Low-overhead

32 Design Objectives Dynamic Should react to changes in runtime environment High accuracy Should identify a high-performance tiling strategy Low overhead Should have low dynamic performance overhead Current techniques are not enough White-box approaches Math kernel libraries like Intel MKL, ATLAS White-box approach BLAS libraries Dynamic Accuracy Low-overhead Online generation of a black-box model

33 Shape Shifter

34 Key Components Dynamic tile generation Tiled loop Application 1

35 Key Components Dynamic tile generation Companion thread (Protean Code + Polly) Tiled loop Code cache Dynamic compiler Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

36 Key Components Dynamic tile generation Detect tiling opportunities Companion thread (Protean Code + Polly) Tiled loop Code cache Dynamic compiler Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

37 Key Components Dynamic tile generation Detect tiling opportunities Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tiled loop Dynamic compiler REM Code cache Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

38 Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tiled loop Dynamic compiler REM Code cache Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

39 Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tile Selector Tiled loop Code cache Dynamic compiler REM Tile selector Z Z Application 1 Companion 1 Application 2 Companion 2 Protean Code, MICRO 2014 and Polly, PLDI 2008

Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tile Selector Tiled

40 Key Components Dynamic tile generation Detect tiling opportunities Find a high-performant tile Companion thread (Protean Code + Polly) Runtime Environment Monitor (REM) Tile Selector Tiled loop Code cache Dynamic compiler Z Z 1 2 Companion controller REM Tile selector Z Z Application 1 Companion 1 Application 2 Companion 2 ShapeShifter Protean Code, MICRO 2014 and Polly, PLDI 2008

41 Overview Dynamic compiler Tile selector REM

42 Overview Online training select tile size and generate training data Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats

43 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Dynamic compiler Tile selector REM Online training Tile selection Find tile size Training set Collect cache stats Tile performance model Choose tile

44 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Monitored execution detect tiling opportunities Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats Tile performance model Tile selection Choose tile Monitored execution

45 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Monitored execution detect tiling opportunities Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats Tile performance model Tile selection Choose tile Monitored execution Runtime environment change

Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache

46 Overview Online training select tile size and generate training data Tile selection generate black-box model and select suitable tile shape Monitored execution detect tiling opportunities Dynamic compiler Tile selector REM Online training Find tile size Training set Collect cache stats Tile performance model Tile selection Choose tile Monitored execution Runtime environment change

47 Tile Selection Black-box model is generated online

48 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Training data Black-box model

49 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Training data IPC Tile parameters Black-box model

50 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Training data IPC Tile parameters Black-box model

51 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Predicts a tile suitable to current runtime environment Training data IPC IPC pred Tile parameters Set of tile shapes of predicted size Black-box model

52 Tile Selection Black-box model is generated online Uses tile parameters and IPC from tile data Model is specific to application and its current runtime environment Predicts a tile suitable to current runtime environment Training data IPC IPC max IPC pred Set of tile shapes of predicted size Tile parameters Black-box model T shapeshifter

53 Insight for Co-optimization Challenging to retile multiple applications simultaneously

54 Insight for Co-optimization Challenging to retile multiple applications simultaneously Tile shape and tile size contribute differently to cache interference

55 Insight for Co-optimization Challenging to retile multiple applications simultaneously Tile shape and tile size contribute differently to cache interference

56 Insight for Co-optimization Challenging to retile multiple applications simultaneously Tile shape and tile size contribute differently to cache interference Co-optimization Find tile size for apps and then tile shape one-by-one

57 Experimental Evaluation

58 Methodology Polybench application suite

59 Methodology Polybench application suite Three sources of dynamism Co-running applications Microarchitectural flexibility cache partitioning Platform diversity

60 Methodology Polybench application suite Three sources of dynamism Co-running applications Microarchitectural flexibility cache partitioning Platform diversity Three platforms AMD Bulldozer Intel Haswell Intel Atom

61 Methodology Polybench application suite Three sources of dynamism Co-running applications Microarchitectural flexibility cache partitioning Platform diversity Three platforms AMD Bulldozer Intel Haswell Intel Atom Tiling is performed in the shared cache

62 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner

63 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner

64 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner

65 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner Co-runner change syr2k to correlation

66 Co-runner Arrival/departure of a co-runner Static Best best tile with no co-runner Co-runner change syr2k to correlation Change in cache allocations

67 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

68 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

69 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

70 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

71 Microarchitectural Flexibility Microarchitectural flexibility cache partitioning Static Best best tile with no cache resizing (16-way enabled)

72 Platform Diversity Platform diversity Intel Atom, Intel Haswell and AMD Bulldozer Static Best best tile on AMD Bulldozer

73 Platform Diversity Platform diversity Intel Atom, Intel Haswell and AMD Bulldozer Static Best best tile on AMD Bulldozer

74 Platform Diversity Platform diversity Intel Atom, Intel Haswell and AMD Bulldozer Static Best best tile on AMD Bulldozer

75 Conclusions ShapeShifter an end to end dynamic loop co-optimization

76 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment

77 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment Loop co-optimization tiling multiple applications on the fly

78 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment Loop co-optimization tiling multiple applications on the fly Novel black-box modelling approach fast and accurate

79 Conclusions ShapeShifter an end to end dynamic loop co-optimization Adapt tiling strategy to the application runtime environment Loop co-optimization tiling multiple applications on the fly Novel black-box modelling approach fast and accurate ShapeShifter achieves significant performance improvements across different sources of dynamism

80 Q/A

81 Why black-box model works? There is trade-off between the best tiling stragey and performance We show that SS chooses a close one Why 3 D tiling? Build on Polly but technique is not restricted to 3D tiling Also memorize the compilation times 2 reasons of slowdown tile doesn t matter, black-box model not good enough Remember cache sizes Prior work refresh 18

82 Overhead Companion thread Three sources of overhead Dynamic Compilation 136 ms on Intel Haswell, 430 ms on AMD Bulldozer Code redirection Training 19

83 Overhead training 20

84 Black-box model Multiple high-performance tiles ShapeShifter chooses one of the high-performanc e tiles 21

85 ShapeShifter vs Dynamic Oracle ShapeShifter achieves 93% of the dynamic oracle performance on average 22

86 Co-runner 23

Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting

Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting Animesh Jain, Michael A. Laurenzano, Lingjia Tang and Jason Mars University of Michigan, Ann Arbor {anijain,