Towards Intelligent Programing Systems for Modern Computing

Size: px

Start display at page:

Download "Towards Intelligent Programing Systems for Modern Computing"

Magdalene Parks
5 years ago
Views:

1 Towards Intelligent Programing Systems for Modern Computing Xipeng Shen Computer Science, North Carolina State University

2 Unprecedented Scale 50X perf Y202X 1000 petaflops 20mw power 2X power! Y petaflops 10mw power sources: SciDAC, IBM 2

3 Heterogeneity becomes Norm Massively parallel accelerators are becoming ubiquitous. 3

4 Thesis To address the challenges in modern computing, one of the keys exists in making programming systems more intelligent. For advancing programming systems, right problem formulating goes a long way. 4

5 Modern Computing Application Data analytics, Machine learning, TOP Algorithmic optimizer for data analytics [VLDB 15,ICML 15] Infrastructure Data centers, Cloud, IoT, Architecture Heterogeneous parallel processors, Emerging complex memory, GStreamline+ PORPLE Memory optimization for GPU [ASPLOS 11, Micro 14, ICS 16] 5

6 Yufei Ding TOP: Enabling Algorithmic Optimizations for Distance-Related Problems Up to 100s X speedups. VLDB 2015, ICML

7 Learning Problem Role of Compiler ML Algorithm ML experts Implementation compiler co- design Execution Runtime system/architecture Can compilers optimize algorithms? 7

8 Why algorithm level? Reason 1: Large benefits: orders of magnitude speedups at no extra cost. Reason 2: Compiler may outsmart ML experts. Really? 8

9 Example K- Means Triangular Inequality: a- b d a+b d b X a C 2 C 1 9

10 K-Means NIPS 2012 SIAM 2010 ICML

11 K-NN SSDM 10 VisionInterface 10 IJCNN 11 11

12 P2P: Point-to-Point Shortest Path SIAM 05 ALENEX 04 12

13 Observations TI has led to many enhanced algorithms across problems and domains. Applying TI well is tricky, hence the many manual efforts and publications. Thoughts Can we have an abstraction to represent all the problems? Can we then generalize the TI optimizations into compiler- based transformations? 13

14 Abstract Distance Problem (Q, T, D, R, C) Constraints Relation Query Point Set Distance Target Point Set KNN KNN join ICP NBody KMeans Shortest Distance 14

15 Abstract Distance- Related Problem Essence & 7 Principles of TI Optimizations Our Analysis and Abstraction KNN KNN join ICP NBody KMeans Shortest Distance 15

16 Key Insights See VLDB 15 for details. Reuse through Landmarks Spatial & temporal reuses Elasticity through hierarchical landmarks Efficient bounds update through ghosts for iterative alg. Order of comparison q2 L q1 t1 q3 t2 t3 16

17 TOP Framework problem semantic TOP API Compiler building blocks Opt Lib Abstract Distance- Related Problem Essence & 7 Principles of TI Optimizations KNN KNN join ICP NBody KMeans Shortest Distance 17

$TOP_defDistance(Euclidean); T = init(); changedflag = 1; while (changedflag){ N = TOP_findClosestTargets(1, S, T); TOP_update(T,$

18 Usage Systematic TOP API Basic algorithm description Compiler Ad hoc Staged program code TI Opt Lib Efficient execution TOP_defDistance(Euclidean); T = init(); changedflag = 1; while (changedflag){ N = TOP_findClosestTargets(1, S, T); TOP_update(T, &changedflag, N, S); } 18

19 K- Means (K=1024) Clustering results are same as original method s. Baseline: Classic K- means (16GB, 8- core Intel Ivy Bridge) Speedup (X) TOP Yinyang K-Means Code link in ICML 15 paper. 19

20 On K- Means Baseline: Classic K- means (16GB, 8- core) Speedup (X) TOP Yinyang K-Means XX 20

21 On All Benchmarks Speedups # distance calculations Speedups(X) by TOP version TOP Optimized Speedups(X) by manual version Manually Optimized TOP In TOP Optimized version Knn Knnjoin Kmeans ICP Nbody P2P Reference line Insight: 10 6 Knn The right abstraction Knnjoin and formulation turn a Kmeans ICP compiler into an automatic algorithm optimizer, Nbody P2P Reference line giving out large speedups In manual version Manually Optimized Average speedups: 50X vs 20X. Save at least 93% calculations. Intel i CPU and 8G memory 21

22 Modern Computing Application Data analytics, Machine learning, Infrastructure Data centers, Cloud, IoT, Architecture Heterogeneous parallel processors, Emerging complex memory, TOP Algorithmic optimizer for data analytics [VLDB 15,ICML 15] GStreamline+ PORPLE Memory optimization for GPU [ASPLOS 11, Micro 14, ICS 16] 22

23 Zheng Zhang Rutgers Univ) Bo Wu Colorado Mines) Guoyang Chen (Qualcomm) Overcome GPU Limitations 23

24 Graphic Processing Unit (GPU) a SIMD group (warp) Massive parallelism Favorable computing power cost effectiveness energy efficiency Xipeng Shen xshen5@ncsu.edu 24

25 Challenges Scheduling Limitations Irregular Mem & Control Dyn Task Parallelism 25

26 Our Explorations Compiler-based software solutions 11/06 9/07 5/09 6/10 3/11 6/12 9/13 10/11 2/13 5/15 12/15 2/17 12/14 6/15 6/16 5/17 CUDA release LCPC talk by David Kirk IPDPS cross input adap. opt. ICS remove thread diverg. dyn. PACT treat synch. correct. GPU2CPU PPOPP mem coalesc. PACT NVM for GPU Micro PORPLE ICS SM centric ICS Multiview IPDPS Co- sched on Fused System Sweet KNN; VersaPipe; Lean DNN; ASPLOS GStreamline ICS syn. relax. & opt. GPU2CPU HotOS Co- run on Fused Micro Free Launch PPOPP EffiSha 26

27 Solutions Scheduling Limitations Irregular Mem & Control Dyn Task Parallelism SM- Centric & EffiSha [ics15,ppopp17] GStreamline & PORPLE [asplos11,micro14, ics16] FreeLaunch [micro15] Compiler-based software solutions Monday PPoPP Session 1 27

28 Dynamic Irregularities memory control flow (thread divergence) P[ ] = { 0, 5, 1, 1, 2, 7, 3, 4, 3, 5, 6, 2} 7} tid: tid: if (A[tid]) {...}... = A[P[tid]]; A[ ]: A[ ]: { a mem seg. for (i=0;i<a[tid]; i++) {...} Degrade throughput by up to (warp size - 1) times. (warp size = 32 in modern GPUs) Xipeng Shen xshen5@ncsu.edu 28

29 Solution 1: Thread-Data Remapping 4 trans/warp Irregularity in a warp: problematic; across warps: okey! { a mem seg. Principle of solution: Turn intra-warp irreg. into reg. or inter-warp irreg. 1 trans/warp { a mem seg. 29

30 Trans-1: Data Reordering original tid: = A[P[tid]]; P[ ] = {0,5,2,3,2,3,7,6} A[ ]: <redirection> <relocation> maintain mapping between threads & data values transformed... = A [Q[tid]]; A [ ]: Q[ ] = {0,1,2,3,2,3,6,7} tid: tid: thread ID; : a thread; : data access; : data relocation 30

31 Trans-2: Job Swapping Job = operations + data elements accessed original... = A[P[tid]]; A[ ]: P[ ] = {0,5,2,3,2,3,7,6} <redirection> tid: transformed newtid = Q[tid]; = A[P[newtid]]; Q[ ] = {0,4,2,3,1,5,6,7} tid: A[ ]: 31

32 G-Streamline [ASPLOS 2011] First framework enabling runtime thread-data remapping. CPU-GPU pipeline to hide transformation overhead. Kernel splitting to resolve dependences X speedups 32

33 Solution 2: Data Placement Global memory Texture memory Shared memory Constant memory L1/L2 cache Read-only cache Texture cache Xipeng Shen 33

34 GPU Memory Global memory Texture memory Shared memory Constant memory L1/L2 cache Read-only cache Texture cache coalescing; cache hierarchy 2D/3D locality; texture cache; read-only on-chip; bank conflicts broadcasting; cached; read-only private/shared read-only data 2D/3D locality; read-only Xipeng Shen 34

35 Data Placement Problem Global memory Texture memory Shared memory Constant memory (L1/L2 cache) (Read-only cache) (Texture cache)????? A B C D Data in a program 3X performance difference Xipeng Shen xshen5@ncsu.edu 35

36 Data Placement Problem Properties: Machine dependent Changes across models/generations Input dependent Changes across runs Options: Manual efforts by programmers? Offline autotuning? Xipeng Shen 36

37 PORPLE in a Whole offline online architect/user microkernels org. program MSL (mem. spec. lang.) PORPLE-C (compiler) mem spec access patterns staged program PLACER (placing engine) online profile input desired placement efficient execution More details in our Micro 2014 paper. Xipeng Shen xshen5@ncsu.edu 37

38 Properties of PORPLE Good portability to new memory Just need new MSL spec Program adapts automatically Adaptivity to new program inputs On-the-fly placement with placement-agnostic code. Generality to regular & irregular programs Static analysis + lightweight online profiling Xipeng Shen xshen5@ncsu.edu 38

39 GPU Models K20c M2075 C1060

40 Potential for Future Memory Systems 3D Stacked Memory Persistent Memory DRAM (NUMA) Xipeng Shen 40

algorithmic optimizer. Up to 100x speedups.

41 Final Takeaways Large potential of compilers for modern computing Right problem formulation is a key TOP An algorithmic optimizer. Up to 100x speedups. GStreamline PORPLE Portable solution to mem. complexity. Consistent speedups cross GPUs.

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen Graphic Processing Units (GPU) 2 Graphic Processing Units (GPU) 2 Graphic