Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Size: px

Start display at page:

Download "Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications"

Griffin Small
5 years ago
Views:

1 Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark

2 Motivation Hardware Trends Put more cores in a single chip More threads always win? NO! X CPU intensive programs Exploits Thread Level Parallelism

3 Optimal Number of Threads Too many threads More synchronization More contention for system resources Too few threads Resource underutilization Who can decide the number? Not a programmer

4 Why NOT? Input changes Various working-set size The system changes Decision must be made at runtime Various available resources Hardware changes Various L2/L3 cache structure / size, etc.

Distribute Combining Threads Group Several Threads into a Single

5 Proposal 16 Thr. OK. I will create lots of threads > 128 Thr. Thread Tailor Combine Threads New Binary Binary Compile Distribute Combining Threads Group Several Threads into a Single Thread Threads in the same group are executed in serial Executed on the SAME core

6 Details Profiler Instrument Profile Info. Graphs > 128 Thr. Instrumented Codes Binary Collect System Info. Run Combine Algorithm Result Code Generator Combined Codes Development Distribution Thread Tailor

7 Graph Construction Thread 1 Thread 2 Cycles = 10M Working-set = 10K Synchronization Cost (cycles) Communication Cost

8 Communication Cost Intuition : STORE Instruction causes coherence miss in cache Log Memory Access per Thread Thread 1 Thread 2 Address LD Count ST Count 0x x x LD ST Graph LD ST Address LD Count ST Count 0x x x x : MIN(5, 7) + MIN(10, 0) + MIN(10, 7) = 12 0x : MIN(7, 8) + MIN( 7, 3) + MIN( 7, 8) = 17 Total Communication Cost: = 29

9 Combining Algorithm Kernighan-Lin(KL) Graph Partitioning Heuristic Goal : Minimize Execution Cycles Precondition : Combined Threads Cores A 60 B E 60 F = 100 Cycles C 60 D G 60 H 2 Cores Partition 1 Partition 2 Partition 1 Cycle Partition2 Cycle Move Move Estimation Estimation From Node B C D G A E F H A A B C D G E F H G A B C D E F G H D

10 Thread Combining Application Dynamic Compiler No : Create Normal Thread Thread Code Cache Translation Target to combine? vm_thread_create() User Thread Yes Thread : Create User Thread User Thread Replace Thread APIs with Wrapper Functions Wrapper Function for Thread Creation Context Switched by Dynamic Compiler Serially Execute User Threads in Real Thread Thread

11 Experimental Setup 2 cores Intel Core 2 Duo 6600 (2.4 Ghz) 4 cores Intel Core 2 Quad Q6600 (2.4.Ghz) 8 cores 2 Quad-core CPUs with SMT Intel Xeon E5520 ( 2.26 Ghz ) 16 cores (Logical) 2 Quad-core CPUs with SMT and HyperThreading Intel Xeon E5520 ( 2.26 Ghz )

12 Speedup Results fluidanimate transpose blackscholes twister water_n^2 swaptions Core Number

13 Result Analysis - Transpose Transpose m * n matrix to n * m Parallel Transpose Thread cols distance Thread 2 Input Matrix 128 rows distance Output Matrix

14 Result Analysis - Transpose Transpose m * n matrix to n * m Core 0 Intel Nehalem L1 private (32K) Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

15 Result Analysis - Transpose Transpose m * n matrix to n * m Core 0 Intel Nehalem 64 Byte Block L1 private (32K) Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

16 Result Analysis - Transpose Transpose m * n matrix to n * m Core 0 Intel Nehalem L1 private (32K) Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

17 Result Analysis - Transpose Transpose m * n matrix to n * m Core 0 Intel Nehalem Input Matrix 512 Byte distance L1 private (32K) Output Matrix 128 rows distance L2 private (256K) L3 Shared (8M)

18 Result Analysis - Transpose Transpose m * n matrix to n * m Input Matrix iterates 128 times Core 0 8KB (128 * 64byte) 8KB (128 * 64byte) L1 private (32K) Intel Nehalem L2 private (256K) iterates 128 times Output Matrix L3 Shared (8M)

19 Result Analysis - Transpose Transpose m * n matrix to n * m Core 0 Intel Nehalem 8KB (128 * 64byte) 8KB (128 * 64byte) L1 private (32K) Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

20 Result Analysis - Transpose Transpose m * n matrix to n * m Core 0 Intel Nehalem 8KB (128 * 64byte) 8KB (128 * 64byte) L1 private (32K) WRITE HIT! Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

21 Result Analysis - Transpose Transpose m * n matrix to n * m Core 0 Intel Nehalem 8KB (128 * 64byte) 8KB (128 * 64byte) Working-set fits into L1 Cache (No Capacity Miss!) L1 private (32K) WRITE HIT! Input Matrix L2 private (256K) Output Matrix L3 Shared (8M)

22 Summary Choosing Optimal Number of Threads is Hard Thread Tailor Ease the Pain Graph Representation Combine Threads at Runtime

23 Thank you

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain