The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization

Size: px

Start display at page:

Download "The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization"

Griffin Lyons
5 years ago
Views:

1 The Potential for Using Thread-Level Data peculation to acilitate Automatic Parallelization J. Gregory teffan and Todd C. Mowry Department of Computer cience University The TAMPede Project 1

2 tate-of-the-art vs. uture Processors??? MIP R10000 (6.8 Million Transistors) In 15 Years (1 Billion Transistors) Challenge: translating these resources into higher performance One possibility: multiple processors on a single chip The TAMPede Project 2

3 Performance Benefits of ingle-chip Multiprocessing Multiprogramming Workload: improved throughput Time P 1 P 1 P 2 P 3 P 4 ingle Application: how to reduce execution time => must contain parallel threads Time P 1 P 1 P 2 P 3 P 4 The Big Question: how do we automatically parallelize all applications? The TAMPede Project 3

4 tate-of-the-art in Automatic Parallelization Numeric Applications: dominated by regular array references inside loops: OR i = 1 to N OR j = 1 to N OR k = 1 to N C[i][j] += A[i][k] * B[k][j]; significant progress has been made e.g., fastest PECfp95 number [Bugnion et al. 1996] Non-Numeric Applications: access patterns & control flow may be highly irregular pointer dereferences, recursion, unstructured loops, etc. little (if any) success in automatic parallelization but these applications are important! we must expand the scope of automatic parallelization The TAMPede Project 4

5 Why Is Automatic Parallelization o Difficult? Current Approach: parallelize only if we can statically prove independence OR i = 1 to N A[i] += i; Parallel transformations can help to eliminate dependences or Non-Numeric Codes: understanding memory addresses is extremely difficult Major Limitation: while (foo()) { x = *q; *p = bar(); } OR i = 1 to N A[i] += f(a[i-1]); equential data dependence? when the compiler is uncertain, it must be conservative The TAMPede Project 5

6 Expanding the cope of Automatic Parallelization The Problem: statically proving independence is hopelessly restrictive a full understanding of memory addresses is unrealistic instead, we should be focusing on performance issues Our olution: Thread-Level Data peculation (TLD) The TAMPede Project 6

7 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Conclusions The TAMPede Project 7

8 Data peculation Basic Idea: optimistically perform access assuming no dependence if speculation was unsafe, invoke recovery action Example: LOAD b = *q? a = f() TORE *p = a LOAD b = *q g(b) (p == q) b = a a = f() TORE *p = a (p!= q) Original Execution g(b) Instruction-Level Data peculation The TAMPede Project 8

9 Thread-Level Data peculation Analogous to instruction-level data speculation except that it involves separate, parallel threads of control Processor 1 Processor 2 a = f() TORE *p = a a = f() TORE *p = a LOAD b = *q g(b)? { LOAD b = *q g(b) (p == q) LOAD b = *q (p!= q) Original Execution g(b) Thread-Level Data peculation The TAMPede Project 9

10 Related Work Prior to this tudy: Wisconsin Multiscalar Architecture [ohi et al, 1995] tightly-coupled ring architecture with register forwarding ARB detects memory dependences, hardware rollback + requires relatively little software support - large, centralized ARB may increase load latency - ring architecture limits flexibility (multiprogramming, locality) Other Recent Work: tanford Hydra [Oplinger et al, 1997] Wisconsin peculative Versioning Cache [Gopal et al, 1997] Illinois peculative Run-Time Parallelization [Zhang et al, 1998] The TAMPede Project 10

11 Objectives of This tudy Does TLD offer compelling performance improvements? study of the PEC92 and PEC95 integer benchmarks Can we provide cost-effective hardware support for TLD? detecting dependence violations recovering from failed speculation goal: performance of non-tld code is not sacrificed What compiler support is necessary to exploit TLD? optimizations, scheduling, etc. The TAMPede Project 11

12 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Conclusions The TAMPede Project 12

13 An Example: Compress Epoch i while ((c = getchar())!= EO) { /* perform data compression */ = hash[hash_function(c)]; hash[hash_function()] = ; }??? Potential ource of Parallelism: data parallelism across the input stream? rom the Compiler s Perspective: hash accesses cannot be statically disambiguated In Reality: consecutive characters rarely hash to the same entry The TAMPede Project 13

14 TLD Execution of Compress Processor 1 Processor 2 Processor 3 Processor 4 Time while () { x = hash[idx1]; hash[idx2] = y; } Epoch 1 Epoch 2 Epoch 3 = hash[3] = hash[19] = hash[33] = hash[10] Violation! hash[10] = hash[21] = hash[30] = hash[25] = attempt_commit() attempt_commit() attempt_commit() attempt_commit() Epoch 5 = hash[31] Epoch 6 = hash[9] Epoch 7 = hash[27] Epoch 4 Retry Epoch 4 = hash[10] hash[25] = attempt_commit() The TAMPede Project 14

15 Other Data Dependences in Compress while ((c = getchar())!= EO) { /* perform data compression */ in_count++; if () {out_count++; putchar();} if (free_entries < ) free_entries = } in_count: induction variable implicit in the epoch number out_count: reduction operation compute partial sums getchar(), putchar(): use parallel library routines (also, malloc(), etc.) free_entries: cannot eliminate dependence forward between epochs The TAMPede Project 15

16 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Relaxing Memory Dependences orwarding Data Between Epochs peedups Architectural upport Conclusions The TAMPede Project 16

17 Benchmarks uite Name Region Average Dynamic Instrs per Epoch % of Total Dynamic Instrs PEC92 compress r gcc r r espresso r li r r sc r PEC95 m88ksim r NA Parallel ijpeg r perl r go r buk r r The TAMPede Project 17

18 Measuring Memory Dependences: Run Lengths Run Length: # of epochs between Read-After-Write (RAW) dependences Violation! RAW d=2 RAW d=1 E 1 E 2 E 3 E 4 Time Violation! E 1 E 2 E 3 E 4 E 5 E 6 E 7 E 8 E 9 Epochs: E 3 E 4 E 5 E 6 Retry Run Lengths: Average Run Length = 3 Retry E 6 E 7 E 8 E 9 Execution on 4 Processors average run length = rough limit of potential parallelism The TAMPede Project 18

19 Relaxing Memory Data Dependences peedup Limit B O B O compress.r1 gcc.r1 B O B O B O gcc.r2 li.r1 espresso.r1 B O li.r2 B O B O B O B O sc.r1 ijpeg.r1 m88ksim.r1 perl.r1 B O go.r1 B O B O buk.r2 buk.r1 B = base case O = compiler optimizations applied to remove dependences = dependences due to forwardable scalars also removed eliminating dependences and forwarding scalars are important significant parallelism is available in many cases The TAMPede Project 19

20 calar Memory Values: orwarding Data Between Epochs forward if dependences occur frequently synchronization is faster than speculation recovery helpful for performance, not necessary for correctness Register Values: must be forwarded to maintain correctness some register dependences may be eliminated: induction variables through simple loop rescheduling all other register dependences forwarded through memory what is the impact on performance? The TAMPede Project 20

21 Critical Path Lengths E 1 Time E 1 E i E i E 2 E 2 E 3 WAIT LOAD A E 3 WAIT A LOAD A E 4 TORE A LOAD B TORE B IGNAL E 4 TORE A IGNAL A WAIT B LOAD B TORE B IGNAL B Course-Grain ynchronization ine-grain ynchronization The TAMPede Project 21

22 orwarding Data Between Epochs peedup Limit f3.2 c s 1.0 f1.0 c s 1.1 f1.1 c s 1.0 f c s 1.2 f c s 1.1 f c s 1.5 f2.4 c compress.r1 gcc.r2 li.r1 sc.r1 ijpeg.r1 go.r1 buk.r2 gcc.r1 espresso.r1 li.r2 m88ksim.r1 perl.r1 buk.r1 s 1.0 f1.0 c s f c s 1.0 f3.4 c s 1.4 f6.0 c s f c s f c s c = coarse-grain synchronization f = fine-grain synchronization s = fine-grain synchronization w/ aggressive instruction scheduling fine-grain synchronization is helpful with aggressive instruction scheduling, forwarding is not a bottleneck The TAMPede Project 22

23 Potential Region peedup on 4 Processors Region peedup compress.r1 gcc.r1 gcc.r2 li.r1 espresso.r1 li.r2 sc.r1 ijpeg.r1 m88ksim.r1 perl.r1 go.r1 buk.r1 buk.r2 = optimize dependences, forward w/ fine-grain synchronization = also perform aggressive instruction scheduling communication latency = 10 cycles aggressive instruction scheduling is a major performance win potential speedups of twofold or more in 11 of 13 regions The TAMPede Project 23

24 Program peedups buk perl gcc 0 go ijpeg m88ksim sc li espresso compress Program peedup 99.9% 12.1% 19.4% 73.1% 69.3% 99.3% 15.3% 35.8% 6.8% 27.9% Coverage: Region peedups: Region peedup buk.r2 go.r1 ijpeg.r1 sc.r1 li.r1 gcc.r2 compress.r1 buk.r1 perl.r1 m88ksim.r1 li.r2 espresso.r1 gcc.r1 The TAMPede Project 24

25 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Communication Latency Key Architectural Issues Detecting Data Dependence Violations Buffering peculative tate Conclusions The TAMPede Project 25

26 Base Architecture ingle Chip Processor Processor Processor Processor L1 Cache L1 Cache L1 Cache L1 Cache 4x4 Crossbar witch Interleaved L2 Cache Each processor has its own L1 data cache maintain single-cycle load latency L1 caches are kept coherent shared-memory programming model The TAMPede Project 26

27 How Important Is Communication Latency? ingle Chip Processor Processor Processor Processor L1 Cache L1 Cache L1 Cache L1 Cache 4x4 Crossbar witch Interleaved L2 Cache ome Options: direct L1-to-L1 communication ~2 cycles communicate through the L2 cache ~10 cycles The TAMPede Project 27

28 Impact of Communication Latency Region peedup compress.r1 gcc.r gcc.r2 li.r1 espresso.r li.r sc.r1 ijpeg.r1 m88ksim.r1 perl.r go.r buk.r buk.r2 = optimize dependences, forward w/ fine-grain synchronization = also perform aggressive instruction scheduling communication latency: 10 = 10 cycles, 2 = 2 cycles instruction scheduling reduces the sensitivity to communication latency communicating through the L2 cache is a viable option The TAMPede Project 28

29 Thread Management Key Architectural Issues Thread creation and epoch scheduling Epoch numbers must be visible to the hardware Distinguishing speculative vs. non-speculative memory accesses Recovering from data dependence violations: hardware notifies software of violation software performs the bulk of the recovery Detecting Data Dependence Violations extend invalidation-based cache coherence Buffering peculative tate extend the functionality of the primary data caches The TAMPede Project 29

30 Invalidation-Based Cache Coherence Time 2 Processor 1 Processor 2 TORE X = 2; 1 LOAD a = X; L1 Cache L1 Cache X = 1 2 X = 1 2 Invalidation 1 Read Request The TAMPede Project 30

31 Detecting Data Dependence Violations Time 2 Processor 1 (p = q = &X) Processor 2 Epoch 5 Epoch 6 TORE *q = 2; 1 3 become_speculative() LOAD a = *p; attempt_commit() AIL! L1 Cache Epoch # = 5 Violation? = alse X = 1 2 L M L1 Cache Epoch # = 6 Violation? = TRUE X = 1 T L M 2 peculatively Loaded? peculatively Modified? 2 Invalidation (Epoch #5) 1 Read Request The TAMPede Project 31

32 Buffering peculative tate peculative tores: software cannot realistically roll back memory side effects our solution: buffer in L1 cache until safe to commit to memory peculative Loads: if displaced, then we can no longer track dependence violations set violation flag upon eviction of a speculatively accessed line correctness is preserved, but performance may suffer Processor L1 Cache Victim Cache a 16KB 2-way set-associative cache with 4 victim entries suffices The TAMPede Project 32

33 Overview Thread-Level Data peculation (TLD) An Example: Compress Experimental Results Architectural upport Conclusions The TAMPede Project 33

34 Conclusions TLD potentially offers compelling performance improvements 12 of 13 regions: speedups of on 4 processors 7 of 10 programs: speedups of on 4 processors Only modest hardware modifications are required cache coherence protocol augmented to detect violations primary data cache is used to buffer speculative state Compiler support is crucial yet feasible eliminating data dependences aggressive scheduling to minimize critical path (forwarding) Ongoing and future work refining the architecture (described in technical report) building the compiler The TAMPede Project 34

The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization

The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J. Gregory Steffan and Todd C. Mowry Computer Science Department Carnegie Mellon University Pittsburgh, PA