Speculative Parallelization in Decoupled Look-ahead

Size: px

Start display at page:

Download "Speculative Parallelization in Decoupled Look-ahead"

Rosaline Fowler
6 years ago
Views:

1 Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY

new bottleneck Speculative parallelization aptly suited for its acceleration More parallelism opportunities:

2 Motivation Single-thread performance still important design goal Branch mispredictions and cache misses are still important performance hurdles One effective approach: Decoupled Look-ahead Look-ahead thread can often become the new bottleneck Speculative parallelization aptly suited for its acceleration More parallelism opportunities: Look-ahead thread does not contain all dependences Simple architectural support: Look-ahead not correctness critical 1

3 Outline Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary 2

shared L2 Sends all branch outcomes through FIFO queue and also helps prefetching *A. Garg and M. Huang.

4 Baseline Decoupled Look-ahead Binary parser is used to generate skeleton from original program The skeleton runs on a separate core and Maintains its own memory image in local L1, no write-back to shared L2 Sends all branch outcomes through FIFO queue and also helps prefetching *A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int l Symp. On Microarch., November

overhead on main thread Easier for run-time control to disable Natural throttling mechanism

5 Practical Advantages of Decoupled Look-ahead Look-ahead thread is a single, selfreliant agent No need for quick spawning and register communication support Little management overhead on main thread Easier for run-time control to disable Natural throttling mechanism to prevent run-away prefetching Look-ahead thread size comparable to aggregation of short helper threads 4

Application categories Bottlenecks removed Speed of

6 Look-ahead: A New Bottleneck Comparing four systems Baseline Decoupled look-ahead Ideal Look-ahead alone Application categories Bottlenecks removed Speed of look-ahead not the problem Look-ahead is the new bottleneck 5

7 Outline Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary 6

8 Unique Opportunities Skeleton code offers more parallelism Certain dependencies removed during slicing for skeleton Look-ahead is inherently errortolerant Can ignore dependence violations Little to no support needed, unlike in conventional TLS Assembly code from equake 7

9 Time Software Support Dependence analysis Profile guided, coarse-grain at basic block level Parallelism at basic block level Spawn Spawn and Target points Basic blocks with consistent dependence distance of more than threshold of DMIN Spawned thread executes from target point Target Loop level parallelism is also exploited Available parallelism for 2 core/contexts system 8

10 Parallelism in Look-ahead Binary Available theoretical parallelism for 2 core/contexts system; DMIN = 15 BB Parallelism potential with stables and consistent target and spawn points 9

Hardware and Runtime Support Thread spawning and merging Not too different from regular thread spawning except Spawned thread shares the same register and memory state Spawning thread will terminate

11 Hardware and Runtime Support Thread spawning and merging Not too different from regular thread spawning except Spawned thread shares the same register and memory state Spawning thread will terminate at the target PC Value communication Register-based naturally through shared registers in SMT Memory-based communication can be supported at different levels Partial versioning in cache at line level Spawning of a new look-ahead thread 10

12 Outline Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary 11

13 Experimental Setup Program/binary analysis tool: based on ALTO Simulator: based on heavily modified SimpleScalar SMT, look-ahead and speculative parallelization support True execution-driven simulation (faithfully value modeling) Microarchitectural and cache configurations 12

Speedup of speculative parallel decoupled look-ahead

Speedup of look-ahead systems over single thread

61x Speculative look-ahead over single thread

14 Speedup of speculative parallel decoupled look-ahead 14 applications in which look-ahead is bottleneck Speedup of look-ahead systems over single thread Decoupled look-ahead over single thread baseline: 1.61x Speculative look-ahead over single thread baseline: 1.81x Speculative look-ahead over decoupled look-ahead:1.13x 13

Speculative Parallelization in Lookahead vs

opportunities for parallelization Speedup of

look-ahead over decoupled LA baseline: 1.

baseline: 1.07x Baseline (ST) Baseline TLS 1.

15 Speculative Parallelization in Lookahead vs Main thread Skeleton provides more opportunities for parallelization Speedup of speculative parallelization Speculative look-ahead over decoupled LA baseline: 1.13x Speculative main thread over single thread baseline: 1.07x Baseline (ST) Baseline TLS 1.07x 1.81x IPC Decoupled LA LA 1.13x Speculative LA LA0 LA1 Core1 Core2 14

16 Flexibility in Hardware Design Comparison of regular (partial versioning) cache support with two other alternatives No cache versioning support Dependence violation detection and squash 15

Other Details in the Paper The effect of spawning look-ahead thread in preserving the lead of overall look-ahead system Technique to avoid spurious spawns in dispatch stage which could be subjected

17 Other Details in the Paper The effect of spawning look-ahead thread in preserving the lead of overall look-ahead system Technique to avoid spurious spawns in dispatch stage which could be subjected to branch misprediction Technique to mitigate the damage of runaway spawns Impact of speculative parallelization on overall recoveries Modifications to branch queue in multiple look-ahead system 16

18 Outline Motivation Baseline decoupled look-ahead Speculative parallelization in look-ahead Experimental analysis Summary 17

removes dependences and increases parallelism Hardware design is flexible and can be a simple extension

19 Summary Decoupled look-ahead can significantly improve single-thread performance but look-ahead thread often becomes new bottleneck Look-ahead thread lends itself to TLS acceleration Skeleton construction removes dependences and increases parallelism Hardware design is flexible and can be a simple extension of SMT A straightforward implementation of TLS look-ahead achieves an average 1.13x (up to 1.39x) speedup 18

20 Backup Slides Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY

References (Partial) J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int l Conf. on Supercomputing, pages 68 75, July 1997. O.

21 References (Partial) J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. Int l Conf. on Supercomputing, pages 68 75, July O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In Proc. Int l Symp. on High-Perf. Comp. Arch., pages , February J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proc. Int l Symp. On Comp. Arch., pages 14 25, June C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proc. Int l Symp. on Comp. Arch., pages 2 13, June A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proc. Int l Symp. on High-Perf. Comp. Arch., pages 37 48, January Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A Study of Slipstream Processors. In Proc. Int l Symp. on Microarch., pages , December G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. Int l Symp. on Comp. Arch., pages , June

22 References (cont ) H. Zhou. Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window. In Proc. Int l Conf. on Parallel Arch. and Compilation Techniques, pages , September A. Garg and M. Huang. A Performance-Correctness Explicitly Decoupled Architecture. In Proc. Int l Symp. On Microarch., November C. Zilles and G. Sohi. Master/Slave Speculative Parallelization. In Proc. Int l Symp. on Microarch., pages 85 96, Nov D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer Sciences Department, University of Wisconsin-Madison, June J. Renau, K. Strauss, L. Ceze,W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread- Level Speculation on a CMP. IEEE Micro, 26(1):80 91, January/February P. Xekalakis, N. Ioannou, and M. Cintra. Combining Thread Level Speculation, Helper Threads, and Runahead Execution. In Proc. Intl. Conf. on Supercomputing, pages , June M. Cintra, J. Martinez, and J. Torrellas. Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors. In Proc. Int l Symp. on Comp. Arch., pages 13 24, June

23 Backup Slides For reference only

24 Register Renaming Support 23

25 Modified Banked Branch Queue 24

26 Speedup of all applications 25

27 Impact on Overall Recoveries 26

If we don t kill we gain around 1% performance improvement After recovery in main,

28 Partial Recovery in Look-ahead When recovery happens in primary look-ahead Do we kill the secondary look-ahead or not? If we don t kill we gain around 1% performance improvement After recovery in main, secondary thread survives often (1000 insts) Spawning of a secondary look-ahead thread helps preserving the slip of overall look-ahead system 27

29 Quality of Thread Spawning Successful and run-away spawns: killed after certain time Impact of lazy spawning policy Wait for few cycles when spawning opportunity arrives Avoids spurious spawns; Saves some wrong path execution 28

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache