Are we ready for high-mlp?

Size: px

Start display at page:

Download "Are we ready for high-mlp?"

Oscar Howard
5 years ago
Views:

1 Are we ready for high-mlp?, James Tuck, Josep Torrellas

2 Why MLP? overlapping long latency misses is a very effective way of tolerating memory latency hide latency at the cost of bandwidth 2

3 Motivation 3

4 Motivation many recent high-mlp innovations 3

5 Motivation many recent high-mlp innovations many rely on processor checkpointing and speculative execution Runahead, CPR, CAVA, Clear, CFP,... 3

6 Motivation many recent high-mlp innovations many rely on processor checkpointing and speculative execution Runahead, CPR, CAVA, Clear, CFP,... but what happens in the memory system? misses pile up quickly, very quickly 3

7 Miss Handling Architectures 4

8 Miss Handling Architectures logic and resources needed to support outstanding misses in a cache 4

9 Miss Handling Architectures logic and resources needed to support outstanding misses in a cache consolidate misses to the same line primary/secondary misses 4

10 Miss Handling Architectures logic and resources needed to support outstanding misses in a cache consolidate misses to the same line primary/secondary misses may perform data forwarding 4

11 MSHR 5

12 MSHR key structure in the MHA proposed by [Kroft 81] 5

13 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] 5

14 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 5

15 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 5

16 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries explicitly addressed off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] 5

17 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed explicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] limited information on existing designs 5

18 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed explicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] limited information on existing designs MSHRs take significant chip area 5

19 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed explicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] limited information on existing designs MSHRs take significant chip area 5

20 Processors Considered Superscalar LargeWindow as above but with a 512 i-window and 2048 ROB hides latency with independent work Checkpointed checkpoint-assisted value prediction (CAVA [Ceze 04]) all in a two-context SMT organization 6

21 Outstanding Misses Distribution 7

22 Outstanding Misses Distribution % of time (cumulative) # of outstanding read misses Superscalar 90% time < 16 7

23 Outstanding Misses Distribution % of time (cumulative) # of outstanding read misses # of outstanding read misses Superscalar 90% time < 16 Checkpointed 50% time > 40 7

Outstanding Misses Distribution 100 100 1 100 90 90 90 80 80 80 % of time (cumulative) 70 60 50 40 30 70 60 50 40 30 70 60 50 40 30 20 20 20 10 10 10 0 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112

24 Outstanding Misses Distribution % of time (cumulative) # of outstanding read misses # of outstanding read misses # of outstanding read misses Superscalar Checkpointed LargeWindow 90% time < 16 50% time > 40 50% time > 100 7

25 MHA Design Space capacity # of lines # of secondary read misses per line # of secondary write misses per line associativity 8

26 Capacity - Entries Unlimited secondary misses 9

27 Capacity - Entries Unlimited secondary misses speedup over Unlimited e 8e 16e 32e Unlimited 0 Int.GM FP.GM Mix.GM Superscalar 20% impact 9

28 Capacity - Entries Unlimited secondary misses speedup over Unlimited e 8e 16e 32e Unlimited speedup over Unlimited e 8e 16e 32e Unlimited speedup over Unlimited e 8e 16e 32e Unlimited 0 Int.GM FP.GM Mix.GM 0 Int.GM FP.GM Mix.GM 0 Int.GM FP.GM Mix.GM Superscalar Checkpointed LargeWindow 20% impact 50% impact 50% impact 9

29 Subentries Checkpointed processor only 10

Subentries Checkpointed processor only speedup over Unlimited 1 0.8 0.6 0.4 0.

30 Subentries Checkpointed processor only speedup over Unlimited r 8r 16r 32r Unlimited 0 Int.GM FP.GM Mix.GM Read Subentries 40% impact 10

31 Subentries Checkpointed processor only speedup over Unlimited r 8r 16r 32r Unlimited speedup over Unlimited w 8w 16w 32w Unlimited 0 Int.GM FP.GM Mix.GM 0 Int.GM FP.GM Mix.GM Read Subentries Write Subentries 40% impact 17% impact 10

Subentries Checkpointed processor only speedup over Unlimited 1 0.8 0.6 0.4 0.2 0 Int.GM FP.GM Mix.

32 Subentries Checkpointed processor only speedup over Unlimited Int.GM FP.GM Mix.GM Read Subentries 4r 8r 16r 32r Unlimited speedup over Unlimited Int.GM FP.GM Mix.GM Write Subentries 4w 8w 16w 32w Unlimited used MSHRs (%) S L C S L C S L C Int.GM FP.GM Mix.GM Distribution Read+Write Write Sub Only Read Sub Only 40% impact 17% impact rarely r+w 10

33 Associativity 32 entries total 11

34 Associativity 32 entries total speedup over FullyAssoc FullyAssoc 16 way 8 way 4 way 2 way 0 Int.GM FP.GM Mix.GM Checkpointed 27% impact 11

35 Two High-MLP Designs 12

36 Two High-MLP Designs Banked bank 0 bank 7 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes 12

37 Two High-MLP Designs Banked bank 0 bank 7 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes Unified 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes set0 8 way set1 8 way 12

38 Two High-MLP Designs Banked bank 0 bank 7 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes Unified 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes set0 8 way set1 8 way Current off, dest off, dest off, dest off, dest... data data data data 8 misses of any type 12

39 Performance 13

40 Performance speedup over Current 1.2 Current 1 Unified Banked 0.8 Unlimited Int.GM FP.GM Mix.GM Superscalar 15% in Mix 13

Performance speedup over Current 1.2 Current 1 Unified Banked 0.8 Unlimited 0.6 0.4 0.2 0 Int.GM FP.GM Mix.GM speedup over Current 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Int.GM FP.GM Mix.GM Current Unified Banked Unlimited speedup over Current 2 Current 1.

41 Performance speedup over Current 1.2 Current 1 Unified Banked 0.8 Unlimited Int.GM FP.GM Mix.GM speedup over Current Int.GM FP.GM Mix.GM Current Unified Banked Unlimited speedup over Current 2 Current Int.GM FP.GM Mix.GM Unified Banked Unlimited Superscalar Checkpointed LargeWindow 15% in Mix 70% in Mix 65% in Mix 13

42 Additional Issues bus bandwidth concerns initial experiments of bus prioritization in CAVA: assign high-priority to requests with lowconfidence predictions up to 20% speedup (swim) relationship with load-store queue 14

43 Conclusion 15

44 Conclusion new MHAs for high MLP processors 15

45 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses 15

46 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses it may be time to start rethinking miss-handling 15

47 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses it may be time to start rethinking miss-handling bus prioritization might be a good idea 15

48 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses it may be time to start rethinking miss-handling bus prioritization might be a good idea specially in a CMP world 15

49 Bus Concerns lots of misses, a lot of bus traffic 16

Bus Concerns lots of misses, a lot of bus traffic Bus contention norm. to Current 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 1.3 14.2 6.3 3.4 2.4 8.3 2.4 2.2 16.4 2.4 1.8 3.5 2.0 3.5 2.1 2.2 2.4 5.0 4.5 1.8 4.

50 Bus Concerns lots of misses, a lot of bus traffic Bus contention norm. to Current Current Unified Banked bzip2 gap mcf perlbmk ammp applu art equake mesa mgrid swim wupwise artequake artgap artperlbmk equakeperlbmk mesaart mgridmcf swimmcf wupwiseperlbmk Int.GM FP.GM Mix.GM Fig. 9. Bus contention normalized to Current in the Checkpointed processor. 16

51 Bus Request Prioritization? initial experiments in bus prioritization in CAVA two priorities: high/low assign high-priority to requests with low-confidence predictions 17

52 Bus Request Prioritization? initial experiments in bus prioritization in CAVA two priorities: high/low assign high-priority to requests with low-confidence predictions Speedup over No!Prio equake swim artequake artgap artperlbmk mesaart mgridmcf swimmcf Int.GM FP.GM Mix.GM 17

53 A Few Words on LSQs 18

54 A Few Words on LSQs why have subentries? 18

55 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only 18

56 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes 18

57 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? 18

58 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors 18

59 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside 18

60 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside cache-side searches would possibly require global searches 18

61 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside cache-side searches would possibly require global searches speculative retirement instructions cause LSQ entries to be recycled 18

62 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside cache-side searches would possibly require global searches speculative retirement instructions cause LSQ entries to be recycled bottom-line: decouple things as much as possible and try to leave the LSQ alone, it has enough of its own problems 18

63 Experimental Setup Processor Conventional Checkpointed LargeWindow Fetch/Issue/Comm 6/5/5 SMT contexts 2 I-window/ROB size 92/ /2048 Int/FP regs 192/ /2048 Ld/St Q entries 60/50 768/768 Mem System I-L1 D-L1 L2 Size/Assoc 32KB/2-way 32KB/2-way 2MB/8-way RT Lat 2 cyc 3 cyc 15 cyc 16-stream stride prefetcher (bet. L2 and Mem) Bus BW: 10GB/s Mem RT: 650 cyc 19

64 Assumptions decoupled processor, cache, MHA interaction all requests sent to the cache sure to be fulfilled when the MHA is full, cache locks up MHA is considered full when it is possible that a request won t be accepted 20

Are We Ready for High Memory-Level Parallelism?

Are We Ready for High Memory-Level Parallelism? Luis Ceze, James Tuck and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign Email: {luisceze,jtuck,torrella}@cs.uiuc.edu