EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

Size: px
Start display at page:

Download "EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems"

Transcription

1 EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 1

2 Outline 2 technology organization and mechanism Memory system organization and design option ew emerging trends Most slides courtesy Jung Ho Ahn, SU EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

3 is Expensive 3 $/bit is almost nothing $2 x 10-9 Memory in system is expensive %10 50 of system cost Rule of thumb the more memory the better Minimize $/bit EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

4 What is? 4 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

5 Cell 5 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

6 Cell 6 Capacity density 3D Recessed Channel Array Transistor (capacitor on top) Samsung, Hynix, Elpida, Micron Trench capacitor Infineon, anya, ProMos, Winbond IBM, Toshiba in embedded- Hynix 80nm EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Infineon 80nm

7 Array 7 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

8 Sense Amplifier 8 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

9 Array 9 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

10 Array 10 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

11 Optimization 11 The more memory the better optimize capacity Also need to worry about power and drivers Result is compromise in latency Small capacitors and large sub-arrays increase access time What about bandwidth? Bandwidth is expensive: $.05 - $.10 per package pin DDR2 requires 80 pins Secondary goal is optimizing BW/pin EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

12 Massively parallel processor architecture 12 > 4000 GB/s < 200 GB/s BW demand of ALUs >> BW supply from s EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

13 Stream processor architecture RF RF RF RF RF LRF : register file SM SM SM SRF : stream register file Memory System EE382: LRF and SRF provide a hierarchy of bandwidth and ity SRF decouples execution from memory Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 13

14 Streaming Memory Systems (SMSs) 14 PE 0 PE 1 PE 2 PE 3 Memory System Memory Channel Memory Channel Memory Channel Memory Channel Off-chip s need to meet the processor s bandwidth demands Multiple address-interleaved memory channels High bandwidth per channel EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

15 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System 1) Because of load imbalance between multiple memory channels EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 15

16 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System 1) Because of load imbalance between multiple memory channels EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 16

17 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System EE382: 2) Because the performance of modern s is very sensitive to access patterns Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 17

18 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 clock PE 2 PE 3 (0, 0, 0) (0, 0, 1) (0, 0, 2) (0, 0, 3) Memory System (0, 0, 4) data EE382: (Bank, Row, Column) Blue : write Red : read (0, 0, 0) (0, 0, 1) (0, 0, 2) (0, 0, 3) (0, 0, 4) data Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 18

19 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 clock PE 2 PE 3 (0, 0, 0) act (0, 0, 1) (0, 0, 2) (0, 0, 3) Memory System (0, 0, 4) wr wr wr wr wr data EE382: (Bank, Row, Column) Blue : write Red : read (0, 0, 0) act wr (0, 0, 1) rd (0, 0, 2) wr (0, 0, 3) (0, 0, 4) data Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 19

20 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System EE382: Parallelism and ity are necessary for efficient usage Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 20

21 Memory system designs rely on the inherent parallelism/ity of memory accesses PE 0 PE 1 PE 2 PE 3 Memory System A stream load or operation yields a large number of related memory accesses, EE382: Stream A Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 21

22 Memory system designs rely on inherent parallelism/ity of memory accesses 22 PE 0 PE 1 PE 2 PE 3 Due to the blocked feature of accesses, a SMS can exploit Parallelism by generating multiple references per cycle per thread Memory System Stream A EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

23 Memory system designs rely on inherent parallelism/ity of memory accesses PE 0 PE 1 PE 2 Memory System PE 3 Due to the blocked feature of accesses, a SMS can exploit Parallelism by generating multiple references per cycle per thread Parallelism by generating references from multiple threads 23 Stream A Stream load B EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

24 Memory system designs rely on inherent parallelism/ity of memory accesses PE 0 PE 1 PE 2 Memory System PE 3 Due to the blocked feature of accesses, a SMS can exploit Parallelism by generating multiple references per cycle per thread Parallelism by generating references from multiple threads Parallelism by both of above Locality by generating entire references of a thread 24 It is Stream important A to understand how the interactions between Stream load these B different factors affect performance EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

25 Outline 25 technology organization, mechanism, and trends Memory system organization and design option A few results Most slides courtesy Jung Ho Ahn, HP Labs EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

26 Row decoder A chip is containsmultiple memory banks where each bank is a 2-D array 26 (bank, row, column) request bank bank n-1 0 Memory array Many shared resources on a chip Row & column accesses shared by request path. Data read & written through shared data path. All banks share request and data path. Sense amplifier Column decoder This sharing and the dynamic nature of result in strict access rules data and timing constraints EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

27 Row decoder A follows rules and occupies resources to access a location 27 request bank n-1 bank 1 bank 0 Memory array Sense amplifier Column decoder Operation Resource Utilization Cycle Precharge Bank Request Data Activate Bank Request Data Read Bank Request Data data Write Bank Request Data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

28 Row decoder A follows rules and occupies resources to access a location 28 Simplified Bank State Diagram bank n-1 bank 1 bank 0 act rd request Memory array pre wr Sense amplifier Column decoder data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Operation Resource Utilization Cycle Activate Bank Request Data

29 Row decoder A follows rules and occupies resources to access a location 29 Simplified Bank State Diagram bank n-1 bank 1 bank 0 act rd request Memory array pre wr Sense amplifier Column decoder data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Operation Resource Utilization Cycle Read Bank Request Data

30 Row decoder A follows rules and occupies resources to access a location 30 Simplified Bank State Diagram bank n-1 bank 1 bank 0 act rd request Memory array pre wr Sense amplifier Column decoder EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Operation Resource Utilization Cycle Precharge Bank data Request Data

31 Row decoder operation sequences for two memory read requests 31 Read (Bank, Row, Column) 1. (0, 0, 0) (0, 0, 1) : different column bank n-1 bank 1 bank 0 act (0, 0, *) rd (0, *, 0) rd (0, *, 1) request Memory array Sense amplifier Column decoder data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

32 Row decoder operation sequences for two memory read requests 32 Read (Bank, Row, Column) bank n-1 1. (0, 0, 0) (0, 0, 1) : different column bank 1 bank 0 act (0, 0, *) rd (0, *, 0) rd (0, *, 1) 2. (0, 0, 0) (1, 1, 0) : different bank & row request Memory array act (0, 0, *) rd (0, *, 0) Sense amplifier act (1, 1, *) rd (1, *, 0) Column decoder data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

33 Row decoder operation sequences for two memory read requests 33 Read (Bank, Row, Column) bank n-1 1. (0, 0, 0) (0, 0, 1) : different column bank 1 bank 0 act (0, 0, *) rd (0, *, 0) rd (0, *, 1) 2. (0, 0, 0) (1, 1, 0) : different bank & row request Memory array act (0, 0, *) rd (0, *, 0) Sense amplifier act (1, 1, *) rd (1, *, 0) Column decoder 3. (0, 0, 0) (0, 1, 1) : different row & column data act (0, 0, *) rd (0, *, 0) pre (0, 0, *) act (0, 1, *) rd (0, *, 1) Internal bank conflict : requiring more commands and cycles EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

34 Activate to active time determines random access performance Row decoder Row decoder Row decoder clock request act act pre act trr : act to act between different banks trc : act to act within the same bank bank 0 bank 1 bank n-1 Memory array Memory array Memory array Sense amplifier Sense amplifier Sense amplifier EE382: Column decoder Column decoder Column decoder Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 34

35 Switches between read/write commands and data transfer require timing delay clock request wr wr rd wr tdwr : write request to read request time tdrw : read request to write request time data trwbub : Hi-Z between read and write data transfer twrbub : Hi-Z between write and read transfer EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 35

36 parameter trends over various 36 generations (ns), (bit) (ns/bit) S DDR DDR2 GDDR3 GDDR4 XDR 0.1 (tck)x10 bit/(col cmd) 1/(Pin Bandwidth) EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

37 parameter trends over various 37 generations tck trc : act to act within the same bank trr : act to act between different banks 100 tdwr : write cmd to read cmd time 10 1 S DDR DDR2 GDDR3 GDDR4 XDR Timing trends show that performance is very sensitive to EE382: Principles of Computer the Architecture, presented Fall Lecture 23 access (c) Mattan Erez, Jung patterns Ho Ahn

38 Outline 38 technology organization, mechanism, and trends Memory system organization and design option A few results Most slides courtesy Jung Ho Ahn, SU EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

39 Streaming Memory Systems 39 Bulk stream loads and s Hierarchical control Expressive and effective addressing modes Can t afford to waste memory bandwidth Use hardware when performance is non-deterministic SRF MEM Strided access Gather Scatter Automatic SIMD alignment Makes SIMD trivial (SIMD short-vector) O x O y O z H1 x H1 y H1 z H2 x H2 y H2 z Stream memory system helps the programmer and maximizes I/O throughput EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

40 A streaming memory system consists of AGs, cross-point switch and MCs 40 PE 0 PE 1 PE 2 PE 3 AG AG & RB AG AG & RB AG : address generator MC : memory channel Memory Channel Memory Channel Memory Channel Memory Channel EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

41 An AG translates memory access thread into a sequence of individual memory requests PE 0 PE 1 PE 2 PE 3 a e b c f g h d AG AG [6, a] [10, b] [14, c] [18, d] [7, e] [11, f] [15, g] [19, h] AG : address generator [address, data] EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 41

42 An AG translates memory access thread into a sequence of individual memory requests PE 0 PE 1 PE 2 PE 3 a e b c f g h d EE382: AG [6, a] [10, b] [14, c] [18, d] [7, e] [11, f] [15, g] [19, h] AG AG : address generator [address, data] Record size : # of consecutive words per data record mapped to a PE Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 42

43 An AG translates memory access thread into a sequence of individual memory requests PE 0 PE 1 PE 2 PE 3 a e b c f g h d EE382: AG [6, a] [10, b] [14, c] [18, d] [7, e] [11, f] [15, g] [19, h] AG AG : address generator [address, data] Stride : address gap between consecutive records Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 43

44 Cross-point Switch PE 0 PE 1 PE 2 PE 3 AG [10, b] [18, d] [11, f] [19, h] AG [6, a] [14, c] [7, e] [15, g] On-chip and off-chip have different address spaces EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 44

45 A memory channel contains MSHRs, channel buffer entries, and a memory controller PE 0 PE 1 PE 2 [10, b] [18, d] [11, f] [19, h] PE 3 Memory Channel MSHRs AG AG & RB Channel Buffer Memory Controller Memory Channel Memory Channel Memory Channel Memory Channel MSHR : miss status handling register EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 45

46 A memory channel contains MSHRs, channel buffer entries, and a memory controller PE 0 PE 1 PE 2 PE 3 Memory Channel MSHRs AG AG & RB [ (0, [10, 1), b] b* ] [ (0,[18, 2), d] d* ] [ (0,[11, 1), f] *f ] [ (0,[19, 2), h] *h ] Channel Buffer Memory Controller Memory Channel Memory Channel Memory Channel Memory Channel EE382: A memory controller checks all the pending requests in a channel buffer and generates proper commands Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 46

47 The design space of a Streaming Memory System 47 PE 0 PE 1 PE 2 PE 3 Address generator AG AG & RB AG AG & RB # of AG AG width Memory Channel Memory Channel Memory Channel Memory Channel Memory channel # of channel buffer entries MAS policy Channel-split configuration EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

48 AG design space 48 PE 0 PE 1 PE 2 PE 3 A single wide AG vs. multiple narrow AGs AG AG Inter-thread vs. intrathread parallelism Load balancing across MC PE 0 PE 1 PE 2 PE 3 The AG width # of accesses each AG can generate per cycle AG EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

49 Memory controller in a MC determines command per cycle based on the MAS policy 49 Per command cycle, the memory controller Looks at the status of every bank bank status bank 0 : row 2 active bank 1 : row 0 active bank 2 : row 3 precharging bank 3 : idle EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

50 Memory controller in a MC determines command per cycle based on the MAS policy 50 Per command cycle, the memory controller Looks at the status of every bank Finds an available command per pending access without violating timing and resource constraints bank status bank 0 : row 2 active bank 1 : row 0 active bank 2 : row 3 precharging bank 3 : idle Pending requests in channel buffer (0, 0, 0) write - precharge (1, 0, 0) read - read (0, 2, 1) write (1, 0, 1) read - read EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Read occurred in the previous cycle

51 Memory controller in a MC determines command per cycle based on the MAS policy 51 Per command cycle, the memory controller Looks at the status of every bank Finds an available command per pending access without violating timing and resource constraints Selects the command to issue among all available commands based on the priority of the chosen policy bank status bank 0 : row 2 active bank 1 : row 0 active bank 2 : row 3 precharging bank 3 : idle Pending requests in channel buffer (0, 0, 0) write - precharge (1, 0, 0) read - read (0, 2, 1) write (1, 0, 1) read - read EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Read occurred in the previous cycle

52 Memory controller in a MC determines command per cycle based on the MAS policy 52 Scheduling policies Open : a row is precharged when there are no pending accesses to the row and there is at least one pending access to a different row in the same bank Pending requests in channel buffer Case 1 (1, 0, 0) write (1, 0, 0) read (2, 0, 1) write (2, 0, 1) read Case 2 (1, 0, 0) write (1, 3, 0) read (2, 5, 1) write (0, 1, 1) read bank 0 : row 0 is active EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

53 Memory controller in a MC determines command per cycle based on the MAS policy 53 Scheduling policies Open : a row is precharged when there are no pending accesses to the row and there is at least one pending access to a different row in the same bank Closed : a row is precharged as soon as the last available reference to that row is performed Pending requests in channel buffer Case 1 (1, 0, 0) write (1, 0, 0) read (2, 0, 1) write (2, 0, 1) read Case 2 (1, 0, 0) write (1, 3, 0) read (2, 5, 1) write (0, 1, 1) read bank 0 : row 0 is active EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

54 MAS determines command order to serve pending memory accesses Reorder row Reorder column Access Algorithm Window size commands commands Precharging selection inorder 1 /A /A /A /A EE382: Inorder policy processes pending requests one by one, effectively having window size of 1 Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 54

55 MAS determines command order to serve pending memory accesses Algorithm Window size Reorder row commands Reorder column commands Precharging Access selection inorder 1 /A /A /A /A inorderla ncb Yes o Open Column First Inorderla looks ahead of other pending requests and generates EE382: row commands, not column commands Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 55

56 MAS determines command order to serve pending memory accesses Algorithm Window size Reorder row commands Reorder column commands Precharging Access selection inorder 1 /A /A /A /A inorderla ncb Yes o Open Column First firstready ncb Yes Yes Open /A Firstready policy checks and processes pending requests one by one from EE382: the oldest until it finishes looking at all the CBEs Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 56

57 MAS determines command order to serve pending memory accesses Algorithm Window size Reorder row commands Reorder column commands Precharging Access selection inorder 1 /A /A /A /A inorderla ncb Yes o Open Column First firstready ncb Yes Yes Open /A opcol ncb Yes Yes Open Column First oprow ncb Yes Yes Open Row First clcol ncb Yes Yes Closed Column First clrow ncb Yes Yes Closed Row First Remaining four policies reorder both row and column commands EE382: using open/closed or column first/row first Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 57

58 A new micro-architecture design for a memory channel a channel-5split configuration MSHR and CBE per AG Memory Channel MSHRs MSHRs MSHRs MSHRs Channel Buffer nag Memory Controller Switch thread when the requests are completely drained Avoid Resource monopolization Internal bank conflicts Read/write turnaround penalty EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

59 Outline 59 technology Impact on memory system Stream architecture review organization, mechanism, and trends Memory system organization and design option A few results Most slides courtesy Jung Ho Ahn, HP Labs EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

60 Six applications from multimedia and scientific domains are used for study 60 DEPTH : stereo depth encoder MPEG : MPEG-2 video encoder RTSL : graphics rendering pipeline QRD : complex matrix upper triangular&othorgonal FEM : finite element method MOLE : n-body molecular dynamics EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

61 A cycle-accurate Imagine simulator is used for performance evaluation 61 PE 0 PE 1 PE 2 PE 3 1 GHz 8 processing elements AG AG Memory Channel Memory Channel Memory Channel Memory Channel EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

62 A cycle-accurate Imagine simulator is used for performance evaluation 62 PE 0 PE 1 PE 2 PE 3 AG width : 4 AG AG & RB AG AG & RB # of MCs : 4 # of CBEs : 16 Memory Channel Memory Channel Memory Channel Memory Channel CBEs for channel-split config : 16 per AG MAS policy : opcol EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

63 A cycle-accurate Imagine simulator is used for performance evaluation 63 PE 0 PE 1 PE 2 PE 3 Memory Channel AG Memory Channel AG Memory Channel Memory Channel EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn burst length : 4 words Peak BW : 4 GW/s # of internal banks : 8 typing params : XDR Peak BW : 2 # of commands for accessing an inactive row : 3

64 Throughput (GW/s) Throughput (GW/s) Memory system performance for representative configurations on six apps % % DEPTH 12% 8% 4% 0% MOLE 20% 16% 12% 8% 4% (# of AG, burst length, MAS) FEM 2ag, burst32, inorder 2ag, burst4, inorder 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch 0 DEPTH 0% row commands MPEG RTSL QRD EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

65 The key memory system related characteristics of six applications 65 Average strided Average indexed Application record size (W) stream length (W) stride/ record record size (W) stream length (W) index range strided access read access DEPTH % 63.0% MPEG % 70.2% DEPTH & MPEG has small record, stride size, and index range EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

66 Throughput (GW/s) Memory system performance for representative configurations on six apps % % % 8% (# of AG, burst length, MAS) 2ag, burst32, inorder % 2ag, burst4, inorder 0 DEPTH 0% MOLE FEM 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch row commands MPEG RTSL QRD Small record, stride size and index range means high spatial ity between generated requests from an access thread EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

67 The key memory system related characteristics of six applications 67 Average strided Average indexed Application record size (W) stream length (W) stride/ record record size (W) stream length (W) index range strided access read access DEPTH % 63.0% MPEG % 70.2% RTSL % 83.5% MOLE % 99.5% Streams with small record size and large index ranges lacks spatial ity between generated requests EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

68 Throughput (GW/s) Memory system performance for representative configurations on six apps % % % 8% (# of AG, burst length, MAS) 2ag, burst32, inorder % 2ag, burst4, inorder 0 DEPTH 0% MOLE FEM 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch row commands MPEG RTSL QRD Long bursts hurt memory system performance EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

69 The key memory system related characteristics of six applications 69 Average strided Average indexed Applicati on record size (W) stream length (W) stride/ recor d record size (W) stream length (W) EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn index range stride d acce ss read acce ss DEPTH % 63.0% MPEG % 70.2% RTSL % 83.5% MOLE % 99.5% QRD /A /A /A 100% 69.0 % Large record size means high spatial ity in generated requests FEM % 74.0 %

70 Throughput (GW/s) Memory system performance for representative configurations on six apps % % % 8% 2ag, burst32, inorder % 2ag, burst4, inorder 0 DEPTH 0% MOLE FEM 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch row commands MPEG RTSL QRD QRD & FEM perform similar to DEPTH & MPEG EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

71 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

72 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

73 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

74 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers rises as the number of AGs is increased DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

75 Throughput (GW/s) Reordering row & column commands are important, but specific MAS policies are not ag 2agcs 2ag 1ag 2agcs 2ag DEPTH FEM inorder inorderla firstready oprow opcol clrow clcol EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

76 Emerging Memory Technology 76 ew non-volatile storage devices Fine-grained access like, non-volatile like FLASH o refresh (or almost no refresh) Density equal to or better than Potentially more scalable than Higher latencies, especially for writes Higher write energy Possible endurance issues Very similar array structure and interface design to ew interface options 3D integration Memory on top of a processor Memory cubes Optical interconnect EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

77 Conclusion 77 trends Data BW increases rapidly while latency and cmd BW improve slowly access granularity grows Throughput is very sensitive to access patterns Locality must be exploited To minimize internal bank conflicts and read-write turnaround penalties Memory system design space umber of AGs inter-thread vs. intra-thread parallelism Load balance across MCs channel interleaving, multiple threads, and AG width The amount of MC buffering determines the window size of MAS Design suggestions A single wide AG exploits ity well Channel-split mechanism exploits ity and balances loads across multiple channels simultaneously at the cost of additional hardware EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn

The Design Space of Data-Parallel Memory Systems

The Design Space of Data-Parallel Memory Systems The Design Space of Data-Parallel Memory Systems Jung Ho Ahn, Mattan Erez, and William J. Dally Computer Systems Laboratory Stanford University, Stanford, California 95, USA {gajh,merez,billd}@cva.stanford.edu

More information

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors

More information

An introduction to SDRAM and memory controllers. 5kk73

An introduction to SDRAM and memory controllers. 5kk73 An introduction to SDRAM and memory controllers 5kk73 Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part

More information

Lecture 18: DRAM Technologies

Lecture 18: DRAM Technologies Lecture 18: DRAM Technologies Last Time: Cache and Virtual Memory Review Today DRAM organization or, why is DRAM so slow??? Lecture 18 1 Main Memory = DRAM Lecture 18 2 Basic DRAM Architecture Lecture

More information

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)

Lecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3) Lecture 15: DRAM Main Memory Systems Today: DRAM basics and innovations (Section 2.3) 1 Memory Architecture Processor Memory Controller Address/Cmd Bank Row Buffer DIMM Data DIMM: a PCB with DRAM chips

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

Memories: Memory Technology

Memories: Memory Technology Memories: Memory Technology Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 * Memory Hierarchy

More information

Lecture: Memory Technology Innovations

Lecture: Memory Technology Innovations Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor intro 1 Row Buffers

More information

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models Lecture: Memory, Multiprocessors Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row

More information

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics ,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics The objectives of this module are to discuss about the need for a hierarchical memory system and also

More information

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda

CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda biswap@cse.iitk.ac.in https://www.cse.iitk.ac.in/users/biswap/cs698y.html Row decoder Accessing a Row Access Address

More information

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Main Memory Systems. Department of Electrical Engineering Stanford University   Lecture 5-1 Lecture 5 Main Memory Systems Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 5-1 Announcements If you don t have a group of 3, contact us ASAP HW-1 is

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Fall 2011 --

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

EEM 486: Computer Architecture. Lecture 9. Memory

EEM 486: Computer Architecture. Lecture 9. Memory EEM 486: Computer Architecture Lecture 9 Memory The Big Picture Designing a Multiple Clock Cycle Datapath Processor Control Memory Input Datapath Output The following slides belong to Prof. Onur Mutlu

More information

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in

More information

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory CS65 Computer Architecture Lecture 9 Memory Hierarchy - Main Memory Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 9: Main Memory 9-/ /6/ A. Sohn Memory Cycle Time 5

More information

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates

More information

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

DRAM Main Memory. Dual Inline Memory Module (DIMM)

DRAM Main Memory. Dual Inline Memory Module (DIMM) DRAM Main Memory Dual Inline Memory Module (DIMM) Memory Technology Main memory serves as input and output to I/O interfaces and the processor. DRAMs for main memory, SRAM for caches Metrics: Latency,

More information

CS311 Lecture 21: SRAM/DRAM/FLASH

CS311 Lecture 21: SRAM/DRAM/FLASH S 14 L21-1 2014 CS311 Lecture 21: SRAM/DRAM/FLASH DARM part based on ISCA 2002 tutorial DRAM: Architectures, Interfaces, and Systems by Bruce Jacob and David Wang Jangwoo Kim (POSTECH) Thomas Wenisch (University

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Mattan Erez The University of Teas at Austin 2010 Workshop on Architecting Memory Systems March 1, 2010 iggest Problems

More information

M2 Outline. Memory Hierarchy Cache Blocking Cache Aware Programming SRAM, DRAM Virtual Memory Virtual Machines Non-volatile Memory, Persistent NVM

M2 Outline. Memory Hierarchy Cache Blocking Cache Aware Programming SRAM, DRAM Virtual Memory Virtual Machines Non-volatile Memory, Persistent NVM M2 Memory Systems M2 Outline Memory Hierarchy Cache Blocking Cache Aware Programming SRAM, DRAM Virtual Memory Virtual Machines Non-volatile Memory, Persistent NVM Memory Technology Memory MemoryArrays

More information

Topic 21: Memory Technology

Topic 21: Memory Technology Topic 21: Memory Technology COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Old Stuff Revisited Mercury Delay Line Memory Maurice Wilkes, in 1947,

More information

Topic 21: Memory Technology

Topic 21: Memory Technology Topic 21: Memory Technology COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Old Stuff Revisited Mercury Delay Line Memory Maurice Wilkes, in 1947,

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information

Introduction to memory system :from device to system

Introduction to memory system :from device to system Introduction to memory system :from device to system Jianhui Yue Electrical and Computer Engineering University of Maine The Position of DRAM in the Computer 2 The Complexity of Memory 3 Question Assume

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I EE382 (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 5: Zeshan Chishti DRAM Basics DRAM Evolution SDRAM-based Memory Systems Electrical and Computer Engineering Dept. Maseeh College of Engineering and Computer Science

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I EE382 (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Spring 2015

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Memory Hierarchy and Caches

Memory Hierarchy and Caches Memory Hierarchy and Caches COE 301 / ICS 233 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

CS152 Computer Architecture and Engineering Lecture 16: Memory System

CS152 Computer Architecture and Engineering Lecture 16: Memory System CS152 Computer Architecture and Engineering Lecture 16: System March 15, 1995 Dave Patterson (patterson@cs) and Shing Kong (shing.kong@eng.sun.com) Slides available on http://http.cs.berkeley.edu/~patterson

More information

Memory Access Scheduling

Memory Access Scheduling Memory Access Scheduling ECE 5900 Computer Engineering Seminar Ying Xu Mar 4, 2005 Instructor: Dr. Chigan 1 ECE 5900 spring 05 1 Outline Introduction Modern DRAM architecture Memory access scheduling Structure

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Review: Major Components of a Computer Processor Devices Control Memory Input Datapath Output Secondary Memory (Disk) Main Memory Cache Performance

More information

Computer Systems Laboratory Sungkyunkwan University

Computer Systems Laboratory Sungkyunkwan University DRAMs Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Main Memory & Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS Basics DRAM ORGANIZATION DRAM Word Line Bit Line Storage element (capacitor) In/Out Buffers Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array Page 1 Basics BUS

More information

EECS150 - Digital Design Lecture 17 Memory 2

EECS150 - Digital Design Lecture 17 Memory 2 EECS150 - Digital Design Lecture 17 Memory 2 October 22, 2002 John Wawrzynek Fall 2002 EECS150 Lec17-mem2 Page 1 SDRAM Recap General Characteristics Optimized for high density and therefore low cost/bit

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 6: DDR, DDR2 and DDR-3 SDRAM Memory Modules Zeshan Chishti Electrical and Computer Engineering Dept. Maseeh College of Engineering and Computer Science Source: Lecture

More information

The Memory Hierarchy 1

The Memory Hierarchy 1 The Memory Hierarchy 1 What is a cache? 2 What problem do caches solve? 3 Memory CPU Abstraction: Big array of bytes Memory memory 4 Performance vs 1980 Processor vs Memory Performance Memory is very slow

More information

2. Link and Memory Architectures and Technologies

2. Link and Memory Architectures and Technologies 2. Link and Memory Architectures and Technologies 2.1 Links, Thruput/Buffering, Multi-Access Ovrhds 2.2 Memories: On-chip / Off-chip SRAM, DRAM 2.A Appendix: Elastic Buffers for Cross-Clock Commun. Manolis

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5 th Edition Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

Chapter 8 Memory Basics

Chapter 8 Memory Basics Logic and Computer Design Fundamentals Chapter 8 Memory Basics Charles Kime & Thomas Kaminski 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Overview Memory definitions Random Access

More information

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors 1 PAR-BS Mutlu and Moscibroda, ISCA 08 A batch of requests (per bank) is formed: each thread can only contribute

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row read/write automatically

More information

Cache Refill/Access Decoupling for Vector Machines

Cache Refill/Access Decoupling for Vector Machines Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanović Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Computer System Components

Computer System Components Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache 2 Virtually Indexed Caches 24-bit virtual address, 4KB page size 12 bits offset and 12 bits

More information

Virtualized ECC: Flexible Reliability in Memory Systems

Virtualized ECC: Flexible Reliability in Memory Systems Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Caches Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs

Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs Shin-Shiun Chen, Chun-Kai Hsu, Hsiu-Chuan Shih, and Cheng-Wen Wu Department of Electrical Engineering National Tsing Hua University

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness EE 352 Unit 10 Memory System Overview SRAM vs. DRAM DMA & Endian-ness The Memory Wall Problem: The Memory Wall Processor speeds have been increasing much faster than memory access speeds (Memory technology

More information

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)] ECE7995 (4) Basics of Memory Hierarchy [Adapted from Mary Jane Irwin s slides (PSU)] Major Components of a Computer Processor Devices Control Memory Input Datapath Output Performance Processor-Memory Performance

More information

CENG3420 Lecture 08: Memory Organization

CENG3420 Lecture 08: Memory Organization CENG3420 Lecture 08: Memory Organization Bei Yu byu@cse.cuhk.edu.hk (Latest update: February 22, 2018) Spring 2018 1 / 48 Overview Introduction Random Access Memory (RAM) Interleaving Secondary Memory

More information

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture 1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered

More information

EE482S Lecture 1 Stream Processor Architecture

EE482S Lecture 1 Stream Processor Architecture EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 16

More information

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory Memory systems Memory technology Memory hierarchy Virtual memory Memory technology DRAM Dynamic Random Access Memory bits are represented by an electric charge in a small capacitor charge leaks away, need

More information

CpE 442. Memory System

CpE 442. Memory System CpE 442 Memory System CPE 442 memory.1 Outline of Today s Lecture Recap and Introduction (5 minutes) Memory System: the BIG Picture? (15 minutes) Memory Technology: SRAM and Register File (25 minutes)

More information

ECE 551 System on Chip Design

ECE 551 System on Chip Design ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we

More information

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B. Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.5) Memory Technologies Dynamic Random Access Memory (DRAM) Optimized

More information

The Memory Hierarchy & Cache

The Memory Hierarchy & Cache Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today. ESE534: Computer Organization Previously Day 7: February 6, 2012 Memories Arithmetic: addition, subtraction Reuse: pipelining bit-serial (vectorization) Area/Time Tradeoffs Latency and Throughput 1 2 Today

More information

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative

More information

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory CS 152 Computer Architecture and Engineering Lecture 6 - Memory Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs152!

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information