EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems
|
|
- Rosemary Gray
- 5 years ago
- Views:
Transcription
1 EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 1
2 Outline 2 technology organization and mechanism Memory system organization and design option ew emerging trends Most slides courtesy Jung Ho Ahn, SU EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
3 is Expensive 3 $/bit is almost nothing $2 x 10-9 Memory in system is expensive %10 50 of system cost Rule of thumb the more memory the better Minimize $/bit EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
4 What is? 4 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
5 Cell 5 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
6 Cell 6 Capacity density 3D Recessed Channel Array Transistor (capacitor on top) Samsung, Hynix, Elpida, Micron Trench capacitor Infineon, anya, ProMos, Winbond IBM, Toshiba in embedded- Hynix 80nm EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Infineon 80nm
7 Array 7 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
8 Sense Amplifier 8 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
9 Array 9 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
10 Array 10 EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
11 Optimization 11 The more memory the better optimize capacity Also need to worry about power and drivers Result is compromise in latency Small capacitors and large sub-arrays increase access time What about bandwidth? Bandwidth is expensive: $.05 - $.10 per package pin DDR2 requires 80 pins Secondary goal is optimizing BW/pin EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
12 Massively parallel processor architecture 12 > 4000 GB/s < 200 GB/s BW demand of ALUs >> BW supply from s EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
13 Stream processor architecture RF RF RF RF RF LRF : register file SM SM SM SRF : stream register file Memory System EE382: LRF and SRF provide a hierarchy of bandwidth and ity SRF decouples execution from memory Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 13
14 Streaming Memory Systems (SMSs) 14 PE 0 PE 1 PE 2 PE 3 Memory System Memory Channel Memory Channel Memory Channel Memory Channel Off-chip s need to meet the processor s bandwidth demands Multiple address-interleaved memory channels High bandwidth per channel EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
15 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System 1) Because of load imbalance between multiple memory channels EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 15
16 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System 1) Because of load imbalance between multiple memory channels EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 16
17 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System EE382: 2) Because the performance of modern s is very sensitive to access patterns Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 17
18 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 clock PE 2 PE 3 (0, 0, 0) (0, 0, 1) (0, 0, 2) (0, 0, 3) Memory System (0, 0, 4) data EE382: (Bank, Row, Column) Blue : write Red : read (0, 0, 0) (0, 0, 1) (0, 0, 2) (0, 0, 3) (0, 0, 4) data Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 18
19 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 clock PE 2 PE 3 (0, 0, 0) act (0, 0, 1) (0, 0, 2) (0, 0, 3) Memory System (0, 0, 4) wr wr wr wr wr data EE382: (Bank, Row, Column) Blue : write Red : read (0, 0, 0) act wr (0, 0, 1) rd (0, 0, 2) wr (0, 0, 3) (0, 0, 4) data Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 19
20 The performance of a streaming memory system is very sensitive to access patterns PE 0 PE 1 PE 2 PE 3 Memory System EE382: Parallelism and ity are necessary for efficient usage Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 20
21 Memory system designs rely on the inherent parallelism/ity of memory accesses PE 0 PE 1 PE 2 PE 3 Memory System A stream load or operation yields a large number of related memory accesses, EE382: Stream A Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 21
22 Memory system designs rely on inherent parallelism/ity of memory accesses 22 PE 0 PE 1 PE 2 PE 3 Due to the blocked feature of accesses, a SMS can exploit Parallelism by generating multiple references per cycle per thread Memory System Stream A EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
23 Memory system designs rely on inherent parallelism/ity of memory accesses PE 0 PE 1 PE 2 Memory System PE 3 Due to the blocked feature of accesses, a SMS can exploit Parallelism by generating multiple references per cycle per thread Parallelism by generating references from multiple threads 23 Stream A Stream load B EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
24 Memory system designs rely on inherent parallelism/ity of memory accesses PE 0 PE 1 PE 2 Memory System PE 3 Due to the blocked feature of accesses, a SMS can exploit Parallelism by generating multiple references per cycle per thread Parallelism by generating references from multiple threads Parallelism by both of above Locality by generating entire references of a thread 24 It is Stream important A to understand how the interactions between Stream load these B different factors affect performance EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
25 Outline 25 technology organization, mechanism, and trends Memory system organization and design option A few results Most slides courtesy Jung Ho Ahn, HP Labs EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
26 Row decoder A chip is containsmultiple memory banks where each bank is a 2-D array 26 (bank, row, column) request bank bank n-1 0 Memory array Many shared resources on a chip Row & column accesses shared by request path. Data read & written through shared data path. All banks share request and data path. Sense amplifier Column decoder This sharing and the dynamic nature of result in strict access rules data and timing constraints EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
27 Row decoder A follows rules and occupies resources to access a location 27 request bank n-1 bank 1 bank 0 Memory array Sense amplifier Column decoder Operation Resource Utilization Cycle Precharge Bank Request Data Activate Bank Request Data Read Bank Request Data data Write Bank Request Data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
28 Row decoder A follows rules and occupies resources to access a location 28 Simplified Bank State Diagram bank n-1 bank 1 bank 0 act rd request Memory array pre wr Sense amplifier Column decoder data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Operation Resource Utilization Cycle Activate Bank Request Data
29 Row decoder A follows rules and occupies resources to access a location 29 Simplified Bank State Diagram bank n-1 bank 1 bank 0 act rd request Memory array pre wr Sense amplifier Column decoder data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Operation Resource Utilization Cycle Read Bank Request Data
30 Row decoder A follows rules and occupies resources to access a location 30 Simplified Bank State Diagram bank n-1 bank 1 bank 0 act rd request Memory array pre wr Sense amplifier Column decoder EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Operation Resource Utilization Cycle Precharge Bank data Request Data
31 Row decoder operation sequences for two memory read requests 31 Read (Bank, Row, Column) 1. (0, 0, 0) (0, 0, 1) : different column bank n-1 bank 1 bank 0 act (0, 0, *) rd (0, *, 0) rd (0, *, 1) request Memory array Sense amplifier Column decoder data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
32 Row decoder operation sequences for two memory read requests 32 Read (Bank, Row, Column) bank n-1 1. (0, 0, 0) (0, 0, 1) : different column bank 1 bank 0 act (0, 0, *) rd (0, *, 0) rd (0, *, 1) 2. (0, 0, 0) (1, 1, 0) : different bank & row request Memory array act (0, 0, *) rd (0, *, 0) Sense amplifier act (1, 1, *) rd (1, *, 0) Column decoder data EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
33 Row decoder operation sequences for two memory read requests 33 Read (Bank, Row, Column) bank n-1 1. (0, 0, 0) (0, 0, 1) : different column bank 1 bank 0 act (0, 0, *) rd (0, *, 0) rd (0, *, 1) 2. (0, 0, 0) (1, 1, 0) : different bank & row request Memory array act (0, 0, *) rd (0, *, 0) Sense amplifier act (1, 1, *) rd (1, *, 0) Column decoder 3. (0, 0, 0) (0, 1, 1) : different row & column data act (0, 0, *) rd (0, *, 0) pre (0, 0, *) act (0, 1, *) rd (0, *, 1) Internal bank conflict : requiring more commands and cycles EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
34 Activate to active time determines random access performance Row decoder Row decoder Row decoder clock request act act pre act trr : act to act between different banks trc : act to act within the same bank bank 0 bank 1 bank n-1 Memory array Memory array Memory array Sense amplifier Sense amplifier Sense amplifier EE382: Column decoder Column decoder Column decoder Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 34
35 Switches between read/write commands and data transfer require timing delay clock request wr wr rd wr tdwr : write request to read request time tdrw : read request to write request time data trwbub : Hi-Z between read and write data transfer twrbub : Hi-Z between write and read transfer EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 35
36 parameter trends over various 36 generations (ns), (bit) (ns/bit) S DDR DDR2 GDDR3 GDDR4 XDR 0.1 (tck)x10 bit/(col cmd) 1/(Pin Bandwidth) EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
37 parameter trends over various 37 generations tck trc : act to act within the same bank trr : act to act between different banks 100 tdwr : write cmd to read cmd time 10 1 S DDR DDR2 GDDR3 GDDR4 XDR Timing trends show that performance is very sensitive to EE382: Principles of Computer the Architecture, presented Fall Lecture 23 access (c) Mattan Erez, Jung patterns Ho Ahn
38 Outline 38 technology organization, mechanism, and trends Memory system organization and design option A few results Most slides courtesy Jung Ho Ahn, SU EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
39 Streaming Memory Systems 39 Bulk stream loads and s Hierarchical control Expressive and effective addressing modes Can t afford to waste memory bandwidth Use hardware when performance is non-deterministic SRF MEM Strided access Gather Scatter Automatic SIMD alignment Makes SIMD trivial (SIMD short-vector) O x O y O z H1 x H1 y H1 z H2 x H2 y H2 z Stream memory system helps the programmer and maximizes I/O throughput EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
40 A streaming memory system consists of AGs, cross-point switch and MCs 40 PE 0 PE 1 PE 2 PE 3 AG AG & RB AG AG & RB AG : address generator MC : memory channel Memory Channel Memory Channel Memory Channel Memory Channel EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
41 An AG translates memory access thread into a sequence of individual memory requests PE 0 PE 1 PE 2 PE 3 a e b c f g h d AG AG [6, a] [10, b] [14, c] [18, d] [7, e] [11, f] [15, g] [19, h] AG : address generator [address, data] EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 41
42 An AG translates memory access thread into a sequence of individual memory requests PE 0 PE 1 PE 2 PE 3 a e b c f g h d EE382: AG [6, a] [10, b] [14, c] [18, d] [7, e] [11, f] [15, g] [19, h] AG AG : address generator [address, data] Record size : # of consecutive words per data record mapped to a PE Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 42
43 An AG translates memory access thread into a sequence of individual memory requests PE 0 PE 1 PE 2 PE 3 a e b c f g h d EE382: AG [6, a] [10, b] [14, c] [18, d] [7, e] [11, f] [15, g] [19, h] AG AG : address generator [address, data] Stride : address gap between consecutive records Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 43
44 Cross-point Switch PE 0 PE 1 PE 2 PE 3 AG [10, b] [18, d] [11, f] [19, h] AG [6, a] [14, c] [7, e] [15, g] On-chip and off-chip have different address spaces EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 44
45 A memory channel contains MSHRs, channel buffer entries, and a memory controller PE 0 PE 1 PE 2 [10, b] [18, d] [11, f] [19, h] PE 3 Memory Channel MSHRs AG AG & RB Channel Buffer Memory Controller Memory Channel Memory Channel Memory Channel Memory Channel MSHR : miss status handling register EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 45
46 A memory channel contains MSHRs, channel buffer entries, and a memory controller PE 0 PE 1 PE 2 PE 3 Memory Channel MSHRs AG AG & RB [ (0, [10, 1), b] b* ] [ (0,[18, 2), d] d* ] [ (0,[11, 1), f] *f ] [ (0,[19, 2), h] *h ] Channel Buffer Memory Controller Memory Channel Memory Channel Memory Channel Memory Channel EE382: A memory controller checks all the pending requests in a channel buffer and generates proper commands Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 46
47 The design space of a Streaming Memory System 47 PE 0 PE 1 PE 2 PE 3 Address generator AG AG & RB AG AG & RB # of AG AG width Memory Channel Memory Channel Memory Channel Memory Channel Memory channel # of channel buffer entries MAS policy Channel-split configuration EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
48 AG design space 48 PE 0 PE 1 PE 2 PE 3 A single wide AG vs. multiple narrow AGs AG AG Inter-thread vs. intrathread parallelism Load balancing across MC PE 0 PE 1 PE 2 PE 3 The AG width # of accesses each AG can generate per cycle AG EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
49 Memory controller in a MC determines command per cycle based on the MAS policy 49 Per command cycle, the memory controller Looks at the status of every bank bank status bank 0 : row 2 active bank 1 : row 0 active bank 2 : row 3 precharging bank 3 : idle EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
50 Memory controller in a MC determines command per cycle based on the MAS policy 50 Per command cycle, the memory controller Looks at the status of every bank Finds an available command per pending access without violating timing and resource constraints bank status bank 0 : row 2 active bank 1 : row 0 active bank 2 : row 3 precharging bank 3 : idle Pending requests in channel buffer (0, 0, 0) write - precharge (1, 0, 0) read - read (0, 2, 1) write (1, 0, 1) read - read EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Read occurred in the previous cycle
51 Memory controller in a MC determines command per cycle based on the MAS policy 51 Per command cycle, the memory controller Looks at the status of every bank Finds an available command per pending access without violating timing and resource constraints Selects the command to issue among all available commands based on the priority of the chosen policy bank status bank 0 : row 2 active bank 1 : row 0 active bank 2 : row 3 precharging bank 3 : idle Pending requests in channel buffer (0, 0, 0) write - precharge (1, 0, 0) read - read (0, 2, 1) write (1, 0, 1) read - read EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn Read occurred in the previous cycle
52 Memory controller in a MC determines command per cycle based on the MAS policy 52 Scheduling policies Open : a row is precharged when there are no pending accesses to the row and there is at least one pending access to a different row in the same bank Pending requests in channel buffer Case 1 (1, 0, 0) write (1, 0, 0) read (2, 0, 1) write (2, 0, 1) read Case 2 (1, 0, 0) write (1, 3, 0) read (2, 5, 1) write (0, 1, 1) read bank 0 : row 0 is active EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
53 Memory controller in a MC determines command per cycle based on the MAS policy 53 Scheduling policies Open : a row is precharged when there are no pending accesses to the row and there is at least one pending access to a different row in the same bank Closed : a row is precharged as soon as the last available reference to that row is performed Pending requests in channel buffer Case 1 (1, 0, 0) write (1, 0, 0) read (2, 0, 1) write (2, 0, 1) read Case 2 (1, 0, 0) write (1, 3, 0) read (2, 5, 1) write (0, 1, 1) read bank 0 : row 0 is active EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
54 MAS determines command order to serve pending memory accesses Reorder row Reorder column Access Algorithm Window size commands commands Precharging selection inorder 1 /A /A /A /A EE382: Inorder policy processes pending requests one by one, effectively having window size of 1 Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 54
55 MAS determines command order to serve pending memory accesses Algorithm Window size Reorder row commands Reorder column commands Precharging Access selection inorder 1 /A /A /A /A inorderla ncb Yes o Open Column First Inorderla looks ahead of other pending requests and generates EE382: row commands, not column commands Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 55
56 MAS determines command order to serve pending memory accesses Algorithm Window size Reorder row commands Reorder column commands Precharging Access selection inorder 1 /A /A /A /A inorderla ncb Yes o Open Column First firstready ncb Yes Yes Open /A Firstready policy checks and processes pending requests one by one from EE382: the oldest until it finishes looking at all the CBEs Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 56
57 MAS determines command order to serve pending memory accesses Algorithm Window size Reorder row commands Reorder column commands Precharging Access selection inorder 1 /A /A /A /A inorderla ncb Yes o Open Column First firstready ncb Yes Yes Open /A opcol ncb Yes Yes Open Column First oprow ncb Yes Yes Open Row First clcol ncb Yes Yes Closed Column First clrow ncb Yes Yes Closed Row First Remaining four policies reorder both row and column commands EE382: using open/closed or column first/row first Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn 57
58 A new micro-architecture design for a memory channel a channel-5split configuration MSHR and CBE per AG Memory Channel MSHRs MSHRs MSHRs MSHRs Channel Buffer nag Memory Controller Switch thread when the requests are completely drained Avoid Resource monopolization Internal bank conflicts Read/write turnaround penalty EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
59 Outline 59 technology Impact on memory system Stream architecture review organization, mechanism, and trends Memory system organization and design option A few results Most slides courtesy Jung Ho Ahn, HP Labs EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
60 Six applications from multimedia and scientific domains are used for study 60 DEPTH : stereo depth encoder MPEG : MPEG-2 video encoder RTSL : graphics rendering pipeline QRD : complex matrix upper triangular&othorgonal FEM : finite element method MOLE : n-body molecular dynamics EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
61 A cycle-accurate Imagine simulator is used for performance evaluation 61 PE 0 PE 1 PE 2 PE 3 1 GHz 8 processing elements AG AG Memory Channel Memory Channel Memory Channel Memory Channel EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
62 A cycle-accurate Imagine simulator is used for performance evaluation 62 PE 0 PE 1 PE 2 PE 3 AG width : 4 AG AG & RB AG AG & RB # of MCs : 4 # of CBEs : 16 Memory Channel Memory Channel Memory Channel Memory Channel CBEs for channel-split config : 16 per AG MAS policy : opcol EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
63 A cycle-accurate Imagine simulator is used for performance evaluation 63 PE 0 PE 1 PE 2 PE 3 Memory Channel AG Memory Channel AG Memory Channel Memory Channel EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn burst length : 4 words Peak BW : 4 GW/s # of internal banks : 8 typing params : XDR Peak BW : 2 # of commands for accessing an inactive row : 3
64 Throughput (GW/s) Throughput (GW/s) Memory system performance for representative configurations on six apps % % DEPTH 12% 8% 4% 0% MOLE 20% 16% 12% 8% 4% (# of AG, burst length, MAS) FEM 2ag, burst32, inorder 2ag, burst4, inorder 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch 0 DEPTH 0% row commands MPEG RTSL QRD EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
65 The key memory system related characteristics of six applications 65 Average strided Average indexed Application record size (W) stream length (W) stride/ record record size (W) stream length (W) index range strided access read access DEPTH % 63.0% MPEG % 70.2% DEPTH & MPEG has small record, stride size, and index range EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
66 Throughput (GW/s) Memory system performance for representative configurations on six apps % % % 8% (# of AG, burst length, MAS) 2ag, burst32, inorder % 2ag, burst4, inorder 0 DEPTH 0% MOLE FEM 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch row commands MPEG RTSL QRD Small record, stride size and index range means high spatial ity between generated requests from an access thread EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
67 The key memory system related characteristics of six applications 67 Average strided Average indexed Application record size (W) stream length (W) stride/ record record size (W) stream length (W) index range strided access read access DEPTH % 63.0% MPEG % 70.2% RTSL % 83.5% MOLE % 99.5% Streams with small record size and large index ranges lacks spatial ity between generated requests EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
68 Throughput (GW/s) Memory system performance for representative configurations on six apps % % % 8% (# of AG, burst length, MAS) 2ag, burst32, inorder % 2ag, burst4, inorder 0 DEPTH 0% MOLE FEM 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch row commands MPEG RTSL QRD Long bursts hurt memory system performance EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
69 The key memory system related characteristics of six applications 69 Average strided Average indexed Applicati on record size (W) stream length (W) stride/ recor d record size (W) stream length (W) EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn index range stride d acce ss read acce ss DEPTH % 63.0% MPEG % 70.2% RTSL % 83.5% MOLE % 99.5% QRD /A /A /A 100% 69.0 % Large record size means high spatial ity in generated requests FEM % 74.0 %
70 Throughput (GW/s) Memory system performance for representative configurations on six apps % % % 8% 2ag, burst32, inorder % 2ag, burst4, inorder 0 DEPTH 0% MOLE FEM 2ag, burst4, opcol 1ag, burst4, opcol 2agcs, burst4, opcol read-write switch row commands MPEG RTSL QRD QRD & FEM perform similar to DEPTH & MPEG EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
71 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
72 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
73 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
74 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag 1ag 2agcs 2ag 4agcs 4ag Throughput (GW/s) Performance sensitivity to the size of MC buffers rises as the number of AGs is increased DEPTH MPEG RTSL MOLE FEM QRD ncb = 4 entries 16 entries 32 entries 64 entries 256 entries EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
75 Throughput (GW/s) Reordering row & column commands are important, but specific MAS policies are not ag 2agcs 2ag 1ag 2agcs 2ag DEPTH FEM inorder inorderla firstready oprow opcol clrow clcol EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
76 Emerging Memory Technology 76 ew non-volatile storage devices Fine-grained access like, non-volatile like FLASH o refresh (or almost no refresh) Density equal to or better than Potentially more scalable than Higher latencies, especially for writes Higher write energy Possible endurance issues Very similar array structure and interface design to ew interface options 3D integration Memory on top of a processor Memory cubes Optical interconnect EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
77 Conclusion 77 trends Data BW increases rapidly while latency and cmd BW improve slowly access granularity grows Throughput is very sensitive to access patterns Locality must be exploited To minimize internal bank conflicts and read-write turnaround penalties Memory system design space umber of AGs inter-thread vs. intra-thread parallelism Load balance across MCs channel interleaving, multiple threads, and AG width The amount of MC buffering determines the window size of MAS Design suggestions A single wide AG exploits ity well Channel-split mechanism exploits ity and balances loads across multiple channels simultaneously at the cost of additional hardware EE382: Principles of Computer Architecture, Fall Lecture 23 (c) Mattan Erez, Jung Ho Ahn
The Design Space of Data-Parallel Memory Systems
The Design Space of Data-Parallel Memory Systems Jung Ho Ahn, Mattan Erez, and William J. Dally Computer Systems Laboratory Stanford University, Stanford, California 95, USA {gajh,merez,billd}@cva.stanford.edu
More informationStream Processing: a New HW/SW Contract for High-Performance Efficient Computation
Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors
More informationAn introduction to SDRAM and memory controllers. 5kk73
An introduction to SDRAM and memory controllers 5kk73 Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part
More informationLecture 18: DRAM Technologies
Lecture 18: DRAM Technologies Last Time: Cache and Virtual Memory Review Today DRAM organization or, why is DRAM so slow??? Lecture 18 1 Main Memory = DRAM Lecture 18 2 Basic DRAM Architecture Lecture
More informationLecture 15: DRAM Main Memory Systems. Today: DRAM basics and innovations (Section 2.3)
Lecture 15: DRAM Main Memory Systems Today: DRAM basics and innovations (Section 2.3) 1 Memory Architecture Processor Memory Controller Address/Cmd Bank Row Buffer DIMM Data DIMM: a PCB with DRAM chips
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationMemories: Memory Technology
Memories: Memory Technology Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 * Memory Hierarchy
More informationLecture: Memory Technology Innovations
Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile cells, photonics Multiprocessor intro 1 Row Buffers
More informationLecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models
Lecture: Memory, Multiprocessors Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row
More information,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics
,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics The objectives of this module are to discuss about the need for a hierarchical memory system and also
More informationCS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda
CS698Y: Modern Memory Systems Lecture-16 (DRAM Timing Constraints) Biswabandan Panda biswap@cse.iitk.ac.in https://www.cse.iitk.ac.in/users/biswap/cs698y.html Row decoder Accessing a Row Access Address
More informationMain Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1
Lecture 5 Main Memory Systems Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 5-1 Announcements If you don t have a group of 3, contact us ASAP HW-1 is
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationChapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)
Department of Electr rical Eng ineering, Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering,
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Fall 2011 --
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved
More informationEEM 486: Computer Architecture. Lecture 9. Memory
EEM 486: Computer Architecture Lecture 9 Memory The Big Picture Designing a Multiple Clock Cycle Datapath Processor Control Memory Input Datapath Output The following slides belong to Prof. Onur Mutlu
More informationSlide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth
Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in
More informationCS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory
CS65 Computer Architecture Lecture 9 Memory Hierarchy - Main Memory Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 9: Main Memory 9-/ /6/ A. Sohn Memory Cycle Time 5
More informationSpring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand
Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates
More information15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationDRAM Main Memory. Dual Inline Memory Module (DIMM)
DRAM Main Memory Dual Inline Memory Module (DIMM) Memory Technology Main memory serves as input and output to I/O interfaces and the processor. DRAMs for main memory, SRAM for caches Metrics: Latency,
More informationCS311 Lecture 21: SRAM/DRAM/FLASH
S 14 L21-1 2014 CS311 Lecture 21: SRAM/DRAM/FLASH DARM part based on ISCA 2002 tutorial DRAM: Architectures, Interfaces, and Systems by Bruce Jacob and David Wang Jangwoo Kim (POSTECH) Thomas Wenisch (University
More informationMainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation
Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationCOSC 6385 Computer Architecture - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity
More informationLet Software Decide: Matching Application Diversity with One- Size-Fits-All Memory
Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Mattan Erez The University of Teas at Austin 2010 Workshop on Architecting Memory Systems March 1, 2010 iggest Problems
More informationM2 Outline. Memory Hierarchy Cache Blocking Cache Aware Programming SRAM, DRAM Virtual Memory Virtual Machines Non-volatile Memory, Persistent NVM
M2 Memory Systems M2 Outline Memory Hierarchy Cache Blocking Cache Aware Programming SRAM, DRAM Virtual Memory Virtual Machines Non-volatile Memory, Persistent NVM Memory Technology Memory MemoryArrays
More informationTopic 21: Memory Technology
Topic 21: Memory Technology COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Old Stuff Revisited Mercury Delay Line Memory Maurice Wilkes, in 1947,
More informationTopic 21: Memory Technology
Topic 21: Memory Technology COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 Old Stuff Revisited Mercury Delay Line Memory Maurice Wilkes, in 1947,
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit
More informationIntroduction to memory system :from device to system
Introduction to memory system :from device to system Jianhui Yue Electrical and Computer Engineering University of Maine The Position of DRAM in the Computer 2 The Complexity of Memory 3 Question Assume
More information18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013
18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit
More informationEE382N (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 10 Parallelism in Software I Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 5: Zeshan Chishti DRAM Basics DRAM Evolution SDRAM-based Memory Systems Electrical and Computer Engineering Dept. Maseeh College of Engineering and Computer Science
More informationCS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II
CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I
EE382 (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Spring 2015
More informationELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II
ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationMemory Hierarchy and Caches
Memory Hierarchy and Caches COE 301 / ICS 233 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationCS152 Computer Architecture and Engineering Lecture 16: Memory System
CS152 Computer Architecture and Engineering Lecture 16: System March 15, 1995 Dave Patterson (patterson@cs) and Shing Kong (shing.kong@eng.sun.com) Slides available on http://http.cs.berkeley.edu/~patterson
More informationMemory Access Scheduling
Memory Access Scheduling ECE 5900 Computer Engineering Seminar Ying Xu Mar 4, 2005 Instructor: Dr. Chigan 1 ECE 5900 spring 05 1 Outline Introduction Modern DRAM architecture Memory access scheduling Structure
More informationLecture-14 (Memory Hierarchy) CS422-Spring
Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect
More information2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]
EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Review: Major Components of a Computer Processor Devices Control Memory Input Datapath Output Secondary Memory (Disk) Main Memory Cache Performance
More informationComputer Systems Laboratory Sungkyunkwan University
DRAMs Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Main Memory & Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width
More informationEE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)
EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III) Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture
More informationBasics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS
Basics DRAM ORGANIZATION DRAM Word Line Bit Line Storage element (capacitor) In/Out Buffers Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array Page 1 Basics BUS
More informationEECS150 - Digital Design Lecture 17 Memory 2
EECS150 - Digital Design Lecture 17 Memory 2 October 22, 2002 John Wawrzynek Fall 2002 EECS150 Lec17-mem2 Page 1 SDRAM Recap General Characteristics Optimized for high density and therefore low cost/bit
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 6: DDR, DDR2 and DDR-3 SDRAM Memory Modules Zeshan Chishti Electrical and Computer Engineering Dept. Maseeh College of Engineering and Computer Science Source: Lecture
More informationThe Memory Hierarchy 1
The Memory Hierarchy 1 What is a cache? 2 What problem do caches solve? 3 Memory CPU Abstraction: Big array of bytes Memory memory 4 Performance vs 1980 Processor vs Memory Performance Memory is very slow
More information2. Link and Memory Architectures and Technologies
2. Link and Memory Architectures and Technologies 2.1 Links, Thruput/Buffering, Multi-Access Ovrhds 2.2 Memories: On-chip / Off-chip SRAM, DRAM 2.A Appendix: Elastic Buffers for Cross-Clock Commun. Manolis
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to
More informationMemory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 5 Large and Fast: Exploiting Memory Hierarchy 5 th Edition Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic
More informationMemory. Lecture 22 CS301
Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch
More informationChapter 8 Memory Basics
Logic and Computer Design Fundamentals Chapter 8 Memory Basics Charles Kime & Thomas Kaminski 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Overview Memory definitions Random Access
More informationLecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors
Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors 1 PAR-BS Mutlu and Moscibroda, ISCA 08 A batch of requests (per bank) is formed: each thread can only contribute
More informationCaches. Hiding Memory Access Times
Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationLecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models
Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row read/write automatically
More informationCache Refill/Access Decoupling for Vector Machines
Cache Refill/Access Decoupling for Vector Machines Christopher Batten, Ronny Krashinsky, Steve Gerding, Krste Asanović Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationComputer System Components
Computer System Components CPU Core 1 GHz - 3.2 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address
More informationLecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)
Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache 2 Virtually Indexed Caches 24-bit virtual address, 4KB page size 12 bits offset and 12 bits
More informationVirtualized ECC: Flexible Reliability in Memory Systems
Virtualized ECC: Flexible Reliability in Memory Systems Doe Hyun Yoon Advisor: Mattan Erez Electrical and Computer Engineering The University of Texas at Austin Motivation Reliability concerns are growing
More informationThe University of Adelaide, School of Computer Science 13 September 2018
Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationCaches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Caches Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns
More informationComputer Architecture Lecture 24: Memory Scheduling
18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationProcessor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs
Processor and DRAM Integration by TSV- Based 3-D Stacking for Power-Aware SOCs Shin-Shiun Chen, Chun-Kai Hsu, Hsiu-Chuan Shih, and Cheng-Wen Wu Department of Electrical Engineering National Tsing Hua University
More informationThe Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)
The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache
More informationMark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness
EE 352 Unit 10 Memory System Overview SRAM vs. DRAM DMA & Endian-ness The Memory Wall Problem: The Memory Wall Processor speeds have been increasing much faster than memory access speeds (Memory technology
More informationECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]
ECE7995 (4) Basics of Memory Hierarchy [Adapted from Mary Jane Irwin s slides (PSU)] Major Components of a Computer Processor Devices Control Memory Input Datapath Output Performance Processor-Memory Performance
More informationCENG3420 Lecture 08: Memory Organization
CENG3420 Lecture 08: Memory Organization Bei Yu byu@cse.cuhk.edu.hk (Latest update: February 22, 2018) Spring 2018 1 / 48 Overview Introduction Random Access Memory (RAM) Interleaving Secondary Memory
More informationEE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture
1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered
More informationEE482S Lecture 1 Stream Processor Architecture
EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 16
More informationMemory systems. Memory technology. Memory technology Memory hierarchy Virtual memory
Memory systems Memory technology Memory hierarchy Virtual memory Memory technology DRAM Dynamic Random Access Memory bits are represented by an electric charge in a small capacitor charge leaks away, need
More informationCpE 442. Memory System
CpE 442 Memory System CPE 442 memory.1 Outline of Today s Lecture Recap and Introduction (5 minutes) Memory System: the BIG Picture? (15 minutes) Memory Technology: SRAM and Register File (25 minutes)
More informationECE 551 System on Chip Design
ECE 551 System on Chip Design Introducing Bus Communications Garrett S. Rose Fall 2018 Emerging Applications Requirements Data Flow vs. Processing µp µp Mem Bus DRAMC Core 2 Core N Main Bus µp Core 1 SoCs
More informationLecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)
Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we
More informationChapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.
Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.5) Memory Technologies Dynamic Random Access Memory (DRAM) Optimized
More informationThe Memory Hierarchy & Cache
Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More information2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.
ESE534: Computer Organization Previously Day 7: February 6, 2012 Memories Arithmetic: addition, subtraction Reuse: pipelining bit-serial (vectorization) Area/Time Tradeoffs Latency and Throughput 1 2 Today
More informationLecture 11 Cache. Peng Liu.
Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative
More informationCS 152 Computer Architecture and Engineering. Lecture 6 - Memory
CS 152 Computer Architecture and Engineering Lecture 6 - Memory Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste! http://inst.eecs.berkeley.edu/~cs152!
More informationCOSC 6385 Computer Architecture. - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More information