CISC 36 Random-ccess Memory (RM) The Memory Hierarchy Nov 24, 29 class12.ppt 2 CISC 36 Fa9 SRM vs DRM Summary Conventional DRM Organization Tran. ccess per bit time Persist?Sensitive? Cost pplications SRM 6 1X Yes No 1x cache memories DRM 1 1X No Yes 1X Main memories, frame buffers (to CPU) 2 bits addr 8 bits data 16 x 8 DRM chip cols 1 2 3 1 rows 2 3 supercell (2,1) 3 CISC 36 Fa9 4 internal row buffer CISC 36 Fa9 Page 1
Reading DRM Supercell (2,1) Reading DRM Supercell (2,1) RS = 2 2 addr 8 data 16 x 8 DRM chip 1 rows 2 3 cols 1 2 3 To CPU supercell (2,1) CS = 1 2 addr 8 data 16 x 8 DRM chip 1 rows 2 3 cols 1 2 3 5 internal row buffer CISC 36 Fa9 supercell 6 internal row buffer (2,1) CISC 36 Fa9 Memory Modules Enhanced DRMs addr (row = i, col = j) DRM 7 DRM : supercell (i,j) 64 MB module consisting of eight 8Mx8 DRMs bits bits bits bits bits bits bits bits 56-63 48-55 4-47 32-39 24-31 16-23 8-15 -7 63 56 55 4847 4 39 32 31 2423 1615 64-bit doubleword at main address 8 7 Memory 64-bit doubleword 7 CISC 36 Fa9 8 CISC 36 Fa9 Page 2
Nonvolatile Memories Typical Bus Structure Connecting CPU and Memory CPU chip system bus bus IO bridge main 9 CISC 36 Fa9 1 CISC 36 Fa9 Memory Read Transaction (1) Memory Read Transaction (2) Load operation: movl, Load operation: movl, IO bridge main x IO bridge x main x 11 CISC 36 Fa9 12 CISC 36 Fa9 Page 3
Memory Read Transaction (3) Memory Write Transaction (1) Load operation: movl, Store operation: movl, x y IO bridge main x IO bridge main 13 CISC 36 Fa9 14 CISC 36 Fa9 Memory Write Transaction (2) Memory Write Transaction (3) Store operation: movl, Store operation: movl, y y IO bridge y main IO bridge main y 15 CISC 36 Fa9 16 CISC 36 Fa9 Page 4
Disk Geometry Disk Geometry (Muliple-Platter View) cylinder k tracks surface track k gaps surface surface 1 surface 2 surface 3 surface 4 surface 5 platter platter 1 platter 2 sectors 17 CISC 36 Fa9 18 CISC 36 Fa9 Disk Capacity Computing Disk Capacity 19 CISC 36 Fa9 2 CISC 36 Fa9 Page 5
Disk Operation (Single-Platter View) Disk Operation (Multi-Platter View) The surface spins at a fixed rotational rate The readwrite head is attached to the end of the arm and flies over the surface on a thin cushion of air. readwrite heads move in unison from cylinder to cylinder arm By moving radially, the arm can position the readwrite head over any track. 21 CISC 36 Fa9 22 CISC 36 Fa9 Disk ccess Time Disk ccess Time Example 23 CISC 36 Fa9 24 CISC 36 Fa9 Page 6
Logical Disk Blocks IO Bus CPU chip system bus bus IO bridge main USB graphics adapter IO bus Expansion slots for other devices such as network adapters. 25 CISC 36 Fa9 mousekeyboard monitor 26 CISC 36 Fa9 Reading a Disk Sector (1) CPU chip CPU initiates a read by writing a command, logical block number, and destination address to a port (address) associated with. Reading a Disk Sector (2) CPU chip Disk reads the sector and performs a direct access (DM) transfer into main. main main IO bus IO bus USB graphics adapter USB graphics adapter mousekeyboard monitor 27 CISC 36 Fa9 mousekeyboard monitor 28 CISC 36 Fa9 Page 7
Reading a Disk Sector (3) CPU chip When the DM transfer completes, the notifies the CPU with an interrupt (i.e., asserts a special interrupt pin on the CPU) Storage Trends metric 198 1985 199 1995 2 2:198 SRM $MB 19,2 2,9 32 256 1 19 access (ns) 3 15 35 15 2 1 IO bus main DRM metric 198 1985 199 1995 2 2:198 $MB 8, 88 1 3 1 8, access (ns) 375 2 1 7 6 6 typical size(mb).64.256 4 16 64 1, metric 198 1985 199 1995 2 2:198 USB mousekeyboard graphics adapter monitor 29 CISC 36 Fa9 Disk $MB 5 1 8.3.5 1, access (ms) 87 75 28 1 8 11 typical size(mb) 1 1 16 1, 9, 9, 3 (Culled from back issues of Byte and PC Magazine) CISC 36 Fa9 CPU Clock Rates The CPU-Memory Gap 198 1985 199 1995 2 2:198 processor 88 286 386 Pent P-III clock rate(mhz) 1 6 2 15 75 75 cycle time(ns) 1, 166 5 6 1.6 75 ns 1,, 1,, 1,, 1, 1, 1, 1 1 1 198 1985 199 1995 2 year Disk seek time DRM access time SRM access time CPU cycle time 31 CISC 36 Fa9 32 CISC 36 Fa9 Page 8
Locality Locality Example Locality Example: Data sum = ; for (i = ; i < n; i++) sum += a[i]; Reference array elements in succession return sum; (stride-1 reference pattern): Spatial locality Reference sum each iteration: Temporal locality Instructions Reference instructions in sequence:spatial locality Cycle through loop repeatedly: Temporal locality 33 CISC 36 Fa9 int sumarrayrows(int a[m][n]) { int i, j, sum = ; } for (i = ; i < M; i++) for (j = ; j < N; j++) sum += a[i][j]; return sum 34 CISC 36 Fa9 Locality Example Locality Example int sumarraycols(int a[m][n]) { int i, j, sum = ; } for (j = ; j < N; j++) for (i = ; i < M; i++) sum += a[i][j]; return sum int sumarray3d(int a[m][n][n]) { int i, j, k, sum = ; } for (i = ; i < M; i++) for (j = ; j < N; j++) for (k = ; k < N; k++) sum += a[k][i][j]; return sum 35 CISC 36 Fa9 36 CISC 36 Fa9 Page 9
Memory Hierarchies n Example Memory Hierarchy Smaller, faster, and costlier (per byte) storage devices Larger, slower, and cheaper (per byte) storage devices L5: L4: L3: L2: L: registers L1: on-chip L1 cache (SRM) off-chip L2 cache (SRM) main (DRM) local secondary storage (local s) remote secondary storage (distributed file systems, Web servers) CPU registers hold words retrieved from L1 cache. L1 cache holds cache lines retrieved from the L2 cache. L2 cache holds cache lines retrieved from main. Main holds blocks retrieved from local s. Local s hold files retrieved from s on remote network servers. 37 CISC 36 Fa9 38 CISC 36 Fa9 Caches Caching in a Memory Hierarchy Level k: 4 8 9 1 14 3 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 1 4 Data is copied between levels in block-sized transfer units Level k+1: 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. 39 CISC 36 Fa9 4 CISC 36 Fa9 Page 1
General Caching Concepts General Caching Concepts 14 12 Request 14 12 Level k: 1 2 3 12 4* 9 14 3 12 4* Request 12 Level k+1: 1 2 3 4* 4 5 6 7 8 9 1 11 12 13 14 15 41 CISC 36 Fa9 42 CISC 36 Fa9 Examples of Caching in the Hierarchy Cache Type Registers TLB L1 cache L2 cache Virtual Memory Buffer cache ddress translations 32-byte block Network buffer Parts of files cache Browser cache Web pages Web cache What Cached 4-byte word 32-byte block 4-KB page Parts of files Web pages Where Cached CPU registers On-Chip TLB On-Chip L1 Off-Chip L2 Main Main Local Local Remote server s Latency (cycles) Managed By Compiler Hardware 1 Hardware 1 Hardware 1 Hardware +OS 1 OS 1,, FSNFS client 1,, Web browser 1,,, Web proxy server 43 CISC 36 Fa9 Page 11