Memory Access Scheduling ECE 5900 Computer Engineering Seminar Ying Xu Mar 4, 2005 Instructor: Dr. Chigan 1 ECE 5900 spring 05 1
Outline Introduction Modern DRAM architecture Memory access scheduling Structure of access scheduler Scheduling policies Experimental results First-ready scheduling Aggressive reordering Conclusions 2 ECE 5900 spring 05 2
Introduction Bandwidth of memory chip increases dramatically DDR2, SDRAM Media processors Streaming memory reference patterns Memory bandwidth bottleneck 3 ECE 5900 spring 05 3
Intro (contd) Pipelining memory accesses Maximize the memory bandwidth Sequential accesses to the different row of the same bank can t be pipelined Memory access scheduling Reorder memory operations Bank precharge, row activation, column access Memory references completed out of order 4 ECE 5900 spring 05 4
Intro(contd) 5 ECE 5900 spring 05 5
Characteristics of DRAM architecture DRAMs are not truly random access devices 3 dimensional memories Bank Row Column 3 operations Bank precharge Row activation Column access 6 ECE 5900 spring 05 6
DRAM organization 7 ECE 5900 spring 05 7
Resource constraints of DRAMS Dram resources Internal banks A single set of address lines A single set of data lines Different operation has different demand 8 ECE 5900 spring 05 8
Bank state 9 ECE 5900 spring 05 9
Memory access scheduling Process of ordering DRAM operations Subject to resource constraints Simplest: oldest pending references first Inefficient DRAM Not ready for the oldest references Leave the available resource idle Need more complicated scheduling algorithm 10 ECE 5900 spring 05 10
Memory access scheduler structure 11 ECE 5900 spring 05 11
Memory access scheduling policies 12 ECE 5900 spring 05 12
Memory access scheduling algorithm Combination of policies used by precharge manager, row arbiter, column arbiter, address arbiter Address arbiter decides which selected precharge, row, column operation to perform Choices: in-order, priority, precharge operation first, row operation first, column operation first 13 ECE 5900 spring 05 13
Experimental setup Streaming media processors are preferred Streams lack temporal locality Stream transfer bandwidth drives the processor performance The image stream processor is simulated frequency 500MHZ Dram frequency 125MHZ Peak system bandwidth 2GB/s 14 ECE 5900 spring 05 14
Experimental setup(contd) Benchmarks and media processing applications 15 ECE 5900 spring 05 15
In order scheduling In-order access scheduler No access reordering A column is only performed for the oldest pending reference; same as bank precharge and row activation Baseline 16 ECE 5900 spring 05 16
First-ready ready scheduling Uses the ordered priority scheme for all units Subjects to resource and timing constraints Schedule an operation for the oldest pending references Benefits: Accesses targeting other banks can be performed while waiting for a precharge or row activation parallelism: multiple references in progress 17 ECE 5900 spring 05 17
Experimental results Sustained memory bandwidth increased about 79% 18 ECE 5900 spring 05 18
Experimental results Sustained bandwidth increased about 17% 19 ECE 5900 spring 05 19
Experimental results Sustained memory bandwidth increased about 79% 20 ECE 5900 spring 05 20
Aggressive reordering Drawback of first-ready scheduling Precharges a bank when the oldest pending reference targets a different row than the active row in a bank, there are still multiple pending references to the active row Aggressive reordering to further increase sustained memory bandwidth 21 ECE 5900 spring 05 21
Possible reordering scheduling algorithm polices Large range of possible memory access scheduler Four representative 22 ECE 5900 spring 05 22
Experimental results Improve bandwidth by 106-144% 23 ECE 5900 spring 05 23
Experimental results Improve bandwidth by 27-30% 24 ECE 5900 spring 05 24
Experimental results Improve bandwidth 85-93% 25 ECE 5900 spring 05 25
Row-first policy VS column first policy Address arbiter Row-first: always select row operation first Column-first: always select column operation first Little difference across all benchmarks Exception: FFT Less to do with the scheduling algorithm than the characteristic of benchmark itself FFT most sensitive to stream load latency Col/op policy allows a store stream to delay load streams 26 ECE 5900 spring 05 26
Open or closed precharge policy? Closed precharge policy banks are precharged as soon as no pending references to the active row Open precharge policy No pending references to the active row, pending references to other rows of the same bank Difference between open and closed precharge policy is slight Benchmarks with random access pattern prefer closed precharge policy Little reference locality No benefit to keep row open FFT prefers op precharge policy Numerous accesses to each row 27 ECE 5900 spring 05 27
Effect of bank buffer size Row/closed scheduling algorithm 28 ECE 5900 spring 05 28
Conclusions Memory access scheduling greatly increases the bandwidth utilization Buffering memory references Access internal banks in parallel Maximize the number of column accesses per row access First ready scheduling algorithm 79% bandwidth improvement on microbenchmarks, 40% on application traces Aggressive reordering algorithm 144% bandwidth improvement on benchmarks, 30% on media processing applications, 93% on the application traces 29 ECE 5900 spring 05 29
Conclusions Closed precharge policy preferred by most benchmarks Little difference in performance between rowfirst or column first policies. For latency sensitive applications, scheduling loads ahead of stores preferred. Banks are precharged as soon as the last column reference to an active row is completed 30 ECE 5900 spring 05 30
Paper reference Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, John D. Owens, Memory access scheduling, ACM SIGARCH Computer Architecture News, Proceedings of the 27th annual international symposium on Computer architecture, Volume 28 Issue 2, May 2000 31 ECE 5900 spring 05 31
Thank you! 32 ECE 5900 spring 05 32