Worst Case Analysis of DRAM Latency in Multi-Requestor Systems. Zheng Pei Wu Yogen Krish Rodolfo Pellizzoni

orst Case Analysis of DAM Latency in Multi-equestor Systems Zheng Pei u Yogen Krish odolfo Pellizzoni

Multi-equestor Systems CPU CPU CPU Inter-connect DAM DMA I/O 1/26

Multi-equestor Systems CPU CPU CPU Inter-connect DAM DMA I/O INTEFEENCE!!! 1/26

Multi-equestor Systems CPU CPU CPU Hard eal Time Systems Must Inter-connect be Predictable!!! DAM DMA I/O INTEFEENCE!!! 1/26

Multi-equestor Systems Schedulability Analysis: needs CET as input CET depends on hardware platform CET: needs Latency to access shared resource (e.g. cache, DAM) Existing approaches can bound the interference but they assume the latency for DAM access is constant 2/26

Multi-equestor Systems Schedulability Analysis: needs CET as input Problem: DAM latency is variable and changes depending on its state CET depends on hardware platform CET: needs Latency to access shared resource (e.g. cache, DAM) Existing approaches can bound the interference but they assume the latency for DAM access is constant 2/26

Contribution equestor Under Analysis CPU CPU CPU Timing analysis that bounds Inter-connect the worst case latency for DAM access DAM DMA I/O 3/26

Contribution Interfering equestors CPU CPU CPU Assuming we do not know Inter-connect what they are doing, so we assume they cause the worst case interference DAM DMA I/O Interfering equestors 3/26

Outline 1. Background & elated ork 2. Memory Controller Model 3. orst Case Latency Analysis 4. esults & Conclusion

Background Storage Array contains Can only ead/rite to ow Buffer 4/26

Background EAD Targeting in this ow ow Buffer contain data from a different row 4/26

Background EAD P, A, Front End generates the needed commands Back End issues commands on command bus 4/26

Background P, A, PE ACT ACT: Load the data from array Pre-Charge: into buffer store the data back into array P A Pre-charge command issued on command bus Timing Constraint 4/26

Background P, A, EAD P A 4/26

Background EAD Targeting Already in ow Buffer Only Need ead Command Can be issued immediately P A 4/26

Background EAD -Latency of a close request is much longer than the latency of an open request -Latency of memory access is variable! Latency of a close request Latency of a open request P A 4/26

Predictable Memory Controllers Close ow Policy: After each -Can access, not take the advantage row buffer of is automatically locality pre-charged (row hits) -Latency is much longer than open request Memory Latency is the same for all requests Implicit Next Pre-charge equest targets same bank A P A 5/26

Predictable Memory Controllers Interleaving Banks Bank 1 Bank 2 Bank 3 Bank 4 Accessing data in multiple banks A Multiple data can be pipelined A A A 6/26

Predictable Memory Controllers Interleaving Banks Bank 1 Bank 2 Bank 3 Bank 4 Problem: requestors can close each other s row buffer since they can access all banks A Thus closed row policy is used to make A latency predictable The problem of long latency of close row policy still exist! A A A 6/26

Predictable Memory Controllers Interleaving Banks This is good for system with small DAM data bus width (e.g. 16 bits) Bank 1 Bank 2 Bank 3 Bank 4 A A A A A Larger data buses can transfer same amount of data without interleaving so many banks 6/26

Predictable Memory Controllers Interleaving Banks Bank 1 Bank 2 Interleaving two banks for wider data bus (e.g. 32 bits) Interleaving Problems: A1. equestors can close each other s rows (interference) A 2. Must be used with close row policy to make latency predictable 3. For wider data bus, effectiveness of interleaving is diminished Time asted!! A 7/26

Predictable Memory Controllers Private Banks Can partition banks to either requestors or tasks Core 1 Core 2 DMA Bank 1 Bank 2 Bank 3 Bank 4 This can be done by: Hardware if Memory controller supports By compiler In OS, using virtual memory 8/26

elated ork AMC[1] and Predator [2]: -Close ow Policy -Interleaved Bank Conservative Open-Page [3]: Interleaved Bank Leave row open for a small window of time PET DAM Controller [4]: Close ow Policy Private Bank 9/26

Our Approach Private Bank eliminates row buffer Challenge: interferences from other requestors 1. Analysis is more complex 2. More than 20 timing constraints 3. Latency depends on the dynamic state of DAM Open ow Policy reduce latency and take advantage or row hit ratio (locality) 10/26

Outline 1. Background & elated ork 2. Memory Controller Model 3. orst Case Latency Analysis 4. esults & Conclusion

Memory Controller Model e focus on the back end latency ignore CONSTANT front end delay Front End Back End Core 1 Per equestor Buffers A Global FIFO Queue Command Bus DMA Command Generator A P Core 2 Bus 11/26

Memory Controller Model Each requestor has a Global private FIFO is used for arbitration buffer for memory command Front End Back End Core 1 Per equestor Buffers A Global FIFO Queue Command Bus DMA Command Generator A P Core 2 Bus 11/26

Memory Controller Model Command at head of each private buffer are inserted into the FIFO Front End Back End Core 1 Per equestor Buffers A Global FIFO Queue Command Bus DMA Command Generator A P Core 2 Bus 11/26

Memory Controller Model Command at head of each private buffer are inserted into the FIFO Front End Back End Core 1 DMA Command Generator Per equestor Buffers A Global FIFO Queue A P Command Bus Core 2 Bus 11/26

Memory Controller Model Controller scan the global FIFO from front to end for a command that can be issued Front End Back End Core 1 DMA Command Generator Per equestor Buffers A Global FIFO Queue A P Command Bus Core 2 Bus 11/26

Memory Controller Model Next command must wait until timing constraints are satisfied before it can be inserted into FIFO Core 1 DMA Intuitively, the arbitration is fair and Front is End similar to a round Back robin End policy Command Generator Per equestor Buffers A Command Issued Global FIFO Queue P Command Bus A Core 2 Bus 11/26

Outline 1. Background & elated ork 2. Memory Controller Model 3. orst Case Latency Analysis 4. esults & Conclusion

orst Case Analysis Total # of equestors Memory Device Parameters Task Under Analysis orst Case Single equest Latency Analysis Part 2 Only provided for in-order core # of open reads # of close reads # of open writes # of close writes Part 1 Main Contribution ork for any type of cores Latency for different types of request Open Close Open Assumption: ead ead rite e do not know about the activity on the other interfering requestors, so we assume those requestors Cumulative produce the worst case orst pattern Case to cause maximum interference Execution Time Close rite CET 12/26

orst Case Analysis Total # of equestors Memory Device Parameters orst Case Single equest Latency Analysis Latency for different types of request Open ead Close ead Open rite Close rite Task Under Analysis # of open reads # of close reads # of open writes # of close writes Cumulative orst Case Execution Time CET 12/26

Single equest Latency Decomposed into two parts equest Arrival / / Arrival to ead/rite ead/rite to Arrival until ead/rite command is inserted into the global FIFO ead/write inserted into FIFO until data is finished transmitting 13/26

Single equest Latency This part may include Pre-charge and ACT commands equest Arrival / P A / Arrival to ead/rite ead/rite to Latency depends on the previous request (i.e., state of the DAM) Latency does not depend on state of the DAM 13/26

Single equest Latency Both parts depends on the # of interfering requestors as well as DAM timing constraints equest Arrival / P A / Arrival to ead/rite ead/rite to 13/26

Single equest Latency equest Arrival / P A / Arrival to ead/rite ead/rite to For details on this part, refer to paper e will focus on this part 13/26

ead/rite to Latency ead to ead has no timing constraints, only contention on the data bus Same for rite to rite 14/26

ead/rite to Latency Therefore, an alternation of read and write commands produce longer latency rite to ead timing constraint ead to rite timing constraint 15/26

ead/rite to Latency Interference on rite command All other requestors inserts / commands to create maximum interference Front 16/26

ead/rite to Latency Interference on rite command Front A write command could of finished immediately before t 0 17/26

ead/rite to Latency Interference on rite command Therefore, further delay the first ead command Front 18/26

orst Case Analysis Total # of equestors Memory Device Parameters orst Case Single equest Latency Analysis Part 2 Only provided for in-order core Latency for different types of request Open ead Close ead Open rite Close rite Task Under Analysis # of open reads # of close reads # of open writes # of close writes Cumulative orst Case Execution Time CET

Cumulative Latency Open ead Close ead Open rite Close rite Task Under Analysis: t 19/26

Cumulative Latency orst case request order depends on input value, code path, cache state, etc. Open ead Close ead Open rite Close rite Task Under Analysis: If worst case request order is known, we can sum the latency of each request t 19/26

Cumulative Latency Open ead Close ead Open rite Close rite Static Analysis tools can be used to obtain safe bound for # of each type of request Task Under Analysis: If worst case request order is known, we can sum the latency of each request t 19/26

Cumulative Latency Open ead Close ead Open rite Close rite This problem can be solved in constant time; see paper for detail Task Under Analysis: hich pattern leads to worst case latency? 19/26

Outline 1. Background & elated ork 2. Memory Controller Model 3. orst Case Latency Analysis Single equest Latency Cumulative Latency 4. esults & Conclusion

esults Comparison against Analyzable Memory Controller [1] Since they use fair arbitration (ound obin) which is similar to our approach Synthetic Benchmarks Used to show how worst case latency varies as parameters are changed CHStone Benchmarks Memory traces are obtained from gem5 simulator Memory traces are used as input the worst case analysis 20/26

esults Synthetic Benchmarks 21/26

esults Synthetic Benchmarks 22/26

esults As memory devices becomes faster, the difference between open and close access is getting larger and therefore close row is becoming too pessimistic 50% ow Hit atio, 4 equestors, 20% rites Devices 800D (ns) 1066F (ns) 1333H (ns) 1600K (ns) 1866L (ns) 2133N (ns) % better AMC (64 bits) 185 185.27 180.9 178 169.84 163 11.89% Our (64 bits) 125.2 112.47 104.85 102.18 96.97 92.85 25.84% 23/26

esults CHStone Benchmarks for 64bits bus 24/26

Conclusion A novel worst case analysis that takes dynamic state into account Open row policy can reduce memory latency as devices are becoming faster Private bank scheme is used to eliminate row buffer interference from other requestors 25/26

Future ork Discussion of shared data Bus utilization is still poor due to read/write switching ead/rite optimization to reduce latency bound Handle Multiple anks Implementation in hardware 26/26

eferences [1] M. Paolieri, E. Quin ones, F. Cazorla, and M. Valero, An Analyzable Memory Controller for Hard eal-time CMPs, Embedded Systems Letters, IEEE, vol. 1, no. 4, pp. 86 90, 2009. [2] B. Akesson, K. Goossens, and M. inghofer, Predator: a predictable SDAM memory controller, in CODES+ISSS, 2007, pp. 251 256. [3] S. Goossens, B. Akesson, and K. Goossens, Conservative Open- page Policy for Mixed Time-Criticality Memory Controllers, in DATE, 2013. [4] J. eineke, I. Liu, H. D. Patel, S. Kim, and E. A. Lee, Pret dram controller: Bank privatization for predictability and temporal isolation, in CODES+ISSS, 2011, pp. 99 108.