Managing Memory for Timing Predictability. Rodolfo Pellizzoni

Size: px

Start display at page:

Download "Managing Memory for Timing Predictability. Rodolfo Pellizzoni"

Doris Simpson
6 years ago
Views:

1 Managing Memory for Timing Predictability Rodolfo Pellizzoni

Yogen Krish Heechul Yun* Renato Mancuso, Marco Caccamo,

2 Thanks This work would not have been possible without the following students and collaborators Zheng Pei Wu*, Yogen Krish Heechul Yun* Renato Mancuso, Marco Caccamo, Lui Sha * additional thanks for contributing slides! 2 / 67

3 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 3 / 67

4 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 4 / 67

5 Multicore Server Desktop Mobile RT/Embedded 5 / 67

6 Challenges: Shared Memory Hierarchy T1 T2 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 CPU Core 1 Core 2 Core 3 Core 4 Memory Hierarchy Memory Hierarchy Unicore Multicore Performance Impact 6 / 67

7 DO-178C Certification Certification Authorities Software Team (CAST) Position Paper 32, May _software/cast/cast_papers/media/cast-32.pdf All effects due to shared resources in Multi-core Processors must be: Identified Analyzed Mitigated 7 / 67

8 CAST Identifies: DO-178C Certification Shared Cache (L3) Main Memory (DDR2/3) Interconnection 8 but other resources might create interference (more on this later). 8 / 67

9 How Bad Can it Be? 7,0 Slowdown ratio 6,0 5,0 4,0 3,0 2,0 1,0 0,0 Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference) Slowdown ratio = Solo IPC / CorunIPC (*) Measured on Intel Xeon 3530 (4 cores), 8GB 1ch DDR3 DRAM 9 9 / 67

10 This Talk Core1 Core2 Core3 Core4 Memory Controller DRAM Focus on main memory implemented as Double Date Rate Dynamic Synchronous RAM (DDR SDRAM DRAM for short). We show how to mitigate and bound the effects of memory contention. 10 / 67

11 Single Core Equivalence (SCE) End goal: Single Core Equivalent (SCE) execution [6] Prove that each core in a M-core platform behaves as a (possibly slower) single-processor system Certify ARINC-653 partition on the (slower) singleprocessor system Hardware Selection Resource Partitioning Software Partition to Core Allocat. Schedulability Analysis I/O Channel Partitioning 11 / 67

12 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 12 / 67

13 DRAM Background Storage Array contains Data Can only Read/Write to Row Buffer 13 / 67

14 DRAM Background Front End generates the needed commands, Back End issues commands on command bus 14 / 67

15 DRAM Background READ Targeting Data in this Row Row Buffer contain data from a different row 15 / 67

16 DRAM Background READ P, A, R 16 / 67

17 DRAM Background P, A, R PRE ACT ACT: Load the data from array Pre-Charge: into buffer store the data back into array P A Pre-charge command issued on command bus Timing Constraint 17 / 67

18 DRAM Background P, A, R CAS (R/W): reads/writes using buffer READ P A R Data 18 / 67

19 DRAM Background READ R Targeting Data Already in Row Buffer Only Need Read Command Can be issued immediately P A R Data R Data 19 / 67

20 DRAM Background READ Latency of a close request is much longer than the R latency of an open request Latency of a close request Latency of a open request P A R Data R Data 20 / 67

21 Bank Parallelism Bank 1 Bank 2 Bank 3 Bank 4 Accessing data in multiple banks A R Data Multiple data can be pipelined A R Data A R Data A R Data 21 / 67

22 Bank Parallelism Requests should be spread over multiple banks Bank 1 Bank 2 Bank 3 Bank 4 Accessing data in same bank, different rows A R Data P A R Data No pipelining much larger delay 22 / 67

23 Write-to-Read Penalty Transactions of the same type can be pipelined R Data W Data Huge Latency Penalty! R Data W Data but a read after a write cannot. W Data Write-to-Read (WtR) timing constraint R Data 23 / 67

24 Summary To extract maximum bandwidth, we need: Successive requests to the same bank should hit the same row (spatial locality) Successive requests to the same bank should be of the same type (reads or writes) Concurrent requests should hit different banks Problem: the way banks are accessed depends on: The memory mapping (how are physical memory locations mapped to banks?) The memory controller schedule 24 / 67

25 Memory Arbitration COTS memory controllers use First-Ready First-Come- First-Serve (FR-FCFS) arbitration. Each bank has a separate request queue Requests targeting an open row are prioritized over requests targeting a close row Reads are prioritized over writes FCFS arbitration among banks Maximum throughput but very bad worse case delay Front R1:W% W Data W Data R2:R% t W L t B U S t W T R t W L t B U S R t W T R R3:W% W Data t = 0 t = 0 t W L t B U S t W T R t = 4 t = 16 t = 20 t = / 67

26 Our Solutions COTS Solution: PALLOC + Memguard OS-level solution No hardware modifications Solves bank parallelism, bandwidth allocation Custom Solution: ROC Controller redesign As before + write-read optimization Supports differentiated service (predictable latency vs optimized throughput) 26 / 67

27 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 27 / 67

28 Problem SMP OS Core1 Core2 Core3 Core4 L3 OS does NOTknow DRAM banks OS memory pages are spread all over multiple banks Memory Controller (MC)???? DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 Unpredictable memory performance

29 Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast Peak = 10.6 GB/s Bank 1 Bank 2 Bank 3 Bank 4 DDR3 1333Mhz Out-of-order processors

30 Best-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Fast Peak = 10.6 GB/s Bank 1 Bank 2 Bank 3 Bank 4 DDR3 1333Mhz

31 Most-cases Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Mess Bank 1 Bank 2 Bank 3 Bank 4 Performance =??

32 Worst-case Core 1 Core 2 Core 3 Core 4 Request order Bank 1 L3 Memory Controller (MC) Bank 2 Bank 3 DRAM DIMM Bank 4 time Row 1 Row 2 Row 3 Row 4 Row 1 Row 2 1 bank & always row misses ~1/10 peak b/w

33 PALLOC SMP OS Core1 Core2 Core3 Core4 CPC Memory Controller (MC) OS is aware of DRAM mapping Each pagecan be allocated to a desired DRAM bank DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 Flexible allocation policy

34 PALLOC Core1 Core2 Core3 Core4 Private banking L3 Memory Controller (MC) Allocatepages on certain exclusively assigned banks DRAM DIMM Bank 1 Bank 2 Bank 3 Bank 4 Eliminate Inter-core bank conflicts

35 PALLOC Implementation Modified Linux kernel s buddy allocator DRAM bank-aware page frame allocation at each page fault Example: per-core private banking (PB) # cd /sys/fs/cgroup # mkdir core0 core1 core2 core3 create 4 cgroup partitions # echo 0-3 > core0/palloc.dram_bank assign bank 0 ~ 3 for the core0 partition. # echo 4-7 > core1/palloc.dram_bank # echo 8-11 > core2/palloc.dram_bank # echo > core3/palloc.dram_bank 35

36 Performance Impact on Unicore 1,20 4banks 8banks 16banks Buddy Normalized IPC 1,00 0,80 0,60 0,40 0,20 0,00 #of bank vs. performance on a single core Finding: bank partitioning negatively affect performance due to reduced MLP, but not significant for most benchmarks 36

37 Performance Isolation on 4 Cores 7,00 buddy PB PB+PC Slowdown ratio 6,00 5,00 4,00 3,00 2,00 1,00 0,00 Setup: Core0: X-axis, Core1-3: 470.lbm x 3 (interference) PB: DRAM bank partitioning only; PB+PC: DRAM bank and Cache partitioning Finding: bank (and cache) partitioning improves isolation, but far from ideal 37

38 Bus Bandwidth Management Why PB+PC not ideal? Cores still contend for access to memory data bus Bandwidth allocation can be highly unfair -> unpredictable behavior FR FCFS scheduling gives higher priority to earlier requests An out-of-order core performing multiple concurrent requests can receive an inordinate amount of bandwidth Solution: bandwidth reservation Similar to a real-time server, but guarantees memory bandwidth rather than CPU time 38

39 MemGuard: Managing Memory MemGuard Bandwidth Reclaim Manager Operating System BW BW Regulator 0.9GB/s BW BW Regulator 0.1GB/s Regulator 0.1GB/s Regulator 0.1GB/s PMC PMC PMC PMC Core1 Core2 Core3 Core4 Memory Controller DRAM DIMM Multicore Processor Reservation: minimum memory bandwidth guarantee Reclaiming/sharing: additional memory bandwidth 39

40 Reservation Idea Reserve per-core memory bandwidth via the OS scheduler Use h/w PMC to monitor memory request rate Budget Core activity 2 1 Suspend the RT idle task Guarantee: as long as sum budgets <= guaranteed memory bandwidth, queuing delay in the memory controller is small (more details in [3]) 0 1ms Schedule a RT idle task 2ms computation memory fetch 40

41 Reclaiming/Sharing Idea Redistribute unused bandwidth to demanding cores Divide core into protected and unprotected cores Protected cores have guaranteed bandwidth; do not require additional bandwidth Unprotected cores use budget prediction Unneeded budget is donated to the system Reclaim on-demand basis 41

42 Reclaiming/Sharing 4 units of budgets donated at time 10, 2 at time 20 Global budget reclaimed when core budget is depleted 42

43 Effects of MemGuard T1's exec. Time (ms) Perf. guarantee No deadline violation 60% solo corun solo corun solo corun T1 Core1 L2 Core2 L2 DRAM Original 94% T2 T1 900M/s Core1 L2 DRAM MemGuard (Reserve only) 14% T2 T1 69% T2 200M/s Core2 L2 900M/s Core1 L2 DRAM 200M/s Core2 L2 MemGuard (Reclaim + Share) 43

44 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 44 / 67

45 ROC Key idea: design new architectural optimizations targeted at reducing worst case, not average case latency ROC: Rank-switching, Open-row Controller We discuss: Design Latency Analysis Implementation Results

46 Multi-Rank Device Memory device with up to 4 ranks, each comprising separate bank set

47 Multi-Rank Device Long Write-to-Read constraint replaced by much shorter Rank-to-Rank one W Data RtR R Data

48 Example: Write-Read-Write-Read 1 Rank: Latency 52 cycles Challenge: Design a command scheduling algorithm to minimize worst case latency by forcing alternation among ranks 2Ranks: Latency 35 cycles 4 Ranks: Latency 29 cycles

49 Worst Case Analysis Total # of Requestors and Ranks Memory Device Parameters Worst Case Single Request Latency Analysis Latency for different types of request Open Read Close Read Open Write Close Write Task Under Analysis # of open reads # of close reads # of open writes # of close writes Cumulative Worst Case Execution Time WCET

50 Back End Model Front End adds constant delay focus on Back End Three levels arbitration Rank 1 Command Queues Level 3 Arbitration P/A Arbiter Level 2 Arbitration Controller Back End CAS Arbiter P/A Arbiter CAS Arbiter Level 1 Command Arbitration To Devic e Rank R

51 Back End Model L1: CAS (R/W) commands have higher priority than PRE/ACT commands priority to data bus contention Rank 1 Command Queues Level 3 Arbitration P/A Arbiter Level 2 Arbitration Controller Back End CAS Arbiter P/A Arbiter CAS Arbiter Level 1 Command Arbitration To Devic e Rank R

52 Back End Model L2 alternates among ranks L3 alternates among requestors within a rank Rank 1 Command Queues Level 3 Arbitration P/A Arbiter Level 2 Arbitration Controller Back End CAS Arbiter P/A Arbiter CAS Arbiter Level 1 Command Arbitration To Devic e Rank R

53 Back End Model Let: : number of ranks : number of requestors for rank Then given the alternation in L2/L3, the latency of each command is a function of Isolation property: the latency of a requestor does not depend on the # of requestors or scheduling policy used in other ranks We can dedicate some ranks to hard real-time requestors, and others to soft real-time requestors Optimize hard requestors for latency, soft requestors for bandwidth

54 CAS Rank Switching Rule Key idea: Schedule ranks in Round-Robin order if transfers on the data bus are separated We by prove: at most Rank-to-Rank constraint When this is not possible (due to Write-to-Read constraint), reorder ranks W Rank 2 can read twice while Rank 1 is waiting for WtR 1. Reordering does not increase worstcase latency if R = 4 2. If the system is backlogged, no more Data than RtRbetween data transfers => WtR Data does not degrade memory RtR throughput R R Data R Rank 1 Data RtR Rank 2

55 Results Comparison against Analyzable Memory Controller (AMC) [1] Predictable memory controller with WCET guarantees for hard real-time tasks Fair arbitration (Round Robin) similar to our approach Synthetic Benchmarks Used to show how worst case latency bound varies as parameters are changed CHStone Benchmarks Memory traces are obtained from gem5 simulator and used as inputs to both analysis and simulators Core under analysis is in-order Interfering requestors are out-of-order running lbm

56 Results DDR3-1333H memory device 64 and 32 bits data bus width Simulations Python simulators base controller [5] and AMC [1] ROC Implementation Three stages pipelined implementation in Verilog RTL Synthesizes to Xilinx FPGA at 340Mhz (original soft memory controller: 400Mhz) ASIC implementation could likely be significantly faster Code available at

Results Synthetic Benchmarks, 20% write Synthetic: 8 Requestors-64bits 400 Avg Worst Case Latency (ns) 350 300 250 200 150 100 50 0 0% 20% 40% 60% 80% 100% Row Hit % This improvement

57 Results Synthetic Benchmarks, 20% write Synthetic: 8 Requestors-64bits 400 Avg Worst Case Latency (ns) % 20% 40% 60% 80% 100% Row Hit % This improvement due to ROC-4 private bank has parallelism > 33% lower latency than [5] for all cases..this due to open-row policy..this due to rank-swithing AMC Analysis [5] ROC-2Rank ROC-4Rank 20/24

58 Results CHStone, 16 requestors 64bits data bus ROC-4 has between 5 and 35% lower WCET than [5] T-bars: worst case bound Solid bars: simulation 2.00 AMC Analysis[5] ROC-2 Rank ROC-4 Rank Normalized Execution Time /26 adpcm aes bf gsm jpeg mips motion sha dfadd dfdiv dfmul dfsin

59 Summary Bank Contention Bandwidth Partitioning Open Row Write-to- Read Opt Differentiated Service ROC Other pred. mem. cont. PALLOC MemGuard PALLOC + MemGuard 59 / 67

60 Outline 1. Introduction and Problem Statement 2. DRAM Background 3. COTS Solution 1. Palloc 2. Memguard 4. Custom Solution (ROC) 1. Call for Action and Conclusions 60 / 67

61 Other Sources Of Interference Help is needed in mitigating and analyzing other sources of interference Cache Allocation interference: multiple tasks/cores contending for cache space. Can be solved with cache partitioning/locking. Timing interference: simultaneous accesses by multiple cores. Generally low in modern architectures. Interconnects How does this work? We rarely have detailed info What about cache coherency? 61 / 67

62 Other Sources Of Interference Other sources of interference might not be well understood Example: Miss Status Holding Register (MSHR) in Intel platforms Caches are non-blocking allow multiple concurrent requests by same/different cores Pending requests are enqueued in global MSHR register A core can be stalled due to contention for MSHR access Based on preliminary measures, we believe this can account for a 2x delay factor 62 / 67

63 Hardware Profiles The described OS techniques rely on basic hw support How do we know if a platform can support them? I strongly believe the community should join to define a set of SCE hardware profiles Techniques can then be specified based on required profile Profile Mem-Predict A hardware platform satisfies this profile if it supports: DRAM controller with per-bank queues Virtual memory Per-core performance counters with interrupt functionality counting number of memory accesses This is sufficient for PALLOC + MemGuard 63 / 67

solution approaches Foster the development of a shared process for the certification of multicore avionics Basic timeline agreed upon

64 Academic Involvement in Multi-Core Certification The community needs to come together to: Identify and review the inter-core interference channels Share and review solution approaches Develop metrics, analysis and testing procedures for the quality of solution approaches Foster the development of a shared process for the certification of multicore avionics Basic timeline agreed upon with FAA and IEEE TCRTS leaderships. Academic workshops to be followed by workshops with FAA and industry First workshop at next CPSWeek in April 64 / 67

65 Conclusions DO-178C certification on multi-core: a huge challenge Main memory is a major interference channel Intra-bank interference Inter-bank interference We discussed solution to mitigate the effect of the memory channel COTS solution relies on available hardware support, mitigates interference but at the cost of performance Custom hardware requires controller re-design A lot more remains to be done consider getting involved! 65 / 67

66 Bibliography [1] M. Paolieri, E. Quinones, F. Cazorla, and M. Valero, An Analyzable Memory Controller for Hard Real-Time CMPs, Embedded Systems Letters, IEEE, vol. 1, no. 4, pp , [2] Y. Krish, Z. P. Wu and R. Pellizzoni, ROC: A Rank-switching, Open-row DRAM Controller for Time-predictable Systems, IEEE ECRTS, July 2014 [3] H. Yun, G. Yao, R. Pellizzoni, M. Caccamoand L. Sha, MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms, 19th IEEE RTAS, April 2013 [4] H. Yun, R. Mancuso, Z. P. Wu, R. Pellizzoni, PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms, 20th IEEE RTAS, April 2014 [5] Z. P. Wu, Y. Krish, and R. Pellizzoni, Worst Case Analysis of DRAM Latency in Multi-Requestor Systems, 34 th IEEE RTSS, December 2013 [6] LuiShaet al.: Single Core Equivalent Virtual Machines for Hard Real-Time Computing on Multicore Processors. Technical Report, 66 / 67

67 Questions? 67 / 67

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,