Administrivia. Mini project is graded. 1 st place: Justin (75.45) 2 nd place: Liia (74.67) 3 rd place: Michael (74.49)

Size: px

Start display at page:

Download "Administrivia. Mini project is graded. 1 st place: Justin (75.45) 2 nd place: Liia (74.67) 3 rd place: Michael (74.49)"

Marvin Brown
5 years ago
Views:

1 Administrivia Mini project is graded 1 st place: Justin (75.45) 2 nd place: Liia (74.67) 3 rd place: Michael (74.49) 1

2 Administrivia Project proposal due: 2/27 Original research Related to real-time embedded systems/cps Building a cyber-physical system (robot) Must include real-time performance evaluation on a selected hardware platform Repeating the evaluation of a chosen paper Any one of the suggested papers. 2

3 Administrivia Addition presentation schedule 2 papers/day on Week 15 (a week before final) eliminate individual meeting Or 2 papers/day on Week 11,12,13 Keep individual meeting 3

4 Real-Time DRAM Controller Heechul Yun 4

5 Memory Performance Isolation Part 1 Part 2 Part 3 Part 4 Core1 Core2 Core3 Core4 LLC LLC LLC LLC Memory Controller DRAM Q. How to guarantee predictable memory performance? 5

6 RQUST #1 ARRIVS How Page Works close the previous page and load new one RQUST #1 COMPLTS, RQUST #2 ARRIVS PR ACT RAD Latency of Request #1 DATA RQUST #2 COMPLTS * Latency First Access page is already open, just issue read command Latency Further Accesses Single Core Data Cycles for each core RAD DATA Latency of Request #2 (with open page) in clock cycles on a JDC-compliant DDR3 module

7 ffects of Contention ALL RQUSTS ARRIV AT TH SAM TIM, TARGTD AT SAM BANK AND RANK P A R D P A R D P A R D * Latency First Access Latency Further Accesses Single Core Multiple Cores same bank/rank 35*N 35*N 4 Data Cycles for each core

8 ALL RQUSTS ARRIV AT TH SAM TIM, TARGTD AT DIFFRNT RANKS ffects of Contention PR ACT R DATA PR ACT R DATA PR ACT R DATA * Latency First Access Latency Further Accesses Single Core N Cores same bank/rank N Cores different ranks *(N-1) *(N-1) *(N-1) 9 + 4*(N-1) 4 Data Cycles used by each access

9 Real-Time Memory Controllers Provided guaranteed performance in accessing DRAM. 9

10 Real-Time Memory Controllers Common techniques Command grouping Force to use ALL banks for each memory access Private banking Assign private DRAM banks to cores Scheduling Use analysis friendly scheduling (e.g., round-robin) over difficult ones (e.g., FR-FCFS) 10

11 Predator 11

12 Worst-case Core1 Core2 Core3 Core4 L3 Memory Controller (MC) DRAM DIMM Slow 1bank b/w Bank 1 Bank 2 Bank 3 Bank 4 Less than peak b/w How much?

13 Worst-Case For Single-Bank: Horrible 13

14 Bank Interleaving and Groups 14

15 Arbitration: CCSP 15

16 Controller Architecture 16

17 Real-Time Memory Controllers (RTMC) Predator Command grouping, CCSP arbitration AMC Command grouping, round-robin arbitration PRT-MC Private bank, TDMA arbitration DcMc, MDUSA RR + FR-FCFS hybrid, bank partitioning Read/Write Bundling Reduce bus turn-around overhead.. 17

18 RTMC References Predator: a predictable sdram memory controller. CODS+ISSS An analyzable memory controller for hard real-time CMPs, I mbedded Systems Letters, 2009 PRT DRAM controller: Bank privatization for predictability and temporal isolation, CODS+ISSS, 2011 A dual-criticality memory controller (dcmc): Proposal and evaluation of a space case study, RTAS, 2015 Improved DRAM Timing Bounds for Real-Time DRAM Controllers with Read/Write Bundling, 2016 A Comprehensive Study of DRAM Controllers in Real-Time Systems. Danlu Guo, MS Thesis, University of Waterloo,

19 Real-Time Multi/Many-Core Architecture Why is it difficult to analyze WCT? Projects on Real-Time CPU Architectures 19

20 Worst-Case xecution Time (WCT) Image source: [Wilhelm et al., 2008] Real-time scheduling theory is based on the assumption of known WCTs of real-time tasks 20

21 Computing WCT Static analysis Input: program code, architecture model output: WCT Problem: architecture model is hard and pessimistic (recall Parallelism-aware paper) Measurement No guarantee on true worst-case But, widely used in practice 21

22 Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-Critical mbedded Systems 22

23 Problematic CPU Features Architectures are optimized to reduce average performance WCT estimation is hard because of Pipelining TLBs/Caches Super-scalar Out-of-order scheduling Branch predictors Hardware prefetchers Basically anything that affect processor state 23

Static Timing Analysis 968 I TRANSACTIONS ON COMPUTR-AIDD DSIGN OF INTGRATD Fig. 1. Main components of a timing-analysis framework and their interaction.

24 Static Timing Analysis 968 I TRANSACTIONS ON COMPUTR-AIDD DSIGN OF INTGRATD Fig. 1. Main components of a timing-analysis framework and their interaction. number of exec of value-analys the obtained tim 5) Microarchitectu bounds on the forming an abs into account th ulation concep approximations point. Pipeline through the pip resources like q average-case-en cise bounds. 6) Global bound bounds on exe formation abou combined to co through the pro formation prov 24 analyses.

25 Control Flow Graph (CFG) Analyze code Split basic blocks Compute per-block WCT use abstract CPU model 25

26 Timing Anomalies Locally faster!= globally faster Image source: [Wilhelm et al., 2008] 26

27 Timing Anomalies Locally faster!= globally faster Image source: [Wilhelm et al., 2008] 27

28 Real-Time CPU Architectures PRT UC Berkeley. MRASA/parMRASA project U ACROSS U ARAMIS Germany MC2 U 28

29 29

30 PRT Pipeline Thread 1, Instruction 1 Thread 1, Instruction 2 THRAD#1 FTCH DCOD RGACC MM XCUT XCPT FTCH DCOD RGACC MM XCUT XCPT THRAD#2 FTCH DCOD RGACC MM XCUT XCPT FTCH DCOD RGACC MM XCUT THRAD#3 FTCH DCOD RGACC MM XCUT XCPT FTCH DCOD RGACC MM THRAD#4 FTCH DCOD RGACC MM XCUT XCPT FTCH DCOD RGACC THRAD#5 FTCH DCOD RGACC MM XCUT XCPT FTCH DCOD THRAD#6 FTCH DCOD RGACC MM XCUT XCPT FTCH 1 clock t 30

31 FlexPRT Pipeline 31

32 MRASA Multicore 32

33 33

34 Acknowledgement Some slides are from: Prof. Rodolfo Pellizzoni, University of Waterloo Prof. dward A. Lee, University of Berkeley 34

35 Summary Timing anomalies Locally fast!= globally fast on non-timing compositional architectures (i.e., most architectures) Timing compositional architecture Free of timing anomalies 35

Worst Case Analysis of DRAM Latency in Multi-Requestor Systems. Zheng Pei Wu Yogen Krish Rodolfo Pellizzoni

Worst Case Analysis of DRAM Latency in Multi-Requestor Systems. Zheng Pei Wu Yogen Krish Rodolfo Pellizzoni orst Case Analysis of DAM Latency in Multi-equestor Systems Zheng Pei u Yogen Krish odolfo Pellizzoni Multi-equestor Systems CPU CPU CPU Inter-connect DAM DMA I/O 1/26 Multi-equestor Systems CPU CPU CPU