CEC 450 Real-Time Systems

Size: px

Start display at page:

Download "CEC 450 Real-Time Systems"

Randell Randall
5 years ago
Views:

1 CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert

2 A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation Time Sensed (From Release to Response) Event Sensed Interrupt Dispatch Preemption Dispatch Completion (IO Queued) Actuation (IO Completion) Interference Time Input-Latency Execution Dispatch-Latency Execution Output-Latency Sam Siewert 2

3 Service Response Timeline (With Intermediate Blocking) Response Time = Time Actuation Time Sensed (From Event to Response) Event Sensed Interrupt Dispatch Preemption Dispatch Completion (IO Queued) Actuation (IO Completion) Time Input Latency Dispatch Latency Execution Interference Execution Output Latency Event Sensed Interrupt Dispatch Yield Dispatch Preemption Dispatch Completion (IO Queued) Actuation (IO Completion) Time Input Latency Dispatch Latency Execution BLOCKING Execution Interference Execution Output Latency Sam Siewert 3

Dispatch Latency Execution Interference Execution Output Latency Event Sensed Interrupt Dispatch Preemption Dispatch Completion (IO Queued)

4 Service Response Timeline - WCET (With Intermediate I/O Impacts BCET to Cause WCET) Response Time = Time Actuation Time Sensed (From Event to Response) Event Sensed Interrupt Dispatch Preemption Dispatch Completion (IO Queued) Actuation (IO Completion) Time Input Latency Dispatch Latency Execution Interference Execution Output Latency Event Sensed Interrupt Dispatch Preemption Dispatch Completion (IO Queued) Actuation (IO Completion) Input Latency Dispatch Latency Execution MMIO (Stalls) Execution Interference Execution Output Latency Time Sam Siewert 4

5 Services and I/O Initial Input I/O from Sensors Low-Rate Input: Byte or Word Input from Memory-Mapped Registers or I/O Ports High-Rate Input: Block Input from DMA Channels and Block Reads from FIFOs Final Response I/O to Actuators Low-Rate Output: Byte or Word Writes Posted to Output FIFO or Bus Interface Queue High-Rate Output: Block Output on DMA Channels and Block Writes Intermediate I/O During Service Execution, Memory Mapped Register I/O External Memory I/O Cache Loads and Write-Backs In Memory Input/Output from Service to Service Sam Siewert 5

6 Service Integration Concepts Initial Sensor Input and Final Actuator Response Output Device Interface Drivers Simple Service Has No Intermediate I/O Latency, No Synchronization Sensor Input Interface Digital Control Law Service Actuator Output Interface Sensor Electronics Sensors Effectors Actuator Electronics Environment Sam Siewert 6

7 Complex Multi-Service Systems Multiple Services Synchronization Between Services Communication Between Services Multiple Sensor Input and Actuator Output Interfaces Intermediate I/O, Shared Memory, Messaging Sam Siewert 7

8 Multi-Service Pipelines framerdy interrupt RT Management Event releases / Service control framerdy Control Plane tbtvid (30 Hz) RGB32 buffer tfrmdisp (10 Hz) grayscale buffer tfrmlnk (2 Hz) frame Remote Display tnet (2 Hz) Data Plane P0 topnav (15 Hz) ttlmlnk (5 Hz) TCP/IP pkts P1 tthrustctl (15 Hz) P2 Bt878 PCI NTSC decoder (30 Hz) tcamctl (3 Hz) Serial A/B ISA ethernet Sam Siewert 8 P3 SW HW

9 Pipelined Architecture Review Recall that Pipeline Yields CPI of 1 or Less Instruction Completed Each CPU Clock Unless Pipeline Stalls! IF ID Execute Write- Back IF ID Execute Write- Back IF ID Execute Write- Back IF ID Execute Write- Back IF ID Execute Write- Back IF ID Execute Write- Back Sam Siewert 9

10 Service Execution [ ] WCET = ( CPI Longest _ Path _ Inst _ Count) + Stall _ Cycles Clk _ Period best case Efficiency of Execution Memory Access for Inter-Service Communication Bounded Intermediate I/O Ideally Lock Data and Code into Cache or Use Tightly Coupled Memory [Scratch Pad, Zero Wait State] Sam Siewert 10

11 Service Efficiency Path Length for a Release Instruction Count Longest Path Given Algorithm Branches Loop Iterations Data Driven? Path Execution Number of Cycles to Complete Path Clocks Per Instruction Number of Stall Cycles for Pipeline Data Dependencies (Intermediate IO) Cache Misses (Intermediate Memory Access) Branch Mis-predictions (Small Penalty) Sam Siewert 11

12 Hiding Intermediate IO Latency (Overlapping CPU and Bus I/O) ICT = Instruction Count Time Time to execute a block of instructions with no stalls CPU Cycles x CPU Clock Period IOT = Bus Interface IO Time Bus IO Cycles x Bus Clock Period OR = Overlap Required Percentage of CPU cycles that must be concurrent with I/O cycles NOA = Non-Overlap Allowable for Si to meet Di Percentage of CPU cycles that can be in addition to IO cycle time without missing service deadline Di = Deadline for Service S i relative to release interrupt or system call initiates S i request and S i execution CPI = Clocks Per Instruction for a block of instructions IPC is Instructions Per Clock, Also Used, Just the Inverse (Superscalar, Pipelined) Sam Siewert 12

13 Processing and I/O Time Overlap Di >= IOT is required; otherwise if Di < IOT, Si is IO-Bound Di >= ICT is required; otherwise if Di < ICT, Si is CPU-Bound Di >= (IOT + ICT) requires no overlap of IOT with ICT if Di < (IOT + ICT) where (Di >= IOT and Di >= ICT), overlap of IOT with ICT is required if Di < (IOT + ICT) where (Di < IOT or Di < ICT), deadline Di can t be met regardless of overlap Sam Siewert 13

IOT CPI required = D i / ICT OR = 1 - [(D i - IOT) / ICT] ICT ICT OR IOT CPI

14 Service Execution Efficiency Waiting on I/O from MMIO Bus and Memory CPI worst-case = (ICT + IOT) / ICT ICT IOT CPI best-case = (max(ict, IOT)) / ICT ICT IOT CPI required = D i / ICT OR = 1 - [(D i - IOT) / ICT] ICT ICT OR IOT CPI required = [ICT(1-OR) + IOT] / ICT NOA = (D i IOT) / ICT; OR + NOA = 1 (by definition) Sam Siewert 14

15 Synchronization and Message Message Queues Provide Communication Provide Synchronization Traditional Message Queue Message Size Internal Fragmentation Heap Message Queue Messages Are Pointers Message Data in Heap Passing S1 Sam Siewert 15 Msg Msg Heap Buffer Pool S2 S1 Ptr S2 Allocation Of Message Heap ENEA OSE Message Passing RTOS Advertised Advantages De-Allocation Of Message Heap

16 What About Storage I/O Much Higher Latency than Off-Chip Memory or Bus MMIO Milliseconds of Latency Compared to GHz Core Clock Rates 1 million to 1! Flash Better, But Still Slower than SRAM or DRAM Option #1 Avoid Disk Drives and Flash I/O Option #2 Pre-Buffer Storage I/O (Read-Ahead Cache), Post Write-backs (Write-Back Cache), MFU/MRU (Most Frequently Used / Most Recently Used) Kept in Read Cache Sam Siewert 16

17 Parallel Processing Speed-up (Makes I/O Latency Worse?) Grid Data Processing Speed-up 1. Multi-Core, Multi-threaded, Macro-blocks/Frames 2. SIMD, Vector Instructions Operating over Large Words (Many Times Instruction Set Size) 3. Co-Processor Operates in Parallel to CPU(s) SPMD GPU or GP-GPU Co-Processor PCI-Express Bus Interfaces Transfer Program and Data to Co-Processor Threads and Blocks to Transform Data Concurrently Image Data Processing Few Data Dependencies Good Speed-up by Amdahl s Law Max _ Speed P=Parallel Portion (1-P)=Sequential Portion S=# of Cores (Concurrency) Overhead for Co-Processor IO for Co-Processing _ Up Multicore _ Speed _ Up = S is infinite here 1 = (1 P) (1 P) + Sam Siewert 17 P / S

18 Amdahl s Law Infinite Cores Maximum Speed-up Driven by Sequential and Parallel Portions of Program P = Parallel Portion (1-P) = Sequential Portion Speed-up for Given Multi-core Architecture Function of # of Cores (Speed-up in Parallel Portions) All Code Parallel (Infinite Speed-up) 95% Parallel (20x Speed-up) Amdahl's Law Max Speed-up (Any Number of Processor Cores) Algorithm Speed Up E Sequential Portion (% Computation in Sequential vs. Parallel Execution) Max Speed-up No Parallel Portion All Sequential (No Speed-up) Sam Siewert 18

19 Multi-Core Speed-Up Amdahl's Law - Speed-up with # Cores and Parallel Portion % Parallel Program 14 Speed-up Max Speed-up 2 cores 4 cores 8 cores 12 cores 32 cores Sequential Portion of Algorithm Sam Siewert 19

Hiding Storage I/O Latency Overlapping with Processing Simple Design Each Thread has READ, PROCESS, WRITE-BACK Execution READ F(1) Process F(1) Write-back F(1) READ F(2) Frame rate is

20 Hiding Storage I/O Latency Overlapping with Processing Simple Design Each Thread has READ, PROCESS, WRITE-BACK Execution READ F(1) Process F(1) Write-back F(1) READ F(2) Frame rate is READ+PROCESS+WRITE latency e.g. 10 fps for 100 milliseconds If READ is 70 msec, PROCESS is 10 msec, and WRITE-BACK 20 msec, predominate time is IO time, not processing Disk drive with 100 MB/sec READ rate can only read 16 fps, 62.5 msec READ latency Sam Siewert 20

21 Hiding Storage I/O Latency Schedule Multiple Overlapping Threads? READ F 1 Process F 1 Write-back F 1 READ F 4 Process F 4 Write-back F 4 READ F 2 Process F 2 Write-back F 2 READ F 5 Process F 5 READ F 3 Process F 3 Write-back F 3 Read F 6 Start-up Core #1 Continuous Processing Core #1 Continuous Processing READ F 1 Process F 1 Write-back F 1 READ F 4 Process F 4 Write-back F 4 READ F 2 Process F 2 Write-back F 2 READ F 5 Process F 5 READ F 3 Process F 3 Write-back F 3 Read F 6 Start-up Core #2 Continuous Processing Core #2 Continuous Processing Requires N threads = N stages x N cores 1.5 to 2x Number of Threads for SMT (Hyper-threading) For IO Stage Duration Similar to Processing Time More Threads if IO Time (Read+WB+Read) >> 3 x Processing Time Sam Siewert 21

22 Hiding Latency Dedicated I/O Schedule Reads Ahead of Processing Read F 1 Read F 2 Read F 3 Read F 4 Read F 5 Read F 6 Read F 7 Read F 8 Wait Process F 1 Process F 3 Process F 5 Wait Process F 2 Process F 4 Process F 6 Wait WB F 1 WB F 2 WB F 3 WB F 4 WB F 5 WB F 6 Start-up Dual-Core Concurrent Processing Completion Requires N threads = 2 + N cores Synchronize Frame Ready/Write-backs Balance Stage Read/Write-Back Latency to Processing 1.5 to 2x Threads for SMT (Hyper-threading) Sam Siewert 22

23 Processing Latency Alone Write Code with Memory Resident Frames Load Frames in Advance Process In-Memory Frames Over and Over Do No IO During Processing Provides Baseline Measurement of Processing Latency per Frame Alone Provides Method of Optimizing Processing Without IO Latency Sam Siewert 23

24 I/O Latency Alone Comment Out Frame Transformation Code or Call Stubbed NULL Function Provides Measurement of IO Frame Rate Alone Essentially Zero Latency Transform No Change Between Input Frames and Output Frames Allows for Tuning of IO Scheduler and Threading Sam Siewert 24

25 Tips for Linux I/O Scheduling blockdev --getra /dev/sda Should return 256 Means that reads read-ahead up to 128K Function calls read, fread should request as much as possible Check actual bytes read, re-read as needed in a loop blockdev --setra /dev/sda (8MB) Switch CFQ to Deadline Use lsscsi to verify your disk is /dev/sda substitue block driver interface used for file system if not sda cat /sys/block/sda/queue/scheduler echo deadline > /sys/block/sda/queue/scheduler Options are noop, cfq, deadline Sam Siewert 25

CS A490 Digital Media and Interactive Systems

CS A490 Digital Media and Interactive Systems Lecture 11 Thread Scaling and I/O Threading and Async I/O on Linux October 30, 2013 Sam Siewert Parallel Processing Speed-up Grid Data Processing Speed-up