CS A490 Digital Media and Interactive Systems

Size: px

Start display at page:

Download "CS A490 Digital Media and Interactive Systems"

Rosamund Young
5 years ago
Views:

1 CS A490 Digital Media and Interactive Systems Lecture 11 Thread Scaling and I/O Threading and Async I/O on Linux October 30, 2013 Sam Siewert

2 Parallel Processing Speed-up Grid Data Processing Speed-up 1. Multi-Core, Multi-threaded, Macro-blocks/Frames 2. SIMD, Vector Instructions Operating over Large Words (Many Times Instruction Set Size) 3. Co-Processor Operates in Parallel to CPU(s) SPMD GPU or GP-GPU Co-Processor PCI-Express Bus Interfaces Transfer Program and Data to Co-Processor Threads and Blocks to Transform Data Concurrently Image Data Processing Few Data Dependencies Good Speed-up by Amdahl s Law Max _ Speed P=Parallel Portion (1-P)=Sequential Portion S=# of Cores (Concurrency) Overhead for Co-Processor IO for Co-Processing _ Up Multicore _ Speed _ Up = S is infinite here 1 = (1 P) (1 P) + Sam Siewert 2 P / S

3 Amdahl s Law Infinite Cores Maximum Speed-up Driven by Sequential and Parallel Portions of Program P = Parallel Portion (1-P) = Sequential Portion Speed-up for Given Multi-core Architecture Function of # of Cores (Speed-up in Parallel Portions) All Code Parallel (Infinite Speed-up) 95% Parallel (20x Speed-up) Amdahl's Law Max Speed-up (Any Number of Processor Cores) Algorithm Speed Up E Sequential Portion (% Computation in Sequential vs. Parallel Execution) Max Speed-up No Parallel Portion All Sequential (No Speed-up) Sam Siewert 3

Multi-Core Speed-Up Amdahl's Law - Speed-up with # Cores and Parallel Portion 20 18 16 95% Parallel Program 14 Speed-up 12 10 8 6 Max Speed-up 2 cores 4 cores 8

4 Multi-Core Speed-Up Amdahl's Law - Speed-up with # Cores and Parallel Portion % Parallel Program 14 Speed-up Max Speed-up 2 cores 4 cores 8 cores 12 cores 32 cores Sequential Portion of Algorithm Sam Siewert 4

Hiding IO Latency Overlapping with Processing Simple Design Each Thread has READ, PROCESS, WRITE-BACK Execution READ F(1) Process F(1) Write-back F(1) READ F(2) Frame rate is READ+PROCESS+WRITE

5 Hiding IO Latency Overlapping with Processing Simple Design Each Thread has READ, PROCESS, WRITE-BACK Execution READ F(1) Process F(1) Write-back F(1) READ F(2) Frame rate is READ+PROCESS+WRITE latency e.g. 10 fps for 100 milliseconds If READ is 70 msec, PROCESS is 10 msec, and WRITE-BACK 20 msec, predominate time is IO time, not processing Disk drive with 100 MB/sec READ rate can only read 16 fps, 62.5 msec READ latency Sam Siewert 5

6 Hiding IO Latency Schedule Multiple Overlapping Threads? READ F 1 Process F 1 Write-back F 1 READ F 4 Process F 4 Write-back F 4 READ F 2 Process F 2 Write-back F 2 READ F 5 Process F 5 READ F 3 Process F 3 Write-back F 3 Read F 6 Start-up Core #1 Continuous Processing Core #1 Continuous Processing READ F 1 Process F 1 Write-back F 1 READ F 4 Process F 4 Write-back F 4 READ F 2 Process F 2 Write-back F 2 READ F 5 Process F 5 READ F 3 Process F 3 Write-back F 3 Read F 6 Start-up Core #2 Continuous Processing Core #2 Continuous Processing Requires N threads = N stages x N cores 1.5 to 2x Number of Threads for SMT (Hyper-threading) For IO Stage Duration Similar to Processing Time More Threads if IO Time (Read+WB+Read) >> 3 x Processing Time Sam Siewert 6

7 Hiding Latency Dedicated IO Schedule Reads Ahead of Processing Read F 1 Read F 2 Read F 3 Read F 4 Read F 5 Read F 6 Read F 7 Read F 8 Wait Process F 1 Process F 3 Process F 5 Wait Process F 2 Process F 4 Process F 6 Wait WB F 1 WB F 2 WB F 3 WB F 4 WB F 5 WB F 6 Start-up Dual-Core Concurrent Processing Completion Requires N threads = 2 + N cores Synchronize Frame Ready/Write-backs Balance Stage Read/Write-Back Latency to Processing 1.5 to 2x Threads for SMT (Hyper-threading) Sam Siewert 7

8 Processing Latency Alone Write Code with Memory Resident Frames Load Frames in Advance Process In-Memory Frames Over and Over Do No IO During Processing Provides Baseline Measurement of Processing Latency per Frame Alone Provides Method of Optimizing Processing Without IO Latency Sam Siewert 8

9 IO Latency Alone Comment Out Frame Transformation Code or Call Stubbed NULL Function Provides Measurement of IO Frame Rate Alone Essentially Zero Latency Transform No Change Between Input Frames and Output Frames Allows for Tuning of IO Scheduler and Threading Sam Siewert 9

10 Tips for IO Scheduling blockdev --getra /dev/sda Should return 256 Means that reads read-ahead up to 128K Function calls read, fread should request as much as possible Check actual bytes read, re-read as needed in a loop blockdev --setra /dev/sda (8MB) Switch CFQ to Deadline Use lsscsi to verify your disk is /dev/sda substitue block driver interface used for file system if not sda cat /sys/block/sda/queue/scheduler echo deadline > /sys/block/sda/queue/scheduler Options are noop, cfq, deadline Sam Siewert 10

S is infinite here 1 =

S is infinite here 1 = Lecture 12 ECEN 5653 CPU & IO Threading, Scaling, and Speed-up April 7, 2008 Sam Siewert Reminders Help Sessions E-mail siewerts@colorado.edu with ECEN5033 DEBUG in Subject Choose Meeting Date and Time