S is infinite here 1 =

Size: px

Start display at page:

Download "S is infinite here 1 ="

Victoria Weaver
5 years ago
Views:

1 Lecture 12 ECEN 5653 CPU & IO Threading, Scaling, and Speed-up April 7, 2008 Sam Siewert

2 Reminders Help Sessions with ECEN5033 DEBUG in Subject Choose Meeting Date and Time Choose Meeting Method (Skype or Engineering Ctr. Lobby) ml Turn in Write-Up on CULearn Turn in Final DVD to Prof. Siewert in Person or EE Dept. Mail-box Final Quiz Review on Web Sam Siewert 2

3 Parallel Processing Speed-up Grid Data Processing Speed-up 1. Multi-Core, Multi-threaded, threaded, Macro-blocks/Frames 2. SIMD, Vector Instructions Operating over Large Words (Many Times Instruction Set Size) 3. Co-Processor Operates in Parallel to CPU(s) SPMD GPU or GP-GPU GPU Co-Processor PCI-Express Bus Interfaces Transfer Program and Data to Co-Processor Threads and Blocks to Transform Data Concurrently Image Data Processing Few Data Dependencies Good Speed-up by Amdahl s Law P=Parallel Portion (1-P)=Sequential Portion S=# of Cores (Concurrency) Overhead for Co-Processor IO for Co-Processing Max_ Speed _ Up Multicore_ Speed_ Up= S is infinite here 1 = (1 P) (1 P) + Sam Siewert 3 P/ S

4 Amdahl s Law Infinite Cores Maximum Speed-up Driven by Sequential and Parallel Portions of Program P = Parallel Portion (1-P) = Sequential Portion Speed-up for Given Multi-core Architecture Function of # of Cores (Speed-up in Parallel Portions) Algorithm Speed Up All Code Parallel (Infinite Speed-up) % Parallel (20x Speed-up) Amdahl's Law Max Speed-up (Any Number of Processor Cores) -1.55E Sequential Portion (% Computation in Sequential vs. Parallel Execution) Max Speed-up No Parallel Portion All Sequential (No Speed-up) Sam Siewert 4

5 Multi-Core Speed-Up 20 Amdahl's Law - Speed-up with # Cores and Parallel Portion % Parallel Program 14 -up Speed Max Speed-up 2 cores 4 cores 8 cores 12 cores 32 cores Sequential Portion of Algorithm Sam Siewert 5

6 Hiding IO Latency Overlapping with Processing Simple Design Each Thread has READ, PROCESS, WRITE-BACK Execution READ F(1) Process F(1) Write-back F(1) READ F(2) Frame rate is READ+PROCESS+WRITE latency e.g. 10 fps for 100 milliseconds If READ is 70 msec, PROCESS is 10 msec, and WRITE-BACK 20 msec, predominate time is IO time, not processing Disk drive with 100 MB/sec READ rate can only read 16 fps, 62.5 msec READ latency Sam Siewert 6

7 Hiding IO Latency Schedule Multiple Overlapping Threads? READ F 1 Process F 1 Write-back F 1 READ F 4 Process F 4 Write-back F 4 READ F 2 Process F 2 Write-back F 2 READ F 5 Process F 5 READ F 3 Process F 3 Write-back F 3 Read F 6 Start-up Core #1 Continuous Processing Core #1 Continuous Processing READ F 1 Process F 1 Write-back F 1 READ F 4 Process F 4 Write-back F 4 READ F 2 Process F 2 Write-back F 2 READ F 5 Process F 5 READ F 3 Process F 3 Write-back F 3 Read F 6 Start-up Core #2 Continuous Processing Core #2 Continuous Processing Requires N threads = N stages x N cores 1.5 to 2x Number of Threads for SMT (Hyper-threading) For IO Stage Duration Similar to Processing Time More Threads if IO Time (Read+WB+Read) >> 3 x Processing Time Sam Siewert 7

8 Hiding Latency Dedicated IO Schedule Reads Ahead of Processing Read F 1 Read F 2 Read F 3 Read F 4 Read F 5 Read F 6 Read F 7 Read F 8 Wait Process F 1 Process F 3 Process F 5 Wait Process F 2 Process F 4 Process F 6 Wait WB F 1 WB F 2 WB F 3 WB F 4 WB F 5 WB F 6 Start-up Dual-Core Concurrent Processing Completion Requires N threads = 2 + N cores Synchronize Frame Ready/Write-backs Balance Stage Read/Write-Back Latency to Processing 1.5 to 2x Threads for SMT (Hyper-threading) Sam Siewert 8

9 Processing Latency Alone Write Code with Memory Resident Frames Load Frames in Advance Process In-Memory Frames Over and Over Do No IO During Processing Provides Baseline Measurement of Processing Latency per Frame Alone Provides Method of Optimizing Processing Without IO Latency Sam Siewert 9

10 IO Latency Alone Comment Out Frame Transformation Code or Call Stubbed NULL Function Provides Measurement of IO Frame Rate Alone Essentially Zero Latency Transform No Change Between Input Frames and Output Frames Allows for Tuning of IO Scheduler and Threading Sam Siewert 10

11 Tips for IO Scheduling blockdev --getra /dev/sda Should return 256 Means that reads read-ahead ahead up to 128K Function calls read, fread should request as much as possible Check actual bytes read, re-read read as needed in a loop blockdev --setra /dev/sda (8MB) Switch CFQ to Deadline Use lsscsi to verify your disk is /dev/sda substitue block driver interface used for file system if not sda cat /sys/block/sda/queue/scheduler echo deadline > /sys/block/sda/queue/scheduler Options are noop, cfq, deadline Sam Siewert 11

CS A490 Digital Media and Interactive Systems

CS A490 Digital Media and Interactive Systems Lecture 11 Thread Scaling and I/O Threading and Async I/O on Linux October 30, 2013 Sam Siewert Parallel Processing Speed-up Grid Data Processing Speed-up