EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University Spring 2014 [ material from Patterson & Hennessy, 4 th ed & Inside the Machine by J. Stokes] 1
Introduction to parallel processor design Intel Core i7 AMD A10 IBM Power 8 1. SIMD (data-level parallelism) 2. Superscalar (instruction-level parallelism) 3. Multi-cores (thread-level parallelism) 2
1. SIMD architectures Single Instruction Multiple Data (SIMD) Single instruction acts on multiple pieces of data at once (aka vector instructions) Common application: scientific computing, graphics Requires vector register file and multiple execution units 3
2. Superscalar architectures VLIW: Compiler groups instructions to be issued together Very Large Instruction Words (VLIW) Packages them statically into issue slots Compiler detects and avoids hazards Superscalar: CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Can be static in order or dynamic out-of-order 4
Difference between superscalar and VLIW [from Fisher et al.] 5
MIPS with static dual issue Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB 6
Pipeline design for multiple issue 8 7
Scheduling example for dual-issue MIPS Schedule this code for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element add $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 nop nop 2 add $t0, $t0, $s2 nop 3 addi $s1, $s1, 4 sw $t0, 0($s1) 4 bne $s1, $zero, Loop nop 5 IPC = 5/5 = 1 (c.f. peak dual-issue IPC = 2 and singleissue IPC = 5/6 = 0.83 for single-issue pipeline) 8
Limits to ILP: data dependencies Data dependencies determine: Order which results should be computed Possibility of hazards Degree of freedom in scheduling instructions => limit to ILP Data/Name dependency hazards: Read After Write (RAW) Write After Read (WAR) Write After Write (WAW) 9
Name dependency (antidependence): WAR lw $s0, 0($t0) add $t0, $s1, $s2 add $s4, $s2, $s0.. sub $s2, $s1, $s3 lw $t2, 0($s2). lw $s2, 4($t0) Just a name dependency no values being transmitted Dependency can be removed by renaming registers (either by compiler or HW) 10
Name dependency (output dependency): WAW lw $s0, 0($t0). add $s0, $s1, $s2 add $s2, $s1, $s0. sub $s2, $t2, $t3 Just a name dependency no values being transmitted Dependency can be removed by renaming registers (either by compiler or HW) 11
Re-schedule example for dual-issue MIPS Re-Schedule this code for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element add $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result addi $s1, $s1, 4 # decrement pointer bne $s1, $zero, Loop # branch $s1!=0 ALU/branch Load/store cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, 4 nop 2 add $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 IPC = 5/4 = 1.25 (c.f. peak IPC = 2) 12
Exposing ILP using loop unrolling Replicate loop body to expose more parallelism Reduces loop-control overhead Use different registers per replication Called register renaming Avoid loop-carried anti-dependencies Store followed by a load of the same register Aka name dependence Reuse of a register name 13
Loop unrolling example ALU/branch Load/store cycle Loop: addi $s1, $s1, 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 add $t0, $t0, $s2 lw $t2, 8($s1) 3 add $t1, $t1, $s2 lw $t3, 4($s1) 4 add $t2, $t2, $s2 sw $t0, 16($s1) 5 add $t3, $t3, $s2 sw $t1, 12($s1) 6 nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size 14
Overall superscalar organization Dynamic Out-of-order (OoO) execution Multiple instructions are fetched and decoded in parallel. Decoder has to check for dependencies and rename registers to avoid WAW and WAR hazards. Instructions wait in a dispatch buffer until their operands are available (avoids RAW hazards). When ready, instructions are dispatched to the execution units. Re-order buffer puts back instructions in program order for WB 15
Summary of superscalar architectures Pros: Improved single-thread throughput: hide memory latency; avoid or reduces stalls; and ability to fetch and execute multiple instructions per cycle. Cons: à impacts silicon area à impacts power consumption à impacts design complexity SW compilation techniques enable à more ILP from the same HW à simplify HW (e.g., VLIW) at the expense of code portability Conclusion: great single-thread performance but at the expense of energy efficiency. 16
3. Multi-core processors Moore s Law + ILP Wall + Power Wall à Multi-core processors Improves total throughput but not single thread performance Applications needs to be re-coded to use the parallelism 17
Communication using shared memory SMP: shared memory multiprocessor Hardware provides single physical address space for all processors Shared memory is used for communication Memory access time UMA (uniform) vs. NUMA (nonuniform) 18
Cache coherence problem Suppose two CPU cores share a physical address space Time step Write-through caches Event CPU A s cache CPU B s cache Memory 0 0 1 CPU A reads X 0 0 2 CPU B reads X 0 0 0 3 CPU A writes 1 to X 1 0 1 19
Defining cache coherency Informally: Reads return most recently written value Formally: P writes X; P reads X (no intervening writes) read returns written value P 1 writes X; P 2 reads X (sufficiently later) read returns written value c.f. CPU B reading X after step 3 in example P 1 writes X, P 2 writes X all processors see writes in the same order End up with the same final value for X 20
Maintaining cache coherence with snooping protocol Cache gets exclusive access to a block when it is to be written Broadcasts an invalidate message on the bus Subsequent read in another cache misses Owning cache supplies updated value CPU activity Bus activity CPU A s cache CPU B s cache Memory CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes 1 to X Invalidate for X 1 0 CPU B read X Cache miss for X 1 1 1 0 21
Example: Core i7 Introduced 2008. Quad-core processor. Each core has a 4-wide superscalar out-of-order pipeline Supports integer/fp SIMD instructions. 64 KB L1 cache/core (32K I and 32K D-cache), 256 KB L2 cache per core shared 4-12 MB L3 cache. frequency depends on the grading or binning 22
Summary Intel Core i7 AMD A10 IBM Power 8 1. SIMD exploits data-level parallelism à needs compiler intervention to explicitly use vector instructions. 2. Superscalar exploits instruction-level parallelism à no need for recompilation though compilation help improve performance and mitigate pressure on HW 3. Multi-cores exploits thread-level parallelism à needs programmer intervention to re-write the code. 23