An Introduction to Parallel Architectures

An Introduction to Parallel Architectures Andrea Marongiu a.marongiu@unibo.it

Impact of Parallel Architectures From cell phones to supercomputers In regular CPUs as well as GPUs

Parallel HW Processing Why? Which types? Which novel issues?

Why Multicores? The SPECint performance of the hottest chip grew by 52% per year from 1986 to 2002, and then grew only 20% in the next three years (about 6% per year). [from Patterson & Hennessy] Diminishing returns from uniprocessor designs

[from Patterson & Hennessy] Power Wall The design goal for the late 1990 s and early 2000 s was to drive the clock rate up. by adding more transistors to a smaller chip. Unfortunately, this increased the power dissipation of the CPU chip beyond the capacity of inexpensive cooling techniques

Roadmap for CPU Clock Speed: Circa 2005 [from Patterson & Hennessy] Here is the result of the best thought in 2005. By 2015, the clock speed of the top hot chip would be in the 20 25 GHz range.

The CPU Clock Speed Roadmap (A Few Revisions Later) [from Patterson & Hennessy] This reflects the practical experience gained with dense chips that were literally hot ; they radiated considerable thermal power and were difficult to cool. Law of Physics: All electrical power consumed is eventually radiated as heat.

The Multicore Approach Multiple cores on the same chip Simpler Slower Less power demanding

Transition to Multicore Intel Xeon (18cores) Sequential App Performance 5/11/2017 Spring 2011 Lecture #15 9

Flynn Taxonomy of parallel computers Instruction Streams Data streams Single Parallel Single SISD SIMD Multiple MISD MIMD 5/11/2017 Spring 2011 Lecture #13 10

Alternative Kinds of Parallelism: Single Instruction/Single Data Stream Processing Unit Single Instruction, Single Data stream (SISD) Sequential computer that exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are traditional uniprocessor machines 5/11/2017 Spring 2011 Lecture #13 11

Alternative Kinds of Parallelism: Multiple Instruction/Single Data Stream Multiple Instruction, Single Data streams (MISD) 5/11/2017 Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized. For example, certain kinds of array processors. No longer commonly encountered, mainly of Spring 2011 Lecture #13 historical interest only 12

Alternative Kinds of Parallelism: Single Instruction/Multiple Data Stream Single Instruction, Multiple Data streams (SIMD) Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized, e.g., SIMD instruction extensions or Graphics Processing Unit (GPU) 5/11/2017 Spring 2011 Lecture #13 13

Alternative Kinds of Parallelism: Multiple Instruction/Multiple Data Streams Multiple Instruction, Multiple Data streams (MIMD) Multiple autonomous processors simultaneously executing different instructions on different data. MIMD architectures include multicore and Warehouse Scale Computers (datacenters) 5/11/2017 Spring 2011 Lecture #13 14

Flynn Taxonomy of parallel computers Instruction Streams Data streams Single Parallel Single SISD: Intel Pentium 4 SIMD: SSE x86, ARM neon, GPGPUs, Multiple MISD: No examples today MIMD: SMP (Intel, ARM, ) From 2011, SIMD and MIMD most common parallel computers Most common parallel processing programming style: Single Program Multiple Data ( SPMD ) Single program that runs on all processors of an MIMD Cross processor execution coordination through conditional expressions (thread parallelism) SIMD (aka hw level data parallelism): specialized function units, for handling lock step calculations involving arrays Scientific computing, signal processing, multimedia (audio/video processing) 5/11/2017 Spring 2011 Lecture #13 15

SIMD Architectures Data parallelism: executing one operation on multiple data streams Single control unit Multiple datapaths (processing elements PEs) running in parallel PEs are interconnected and exchange/share data as directed by the control unit Each PE performs the same operation on its own local data Example to provide context: Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] x[i], 0 i < n 5/11/2017 Spring 2011 Lecture #13 Slide 16

Advanced Digital Media Boost To improve performance, SIMD instructions Fetch one instruction, do the work of multiple instructions 5/11/2017 Spring 2011 Lecture #13 17

Example: SIMD Array Processing for each f in array f = sqrt(f) for each f in array { load f to the floating-point register calculate the square root write the result from the register to memory } for each 4 members in array { load 4 members to the SIMD register calculate 4 square roots in one operation write the result from the register to memory } SISD SIMD 5/11/2017 Spring 2011 Lecture #13 18

Data Level Parallelism and SIMD SIMD wants adjacent values in memory that can be operated in parallel Usually specified in programs as loops for(i=1000; i>0; i=i-1) x[i] = x[i] + s; How can reveal more data level parallelism than available in a single iteration of a loop? Unroll loop and adjust iteration rate 5/11/2017 Spring 2011 Lecture #14 19

Loop Unrolling Loop Unrolling can be implemented from C code into for(i=1000; i>0; i=i-1) x[i] = x[i] + s; for(i=1000; i>0; i=i-4) { x[ i ] = x[ i ] + s; x[i-1] = x[i-1] + s; x[i-2] = x[i-2] + s; x[i-3] = x[i-3] + s; } 5/11/2017 Spring 2011 Lecture #14 20

Loop Unrolling (MIPS) Assumptions: R1 is initially the address of the element in the array with the highest address F2 contains the scalar value s 8(R2) is the address of the last element to operate on. for(i=1000; i>0; i=i-1) Loop: 1. l.d F0, 0(R1) x[i] = x[i] + s; ; F0=array element 2. add.d F4,F0,F2 ; add s to F0 3. s.d F4,0(R1) ; store result 4. addui R1,R1,#-8 ; decrement pointer 8 byte 5. bne R1,R2,Loop ; repeat loop if R1!= R2

Loop Unrolled Loop: l.d F0,0(R1) add.d F4,F0,F2 s.d F4,0(R1) l.d F6, 8(R1) add.d F8,F6,F2 s.d F8, 8(R1) l.d F10, 16(R1) add.d F12,F10,F2 s.d F12, 16(R1) l.d F14, 24(R1) add.d F16,F14,F2 s.d F16, 24(R1) addui R1,R1,# 32 bne R1,R2,Loop NOTE: 1. Different Registers eliminate stalls 2. Only 1 Loop Overhead every 4 iterations 3. This unrolling works if loop_limit(mod 4) = 0

Loop Unrolled Scheduled Loop: l.d F0,0(R1) l.d F6, 8(R1) l.d F10, 16(R1) l.d F14, 24(R1) add.d F4,F0,F2 add.d F8,F6,F2 add.d F12,F10,F2 add.d F16,F14,F2 s.d F4,0(R1) s.d F8, 8(R1) s.d F12, 16(R1) s.d F16, 24(R1) addui R1,R1,# 32 bne R1,R2,Loop 4 Loads side by side: Could replace with 4 wide SIMD Load 4 Adds side by side: Could replace with 4 wide SIMD Add 4 Stores side by side: Could replace with 4 wide SIMD Store

Generalizing Loop Unrolling A loop of niterations kcopies of the body of the loop Then we will run the loop with 1 copy of the body n(mod k) times and with k copies of the body floor(n/k) times

MIMD Architectures Multicore architectures At least 2 processors interconnected via a communication network abstractions (HW/SW interface) organizational structure to realize abstraction efficiently 5/11/2017 ACA H.Corporaal 25

MIMD Architectures Thread Level parallelism Have multiple program counters Targeted for tightly coupled shared memory multiprocessors For n processors, need n threads Amount of computation assigned to each thread = grain size Threads can be used for data level parallelism, but the overheads may outweigh the benefit Copyright 2012, Elsevier Inc. All rights reserved.

MIMD Architectures And what about memory? How is data to be accessed by multiple cores? How to design accordingly the memory system? How do traditional solutions from uniprocessor systems affect multicores? Copyright 2012, Elsevier Inc. All rights reserved.

Recall: The Memory Gap Memory CPU 100000 10000 Performance 1000 100 10 1 1980 1985 1990 1995 2000 2005 H&P Fig. 5.2 Bottom line: memory access is increasingly expensive and CA must devise new ways of hiding this cost

The memory wall How do multicores attach the memory wall? Same issue as uniprocessor systems By building on chip cache hierarchies

Memory Hierarchy 5/11/2017 Spring 2011 Lecture #1 30

The memory wall How do multicores attach the memory wall? Same issue as uniprocessor systems By building on chip cache hierarchies What novel issues arise? Coherence Consistency Scalability

COMPUTING SYSTEMS PERFORMANCE Chapter 1 Computer Abstractions and Technology 32

Defining Performance Which airplane has the best performance? 1.6 Performance Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 100 200 300 400 500 Passenger Capacity 0 2000 4000 6000 8000 10000 Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 500 1000 1500 Cruising Speed (mph) 0 100000 200000 300000 400000 Passengers x mph Chapter 1 Computer Abstractions and Technology 33

Response Time and Throughput Response time How long it takes to do a task Throughput Total work done per unit time e.g., tasks/transactions/ per hour How are response time and throughput affected by Replacing the processor with a faster version? Adding more processors? We ll focus on response time for now Chapter 1 Computer Abstractions and Technology 34

Relative Performance Define Performance = 1/Execution Time X is n time faster than Y Performance X Execution time Performance Y Y Execution time X n Example: time taken to run a program 10s on A, 15s on B Execution Time B / Execution Time A = 15s / 10s = 1.5 So A is 1.5 times faster than B Chapter 1 Computer Abstractions and Technology 35

Measuring Execution Time Elapsed time Total response time, including all aspects Processing, I/O, OS overhead, idle time Determines system performance CPU time Time spent processing a given job Discounts I/O time, other jobs shares Comprises user CPU time and system CPU time Different programs are affected differently by CPU and system performance Chapter 1 Computer Abstractions and Technology 36

CPU Clocking Operation of digital hardware governed by a constant-rate clock Clock (cycles) Data transfer and computation Update state Clock period Clock period: duration of a clock cycle e.g., 250ps = 0.25ns = 250 10 12 s Clock frequency (rate): cycles per second e.g., 4.0GHz = 4000MHz = 4.0 10 9 Hz Chapter 1 Computer Abstractions and Technology 37

CPU Time CPU Time CPU Clock Cycles Clock Cycle Time CPU Clock Cycles Clock Rate Performance improved by Reducing number of clock cycles Increasing clock rate Hardware designer must often trade off clock rate against cycle count Chapter 1 Computer Abstractions and Technology 38

CPU Time Example Computer A: 2GHz clock, 10s CPU time Designing Computer B Aim for 6s CPU time Can do faster clock, but causes 1.2 clock cycles How fast must Computer B clock be? Clock Rate B Clock Cycles CPU Time B B 1.2Clock Cycles 6s A Clock Cycles A CPU Time A Clock Rate A 10s2GHz 2010 9 Clock Rate B 1.22010 6s 9 2410 6s 9 4GHz Chapter 1 Computer Abstractions and Technology 39

Instruction Count and CPI Clock Cycles Instruction Count Cycles per Instruction CPU Time Instruction Count CPI Clock Cycle Time Instruction Count CPI Clock Rate Instruction Count for a program Determined by program, ISA and compiler Average cycles per instruction Determined by CPU hardware If different instructions have different CPI Average CPI affected by instruction mix Chapter 1 Computer Abstractions and Technology 40

CPI Example Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? CPU Time A CPU Time B CPU Time B CPU Time A Instruction Count CPI A I 2.0 250ps I500ps Instruction Count CPI B I1.2500ps I 600ps I 600ps I500ps 1.2 Cycle Time A Cycle Time B A is faster by this much Chapter 1 Computer Abstractions and Technology 41

CPI in More Detail If different instruction classes take different numbers of cycles Clock Cycles n i1 (CPIi Instruction Counti) Weighted average CPI CPI Clock Cycles Instruction Count n i1 CPI i Instruction Counti Instruction Count Relative frequency Chapter 1 Computer Abstractions and Technology 42

CPI Example Alternative compiled code sequences using instructions in classes A, B, C Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1 Sequence 1: IC = 5 Clock Cycles = 2 1 + 1 2 + 2 3 = 10 Avg. CPI = 10/5 = 2.0 Sequence 2: IC = 6 Clock Cycles = 4 1 + 1 2 + 1 3 = 9 Avg. CPI = 9/6 = 1.5 Chapter 1 Computer Abstractions and Technology 43

Performance Summary The BIG Picture CPU Time Instructions Program Clock cycles Instruction Seconds Clock cycle Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, T c Chapter 1 Computer Abstractions and Technology 44

Power In CMOS IC technology Pdyn Capacitive load Voltage 2 Frequency 1.7 The Power Wall 30 5V 1V 1000 Dynamic power is dominant when active Static power: Pstat=Voltage I static Total power P=Pstat+Pdyn Chapter 1 Computer Abstractions and Technology 45

Reducing Power Suppose a new CPU has 85% of capacitive load of old CPU 15% voltage and 15% frequency reduction 2 Pnew Cold 0.85(Vold 0.85) Fold 0.85 4 0.85 2 P C V F old The power wall old We can t reduce voltage further We can t remove more heat old old 0.52 How else can we improve performance? Chapter 1 Computer Abstractions and Technology 46

Multiprocessors Multicore microprocessors More than one processor per chip Requires explicitly parallel programming Compare with instruction level parallelism Hardware executes multiple instructions at once Hidden from the programmer Hard to do Programming for performance Load balancing Optimizing communication and synchronization Chapter 1 Computer Abstractions and Technology 47

Pitfall: Amdahl s Law Improving an aspect of a computer and expecting a proportional improvement in overall performance T improved Taffected improvement factor T unaffected Example: multiply accounts for 80s/100s How much improvement in multiply performance to get 5 overall? 80 20 20 Can t be done! n 1.10 Fallacies and Pitfalls Corollary: make the common case fast Chapter 1 Computer Abstractions and Technology 48

Fallacy: Low Power at Idle i7 power benchmark At 100% load: 258W At 50% load: 170W (66%) At 10% load: 121W (47%) Consider designing processors to make power proportional to load However today P= Pdyn+Pstat Pstat is due to leakage power and residual activity If Pstat >0 careful with extreme parallelism! Chapter 1 Computer Abstractions and Technology 49

Exercise Processor core IPC=1: @1.5V, Pdyn= 0.1W, Pstat=0.01W, f=0.5ghz @1.0V, f=0.4ghz Uncore @1.5V, Pdyn+Pstat=0.1W Compare 2,4,8,16 core architecture with a 1 core architecture for running appl with 100GInstr 50% sequential parr 10% sequential part 1% sequential part Compute power, energy Chapter 1 Computer Abstractions and Technology 50