CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

Introduction to Parallel Processing Multiprocessor Machine : A computer system with at least two processors (vs Uniprocessor ) Goal: To connect multiple computers to get higher performance to improve: Scalability, Availability, and Power Efficiency (multicore era) Type 1: High throughput for independent jobs Type 2: Single program that runs on multiple processors More difficult Cluster: A set of computers connected over a local area network Can serve as search engines, web servers, databases, etc. Multicore microprocessors A CPU containing multiple cores in a single chip/die/socket. 3 Today s status All CPUs today are multicore #Cores is expected to increase constantly We expect to see 2 additional cores per chip every two years All machines are SMP: Shared Memory Processors Any Programmers who care about performance must become Parallel Programmers Before 2004, you don t have to. Now, sequential programs are slow. Unfortunately, No easy software and language are available to write both correct and fast parallel programs 4 2

Parallel Programming The difficulty of parallelism is NOT hardware Parallel software is the problem It is difficult to use multiple processors to complete one task faster You hope to get significant performance improvement Otherwise, simply use a faster uniprocessor, since it s easy Difficulties: Partitioning the problem Too many ways to partition it Coordination Communications overhead Load balancing Data locality 5 What is Speedup Number of cores = p Serial run-time = T serial Parallel run-time = T parallel S = T serial T parallel T parallel = T serial / p 3

What is Parallel Efficiency of a Program E = S p = T serial T parallel p = T serial. p T parallel An Example of Speedup and Efficiency 4

Efficiencies of parallel program on different problem sizes Amdahl s Law: S = 1 / (F s + F p /P) à 1 / F s The sequential part of your program limits the speedup of your program on parallel computers Question: 100 processors, how to get 90 speedup? T old = T parallelizable + T sequential T new = T parallelizable /#P + T sequential Speedup = P->inf 1 (1 F parallelizable ) +F parallelizable / #P = 90 Solving: F parallelizable = 99.9% So, we need the sequential part to be <= 0.1% of original time Yes, there are such applications with plenty of parallelism 10 5

Another Example of Amdal s Law An example of workload: add 10 scalars, then sum of two 10 10 matrices Assume adding scalars cannot benefit from parallelism, but matrix can benefit Q: What are the speed up from 10 to 100 processors? On a single processor: Time = (10 + 100) t add = 110 t add 10 processors: Time = 10 t add + 100/10 t add = 20 t add Speedup = 110/20 = 5.5 (or efficiency = 55%) 100 processors: Time = 10 t add + 100/100 t add = 11 t add Speedup = 110/11 = 10 (or efficiency = 10%) 11 Strong Scaling vs Weak Scaling Strong scaling: the problem size fixed As shown in the first example Weak scaling: problem size is proportional to number of processors 10 processors, 10 10 matrix (i.e., original size) //10 elements / processor Time = 10 t add + 100/10 t add 100 processors, 32 32 matrix (32x32=1024) //sqrt(1000)=31.6 Time = 10 t add + 1000/100 t add = 20 t add Constant execution time in this example Most often, people solve bigger problems on bigger computers 12 6

Load Balancing In the previous examples, we assumed that the workload was perfectly balanced! Example: suppose 100x100 matrix, 100 processors 10+100*100/100=110 è 10010/110 = 91X given 100 processors However, if one processor has 5% of the workload 5% x 10000t = 500t The other 99 processors have 95% of the workload Time = Max(500t, 9500t/99) + 10t = 510t Speedup = 10010 / 510 = only 20X given 100 processors 13 Flynn s Taxonomy SISD Single instruction stream Single data stream SIMD Single instruction stream Multiple data stream MISD Multiple instruction stream Single data stream MIMD Multiple instruction stream Multiple data stream 7

SIMD Parallelism achieved by dividing data among the compute units. Applies the same instruction to multiple data items. Also called data parallelism. SIMD Example control unit n data items n ALUs x[1] x[2] x[n] ALU 1 ALU 2 ALU n for (i = 0; i < n; i++) x[i] += val; 8

SIMD What if we don t have as many ALUs as data items? Divide the work and process iteratively. Ex. m = 4 ALUs and n = 15 data items. Round ALU 1 ALU 2 ALU 3 ALU 4 1 X[0] X[1] X[2] X[3] 2 X[4] X[5] X[6] X[7] 3 X[8] X[9] X[10] X[11] 4 X[12] X[13] X[14] SIMD Drawbacks All ALUs are required to execute the same instruction, or remain idle. They must also operate synchronously. Efficient for large data parallel problems, but not for other types of more complex parallel problems. 9

Hardware Multithreading Hardware multithreading (about one core, about ILP) VS MIMD: create n threads running on n processors in parallel. To increase resource utilization on a single core Perform multiple threads of execution in parallel Has replicated registers, PC Support fast switching between threads 3 Versions: Fine-grain multithreading Switch threads after each cycle Interleave instruction execution (normally round-robin) If one thread stalls, others are executed Con: a normal individual thread will be delayed by other threads instructions Coarse-grain multithreading Only switch on long pipeline stall (e.g., L2-cache miss) Simplifies hardware, but does not hide short stalls (e.g., data hazards) SMT 19 Simultaneous Multithreading (SMT) In modern multiple-issue dynamically scheduled processor Can schedule instructions from multiple threads No thread switching on every cycle Instructions from independent threads execute whenever function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium4 HT Two threads: duplicated registers, shared function units and caches 20 10

A HW Multithreading Example 21 MIMD Supports multiple simultaneous instruction streams operating on multiple data streams. Typically consist of a collection of fully independent processing units or cores, each of which has its own control unit and its own ALU. 11

Shared Memory System A collection of autonomous processors is connected to a memory system via an interconnection network. Each processor can access each memory location. The processors usually communicate implicitly by accessing shared data structures. Shared Memory System Figure 2.3 12

UMA Multicore System Time to access all the memory locations will be the same for all the cores. Figure 2.5 NUMA Multicore System A memory location a core is directly connected to can be accessed faster than a memory location that must be accessed through another chip. Figure 2.6 13

Distributed Memory System Clusters (most popular) A collection of commodity systems. Connected by a commodity interconnection network. Nodes of a cluster are individual computations units joined by a communication network. Distributed Memory System Figure 2.4 14

29 30 15

Intel CPUs Each tick has improved nm technology (thus, lower power, faster clock) Each tock introduces new features and improved architectural performance. How to Decide a Computer s Peak Performance? TABLE III THEORETICAL PER-CYCLE PEAK FOR HASWELL AVX 2.0 SSE SSE SSE AVX+FMA AVX-128 AVX-128 AVX+FMA AVX+FMA (Scalar) (DP) (SP) (scalar) +FMA (DP) +FMA (SP) (DP) (SP) flop / operation 1 1 1 2 2 2 2 2 operations / instruction 1 2 4 1 2 4 4 8 instructions / cycle 2 2 2 2 2 2 2 2 = flop / cycle 2 4 8 4 8 16 16 32 TABLE IV THEORETICAL PER-NODE PEAK FOR E5-2695 V3 SSE SSE SSE AVX+FMA AVX-128 AVX-128 AVX+FMA AVX+FMA (Scalar) (DP) (SP) (scalar) +FMA (DP +FMA (SP) (DP) (SP) flop / cycle 2 4 8 4 8 16 16 32 Clock cycles Rate/ second 2.3G 2.3G 2.3G 2.3G 2.3G 2.3G 2.3G 2.3G cores / socket 14 14 14 14 14 14 14 14 sockets / node 2 2 2 2 2 2 2 2 = flops / node 128.8G 257.6G 515.2G 257.6G 515.2G 1030.4G 1030.4G 2060.8G Paper to read: http://cs.iupui.edu/~fgsong/cs590hpc/how2decide_peak.pdf (Haswell-EP) $2424 (as of Nov 2016) 16

Intel Xeon vs Intel Xeon Phi vs Nvidia GPUs vs IBM Power 589 875 MHz https://www.xcelerit.com/computing-benchmarks/processors/ 17