High Performance Computing in C and C++

Size: px

Start display at page:

Download "High Performance Computing in C and C++"

Ariel Mosley
5 years ago
Views:

1 High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University

2 Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday 2 to 3 Glyndwr C Substitute Lecturer for week 3: Finish C. Coursework1: C and HPC.

3 Summary Goals of HPC: Performance: Definition Metrics Top500 list Scaling Efficiency Cost

4 Today Computational Models: Parallel Programming vs. distributed Parallel Architectures Intro to Parallel Programming

5 Assignment 1 Due date:november 2, 2012 at 11:00 AM Three parts: 4 Questions One Programming assignment Coding Conventions: TO BE FOLLOWED.

6 Assignment 1 Due date:november 2, 2012 at 11:00 AM New College Policy: WARNING!!Late Submissions Get a ZERO!!

7 Course Administration All assignment are individual work (unless stated otherwise). Copying solutions is considered cheating. Submitted documents will be compared. Keep a copy of the listings to provide evidence of creative work. Unfair practice and plagiarism: University Definition and Procedure: andprogress/unfairpracticeprocedure/ School Definition and Procedure: refer to the Handbook

8 COMPUTATIONAL MODELS

Two Types of HPC Parallel Computing Breaking the problem to be computed into

program to perform matrix multiplication Solve tightly coupled problems

different places (Note: does not necessarily imply simultaneous processing)

9 Two Types of HPC Parallel Computing Breaking the problem to be computed into parts that can be run simultaneously in different processors Example: a program to perform matrix multiplication Solve tightly coupled problems Distributed Computing Parts of the work to be computed are computed in different places (Note: does not necessarily imply simultaneous processing) An example: running a workflow in a Grid Solve loosely-coupled problems (no much communication)

10 Architecture Types Shared Memory: Usually via threads, all processors can access all memory directly at any time; Distributed Memory: A processor can access only its own memory, but processors can share data using message passing.

11 Architecture: Shared Memory Uniform Memory Access (UMA) Non-uniform memory access (NUMA) Shared Memory Interconnect Interconnect Shared Memory 1 Shared Memory m PE 0 PE n PE 1 PE n PE (m-1)n+1 PE m.n

12 Architecture: Shared Memory Shared memory (uniform memory access - UMA) Multiple CPUs, single memory, shared I/O All resources in a SMP machine are equally available to each CPU Processors share access to a common memory space. Implemented over a shared memory bus or switch. Support for Critical Sections is required Local cache is critical. Shared Memory PE 0 Interconnect PE n

13 Shared Memory - UMA Why local cache is critical? If not, bus/switch contention (or network traffic) reduces the systems efficiency. For this reason, uniform memory access systems do not scale well: Cache introduces problems of coherency (ensuring that cache is updated when other processors alter the memory) Shared Memory PE 0 Interconnect PE n

14 Shared Memory - NUMA Shared memory (Nonuniform memory access NUMA) Multiple CPUs Each CPU has fast access to its local area of the memory, but slower access to other areas Scale well to a large number of processors Complicated memory access pattern Global address space Shared Memory 1 PE 1 PE n Interconnect Shared Memory m PE (m-1)n+1 PE m.n

15 Distributed Memory Distributed Memory Each processor has it s own local memory Data exchange/sharing done through explicit communication: Message Passing (MPI language) Larger latencies between processors Scalability is good if the task to be computed can be divided properly PE 0 M 0 Interconnect PE n M n

16 Why the Schism? Problems whose parts are completely separated from and independent of one another are trivial to parallelize/distribute. But most interesting problems have some irreducible interaction between them. The two different memory models or computing paradigms encourage two very different ways to handle interactions.

17 Basic Issues Two processes: Alice, Bob Alice task: Add two numbers First number is her own Bob task: Provide second number Three possible scenarios Whiteboard: Shared Memory

18 The Shared Memory Lucky Example

19 The Shared Memory Unlucky Example 1

20 The Shared Memory Unlucky Example 2

21 Shared Memory now what? How do you solve it? Locking mechanism Semaphore Synchronization MUST be guaranteed! However results are non deterministic...

22 Distributed Memory Message Passing

23 HPC Strategies Good performance and scalability: how well we can get solutions faster solve a problem that is larger. Solutions are not created equal in parallelization: Luck Increase data locality Reduce dependencies Amortize system overheads

24 Simple Case Study: test 1 #include < stdio.h > #include < time.h > #define N 2048 float x[n], y[n], A[N][N]; int main(void){ int i,j, irepeat; float total; clock_t c1, c2; c1 = clock(); for (irepeat = 0; irepeat < 5; irepeat ++){ for (i = 0; i < N; i++) for (j = 0; j < N; j++) y[i] = y[i] + A[i][j]*x[j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) x[i] = x[i] + A[i][j]*y[j]; } c2 = clock(); total = (c2 - c1)*1000.f/clocks_per_sec; printf("total time = %6.2f milliseconds\n", total); return 0;}

25 Simple Case Study: test 2 #include < stdio.h > #include < time.h > #define N 2048 float x[n], y[n], A[N][N]; int main(void) { int i,j,irepeat; float total; clock_t c1, c2; c1 = clock(); for (irepeat = 0; irepeat < 5; irepeat ++) { for (j = 0; j < N; j++) for (i = 0; i < N; i++) y[i] = y[i] + A[i][j]*x[j]; for (j = 0; j < N; j++) for (i = 0; i < N; i++) x[i] = x[i] + A[i][j]*y[j]; } c2 = clock(); total = (c2 - c1)*1000.f/clocks_per_sec; printf("total time = %6.2f milliseconds\n", total); return 0; }

26 Simple Case Study $> gcc -o test1 test1.c $> test1 total time taken = milliseconds $> gcc -o test2 test2.c $> test2 total time taken = milliseconds

27 Data locality Data are stored in linear space Storage pattern: No difference to program execution Important for performance Access data according to how data is stored: Local data: use native methods (C uses row-major order) Remote data: reduce the need to move data between nodes

28 PARALLEL PROCESSING & ARCHITECTURES

29 Parallel Processing Concurrent use of multiple processors to process data. Either by: Running the same program on each processor. Running different programs on each processor. Parallel processing may occur in the instruction stream, the data stream, or both. The sequence of instructions read from memory is the instruction stream. The operations performed on the data in the processor is the data stream.

30 Types of Parallelism Pipeline parallelism: instruction stream. Data parallelism: data stream. Task parallelism: more complex, interdependencies cannot be avoided, implies data sharing.

31 Architectural Classification Flynn's Classification (1972) Four classes. Based on the multiplicity of Instruction Streams and Data Streams. Number of Data Streams Single Multiple Instruction Stream: Sequence of Instructions read from memory. Data Stream: Number of Instruction Streams Single Multiple SISD MISD SIMD MIMD Operations performed on the data in the processor.

Machines may still consist of multiple (unicore) processors, operating on independent data - these can be considered as

32 Flynn s Taxonomy Classes: SISD - Single Instruction, Single Data Instructions are operated sequentially on a single stream of data in a single memory unit. Classic Von Neumann architecture. Machines may still consist of multiple (unicore) processors, operating on independent data - these can be considered as multiple SISD systems. Examples: scalar and superscalar processors. Superscalar processors: instruction level parallelism (more than one instruction per clock cycle). UNIVAC1 Cray1

Flynn s Taxonomy Classes: SIMD - Single Instruction, Multiple Data A

These can deliver results several times faster than scalar processors.

33 Flynn s Taxonomy Classes: SIMD - Single Instruction, Multiple Data A single instruction stream (broadcast to all processors), acting on multiple data. The most common form of this architecture class are Vector processors. These can deliver results several times faster than scalar processors. Example: GPUs. Limitations: Not all algorithm can be vectorized. Algorithm implementation tricky. No compiler support. Architecture specific. CraySMP

34 Flynn s Taxonomy Classes: MISD - Multiple instruction, Single data No practical implementations of this architecture.

MIMD - Multiple instruction, Multiple data Multiple instruction streams, acting on

35 Flynn s Taxonomy Classes: MISD - Multiple instruction, Single data No practical implementations of this architecture. MIMD - Multiple instruction, Multiple data Multiple instruction streams, acting on different data. Most common HPC systems. Can be either shared or distributed memory MIMD. Multi-core superscalar processors are classified as MIMD. CrayT3 IBM BG/L

36 Flynn s Taxonomy Classes: SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data. MISD - Multiple instruction, Single Data. MIMD - Multiple Instruction, Multiple Data. Parallelism can be achieved within: SISD (superscalar); SIMD (vector processors); MIMD (clusters, massively parallel processors)

37 In-processor Parallelism (single processor) Pipelining: Overlap the execution of instructions. Example: $> cat scalar_array extract_contour gzip -c > contour_data.z cat : reads the disk file, "scalar_array", sends its content to "extract_contour" through a pipe extract_contour: visualization program (contour from scalar function-> geometry) gzip: compresses the data (geometry) writes the compressed data to disk

38 In-processor Parallelism (single processor) Pipelining (SISD): Overlap the execution of instructions. Example: $> cat scalar_array extract_contour gzip -c > contour_data.z 3 processes running together: keeps 3 cores busy speedup keep multiple devices busy at the same time: will help to overlap: computation and the wait for services done by system devices (OS scheduler) Achieve Streaming

39 In-processor Parallelism (single processor) Pipelining: Overlap the execution of instructions. Reduces the idle time of hardware components. Good performance with independent instructions. Performing more operations per clock cycle. Does not reduce latency. As fast as the slowest step. Branches can be a problem (loops).

40 Pipelining: Branches Example: loop-level parallelism: exploit parallelism among iterations of a loop. Example 1 for (i=1; i<=1000; i= i+1) x[i] = x[i] + y[i]; //Instruction Level Parallelism

41 Pipelining: Branches Loop-level parallelism: exploit parallelism among iterations of a loop. Example 1 for (i=1; i<=1000; i= i+1) x[i] = x[i] + y[i]; //Instruction Level Parallelism Example 2 for (i=1; i<=100; i= i+1){ a[i] = a[i] + b[i]; b[i+1] = c[i] + d[i]; } //s1 //s2 Is this loop parallel?

42 Pipelining: Branches Loop-level parallelism: exploit parallelism among iterations of a loop. Example 2 for (i=1; i<=100; i= i+1){ a[i] = a[i] + b[i]; b[i+1] = c[i] + d[i]; } //s1 //s2 Is this loop parallel? There is a loop-carried dependency S1 depends on S2... but no cycles!

43 Loop-level parallelism A loop is parallel unless there is a cycle in the dependecies. The absence of a cycle means that the dependencies give a partial ordering on the statements. Parallel loop Example 2 // re-written a[1] = a[1] + b[1]; for (i=1; i<=99; i= i+1){ b[i+1] = c[i] + d[i]; a[i+1] = a[i+1] + b[i+1]; } b[101] = c[100] + d[100]; //S1 //S2

44 Loop-level parallelism A loop is parallel unless there is a cycle in the dependecies. The absence of a cycle means that the dependencies give a partial ordering on the statements. Non parallel loop Example 3 for (i=1; i<=100; i= i+1){ a[i+1] = a[i] + c[i]; b[i+1] = b[i] + a[i+1]; } S1 and S2 depend on each other. //S1 //S2

45 Loop unrolling There are a number of techniques for converting loop-level parallelism into instruction-level parallelism. Such techniques work by unrolling the loop. for (int i=0 ; i< imagesize ; i++) { pixels [i] *= scale ; } for (int i=0 ; i< imagesize ; i++) { pixels [i] *= scale ; pixels [i++] *= scale ; pixels [i++] *= scale ; pixels [i++] *= scale ; } Activated by the option funroll-loops in GCC.

46 In-processor Parallelism (single processor) Pipelining (SISD): Overlap the execution of instructions. Reduces the idle time of hardware components. Good performance with independent instructions. Performing more operations per clock cycle. Discrepancy between peak and actual performance often caused by pipeline effects Difficult to keep pipelines full (conditional branches might be a reason) Branch prediction helps: Correct prediction is very fast. Incorrect prediction is very slow. Accuracy is about 95%. So 5% of branches cause a pipeline stall (bad!).

In-processor Parallelism (single processor) Vector architectures (SIMD) Each result independent of previous result: long pipeline (high clock rate).

47 In-processor Parallelism (single processor) Vector architectures (SIMD) Each result independent of previous result: long pipeline (high clock rate). Vector instructions access memory with known pattern: highly interleaved memory (low latency). no (data) caches required! (Do use instruction cache). Reduces branches and branch problems in pipelines. Fewer instruction fetches. Bad performance on problems that do not have independent inputs.

48 Vector Processors: Branches Branches are expensive on GPUs. void stripe ( const float4 * input global float4 * output ) { int i = get_global_id (0); // Lighten even pixels, darken odd pixels if (i % 2) } { } else { } output [i] = input [i] * 1.1; output [i] = input [i] * 0.9; Each pair of threads will take different branches (fragments). Only half will actually be running in parallel

49 Multiprocessor Parallelism MIMD architectures. Divide workload up between processors. Often achieved by dividing up a data structure. Each processor works on it s own data. Typically processors need to communicate. Shared memory. Message exchange. Distributed shared memory (virtual global address space).

50 PARALLEL PROGRAMMING

51 Designing a Parallel Program Granularity Data Dependency Communication

Granularity Granularity of parallelism is the ratio of computations to that are being performed in parallel to communication: Fine: relatively small

52 Granularity Granularity of parallelism is the ratio of computations to that are being performed in parallel to communication: Fine: relatively small amounts of computational work are done between communication events. Coarse: relatively large amounts of computational work are done between communication events.

53 Granularity Four types of parallelism (in order of granularity size) Instruction-level parallelism (e.g. pipeline) Thread-level parallelism (e.g. run a multi-thread java program) Process-level parallelism (e.g. run an MPI job in a cluster) Job-level parallelism (e.g. run a batch of independent singleprocessor jobs in a cluster) Which is Best? The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs. In most cases the overhead associated with communications and synchronization is high relative to execution speed so it is advantageous to have coarse granularity. Fine-grain parallelism can help reduce overheads due to load imbalance.

54 Communication vs. Computation Main issues that affect parallel efficiency are: Ratio of computation to communication Higher computation usually yields better performance. Communication bandwidth & latency Latency has the biggest impact. Scalability Inherent limit in the problem. Hardware limit: does the bandwidth & latency scale with the number of processors.

55 Dependency Dependencies are one of the primary inhibitors to parallelism. Dependency: If event A must occur before event B, then B is dependent on A. Two types of Dependency Control dependency: waiting for the instruction which controls the execution flow to be completed IF (X!=0) Then Y=1.0/X: Y has the control dependency on X!=0 Data dependency: dependency because of calculations or memory access Flow dependency: A=X+Y; B=A+C; Anti-dependency: B=A+C; A=X+Y; Output dependency: A=2; X=A+1; A=5;

56 Identifying Dependency Draw a Directed Acyclic Graph (DAG) to identify the dependency among a sequence of instructions Anti-dependency: a variable appears as a parent in a calculation and then as a child in a later calculation Output dependency: a variable appears as a child in a calculation and then as a child again in a later calculation X=A+B D=X*17 A=B+C X=C+E A 1 X 2 D B C 3 anti A output 17 anti 4 X E

57 How to Handle How to Handle Data Dependencies: Distributed memory architectures - communicate required data at synchronization points. Shared memory architectures -synchronize read/write operations between tasks. Loop carried dependency are the most important.

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside