High Performance Computing for Engineers

Size: px

Start display at page:

Download "High Performance Computing for Engineers"

Ruth Alice Horton
5 years ago
Views:

1 High Performance Computing for Engineers David Thomas Room 903 HPCE / dt10/ 2014 / 0.1

2 High Performance Computing for Engineers Research Testing communication protocols Evaluating signal-processing filters Simulating analogue and digital designs HPCE / dt10/ 2014 / 0.2

3 High Performance Computing for Engineers Research Testing communication protocols Evaluating signal-processing filters Simulating analogue and digital designs Tools CAD tools: synthesis, place-and-route, verification Libraries/toolboxes: filter design, compressive sensing HPCE / dt10/ 2014 / 0.3

4 High Performance Computing for Engineers Research Testing communication protocols Evaluating signal-processing filters Simulating analogue and digital designs Tools CAD tools: synthesis, place-and-route, verification Libraries/toolboxes: filter design, compressive sensing Products Oil exploration and discovery Mobile-phone apps Financial computing HPCE / dt10/ 2014 / 0.4

5 High Performance Computing for Engineers Types of performance metrics HPCE / dt10/ 2014 / 0.5

6 High Performance Computing for Engineers Types of performance metrics Throughput Latency Power Design-time Capital and running costs HPCE / dt10/ 2014 / 0.6

7 High Performance Computing for Engineers Types of performance metrics Throughput Latency Power Design-time Capital and running costs Required versus desired performance Subject to a throughput of X, minimise average power Subject to a budget of Y, maximise energy efficiency Subject to Z development days, maximise throughput HPCE / dt10/ 2014 / 0.7

8 What is available to you Types of compute device Multi-core CPUs GPUs (Graphics Processing Units) MPPAs (Massively Parallel Processor Arrays) FPGAs (Field Programmable Gate Arrays) HPCE / dt10/ 2014 / 0.8

9 What is available to you Types of compute device Multi-core CPUs GPUs (Graphics Processing Units) MPPAs (Massively Parallel Processor Arrays) FPGAs (Field Programmable Gate Arrays) Types of compute system Embedded Systems Mobile Phones Tablets Laptops Grid computing Cloud computing HPCE / dt10/ 2014 / 0.9

10 HTC Droid DNA Snapdragon S4 Pro - CPU : Quad-core Krait (ARM derivative) - GPU : Adreno 320 GPU (OpenCL compatible) Images Copyright HTC and Qaulcomm HPCE / dt10/ 2014 / 0.10

11 Lenovo Thinkpad Edge E525 AMD Fusion A8-3500M - CPU : Quad-Core 2.4GHz Phenom-II - GPU : HD 6620G 400MHz (320 cores) Img: HPCE / dt10/ 2014 / 0.11

Imperial HPC Cluster cx2 - SGI Altix ICE 8200 EX Racks and racks of high-performance PCs 3000+ x64 cores running at 3GHz Available to

12 Imperial HPC Cluster cx2 - SGI Altix ICE 8200 EX Racks and racks of high-performance PCs x64 cores running at 3GHz Available to researchers and undergrads (if they ask nicely) Grid-management system Run program on 1000 PCs with one command HPCE / dt10/ 2014 / 0.12

13 Performance and Efficiency Relative to CPU Uniform Gaussian Exponential Mean (Geo) MPPA FPGA GPU Uniform Gaussian 345 Exponential Mean (Geo) FPGA GPU MPPA Performance Power Efficiency HPCE / dt10/ 2014 / 0.13

14 Design tradeoffs 1 Sequential SW 10 Performance hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.14

15 Design tradeoffs 1 10 Performance 100 Sequential SW Thread-based SW hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.15

16 Design tradeoffs 1 10 Performance 100 Sequential SW Thread-based SW hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.16

17 Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.17

18 Design tradeoffs Task-based parallelism vs threads Easy to program (less time coding) 1 Easy to get right (less time testing) 10Many implementations and APIs Performance 100 Intel Threaded Building Blocks (TBB) Microsoft.NET Task Parallel Library 1000 OpenCL 1 hour 1 day 1 week 1 month Sequential SW Task-based SW Thread-based SW Design-time HPCE / dt10/ 2014 / 0.18

19 Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.19

20 Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.20

21 Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW GPU hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.21

Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW GPU 1000 1 hour 1 day 1 week 1

22 Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW GPU hour 1 day 1 week 1 month Design-time Src: NVIDIA CUDA Compute Unified Device Architecture, Programmers Guide HPCE / dt10/ 2014 / 0.22

23 Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW GPU hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.23

24 Design tradeoffs 1 10 Performance 100 Sequential SW Task-based SW Thread-based SW GPU hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.24

25 Design tradeoffs 1 10 Performance Sequential SW Task-based SW Thread-based SW GPU FPGA 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.25

26 Design tradeoffs 1 10 Performance Sequential SW Task-based SW Thread-based SW GPU FPGA 1 hour 1 day 1 week 1 month Design-time HPCE / dt10/ 2014 / 0.26

27 What you will learn Systems: what high-performance systems are available Methods: how these systems can be programmed Practise: concrete experience with multi-core and GPUs Analysis: knowing what to use and when Tools: making better use of your time HPCE / dt10/ 2014 / 0.27

28 Developer productivity is also part of performance HPCE / dt10/ 2014 / 0.28

29 HPCE / dt10/ 2014 / 0.29

30 Re: XKCD - My Professional Context Undergraduate degree and PhD from Computing If pushed, I self-identify as a programmer Research focuses on hardware acceleration Both academic and industrial applications My motivation for this course Supervising final year project students Working with PhD students Talking to industry people HPCE / dt10/ 2014 / 0.30

31 100% Coursework Course Assessment Change from last year, used to be 50% exam 40% : Four short course-works to build skills Get familiar with environments and how to do common tasks Structured and quite linear should not be taxing Force people to do work earlier in term 40% : Two larger tasks to apply skills to real problems Allow demonstration of knowledge and skills Unstructured; open-ended; competitive; hard 20% : Oral assessment; individual Test ability to communicate about your code and solutions (Check that you did the work) HPCE / dt10/ 2014 / 0.31

32 Skills needed Basic programming If you can t program in _any_ language then worry Intel TBB uses C++ rather than C Some weird C++ stuff, but not scary: explained in lectures Setup and basics covered in third coursework GPU programming uses OpenCL (C-like) Let s you use whatever graphics card you happen to have Working examples, explained in lectures Language and compiler setup covered in fourth coursework Not expected to become a guru, just make it faster HPCE / dt10/ 2014 / 0.32

33 Key Focus: Engineering How does this apply to you? Examples from Elec. Eng. problems Mathematical analysis Simulation of digital circuits VLSI circuit layout Communication channel evaluation Tools and languages used in EE C / C++ MATLAB HPCE / dt10/ 2014 / 0.33

34 Simple example : Totient function Eulers totient function: totient(n) Number of integers in range 1..n which are relatively prime to n Integers i and j are relatively prime if gcd(i,j)=1 Totient not included in MATLAB HPCE / dt10/ 2014 / 0.34

35 Version 0 : Simple loop Eulers totient function: totient(n) Number of integers in range 1..n which are relatively prime to n Not included in MATLAB Integers i and j are relatively prime if gcd(i,j)=1 function [res]=totient_v0(n) res=0; for i=1:n % Loop over all numbers in 1..n if gcd(i,n)==1 % Check if relatively prime res=res+1; % Count any that are end end HPCE / dt10/ 2014 / 0.35

36 Version 1 : Vectorising Convert loops into vector operations Standard MATLAB optimisation Actually a way of making parallelism explicit function [res]=totient_v1(n) numbers=1:n; % Generate all numbers in 1..n gcd_res= (gcd(numbers,n)==1); % Perform GCD on all numbers res=sum(gcd_res==1); % Count all relatively prime numbers HPCE / dt10/ 2014 / 0.36

37 Version 2 : Parallel for loop MATLAB supports a parfor command Each loop iteration is/may be executed in parallel Can operate on multiple cores, and even multiple machines HPCE / dt10/ 2014 / 0.37

38 Version 2 : Parallel for loop MATLAB supports a parfor command Each loop iteration is/may be executed in parallel Can operate on multiple cores, and even multiple machines function [res]=totient_v2(n) res=0; parfor i=1:n % Loop over all numbers in 1..n if gcd(i,n)==1 % Check if relatively prime res=res+1; % Count any that are end end HPCE / dt10/ 2014 / 0.38

39 Version 3 : Agglomeration Too much overhead with current parallel loop Each parallel iteration has a cost due to scheduling Process space in chunks, using smaller vectors function [res]=totient_v3(n, step) if nargin<2 % How large each chunk should be step=1000; end res=0; % Loop over each chunk parfor i=1:floor(n/step) % Then process each chunk as a vector numbers=(i-1)*step+1:min(i*step,n); rel_prime= (gcd(numbers,n)==1); res=res+sum(rel_prime); end HPCE / dt10/ 2014 / 0.39

40 Results from my 4-core desktop v0: For Loop v1: Vectorised v2: ParFor Loop v3: ParFor Chunked v4: Algorithm X x 10 4 HPCE / dt10/ 2014 / 0.40

41 Results from my 4-core desktop v0: For Loop v1: Vectorised v2: ParFor Loop v3: ParFor Chunked v4: Algorithm X HPCE / dt10/ 2014 / 0.41

Cilk programs as a DAG

Cilk programs as a DAG The pattern of spawn and sync commands defines a graph The graph contains dependencies between different functions spawn command creates a new task with an out-bound link sync command