Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Size: px

Start display at page:

Download "Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration"

Cody McGee
5 years ago
Views:

2 Accelerating Applications the art of maximum performance computing James Spooner Maxeler VP of Acceleration

3 Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How do we go about it? What can we automate? How can we apply this for real? What do you think? 3

4 Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How do we go about it? What can we automate? How can we apply this for real? What do you think? 4

About Maxeler Technologies Maxeler offers complete hardware, software and application acceleration solutions for high performance computing Founded 2003, ~65 people, offices in London, UK and Palo

5 About Maxeler Technologies Maxeler offers complete hardware, software and application acceleration solutions for high performance computing Founded 2003, ~65 people, offices in London, UK and Palo Alto, CA Main clients in banking and oil and gas exploration Hardware Software Consulting Card: PCI Express x16, compute, memory and local interconnect Node: 1U solutions with multi cards Rack: 10U, 20U or 40U, balancing compute, storage & network Resource management for Accelerated Computing Runtime support: memory management and data choreography Compilers and High Level Libraries HPC System Performance Architecture Algorithms and Numerical Optimization Integration into business and technical processes 5

6 Application Acceleration Deliberate, focused approach to improving application speed May involve using new or additional hardware May require (dramatic) changes to the code base Makes some of the program faster Will be programmed intentionally and be architecture specific May have multiple implementations Maxeler is a acceleration specialist, delivering end to end performance for a range of clients in the banking and oil/gas exploration industries. This talk aims to present some of our methodology, and experience across GPU and FPGA acceleration projects 6

7 What always makes Acceleration hard? Messy code Complicated build dependences Confused control-flow Impenetrable data access Pointer-intensive data structures Premature optimization for (i=0; i<n; ++i) { points[i] >incx(); } x y z x y z p x y z x y z p x y z x y z x y z r θ x y z q x y z 7

8 Conflicting goals Some well-motivated software structures have real value, but make acceleration harder for (i=0; i<n; ++i) { points[i] >incx(); } Examples: Virtual method calls inside a loop Collections with non-uniform type Substructure sharing x y z x y z p x y z x y z p x y z x y z x y z r θ x y z q x y z 8

9 What makes Acceleration easier? Self-evident data dependences Computing on large collections of uniform data Appropriate representation hiding Getting the abstraction right x x x x x x x x y y y y y y y y z z z z z z z z 9

10 Maximum Performance Computing Identify parallelism and take advantage of it Fully understand data dependencies Minimize memory bandwidth Data reuse and representation Regularize the computation and data Minimize control flow complexity Find optimal balance for underlying architecture Memory hierarchy bandwidth(s) and size(s) and latency(s) Communication bandwidth(s) and latency(s) Math performance Branch cost (control divergence) Axes of Parallelism 10

11 Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How do we go about it? What can we automate? How can we apply this for real? What do you think? 11

12 Maxeler Acceleration Process Code Analysis Transformation Partitioning Implementation Result Sets theoretical performance bounds Achieve performance Run the code with profiling tools. Understand data and loop structures and data access patterns Investigate transformation options for these structures and access patterns Decide which parts of the code need acceleration Implement and validate 12

Analysis Understand the application (code + data) Find parallelism At all levels Understand data dependencies Size Frequency Distance Transformation Change the structure of the

13 Analysis Understand the application (code + data) Find parallelism At all levels Understand data dependencies Size Frequency Distance Transformation Change the structure of the code and data Put all the computation that matters into one place Strive for regular code and data structures in the inner loop Split out irregular or complex pre/post processing 13

Compute Consider input/output data volume, latency, computational throughput

14 Partitioning What should go where? Storage Consider capacity, bandwidth, latency Data access patterns, reuse Compute Consider input/output data volume, latency, computational throughput APU Chip CPU ~7 GB/s Coherent path NBCORE Embedded GPU CPU ~20 GB/s Non coherent path Model the performance before implementation. 14

Development Time Runtime Try to minimise runtime and

15 Data Access Plans Partitioning Options Code Partitioning Pareto Optimal Options Transformations Development Time Runtime Try to minimise runtime and development time, while maximising flexibility and precision. 15

16 Implementation Making it work Visibility is a challenge Be systematic, don t bite off too much at once Get it working first Unit Testing Make sure everything works Have good models and keep them up to date Evaluate implementation options and measure them all! Optimize quantitatively, not intuitively 16

17 Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How do we go about it? What can we automate? How can we apply this for real? What do you think? 17

18 Maxeler Parton Internal tool suite consisting of the following tools: Multithreaded lightweight timing Arithmetic precision simulation Automatic control / data flow analysis Constraints-bound performance estimation 18

19 Parton in the Acceleration Process Loop Timing Measurement Arithmetic precision simulation Code Analysis Transformation Partitioning Implementation Result Sets theoretical performance bounds Automatic control / data flow analysis Constraints bound performance estimation Achieve performance 19

20 Automatic control /data flow analysis Run-time analysis of compiled software Allows automatic analysis of impenetrable object-oriented C++ software Traces data dependencies between functions and libraries Three stage process: Trace Analyse Visualise 20

21 Automatic control /data flow analysis Initial Program Trace Control + Memory Flow Analyse Control Flow and Data Dependencies Visualise part or all of the program 21

22 Automatic control /data flow analysis Dependency through heap memory allocation Data flow (size in bytes) Hot spots identified by color Nested for loop structures Control flow 22

23 Maxeler Loop Graphs Boxes represent loops Diamonds represent data Ellipses represent computation 23

24 Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How do we go about it? What can we automate? How can we apply this for real? What do you think? 24

25 MySQL Internals : Performance Review Current version of MySQL is optimized for: Minimizing the disk I/O ( B+tree) Caching queries & data ( hash table, memory buffer pool ) Finding the best way to execute a query (optimizer) Current version of MySQL is not optimized for: Data Level Parallelism : SIMD is not used Thread Level Parallelism to process slow queries : a query is processed by a single thread Compute intensive analytical queries 25

26 MySQL Indexes Indexes in MySQL use B+Trees allows searches in logarithmic time Search starts at the root, traverses downwards, and performs binary search at each node. Linked list (red) allows rapid in-order traversal. 26

27 Accelerating Index Search Accelerating index search would accelerate many queries (insert, select, update, joins,...) Full text search will also benefit. Common queries involving a B+tree search : SELECT * FROM table WHERE id = 25 SELECT title,text FROM page WHERE MATCH(title) AGAINST( name ) Two possible acceleration strategies: Option 1: Use GPU to accelerate a single search query. Option 2: Use GPU to process many search queries in parallel. 27

28 Option 1 : Accelerating a Single Search Use massively parallel architecture of GPU to accelerate a single search. Replace binary search in the B+tree by a K ary search. Maximum performance improvement : log(k)/log(2) Needs more memory bandwidth for reading larger blocks K ary search B+tree K ary search K ary search 1 Approach presented in Parallel Search on Video Cards (2009), T Kaldewey, J Hagen, et al. 28

29 K-ary search : Use K GPU cores to accelerate the binary search Binary search with one core done in log2(n) steps 4 ary search with 3 cores done in log4(n) steps 29

30 GPUs can process hundreds of queries in parallel 1. Key features: Maximize memory bandwidth (parallel reads). Prefetching to use the SIMD architecture. Maximize throughput (number of threads).

30 30 GPUs can process hundreds of queries in parallel 1. Key features: Maximize memory bandwidth (parallel reads). Prefetching to use the SIMD architecture. Maximize throughput (number of threads). MySQL architecture implications: Need modification of the data layout of the binary tree. Need modification of the query optimizer to gather similar queries. Needs many search queries to achieve high throughput. Query 1 // read compare Query 2 Query 3 Query 4 Query Approach presented in Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs (2010), C. Kim, J. Chhugani, et al

31 Performance on Zacate Speedup vs CPU alone 1 Bottleneck Option 1 (projected) ~2.2x GPU clock frequency. Option 2 (projected) ~3.2x Number of GPU cores. Option 2 (initial results) ~2.1x Number of GPU cores. Option 1 optimizes for latency, so speedup is quoted for a single query running with a single thread. Option 2 optimizes for throughput, so speedup assumes sufficient queries to keep the GPU fully occupied and is compared to CPU using both cores. As option 2 optimizes for throughput, latency may be increased. Speedup is quoted for overall performance of a full text keyword search on the simple English Wikipedia database, and not just the index search (accounts for ~71% of runtime). 1 Speedup is quoted for a given system versus the same system not using the GPU. 31

32 Reverse Time Migration Geoscience algorithm for imaging the earth s subsurface Runtime of weeks to months on thousands of CPU cores Core computational kernel is 3D finite difference wave propagation 32

33 Application Structure Source wave Received wave t=0 Forward extrapolate pressure field Cross correlate at each time point Reverse extrapolate pressure field t=tmax 33

34 RTM Option 1 Storing pressure fields Accelerator Extrapolate source wavefield Extrapolate receiver wavefield CPU Save wavefields to memory/disk Load src wavefields from disk, image with rcver Post process / save output image Step 1 Step 2 Step 3 34

35 RTM Option 2 Recompute pressure fields Accelerator Extrapolate source wavefield Extrapolate receiver wavefield Re-extrapolate source backwards Imaging CPU Save boundaries in memory Send boundary elements back to source propagator Post process / save output image Step 1 Step 2 Step 3 35

performance at low speedups Limited by disk I/O Option 2 Recompute

36 Modelling Performance Impact for Design Alternatives Option 1: partial problem Option 2 Option 1: full problem size Option 1 - Store Higher performance at low speedups Limited by disk I/O Option 2 Recompute Performance scales Which is better depends on problem size and relative disk I/O bandwidth 36

37 Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How do we go about it? What can we automate? How can we apply this for real? What do you think? 37

38 Summary Have a clear goal about what needs to be faster Identify all the code that needs to be accelerated Understand the data and compute requirements Map the algorithm to the architecture Evaluate all options thoroughly Leave intuition at the door, measure quantitatively Implement systematically 38

39 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied. 39

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different