Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Size: px

Start display at page:

Download "Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield"

Bernard Jenkins
5 years ago
Views:

1 Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

2 Graphics Processing Units (GPUs)

3 Context: GPU Performance Accelerated Computing Serial Computing Parallel Computing ~40 GigaFLOPS 384 GigaFLOPS 8.74 TeraFLOPS 1 core 16 cores 4992 cores

4 10.0 TFlops 9.0 TFlops 8.0 TFlops 7.0 TFlops 6.0 TFlops 5.0 TFlops 4.0 TFlops 3.0 TFlops 2.0 TFlops 1.0 TFlops 0.0 TFlops ~40 GigaFLOPS 8.74 TeraFLOPS 1 CPU Core GPU (4992 cores) 6 hours CPU time vs. 1 minute GPU time

5 Accelerators Much of the functionality of CPUs is unused for HPC Complex Pipelines, Branch prediction, out of order execution, etc. Ideally for HPC we want: Simple, Low Power and Highly Parallel cores

6 An accelerated system DRAM GDRAM CPU GPU/ Accelerator I/O PCIe I/O Co-processor not a CPU replacement

7 Thinking Parallel Hardware considerations High Memory Latency (PCI-e) Huge Numbers of processing cores Algorithmic changes required High level of parallelism required Data parallel!= task parallel If your problem is not parallel then think again

8 Speedup (S) Amdahl s Law 25 Speedup S = P N 1 (1 P) 20 P = 25% P = 50% P = 90% P= 95% 5 0 Number of Processors (N) Speedup of a program is limited by the proportion than can be parallelised Addition of processing cores gives diminishing returns

9 SATALL Optimisation

10 Time (s) Time per function for Largest Network (LoHAM) Profile the Application 11 hour runtime Function A 97.4% runtime 2000 calls Hardware Intel Core i7-4770k 3.50GHz 16GB DDR3 Nvidia GeForce Titan X % % 0.3% 0.1% 0.1% 0.1% A B C D E F Function

11 Function A Input Network (directed weighted graph) Origin-Destination Matrix Output Traffic flow per edge 2 Distinct Steps 1. Single Source Shortest Path (SSSP) All-or-Nothing Path For each origin in the O-D matrix 2. Flow Accumulation Apply the OD value for each trip to each link on the route 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Distribution of Runtime (LoHAM) Serial Flow SSSP

12 Single Source Shortest Path For a single Origin Vertex (Centroid) Find the route to each Destination Vertex With the Lowest Cumulative Weight (Cost)

13 Serial SSSP Algorithm D Esopo-Pape (1974) Maintains a priority queue of vertices to explore Highly Serial Not a Data-Parallel Algorithm We must change algorithm to match the hardware Pape, U. "Implementation and efficiency of Moore-algorithms for the shortest route problem." Mathematical Programming 7.1 (1974):

14 Number of Edges Parallel SSSP Algorithm Bellman-Ford Algorithm (1956) Poor serial performance & time complexity Performs significantly more work Highly Parallel Suitable for GPU acceleration 30 Total Edges Considered Bellman, Richard. On a routing problem Bellman-Ford Desopo-Pape

15 Time (seconds) Implementation A2 - Naïve Bellman- Ford using Cuda Up to 369x slower Striped bars continue off the scale Derby 36.5s Time to compute all requied SSSP results per model CLoHAM s LoHAM s Serial A2 Optimisation Derby CLoHAM LoHAM

16 Time (seconds) Implementation Followed iterative cycle of performance optimisations A3 Early Termination A4 Node Frontier A8 Multiple origins Concurrently SSSP for each Origin in the OD matrix A10 Improved load Balancing Cooperative Thread Array A11 Improved array access Time to compute all requied SSSP results per model Serial A2 A3 A4 A8 A10 A11 Optimisation Derby CLoHAM LoHAM

17 Limiting Factor (Function A) Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Serial 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Serial Flow SSSP Flow SSSP

18 Limiting Factor (Function A) Limiting Factor has now changed Need to parallelise Flow Accumulation Distribution of Runtime (CLoHAM) Distribution of Runtime (LoHAM) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Serial A11 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Serial A11 Flow SSSP Flow SSSP

19 Flow Accumulation Shortest Path + OD = Flow-per-link For each origin-destination pair Trace the route from the destination to the origin increasing the flow value for each link visited Parallel problem But requires synchronised access to shared data structure for all trips (atomic operations) Link Flow

20 Time (Seconds) Time to compute Flow values per link Flow Accumulation 6 5 Problem: A12 - lots of atomic operations serialise the execution Serial A12 Derby CLoHAM LoHAM

21 Time (Seconds) Time to compute Flow values per link Flow Accumulation 6 5 Problem: A12 - lots of atomic operations serialise the execution Solutions: A15 - Reduce number of atomic operations Solve in batches using parallel reduction A14 - Use fast hardware-supported single precision atomics Minimise loss of precision using multiple 32-bit summations % total error 0 Serial A12 A15 (double) Derby CLoHAM LoHAM A14 (single)

22 Integrated Results

23 Assignment Time relative to serial Relative Assignment Runtime Performance vs Serial Assignment Speedup relative to Serial Serial LoHAM 12h 12m Double precision LoHAM 35m 22s 25 Single precision 20 Reduced loss of precision LoHAM 17m 32s 15 Hardware: 10 Intel Core i7-4770k 16GB DDR3 5 Nvidia GeForce Titan X 0 Serial Multicore A15 A14 (single) Derby CLoHAM LoHAM

24 Assignment Time relative to multicore Relative Assignment Runtime Performance vs Multicore Assignment Speedup relative to Multicore Multicore LoHAM 1h 47m Double precision LoHAM 35m 22s 4 Single precision 3 Reduced loss of precision LoHAM 17m 32s 2 Hardware: Intel Core i7-4770k 1 16GB DDR3 Nvidia GeForce Titan X 0 Serial Multicore A15 A14 (single) Derby CLoHAM LoHAM

25 GPU Computing at UoS

26 Expertise at Sheffield Specialists in GPU Computing and performance optimisation Complex Systems Simulations via FLAME and FLAME GPU Visual Simulation, Computer Graphics and Virtual Reality Training and Education for GPU Computing

Thank You Largest Model (LoHAM) results Paul Richmond p.richmond@sheffield.ac.

uk Runtime Speedup Serial Speedup Multicore Serial 12:12:24 1.00 0.

27 Thank You Largest Model (LoHAM) results Paul Richmond paulrichmond.shef.ac.uk Peter Heywood ptheywood.uk Runtime Speedup Serial Speedup Multicore Serial 12:12: Multicore 01:47: A15 (double precision) A14 (single precision) 00:35: :17:

28 Backup Slides

29 Time (s) Benchmark Models 3 Benchmark networks Range of sizes Small to V. Large Up to 12 hour runtime Benchmark model performance Model Vertices (Nodes) Edges (Links) O-D trips Derby ² CLoHAM ² LoHAM ² Serial Salford Version Multicore Derby CLoHAM LoHAM

30 Number of Edges Number of Edges Edges considered per algorithm Edges Considered Per Iteration Total Edges Considered Iteration 0 Bellman-Ford Desopo-Pape Bellman-Ford Desopo-Pape

31 Frontier Size Vertex Frontier (A4) Only Vertices which were updated in the previous iteration can result in an update Much fewer threads launched per iteration Up to 2500 instead of per iteration Frontier size for each element for source 1 Iteration Derby CLoHAM LoHAM

32 Frontier Size Frontier Size Multiple Concurrent Origins (A8) Frontier size for each element for source 1 Frontier Size for all concurrent sources Iteration 0 Iteration Derby CLoHAM LoHAM Derby CLoHAM LoHAM

33 Atomic Contestance Atomic Contention Atomic operations are guaranteed to occur Atomic Contention multiple threads atomically modify same address Serialised! atomicadd(double) not implemented in hardware Not yet Solutions 1. Algorithmic change to minimise atomic contention 2. Single precision Atomic Contestance per Iteration Cloham Loham

34 Time (seconds) Assignment runtime per algorithm Raw Performance Hardware: Intel Core i7-4770k 16GB DDR3 Nvidia GeForce Titan X Serial Multicore A8 A10 A11 A15 A13 (single) Derby CLoHAM LoHAM A14 (single)

SATGPU - A Step Change in Model Runtimes

SATGPU - A Step Change in Model Runtimes User Group Meeting Thursday 16 th November 2017 Ian Wright, Atkins Peter Heywood, University of Sheffield 20 November 2017 1 SATGPU: Phased Development Phase 1