HETEROGENEOUS CPU+GPU COMPUTING

Size: px
Start display at page:

Download "HETEROGENEOUS CPU+GPU COMPUTING"

Transcription

1 HETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu University of Amsterdam Significant contributions by: Stijn Heldens (U Twente), Jie Shen (NUDT, China),

2 Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Everywhere from supercomputers to mobile devices

3 Heterogeneous platforms Host-accelerator hardware model Accelerator Host PCIe / Shared memory Accelerator Accelerator... MICs FPGAs Accelerator GPUs CPUs

4 Our focus today A heterogeneous platform = CPU + GPU Most solutions work for other/multiple accelerators An application workload = an application + its input dataset Workload partitioning = workload distribution among the processing units of a heterogeneous system Few cores Thousands of Cores

5 Generic multi-core CPU 5

6 6 Programming models Pthreads + intrinsics TBB Thread building blocks Threading library OpenCL To be discussed OpenMP Traditional parallel library High-level, pragma-based Cilk Simple divide-and-conquer model Level of abstraction increases

7 A GPU Architecture 7

8 Offloading model Kernel Host code

9 9 Programming models CUDA NVIDIA proprietary OpenCL Open standard, functionally portable across multi-cores OpenACC High-level, pragma-based Different libraries, programming models, and DSLs for different domains Level of abstraction increases

10 CPU vs. GPU 10 ALU ALU Control Throughput: ~ ALU 500 GFLOPs ALU Bandwidth: ~ 60 GB/s Cache CPU Low latency, high flexibility. Excellent for irregular codes with limited parallelism. Heterogeneous computing Do we gain = CPU anything? + GPU work together GPU High throughput. Excellent for massively parallel workloads. Throughput: ~ GFLOPs Bandwidth: ~ 350 GB/s

11 Case studies Executiontime(ms) Best = (80,20) TG TD TC TMax Matrix Multiplication (SGEMM) n = 2048 x 2048 (48MB) Only-GPU granularity = 10% Only-CPU T G T C T Max T D T G = GPU kernel execution time T D = Data transfer time T C = CPU kernel execution time T Max = max(t G +T D, T C )

12 Case studies Executiontime(ms) TG TD TC TMax Best = (80,20) T G = GPU kernel execution time T D = Data transfer time T C = CPU kernel execution time T Max = max(t G +T D, T C ) Matrix Multiplication (SGEMM) n = 2048 x 2048 (48MB) Only-GPU Executiontime(ms) Only-CPU granularity = 10% TG TD TC TMax Best = (30,70) 0 Very few applications are GPU-only. Executiontime(ms) TG TD TC TMax Best =Only-CPU Separable Convolution (SConv) n = 6144x6144 (432MB) Dot Product (SDOT) n = (512MB)

13 Possible advantages of heterogeneity Increase performance Both devices work in parallel Decrease data communication Which is often the bottleneck of the system Different devices for different roles Increase flexibility and reliability Choose one/all *PUs for execution Fall-back solution when one *PU fails Increase energy efficiency Cheaper per flop Main challenges: programming and workload partitioning!

14 Challenge 1: Programming models

15 OmpSs 15 Programming models (PMs) Heterogeneous computing = a mix of different processors on the same platform. Programming Mix of programming models One(/several?) for CPUs OpenMP One(/several?) for GPUs CUDA Single programming model (unified) OpenCL is a popular choice Low level High level 4.0

16 16 Heterogeneous Computing PMs High productivity; not all applications are easy to implement. Generic OpenACC, OpenMP 4.0 OmpSS, StarPU, HPL High level HyGraph, Cashmere, GlassWing Domain and/or application specific. Focus on: productivity and performance Specific OpenCL OpenMP+CUDA TOTEM The most common atm. Useful for performance, more difficult to use in practice Low level Domain specific, focus on performance. More difficult to use. Quite rare.

17 Heterogeneous computing PMs CUDA + OpenMP/TBB Typical combination for NVIDIA GPUs Individual development per *PU Glue code can be challenging OpenCL (KHRONOS group) Functional portability => can be used as a unified model Performance portability via code specialization HPL (University of A Coruna, Spain) Library on top of OpenCL, to automate code specialization

18 Heterogeneous computing PMs StarPU (INRIA, France) Special API for coding Runtime system for scheduling OmpSS (UPC + BSC, Spain) C + OpenCL/CUDA kernels Runtime system for scheduling and communication optimization

19 Heterogeneous computing PMs Cashmere (VU Amsterdam + NLeSC) Dedicated to Divide-and-conquer solutions OpenCL backend. GlassWing (VU Amsterdam) Dedicated to MapReduce applications TOTEM (U. of British Columbia, Canada) Graph processing CUDA+Multi-threading HyGraph (TUDelft, UTwente, UvA, NL) Graph processing Based on CUDA+OpenMP

20 Challenge 2: Workload partitioning

21 Workload DAG (directed acyclic graph) of kernels k0 k0 k0 k0 k0 k1 k1 k1 k k3 k4 kn kn k5 I II III IV V SK-One SK-Loop MK-Seq MK-Loop MK-DAG

22 22 Determining the partition Static partitioning (SP) vs. Dynamic partitioning (DP) Mul3ple Cores Thousands of Cores Mul3ple Cores Thousands of Cores

23 23 Static vs. dynamic Static partitioning + can be computed before runtime => no overhead + can detect GPU-only/CPU-only cases + no unnecessary CPU-GPU data transfers -- does not work for all applications Dynamic partitioning + responds to runtime performance variability + works for all applications -- incurs (high) runtime scheduling overhead -- might introduce (high) CPU-GPU data-transfer overhead -- might not work for CPU-only/GPU-only cases

24 24 Determining the partition Static partitioning (SP) vs. Dynamic partitioning (DP) Mul3ple Cores (near-) Optimal Low applicability Thousands of Cores Often sub-ptimal High applicability Mul3ple Cores Thousands of Cores

25 25 Heterogeneous Computing today Limited applicability. Low overhead => high performance Systems/frameworks: Qilin, Insieme, SKMD, Glinda,... Libraries: HPL, Static Single kernel Sporradic attempts and light runtime systems Not interesting, given that static & run-time based systems exist. Dynamic Glinda 2.0 Low overhead => high performance Still limited in applicability. Multi-kernel (complex) DAG Run-time based systems: StarPU OmpSS HyGraph High Applicability, high overhead

26 Mul3ple Cores Thousands of Cores Static partitioning: Glinda

27 *Jie Shen et al., HPCC 14. Look before you Leap: Using the Right Hardware. Resources to Accelerate Applications Glinda* Modeling the partitioning The application workload The hardware capabilities The GPU-CPU data transfer Predict the optimal partitioning Making the decision in practice Only-GPU Only-CPU CPU+GPU with the optimal partitioning

28 28 Modeling the partitioning Define the optimal (static) partitioning β= the fraction of data points assigned to the GPU (+T D ) T G +T D = T C

29 29 Modeling the workload n: the total problem size w: workload per work- item W = n 1 w i i=0 the area of the rectangle GPU par33on CPU par33on w W G =w x n x β W C =w x n x (1- β) n 0 1 β 1- β W (total workload size) quan3fies how much work has to be done

30 Modeling the partitioning T G = W G P G T C = W C P C T D = O Q W: total workload size W C = (1-β) * W and W G = β * W P: processing throughput (W/second) O: CPU-GPU data-transfer size Q: CPU-GPU bandwidth (B/second) T G +T D = T C W G W C = P G P C 1 1+ P G Q O W G R GC CPU kernel time vs. GPU kernel time R GD GPU data-transfer time vs. GPU kernel execution time

31 Predicting the optimal partitioning Solving β from the equation Total workload size HW capability ratios Data transfer size W G W C = P G P C 1 1+ P G Q O W G β predictor There are three β predictors (by data transfer type) β = R GC 1+ R GC β = R GC 1+ v w R GD + R GC β = R GC v w R GD 1+ R GC No data transfer Partial data transfer Full data transfer

32 32 Glinda outcome (Any?) data-parallel application can be transformed to support heterogeneous computing A decision on the execution of the application only on the CPU only on the GPU CPU+GPU And the partitioning point

33 Results [1] Quality (compared to an oracle) The right HW configuration in 62 out of 72 test cases Optimal partitioning when CPU+GPU is selected Predicted partitioning Execution time (ms) Optimal partitioning (auto-tuning)

34 Results [2] Effectiveness (compared to Only-CPU/Only-GPU) 1.2x-14.6x speedup If taking GPU for granted, up to 96% performance will be lost Our config Only- GPU Only- CPU Execution time (ms)

35 Related work Glinda achieves similar or better performance than the other partitioning approaches with less cost

36 More advantages Support different types of heterogeneous platforms Multi-GPUs, identical or non-identical Support different profiling options suitable for different execution scenarios Online or offline Partial of full Determine not only the optimal partitioning but also the best hardware configuration Only-GPU / Only-CPU / CPU+GPU with the optimal partitioning Support both balanced and imbalanced applications

37 Glinda: limitations? Regular applications only >> NO, we support imbalanced applications, too. Single kernel only? >> NO, we support two types of multikernel applications, too. Single GPU only? >> NO, we support multiple GPUs/ accelerators on the same node Single node only? >> YES for now.

38 What about energy? The Glinda model can be adapted to optimize for energy Ongoing work at UvA Not yet clear whether optimality can be so easily determined Finding the right energy models We started with statistical models We plan to use existing analytical models in the next stage Would probably be more limitations regarding the workload types

39 Mul3ple Cores Thousands of Cores Dynamic partitioning: HyGraph

40 Dynamic partitioning: StarPU, OmpSS k1 k0 k3 k2 k4 Runtime system k2,k3,k4 Partitioning & Scheduling k5 V k0,k1,k5 Our own take on it: graph processing.

41 Graph processing Time Graphs are increasing in size and complexity Facebook 1.7B users, LinkedIn ~500M users, Twitter ~400M users We need HPC solutions for graph processing

42 CPU or GPU? Graph processing Many vertices => Massive parallelism. Some vertices inactive => Irregular workload. Varying vertex degree => Irregular tasks. Low arithmetic intensity => Bandwidth limited. Irregular access pattern => Poor caching. Ideal platform? CPU: few cores, deep caches, asymmetric processing. GPU: many cores, latency hiding, data-parallel processing. No clear winner => Attempt CPU+GPU!

43 Design of HyGraph HyGraph: Dynamic scheduling Replicate vertex states on CPU and GPU. Divide vertices into fixed-sized segments called blocks. Processing vertices inside block corresponds to one task. CPU: 1 thread per block. GPU: 1 SM (=1024 threads) per block

44 Design of HyGraph Overlap of computation with communication

45 Performance evaluation 2 GPUs presented Kepler : K20 Maxwell : GTX980 1 CPU Dual 8-core CPUs, 2.4GHz Three algorithms PageRank (500 iterations) 10 datasets 5 synthetic ones presented here PRELIMINARY results! V E RMAT M 64.2 M RMAT M 129.3M RMAT M 260.1M RMAT M 523.6M RMAT M M

46 Lower is better Preliminary results: execution time [s] CPU GPU K20 Hybrid K20 GPU GTX980 Hybrid GTX RMAT22 RMAT23 RMAT24 RMAT25 RMAT26

47 Preliminary results: Energy [Joule] 100,000 10,000 1,000 CPU GPU K20 Hybrid K20 GPU GTX980 Hybrid GTX RMAT22 RMAT23 RMAT24 RMAT25 RMAT26 Lower is better

48 Higher is better (TEPS = traversed edges per s) Results [3]: TEPS / Watt CPU GPU K20 Hybrid K20 GPU GTX 980 Hybrid GTX RMAT22 RMAT23 RMAT24 RMAT25 RMAT26

49 Results [4]: EDP GPU K20 Hybrid K20 GPU GTX980 Hybrid GTX980 CPU RMAT22 RMAT23 RMAT24 RMAT25 RMAT26

50 Lessons learned For irregular applications Heterogeneous computing helps performance Heterogeneous computing might help energy efficiency For K20, it s pretty good For GTX980. GPU wins Cannot generalize results to other types of graphs Both energy and performance are unpredictable Is hybrid dynamic processing feasible for energy efficiency?! Open question, still

51 Instead of summary

52 to the office Take home message [1] Heterogeneous computing works! More resources. Specialized resources. Most efforts in static partitioning and run-time systems Glinda = static partitioning for single-kernel, data parallel applications Now works for multi-kernel applications, too StarPU, OmpSS = run-time based dynamic partitioning for multikernel, complex DAG applications Domain-specific efforts, too HyGraph - graph processing Cashmere divide-and-conquer, distributed

53 to the office Take home message [2] Choose a system based on your application scenario: Single-kernel vs. multi-kernel Massive parallel vs. Data-dependent Single run vs. Multiple run Programming model of choice There are models to cover combinations of these choices! No framework to combine them all food for thought?

54 to the office Take home message [3] GPUs are energy efficient On their own Hybrid processing helps performance but for energy efficiency we are heavily application, workload, and hardware specific Irregular applications remain the cornerstone of *both* performance and energy

55 Current/Future research directions More heterogeneous platforms Extension to more application classes Multi-kernel with complex DAGs (streaming applications, graph-processing applications) Integration with distributed systems Intra-node workload partitioning + inter-node workload scheduling Extension of the partitioning model for energy

56 Backup slides

57 Imbalanced app: Sound ray tracing

58 Example 4: Sound ray tracing 58

59 Which hardware? Our application has Massive data-parallelism No data dependency between rays Compute-intensive per ray clearly, this is a perfect GPU workload!!!

60 60 Results [1] 180 Execu3on 3me (s) Only GPU Only CPU 0 W9(1.3GB) Only 2.2x performance improvement! We expected 100x

61 Workload profile 0.8 Peak Processing iterations: ~7000 Altitude, km aircraft Sound speed, m/s (a) Relative distance fromsource, km (b) Launch Angle [deg] #Simulation iterations Ray ID Bottom Processing iterations: ~500

62 Results [2] Execu3on 3me (ms) 62% performance improvement compared to Only-GPU

63 63 Model the app workload n: the total problem size w: workload per work- item w n

64 64 Model the app workload n: the total problem size w: workload per work- item W = n 1 w i i=0 the area of the rectangle GPU par33on CPU par33on w W G =w x n x β W C =w x n x (1- β) n 0 1 β 1- β W (total workload size) quan3fies how much work has to be done

65 Modeling the partitioning T G = W G P G T C = W C P C T D = O Q W: total workload size W C = (1-β) * W and W G = β * W P: processing throughput (W/second) O: data-transfer size Q: data-transfer bandwidth (bytes/second) T G +T D = T C W G W C = P G P C 1 1+ P G Q O W G

66 66 Model the HW capabilities Workload: W G + W C = W Execution time: T G and T C P (processing throughput) Measured as workload processed per second P evaluates the hardware capability of a processor GPU kernel execu3on 3me: T G =W G /P G CPU kernel execu3on 3me: T C =W C /P C

67 67 Model the data-transfer O (GPU data-transfer size) Measured in bytes Q (GPU data-transfer bandwidth) Measured in bytes per second Data- transfer 3me: TD=O/Q + (Latency) Latency < 0.1 ms, negligible impact

68 68 Predict the partitioning T G = W G P G T C = W C P C T D = O Q T G +T D = T C W G W C = P G P C 1 1+ P G Q O W G by solving this equation in β

69 69 Predict the partitioning Solve the equation W G W C = P G P C 1 1+ P G Q O W G β- dependent terms Expression W G =w x n x β W C =w x n x (1- β) O=Full data transfer or Full data transfer x β

70 70 Predict the partitioning Solve the equation W G W C = P G P C 1 1+ P G Q O W G β- dependent terms β- independent terms Expression W G =w x n x β W C =w x n x (1- β) O=Full data transfer or Full data transfer x β Es3ma3on

71 Modeling the partitioning Estimating the HW capability ratios by using profiling The ratio of GPU throughput to CPU throughput The ratio of GPU throughput to data transfer bandwidth W G W C = P G P C 1 1+ P G Q O W G R GC CPU kernel execution time vs. GPU kernel execution time R GD GPU data-transfer time vs. GPU kernel execution time

72 Predicting the optimal partitioning Solving β from the equation Total workload size HW capability ratios Data transfer size W G W C = P G P C 1 1+ P G Q O W G β predictor There are three β predictors (by data transfer type) β = R GC 1+ R GC β = R GC 1+ v w R GD + R GC β = R GC v w R GD 1+ R GC No data transfer Partial data transfer Full data transfer

73 Making the decision in practice From β to a practical HW configuration Calculate N G and N C N G = n xβ N C = n x (1-β) N G : GPU partition size N C : CPU partition size N G <its lower bound Y Only-CPU N N c <its lower bound Y Only-GPU N Round up N G CPU+GPU Final N G and N C

74 Extensions Different profiling options Online vs. Offline profiling Partial vs. Full profiling Overall profiling overhead Offline Online w W V = v n v # runs CPU+Multiple GPUs Identical GPUs Non-identical GPUs (may be suboptimal) n profile partial workload W V v min v w

75 What if we have multiple kernels? Can we still do static partitioning? Mul3ple Cores Thousands of Cores

76 Multi-kernel applications? Use dynamic partitioning: OmpSs, StarPU Partition the kernels into chunks Distribute chunks to *PUs Keep data dependencies Enabling multi-kernel asynchronous execution (inter-kernel Static partitioning: paralelism) low applicability, high performance. Dynamic Respond partitioning: automatically low performance, to runtime high applicability. performance variability (Often) lead to suboptimal performance Scheduling policies and chunk size Scheduling overhead (taking the decision, data transfer, etc.) Can we get the best of both worlds?

77 How to satisfy both requirements? We combine static and dynamic partitioning We design an application analyzer that chooses the best performing partitioning strategy for any given application (1) Implement the app source code (2) Analyze the app kernel structure App class (3) Select the partitioning strategy (4) Enable the partitioning in the code App classification Partitioning strategy repository Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

78 Application classification 5 application classes k0 k0 k0 k0 k0 k1 k1 k1 k k3 k4 kn kn k5 I II III IV V SK-One SK-Loop MK-Seq MK-Loop MK-DAG Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

79 Partitioning strategies Static partitioning: single-kernel applications Dynamic partitioning: multi-kernel applications SP-Single SP-Unified SP-Varied k0 GPU CPU k0 GPU CPU k0 GPU CPU global sync k1 GPU CPU k1 GPU CPU global sync thread 0 thread 1 k2 GPU CPU k2 GPU CPU global sync k0 k1 DP-Dep & DP-Perf data dependency chains scheduling policies DP-Dep or DP-Perf scheduled partitions GPU CPU thread 0 CPU thread 1 Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

80 Partitioning strategies Static partitioning: in Glinda single-kernel + multi-kernel applications SP-Single SP-Unified SP-Varied k0 GPU CPU k0 GPU CPU k0 GPU CPU global sync k1 GPU CPU k1 GPU CPU global sync thread 0 thread 1 k2 GPU CPU k2 GPU CPU global sync k0 k1 DP-Dep & DP-Perf data dependency chains scheduling policies DP-Dep or DP-Perf scheduled partitions GPU CPU thread 0 CPU thread 1 Implement the source code Dynamic partitioning: in OmpSs Analyze multi-kernel applications Select a (fall-back scenarios) Enable application s class partitioning strategy the partitioning

81 Results [4] MK-Loop (STREAM-Loop) w/o sync: SP-Unified > DP-Perf >= DP-Dep > SP-Varied with sync: SP-Varied > DP-Perf >= DP-Dep > SP-Unified Execu'on Execution 'me [ms] time (ms) Only-GPU Only-CPU SP-Unified DP-Perf DP-Dep SP-Varied STREAM-Loop-w/o w sync STREAM-Loop-w

82 Results: performance gain Best partitioning strategy vs. Only-CPU or Only-GPU Performance Performance Speedup Speedup MatrixMul 22.2 Nbody BlackScholes HotSpot STREAM-Seq-w STREAM-Seq-w/o Best vs OG Best vs OC Avg STREAM-Loop-w STREAM-Loop-w/o Average: 3.0x (vs. Only-GPU) 5.3x (vs. Only-CPU)

83 Summary: Glinda Computes (close-to-)optimal partitioning Based on static partitioning Single-kernel Multi-kernel (with restrictions) No programming models attached CUDA + OpenMP/TBB OpenCL others => we propose HPL

Heterogeneous platforms

Heterogeneous platforms Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

High Performance Computing (HPC) Introduction

High Performance Computing (HPC) Introduction High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Critically Missing Pieces on Accelerators: A Performance Tools Perspective Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GpuWrapper: A Portable API for Heterogeneous Programming at CGG GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ. An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ. Of Pisa Italy 29/02/2012, Nuremberg, Germany ARTEMIS ARTEMIS Joint Joint Undertaking

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

On the limits of (and opportunities for?) GPU acceleration

On the limits of (and opportunities for?) GPU acceleration On the limits of (and opportunities for?) GPU acceleration Aparna Chandramowlishwaran, Jee Choi, Kenneth Czechowski, Murat (Efe) Guney, Logan Moon, Aashay Shringarpure, Richard (Rich) Vuduc HotPar 10,

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

Re-architecting Virtualization in Heterogeneous Multicore Systems

Re-architecting Virtualization in Heterogeneous Multicore Systems Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. System Design of Kepler Based HPC Solutions Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering. Introduction The System Level View K20 GPU is a powerful parallel processor! K20 has

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

Overview of Tianhe-2

Overview of Tianhe-2 Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn

More information

Digital Earth Routine on Tegra K1

Digital Earth Routine on Tegra K1 Digital Earth Routine on Tegra K1 Aerosol Optical Depth Retrieval Performance Comparison and Energy Efficiency Energy matters! Ecological A topic that affects us all Economical Reasons Practical Curiosity

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016 OpenACC. Part I Ned Nedialkov McMaster University Canada October 2016 Outline Introduction Execution model Memory model Compiling pgaccelinfo Example Speedups Profiling c 2016 Ned Nedialkov 2/23 Why accelerators

More information

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Accelerating MapReduce on a Coupled CPU-GPU Architecture Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T

More information

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Unified Tasks and Conduits for Programming on Heterogeneous Computing Platforms

Unified Tasks and Conduits for Programming on Heterogeneous Computing Platforms Unified Tasks and Conduits for Programming on Heterogeneous Computing Platforms A Dissertation Presented by Chao Liu to The Department of Electrical and Computer Engineering in partial fulfillment of the

More information

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 Video Codecs 70% of internet traffic will be video in 2018 [CISCO] Video

More information

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian

More information

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

A Distributed Hash Table for Shared Memory

A Distributed Hash Table for Shared Memory A Distributed Hash Table for Shared Memory Wytse Oortwijn Formal Methods and Tools, University of Twente August 31, 2015 Wytse Oortwijn (Formal Methods and Tools, AUniversity Distributed of Twente) Hash

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018 S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information