Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)
|
|
- Noreen Golden
- 5 years ago
- Views:
Transcription
1 Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB 2018) Vienna, Austria, 24-Feb, 2018
2 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62
3 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62
4 Current supercomputers: Performance Energy consumption HiPINEB /62
5 Current supercomputers: Performance Energy consumption Desired future supercomputers: Performance: exaflops Energy consumption: 20 MW HiPINEB /62
6 Current high-performance processors (Xeon): Performance Energy (~165W) HiPINEB /62
7 Current high-performance processors (Xeon): Performance Energy (~165W)...plus high-performance accelerators (GPU): Performance Energy (~100W) HiPINEB /62
8 Current high-performance processors (Xeon): Performance Energy (~165W)...plus high-performance accelerators (GPU): Performance Energy (~100W) Current low-power processors (ARM): Energy consumption (~15W) Performance HiPINEB /62
9 How improving efficiency of future exascale systems? HiPINEB /62
10 How improving efficiency of future exascale systems? Our proposal: heterogeneous platforms Low-power processors (ARM) Accelerated with remote HPC servers (Xeon + GPUs) Remote HPC servers shared among all low-power processors HiPINEB /62
11 How improving efficiency of future exascale systems? Our proposal: heterogeneous platforms Low-power processors (ARM) Accelerated with remote HPC servers (Xeon + GPUs) Remote HPC servers shared among all low-power processors How sharing? GPU virtualization GPUs transparently shared among all cluster nodes Virtualization framework overhead Network overhead (high-performance interconnection!!) HiPINEB /62
12 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62
13 GPGPU: OpenCL or CUDA HiPINEB /62
14 GPGPU: OpenCL or CUDA General architecture remote GPU solutions: HiPINEB /62
15 GPGPU: OpenCL or CUDA General architecture remote GPU solutions: Existing solutions: rcuda ( rcuda: freely available, supports CUDA 8 HiPINEB /62
16 rcuda communication architecture: Eth, IB,... HiPINEB /62
17 Local GPU: Node 1 GPU Remote GPU: Performance depends on GPU bandwidth Node 2 Network Node 1 GPU Performance depends on network bandwidth HiPINEB /62
18 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62
19 ARM ARM ARM Network Xeon GPU ARM ARM ARM Offload GPU workloads from ARMs to remote GPU server... HiPINEB /62
20 ARM NVIDIA Jetson TX1: 4-core ARM processor 256-core CUDA compatible GPU Supermicro 1027GR-TRF: 2 hexa-core Xeon processors NVIDIA Tesla K20 GPU with 2496 CUDA cores Xeon K20 HiPINEB /62
21 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62
22 Performance comparison: ARM CPU vs. Xeon ARM GPU vs. K20 HiPINEB /62
23 Performance comparison: ARM CPU vs. Xeon ARM GPU vs. K20 HiPINEB /62
24 1,59 3,04 Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62
25 1,59 3,04 Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62
26 1,59 3,04 Better Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62
27 1,59 3,04 Better Better Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62
28 1,59 3,04 Better Better Better Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62
29 Performance comparison: ARM CPU vs. Xeon ARM GPU vs. K20 HiPINEB /62
30 HiPINEB /62
31 HiPINEB /62
32 HiPINEB /62
33 32,58 Test unit 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up 7, ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62
34 32,58 Test unit Better 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up K20 wins 7, ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62
35 32,58 Test unit Better Better 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up K20 wins K20 wins 7, ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62
36 32,58 Better Test unit Better Better 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up K20 wins K20 wins 7, K20 wins ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62
37 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62
38 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62
39 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62
40 Local GPU: ARM TX1 Xeon K20 Remote GPU: ARM Network Xeon K20 HiPINEB /62
41 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, ,457 Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m ,1 Comparison of Rodinia benchmarks executed in three different scenarios: (i) a regular server with one NVIDIA K20 GPU, (ii) an NVIDIA Jetson TX1, and (iii) an NVIDIA Jetson TX1 offloading GPU work to a remote regular server with one NVIDIA K20 GPU. HiPINEB /62
42 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 HiPINEB /62
43 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 Short execution time: ARM wins HiPINEB /62
44 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 Short execution time: ARM wins Medium and large execution time: Xeon wins HiPINEB /62
45 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 Short execution time: ARM wins Medium and large execution time: Xeon wins Large execution time: remote GPU better than ARM HiPINEB /62
46 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62
47 Local GPU: ARM TX1 Xeon K20 Remote GPU: ARM Network Xeon K20 HiPINEB /62
48 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, ,43 Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Comparison of Rodinia benchmark gaussian varying the problem size, and executed in three different scenarios: (i) a regular server with one NVIDIA K20 GPU, (ii) a NVIDIA Jetson TX1, and (iii) a NVIDIA Jetson TX1 offloading GPU work to a remote regular server with one NVIDIA K20 GPU. Size HiPINEB /62
49 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, , Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Zone A Zone B Zone C Short execution time: ARM wins Size HiPINEB /62
50 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, , Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Zone A Zone B Zone C Short execution time: ARM wins Medium and Sizelarge execution time: Xeon wins HiPINEB /62
51 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, , Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Zone A Zone B Zone C Short execution time: ARM wins Medium and Sizelarge execution time: Xeon wins Large execution time: remote GPU better than ARM HiPINEB /62
52 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62
53 ARM ARM ARM Network Xeon GPU ARM ARM ARM Offload GPU workloads from ARMs to remote GPU server HiPINEB /62
54 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up , , ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62
55 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up ,82 12,54 No sharing: over 12x speed-up! ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62
56 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up ,82 12,54 2 ARMS sharing 1 remote GPU ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62
57 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up , ,54 4 ARMS sharing 1 remote GPU ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62
58 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up , ,54 6 ARMS sharing 1 remote GPU: 2.83 x speed-up! 7,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62
59 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62
60 Proposal for improving efficiency of future exascale systems Using rcuda to provide ARMs access to high-end remote GPUs Execution time speed-up by up to 12x ARM power efficiency is leveraged along with the highend accelerators Accelerators dynamically shared among multiple ARMs HiPINEB /62
61 Analysis of this proposal with HPC apps Include energy and power measurements Validate if our proposal is able to contribute to energy efficiency in future exascale systems HiPINEB /62
62 Get a free copy of rcuda at
Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain
Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the
More informationrcuda: desde máquinas virtuales a clústers mixtos CPU-GPU
rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU Federico Silla Universitat Politècnica de València HPC ADMINTECH 2018 rcuda: from virtual machines to hybrid CPU-GPU clusters Federico Silla Universitat
More informationThe rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla
The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain The scope of this talk Delft, April 2015 2/47 More flexible
More informationRemote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain
Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What
More informationIs remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain
Is remote virtualization useful? Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC Advisory Council Spain Conference 2015 2/57 We deal with s, obviously!
More informationCarlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland)
Carlos Reaño Universitat Politècnica de València (Spain) Switzerland Conference April 3, 2014 - Lugano (Switzerland) What is rcuda? Installing and using rcuda rcuda over HPC networks InfiniBand How taking
More informationDeploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain
Deploying remote virtualization with rcuda Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC ADMINTECH 2016 2/53 It deals with s, obviously! HPC ADMINTECH
More informationImproving overall performance and energy consumption of your cluster with remote GPU virtualization
Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION
More informationrcuda: an approach to provide remote access to GPU computational power
rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda
More informationFramework of rcuda: An Overview
Framework of rcuda: An Overview Mohamed Hussain 1, M.B.Potdar 2, Third Viraj Choksi 3 11 Research scholar, VLSI & Embedded Systems, Gujarat Technological University, Ahmedabad, India 2 Project Director,
More informationWhy? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators
Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationExperiences Using Tegra K1 and X1 for Highly Energy Efficient Computing
Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian
More informationIncreasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain
Increasing the efficiency of your -enabled cluster with rcuda Federico Silla Technical University of Valencia Spain Outline Why remote virtualization? How does rcuda work? The performance of the rcuda
More informationrcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain
rcuda: hybrid - clusters Federico Silla Technical University of Valencia Spain Outline 1. Hybrid - clusters 2. Concerns with hybrid clusters 3. One possible solution: virtualize s! 4. rcuda what s that?
More informationMELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구
MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio
More informationManycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT
Manycore and GPU Channelisers Seth Hall High Performance Computing Lab, AUT GPU Accelerated Computing GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate
More informationSpeeding up the execution of numerical computations and simulations with rcuda José Duato
Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?
More informationPedraforca: a First ARM + GPU Cluster for HPC
www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu
More informationrcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects
rcuda: towards energy-efficiency in computing by leveraging low-power processors and InfiniBand interconnects Federico Silla Technical University of Valencia Spain Joint research effort Outline Current
More informationElaborazione dati real-time su architetture embedded many-core e FPGA
Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationOn the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications
On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, F. Silla Universitat Politècnica
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationGPU-centric communication for improved efficiency
GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationPaper Summary Problem Uses Problems Predictions/Trends. General Purpose GPU. Aurojit Panda
s Aurojit Panda apanda@cs.berkeley.edu Summary s SIMD helps increase performance while using less power For some tasks (not everything can use data parallelism). Can use less power since DLP allows use
More informationHPC with GPU and its applications from Inspur. Haibo Xie, Ph.D
HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance
More informationcomputational power computational
rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council
More informationExploiting CUDA Dynamic Parallelism for low power ARM based prototypes
www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More information7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT
7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also
More informationE4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU
E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè ARM64 and GPGPU 1 E4 Computer Engineering Company E4 Computer Engineering S.p.A. specializes in the manufacturing of high performance IT systems of medium
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationShadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies
Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu
More informationAn approach to provide remote access to GPU computational power
An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationBuilding the Most Efficient Machine Learning System
Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide
More informationReal-Time Support for GPU. GPU Management Heechul Yun
Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs Introduction Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University How to.. Process terabytes
More informationUsing Graphics Chips for General Purpose Computation
White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationGPU ACCELERATED DATABASE MANAGEMENT SYSTEMS
CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU
More informationVectorisation and Portable Programming using OpenCL
Vectorisation and Portable Programming using OpenCL Mitglied der Helmholtz-Gemeinschaft Jülich Supercomputing Centre (JSC) Andreas Beckmann, Ilya Zhukov, Willi Homberg, JSC Wolfram Schenck, FH Bielefeld
More informationACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015
ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationMapping MPI+X Applications to Multi-GPU Architectures
Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under
More informationExploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,
More informationTimothy Lanfear, NVIDIA HPC
GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision
More informationBig Data Systems on Future Hardware. Bingsheng He NUS Computing
Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationGPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA
GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles
More informationSolutions for Scalable HPC
Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End
More informationCUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker
CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It
More informationAtos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc
Atos announces the Bull sequana X1000 the first exascale-class supercomputer Jakub Venc The world is changing The world is changing Digital simulation will be the key contributor to overcome 21 st century
More informationHow GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige
How GPUs can find your next hit: Accelerating virtual screening with OpenCL Simon Krige ACS 2013 Agenda > Background > About blazev10 > What is a GPU? > Heterogeneous computing > OpenCL: a framework for
More informationWorld s most advanced data center accelerator for PCIe-based servers
NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying
More informationMICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE
MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE LEVERAGE OUR EXPERTISE sales@microway.com http://microway.com/tesla NUMBERSMASHER TESLA 4-GPU SERVER/WORKSTATION Flexible form factor 4 PCI-E GPUs + 3 additional
More informationIBM Power Advanced Compute (AC) AC922 Server
IBM Power Advanced Compute (AC) AC922 Server The Best Server for Enterprise AI Highlights IBM Power Systems Accelerated Compute (AC922) server is an acceleration superhighway to enterprise- class AI. A
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationNOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer
NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY Peter Messmer pmessmer@nvidia.com COMPUTATIONAL CHALLENGES IN HEP Low-Level Trigger High-Level Trigger Monte Carlo Analysis Lattice QCD 2 COMPUTATIONAL
More informationHeterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1
COSCOⅣ Heterogeneous SoCs M5171111 HASEGAWA TORU M5171112 IDONUMA TOSHIICHI May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1 Contents Background Heterogeneous technology May 28, 2014 COMPUTER SYSTEM COLLOQUIUM
More informationarxiv: v1 [physics.ins-det] 11 Jul 2015
GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University
More informationGPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting
Girona, Spain May 4-5 GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting David Camp, Hari Krishnan, David Pugmire, Christoph Garth, Ian Johnson, E. Wes Bethel, Kenneth
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationEmpirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms
Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Javier Cuenca Luis-Pedro García Domingo Giménez Francisco J. Herrera Scientific Computing and Parallel
More informationENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION
ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION Vignesh Adhinarayanan Ph.D. (CS) Student Synergy Lab, Virginia Tech INTRODUCTION Supercomputers are constrained by power Power
More informationBuilding the Most Efficient Machine Learning System
Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide
More informationIntroduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29
Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions
More informationPyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent
PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python F.D. Witherden, M. Klemm, P.E. Vincent 1 Overview Motivation. Accelerators and Modern Hardware Python and PyFR. Summary. Motivation
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationConcurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration
Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com Berni Schiefer schiefer@ca.ibm.com Tuesday March 17th at 12:00
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationDocument downloaded from:
Document downloaded from: http://hdl.handle.net/10251/70225 This paper must be cited as: Reaño González, C.; Silla Jiménez, F. (2015). On the Deployment and Characterization of CUDA Teaching Laboratories.
More informationHETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA
HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS
More informationFCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA
1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,
More informationGPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester
NVIDIA GPU Computing A Revolution in High Performance Computing GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester John Ashley Senior Solutions Architect
More informationNVIDIA Update and Directions on GPU Acceleration for Earth System Models
NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationNVIDIA Jetson Platform Characterization
NVIDIA Jetson Platform Characterization Hassan Halawa, Hazem A. Abdelhafez, Andrew Boktor, Matei Ripeanu The University of British Columbia {hhalawa, hazem, boktor, matei}@ece.ubc.ca Abstract. This study
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationHigh Performance Computing
High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and
More informationUnderstanding the Role of GPGPU-accelerated SoC-based ARM Clusters
Understanding the Role of GPGPU-accelerated SoC-based ARM Clusters Reza Azimi, Tyler Fox and Sherief Reda Brown University Providence, RI firstname lastname@brown.edu Abstract The last few years saw the
More informationBuilding supercomputers from commodity embedded chips
http://www.montblanc-project.eu Building supercomputers from commodity embedded chips Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results
More informationDeep learning in MATLAB From Concept to CUDA Code
Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,
More informationAn Introduction to the SPEC High Performance Group and their Benchmark Suites
An Introduction to the SPEC High Performance Group and their Benchmark Suites Robert Henschel Manager, Scientific Applications and Performance Tuning Secretary, SPEC High Performance Group Research Technologies
More information