AutoTVM & Device Fleet
|
|
- Rosalyn Pitts
- 5 years ago
- Views:
Transcription
1 AutoTVM & Device Fleet `
2 Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware
3 Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware
4 Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Machine Learning based Program Optimizer Hardware
5 Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Machine Learning based Program Optimizer Learning to generate optimized program for new operator workloads and hardware Hardware
6 Search over Possible Program Transformations Compute Description C = tvm.compute((m, n), lambda y, x: tvm.sum(a[k, y] * B[k, x], axis=k)) Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding Hardware
7 Search over Possible Program Transformations Compute Description C = tvm.compute((m, n), lambda y, x: tvm.sum(a[k, y] * B[k, x], axis=k)) Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding Hardware
8 Search over Possible Program Transformations Compute Description C = tvm.compute((m, n), lambda y, x: tvm.sum(a[k, y] * B[k, x], axis=k)) Billions Loop Thread Bindings Cache Locality of possible Transformations optimization Thread choices Cooperation Tensorization Latency Hiding Hardware
9 Learning-based Program Optimizer Program Optimizer Code Generator Program!4
10 Learning-based Program Optimizer Program Optimizer Code Generator Program Runtime Measurements!4
11 Learning-based Program Optimizer Program Optimizer Code Generator Program Runtime Measurements High experiment cost, each trial costs ~1second!4
12 Learning-based Program Optimizer Program Optimizer Code Generator Program!5
13 Learning-based Program Optimizer Program Optimizer Code Generator Program Cost Model!5
14 Learning-based Program Optimizer Program Optimizer Code Generator Program Cost Model Need reliable cost model per hardware!5
15 Learning-based Program Optimizer Program Optimizer Code Generator Program
16 D<latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> <latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> <latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> <latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> Learning-based Program Optimizer Program Optimizer Code Generator Program Training data
17 D<latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> <latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> <latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> <latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> Learning-based Program Optimizer Program Optimizer Code Generator Program Learning Statistical Cost Model Training data
18 D<latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> <latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> <latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> <latexit sha1_base64="1z6czjbl0omvztfq+m452ydkcy0=">aaab8nicbvdlssnafl2pr1pfvzdugkvwvrirdfnuhcsk9gftkjpppb06mqkzn0ij/qw3lhrx69e482+ctflo64gbwzn3mueembhcood9o6w19y3nrfj2zwd3b/+genjunirvllwoekp3q2ky4jk1kkng3uqzeoecdcljbe53npg2xmlhncysimli8ohtglbq9wocy0pedjcbvgte3zvdxsv+qwpqodmofvwhiqyxk0gfmabnewkggdhiqwczsj81lcf0qkasz6kkmtnbno88c8+smnqjpe2t6m7v3xsziy2zxqgdzcoazs8x//n6kubxqczlkiktdpfrlaoxlzvf7w65zhtf1bjcnbdzxtommlc0lvvscf7yyaukfvh3vbr/cflr3br1loeetuecflicbtxde1paqcezvmkbg86l8+58lezltrfzdh/gfp4adn2rwg==</latexit> Learning-based Program Optimizer Program Optimizer Code Generator Program Learning Statistical Cost Model Training data Unique Problem Characteristics Relatively low experiment cost Domain-specific problem structure Large quantity of similar tasks
19 Program-aware Cost Modeling High-Level Configuration
20 Program-aware Cost Modeling High-Level Configuration for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x] Low-level Abstract Syntax Tree (shared between tasks)
21 Program-aware Cost Modeling High-Level Configuration touched memory C A B y x outer loop length y 1 x 8 Boosted Tree Ensembles k k 64 statistical for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x] features Low-level Abstract Syntax Tree (shared between tasks)
22 Program-aware Cost Modeling High-Level Configuration touched memory C A B y x outer loop length y 1 x 8 Boosted Tree Ensembles k k 64 statistical for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x] features for soft scatter Low-level Abstract Syntax Tree context vec of y for + TreeGRU (shared between tasks) context vec of x for context vec of k final embedding
23 Effectiveness of ML based Model !8
24 Effectiveness of ML based Model One Conv2D Layer of ResNet18 on Titan X !8
25 Effectiveness of ML based Model One Conv2D Layer of ResNet18 on Titan X Number of Trials!8
26 Effectiveness of ML based Model 1.50 Relative Speedup Baseline: CuDNN 0.25 One Conv2D Layer of ResNet18 on Titan X Number of Trials!8
27 Effectiveness of ML based Model 1.50 Relative Speedup TVM: Random Search Baseline: CuDNN 0.25 One Conv2D Layer of ResNet18 on Titan X Number of Trials!9
28 Effectiveness of ML based Model 1.50 TVM: ML-based Model Relative Speedup TVM: Random Search Baseline: CuDNN 0.25 One Conv2D Layer of ResNet18 on Titan X Number of Trials!10
29 Transfer Learning Among Different Workloads Historical Optimization Tasks
30 Transfer Learning Among Different Workloads Historical Optimization Tasks Domain Invariant Program Representations
31 Transfer Learning Among Different Workloads Historical Optimization Tasks Domain Invariant Program Representations Transferable Models to speedup new tasks
32 Transfer Learning Among Different Workloads Historical Optimization Tasks Domain Invariant Program Representations Transferable Models to speedup new tasks
33 NVIDIA GPU Optimization (GTX 1080 Ti) MXNet + TensorRT 4.0 AutoTVM Latency (ms) ResNet-50 MobileNet VGG-19 Inception V3 DenseNet-121
34 AMD GPU Optimization (Vega FE) MIOpen AutoTVM Latency (ms) ResNet-50 MobileNet DenseNet-121
35 Bonus (INT8, GTX 1080) cudnn AutoTVM 1.6E E-04 Latency (ms) 8E-05 4E-05 0E
36 High Level: Scaling Automatic Performance @!15
37 Low Level: Portable RPC Tracker + Server Resource Allocation Resource Manager (Tracker) Resource token RPC RT JIT driver Hardware bitstream Zynq FPGA board Red modules can be reconfigured remotely in each session cross compiler RPC client Optimization Service RPC Session Data Path RPC RT CUDA tasks Nvidia GPU Server RPC RT ROCm tasks AMD GPU Server cross compiler RPC client RPC RT OpenCL tasks RPC RT ARM tasks Optimization Service Android Phone Raspberry Pi ML-based cost model Prioritizer Running optimization services Workload 1 Workload 2 Workload 3 Shared cluster of heterogeneous devices!16
38 RPC Communication Flow Client Tracker Device Client upload code run code Device return run time!17
39 RPC Communication Flow device free Client Tracker Device Client upload code run code Device return run time!17
40 RPC Communication Flow device free Client Tracker Device Client upload code run code Device return run time!17
41 RPC Communication Flow request device device free Client Tracker Device Client upload code run code Device return run time!17
42 RPC Communication Flow request device device free Client Tracker Device Client upload code run code Device return run time!17
43 RPC Communication Flow request device device free Client Tracker Device Client upload code run code Device return run time!17
44 RPC Communication Flow request device device free Client return handle Tracker Device upload code Client run code Device return run time!17
45 Model to Tuned Implementation Model Bag of Operators operator extraction AutoTVM tuning Tuned Model!18
46 Next: Autoscheduler, 16:30 Handcrafted Schedule Templates conv2d, x86 conv2d, GPU, winograd conv2d, ARM, spatial packing!19
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos
More informationTVM Stack Overview. Tianqi Chen
TVM Stack Overview Tianqi Chen Beginning of TVM Story Acclerator Beginning of TVM Story Acclerator Beginning of TVM Story Beginning of TVM Story Acclerator // Pseudo-code for convolution program for the
More informationEnd to End Optimization Stack for Deep Learning
End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team
More informationDeep Learning Compiler
Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,
More informationOptimizing CNN Inference on CPUs
Optimizing CNN Inference on CPUs Yizhi Liu, Yao Wang, Yida Wang With others in AWS AI Agenda Deep learning inference optimization Optimization on Intel CPUs Evaluation Make DL inference easier and faster
More informationAutomatic Code Generation TVM Stack
Automatic Code Generation TVM Stack CSE 599W Spring TVM stack is an active project by saml.cs.washington.edu and many partners in the open source community The Gap between Framework and Hardware Frameworks
More informationHKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro
HKG18-417 OpenCL Support by NNVM & TVM Jammy Zhou - Linaro Agenda OpenCL Overview OpenCL in NNVM & TVM Current Status OpenCL Introduction Open Computing Language Open standard maintained by Khronos with
More informationarxiv: v2 [cs.lg] 20 May 2018
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen 1, Thierry Moreau 1, Ziheng Jiang 2, Lianmin Zheng 3, Eddie Yan 1 Meghan Cowan 1, Haichen Shen 1, Leyuan Wang 4, Yuwei Hu
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationTVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen and Thierry Moreau, University of Washington; Ziheng Jiang, University of Washington, AWS; Lianmin Zheng, Shanghai Jiao Tong
More informationNvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018
Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks
More informationDeploying Deep Learning Networks to Embedded GPUs and CPUs
Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD Senior Application Engineer, Computer Vision 2015 The MathWorks, Inc. 1 MATLAB Deep Learning Framework Access Data Design + Train
More informationScalable Distributed Training with Parameter Hub: a whirlwind tour
Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC
More informationVTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018
VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor
More informationA Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle
A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD
More informationAMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling
AMD Radeon Open Compute Platform Felix Kuehling ROCM PLATFORM ON LINUX Compiler Front End AMDGPU Driver Enabled with ROCm GCN Assembly Device LLVM Compiler (GCN) LLVM Opt Passes GCN Target Host LLVM Compiler
More informationEmbarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA
Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Pierre Nowodzienski Engineer pierre.nowodzienski@mathworks.fr 2018 The MathWorks, Inc. 1 From Data to Business value Make decisions Get
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationRelay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018
Relay: a high level differentiable IR Jared Roesch TVMConf December 12th, 2018!1 This represents months of joint work with lots of great folks:!2 TVM Stack Optimization Relay High-Level Differentiable
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationThe HammerBlade: An ML-Optimized Supercomputer for ML and Graphs
The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs Prof. Michael B. Taylor (PI) University of Washington Prof. Adrian Sampson Cornell University Prof. Luis Ceze University of Washington Prof.
More informationGpuWrapper: A Portable API for Heterogeneous Programming at CGG
GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationTracing and profiling dataflow applications
Tracing and profiling dataflow applications Pierre Zins, Michel Dagenais December, 2017 Polytechnique Montréal Laboratoire DORSAL Agenda Introduction Existing tools for profiling Available platforms Current
More informationProcessing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach
2018 51st Annual IEEE/ACM International Symposium on Microarchitecture Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach Jiawen Liu*, Hengyu Zhao*, Matheus Almeida
More informationLecture 6: Optimize for Hardware Backends. CSE599G1: Spring 2017
Lecture 6: Optimize for Hardware Backends CSE599G1: Spring 2017 Where are we High level Packages User API Programming API Gradient Calculation (Differentiation API) System Components Computational Graph
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationSeamless Dynamic Runtime Reconfiguration in a Software-Defined Radio
Seamless Dynamic Runtime Reconfiguration in a Software-Defined Radio Michael L Dickens, J Nicholas Laneman, and Brian P Dunn WINNF 11 Europe 2011-Jun-22/24 Overview! Background on relevant SDR! Problem
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More informationCUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University
GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014 Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each
More informationA Simulated Annealing algorithm for GPU clusters
A Simulated Annealing algorithm for GPU clusters Institute of Computer Science Warsaw University of Technology Parallel Processing and Applied Mathematics 2011 1 Introduction 2 3 The lower level The upper
More informationGPU Programming with Ateji PX June 8 th Ateji All rights reserved.
GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationWhy GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)
Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G (TUDo( TUDo), Dominik Behr (AMD) Conference on Parallel Processing and Applied Mathematics Wroclaw, Poland, September 13-16, 16, 2009 www.gpgpu.org/ppam2009
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationCAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected
More informationSupport Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura
Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview
More informationGPU Coder: Automatic CUDA and TensorRT code generation from MATLAB
GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Ram Kokku 2018 The MathWorks, Inc. 1 GPUs and CUDA programming faster Performance CUDA OpenCL C/C++ GPU Coder MATLAB Python Ease of programming
More informationS INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS
S8497 - INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS Chris Lamb CUDA and NGC Engineering, NVIDIA John Barco NGC Product Management, NVIDIA NVIDIA GPU Cloud (NGC) overview AGENDA Using NGC
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationOpenCL: History & Future. November 20, 2017
Mitglied der Helmholtz-Gemeinschaft OpenCL: History & Future November 20, 2017 OpenCL Portable Heterogeneous Computing 2 APIs and 2 kernel languages C Platform Layer API OpenCL C and C++ kernel language
More informationDeep learning in MATLAB From Concept to CUDA Code
Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,
More informationHeterogeneous Processing Systems. Heterogeneous Multiset of Homogeneous Arrays (Multi-multi-core)
Heterogeneous Processing Systems Heterogeneous Multiset of Homogeneous Arrays (Multi-multi-core) Processing Heterogeneity CPU (x86, SPARC, PowerPC) GPU (AMD/ATI, NVIDIA) DSP (TI, ADI) Vector processors
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationImproving Performance of Machine Learning Workloads
Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationOpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania
OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming
More informationEvaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi
Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationWu Zhiwen.
Wu Zhiwen zhiwen.wu@intel.com Agenda Background information OpenCV DNN module OpenCL acceleration Vulkan backend Sample 2 What is OpenCV? Open Source Compute Vision (OpenCV) library 2500+ Optimized algorithms
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationIs dynamic compilation possible for embedded system?
Is dynamic compilation possible for embedded system? Scopes 2015, St Goar Victor Lomüller, Henri-Pierre Charles CEA DACLE / Grenoble www.cea.fr June 2 2015 Introduction : Wake Up Questions Session FAQ
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationA Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware
A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationImplementing Deep Learning for Video Analytics on Tegra X1.
Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning
More informationCUDA Performance Considerations (2 of 2)
Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability 1 History of GPU
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability History of GPU
More informationTESLA V100 PERFORMANCE GUIDE. Life Sciences Applications
TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important
More informationAccelerating cublas/cudnn using Input-Aware Auto-Tuning
Accelerating cublas/cudnn using Input-Aware Auto-Tuning The ISAAC library Philippe Tillet Harvard University Introduction cublas does not always achieve peak performance: (M, N, K) 1 = (4096, 4096, 4096):
More informationLecture 12: Model Serving. CSE599W: Spring 2018
Lecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications That drink will get you to 2800 calories for today I last saw your keys in the store room Remind Tom of the party You re on page
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationGPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13
GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning
More informationRegister and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen
Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen Peking University, China Advanced Digital Science Center, Singapore University of Illinois at Urbana
More informationPorting Fabric Engine to NVIDIA Unified Memory: A Case Study. Peter Zion Chief Architect Fabric Engine Inc.
Porting Fabric Engine to NVIDIA Unified Memory: A Case Study Peter Zion Chief Architect Fabric Engine Inc. What is Fabric Engine? A high-performance platform for building 3D content creation applications,
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationComputational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research
Insieme Insieme-an Optimization System for OpenMP, MPI and OpenCLPrograms Institute of Computer Science University of Innsbruck Thomas Fahringer, Ivan Grasso, Klaus Kofler, Herbert Jordan, Hans Moritsch,
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationThe Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest
The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationProgrammer's View of Execution Teminology Summary
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs
More informationVisualization of OpenCL Application Execution on CPU-GPU Systems
Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research
More informationRUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS
RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering
More informationPOWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017
POWERING THE AI REVOLUTION JENSEN HUANG, FOUNDER & CEO GTC 2017 LIFE AFTER MOORE S LAW 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 Transistors (thousands) 1.1X per year 10 4 10 3 1.5X per year
More informationOpen Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018
Copyright Khronos Group 2018 - Page 1 Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc peter.mcguinness@gobrach.com May 2018 Khronos Mission E.g. OpenGL ES provides 3D
More informationIntroduction to Deep Learning in Signal Processing & Communications with MATLAB
Introduction to Deep Learning in Signal Processing & Communications with MATLAB Dr. Amod Anandkumar Pallavi Kar Application Engineering Group, Mathworks India 2019 The MathWorks, Inc. 1 Different Types
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationNVIDIA PLATFORM FOR AI
NVIDIA PLATFORM FOR AI João Paulo Navarro, Solutions Architect - Linkedin i am ai HTTPS://WWW.YOUTUBE.COM/WATCH?V=GIZ7KYRWZGQ 2 NVIDIA Gaming VR AI & HPC Self-Driving Cars GPU Computing 3 GPU COMPUTING
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More information