Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA
|
|
- Christina Burns
- 5 years ago
- Views:
Transcription
1 Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA Dott. Alessandro Capotondi alessandro.capotondi@unibo.it Prof. Davide Rossi davide.rossi@unibo.it
2 Agenda Many-core introduction CIRI ICT OpenMP Technologies Productive Parallel Programming Models Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory Conclusions
3 Collaborations Academia Industrial UE Projects P-SOCRATES FP7 ICT GRANT N ERC GRANT N
4 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems
5 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems
6 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems
7 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems
8 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems
9 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems Various design schemes are available: Targeting ADAPTIVITY: heterogeneity and specialization for efficient computing. Es. ARM Big.Little Embedded systems need to be capable to process workloads usually tailored for workstations or HPC. Multi-Processor Systems-on-Chip (MPSoCs) computing units embedded in the same die designed to deliver high performance at low power consumption = high energy efficiency (GOPS/Watt) Targeting PARALLELISM: massively parallel many-core accelerators, to maximize GOPS/Watt (i.e. GPUs, GPGPUs, PMCA). Nvidia Tegra-K1 Two levels of heterogeneity: Host Processor (4 powerful cores + 1 energy efficient core) Parallel many-core coprocessor (192 cores accelerator: NvidiaKeplerGPU)
10 Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra
11 Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra 200 $. Big Community of user!
12 Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra
13 Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra 599 $. WORKSTATION COMPARABLE PERFORMANCES
14 TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL
15 TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL Evaluation Board 1000$. Target for Signal Processing Accelerator
16 MPPA Kalray
17 MPPA Kalray Evaluation Board 1000$. Targeting High Performance Time critical missions Aerospace/Military/Autonomous driving Industrial Robotics
18 Programmable many-core accelerator (PMCA)
19 Programmable many-core accelerator (PMCA)
20 Programmable many-core accelerator (PMCA)
21 Programmable many-core accelerator (PMCA)
22 Challenges Fast Programmability, High Productivity programming techniques Time predictability for industrial applications Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory
23 Performance is not free MEAL Thread-Level Parallelism 2 1,5 1 0,5 0 ifunny NetFlix Candy Crush Saga My Talking Tom Android BS Player LinkedIn Google Drive Apple Instagram Youtube Dropbox Facebook Twitter Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.36 Apple
24 Performance is not free MEAL Thread-Level Parallelism 2 1,5 1 0,5 0 ifunny NetFlix Candy Crush Saga My Talking Tom Android BS Player LinkedIn Google Drive Apple Instagram Youtube Dropbox Facebook Twitter Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.36 Apple [*] Analysis of the Effective Use of Thread-Level Parallelism in Mobile Applications. A preliminary study on ios and Android devices, Ethan Bogdan Hongin Yun.
25 Parallel Programming models Proprietary Programming models
26 Proprietary Programming models Parallel Programming models
27 Proprietary Programming models Parallel Programming models
28 Proprietary Programming models Parallel Programming models
29 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing
30 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing
31 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system
32 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system
33 Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing
34 Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing OmpSS OpenHMPP Academic Proposals
35 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity
36 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler
37 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity 2x to 10x less LOC OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler
38 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA and multiple runtime systems too!!
39 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA and multiple runtime systems too!! Intel s Parallel Universe magazine May 2014
40 Open-Next OpenMP runtime 4.0 What snew? UNTIED tasks
41 Open-Next OpenMP runtime Comparison with other OpenMP implementations RECURSIVE libgomp iomp >> x86 (Intel Haswell GHz.) libgomp: GNU OpenMP implementation (GCC 4.9.2) iomp: Intel OpenMP implementation (ICC )
42 Open-Next OpenMP runtime 4.0 Comparison with other tasking runtimes RECURSIVE nanos libgomp iomp Intel CILK+ ICC (15.0.2) Intel TBB ICC (15.0.2) WOOL GCC (4.9.2) >> x86 (Intel Haswell GHz.) nanos: BSC OpenSS (Mercurium Nanos++) Intel CILK+: ICC Intel TBB: ICC Wool: GCC 4.9.2
43 Time Predictability At compile-time, generate the TDG that includes timing information to consider the tasks communication At design-time, assign the TDG to OS threads (mapping) At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling) source code #pragma omp #pragma omp Compiler C/C++ binary code newtask() newtask() etdg Static Scheduler + Timing Analysis OpenMP RTE Dispatcher Many-core Compile-time Design time Run-time
44 Time Predictability At compile-time, generate the TDG that includes timing information to consider the tasks communication At design-time, assign the TDG to OS threads (mapping) At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling) source code #pragma omp #pragma omp Compiler C/C++ binary code newtask() newtask() etdg Static Scheduler + Timing Analysis OpenMP RTE Dispatcher Many-core Compile-time Design time Run-time
45 Open-Next Offload using OpenMP
46 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) }
47 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; new OpenMP directive used to offload the execution of a code block to the accelerator { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) }
48 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator /* some more code here */ #pragma omp wait (ker_id) }
49 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }
50 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }
51 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block /* some more code here */ #pragma omp wait (ker_id) }
52 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) } new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block TASK_A(){ int i; #pragma omp parallel proc_bind (close) #pragma omp for for( i=0;. ) do_smthg(); }
53 Early evaluation FAST CT Mahala Strassen NCC SHOT Kernel Repetitions Kernel Repetitions Speedup Speedup
54 Early evaluation Speedup FAST CT NCC Kernel Repetitions Speedup Mahala Strassen Simplifying Heterogeneous Embedded SoC Programming with Directive-Based Offload Marongiu, Capotondi, Tagliavini, Benini IEEE Transactions Industrial Informatics 2015 Kernel Repetitions SHOT
55 Accelerator Resource Sharing OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer
56 Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer
57 Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer
58 Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer
59 Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB on PMCAs RTEs are typically developed on top of bare metal low-level Runtime Hardware Abstraction Layer
60 Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB on PMCAs RTEs are typically developed on top of bare metal low-level Runtime Hardware Abstraction Layer Legacy Applications
61 Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver
62 Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support
63 Accelerator Resource Sharing OpenMP Virtual Accelerators O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support
64 Accelerator Resource Sharing Virtual Accelerators O 2 O 1 O 3 O N HOST driver Lightweight Spatial Partitioning Support
65 Runtime Efficiency: Computer Vision Use-Case
66 Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] Removal Object Detector (OpenMP 4 Clusters)[4]
67 Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] [1] Rublee, Ethan, et al. "ORB: an efficient alternative to SIFT or SURF."Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, [2] Jones, Michael, et al. "Fast multi-view face detection." Mitsubishi Electric Research Lab TR (2003): 14. [3] Rosten, et al. "Faster and better: A machine learning approach to corner detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 32.1 (2010): [4] Magno, Michele, et al. "Multimodal abandoned/removed object detection for low power video surveillance systems." Advanced Video and Signal Based Surveillance, AVSS'09. Sixth IEEE International Conference on. IEEE, Removal Object Detector (OpenMP 4 Clusters)[4]
68 Runtime Efficiency: Computer Vision Use-Case 100% 80% Efficiency % vs Ideal 60% 40% 20% 0% #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO
69 Runtime Efficiency: Computer Vision Use-Case 100% 90% efficient wrt ideal +30% efficiency wrt SPM-MO +40% efficiency wrt SPM-SO Efficiency % vs Ideal 80% 60% 40% 20% 0% #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO
70 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support.
71 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning
72 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host.
73 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory.
74 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory. Explicit data management involving copies: Limited programmability Low performance
75 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Open-Next goal: Lightweight Virtual Memory Support > Sharing of virtual address pointers >Transparent to the application developer Coherent virtual >Zero-copy offload, memory for higher host. predictability >Low complexity, low area, low cost Explicit data management involving copies: Limited programmability Low performance Accelerator can only access contiguous section in shared main memory, no virtual memory.
76 Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.
77 Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.
78 Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Moves the complexity from the software to the hardware Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.
79 Not only many-core accelerator! Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Moves the complexity from the software to the hardware FPGA CNN Deep- Learning Accelerator Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.
80 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU
81 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13
82 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA (
83 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA ( First open-source RISC-V core
84 Open-Next CIRI-ICT Activities Identificazione del programming model di riferimento per le piattaforme multi- e many-core che implementano gli use-case del progetto Implementazione di meccanismi software per facilitare la programmazione e rendere piu' efficiente lo scambio di dati in architetture eterogenee "shared memory" composte da host con supporto per memoria virtuale ed acceleratori senza supporto per memoria virtuale (es. GPU, DSP, FPGA) Implementazione di meccanismi software per la gestione ad alto livello di funzioni accelerate attraverso l'utilizzo di hardware dedicato (FPGA) Identificazione di possibili estensioni al programming model per la prossima generazione di impianti industriali real-time Porting di kernel significativi estratti dalle applicazioni che implementano gli use-cases e analisi delle performance
85 Open-Next CIRI-ICT Unibo Your Industrial use-cases! >10 year experience on embedded many-core programming 36 pm on industrial usecases exploration Move from workstation to efficient embedded systems!
86 Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA
Elaborazione dati real-time su architetture embedded many-core e FPGA
Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T
More informationTHE LEADER IN VISUAL COMPUTING
MOBILE EMBEDDED THE LEADER IN VISUAL COMPUTING 2 TAKING OUR VISION TO REALITY HPC DESIGN and VISUALIZATION AUTO GAMING 3 BEST DEVELOPER EXPERIENCE Tools for Fast Development Debug and Performance Tuning
More informationManycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT
Manycore and GPU Channelisers Seth Hall High Performance Computing Lab, AUT GPU Accelerated Computing GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate
More informationA176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O
The A176 Cyclone is the smallest and most powerful Rugged-GPGPU, ideally suited for distributed systems. Its 256 CUDA cores reach 1 TFLOPS, and it consumes less than 17W at full load (8-10W at typical
More informationIntegrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali
Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers
More informationEmbedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017
Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM - Systems GTC Israel 2017 Agenda Current GPGPU systems NVIDIA Jetson TX1 and TX2 evaluation Conclusions New Products 2 GPGPU Product
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationA176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer. Aitech
The A176 Cyclone is the smallest and most powerful Rugged-GPGPU, ideally suited for distributed systems. Its 256 CUDA cores reach 1 TFLOPS at a remarkable level of energy efficiency, providing all the
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationOmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel
OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies
More informationNvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018
Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationExercise: OpenMP Programming
Exercise: OpenMP Programming Multicore programming with OpenMP 19.04.2016 A. Marongiu - amarongiu@iis.ee.ethz.ch D. Palossi dpalossi@iis.ee.ethz.ch ETH zürich Odroid Board Board Specs Exynos5 Octa Cortex
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationPedraforca: a First ARM + GPU Cluster for HPC
www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationExperiences Using Tegra K1 and X1 for Highly Energy Efficient Computing
Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian
More information. SMARC 2.0 Compliant
MSC SM2S-IMX8 NXP i.mx8 ARM Cortex -A72/A53 Description The new MSC SM2S-IMX8 module offers a quantum leap in terms of computing and graphics performance. It integrates the currently most powerful i.mx8
More informationOpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj
OpenMP Accelerator Model for TI s Keystone DSP+ Devices SC13, Denver, CO Eric Stotzer Ajay Jayaraj 1 High Performance Embedded Computing 2 C Core Architecture 8-way VLIW processor 8 functional units in
More informationEmbedded Linux Conference San Diego 2016
Embedded Linux Conference San Diego 2016 Linux Power Management Optimization on the Nvidia Jetson Platform Merlin Friesen merlin@gg-research.com About You Target Audience - The presentation is introductory
More informationAn Alternative to GPU Acceleration For Mobile Platforms
Inventing the Future of Computing An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson andreas@adapteva.com 50 th DAC June 5th, Austin, TX Adapteva Achieves 3 World Firsts 1. First
More information8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2
CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.
More informationCUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker
CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It
More informationKalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013
Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationExploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,
More informationParallel Programming on Ranger and Stampede
Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationINTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.
INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp. Computer Vision in Mobile Tegra K1 It s time! AGENDA Use cases categories
More information. Micro SD Card Socket. SMARC 2.0 Compliant
MSC SM2S-IMX6 NXP i.mx6 ARM Cortex -A9 Description The design of the MSC SM2S-IMX6 module is based on NXP s i.mx 6 processors offering quad-, dual- and single-core ARM Cortex -A9 compute performance at
More informationMapping applications into MPSoC
Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationRenderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.
Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs Lihua Zhang, Ph.D. MulticoreWare Inc. lihua@multicorewareinc.com Overview More & more mobile apps are beginning to require
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationE4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU
E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè ARM64 and GPGPU 1 E4 Computer Engineering Company E4 Computer Engineering S.p.A. specializes in the manufacturing of high performance IT systems of medium
More informationOpenMP tasking model for Ada: safety and correctness
www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationCarlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)
Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationWelcome. Altera Technology Roadshow 2013
Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationTiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation
Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer
More informationDeep Learning: Transforming Engineering and Science The MathWorks, Inc.
Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA
More informationBarcelona Supercomputing Center
www.bsc.es Barcelona Supercomputing Center Centro Nacional de Supercomputación EMIT 2016. Barcelona June 2 nd, 2016 Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives:
More informationTo hear the audio, please be sure to dial in: ID#
Introduction to the HPP-Heterogeneous Processing Platform A combination of Multi-core, GPUs, FPGAs and Many-core accelerators To hear the audio, please be sure to dial in: 1-866-440-4486 ID# 4503739 Yassine
More informationThe Mont-Blanc Project
http://www.montblanc-project.eu The Mont-Blanc Project Daniele Tafani Leibniz Supercomputing Centre 1 Ter@tec Forum 26 th June 2013 This project and the research leading to these results has received funding
More informationBuilding supercomputers from commodity embedded chips
http://www.montblanc-project.eu Building supercomputers from commodity embedded chips Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results
More informationARM+DSP - a winning combination on Qseven
...embedding excellence ARM+DSP - a winning combination on Qseven 1 ARM Conference Munich July 2012 ARM on Qseven your first in module technology Over 6 Billion ARM-based chips sold in 2010 10% market
More informationImplementation of Deep Convolutional Neural Net on a Digital Signal Processor
Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationA Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013
A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company
More informationHSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!
Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationNOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer
NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY Peter Messmer pmessmer@nvidia.com COMPUTATIONAL CHALLENGES IN HEP Low-Level Trigger High-Level Trigger Monte Carlo Analysis Lattice QCD 2 COMPUTATIONAL
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationEyeCheck Smart Cameras
EyeCheck Smart Cameras 2 3 EyeCheck 9xx & 1xxx series Technical data Memory: DDR RAM 128 MB FLASH 128 MB Interfaces: Ethernet (LAN) RS422, RS232 (not EC900, EC910, EC1000, EC1010) EtherNet / IP PROFINET
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationEmbedded Systems: Architecture
Embedded Systems: Architecture Jinkyu Jeong (Jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ICE3028: Embedded Systems Design, Fall 2018, Jinkyu Jeong (jinkyu@skku.edu)
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationFiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers
FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers Rene Griessl, Meysam Peykanu, Lennart Tigges, Jens Hagemeyer, Mario Porrmann Center of Excellence Cognitive Interaction Technology
More informationECE 574 Cluster Computing Lecture 18
ECE 574 Cluster Computing Lecture 18 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 2 April 2019 HW#8 was posted Announcements 1 Project Topic Notes I responded to everyone s
More informationBuilding supercomputers from embedded technologies
http://www.montblanc-project.eu Building supercomputers from embedded technologies Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationWHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016
WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationFrom Application to Technology OpenCL Application Processors Chung-Ho Chen
From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationParallella: A $99 Open Hardware Parallel Computing Platform
Inventing the Future of Computing Parallella: A $99 Open Hardware Parallel Computing Platform Andreas Olofsson andreas@adapteva.com IPDPS May 22th, Cambridge, MA Adapteva Achieves 3 World Firsts 1. First
More information[Potentially] Your first parallel application
[Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel
More informationOmpCloud: Bridging the Gap between OpenMP and Cloud Computing
OmpCloud: Bridging the Gap between OpenMP and Cloud Computing Hervé Yviquel, Marcio Pereira and Guido Araújo University of Campinas (UNICAMP), Brazil A bit of background qguido Araujo, PhD Princeton University
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationUse ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC
Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC Topics Hardware advantages of ZYNQ UltraScale+ MPSoC Software stacks of MPSoC Target reference design introduction Details about one Design
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationFeature Detection Plugins Speed-up by
Feature Detection Plugins Speed-up by OmpSs@FPGA Nicola Bettin Daniel Jimenez-Gonzalez Xavier Martorell Pierangelo Nichele Alberto Pomella nicola.bettin@vimar.com, pierangelo.nichele@vimar.com, alberto.pomella@vimar.com
More informationA framework for optimizing OpenVX Applications on Embedded Many Core Accelerators
A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS
More informationIntegrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM
Integrating CPU and GPU, The ARM Methodology Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM The ARM Business Model Global leader in the development of
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationHow GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige
How GPUs can find your next hit: Accelerating virtual screening with OpenCL Simon Krige ACS 2013 Agenda > Background > About blazev10 > What is a GPU? > Heterogeneous computing > OpenCL: a framework for
More informationReal-Time Support for GPU. GPU Management Heechul Yun
Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks
More informationMYC-C7Z010/20 CPU Module
MYC-C7Z010/20 CPU Module - 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor with Xilinx 7-series FPGA logic - 1GB DDR3 SDRAM (2 x 512MB, 32-bit), 4GB emmc, 32MB QSPI Flash - On-board Gigabit
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationHeterogeneous Architecture. Luca Benini
Heterogeneous Architecture Luca Benini lbenini@iis.ee.ethz.ch Intel s Broadwell 03.05.2016 2 Qualcomm s Snapdragon 810 03.05.2016 3 AMD Bristol Ridge Departement Informationstechnologie und Elektrotechnik
More informationSoC Platforms and CPU Cores
SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationGPU ACCELERATED DATABASE MANAGEMENT SYSTEMS
CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationAn Evaluation of Unified Memory Technology on NVIDIA GPUs
An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More information