Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA

Size: px
Start display at page:

Download "Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA"

Transcription

1 Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA Dott. Alessandro Capotondi alessandro.capotondi@unibo.it Prof. Davide Rossi davide.rossi@unibo.it

2 Agenda Many-core introduction CIRI ICT OpenMP Technologies Productive Parallel Programming Models Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory Conclusions

3 Collaborations Academia Industrial UE Projects P-SOCRATES FP7 ICT GRANT N ERC GRANT N

4 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

5 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

6 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

7 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

8 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems

9 The Advent of Heterogeneous Many-Core Architectures Era in High-Performance Embedded systems Various design schemes are available: Targeting ADAPTIVITY: heterogeneity and specialization for efficient computing. Es. ARM Big.Little Embedded systems need to be capable to process workloads usually tailored for workstations or HPC. Multi-Processor Systems-on-Chip (MPSoCs) computing units embedded in the same die designed to deliver high performance at low power consumption = high energy efficiency (GOPS/Watt) Targeting PARALLELISM: massively parallel many-core accelerators, to maximize GOPS/Watt (i.e. GPUs, GPGPUs, PMCA). Nvidia Tegra-K1 Two levels of heterogeneity: Host Processor (4 powerful cores + 1 energy efficient core) Parallel many-core coprocessor (192 cores accelerator: NvidiaKeplerGPU)

10 Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

11 Nvidia K1 (jetson) Hardware Features Dimensions: 5" x 5" (127mm x 127mm) board Tegra K1 SOC (1 to 5 Watts): NVIDIA Kepler GPU with 192 SM (326 GFLOPS) NVIDIA "4-Plus-1" 2.32GHz ARM quad-core Cortex-A15 DRAM: 2GB DDR3L 933MHz IO Features mini-pcie SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra 200 $. Big Community of user!

12 Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra

13 Nvidia TX1 Hardware Features Dimensions: 8" x 8 board Tegra TX1 SOC (15 Watts): NVIDIA Maxwell GPU with 256 NVIDIA CUDA Cores(1 TFLOP/s) Quad-core ARM Cortex-A57 MPCore Processor 4 GB LPDDR4 Memory IO Features PCI-E x4 5MP CSI SD/MMC card USB 3.0/2.0 HDMI RS232 Gigabit Ethernet SATA JTAG UART 3x I2C 7 x GPIo Software Features CUDA 6.0 OpenGL 4.4 OpenMAX IL multimedia codec including H.264 OpenCV4Tegra 599 $. WORKSTATION COMPARABLE PERFORMANCES

14 TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL

15 TI Keystone II Hardware Features Dimensions: 8" x 8 board TI 66AK2H12 SOC (14 Watts): 8x C6600 DSP 1.2 GHz (304 GMACs) Quad-core ARM Cortex-A15 MPCore Up to 4 GB DDR3 Memory IO Features PCI-E SD/MMC card USB 3.0/2.0 2x Gigabit Ethernet SATA JTAG UART 3x I2C 38 x GPIo Hyperlink SRIO 20x 64bit Timers Security Accelerators Software Features OpenMP OpenCL Evaluation Board 1000$. Target for Signal Processing Accelerator

16 MPPA Kalray

17 MPPA Kalray Evaluation Board 1000$. Targeting High Performance Time critical missions Aerospace/Military/Autonomous driving Industrial Robotics

18 Programmable many-core accelerator (PMCA)

19 Programmable many-core accelerator (PMCA)

20 Programmable many-core accelerator (PMCA)

21 Programmable many-core accelerator (PMCA)

22 Challenges Fast Programmability, High Productivity programming techniques Time predictability for industrial applications Accelerator virtualization for high performance and power efficient computations resource sharing among applications heterogeneous unified shared memory

23 Performance is not free MEAL Thread-Level Parallelism 2 1,5 1 0,5 0 ifunny NetFlix Candy Crush Saga My Talking Tom Android BS Player LinkedIn Google Drive Apple Instagram Youtube Dropbox Facebook Twitter Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.36 Apple

24 Performance is not free MEAL Thread-Level Parallelism 2 1,5 1 0,5 0 ifunny NetFlix Candy Crush Saga My Talking Tom Android BS Player LinkedIn Google Drive Apple Instagram Youtube Dropbox Facebook Twitter Tests[*] based on common mobile applications show that real platforms are still far from materializing the potential parallelism provided by hardware: TLP AVG(52apps) 1.22 Android TLP AVG(52apps) 1.36 Apple [*] Analysis of the Effective Use of Thread-Level Parallelism in Mobile Applications. A preliminary study on ios and Android devices, Ethan Bogdan Hongin Yun.

25 Parallel Programming models Proprietary Programming models

26 Proprietary Programming models Parallel Programming models

27 Proprietary Programming models Parallel Programming models

28 Proprietary Programming models Parallel Programming models

29 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing

30 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing

31 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system

32 Proprietary Programming models Parallel Programming models Khronos Standard for Heterogeneous Computing Standard for shared memory system

33 Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing

34 Proprietary Programming models Parallel Programming models Standard for shared memory system Khronos Standard for Heterogeneous Computing OmpSS OpenHMPP Academic Proposals

35 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity

36 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler

37 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity 2x to 10x less LOC OpenCL for programming shared memory multicore CPUs by Akhtar Ali, Usman Dastgeer, Christoph Kessler

38 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA and multiple runtime systems too!!

39 OpenMP De-facto standard for shared memory programming Support for nested (multi-level) parallelism good for clusters Annotations to incrementally convey parallelism to the compiler increased ease of use Based on well-understood programming practices (shared memory, C language) increases productivity Designed for uniform SMP with main shared memory Lacks constructs to control accelerators But 1. And compilation toolchain to deal with multiple ISA and multiple runtime systems too!! Intel s Parallel Universe magazine May 2014

40 Open-Next OpenMP runtime 4.0 What snew? UNTIED tasks

41 Open-Next OpenMP runtime Comparison with other OpenMP implementations RECURSIVE libgomp iomp >> x86 (Intel Haswell GHz.) libgomp: GNU OpenMP implementation (GCC 4.9.2) iomp: Intel OpenMP implementation (ICC )

42 Open-Next OpenMP runtime 4.0 Comparison with other tasking runtimes RECURSIVE nanos libgomp iomp Intel CILK+ ICC (15.0.2) Intel TBB ICC (15.0.2) WOOL GCC (4.9.2) >> x86 (Intel Haswell GHz.) nanos: BSC OpenSS (Mercurium Nanos++) Intel CILK+: ICC Intel TBB: ICC Wool: GCC 4.9.2

43 Time Predictability At compile-time, generate the TDG that includes timing information to consider the tasks communication At design-time, assign the TDG to OS threads (mapping) At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling) source code #pragma omp #pragma omp Compiler C/C++ binary code newtask() newtask() etdg Static Scheduler + Timing Analysis OpenMP RTE Dispatcher Many-core Compile-time Design time Run-time

44 Time Predictability At compile-time, generate the TDG that includes timing information to consider the tasks communication At design-time, assign the TDG to OS threads (mapping) At run-time, schedule OS threads to achieve both predictability and high-performance (scheduling) source code #pragma omp #pragma omp Compiler C/C++ binary code newtask() newtask() etdg Static Scheduler + Timing Analysis OpenMP RTE Dispatcher Many-core Compile-time Design time Run-time

45 Open-Next Offload using OpenMP

46 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) }

47 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; new OpenMP directive used to offload the execution of a code block to the accelerator { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) }

48 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator /* some more code here */ #pragma omp wait (ker_id) }

49 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }

50 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads /* some more code here */ #pragma omp wait (ker_id) }

51 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block /* some more code here */ #pragma omp wait (ker_id) }

52 Open-Next Offload using OpenMP void main(){ int a[]; int ker_id; { } /* some code here */ #pragma omp offload \ shared (a) \ name ( myker, ker_id) \ nowait #pragma omp parallel sections \ proc_bind (spread) { } #pragma omp section TASK_A(); #pragma omp section TASK_B(); /* some more code here */ #pragma omp wait (ker_id) } new OpenMP directive used to offload the execution of a code block to the accelerator shared clause specifies data that needs to be shared between the host and accelerator New name clause retrieves a handle (an ID) to the kernel instance at runtime, necessary to wait for asynchronous offloads Specify asynchronous offloads All standard OpenMP and custom extensions can be used within an offload block TASK_A(){ int i; #pragma omp parallel proc_bind (close) #pragma omp for for( i=0;. ) do_smthg(); }

53 Early evaluation FAST CT Mahala Strassen NCC SHOT Kernel Repetitions Kernel Repetitions Speedup Speedup

54 Early evaluation Speedup FAST CT NCC Kernel Repetitions Speedup Mahala Strassen Simplifying Heterogeneous Embedded SoC Programming with Directive-Based Offload Marongiu, Capotondi, Tagliavini, Benini IEEE Transactions Industrial Informatics 2015 Kernel Repetitions SHOT

55 Accelerator Resource Sharing OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer

56 Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB Hardware Abstraction Layer

57 Accelerator Resource Sharing there is not dominant standard PPM OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer

58 Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB low-level Runtime Hardware Abstraction Layer

59 Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB on PMCAs RTEs are typically developed on top of bare metal low-level Runtime Hardware Abstraction Layer

60 Accelerator Resource Sharing there is not dominant standard PPM Improve overall utilization of accelerators in multiuser environment OpenMP OpenCL OpenVX TBB on PMCAs RTEs are typically developed on top of bare metal low-level Runtime Hardware Abstraction Layer Legacy Applications

61 Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver

62 Accelerator Resource Sharing OpenMP O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support

63 Accelerator Resource Sharing OpenMP Virtual Accelerators O 1 OpenCL OpenVX O 2 O 3 O N HOST driver Lightweight Spatial Partitioning Support

64 Accelerator Resource Sharing Virtual Accelerators O 2 O 1 O 3 O N HOST driver Lightweight Spatial Partitioning Support

65 Runtime Efficiency: Computer Vision Use-Case

66 Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] Removal Object Detector (OpenMP 4 Clusters)[4]

67 Runtime Efficiency: Computer Vision Use-Case ORB Object Detector (OpenCL 4 Clusters)[1] Face Detector (OpenCL 1 Cluster)[2] FAST Corner Detector (OpenMP 1 Cluster)[3] [1] Rublee, Ethan, et al. "ORB: an efficient alternative to SIFT or SURF."Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, [2] Jones, Michael, et al. "Fast multi-view face detection." Mitsubishi Electric Research Lab TR (2003): 14. [3] Rosten, et al. "Faster and better: A machine learning approach to corner detection." Pattern Analysis and Machine Intelligence, IEEE Transactions on 32.1 (2010): [4] Magno, Michele, et al. "Multimodal abandoned/removed object detection for low power video surveillance systems." Advanced Video and Signal Based Surveillance, AVSS'09. Sixth IEEE International Conference on. IEEE, Removal Object Detector (OpenMP 4 Clusters)[4]

68 Runtime Efficiency: Computer Vision Use-Case 100% 80% Efficiency % vs Ideal 60% 40% 20% 0% #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO

69 Runtime Efficiency: Computer Vision Use-Case 100% 90% efficient wrt ideal +30% efficiency wrt SPM-MO +40% efficiency wrt SPM-SO Efficiency % vs Ideal 80% 60% 40% 20% 0% #Frames MPM-MO SPM-MO (0%) SPM-MO (25%) SPM-MO (50%) SPM-MO (100%) SPM-SO

70 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support.

71 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning

72 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host.

73 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory.

74 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Coherent virtual memory for host. Accelerator can only access contiguous section in shared main memory, no virtual memory. Explicit data management involving copies: Limited programmability Low performance

75 Heterogeneous unified shared memory Shared Memory for Accelerators in Embedded SoCs No clear view about practical implementation aspects and performance implications of virtual memory support. Today s reality: Memory partitioning Open-Next goal: Lightweight Virtual Memory Support > Sharing of virtual address pointers >Transparent to the application developer Coherent virtual >Zero-copy offload, memory for higher host. predictability >Low complexity, low area, low cost Explicit data management involving copies: Limited programmability Low performance Accelerator can only access contiguous section in shared main memory, no virtual memory.

76 Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

77 Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

78 Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Fine-grained offloading of highly parallel tasks. Moves the complexity from the software to the hardware Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

79 Not only many-core accelerator! Heterogeneous unified shared memory Heterogeneous Systems Increase computing power and energy efficiency. OFFLOAD ZERO-COPY (transparent) virtual pointer sharing Execute control intensive and sequential tasks. Moves the complexity from the software to the hardware FPGA CNN Deep- Learning Accelerator Fine-grained offloading of highly parallel tasks. Communicate via coherent shared memory. IOMMU for huma in high-end SoCs.

80 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU

81 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13

82 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA (

83 HETEROGENEOUS UNIFIED SHARED MEMORY Low-cost IOMMU Host: Dual-Core ARM Cortex-A9 Linux Kernel 3.13 Accelerator: PULP implemented in the FPGA ( First open-source RISC-V core

84 Open-Next CIRI-ICT Activities Identificazione del programming model di riferimento per le piattaforme multi- e many-core che implementano gli use-case del progetto Implementazione di meccanismi software per facilitare la programmazione e rendere piu' efficiente lo scambio di dati in architetture eterogenee "shared memory" composte da host con supporto per memoria virtuale ed acceleratori senza supporto per memoria virtuale (es. GPU, DSP, FPGA) Implementazione di meccanismi software per la gestione ad alto livello di funzioni accelerate attraverso l'utilizzo di hardware dedicato (FPGA) Identificazione di possibili estensioni al programming model per la prossima generazione di impianti industriali real-time Porting di kernel significativi estratti dalle applicazioni che implementano gli use-cases e analisi delle performance

85 Open-Next CIRI-ICT Unibo Your Industrial use-cases! >10 year experience on embedded many-core programming 36 pm on industrial usecases exploration Move from workstation to efficient embedded systems!

86 Elaborazione di dati ad elevate prestazioni e bassa potenza su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T

More information

THE LEADER IN VISUAL COMPUTING

THE LEADER IN VISUAL COMPUTING MOBILE EMBEDDED THE LEADER IN VISUAL COMPUTING 2 TAKING OUR VISION TO REALITY HPC DESIGN and VISUALIZATION AUTO GAMING 3 BEST DEVELOPER EXPERIENCE Tools for Fast Development Debug and Performance Tuning

More information

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT Manycore and GPU Channelisers Seth Hall High Performance Computing Lab, AUT GPU Accelerated Computing GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate

More information

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O The A176 Cyclone is the smallest and most powerful Rugged-GPGPU, ideally suited for distributed systems. Its 256 CUDA cores reach 1 TFLOPS, and it consumes less than 17W at full load (8-10W at typical

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017 Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM - Systems GTC Israel 2017 Agenda Current GPGPU systems NVIDIA Jetson TX1 and TX2 evaluation Conclusions New Products 2 GPGPU Product

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer. Aitech

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer.  Aitech The A176 Cyclone is the smallest and most powerful Rugged-GPGPU, ideally suited for distributed systems. Its 256 CUDA cores reach 1 TFLOPS at a remarkable level of energy efficiency, providing all the

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies

More information

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018 Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Exercise: OpenMP Programming

Exercise: OpenMP Programming Exercise: OpenMP Programming Multicore programming with OpenMP 19.04.2016 A. Marongiu - amarongiu@iis.ee.ethz.ch D. Palossi dpalossi@iis.ee.ethz.ch ETH zürich Odroid Board Board Specs Exynos5 Octa Cortex

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian

More information

. SMARC 2.0 Compliant

. SMARC 2.0 Compliant MSC SM2S-IMX8 NXP i.mx8 ARM Cortex -A72/A53 Description The new MSC SM2S-IMX8 module offers a quantum leap in terms of computing and graphics performance. It integrates the currently most powerful i.mx8

More information

OpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj

OpenMP Accelerator Model for TI s Keystone DSP+ARM Devices. SC13, Denver, CO Eric Stotzer Ajay Jayaraj OpenMP Accelerator Model for TI s Keystone DSP+ Devices SC13, Denver, CO Eric Stotzer Ajay Jayaraj 1 High Performance Embedded Computing 2 C Core Architecture 8-way VLIW processor 8 functional units in

More information

Embedded Linux Conference San Diego 2016

Embedded Linux Conference San Diego 2016 Embedded Linux Conference San Diego 2016 Linux Power Management Optimization on the Nvidia Jetson Platform Merlin Friesen merlin@gg-research.com About You Target Audience - The presentation is introductory

More information

An Alternative to GPU Acceleration For Mobile Platforms

An Alternative to GPU Acceleration For Mobile Platforms Inventing the Future of Computing An Alternative to GPU Acceleration For Mobile Platforms Andreas Olofsson andreas@adapteva.com 50 th DAC June 5th, Austin, TX Adapteva Achieves 3 World Firsts 1. First

More information

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2 CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.

More information

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It

More information

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013

Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 Kalray MPPA Manycore Challenges for the Next Generation of Professional Applications Benoît Dupont de Dinechin MPSoC 2013 The End of Dennard MOSFET Scaling Theory 2013 Kalray SA All Rights Reserved MPSoC

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,

More information

Parallel Programming on Ranger and Stampede

Parallel Programming on Ranger and Stampede Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012 What is Stampede? NSF-funded XSEDE

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp.

INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES. Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp. INTEGRATING COMPUTER VISION SENSOR INNOVATIONS INTO MOBILE DEVICES Eli Savransky Principal Architect - CTO Office Mobile BU NVIDIA corp. Computer Vision in Mobile Tegra K1 It s time! AGENDA Use cases categories

More information

. Micro SD Card Socket. SMARC 2.0 Compliant

. Micro SD Card Socket. SMARC 2.0 Compliant MSC SM2S-IMX6 NXP i.mx6 ARM Cortex -A9 Description The design of the MSC SM2S-IMX6 module is based on NXP s i.mx 6 processors offering quad-, dual- and single-core ARM Cortex -A9 compute performance at

More information

Mapping applications into MPSoC

Mapping applications into MPSoC Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc. Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs Lihua Zhang, Ph.D. MulticoreWare Inc. lihua@multicorewareinc.com Overview More & more mobile apps are beginning to require

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè ARM64 and GPGPU 1 E4 Computer Engineering Company E4 Computer Engineering S.p.A. specializes in the manufacturing of high performance IT systems of medium

More information

OpenMP tasking model for Ada: safety and correctness

OpenMP tasking model for Ada: safety and correctness www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB

More information

Preparing for Highly Parallel, Heterogeneous Coprocessing

Preparing for Highly Parallel, Heterogeneous Coprocessing Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

Welcome. Altera Technology Roadshow 2013

Welcome. Altera Technology Roadshow 2013 Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer

More information

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deep Learning: Transforming Engineering and Science The MathWorks, Inc. Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA

More information

Barcelona Supercomputing Center

Barcelona Supercomputing Center www.bsc.es Barcelona Supercomputing Center Centro Nacional de Supercomputación EMIT 2016. Barcelona June 2 nd, 2016 Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives:

More information

To hear the audio, please be sure to dial in: ID#

To hear the audio, please be sure to dial in: ID# Introduction to the HPP-Heterogeneous Processing Platform A combination of Multi-core, GPUs, FPGAs and Many-core accelerators To hear the audio, please be sure to dial in: 1-866-440-4486 ID# 4503739 Yassine

More information

The Mont-Blanc Project

The Mont-Blanc Project http://www.montblanc-project.eu The Mont-Blanc Project Daniele Tafani Leibniz Supercomputing Centre 1 Ter@tec Forum 26 th June 2013 This project and the research leading to these results has received funding

More information

Building supercomputers from commodity embedded chips

Building supercomputers from commodity embedded chips http://www.montblanc-project.eu Building supercomputers from commodity embedded chips Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results

More information

ARM+DSP - a winning combination on Qseven

ARM+DSP - a winning combination on Qseven ...embedding excellence ARM+DSP - a winning combination on Qseven 1 ARM Conference Munich July 2012 ARM on Qseven your first in module technology Over 6 Billion ARM-based chips sold in 2010 10% market

More information

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY Peter Messmer pmessmer@nvidia.com COMPUTATIONAL CHALLENGES IN HEP Low-Level Trigger High-Level Trigger Monte Carlo Analysis Lattice QCD 2 COMPUTATIONAL

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

EyeCheck Smart Cameras

EyeCheck Smart Cameras EyeCheck Smart Cameras 2 3 EyeCheck 9xx & 1xxx series Technical data Memory: DDR RAM 128 MB FLASH 128 MB Interfaces: Ethernet (LAN) RS422, RS232 (not EC900, EC910, EC1000, EC1010) EtherNet / IP PROFINET

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Embedded Systems: Architecture

Embedded Systems: Architecture Embedded Systems: Architecture Jinkyu Jeong (Jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ICE3028: Embedded Systems Design, Fall 2018, Jinkyu Jeong (jinkyu@skku.edu)

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

Accelerating Financial Applications on the GPU

Accelerating Financial Applications on the GPU Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General

More information

FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers

FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers Rene Griessl, Meysam Peykanu, Lennart Tigges, Jens Hagemeyer, Mario Porrmann Center of Excellence Cognitive Interaction Technology

More information

ECE 574 Cluster Computing Lecture 18

ECE 574 Cluster Computing Lecture 18 ECE 574 Cluster Computing Lecture 18 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 2 April 2019 HW#8 was posted Announcements 1 Project Topic Notes I responded to everyone s

More information

Building supercomputers from embedded technologies

Building supercomputers from embedded technologies http://www.montblanc-project.eu Building supercomputers from embedded technologies Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Parallella: A $99 Open Hardware Parallel Computing Platform

Parallella: A $99 Open Hardware Parallel Computing Platform Inventing the Future of Computing Parallella: A $99 Open Hardware Parallel Computing Platform Andreas Olofsson andreas@adapteva.com IPDPS May 22th, Cambridge, MA Adapteva Achieves 3 World Firsts 1. First

More information

[Potentially] Your first parallel application

[Potentially] Your first parallel application [Potentially] Your first parallel application Compute the smallest element in an array as fast as possible small = array[0]; for( i = 0; i < N; i++) if( array[i] < small ) ) small = array[i] 64-bit Intel

More information

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing OmpCloud: Bridging the Gap between OpenMP and Cloud Computing Hervé Yviquel, Marcio Pereira and Guido Araújo University of Campinas (UNICAMP), Brazil A bit of background qguido Araujo, PhD Princeton University

More information

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC

Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC Topics Hardware advantages of ZYNQ UltraScale+ MPSoC Software stacks of MPSoC Target reference design introduction Details about one Design

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

Feature Detection Plugins Speed-up by

Feature Detection Plugins Speed-up by Feature Detection Plugins Speed-up by OmpSs@FPGA Nicola Bettin Daniel Jimenez-Gonzalez Xavier Martorell Pierangelo Nichele Alberto Pomella nicola.bettin@vimar.com, pierangelo.nichele@vimar.com, alberto.pomella@vimar.com

More information

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators Giuseppe Tagliavini, DEI University of Bologna Germain Haugou, IIS ETHZ Andrea Marongiu, DEI University of Bologna & IIS

More information

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM Integrating CPU and GPU, The ARM Methodology Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM The ARM Business Model Global leader in the development of

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige How GPUs can find your next hit: Accelerating virtual screening with OpenCL Simon Krige ACS 2013 Agenda > Background > About blazev10 > What is a GPU? > Heterogeneous computing > OpenCL: a framework for

More information

Real-Time Support for GPU. GPU Management Heechul Yun

Real-Time Support for GPU. GPU Management Heechul Yun Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks

More information

MYC-C7Z010/20 CPU Module

MYC-C7Z010/20 CPU Module MYC-C7Z010/20 CPU Module - 667MHz Xilinx XC7Z010/20 Dual-core ARM Cortex-A9 Processor with Xilinx 7-series FPGA logic - 1GB DDR3 SDRAM (2 x 512MB, 32-bit), 4GB emmc, 32MB QSPI Flash - On-board Gigabit

More information

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012

More information

Heterogeneous Architecture. Luca Benini

Heterogeneous Architecture. Luca Benini Heterogeneous Architecture Luca Benini lbenini@iis.ee.ethz.ch Intel s Broadwell 03.05.2016 2 Qualcomm s Snapdragon 810 03.05.2016 3 AMD Bristol Ridge Departement Informationstechnologie und Elektrotechnik

More information

SoC Platforms and CPU Cores

SoC Platforms and CPU Cores SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

JCudaMP: OpenMP/Java on CUDA

JCudaMP: OpenMP/Java on CUDA JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

An Evaluation of Unified Memory Technology on NVIDIA GPUs

An Evaluation of Unified Memory Technology on NVIDIA GPUs An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo

More information

Trends in the Infrastructure of Computing

Trends in the Infrastructure of Computing Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information