E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

Similar documents
Pedraforca: a First ARM + GPU Cluster for HPC

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

OCTOPUS Performance Benchmark and Profiling. June 2015

MegaGauss (MGs) Cluster Design Overview

The rcuda middleware and applications

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

SNAP Performance Benchmark and Profiling. April 2014

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

HP GTC Presentation May 2012

CPMD Performance Benchmark and Profiling. February 2014

The Mont-Blanc Project

MILC Performance Benchmark and Profiling. April 2013

Building supercomputers from commodity embedded chips

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer. Aitech

Barcelona Supercomputing Center

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

FiPS and M2DC: Novel Architectures for Reconfigurable Hyperscale Servers

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

N V M e o v e r F a b r i c s -

GPUs and Emerging Architectures

n N c CIni.o ewsrg.au

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Embedded Computing without Compromise. Evolution of the Rugged GPGPU Computer Session: SIL7127 Dan Mor PLM -Aitech Systems GTC Israel 2017

NAMD Performance Benchmark and Profiling. January 2015

LAMMPSCUDA GPU Performance. April 2011

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Tightly Coupled Accelerators Architecture

About 2CRSI. OCtoPus Solution. Technical Specifications. OCtoPus servers. OCtoPus. OCP Solution by 2CRSI.

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

THE LEADER IN VISUAL COMPUTING

Next Generation Storage Networking for Next Generation Data Centers. PRESENTATION TITLE GOES HERE Dennis Martin President, Demartek

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Fermi Cluster for Real-Time Hyperspectral Scene Generation

Interconnect Your Future

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Interconnection Network for Tightly Coupled Accelerators Architecture

About 2CRSI. OCtoPus Solution. Technical Specifications. OCtoPus. OCP Solution by 2CRSI.

Elaborazione dati real-time su architetture embedded many-core e FPGA

fit-pc Intense 2 Overview

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

ARM and x86 on Qseven & COM Express Mini. Zeljko Loncaric, Marketing Engineer, congatec AG

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

IBM Power AC922 Server

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Hosted Desktops Infrastructure

Comet Virtualization Code & Design Sprint

Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System

About Us. Are you ready for headache-free HPC? Call us and learn more about our custom clustering solutions.

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Apalis A New Architecture for Embedded Computing

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

A step up from TS-831X

The Visual Computing Company

HPC Performance in the Cloud: Status and Future Prospects

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Building supercomputers from embedded technologies

Intel Xeon E v4, Optional Operating System, 8GB Memory, 2TB SAS H330 Hard Drive and a 3 Year Warranty

Vectorisation and Portable Programming using OpenCL

World s most advanced data center accelerator for PCIe-based servers

STAR-CCM+ Performance Benchmark and Profiling. July 2014

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Butterfly effect of porting scientific applications to ARM-based platforms

. Micro SD Card Socket. SMARC 2.0 Compliant

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

HPC Hardware Overview

OpenFOAM Performance Testing and Profiling. October 2017

. SMARC 2.0 Compliant

Avid Configuration Guidelines Dell R7920 Rack workstation Dual 8 to 28 Core CPU System

Gen-Z Memory-Driven Computing

CST STUDIO SUITE R Supported GPU Hardware

Designed for Maximum Accelerator Performance

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Specifications. Packing List. Ethernet Interface. General. MIOe Expansion Slot. VGA/HDMI Interface. MIO-2360 Startup Manual 1

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

ARISTA: Improving Application Performance While Reducing Complexity

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Avid Configuration Guidelines Lenovo P520/P520C workstation Single 6 to 18 Core CPU System P520 P520C

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

IBM Research: AcceleratorTechnologies in HPC and Cognitive Computing

ASTRI/CTA data analysis on parallel and low-power platforms

Computing on Low Power SoC Architecture

2008 International ANSYS Conference

Turn-key OPNsense appliances from the founders

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

Transcription:

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè ARM64 and GPGPU 1

E4 Computer Engineering Company E4 Computer Engineering S.p.A. specializes in the manufacturing of high performance IT systems of medium and high range. Our products aim to accomplish both industrial and scientific research requirements and space from universities to computing centers. Thanks to the established experience and quality in this circle, E4 has become a valued technology s supplier and it is acknowledged and appreciated by famous and prestigious organizations. ARM64 and GPGPU 2

The first ARM+GPU cuda based platform 2011 Features CPU GPU Memory Peak Performance Network Storage ARKA Blade NVIDIA Tegra 3 ARM Cortex A9 Quad-Core NVIDIA Quadro 1000M with 96 CUDA Cores 2GB x CPU 2GB x GPU 270 Single Precision GFLOPS 1x Gigabit Ethernet 1x SATA 2.0 Connector CARMA Devkit developed by SECO USB 3x USB 2.0 Display 3x HDMI + serial console available ARM64 and GPGPU 3

ARKA CLUSTER SHOC BENCH 2011 The Scalable Heterogeneous Computing Benchmark Suite (SHOC) is a collection of benchmark programs testing performance and stability (using CUDA and/or opencl) Test ARKA Intel+M2075 Units ARKA/M2075 Maxspflops 263.12 1001.31 GFlops 26% fft_sp 23.343 162.524 GFlops 14% sgemm_n 132.482 666.272 GFlops 20% dgemm_n 21.5152 315.110 GFlops 7% md_sp_bw 5.5301 25.3975 GB/s 22% Reduction 23.1895 92.7343 GB/s 25% Sort 0.3090 1.6296 GB/s 19% triad_bw 0.4279 6.0163 GB/s 7% ARM64 and GPGPU 4

How we started 2012 ARM64 and GPGPU 5

The first server ARM+K20 EK002 2013 ARKA EK002 diagram Q7 Tegra 3 SOM PCIe 4x Gen 1 PLX PCIe switch PCIe PCIe PCIe Mini-itx carrier board PCIe Nvidia K20/K2000 ETH 1Gb 1x SATA SATA 2.5 SSD Node maximum power consumption 250W 1,170 Tflops Rpeak 4.5+ Gflops/W ARM64 and GPGPU 6

EK002 Efficiency 8x More efficient 5x More efficient ARM64 and GPGPU 7

Pedraforca Cluster 2013 3 rack 78 compute nodes 2 login nodes 4 36-port InfiniBand switches (MPI) 2 50-port GbE switches (storage) ARM64 and GPGPU 8

The X-gene1 SoC 8 Brawny Custom Cores X-Gene: Highly Integrated Server on a Chip TM 64b 2.4GHz 64b 2.4GHz 64b 2.4GHz 64b 2.4GHz 64b 2.4GHz ON-CHIP FABRIC 64b 2.4GHz 64b 2.4GHz 64b 2.4GHz High Speed Memory Ctrlr Broad, Fast Memory Network Accel SDN Etc. 4 x 10Gb Ethernet 16 x PCIe SATA PHYSICAL LAYER High-Speed Mixed-Signal 9 I/O ARM64 and GPGPU 9

RK003 PROTOTYPE PCIe x4 Mini-itx APM Mustang dev board PCIe 8x Gen 3 PLX PCIe switch PCIe x4 PCIe x4 PCIe x4 Nvidia K20 ETH 10Gb 1x SATA SATA 2.5 SSD Node maximum power consumption 250W 1,17 Tflops Rpeak 4.68 Gflops/W ARM64 and GPGPU 10

RK003 PROTOTYPE [root@apm2 bandwidthtest]#./bandwidthtest [CUDA Bandwidth Test] - Starting... Running on... Device 0: Tesla K20m Quick Mode Host to Device Bandwidth, 1 Device(s) Mini-itx APM PCIe 8x PINNED Memory Transfers Mustang dev board Transfer Size (Bytes) Bandwidth(MB/s) Gen 3 33554432 1430.7 PLX PCIe switch PCIe PCIe PCIe Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1639.0 ETH 10Gb 1x SATA Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 Result = PASS SATA 2.5 SSD 146841.6 PCIe x4 Nvidia K20 ARM64 and GPGPU 11

RK003 PROTOTYPE Ibv_rc_pingpong : 19548.12 Mbit/s 3.69 µs Mini-itx APM Mustang dev board PCIe 8x Gen 3 PLX PCIe switch PCIe PCIe PCIe x4 PCIe x4 Connect-X3 Nvidia K20 ETH 10Gb 1x SATA SATA 2.5 SSD ARM64 and GPGPU 12

RK003 PROTOTYPE PCIe Mini-itx APM Mustang dev board PCIe 8x Gen 3 PLX PCIe switch PCIe PCIe PCIe Nvidia K20 ETH 10Gb 1x SATA SATA 2.5 SSD Node maximum power consumption 250W 1,17 Tflops Rpeak 4.68 Gflops/W ARM64 and GPGPU 13

The ARKA RK003 Server (1) FEATURES Form Factor: 2U CPU 1x APM X-Gene 8 cores GPU up to 2x NVIDIA Kepler K20, K40, K80 Memory Up to 128 GB RAM DDR3 Network 2x 10 GbE SFP+, 1x Infiniband FDR QSFP Storage 4x SATA 3.0 Expansion slots 2x PCI-E 3.0 x8 (in x16) 1U motherboard X- gene 1 PCIe 8x Gen 3 PCIe 8x Gen 3 Mellanox ConnectX Nvidia K20/K40 2x ETH 10Gb 4 x SATA SATA 2.5 SSD ARM64 and GPGPU 14

The ARKA RK003 Server (2) OS: Ubuntu derivative for ARM 14.04 LTS DEVELOPMENT TOOL: NVIDA CUDA 6.5 (compilers, libraries, SDK) MPI 2.0 Libraries GNU compilers Scalable Heterogeneous Computing (SHOC) Benchmark Suite CLUSTER, MONITORING, MANAGEMENT TOOLS (optional): Resource/Queue manager Monitoring tools for cluster Parallel shell for cluster-wide commands Bare-metal restore ARM64 and GPGPU 15

The ARKA RK003 Server (3) ARM64 and GPGPU 16

The ARKA RK003 GPU Performance Device 0: 'Tesla K20m' Device selection not specified: defaulting to device #0. Using size class: 1 --- Starting Benchmarks --- Running benchmark BusSpeedDownload result for bspeed_download: Running benchmark BusSpeedReadback result for bspeed_readback: Running benchmark MaxFlops result for maxspflops: result for maxdpflops: Running benchmark DeviceMemory result for gmem_readbw: result for gmem_writebw: 3.1592 GB/sec 3.3575 GB/sec 3095.8700 GFLOPS 1165.9700 GFLOPS 147.9020 GB/s 139.9250 GB/s ARM64 and GPGPU 17

The ARKA RK003 GPU Performance Device 0: 'Tesla K20m' Device selection not specified: defaulting to device #0. Using size class: 1 --- Starting Benchmarks --- Running benchmark BusSpeedDownload result for bspeed_download: Running benchmark BusSpeedReadback result for bspeed_readback: Running benchmark MaxFlops result for maxspflops: result for maxdpflops: Running benchmark DeviceMemory result for gmem_readbw: result for gmem_writebw: 3.1592 GB/sec 3.3575 GB/sec 3095.8700 GFLOPS 1165.9700 GFLOPS 147.9020 GB/s 139.9250 GB/s ARM64 and GPGPU 18

Infiniband ping pong test Ibv_rc_pingpong: Latency 1.57 usec Bandwidth 39.2Gb/s ARM64 and GPGPU 19

Time to solution (s) LAMMPS Benchmark 18 16 14 12 10 8 6 4 2 0 LAMMPS Benchmark 1x RK003 32 cores X86 SETUP COMPILER -Intel compilers on E5-2650v2 -MKL -GCC on ARM -FFTW 3.2.2 HARDWARE -Dual E5-2650v2, Infiniband QDR -ARM X-gene1+ 1 GPU Nvidia K20m INPUT FILES -WdV 1M particle ARM64 and GPGPU 20

Time to solution (s) LAMMPS Benchmark (2) SETUP COMPILER GNU Cuda 6.5 Cublas and cufft HARDWARE -Dual E5-2630v3 + 1 GPU Nvidia K20m -ARM X-gene1+ 1 GPU Nvidia K20m 1x RK003 1x E5-2630v3 + K20 INPUT FILES -WdV 1M particle - 1000 steps ARM64 and GPGPU 21

Power Consumption SETUP COMPILER GNU Cuda 6.5 Cublas and cufft HARDWARE -Dual E5-2630v3 + 1 GPU Nvidia K20m -ARM X-gene1+ 1 GPU Nvidia K20m GPU performances: RK003 Xeon E5 - avg load power consumption: 229w MaxFlops: 1165 GFLOPS Benchmark: 5,1 GFLOP/w power consumption in idle: 92 W avg load power consumption: 410w MaxFlops: 1165 GFLOPS Benchmark: 2,8 GFLOP/w power consumption in idle: 151 ARM64 and GPGPU 22

CAVIUM SoC Up to 48 custom ARMv8 cores @ 2.5GHz Single & Dual socket configs Integrated PCIe, SATA, 10/40GbE, Ethernet Fabric VirtSOC : Low latency end to end virtualization from VM to virtual port Four Workload Optimized Families Compute, Networking, Storage and Security Shipping today in Single and Dual Socket Configurations ARM64 and GPGPU 23

The ARKA RK004 Server FEATURES Form Factor: 2U CPU 1x CAVIUM Thunder-X 48 cores GPU up to 1x NVIDIA Kepler K20, K40, K80 Memory Up to 128 GB RAM DDR3 Network 2x 10 GbE SFP+, 1x IB FDR QSFP, 1x 40 GbE QSFP Storage 4x SATA 3.0 Expansion slots 1x PCI-E 3.0 x8 (in x16), 1x PCI-E 3.0 x8 BOOTH 430 ARM64 and GPGPU 24

Acknowledgment Cosimo Gainfreda Simone Tinti Wissam Abu-Ahmmad Francesca Tartaglione (now at OIST) Marco Ciacala Alessandro De Filippo Filippo Mantovavi ARM64 and GPGPU 25

BOOTH 430 ARM64 and GPGPU 26