Maxwell: a 64-FPGA Supercomputer

Similar documents
Maxwell a 64 FPGA Supercomputer

H100 Series FPGA Application Accelerators

HPC Architectures. Types of resource currently in use

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

The other function of this talk is to introduce some terminology that will be used through the workshop.

Erkenntnisse aus aktuellen Performance- Messungen mit LS-DYNA

Building supercomputers from embedded technologies

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Real Parallel Computers

Advances of parallel computing. Kirill Bogachev May 2016

Introduction to Xeon Phi. Bill Barth January 11, 2013

Support for Programming Reconfigurable Supercomputers

The Optimal CPU and Interconnect for an HPC Cluster

Rapid Platform Deployment: Allows clients to concentrate their efforts on application software.

COSMOS Architecture and Key Technologies. June 1 st, 2018 COSMOS Team

Utilisation of the GPU architecture for HPC

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Turbostream: A CFD solver for manycore

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

OpenFOAM Performance Testing and Profiling. October 2017

How то Use HPC Resources Efficiently by a Message Oriented Framework.

HPC projects. Grischa Bolls

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Parallel Computing at DESY Zeuthen. Introduction to Parallel Computing at DESY Zeuthen and the new cluster machines

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Motivation to Teach Network Hardware

2008 International ANSYS Conference

ANSYS HPC Technology Leadership

Compute Node Design for DAQ and Trigger Subsystem in Giessen. Justus Liebig University in Giessen

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

COMP528: Multi-core and Multi-Processor Computing

High-Performance Reconfigurable Computing the View from Edinburgh

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Interconnection Network for Tightly Coupled Accelerators Architecture

OCTOPUS Performance Benchmark and Profiling. June 2015

Building resiliency with IBM System x and IBM System Storage solutions

unleashed the future Intel Xeon Scalable Processors for High Performance Computing Alexey Belogortsev Field Application Engineer

Getting the Best Performance from an HPC Cluster: BY BARIS GULER; JENWEI HSIEH, PH.D.; RAJIV KAPOOR; LANCE SHULER; AND JOHN BENNINGHOFF

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

CAS 2K13 Sept Jean-Pierre Panziera Chief Technology Director

High-Performance Lustre with Maximum Data Assurance

SIMPLIFYING HPC SIMPLIFYING HPC FOR ENGINEERING SIMULATION WITH ANSYS

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

To hear the audio, please be sure to dial in: ID#

User Training Cray XC40 IITM, Pune

Suggested use: infrastructure applications, collaboration/ , web, and virtualized desktops in a workgroup or distributed environments.

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Cluster Network Products

FAST FORWARD TO YOUR <NEXT> CREATION

Simulation using MIC co-processor on Helios

VXS-610 Dual FPGA and PowerPC VXS Multiprocessor

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence

40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

IBM s Data Warehouse Appliance Offerings

AcuSolve Performance Benchmark and Profiling. October 2011

Emerging Technologies for HPC Storage

Building NVLink for Developers

Maximizing heterogeneous system performance with ARM interconnect and CCIX

The Mont-Blanc approach towards Exascale

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

The AMD64 Technology for Server and Workstation. Dr. Ulrich Knechtel Enterprise Program Manager EMEA

FUSION1200 Scalable x86 SMP System

High Performance Computing

Virtual EM Inc. Ann Arbor, Michigan, USA

Performance Comparisons of Dell PowerEdge Servers with SQL Server 2000 Service Pack 4 Enterprise Product Group (EPG)

Re-architecting Virtualization in Heterogeneous Multicore Systems

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Accelerating computation with FPGAs

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

ARISTA: Improving Application Performance While Reducing Complexity

XPU A Programmable FPGA Accelerator for Diverse Workloads

SMART SERVER AND STORAGE SOLUTIONS FOR GROWING BUSINESSES

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

FUJITSU PHI Turnkey Solution

QLogic TrueScale InfiniBand and Teraflop Simulations

Datura The new HPC-Plant at Albert Einstein Institute

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Molecular Dynamics and Quantum Mechanics Applications

High Performance Computing with Accelerators

General Purpose GPU Computing in Partial Wave Analysis

Processor 2 Quad Core Intel Xeon Processor. E GHz Memory (std/max) 2048MB/48GB Slots and bays. SAS Controller, 0 installed 1.

NetFPGA Hardware Architecture

The DEEP (and DEEP-ER) projects

Cloud Computing. UCD IT Services Experience

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Fujitsu s Approach to Application Centric Petascale Computing

5051 & 5052 PCIe Card Overview

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

The NEMO Ocean Modelling Code: A Case Study

Real Parallel Computers

Leading Performance for Oracle Applications? John McAbel Collaborate 2015

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Transcription:

Maxwell: a 64-FPGA Supercomputer Copyright 2007, the University of Edinburgh Dr Rob Baxter Software Development Group Manager, EPCC R.Baxter@epcc.ed.ac.uk +44 131 651 3579

Outline The FHPCA Why build Maxwell? Hardware details Software environment Demo applications easy harder very hard Concluding thoughts

Who are the FHPCA? The FPGA High-Performance Computing Alliance: EPCC (lead partner) Alpha Data Ltd Nallatech Ltd Xilinx Corporation Institute for System Level Integration Algotronix Ltd The Alliance is funded by the partners themselves Scottish Enterprise Priority Industries Team Scottish Funding Council under the edikt2 SRDG

Why build Maxwell? Maxwell was completed earlier this year but what is it for? FPGAs can be used to accelerate some computations their use as coprocessors is well understood But, can FPGAs be used as main processors? in parallel arrays? against real HPC applications? These are the questions we set ourselves answers are at the end of the talk

Maxwell at Edinburgh s ACF

Physical architecture Maxwell comprises five IBM BladeCentre chassis 32 IBM Intel Xeon Blades 64 Xilinx Virtex-4 FPGAs Dell Precision 670 headnode (4 GB memory, 1 TB local SATA) Each Blade diskless 2.8 GHz Intel Xeon with 1 GB main memory hosts two FPGAs through a PCI-X expansion module FPGAs mounted in two card types Nallatech H101 Alpha Data ADM-XRC-4FX

Nallatech H101 Xilinx V4LX160 main device 16 MB SRAM 4x 4MB banks 6.4 GB/s total bandwidth 512 MB SDRAM 1x 512 bank 3.2 GB/s total bandwidth V2Pro FX4 device for comms 4x 2.5 Gb/s MGT ( RocketIO )

Alpha Data ADM-XRC-4FX Xilinx V4FX100 main device 16 MB SRAM 4x 4MB banks 6.4 GB/s total bandwidth 1,024 MB SDRAM 4x 256MB banks 8.4 GB/s total bandwidth Comms inherent in V4FX 4x 3.125 Gb/s MGT ( RocketIO )

Overall topology All 64 FPGAs are wired together directly two-dimensional 8 8 torus this direct connection allows full distributed-memory parallel programming purely on the FPGAs The Xeons are connected over gigabit Ethernet single 48-way Netgear switch supports any inter-process communication that remains above the FPGA level Thus two networks all-to-all software network nearest-neighbour, 8 8, hardwired FPGA network

FPGA topology FPGAs connected in a 2D torus of Rocket IO connections FPGAs in Nallatech hardware FPGAs in Alpha Data hardware

Maxwell schematic Alpha Data half Nallatech half Blade FPGAs East-west rings BladeCentre North-south rings

Software environment Linux Red Hat-variant CentOS Standard GNU/Linux tools Sun Grid Engine (SGE) as the batch scheduling system MPI for inter-process communication Similar to other parallel clusters But Maxwell also has the FHPCA Parallel Toolkit (PTK) a set of infrastructure and practices intended to address acceleration issues

What is the Parallel Toolkit? The PTK is a set of practices and infrastructure intended to address identified acceleration issues e.g. associating processes with FPGA resources associating FPGAs with bitstreams managing contention for FPGA resources within a process managing code dependencies to facilitate re-use PTK infrastructure written mostly in C++ bash used for scripting tasks

Demo 1 MCopt: Financial Engineering Monte Carlo simulation of stock option pricing Classic Black-Scholes model ds = S r dt + S dt stock price S; interest rate r; time dt; volatility ; and Gaussian RN simple European options have closed-form solution exotic Asian options need MC simulation Essentially an exercise in Gaussian RNG Simple core and small data requirements

Demo 2 DI3D: Facial Imaging Partner DI3D ltd, medical imaging specialists 3 and 4D facial image reconstruction codes Pairwise merging and processing of images 3D view Main aim is to batch process video images over 64 FPGAs Straightforward serial core c. 85% runtime on current data sizes and significant data requirements images each 2-4 MB

Demo 3 OHM3D: Oil & Gas Partner OHM plc, oil & gas services company 3D controlled source electromagnetics (CSEM) code Pretty typical physical simulation code double precision nine-point stencil (square nearest neighbour + corners) logical regular mesh domain decomposed using MPI Has core parallel iterative solver c. 90% runtime for current data sizes and major data requirements c. 500,000 point data sets and above

Initial demo benchmarks Demo 1 MCopt runs 320 faster per FPGA Demo 2 DI3D runs at sustained 2.5 faster per FPGA Demo 3 OHM3D runs at sustained 5.5 faster per FPGA (on 8 nodes) parallel scaling still not working efficiently Compared to software on Maxwell CPU (2.8 GHz Xeon) IBM HS20 2.8GHz Xeon: SPECfp_base2000 = 1559 cf. HS21XM 3.0GHz Xeon 2 2 core: SPECfp_base2000 = 2636 cf. Intel Core 2 2.13GHz: SPECfp_base2000 = 2262

Wallclock time (s) MCopt performance (log scale) 100,000.0 10,000.0 1,000.0 CPU AD NT 100.0 10.0 1.0 1 2 4 8 16

Wallclock time (s) DI3D performance 900.0 800.0 700.0 600.0 500.0 400.0 300.0 200.0 100.0 0.0 CPU AD NT 2 4 8 16 32

Wallclock time (s) OHM3D performance 900.0 800.0 700.0 600.0 500.0 400.0 300.0 200.0 100.0 0.0 CPU AD NT 8 16 32 64

Answers to questions So, can FPGAs be used as main processors? yes: you can fit a lot of logic on a V4100/160 but: complexity & compilation overhead makes development slow in parallel arrays? yes: RocketIO is a good connection technology but: some form of all-to-all routing is desirable against real HPC applications? yes: where the numeric kernel is compact & well-defined but: memory bandwidth limitations are still critical

Top three challenges for FPGAs in HPC 1. Development costs cost of an efficient port of a major code is still too high four or five times longer than producing optimised software 2. Memory bandwidth HPC generally suffers lack of memory bandwidth FPGAs exacerbate this with more compute for same DDR cost 3. Amdahl s Law HPC performance demands parallel scalability accelerating small cores is not enough contrast OpenMP vs MPI

Top three solutions (To Do ) 1. Better tooling, more standardisation need common C dialects need standard APIs while avoiding library call overheads eg. no calls to FPGA BLAS over PCI vs through L1 cache 2. Better FPGA-memory connectivity; better memory chips! multi-banked memory chips needed and not just by HPRC! 3. Parallel FPGAs FPGA-to-FPGA connectivity essential to keep up with the Joneses regard FPGAs as main compute platform, not just accelerator

Next steps for FHPCA Ongoing industrial and academic collaboration programmes we welcome academic collaborators and can pay travel costs for them to visit Edinburgh to work with us you can also apply for development time on the supercomputer through the Technology Translator Proposal to EU FP7 to enhance programmability with Xilinx, Nallatech, Alpha Data, ZIB, U. Ferrara, QUB Ongoing EPCC work supported by edikt2 and HPCx http://