APENet: LQCD clusters a la APE

Similar documents
arxiv: v1 [physics.comp-ph] 18 Feb 2011

Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

arxiv: v1 [hep-lat] 1 Dec 2010

APEnet+: a 3D Torus network optimized for GPU-based HPC Systems

Cluster Network Products

A Case for High Performance Computing with Virtual Machines

BlueGene/L. Computer Science, University of Warwick. Source: IBM

GRID Testing and Profiling. November 2017

MM5 Modeling System Performance Research and Profiling. March 2009

Initial Performance Evaluation of the Cray SeaStar Interconnect

The way toward peta-flops

2008 International ANSYS Conference

Brand-New Vector Supercomputer

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Network Design Considerations for Grid Computing

Application Performance on Dual Processor Cluster Nodes

High Performance MPI on IBM 12x InfiniBand Architecture

Single-Points of Performance

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Birds of a Feather Presentation

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

ABySS Performance Benchmark and Profiling. May 2010

Interconnect Your Future

InfiniBand Experiences of PC²

GPU peer-to-peer techniques applied to a cluster interconnect

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Interconnection Network for Tightly Coupled Accelerators Architecture

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

An HS Link Network Interface Board for Parallel Computing

L'esperimento apenext dell'infn dall'architettura all'installazione

Tightly Coupled Accelerators Architecture

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Maximizing Memory Performance for ANSYS Simulations

Map3D V58 - Multi-Processor Version

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Ravindra Babu Ganapathi

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

The CMS Event Builder

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Topology Awareness in the Tofu Interconnect Series

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

Performance Evaluation of Myrinet-based Network Router

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

The Optimal CPU and Interconnect for an HPC Cluster

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

OpenFOAM Performance Testing and Profiling. October 2017

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Future Routing Schemes in Petascale clusters

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Leveraging HyperTransport for a custom high-performance cluster network

Performance Analysis and Prediction for distributed homogeneous Clusters

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Organizational issues (I)

Organizational issues (I)

PC DESY Peter Wegner. PC Cluster Definition 1

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

TFLOP Performance for ANSYS Mechanical

Composite Metrics for System Throughput in HPC

Windows NT Server Configuration and Tuning for Optimal SAS Server Performance

MILC Performance Benchmark and Profiling. April 2013

Exploring the Effects of Hyperthreading on Scientific Applications

Cray XC Scalability and the Aries Network Tony Ford

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

Bei Wang, Dmitry Prohorov and Carlos Rosales

What are Clusters? Why Clusters? - a Short History

Fujitsu s Approach to Application Centric Petascale Computing

Memory-Based Cloud Architectures

Communication Models for Resource Constrained Hierarchical Ethernet Networks

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

CP2K Performance Benchmark and Profiling. April 2011

The Architecture and the Application Performance of the Earth Simulator

HPC Enabling R&D at Philip Morris International

A Global Operating System for HPC Clusters

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

HP Z Turbo Drive G2 PCIe SSD

Intel Enterprise Processors Technology

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

MPICH-G2 performance evaluation on PC clusters

Cluster Computing. Chip Watson Jefferson Lab High Performance Computing. Acknowledgements to Don Holmgren, Fermilab,, USQCD Facilities Project

Altix Usage and Application Programming

From Beowulf to the HIVE

Status and physics plan of the PACS-CS Project

The Red Storm System: Architecture, System Update and Performance Analysis

10 Gbit/s Challenge inside the Openlab framework

46PaQ. Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL. 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Large scale Imaging on Current Many- Core Platforms

Transcription:

Overview Hardware/Software Benchmarks Conclusions APENet: LQCD clusters a la APE Concept, Development and Use Roberto Ammendola Istituto Nazionale di Fisica Nucleare, Sezione Roma Tor Vergata Centro Ricerce E. Fermi SM&FT 2006, Bari 21th September 2006

Overview Hardware/Software Benchmarks Conclusions Motivation The APE group has traditionally focused on the design and the development of custom silicon, electronics and software optimized for Lattice Quantum ChromoDynamics. The APENet project was started to study the mixing of existing off-the-shelf computing technology (CPUs, motherboards and memories for PC clusters) with a custom interconnect architecture, derived from previous experience of the APE group. The focus is on building optimized, super-computer level platforms suited for numerical applications that are both CPU and memory intensive.

Overview Hardware/Software Benchmarks Conclusions Requested Features The idea was to build a switch-less network characterized by: High bandwidth Low latency Natural fit with LQCD and numerical grid-based algorithm. Not necessarily first neighbour communication. Good performance scaling as a function of the number of processors. Very good cost scaling even for large number of processors.

Overview Hardware/Software Benchmarks Conclusions Development stages APENet history: Sept 2001 June 2002: Release of the first HW prototype Sept 2002 Feb 2003: 2nd HW version March Sep 2003: 3rd HW version development and production Dec 2003: Electrical validation Sept 2004: 16 nodes APENet prototype cluster March Nov 2005: 128 nodes APENet cluster June 2006: Production starts

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software APENet Main Features APENet is a 3D network of point-to-point links with toroidal topology. Each computing node has 6 bi-directional full-duplex communication channels Computing nodes are arranged in a 3D cubic mesh Data is transmitted in packets which are routed to the destination node Lightweight low level protocol Wormhole routing Dimension ordered routing algorithm 2 Virtual Channels per receiving channel to prevent deadlocks

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software How Does it Look Like Here it is shown a 4 2 2 example in a logic description and a real implementation: APE 16 (3,1,1) APE 15 (2,1,1) APE 12 (3,0,1) APE 8 (3,1,0) APE 14 (1,1,1) APE 11 (2,0,1) APE 7 (2,1,0) APE 4 (3,0,0) APE 13 (0,1,1) APE 10 (1,0,1) APE 6 (1,1,0) APE 3 (2,0,0) APE 9 (0,0,1) APE 2 (1,0,0) APE 5 (0,1,0) Y+ APE 1 (0,0,0) Z X+ X Z+ Y

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software The interconnection card Altera Stratix EP1S30, 1020 pin package, fastest speed grade National Serializers/Deserializers DS90CR485/486, 48 bit 133 MHz Usage of a programmable device allows possible logic redesign and quick on-field firmware upgrade.

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Functional blocks 6.4(x2) Gb/s X+ X- Y+ Y- Z+ Z- 8DiffPairs*800Mhz 48bit*133Mhz (e) (e) (e) (e) (e) (e) 64b*133Mhz Crossbar Switch Router (a) Local PCI (b) (c) (d) (a): PCI & FIFOs (b): Command Queue FIFO (c): Gather/Scatter FIFO (d): RDMA Table Dual Port RAM (e): Channel FIFOs EP1S30 PCI-X CORE PCI-X (64b*133MHz)

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Functional blocks 6.4(x2) Gb/s X+ X- Y+ Y- Z+ Z- 8DiffPairs*800Mhz 48bit*133Mhz (e) (e) (e) (e) (e) (e) Crossbar Switch 64b*133Mhz Router EP1S30 (a) Local PCI (b) (c) PCI-X CORE (d) (a): PCI & FIFOs (b): Command Queue FIFO (c): Gather/Scatter FIFO (d): RDMA Table Dual Port RAM (e): Channel FIFOs PCI-X (64b*133MHz)

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Functional blocks 6.4(x2) Gb/s X+ X- Y+ Y- Z+ Z- 8DiffPairs*800Mhz 48bit*133Mhz (e) (e) (e) (e) (e) (e) Crossbar Switch 64b*133Mhz Router EP1S30 (a) Local PCI (b) (c) PCI-X CORE (d) (a): PCI & FIFOs (b): Command Queue FIFO (c): Gather/Scatter FIFO (d): RDMA Table Dual Port RAM (e): Channel FIFOs PCI-X (64b*133MHz)

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Functional blocks 6.4(x2) Gb/s X+ X- Y+ Y- Z+ Z- 8DiffPairs*800Mhz 48bit*133Mhz (e) (e) (e) (e) (e) (e) Crossbar Switch 64b*133Mhz Router EP1S30 (a) Local PCI (b) (c) PCI-X CORE (d) (a): PCI & FIFOs (b): Command Queue FIFO (c): Gather/Scatter FIFO (d): RDMA Table Dual Port RAM (e): Channel FIFOs PCI-X (64b*133MHz)

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Functional blocks 6.4(x2) Gb/s X+ X- Y+ Y- Z+ Z- 8DiffPairs*800Mhz 48bit*133Mhz (e) (e) (e) (e) (e) (e) Crossbar Switch 64b*133Mhz Router EP1S30 (a) Local PCI (b) (c) PCI-X CORE (d) (a): PCI & FIFOs (b): Command Queue FIFO (c): Gather/Scatter FIFO (d): RDMA Table Dual Port RAM (e): Channel FIFOs PCI-X (64b*133MHz)

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Functional blocks 6.4(x2) Gb/s X+ X- Y+ Y- Z+ Z- 8DiffPairs*800Mhz 48bit*133Mhz (e) (e) (e) (e) (e) (e) Crossbar Switch 64b*133Mhz Router EP1S30 (a) Local PCI (b) (c) PCI-X CORE (d) (a): PCI & FIFOs (b): Command Queue FIFO (c): Gather/Scatter FIFO (d): RDMA Table Dual Port RAM (e): Channel FIFOs PCI-X (64b*133MHz)

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Functional blocks 6.4(x2) Gb/s X+ X- Y+ Y- Z+ Z- 8DiffPairs*800Mhz 48bit*133Mhz (e) (e) (e) (e) (e) (e) Crossbar Switch 64b*133Mhz Router EP1S30 (a) Local PCI (b) (c) PCI-X CORE (d) (a): PCI & FIFOs (b): Command Queue FIFO (c): Gather/Scatter FIFO (d): RDMA Table Dual Port RAM (e): Channel FIFOs PCI-X (64b*133MHz)

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Software A new piece of Hardware needs new Software Device driver Application level library Application level test suites MPI library MPI level test suites

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Submitting job A job submitting environment has been developed, aware of allowed network topologies on a given machine. A configuration file describes allowed topologies: APE_NET_TOPOLOGY 4 4 8 APE_NET_PARTITION full 4 4 8 0 0 0 APE_NET_PARTITION single 1 1 1 0 0 0 APE_NET_PARTITION zline0 1 1 8 0 0 0 APE_NET_PARTITION zline1 1 1 8 0 1 0 Jobs are submitted with an mpirun derived script: aperun -topo zline0 cpi

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software MPI Profiling It can be very hard to produce optimized parallel code. A tool for profiling MPI communication is under development, which can help understanding the communication requests of a certain code for a better tuning of its internal parameters. Informations are gathered regarding: communication vs computation time time spent per communication function transferred data size per communication function transferred data size transfer per rank...

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Testbed APE128 128 Dual Xeon "Nocona" 3.4 GHz 1 GB RAM DDR 333 6 TB NFS export Service 100 Mbit Network User access 1Gbit Network 4 4 8 APENet network

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Some notes about the assembling To host the machine we needed to rework the infrastructure: floor air condition power supply with UPS APENet assembling issues: 5% faulty devices 20% badly mounted devices 10% badly connected links General issues: 10% early dead disks

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications MPI Bandwidth benchmark 800 700 Unidirectional Bandwidth 133 MHz Bidirectional Bandwidth 133 MHz Link Protocol Limit 600 Bandwidth (MB/s) 500 400 300 200 100 0 16 64 256 1K 4K 16K Message Size (bytes) 64K 256K 1M Peak Unidirectional Bandwidth 570MB/s Peak Bidirectional Bandwidth 720MB/s Link physical limit 590MB/s

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications MPI Latency benchmark 256 128 One Way Latency Round Trip Latency 64 32 Time (us) 16 8 4 2 1 16 64 256 1K Message Size (bytes) 4K 16K 64K Minimum Streaming Latency 1.9µs Minumum Round Trip Latency 6.9µs

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Real life benchmark The performances (in floating point operation per second) of 4 LQCD core routines are shown. The tests are executed with a fixed local lattice size (8 4 ). Allowed network topologies: 4 CPU: 4 1 1 8 CPU: 1 1 8 16 CPU: 4 4 1 32 CPU: 1 4 8 128 CPU: 4 4 8 Best results are achieved when the global lattice geometry reflects the physical network topology.

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Application Scaling 500 450 400 350 300 Block Dirac inversion Full Dirac inversion Dirac application 32 bit Dirac application 64 bit GFlops 250 200 150 100 50 0 1 4 8 16 32 64 Number of CPUs 128

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Application Scaling 4000 3500 3000 MFlops per CPU 2500 2000 1500 1000 500 0 4 8 16 32 Number of CPUs Block Dirac inversion Full Dirac inversion Dirac application 32 bit Dirac application 64 bit 64 128

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Application Scaling Percentage of Communication Time 100 80 60 40 20 Block Dirac inversion Full Dirac inversion Dirac application 32 bit Dirac application 64 bit 0 4 8 16 32 Number of CPUs 64 128

Overview Hardware/Software Benchmarks Conclusions Future work Development is still in progress both for enhancing performances and for bug fixing. Activity will be focused mainly on: Multiport introduction Reduce CPU usage for communication Reduce latency for medium-large size data transfers

Overview Hardware/Software Benchmarks Conclusions Conclusions Developing APENet has been a very interesting challenge and performances can be (still) compared with commercial interconnect systems in HPC. Technology is running fast out there, and as we are not market driven (and under-sized), it s very easy to become obsolete. The acquired know-how is going to be transferred in the next generation APE machines (going for PETAFlops?).

Overview Hardware/Software Benchmarks Conclusions APENet people R. Ammendola R. Petronzio D. Rossetti A. Salamon N. Tantalo P. Vicini