APENet: LQCD clusters a la APE

Overview Hardware/Software Benchmarks Conclusions APENet: LQCD clusters a la APE Concept, Development and Use Roberto Ammendola Istituto Nazionale di Fisica Nucleare, Sezione Roma Tor Vergata Centro Ricerce E. Fermi SM&FT 2006, Bari 21th September 2006

Overview Hardware/Software Benchmarks Conclusions Motivation The APE group has traditionally focused on the design and the development of custom silicon, electronics and software optimized for Lattice Quantum ChromoDynamics. The APENet project was started to study the mixing of existing off-the-shelf computing technology (CPUs, motherboards and memories for PC clusters) with a custom interconnect architecture, derived from previous experience of the APE group. The focus is on building optimized, super-computer level platforms suited for numerical applications that are both CPU and memory intensive.

Overview Hardware/Software Benchmarks Conclusions Requested Features The idea was to build a switch-less network characterized by: High bandwidth Low latency Natural fit with LQCD and numerical grid-based algorithm. Not necessarily first neighbour communication. Good performance scaling as a function of the number of processors. Very good cost scaling even for large number of processors.

Overview Hardware/Software Benchmarks Conclusions Development stages APENet history: Sept 2001 June 2002: Release of the first HW prototype Sept 2002 Feb 2003: 2nd HW version March Sep 2003: 3rd HW version development and production Dec 2003: Electrical validation Sept 2004: 16 nodes APENet prototype cluster March Nov 2005: 128 nodes APENet cluster June 2006: Production starts

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software APENet Main Features APENet is a 3D network of point-to-point links with toroidal topology. Each computing node has 6 bi-directional full-duplex communication channels Computing nodes are arranged in a 3D cubic mesh Data is transmitted in packets which are routed to the destination node Lightweight low level protocol Wormhole routing Dimension ordered routing algorithm 2 Virtual Channels per receiving channel to prevent deadlocks

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software How Does it Look Like Here it is shown a 4 2 2 example in a logic description and a real implementation: APE 16 (3,1,1) APE 15 (2,1,1) APE 12 (3,0,1) APE 8 (3,1,0) APE 14 (1,1,1) APE 11 (2,0,1) APE 7 (2,1,0) APE 4 (3,0,0) APE 13 (0,1,1) APE 10 (1,0,1) APE 6 (1,1,0) APE 3 (2,0,0) APE 9 (0,0,1) APE 2 (1,0,0) APE 5 (0,1,0) Y+ APE 1 (0,0,0) Z X+ X Z+ Y

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software The interconnection card Altera Stratix EP1S30, 1020 pin package, fastest speed grade National Serializers/Deserializers DS90CR485/486, 48 bit 133 MHz Usage of a programmable device allows possible logic redesign and quick on-field firmware upgrade.

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Functional blocks 6.4(x2) Gb/s X+ X- Y+ Y- Z+ Z- 8DiffPairs*800Mhz 48bit*133Mhz (e) (e) (e) (e) (e) (e) 64b*133Mhz Crossbar Switch Router (a) Local PCI (b) (c) (d) (a): PCI & FIFOs (b): Command Queue FIFO (c): Gather/Scatter FIFO (d): RDMA Table Dual Port RAM (e): Channel FIFOs EP1S30 PCI-X CORE PCI-X (64b*133MHz)

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Functional blocks 6.4(x2) Gb/s X+ X- Y+ Y- Z+ Z- 8DiffPairs*800Mhz 48bit*133Mhz (e) (e) (e) (e) (e) (e) Crossbar Switch 64b*133Mhz Router EP1S30 (a) Local PCI (b) (c) PCI-X CORE (d) (a): PCI & FIFOs (b): Command Queue FIFO (c): Gather/Scatter FIFO (d): RDMA Table Dual Port RAM (e): Channel FIFOs PCI-X (64b*133MHz)

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Software A new piece of Hardware needs new Software Device driver Application level library Application level test suites MPI library MPI level test suites

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software Submitting job A job submitting environment has been developed, aware of allowed network topologies on a given machine. A configuration file describes allowed topologies: APE_NET_TOPOLOGY 4 4 8 APE_NET_PARTITION full 4 4 8 0 0 0 APE_NET_PARTITION single 1 1 1 0 0 0 APE_NET_PARTITION zline0 1 1 8 0 0 0 APE_NET_PARTITION zline1 1 1 8 0 1 0 Jobs are submitted with an mpirun derived script: aperun -topo zline0 cpi

Overview Hardware/Software Benchmarks Conclusions Main Features Hardware Software MPI Profiling It can be very hard to produce optimized parallel code. A tool for profiling MPI communication is under development, which can help understanding the communication requests of a certain code for a better tuning of its internal parameters. Informations are gathered regarding: communication vs computation time time spent per communication function transferred data size per communication function transferred data size transfer per rank...

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Testbed APE128 128 Dual Xeon "Nocona" 3.4 GHz 1 GB RAM DDR 333 6 TB NFS export Service 100 Mbit Network User access 1Gbit Network 4 4 8 APENet network

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Some notes about the assembling To host the machine we needed to rework the infrastructure: floor air condition power supply with UPS APENet assembling issues: 5% faulty devices 20% badly mounted devices 10% badly connected links General issues: 10% early dead disks

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications MPI Bandwidth benchmark 800 700 Unidirectional Bandwidth 133 MHz Bidirectional Bandwidth 133 MHz Link Protocol Limit 600 Bandwidth (MB/s) 500 400 300 200 100 0 16 64 256 1K 4K 16K Message Size (bytes) 64K 256K 1M Peak Unidirectional Bandwidth 570MB/s Peak Bidirectional Bandwidth 720MB/s Link physical limit 590MB/s

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications MPI Latency benchmark 256 128 One Way Latency Round Trip Latency 64 32 Time (us) 16 8 4 2 1 16 64 256 1K Message Size (bytes) 4K 16K 64K Minimum Streaming Latency 1.9µs Minumum Round Trip Latency 6.9µs

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Real life benchmark The performances (in floating point operation per second) of 4 LQCD core routines are shown. The tests are executed with a fixed local lattice size (8 4 ). Allowed network topologies: 4 CPU: 4 1 1 8 CPU: 1 1 8 16 CPU: 4 4 1 32 CPU: 1 4 8 128 CPU: 4 4 8 Best results are achieved when the global lattice geometry reflects the physical network topology.

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Application Scaling 500 450 400 350 300 Block Dirac inversion Full Dirac inversion Dirac application 32 bit Dirac application 64 bit GFlops 250 200 150 100 50 0 1 4 8 16 32 64 Number of CPUs 128

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Application Scaling 4000 3500 3000 MFlops per CPU 2500 2000 1500 1000 500 0 4 8 16 32 Number of CPUs Block Dirac inversion Full Dirac inversion Dirac application 32 bit Dirac application 64 bit 64 128

Overview Hardware/Software Benchmarks Conclusions Bandwitdh Latency Applications Application Scaling Percentage of Communication Time 100 80 60 40 20 Block Dirac inversion Full Dirac inversion Dirac application 32 bit Dirac application 64 bit 0 4 8 16 32 Number of CPUs 64 128

Overview Hardware/Software Benchmarks Conclusions Future work Development is still in progress both for enhancing performances and for bug fixing. Activity will be focused mainly on: Multiport introduction Reduce CPU usage for communication Reduce latency for medium-large size data transfers

Overview Hardware/Software Benchmarks Conclusions Conclusions Developing APENet has been a very interesting challenge and performances can be (still) compared with commercial interconnect systems in HPC. Technology is running fast out there, and as we are not market driven (and under-sized), it s very easy to become obsolete. The acquired know-how is going to be transferred in the next generation APE machines (going for PETAFlops?).

Overview Hardware/Software Benchmarks Conclusions APENet people R. Ammendola R. Petronzio D. Rossetti A. Salamon N. Tantalo P. Vicini