Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Size: px

Start display at page:

Download "Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:"

Nathan Townsend
5 years ago
Views:

1 Cray events! Cray User Group (CUG):! When: May 16-19, 2005! Where: Albuquerque, New Mexico - USA! Registration: reserved to CUG members! Web site: Cray Technical Workshop Europe:! When: September 20-22, 2005! Where: Manno, Lugano Switzerland! Registration: free! Web site: 1

2 XD1 Presentation agenda! Cray XD1! Product Overview! Interconnect! Management! FPGA-Based Application Acceleration! Benchmark results! Usage of ENEA s system! Login! Compilation! Job submission 2

3 Cray XD1 Product Overview

4 The Cray XD1 Cray XD1! Built for price/performance! Interconnect bandwidth/latency! System-wide process synchronization! Application Acceleration FPGAs! Standards-based! 32/64-bit X86, Linux, MPI! High resiliency! Self-configuring, self-monitoring, self-healing! Single system command & control! Intuitive, tightly integrated management software Purpose-built and and optimized for for high performance workloads 4

Cray XD1 System Architecture Compute! 12 AMD Opteron 32/64 bit, x86 processors! High Performance Linux RapidArray Interconnect! 12 communications processors!

5 Cray XD1 System Architecture Compute! 12 AMD Opteron 32/64 bit, x86 processors! High Performance Linux RapidArray Interconnect! 12 communications processors! 1 Tb/s switch fabric Active Management! Dedicated processor Application Acceleration! 6 co-processors Processors directly connected via integrated switch fabric 5

6 XD1 Chassis Six SATA Hard Drives Fans Six Two-way Opteron Blades Chassis Front Six FPGA Modules 0.5 Tb/s Switch Three I/O Slots (e.g. JTAG) Four 133 MHz PCI-X Slots 12 x 2 GB/s Ports to Fabric Connector for 2 nd 0.5 Tb/s Switch and 12 More 2 GB/s Ports to Fabric Chassis Rear 6

7 Compute Blade 4 DIMM Sockets for DDR 400 Registered ECC Memory RapidArray Communications Processor AMD Opteron 248 Processor AMD Opteron 248 Processor (2.2 GHz for ENEA) Connector to Main Board 4 DIMM Sockets for DDR 400 ECC Memory (2+2 GB for ENEA) 7

8 The AMD Opteron Processor Dedicated Memory Bus Native 32 & 64 bit x86 compatibility 64KB 64KB 1 MB Up to 19.2 GB/s I/O 8

9 Cray Innovations Balanced Interconnect Active Management Cray XD1 Application Acceleration Performance and Usability 9

10 Interconnect

11 Cray XD1 Interconnect System RapidArray! Interconnect processors! Switch fabric! Communications software 11

12 Typical HPC Application Compute Communicate Compute Communicate Compute.! HPC applications exhibit intense compute/ communicate cycles! 20% - 60% of time, CPUs sit idle, stalled by communications! Application performance is very sensitive to latency and bandwidth Interconnect Drives System Performance 12

13 Balanced Interconnect Gigabytes GFLOPS Gigabytes per Second Memory Processor I/O Interconnect Xeon Server 6.4GB/s DDR GB/s PCI-X 0.25 GB/s GigE Cray XD1 6.4GB/s DDR GB/s Removing the communications bottleneck 13

14 HPC Communications Optimizations Cray Communications Libraries! MPI 1.2 library! TCP/IP! PVM! Shmem! Global Arrays! System-wide process & time synchronization RapidArray Communications Processor! HT/RA tunnelling! Routing with route redundancy! Reliable transport! Short message latency optimization! DMA operations! System-wide clock synchronization AMD Opteron 2XX Processor RapidArray Communications Processor 2 GB/s 3.2 GB/s RA 2 GB/s Direct Connected Processor Architecture 14

15 Interconnect Benchmarks (MPI Latency) MPI Latency versus Message Size Latency (microsec) Message Length (bytes) Cray XD1 (RapidArray) Quadrics (Elan 4) 4x Infiniband Myrinet (D card) Cray XD1 XD1 latency is is 4 times lower than Infiniband Cray XD1 XD1 can can send 2 KB KB before Infiniband sends its its first first byte byte 15

16 Interconnect Benchmarks (MPI Throughput) Bandwidth versus Message Size Bandwidth (MB/s) MB Data Length (Bytes) Cray XD1 (1/2 RapidArray Fabric) Quadrics Elan 4 4x Infiniband Myrinet (D card) Cray XD1 XD1 delivers 2X 2X the the throughput of of Infiniband (1 (1 KB KB Message Size) Size) 16

17 Management

18 Active Manager System CLI and Web Access Active Management Software Usability! Single System Command and Control Resiliency! Dedicated management processors, realtime OS and communications fabric.! Proactive background diagnostics with self-healing. Automated management for exceptional reliability, availability, serviceability 18

19 Active Manager GUI: SysAdmin Portal GUI GUI provides quick quick access to to status status info info and and system functions 19

20 Automated Management Users & Administrators Compute Partition 1 File Services Partition Front End Partition Compute Partition 2 Compute Partition 1! Partition management! Linux configuration! Hardware monitoring! Software upgrades! File system management! Data backups Network configuration Accounting & user management Security Performance analysis Resource & queue management Single System Command and and Control 20

21 Active Manager Job Scheduler Job Job management is is integrated with with self-healing features to to increase job job completion rates 21

22 Application Acceleration FPGA

Application Acceleration Application Accelerator Application Acceleration! Reconfigurable Computing! Tightly coupled to Opteron! FPGA acts like a programmable coprocessor! Performs vector operations!

23 Application Acceleration Application Accelerator Application Acceleration! Reconfigurable Computing! Tightly coupled to Opteron! FPGA acts like a programmable coprocessor! Performs vector operations! Well-suited for:! Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation. SuperLinear speedup for key algorithms 23

24 Application Acceleration Co-Processor AMD Opteron HyperTransport 3.2 GB/s 3.2 GB/s 3.2 GB/s 3.2 GB/s 3.2 GB/s RAP RapidArray Application Acceleration FPGA Xilinx Virtex II Pro 3.2 GB/s QDR SRAM 2 GB/s 2 GB/s Cray RapidArray Interconnect 24

25 FPGA Linux API! Administration Commands! fpga_open allocate and open fpga! fpga_close close allocated fpga! fpga_load load binary into fpga! Control Commands! fpga_start start fpga (release from reset)! fpga_stop stop fpga! Status Commands! fpga_status get status of fpga! Data Commands! fpga_put put data to fpga ram! fpga_get get data from fpga ram! Interrupt/Blocking Commands! fpga_intwait blocks process waits for fpga interrupt Programmer sees get/put and message passing programming model 25

26 Programming & Applications Environment

27 Programming Environment Operating System Cray HPC Enhanced Linux Distribution (derived from SuSE 8.2) System Management Active Manager for system administration & workload management Application Acceleration Kit IP Cores, Reference Designs, Command-line tools, API, JTAG interface card Scientific Libraries AMD Core Math Library (ACML) Shared Memory Access Shmem, Global Arrays, OpenMP 3 rd Party Tools Fortran 77/90/95, HPF, C/C++, Java Communications Libraries MPI 1.2 Cray Cray XD1 XD1 is is standards-based for for ease ease of of programming Linux, Linux, x86, x86, MPI MPI 27

28 Benchmark results

29 HPCC results! Slide by J. Dongarra (SOS9 meeting 03/2005). 29

30 PTRANS Benchmark! PTRANS (parallel matrix transpose) implements a parallel matrix transpose for two-dimensional blockcyclic storage! It is an important benchmark because it exercises the communications of the computer heavily on a realistic problem where pairs of processors communicate with each other simultaneously.! It is a useful test of the total communications capacity of the network. Unit: Giga Bytes per Second! Several molecular dynamic codes and some climate models must transpose large arrays to perform multidimensional FFTs (CPMD, FPMD, VASP, Climate Spectral models) 30

31 G-PTRANS, EP-STREAM-TRIAD PTRANS EP STREAM TRIAD GByte/s Cray XD RA 64 Dalco - Opt QSNetII 64 Dell - Xeon InfiniBand 64 IBM p Federation 128 HP SC QsNet 32 SGI Altix Itanium SGI Altix Itanium2 1.5 IBM p655 STREAM results scaled to single CPU 31

32 FFTE Benchmark! FFTE measures the floating point rate of execution of double precision complex one-dimensional Fast Fourier Transform! It is an important benchmark because it exercises the computation and the all-to-all communications required by global FFT algorithm! It is a useful test of the total communications capacity of the network. Unit: Giga Flops per Second! Several molecular dynamic codes and some climate models perform multi-dimensional FFTs (CPMD, FPMD, VASP, Climate Spectral models) 32

33 G-FFTE, EP-DGEMM G-FFTE - Global FFT EP DGEMM Gflop/s Cray XD RA 64 Dalco - Opt QSNetII 64 Dell - Xeon InfiniBand 64 IBM p Federation 128 HP SC QsNet 32 SGI Altix Itanium SGI Altix Itanium2 1.5 IBM p655 DGEMM results scaled to single CPU 33

34 Cray XD1 benchmark - FPMD FPMD - H2Obig - 64 CPUs 1.8 GHz Opteron/Myrinet IBM Cluster 1350, Intel Xeon 3.06 GHz/Myrinet IBM 1.3 GHz SP4 Cray XD1-2.4 GHz Cray X Elapsed time (seconds) Cray XD1 XD1 is is times faster than the the IBM IBM Power times faster than the the IBM IBM Cluster times faster than the the Opteron/Myrinet cluster 34

35 ECHAM5 case T63/L Forecast years per day =5 6=30 12=60 Number processors 5xVector=Scalar X1 XD1 35

36 Parallel performance on XD1! 5 Opteron/XD1 >= 1 CPU of X1! At low end of scaling curve (no real surprise)! At high end of scaling curve too (proprietary high speed interconnect) Parallel efficiency Number processors XD1 non radiation radiation full run (including IO) X1 reference (1/5 MSPs) 36

37 XD1 Benchmark - GROMACS GROMACS (DPPC in water) Speedup XD1, 2.2 GHz 1.8 GHz Opteron cluster, Myrinet Perfect scaling CPUs The The Cray XD1 XD1 delivers 63% 63% greater speedup over GHz GHz Opteron/Myrinet cluster at at CPUs (Higher is is better.) 37

38 Cray XD1 Benchmark: Amber 8 (scaling) Amber8 XD1 vs. Altix Scaling Speedup XD1 jac XD1 factor_ix Altix jac Altix factor_ix Perfect CPUs The The Cray XD1 XD1 delivers % 40% greater speedup over Altix Itanium2 cluster (Higher is is better.) 38

39 Cray XD1 Benchmarks: CHARMM! Next slide:! Itanium2 data - CHARMM version c31a2! XD1 CHARMM version c31b1 39

40 Cray XD1 Benchmarks: CHARMM CHARM M MbCO+4985 waters(17491 atoms), 100 steps Elapsed time (seconds) XD1, 2.2 GHz Itanium2, 1.4 GHz CPUs XD1 is 20% faster than 1.4 GHz Itanium2 at 16 CPUs, and is less expensive. (Lower is better.) 40

41 Cray XD1 Benchmarks: LS-DYNA LS-DYNA mpp970, revision 5434a 3-car collision, simulation time: 150 ms 20 Number of Runs per day XD1, 2.2 GHz Opteron/RapidArray HP, 2.2 GHz Opteron/InfiniBand 1.5GHz Itanium2 rx2600 Intel Xeon 3.06GHz CPUs The The Cray Cray XD1 XD1 is is 29% 29% faster faster than than a HP HP Opteron Opteron Cluster Cluster and and 9% 9% faster faster than than an an Itanium2 Itanium2 Cluster Cluster at at CPUs, CPUs, and and 12% 12% faster faster than than the the Itanium2 Itanium2 cluster cluster at at CPUs. CPUs. (Higher (Higher is is better.) better.) 41

42 Cray XD1 Benchmarks: LS-DYNA LS-DYNA mpp970, revision 5434a Neon_refined, simulation time: 30 ms XD1, 2.2 GHz Opteron/RapidArray HP, 2.2 GHz Opteron/InfiniBand 1.5GHz Itanium2 rx2600 Number of Runs per day CPUs The The Cray Cray XD1 XD1 is is 31% 31% faster faster than than a HP HP Opteron Opteron Cluster Cluster and and 13% 13% faster faster than than an an Itanium2 Itanium2 Cluster Cluster at at CPUs, CPUs, and and 11% 11% faster faster than than the the Itanium2 Itanium2 cluster cluster at at CPUs. CPUs. (Higher (Higher is is better.) better.) 42

43 Cray XD1 Benchmarks: STAR-HPC STAR-HPC 3.24 engine test case Number of Runs/day XD1, 2.2 GHz Opteron/RapidArray SGI Altix, 1.5 GHz Itanium2 IBM p5-570, 1.9 GHz Power 5 AMD 2 GHz Opteron/Myrinet CPUs The The Cray Cray XD1 XD1 is is 40% 40% faster faster than than the the SGI SGI Itanium2 Itanium2 Cluster Cluster and and 2.2X 2.2X faster faster than than the the AMD AMD Opteron Opteron cluster cluster at at CPUs, CPUs, and and 5% 5% faster faster than than the the IBM IBM Power5 Power5 at at CPUs. CPUs. (Higher (Higher is is better.) better.) 43

The Cray XD1. Technical Overview. Amar Shan, Senior Product Marketing Manager. Cray XD1. Cray Proprietary

The Cray XD1. Technical Overview. Amar Shan, Senior Product Marketing Manager. Cray XD1. Cray Proprietary The Cray XD1 Cray XD1 Technical Overview Amar Shan, Senior Product Marketing Manager Cray Proprietary The Cray XD1 Cray XD1 Built for price performance 30 times interconnect performance 2 times the density