CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar
CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary
INTRODUCTION The Cray XK6 supercomputer is a trifecta of scalar, network and many-core innovation. Hybrid supercomputer Combination of: Cray s Gemini interconnect, AMD's leading multi-core scalar processors and NVIDIA s powerful many-core GPU processors Enhanced version of XE6 Uses Blade architecture as in Cray XE6 Capable of scaling to 500,000 scalar processors and 50 petaflops of hybrid peak performance
HISTORY In 1988, Cray Research introduced Cray Y-MP, the world's first supercomputer Sustained over 1 gigaflop on many applications Fujitsu's Numerical Wind Tunnel supercomputer used 166 vector processors to gain the top spot in 1994 with a peak speed of 1.7 gigaflops per processor. The Hitachi SR2201: peak performance of 600 gigaflops in 1996 by using 2048 The Intel Paragon had 1000 to 4000 Intel i860 processors, was ranked the fastest in the world in 1993
SUPER-COMPUTER STATISTICS
COMPARISON WITH THE PRESENT CRAY SUPERCOMPUTERS
CRAY XK6- ARCHITECTURE Four nodes per blade Adaptive hybrid computing Scalable compute nodes, I/Os Gemini Mezzanine Plug compatible with Cray XE6 blade Configurable processor, memory and SXM GPU AMD Opteron 6200 Series processor: Highly associative on-chip data cache supports aggressive out-of-order execution Integrated memory controller Significant performance advantage to algorithms The NVIDIA Tesla 20-series: Based on the next generation CUDA GPU architecture codenamed Fermi
NODE- ARCHITECTURE
XK6 ACCELERATOR BLADE
GEMINI INTERCONNECTION NETWORK
GEMINI INTERCONNECTION NETWORKS Each node acts as 2 nodes on a 3D Torus Each Node provided with a High Radix YARC router to support up to 168 Gbps. Parallel electrical and optical paths High Bandwidth and lower latency for both long and short messages Low cost of integration Gemini Mezzanine card to avoid memory ICN bottlenecks.
NVIDIA TESLA X2090 Special Embedded version of Tesla M2090. Provides High Performance Computing for highly parallel applications. 448 cores with 6 GB GDDR5 Memory. Can support up to 600+ GFLOPs High Bandwidth to host Quick Master-Slave Communication. CUDA capable for easy programmability.
CRAY XK6 CABINETS Each cabinet has up to 96 processors Two processors wrapped in the form of a blade (XE6 compatible) With 1536 cores, can give 70+ TFLOPs performance
SPECIFICATIONS
SPECIFICATIONS
PERFORMANCE- LUDWIG 10 cabinets of Cray XK6 936 GPUs (nodes) Only 4% deviation from perfect scaling between 8 and 936 GPUs Application sustaining 40+ Tflop/s and still scaling... Strong scaling also very good, but physicists want to simulate larger systems
PERFORMANCE - HIMENO Parallel 3D Poisson equation solver benchmark iterative loop evaluating 19-point stencil Co-Array Fortran version of code Fully ported to accelerators using 27 directive pairs Strong scaling Use asynchronous GPU data transfers and kernel launches to help avoid this
INDUSTRIAL ACCEPTANCE Oak Ridge National Laboratory Jaguar/TITAN High computation capacity for Scientific research 200 cabinets with > 18000 nodes. Estimated 10 20 PFLOPs Currently upgrading from XT5 based Jaguar system to XK6 based Titan system with increased performance.
INDUSTRIAL ACCEPTANCE
INDUSTRIAL ACCEPTANCE CSCS- Swiss National Super Computing Centre Cray XE6 402 Tflops 1496 nodes Gemini Interconnects Cray XK6 176 nodes with one AMD and one GPU element each
SUMMARY Higher Supercomputing potential with GPU Accelerated computing Better Inter node communication with the Gemini Optical interconnects Backward compatible with XE6 cabinets and can be merged with XE6 systems. Highly suited to Scientific Research computations requiring high computational power of the order of 100s TFLOPs
REFERENCES http://www.cray.com/products/xk6/xk6.aspx CrayXK6Brochure.pdf http://en.wikipedia.org/wiki/supercomputer http://i.top500.org/stats Applications on Cray XK6, Roberto Ansaloni