The Stampede Supercomputer

Size: px

Start display at page:

Download "The Stampede Supercomputer"

Charleen Sharp
5 years ago
Views:

1 The Stampede Supercomputer Niall Gaffney (Dan Stanzione, Karl Schulz, Bill Barth, Tommy Minyard) July 2013

Acknowledgements Thanks/kudos to: Sponsor: National Science Foundation NSF Grant #OCI-1134872 Stampede Award, Enabling, Enhancing, and Extending Petascale Computing for Science and Engineering NSF

2 Acknowledgements Thanks/kudos to: Sponsor: National Science Foundation NSF Grant #OCI Stampede Award, Enabling, Enhancing, and Extending Petascale Computing for Science and Engineering NSF Grant #OCI Topology-Aware MPI Collectives and Scheduling Many, many Dell, Intel, and Mellanox engineers All my colleagues at TACC who constructed such an amazing system and let me take some credit 2

3 Who am I Came from the Space Telescope Science Institute Hubble Space Telescope Archive, Hubble Legacy Archive, James Web Space Telescope Archive Started May 2013 as Director of Data Intensive Computing Charged with bringing new users to do Data Intensive Computations using Stampede Leading Data Management and Collections and Data Mining and Statistics groups 3

4 TACC s Stampede Project They designed a comprehensive science system: huge supercomputing capability, with powerful processors high throughput computing & big shared memory capabilities powerful visualization, with high-end GPUs extensive software stack to support both simulation-based and data-driven science expertise, training & documentation to support diverse users, applications, and modes of computational science 4

5 Faster Always Faster Default 2.15 R on Stampede: c $ cat R-benchmark-25.R time R --slave > R.2.15.out user 2.20system 3:06.67 elapsed 99%CPU (0avgtext+0avgdata maxresident)k 25984inputs+32outputs (54major minor)pagefaults 0swaps Yaakoubed 3.0 R on Stampede: c $ cat R-benchmark-25.R time R --slave > R.3.0.out 52.49user 2.10system 0:57.98 elapsed 94%CPU (0avgtext+0avgdata maxresident)k 30640inputs+16outputs (538major minor)pagefaults 0swaps From the article, default R takes ~ 9 minutes! 5

6 TACC s Stampede: The Big Picture Dell, Intel, and Mellanox are vendor partners Almost 10 petaflops peak in initial system (2013) 2.2 PF of Intel Xeon E5 (6400 dual-socket nodes) 7.3 PF of Intel Xeon Phi (MIC) coprocessors (6400+) 14 PB disk, 150+ GB/s I/O bandwidth 260 TB RAM 56 Gb/s Mellanox FDR InfiniBand interconnect 16 x 1TB large shared memory nodes 128 Nvidia Kepler K20 GPUs for remote visualization $51.5M project for 4 years to enable new science 6

7 Xeon Phi: Innovative Component One of the goals of the NSF solicitation was to introduce a major new innovative capability component to science and engineering research communities Included as an experimental HPC component We proposed the Intel Xeon Phi coprocessor (many integrated core or MIC) one first generation Phi installed per host during initial deployment confirmed injection of 1600 future generation MICs in 2015 (5+ PF) 7

that are simpler, but allow for more compute throughput Leverage existing x86

, keep some cache(s) Keep cache-coherency protocol Increase floating-point throughput

8 What is MIC? Basic Design Ideas: Leverage x86 architecture (a CPU with many cores) Use x86 cores that are simpler, but allow for more compute throughput Leverage existing x86 programming models Dedicate much of the silicon to floating point ops., keep some cache(s) Keep cache-coherency protocol Increase floating-point throughput per core Implement as a separate device Strip expensive features (out-of-order execution, branch prediction, etc.) Widened SIMD registers for more throughput (512 bit) Fast (GDDR5) memory on card 8

9 Will My Code Run on MIC? Yes That s the wrong question, it s: Will your code run *best* on MIC?, or Will you get great MIC performance without additional work? Do not HAVE to use MICs to use Stampede More opportunities to learn about programing MICs in the near future. 9

10 Stampede Footprint Ranger Stampede 8000 ft 2 ~10PF 6.5 MW 3000 ft2 0.6 PF 3 MW Machine Room Expansion Added 6.5MW of additional power 10

11 Stampede: a Fast Deployment of a Fast System Feb 20, 2012 May 16, 2012 March 22, 2012 Sep 10,

12 Stampede Datacenter, ~September 10th 12

13 Some utilities are involved 13

14 [ I/O Subsystem ] 14

3 Logical Volume Capacity Target Usage $HOME $WORK $SCRATCH 768 TB (524 TB) ~2 PB (1.1 PB) ~11 PB (7.

15 File Systems Build-Out At the Stampede scale, parallel file systems are universally required for all user file systems Currently running Lustre Logical Volume Capacity Target Usage $HOME $WORK $SCRATCH 768 TB (524 TB) ~2 PB (1.1 PB) ~11 PB (7.5 PB) Permanent user storage; automatically backed up, quota enforced Large allocated storage; not backed up, quota enforced Large temporary storage; not backed up, purged periodically 15

C8200 1 Compute blade (left) 4 Storage blades (right) RAID6 software raid and

16 File Systems Before beginning any Lustre work, significant time was spent on designing MD layout and performing thorough drive burn-in Basic OSS config - DCS C Compute blade (left) 4 Storage blades (right) RAID6 software raid and Lustre run on dual-socket SB Each storage blade has 16 3TB drives each (64 drives/shelf) 16

17 Write Speed (GB/sec) Write Speed (GB/sec) File Systems Build-Out: Lustre With all raid sets vetted, we commenced with Lustre formatting and full formal I/O testing: single OSS tested first to verify adequate Lustre performance (measured peaks of almost 3 GB/sec) then, verified scalability of multiple servers when writing 2GB per task in parallel (near perfect scalability observed across 58 OSS servers) finally, full system I/O tests commenced using more than 6K hosts Full System I/O Results: Peak Write = 159 GB/sec Peak Read = GB/sec Lustre StripeCount= # of Write Clients Lustre StripeCount= # of Write Clients 17

18 [ System Speeds and Feeds ] 18

HPL Gflos HPL Efficiency Speeds and Feeds: Full System HPL (SB) HPL Completed on all 6400 hosts on 12/31/12 Exceeded 90% efficiency with 8GB/node 2.5E+06 100.00% 90.00% 2.0E+06 80.00% 70.00% 1.

19 HPL Gflos HPL Efficiency Speeds and Feeds: Full System HPL (SB) HPL Completed on all 6400 hosts on 12/31/12 Exceeded 90% efficiency with 8GB/node 2.5E % 90.00% 2.0E % 70.00% 1.5E % 50.00% 1.0E % 30.00% 5.0E % 10.00% 0.0E+00 2G / node 4G / node 8G / node 0.00% 2G / node 4G / node 8G / node Prior to these full system runs, we also ran a heterogeneous SB+MIC run for submission ISC 13 -> Stampede currently ranked 6 th 19

MPI Bandwidth (MB/sec) Speeds and Feeds: P2P Bandwidth (FDR) Comparison to previous generation IB fabrics 7000 6000 5000

20 MPI Bandwidth (MB/sec) Speeds and Feeds: P2P Bandwidth (FDR) Comparison to previous generation IB fabrics Stampede Lonestar 4 Ranger Message Size (Bytes)

21 Topology Considerations At scale, process mapping with respect to topology can have significant impact on applications Full fat-tree (Stampede, TACC) 4x4x4 3D Torus (Gordon, SDSC) 21

22 Latency (us) Topology Considerations Topology query service (now in production on Stampede) - NSF STCI with OSU, SDSC caches the entire linear forwarding table (LFT) for each IB switch - via OpenSM plugin or ibnetdiscover tools exposed via network (socket) interface such that an MPI stack (or user application) can query the service remotely can return # of hops between each host or full directed route between any two hosts Nearest neighbor application benchmark from Stampede [courtesy H. Subramoni, SC 12] query c :c c x0002c x0002c903006f9010 0x0002c c090 c We will also be leveraging this service to perform topology-aware scheduling so that smaller user jobs will have their nodes placed closer together topologically have created simple tool to create SLURM topology config file using above query service works, but slows interactivity when users specify maximum # of switch hops desired during job submission Default Topo-Aware K 2K 4K 8K Number of Processes 45% 22

23 Initial File System Growth (first 2 months) Disk Usage (in TBs) $WORK 1000 $SCRATCH Jan 17-Jan 27-Jan 6-Feb 16-Feb 26-Feb 8-Mar 23

24 Disk Usage Today Size Used Avail Use% Mounted on 524T 3.3T 521T 1% /home1 1.1P 125T 923T 12% /work 7.5P 5.0P 2.6P 67% /scratch 24

25 [Software Stack] 25

26 Data Oriented Software General Data Packages Python Enthought Python Dist. (numpy, scipy, ipython, matplotlib ) R - statistics package (multicore, snow) Paraview a parallel interactive visualization system Visit - a parallel visualization suite based in part on VTK Matlab 2013a from MathWorks (bring your own license) Bio/Genomic Packages Bedtools: a flexible suite of utilities for comparing genomic features BioPerl Many BLAST (Basic Local Alignment Search Tool) packages Freesurfer - a set of tools for structural and functional brain imaging data TopHat2 - a fast splice junction mapper for RNA-Seq reads Will support more tools for other communities 26

27 Stampede so far In first ~4 months of production operations Stampede has delivered: More than 700,000 successful job completions More than 250,000,000 core hours of processing time. Jobs for more than 1,100 individual users, on 850 different funded projects. Roughly 12% of projects on Stampede involve data computation 27

28 Modes for Data Intensive Computing User Created Code Not covered here (OpenMP, MPI, MIC optimization) High Throughput Computing Capabilities Launchers to start and manage jobs Large Shared Memory Capabilities Big Nodes for Large Data requirements Parallel Data Analysis Capabilities R analysis using parallel packages Remote Visualization Capabilities Using the GPUs and visualization packages remotely

29 Stampede Will Enable New Scientific Discoveries Across Domains science projects, by thousands of researchers

30 Stampede Will Enable New Scientific Discoveries Across Domains science projects, by thousands of researchers

31 Stampede Early Science Highlights Predicting Earthquakes in California The Southern California Earthquake Center used Stampede to predict the frequency of damaging earthquakes for the latest Uniform California Earthquake Rupture Forecast (UCERT3). Results will be incorporated into USGS s National Seismic Hazard Maps which are used to set building codes and insurance rates.

Results will be incorporated into USGS s National Seismic Hazard Maps which are used to set building codes and insurance rates.

32 Stampede Early Science Highlights Predicting Earthquakes in California The Southern California Earthquake Center used Stampede to predict the frequency of damaging earthquakes for the latest Uniform California Earthquake Rupture Forecast (UCERT3). Results will be incorporated into USGS s National Seismic Hazard Maps which are used to set building codes and insurance rates. We do a lot of HPC calculations, but it s rare that any of them have this level of potential impact. Thomas Jordan, Director Southern California Earthquake Center

33 Stampede Early Science Highlights Stampede as a Computational Microscope Researchers from the University of Illinois at Urbana- Champaign used Stampede to simulate protein folding and to design enzymes for secondgeneration biofuels.

Stampede Early Science Highlights Stampede as a Computational Microscope Researchers from the University of Illinois at Urbana- Champaign used Stampede to simulate protein folding and to design

34 Stampede Early Science Highlights Stampede as a Computational Microscope Researchers from the University of Illinois at Urbana- Champaign used Stampede to simulate protein folding and to design enzymes for secondgeneration biofuels. We are extremely excited about the strong computational power of Stampede. It is the fastest machine we have experienced right away and we have performed a lot of interesting scientific computational experiments on the system. Klaus Schulten University of Illinois at Urbana-Champaign

Stampede Early Science Highlights Improving Brain Tumor Imaging Surgeons want to know how aggressive a tumor is, and what its infiltration into the surrounding tissue is, in order to plan for

35 Stampede Early Science Highlights Improving Brain Tumor Imaging Surgeons want to know how aggressive a tumor is, and what its infiltration into the surrounding tissue is, in order to plan for surgery, radiotherapy and other treatment options. Dr. George Biros is creating new methods that quickly and accurately assimilate massive amounts of data from MRI scans and other imaging modalities and combine these with biophysical models that represent tumor growth.

Stampede Early Science Highlights Improving Brain Tumor Imaging The addition of biophysical tumor models increases the accuracy and effectiveness of the interpretation of images, but involves large

36 Stampede Early Science Highlights Improving Brain Tumor Imaging The addition of biophysical tumor models increases the accuracy and effectiveness of the interpretation of images, but involves large amounts of complex computations that must be accomplished quickly. This is an area where supercomputers can be of great assistance to medicine. Just looking at the image is not enough. You have to combine several imaging modalities, many images, do pattern recognition and statistical data analysis, and incorporate machine learning tools and biophysical models, to try to interpret the images. A machine like Stampede makes this possible. George Biros, The University of Texas at Austin

37 Stampede Early Science Highlights High Performance Sound Technologies for Access and Scholarship (HiPSTAS) Stampede will provide high performance computing, large-scale visualization and massive storage capabilities to sound archivists to help them search for patterns and gain insights into spoken language and music. Stampede combined with the ARLO software enabled researchers to analyze massive sound archives at unprecedented speeds, increasing the pace of new discoveries.

Stampede Early Science Highlights High Performance Sound Technologies for Access and Scholarship (HiPSTAS) Stampede will provide high performance computing, large-scale visualization and massive

38 Stampede Early Science Highlights High Performance Sound Technologies for Access and Scholarship (HiPSTAS) Stampede will provide high performance computing, large-scale visualization and massive storage capabilities to sound archivists to help them search for patterns and gain insights into spoken language and music. Stampede combined with the ARLO software enabled researchers to analyze massive sound archives at unprecedented speeds, increasing the pace of new discoveries. Using Stampede was essential for the success of the workshop, as participants sought to run advanced processes over thousands of files at once. This work would not have been possible without the supercomputer. Tanya Clement, The University of Texas at Austin

39 Summary Stampede was successfully deployed into a newly expanded datacenter in 2012 entered formal production in January 2013 Xeon Phi were made available to users early - formal acceptance this summer Sandy-Bridge + FDR scaling looking great good application scaling - user s have run up to 64K in normal production so far Lustre file systems are providing excellent throughput I/O rates encouraging for supporting large parallel data computations Stampede is producing groundbreaking results Significant initial data intensive computing results Start expanding that today 39

40 Niall Gaffney For more information: 40

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

The Stampede is Coming Welcome to Stampede Introductory Training Dan Stanzione Texas Advanced Computing Center dan@tacc.utexas.edu Thanks for Coming! Stampede is an exciting new system of incredible power.