HPC Hardware Overview

Similar documents
The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

The Stampede is Coming: A New Petascale Resource for the Open Science Community

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

Managing Terascale Systems and Petascale Data Archives

Introduction to Xeon Phi. Bill Barth January 11, 2013

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Parallel Programming on Ranger and Stampede

Overview of the Texas Advanced Computing Center. Bill Barth TACC September 12, 2011

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Introduction to the Intel Xeon Phi on Stampede

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

STAR-CCM+ Performance Benchmark and Profiling. July 2014

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Philippe Thierry Sr Staff Engineer Intel Corp.

Introduc)on to Xeon Phi

LBRN - HPC systems : CCT, LSU

Introduc)on to Hyades

HPC Capabilities at Research Intensive Universities

LS-DYNA Performance Benchmark and Profiling. April 2015

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

n N c CIni.o ewsrg.au

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Preparing for Highly Parallel, Heterogeneous Coprocessing

University at Buffalo Center for Computational Research

The Stampede Supercomputer

Introduc)on to Xeon Phi

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Overview of Tianhe-2

HP GTC Presentation May 2012

NAMD Performance Benchmark and Profiling. January 2015

Architecting High Performance Computing Systems for Fault Tolerance and Reliability

Intel Knights Landing Hardware

Emerging Technologies for HPC Storage

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Intel Architecture for HPC

Enabling Performance-per-Watt Gains in High-Performance Cluster Computing

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

Leibniz Supercomputer Centre. Movie on YouTube

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

HMEM and Lemaitre2: First bricks of the CÉCI s infrastructure

Sugon TC6600 blade server

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Dual Socket Server Node

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

Fujitsu PRIMERGY BX920 S1 Dual-Socket Server

Erkenntnisse aus aktuellen Performance- Messungen mit LS-DYNA

Fermi Cluster for Real-Time Hyperspectral Scene Generation

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

Intel Xeon E v4, Windows Server 2016 Standard, 16GB Memory, 1TB SAS Hard Drive and a 3 Year Warranty

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Longhorn Project TACC s XD Visualization Resource

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

OCTOPUS Performance Benchmark and Profiling. June 2015

IBM Power AC922 Server

irods at TACC: Secure Infrastructure for Open Science Chris Jordan

Dell Solution for High Density GPU Infrastructure

Inspur AI Computing Platform

Intel Workstation Technology

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Datura The new HPC-Plant at Albert Einstein Institute

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

Remote & Collaborative Visualization. Texas Advanced Computing Center

MILC Performance Benchmark and Profiling. April 2013

Large Scale Remote Interactive Visualization

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Power Technology For a Smarter Future

IBM Power Advanced Compute (AC) AC922 Server

IBM CORAL HPC System Solution

ABySS Performance Benchmark and Profiling. May 2010

NCAR Workload Analysis on Yellowstone. March 2015 V5.0

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

Tightly Coupled Accelerators Architecture

MAHA. - Supercomputing System for Bioinformatics

Parallel Visualization At TACC. Greg Abram

Data Sheet Fujitsu PRIMERGY BX920 S2 Dual Socket Server Blade

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

LCE: Lustre at CEA. Stéphane Thiell CEA/DAM

CPMD Performance Benchmark and Profiling. February 2014

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

Vector Engine Processor of SX-Aurora TSUBASA

DATARMOR: Comment s'y préparer? Tina Odaka

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

High Performance Computing and Data Resources at SDSC

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

LS-DYNA Performance Benchmark and Profiling. October 2017

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Agenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >

Transcription:

HPC Hardware Overview John Lockman III April 19, 2013 Texas Advanced Computing Center The University of Texas at Austin

Outline Lonestar Dell blade-based system InfiniBand ( QDR) Intel Processors Longhorn Dell blade-based system InfiniBand (QDR) Intel Processors Stampede Dell Infiniband (FDR) Intel Processors Ranch & Corral - Storage Systems

About this Talk As an applications programmer you may not care about hardware details, but We need to consider performance issues Better performance means faster turnaround and/or larger problems We will focus on the most relevant architecture characteristics Do not hesitate to ask questions as we go

High Performance Computing In our context, it refers to hardware and software tools dedicated to computationally intensive tasks Distinction between HPC center (throughput focused) and Data center (data focused) is becoming fuzzy High bandwidth, low latency Memory Network

Lonestar: Intel hexa-core system

Lonestar: Introduction Lonestar Cluster Configuration & Diagram Server Blades Dell PowerEdge M610 Blade (Intel Hexa-Core) Server Nodes Microprocessor Architecture Features Instruction Pipeline Speeds and Feeds Block Diagram Node Interconnect Hierarchy InfiniBand Switch and Adapters Performance

Lonestar Cluster Overview lonestar.tacc.utexas.edu

Lonestar Cluster Overview Hardware Components Characteristics Peak Performance 302 TFLOPS Nodes 2 Hexa-Core Xeon 5680 1888 nodes / 22656 cores Memory 1333 MHz DDR3 DIMMS 24 GB/node, 45 TB total Shared Disk Lustre parallel file system 1 PB Local Disk SATA 146GB/node, 276 TB total Interconnect Infiniband 4 GB/sec P-2-P

Blade : Rack : System 1 node : 2 x 6 cores = 12 cores 1 chassis : 16 nodes = 192 cores 1 rack : 3 x 16 nodes = 576 cores 39⅓ racks : 22656 cores 3 chassis 16 blades

Lonestar login nodes Dell PowerEdge M610 Intel dual socket Xeon hexa-core 3.33GHz 24 GB DDR3 1333 MHz DIMMS Intel QPI 5520 Chipset Dell 1200 PowerVault 15 TB HOME disk 1 GB user quota(5x)

Lonestar compute nodes 16 Blades / 10U chassis Dell PowerEdge M619 Dual socket Intel hexa-core Xeon 3.33 GHz(1.25x) 13.3 GFLOPS/core(1.25x) 64 KB L1 cache (independent) 12 MB L2 cache (unified) 24 GB DDR3 1333 MHz DIMMS Intel QPI 5520 Chipset 2x QPI 6.4 GT/s 146 GB 10k RPM SAS-SATA local disk (/tmp)

Motherboard DDR3 1333 DDR3 1333 Hexa-core Xeon QPI Hexa-core Xeon 12 CPU Cores IOH I/O Hub DMI PCIe ICH I/O Controller

Intel Xeon 5600(Westmere) 32 KB L1 cache/core 256 KB L2 cache/core Shared 12 MB L3 cache Core 1 Core 2

Cluster Interconnect 16 independent 4 GB/s connections/chassis

Lonestar Parallel File Systems: Lustre &NFS 97 TB 1GB/ user Ethernet $HOME 226 TB 200GB/ user $WORK 3 2 1 806 TB InfiniBand $SCRATCH Uses IP over IB

Longhorn: Intel Quad-core system

Longhorn: Introduction First NSF extreme Digital Visualization grant (XD Vis) Designed for scientific visualization and data analysis Very large memory per computational core Two NVIDIA Graphics cards per node Rendering performance 154 billion triangles/second

Longhorn Cluster Overview Hardware Components Characteristics Peak Performance (CPUs) 512 Intel Xeon E5400 20.7 TFLOPS Peak Performance (GPUs, SP) System Memory 128 NVIDIA Quadroplex 2200 S4 DDR3-DIMMS 500 TFLOPS 48 GB/node (240 nodes) 144 GB/node (16 nodes) 13.5 TB total Graphics Memory 512 FX5800 x 4GB 2 TB Disk Lustre parallel file system 210 TB Interconnect QDR Infiniband 4 GB/sec P-2-P

Longhorn Large Mem nodes Dell R710 Intel dual socket quad-core Xeon E5400 @ 2.53GHz 144 GB DDR3 ( 18 GB/core ) Intel 5520 chipset NVIDIA Quadroplex 2200 S4 4 NVIDIA Quadro FX5800 240 CUDA cores 4 GB Memory 102 GB/s Memory bandwidth

Longhorn standard nodes Dell R610 Dual socket Intel quad-core Xeon E5400 @ 2.53 GHz 48 GB DDR3 ( 6 GB/core ) Intel 5520 chipset NVIDIA Quadroplex 2200 S4 4 NVIDIA Quadro FX5800 240 CUDA cores 4 GB Memory 102 GB/s Memory bandwidth

DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 Motherboard (R610/R710) Quad Core Intel Xeon 5500 Quad Core Intel Xeon 5500 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 12 DIMM slots (R610) 18 DIMM slots (R710) Intel 5520 42 lanes PCI Express or 36 lines PCI Express 2.0

QPI 0 QPI 1 Cache Sizes in Intel Nehalem QPI 0 QPI 1 Memory Controller Memory Controller Core 1 Core 2 Core 3 Core 4 L3 8 MB L2 256 KB L1 32 KB Core 1 L2 256 KB L1 32 KB Core 2 L2 256 KB L1 32 KB Core 3 L2 256 KB L1 32 KB Core 4 Shared L3 Cache Total QPI bandwidth up to 25.6 GB/s (@ 3.2 GHz) 2 20-lane QPI links 4 quadrants (10 lanes each)

Nehalem μarchitecture

Stampede: Intel 8 core system

Stampede NSF 11-511: High Performance Computing System Acquisition: Enhancing the Petascale Computing Environment for Science and Engineering Enable sustained petascale computational and datadriven science and engineering and provide an innovative component

Dell/Intel Partnership TACC Partnered with Dell and Intel to design Stampede Intel MIC (Intel Xeon Phi) is the innovative component High performance and low power per operation Highly programmable

Datacenter Expansion

Thermal Storage New Machine Room New Infrastructure

2x footprint, 2x power, 20x Capability (Stampede vs. Ranger) 1M gallons Thermal Storage 3,750 Tons 14KV in, 480V out 6.5MW ~3,000 GPM

Stampede consists of 182 48U cabinets.

Key Datacenter Features Thermal energy storage to reduce peak power consumption Hot aisle containment to boost efficiency Datacenter power increased to ~12MW Expand experiments in mineral oil cooling

Cooling and Electrical Infrastructure

Stampede debuted at #7 on the Top 500 Stampede Performance

Stampede Overview $27.5M acquisition 9+ petaflops (PF) peak performance 2+ PF Linux cluster 6400 Dell DCS C8220X nodes 2.7GHz Intel Xeon E5 (Sandy Bridge) 102,400 total cores 56Gb/s FDR Mellanox InfiniBand 7+ PF Intel Xeon Phi Coprocessor TACC has a special release: Intel Xeon Phi SE10P 14+ PB disk, 150GB/s 16 1TB shared memory nodes 128 NVIDIA Tesla K20 GPUs

Processor Specs Arch. Features Xeon E5 Xeon Phi SE10P Frequency 2.7GHz +turbo 1.0GHz +turbo Cores 8 61 HW threads/core 2 4 Vector size 256 bits 4 doubles 8 singles 512 bits 8 doubles 16 singles Instr. Pipeline Out of Order In Order Registers 16 32 Caches L1:32KB L2:256KB L3:20MB L132KB L2:512KB Memory 2 GB/core 128 MB/core Sustained Memory BW 75 GB/s 170 GB/s Sustain Peak FLOPS 1 thread/core 2 threads/core Instruction Set x86 + AVX x86 + new vector instructions

MIC Details

What is a MIC Basic Design Ideas Leverage x86 architecture (CPU with many cores) X86 cores are simpler, but allow for more compute throughput Leverage existing x86 programming models Dedicate much of the silicon to floating point ops Cache coherent Increase floating-point throughput Implement as a separate device Strip expensive features (out-of-order execution, branch prediction, etc.) Widen SIMD registers for more throughput Fast (GDDR5) memory on card Runs a full Linux operating system (BusyBox)

MIC Architecture Many cores on the die L1 and L2 cache Bidirectional ring network Memory and PCIe connection

Stampede Login Nodes Component Technology 4 login nodes stampede.tacc.utexas.edu Processors Sockets per Node/Cores per Socket Motherboard Memory Per Node $HOME /tmp (2) E5-2680, 2.7 GHz 8 cores Dell R720, Intel QPI 600 Chipset 32GB DDR3-1333 MHz Cache: 256KB/core L2: 20MB Global: Lustre, 5GB quota Local: Shared, 432GB SATA 10K rpm

Dell DCS C8220z Compute Node Component Sockets per Node/Cores per Socket Coprocessors/Cores Motherboard Memory Per Host Memory per Coprocessor Interconnect Processor-Processor Processor-Coprocessor PCI Express Processor PCI Express Coprocessor 250GB Disk Technology 2/8 Xeon E5-2680 2.7GHz (turbo, 3.5) 1/61 Xeon Phi SE10P 1.1GHz Dell C8220, Intel PQI, C610 Chipset 32GB 8x4GB 4 channels DDR3-1600MHz 8GB DDR5 QPI 8.0 GT/s PCI-e x40 lanes, Gen 3 x16 lanes, Gen 2 (extended) 7.5 RPM SATA

HCA Compute Node Configuration CPUs and MIC appear as separate HOSTS ( symmetric computing) Comm layer & Virtual IP Service For MIC PCIe

Stampede Filesystems Storage Class Size Architecture Features Local (each node) Login: 1TB Compute: 250GB Big Mem: 600 GB SATA SATA SATA 432GB mounted on /tmp 80GB mounted on /tmp 398GB mounted on /tmp Parallel 14PB Lustre 72 Dell R610 (OSS) 4 Dell R710 (MDS) Ranch (Tape Storage) 60PB SAM-FS 10GB/s connection through 4 GridFTP Servers

Stampede Filesystems $HOME Quota : 5GB, 150K files Filesystem is backed up $WORK Quota: 400GB, 3M files NOT backed up Use cdw to change to $WORK $SCRATCH No Quota Total size ~8.5PB NOT backed up Use cds to change to $SCRATCH Files older than 10 days are subject to purge policy /tmp Local disk ~80GB

Large Memory and Visualization Nodes 16 Large Memory Nodes 32 cores 1TB of memory Used for data-intense applications requiring disk caching and large memory methods 128 Visualization Nodes 16 cores NVIDIA Tesla K20 with 8GB GDDR5 memory

Storage Ranch Long term tape storage Corral 1 PB of spinning disk

Ranch Archival System Sun StorageTek Silo 10,000 - T1000B tapes 6,000 - T1000C tapes ~40 PB total capacity Used for long-term storage

6PB DataDirect Networks online disk storage 8 Dell 1950 servers 8 Dell 2950 servers 12 Dell R710 servers High Performance Parallel File System Multiple databases irods data management Replication to tape archive Multiple levels of access control Web and other data access available globally Corral

References

Lonestar Related References User Guide services.tacc.utexas.edu/index.php/lonestar-user-guide Developers www.intel.com/p/en_us/products/server/processor/xeon5000/technical-documents

Longhorn Related References User Guide services.tacc.utexas.edu/index.php/longhorn-user-guide General Information www.intel.com/technology/architecture-silicon/next-gen/ www.nvidia.com/object/cuda_what_is.html Developers developer.intel.com/design/pentium4/manuals/index2.htm developer.nvidia.com/forums/index.php

Stampede Related References User Guide http://www.tacc.utexas.edu/user-services/userguides/stampede-user-guide General Information http://www.intel.com/content/www/us/en/processors/xeon/xe on-phi-detail.html

John Lockman III jlockman@tacc.utexas.edu For more information: www.tacc.utexas.edu