Resource allocation and utilization in the Blue Gene/L supercomputer

Similar documents
Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Cluster Network Products

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

BlueGene/L. Computer Science, University of Warwick. Source: IBM

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

Efficient Subtorus Processor Allocation in a Multi-Dimensional Torus

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Stockholm Brain Institute Blue Gene/L

The Red Storm System: Architecture, System Update and Performance Analysis

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

BlueGene/L (No. 4 in the Latest Top500 List)

Supercomputers. Alex Reid & James O'Donoghue

Real Parallel Computers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

San Diego Supercomputer Center. Georgia Institute of Technology 3 IBM UNIVERSITY OF CALIFORNIA, SAN DIEGO

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

Enabling Contiguous Node Scheduling on the Cray XT3

Real Parallel Computers

Fujitsu s Approach to Application Centric Petascale Computing

Parallel Architectures

Lecture 9: MIMD Architectures

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Lecture 9: MIMD Architecture

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters

ABySS Performance Benchmark and Profiling. May 2010

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

4. Networks. in parallel computers. Advances in Computer Architecture

Directions in Workload Management

Early experience with Blue Gene/P. Jonathan Follows IBM United Kingdom Limited HPCx Annual Seminar 26th. November 2007

Top500 Supercomputer list

SLURM Operation on Cray XT and XE

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Lecture 2 Parallel Programming Platforms

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

PetaFlop+ Supercomputing. Eric Kronstadt IBM TJ Watson Research Center Yorktown Heights, NY IBM Corporation

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Parallel Computer Architecture II

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

AcuSolve Performance Benchmark and Profiling. October 2011

UNIVERSITY OF CALIFORNIA, SAN DIEGO. Scalable Dynamic Instrumentation for Large Scale Machines

High Performance Computing Course Notes Course Administration

High Scalability Resource Management with SLURM Supercomputing 2008 November 2008

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

The Cray Rainier System: Integrated Scalar/Vector Computing

GROMACS Performance Benchmark and Profiling. August 2011

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Algorithm Engineering with PRAM Algorithms

Distributed and Cloud Computing

1 Bull, 2011 Bull Extreme Computing

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

CSE502: Computer Architecture CSE 502: Computer Architecture

CS Parallel Algorithms in Scientific Computing

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Extreme-Scale Operating Systems

Compute Node Linux (CNL) The Evolution of a Compute OS

Executing dynamic heterogeneous workloads on Blue Waters with RADICAL-Pilot

Interconnect Your Future

2008 International ANSYS Conference

Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer

Technical Computing Suite supporting the hybrid system

Partitioning Effects on MPI LS-DYNA Performance

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

A Case for High Performance Computing with Virtual Machines

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

IBM Information Technology Guide For ANSYS Fluent Customers

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

CS/COE1541: Intro. to Computer Architecture

Messaging Overview. Introduction. Gen-Z Messaging

Design of Scalable Network Considering Diameter and Cable Delay

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

Scalable Debugging with TotalView on Blue Gene. John DelSignore, CTO TotalView Technologies

IBM Spectrum Scale IO performance

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

vsan Mixed Workloads First Published On: Last Updated On:

Lecture 11 Hadoop & Spark

Comet Virtualization Code & Design Sprint

EE 4683/5683: COMPUTER ARCHITECTURE

The Cray XD1. Technical Overview. Amar Shan, Senior Product Marketing Manager. Cray XD1. Cray Proprietary

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

Lecture 9: MIMD Architectures

CCS HPC. Interconnection Network. PC MPP (Massively Parallel Processor) MPP IBM

Scalable In-memory Checkpoint with Automatic Restart on Failures

Introduction to Parallel Computing

Introduction to Slurm

(ii) Why are we going to multi-core chips to find performance? Because we have to.

Performance Analysis of an Optical Circuit Switched Network for Peta-Scale Systems

HPC Colony: Linux at Large Node Counts

Assessing performance in HP LeftHand SANs

PRIMEHPC FX10: Advanced Software

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Customer Success Story Los Alamos National Laboratory

High Performance Computing Course Notes HPC Fundamentals

Transcription:

Resource allocation and utilization in the Blue Gene/L supercomputer Tamar Domany, Y Aridor, O Goldshmidt, Y Kliteynik, EShmueli, U Silbershtein IBM Labs in Haifa

Agenda Blue Gene/L Background Blue Gene/L Topology Resource Allocation Simulation Results

Blue Gene/L- Overview First memberof IBM Blue Gene family of supercomputers Machine configurations range from 1000 to 64,000 nodes The world fastest supercomputer Rated first in the last top500 list (November 2004) Machine size of 16K nodes Selected customers: Lawrence Livermore National Laboratory Japan's National Institute of Advanced Industrial Science and Technology Lofar radio telescope run by Astron in the Netherlands Argonne National Laboratory

Blue Gene/L Philosophy Designed for highly parallel applications Traditional Linux and MPI programming models Extendable and manageable simple to build and operate Vastly improved price/performance choosing simple low power building block highest possible single threaded performance is not relevant, aggregate is! Floor space and power efficiency BlueGene/L = Cellular architecture + aggressive packaging + scalable software

BlueGene/L cellular architecture The design of BlueGene/L is substantially different from the traditional supercomputers (NEC Earth Simulator, ASCI machines) that uses large clusters of SMP nodes Very large number (64K) of simple identical nodes Low cost, low power, PPC microprocessors (700Mhz) Geometry: 64x32x32, based on 3D torus Low latency, high bandwidth propriety interconnect I/O physically separated from computations At most one process per CPU at a time Scalable and extendable architecture Computational power of the machine can be expanded by adding more building blocks

Jobs in Blue Gene/L Blue Gene/L runs parallel jobs Set of task running together, communicating via message-passing Each job has a set of attributes Size # of threads (and thus nodes) 3D Shape size 8 can be slim (eg 8x1x1) or fat (2x2x2) Communication pattern torus or mesh torus mesh

What is a Job Partition? A partition is A set of nodes A set of communication links Which connect the nodes as a torus or a mesh Partitions are isolated A single partition accommodates a single job No sharing of nodes or links between partitions

Job Management for Blue Gene/L Users submit jobs to the Blue Gene/L scheduler The scheduler maintains a queue of submitted jobs The scheduler s task: Choose the next job to run from the queue Allocate resources for the job Launch the job Monitor the job until termination Signals, debugging

Job Management Challenges How do we scale beyond a few thousands nodes? Group nodes into midplanes How do we maximize machine utilization? Extend toroidal topology to multi-toroidal topology

Scalability via Midplanes Nodes are grouped into 512-node units called midplanes A midplane is an 8x8x8 3D mesh Each internal node is connected directly to at most six internal neighbors Midplanes are connected to each other through switches Scalability achieved by sacrificing granularity of management Midplane is the minimal allocation unit Not all nodes may be utilized for a given job In practice, we deal with a 128-node machine instead of 64K nodes For all aspects of job management

BG/L Topology Z-line Z Y-line Y Midplane X-line X midplanes 0 1 2 3 4 5 6 7 X-switches X0 X1 X2 X3 X4 X5 X6 X7

Line connectivity - properties Lines have multi-toroidal topology Can be easily extended X0 X1 X2 X3 X4 X5 X6 X7 X8 X9

Line connectivity - properties Lines have multi-toroidal topology Can be easily extended Can be connected as a torus X0 X2 X1 X3 X0 X1 X4 X5 X2 X3 X6 X7 X4 X5 3D torus X6 X7

Line connectivity - properties Lines have multi-toroidal topology Can be easily extended Can be connected as one torus Multiple toroidal partitions can coexist X0 X2 X1 X3 X0 X1 X4 X5 X2 X3 X6 X7 X4 X5 3D torus X6 X7

Line connectivity- properties Lines have multi-toroidal topology Can be easily extended Can be connected as a torus Multiple toroidal partitions can coexist More than one way to wire a set of midplanes X0 X2 X4 X6 X1 X3 X5 X7

Resource Allocation Challenges High machine utilization Short response time (of jobs) On-line problem Requirements Satisfy job requests for size, shape, and connectivity (torus or mesh) Deal with faulty resources (nodes and wires) Two kinds of dedicated resources to manage Node allocation Link allocation

Allocation Algorithm Finding a partition: scan the 3D machine Find all free partitions that match the shape/size of a job For each candidate partition, find if and how it can be wired From all wireable partitions, choose the best partition use flexible criteria eg minimal number of links Wiring a partition Static wire lookup tables per dimension Availability of wires (previous allocation or faults) is checked Find suitable links in (almost) constant time Small memory footprint despite the huge number of links

Simulated Environment Faithful simulation of Blue Gene/L 128 midplanes Scheduler invoked when a job arrives or terminates Scheduling policy Aggressive backfilling If the job at the head of the queue cannot be accommodated we try to allocate another job out of order Workloads (benchmarks) Arrival times, runtimes, size, shape, torus/mesh Based on real parallel systems logs This presentation: San Diego Supercomputer Center (SDSC)

The benefits of multi-toroidal topology System Utilization vs Load SDSC fat jobs utilization BlueGene/L - % mesh BlueGene/L - % torus BlueGene/L - % T % M D/TORUS - % torus load

The influence of job shapes on utilization System Utilization vs Load SDSC utilization slim % fat fat load

Summary Blue Gene/L brings with it a new level of supercomputer scalability and many new challenges Scalability of system management is achieved by sacrificing granularity Represent the machine as a smaller system consisting of collections of nodes Blue Gene/L s novel network topology has considerable advantages compared to traditional interconnects (such as 3D tori) The challenges are successfully met with a combined hardware and software solution

End

Link Allocation The problem: Given a partition, fined links in all the lines that participate in the partition for all three dimensions to wire a partition attempting to best utilize future allocations Solution main idea: Build a lookup table with the partitions wiring possibilities The dimension are independent Table per dimension All lines in a dimension are equal Table contain information on one line There are not so many whys to wire a partition consume relatively small amount of memory

The Lookup table A table per topology dimension The index is a possible set of midplanes Each entry contains all sets of links that can wire it as a torus or as a mesh Built once at startup time Given a partition, use tables to find link set in each dimension Eliminate non-available sets, output best among available Index 5 6 14 15 120 MP sets 0,2 1,2 1,2,3 0,1,2,3 3,4,5,6 Link sets 1 Mesh 2 Mesh 1,2,3 Mesh 3,6 Mesh 0,1,3,6 Torus 1,2,3,6 Torus Connection

Y & Z lines Connectivity No multi-toroidal topology midplanes 0 1 2 3 Y-switches Y0 Y1 Y2 Y3 Or can be drown that way (without the midplanes): Y0 Y1 Can accommodate only one torus partition at a time Y2 Y3