Eldorado. Outline. John Feo. Cray Inc. Why multithreaded architectures. The Cray Eldorado. Programming environment.

Similar documents
Productivity is a 11 o clock tee time

Introducing the Cray XMT. Petr Konecny May 4 th 2007

The Red Storm System: Architecture, System Update and Performance Analysis

Cray RS Programming Environment

Initial Performance Evaluation of the Cray SeaStar Interconnect

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Fine-grain Multithreading: Sun Niagara, Cray MTA-2, Cray Threadstorm & Oracle T5

Lecture 2 Parallel Programming Platforms

Multithreading and the Tera MTA. Multithreading for Latency Tolerance

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

p5 520 server Robust entry system designed for the on demand world Highlights

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

The Cray Rainier System: Integrated Scalar/Vector Computing

Scalable Computing at Work

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Application Performance on Dual Processor Cluster Nodes

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Use of Common Technologies between XT and Black Widow

Cray XE6 Performance Workshop

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Cluster Network Products

BlueGene/L. Computer Science, University of Warwick. Source: IBM

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Parallel Computer Architecture II

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

What are Clusters? Why Clusters? - a Short History

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

HPC Architectures. Types of resource currently in use

Designing a Cluster for a Small Research Group

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Tile Processor (TILEPro64)

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Lecture 3: Intro to parallel machines and models

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Adapted from David Patterson s slides on graduate computer architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

The Architecture and the Application Performance of the Earth Simulator

Parallel Architectures

CUDA GPGPU Workshop 2012

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Fujitsu s Approach to Application Centric Petascale Computing

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Lecture 20: Distributed Memory Parallelism. William Gropp

Best Practices for Setting BIOS Parameters for Performance

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

LECTURE 5: MEMORY HIERARCHY DESIGN

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Sugon TC6600 blade server

Introduction to Xeon Phi. Bill Barth January 11, 2013

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

SAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

Software and Tools for HPE s The Machine Project

Cray XC Scalability and the Aries Network Tony Ford

Master Informatics Eng.

COSC 6374 Parallel Computation. Parallel Computer Architectures

EE/CSCI 451: Parallel and Distributed Computation

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

HPCC Results. Nathan Wichmann Benchmark Engineer

XPU A Programmable FPGA Accelerator for Diverse Workloads

Early Experiences with the Naval Research Laboratory XD1. Wendell Anderson Dr. Robert Rosenberg Dr. Marco Lanzagorta Dr.

Technical Computing Suite supporting the hybrid system

IBM System p5 510 and 510Q Express Servers

Wednesday : Basic Overview. Thursday : Optimization

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

HPCS HPCchallenge Benchmark Suite

COSC 6374 Parallel Computation. Parallel Computer Architectures

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Lecture 12: Instruction Execution and Pipelining. William Gropp

White Paper. Technical Advances in the SGI. UV Architecture

Parallel Architecture. Sathish Vadhiyar

Flexible Architecture Research Machine (FARM)

Microsoft Exchange Server 2010 workload optimization on the new IBM PureFlex System

Our new HPC-Cluster An overview

EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD. George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA

OP2 FOR MANY-CORE ARCHITECTURES

MAHA. - Supercomputing System for Bioinformatics

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

User Training Cray XC40 IITM, Pune

What is Parallel Computing?

NUMA-aware OpenMP Programming

Topic & Scope. Content: The course gives

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Virtual Memory. Motivation:

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Transcription:

Eldorado John Feo Cray Inc Outline Why multithreaded architectures The Cray Eldorado Programming environment Program examples 2 1

Overview Eldorado is a peak in the North Cascades. Internal Cray project name for the next MTA system. Make the MTA-2 cost-effective and supportable. Retain proven performance and programmability advantages of the MTA-2. Eldorado Hardware based on Red Storm. Programming model and software based on MTA-2. 3 ~1970 s Generic computer Instructions CPU Memory Memory keeps pace with CPU Everything is in Balance 4 2

but balance is short-lived Processors have gotten much faster (~ 1000x) Memories have gotten a little faster (~ 10x), and much larger (~ 1,000,000x) System diameters have gotten much larger (~ 1000x) Flops are for free, but bandwidth is very expensive Systems are no longer balanced, processors are starved for data 5 20 th Century parallel computer CPU Cache CPU Cache Memory CPU Cache Avoid the straws and there are no flaws; otherwise, good luck to you 6 3

Constraints on parallel programs Place data near computation Avoid modifying shared data Access data in order and reuse Avoid indirection and linked data-structures Partition program into independent, balanced computations Avoid adaptive and dynamic computations Avoid synchronization and minimize inter-process communications 7 A tiny solution space 8 4

Multithreading Hide latencies via parallelism Maintain multiple active threads per processor, so that gaps introduced by long latency operations in one thread are filled by instructions in other threads Requires special hardware support Multiple threads Single-cycle context switch Multiple outstanding memory requests per thread Fine-grain synchronization 9 The generic transaction Allocate memory Write/read memory Deallocate memory 10 5

Execution time line 1 thread 11 Execution time line 2 threads 12 6

Execution time line 3 threads 13 Transaction per second on MTA Threads P1 P2 P4 1 10 20 30 40 50 1,490.55 14,438.27 26,209.85 31,300.51 32,629.65 32,896.10 65,890.24 123,261.82 14 7

Eldorado Processor (logical view) 1 2 3 4 Programs running in parallel i = n.. i = 3. i = 2 i = 1 Subproblem A Subproblem B i = n.. i = 1. i = 0 Subproblem A Serial Code Concurrent threads of computation.... Hardware streams (128) Unused streams Instruction Ready Pool; Pipeline of executing instructions 15 Eldorado System (logical view) 1 2 3 4 Programs running in parallel i = n.. i = 3. i = 2 i = 1 Subproblem A Subproblem B i = n.. i = 1. i = 0 Subproblem A Serial Code Concurrent threads of computation............ Multithreaded across multiple processors 16 8

What is not important Placing data near computation Modifying shared data Accessing data in order Using indirection or linked data-structures Partitioning program into independent, balanced computations Using adaptive or dynamic computations Minimizing synchronization operations 17 What is important 18 9

Eldorado system goals Make the Cray MTA-2 cost-effective and supportable Mainstream Cray manufacturing Mainstream Cray service and support Eldorado is a Red Storm with MTA-2 processors Eldorado leverages Red Storm technology Red Storm cabinet, cooling, power distribution Red Storm circuit boards and system interconnect Red Storm RAS system and I/O system (almost) Red Storm manufacturing, service, and support teams Eldorado retains key MTA technology Instruction set Operating system Programming model and environment 19 Red Storm Red Storm consists of over 10,000 AMD Opteron processors connected by an innovative high speed, high bandwidth 3D mesh interconnect designed by Cray (Seastar) Cray is responsible for the design, development, and delivery of the Red Storm system to support the Department of Energy's Nuclear stockpile stewardship program for advanced 3D modeling and simulation Red Storm uses a distributed memory programming model (MPI) 20 10

Red Storm Compute Board 4 DIMM Slots Redundant VRMs L0 L0 RAS Computer CRAY Seastar CRAY Seastar CRAY Seastar CRAY Seastar 21 Eldorado Compute Board 4 DIMM Slots Redundant VRMs L0 L0 RAS Computer CRAY Seastar2 CRAY Seastar2 CRAY Seastar2 CRAY Seastar2 22 11

Eldorado system architecture Compute MTX Service & IO Linux Service Partition Linux OS Specialized Linux nodes Login PEs IO Server PEs Network Server PEs FS Metadata Server PEs Database Server PEs Compute Partition MTX (BSD) PCI-X PCI-X 10 GigE Fiber Channel Network RAID Controllers 23 Eldorado CPU 24 12

Speeds and Feeds 1.5 GFlops 500M memory ops CPU ASIC 100M memory ops 16 GB DDR DRAM 500M memory ops 140M memory ops 90M 30M memory ops (1 4K processors) Sustained memory rates are for random single word accesses over entire address space. 25 Eldorado memory Shared memory Some memory can be reserved as local memory at boot time Only compiler and runtime system have access to local memory Memory module cache Decreases latency and increases bandwidth No coherency issues 8 word data segments randomly distributed across the memory system Eliminates stride sensitivity and hotspots Makes programming for data locality impossible Segment moves to cache, but only word moves to processor Full/empty bits on all data words 26 13

MTA-2 / Eldorado Comparisons MTA-2 Eldorado CPU clock speed Max system size Max memory capacity TLB reach Network topology Network bisection bandwidth Network injection rate 220 MHz 256 P 1 TB (4 GB/P) 128 GB Modified Cayley graph 3.5 * P GB/s 220 MW/s per processor 500 MHz 8192 P 128 TB (16 GB/P) 128 TB 3D torus 15.36 * P 2/3 GB/s Variable (next slide) 27 Eldorado Scaling Example Topology 6x12x8 11x12x8 11x12x16 22x12x16 14x24x24 Processors 576 1056 2112 4224 8064 Memory capacity 9 TB 16.5 TB 33 TB 66 TB 126 TB Sustainable remote memory reference rate (per processor) 60 MW/s 60 MW/s 45 MW/s 33 MW/s 30 MW/s Sustainable remote memory reference rate (aggregate) 34.6 GW/s 63.4 GW/s 95.0 GW/s 139.4 GW/s 241.9 GW/s Relative size 1.0 1.8 3.7 7.3 14.0 Relative performance 1.0 1.8 2.8 4.0 7.0 28 14

Software The compute nodes run MTX, a multithreaded unix operating system The service nodes run Linux I/O service calls are offloaded to the service nodes The programming environment runs on the service nodes Operating Systems LINUX on service and I/O nodes MTX on compute nodes Syscall offload for I/O Run-Time System Job launch Node allocator Logarithmic loader Batch system PBS Pro Programming Environment Cray MTA compiler - Fortran, C, C++ Debugger - mdb Performance tools: canal, traceview High Performance File Systems Lustre System Mgmt and Admin Accounting Red Storm Management System RSMS Graphical User Interface 29 CANAL Compiler ANALysis Static tool Shows how the code is compiled and why 30 15

Traceview 31 Dashboard 32 16

What is Eldorado s sweet spot? Any cache-unfriendly parallel application is likely to outperform on Eldorado Any application whose performance depends upon... Random access tables (GUPS, hash tables) Linked data structures (binary trees, relational graphs) Highly unstructured, sparse methods Sorting Some candidate application areas: Adaptive meshes Graph problems (intelligence, protein folding, bioinformatics) Optimization problems (branch-and-bound, linear programming) Computational geometry (graphics, scene recognition and tracking) 33 Sparse Matrix Vector Multiply C n x 1 = A n x m * B m x 1 Store A in packed row form A[nz], where nz is the number of non-zeros cols[nz] stores the column index of the non-zeros rows[n] stores the start index of each row in A #pragma mta use 100 streams #pragma mta assert no dependence for (i = 0; i < n; i++) { int j; double sum = 0.0; for (j = rows[i]; j < rows[i+1]; j++) sum += A[j] * B[cols[j]]; C[i] = sum; } 34 17

Canal report #pragma mta use 100 streams #pragma mta assert no dependence for (i = 0; i < n; i++) { int j; 3 P double sum = 0.0; 4 P- for (j = rows[i]; j < rows[i+1]; j++) sum += A[j] * B[cols[j]]; 3 P C[i] = sum; } Parallel region 2 in SpMVM Multiple processor implementation Requesting at least 100 streams Loop 3 in SpMVM at line 33 in region 2 In parallel phase 1 Dynamically scheduled Loop 4 in SpMVM at line 34 in loop 3 Loop summary: 3 memory operations, 2 floating point operations 3 instructions, needs 30 streams for full utilization, pipelined 35 Performance N = M = 1,000,000 Non-zeros 0 to 1000 per row, uniform distribution Nz = 499,902,410 P T Sp 1 7.11 1.0 2 4 8 3.59 1.83 0.94 1.98 3.88 7.56 IBM Power4 1.7 GHz 26.1 s Time = (3 cycles * 499902410 iterations) / 220000000 cycles/sec = 6.82 sec 96% utilization 36 18

Integer sort Any parallel sort works well on Eldorado Bucket sort is best if universe is large and elements don t cluster in just a few buckets Bucket Sort Count the number of elements in each bucket Calculate the start of each bucket in dst Copy elements from src to dst, placing each element in the correct segment Sort the elements in each segment 37 Step 1 for (i = 0; i < n; i++) count[src[i] >> shift] += 1; Compiler automatically parallelizes the loop using an int_fetch_add instruction to increment count src count 38 Buckets 19

Step 2 start(0) = 0; for (i = 0; i < n_buckets; i++) start(i) = start(i-1) + count(i) Compiler automatically parallelizes most first-order recurrences src dst start 0 3 9 14 15 22 24 26 32 39 Step 3 #pragma mta assert parallel #pragma mta assert noalias *src, *dst for (i = 0; i < n; i++) { bucket = src(i) >> shift; index = start[bucket]++; dst(index) = src(i); } Compiler can not automatically parallelize the loop because the order in which elements are copied to dst is not determinant noalias pragma lets compiler generate optimal code (3 instructions) The compiler uses an int_fetch_add instruction to increment start 40 20

Step 4 Sort each segment Since there are many more buckets than streams and we assume the number of elements per bucket is small, any sort algorithm will do we used a simple insertion sort Most of the time is spent skipping through buckets of size 0 or 1 41 Performance N = 100,000,000 P 1 2 4 8 T 9.14 4.56 2.41 1.26 Sp 1.0 2.0 3.79 7.25 9.14 sec * 220000000 cycles/sec / 100000000 elements = 20.1 cycles/elements 42 21

Prefix operations on lists P(i) = P(i - 1) + V(i) List ranking - determine the rank of each node in the list A common procedure that occurs in many graph algorithms 3 12 16 1 7 9 14 5 13 6 10 11 4 2 15 8 43 N R 1 2 2 13 3 4 4 14 5 6 7 10 7 8 3 15 9 10 11 12 1 13 14 15 16 6 11 12 1 9 8 16 5 Hellman and JaJa algorithm Mark nwalk nodes including the first node This step divides the list into nwalk sublists Traverse the sublists computing each node s rank in the sublist From the lengths of the sublists, compute the start of each sublist Re-traverse each sublist, incrementing each node s local rank by the sublist s start 44 22

Steps 1 and 2 3 12 16 1 7 9 14 5 13 6 10 11 4 2 15 8 N 1 2 3 4 5 6 7 8 9 10 11 12 1 13 14 15 16 R 2 1 R 2 1 4 1 3 R 1 4 5 1 3 2 R 1 1 45 R 1 2 3 1 4 Step 3 W L 0 1 0 2 2 3 4 5 4 5 1 4 W L 0 1 0 2 2 3 6 9 4 5 6 5 W L 0 1 0 2 2 3 4 5 6 11 12 14 W L 0 1 0 2 2 3 4 5 6 11 12 16 46 23

Step 4 3 12 16 1 7 9 14 5 13 6 10 11 4 2 15 8 N 1 2 3 4 5 6 7 8 9 10 11 12 1 13 14 15 16 R + 0 2 1 R + 2 4 3 6 1 5 R + 6 7 10 11 1 9 8 R + 11 12 1 47 R + 12 13 14 15 1 16 Performance Comparison 4 Performance of List Ranking on Cray MTA (For Random List) 4 Performance of List Ranking on Cray MTA (For Ordered List) Execution Time (Seconds) 3 2 1 1 Proc 2 Proc 4 Proc 8 Proc Execution Time (Seconds) 3 2 1 1 Proc 2 Proc 4 Proc 8 Proc 0 0 20 40 60 80 100 Problem Size (M) 0 0 20 40 60 80 100 Problem Size (M) Execution Time (Seconds) Performance of List Ranking on SUN E4500 (For Random List) 140 1 Proc 120 2 Proc 4 Proc 8 Proc 100 80 60 40 Execution Time (Seconds) 40 30 20 10 Performance of List Ranking on Sun E4500 (For Ordered List) 1 Proc 2 Proc 4 Proc 8 Proc 20 48 0 0 20 40 60 80 100 Problem Size (M) 0 0 20 40 60 80 100 Problem Size (M) 24

Two more kernels Linked List Search SunFire 880 MHz (1P) Intel Xeon 2.8 GHz (32b) (1P) Cray MTA-2 (1P) Cray MTA-2 (10P) N = 1000 N = 10000 9.30 107.0 7.15 40.0 0.49 1.98 0.05 0.20 49 Random Access (GUPs) IBM Power4 1.7 GHz (256P) Cray MTA-2 (1P) Cray MTA-2 (5P) Cray MTA-2 (10P) Giga updates per second 0.0055 0.41 0.204 0.405 Eldorado Timeline Early Systems (moderate-size of 500-1000P) Prototype Systems ASIC Tape-out 2004 2005 2006 50 25

Eldorado Summary Eldorado is a low-risk development project because it reuses much of the Red Storm infrastructure which is part of Cray s standard product path Ranier module Cascade LWP Eldorado retains the MTA-2 s easy to program model with parallelism managed by compiler and run-time Ideal for applications that do not run well on SMP clusters High productivity system for algorithms research and development Applications scale to very large system sizes 15-18 months to system availability 51 26