Introducing the Cray XMT. Petr Konecny May 4 th 2007

Similar documents
Eldorado. Outline. John Feo. Cray Inc. Why multithreaded architectures. The Cray Eldorado. Programming environment.

Productivity is a 11 o clock tee time

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Latency-Tolerant Software Distributed Shared Memory

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Practical Near-Data Processing for In-Memory Analytics Frameworks

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Fine-grain Multithreading: Sun Niagara, Cray MTA-2, Cray Threadstorm & Oracle T5

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Compute Node Linux (CNL) The Evolution of a Compute OS

Operating Systems. Operating Systems Professor Sina Meraji U of T

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Portland State University ECE 588/688. Graphics Processors

HPC VT Machine-dependent Optimization

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

THOUGHTS ABOUT THE FUTURE OF I/O

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Show Me the $... Performance And Caches

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Memory hierarchy review. ECE 154B Dmitri Strukov

High Performance Computing on GPUs using NVIDIA CUDA

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Flexible Architecture Research Machine (FARM)

Maximizing Face Detection Performance

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

Cray XE6 Performance Workshop

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Parallel Computing Platforms

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

I/O Buffering and Streaming

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Mo Money, No Problems: Caches #2...

Introduction to Parallel Programming. Tuesday, April 17, 12

Flash-Conscious Cache Population for Enterprise Database Workloads

Lecture 14: Multithreading

Lecture 17: Parallel Architectures and Future Computer Architectures. Shared-Memory Multiprocessors

Online Course Evaluation. What we will do in the last week?

A Distributed Hash Table for Shared Memory

Initial Performance Evaluation of the Cray SeaStar Interconnect

Master Informatics Eng.

Memory management. Knut Omang Ifi/Oracle 10 Oct, 2012

Introduction to OpenMP. Lecture 10: Caches

STEPS Towards Cache-Resident Transaction Processing

Memory management: outline

Memory management: outline

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Virtual Memory. CS 351: Systems Programming Michael Saelee

Shared Virtual Memory. Programming Models

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Multiprocessor Systems. Chapter 8, 8.1

EECS 482 Introduction to Operating Systems

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Computer Science 146. Computer Architecture

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

High-Performance Key-Value Store on OpenSHMEM

Eliminating Dark Bandwidth

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

SPIN Operating System

Capriccio : Scalable Threads for Internet Services

The Cray Rainier System: Integrated Scalar/Vector Computing

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Martin Kruliš, v

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Fundamentals Jeff Larkin November 14, 2016

Lecture 13. Shared memory: Architecture and programming

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Distributed caching for cloud computing

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Heckaton. SQL Server's Memory Optimized OLTP Engine

Parallel and High Performance Computing CSE 745

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

IBM PSSC Montpellier Customer Center. Blue Gene/P ASIC IBM Corporation

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

COS 318: Operating Systems. Virtual Memory and Address Translation

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University

Memory Hierarchy. Slides contents from:

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

File Systems. Kartik Gopalan. Chapter 4 From Tanenbaum s Modern Operating System

CS3600 SYSTEMS AND NETWORKS

Multiprocessor Support

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

HPCC Random Access Benchmark Excels on Data Vortex

Memory Management. Goals of Memory Management. Mechanism. Policies

File system internals Tanenbaum, Chapter 4. COMP3231 Operating Systems

CSE 4/521 Introduction to Operating Systems. Lecture 14 Main Memory III (Paging, Structure of Page Table) Summer 2018

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Lecture 17. NUMA Architecture and Programming

Transcription:

Introducing the Cray XMT Petr Konecny May 4 th 2007

Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions Basic programming environment features Strong Cray XMT application areas Examples HPCC Random Access Hash tables Summary May 07 Slide 2

Origins of the Cray XMT Cray XMT (a.k.a. Eldorado) Upgrade Opteron to Threadstorm Multithreaded Architecture (MTA) Shared memory programming model Thread level parallelism Lightweight synchronization Cray XT Infrastructure Scalable I/O, HSS, Support Network efficient for small messages May 07 Slide 3

Cray XMT System Architecture Compute MTK Service & IO Linux Service Partition Linux OS Specialized Linux nodes Login PEs IO Server PEs Network Server PEs FS Metadata Server PEs Database Server PEs Compute Partition MTK (BSD) PCI-X PCI-X 10 GigE Fiber Channel Network RAID Controllers May 07 Slide 4

Cray XMT Speeds and feeds 500M instructions/s 500M memory op/s Threadstorm ASIC 66M cache lines/s 4,8 or 16 GB DDR DRAM 500M memory op/s 140M memory op/s 110M 30M memory op/s (1 4K processors); bisection bandwidth impact May 07 Slide 5

Cray Threadstorm architecture Streams (128 per processor) Registers, program counter, other state Protection domain (16 per processor) Provides address space Each running stream belongs to exactly one protection domain Functional units (three per processor) Memory buffer (cache) Only store data of the DIMMs attached to the processor All requests go through the buffer 128 KB, 4-way associative, 64 byte cache lines May 07 Slide 6

Threadstorm CPU architecture (continued) May 07 Slide 7

Shared memory model Benefits Uniform memory access Memory is distributed across all nodes No (need for) explicit message passing Productivity advantage over MPI Drawbacks Latency: time for a single operation Network bandwidth limits performance Legacy MPI codes May 07 Slide 8

Cray XMT addresses shared memory drawbacks Latency Little s law: Concurrency = Bandwidth * Latency e.g.: 800 MB/s, 2µs latency => 200 concurrent 64-bit word ops Need a lot of concurrency to maximize bandwidth Concurrency per thread (ILP, vector, SSE) => SPMD Many threads (MTA, XMT) => MPMD Network Bandwidth Provision lots of bandwidth ~1 GB/s per processor ~5 GB/s per router Efficient for small messages Software controlled caching (registers, nearby memory) Reduces network bandwidth Eliminates cache coherency traffic May 07 Slide 9

XMT Programming Environment supports multithreading Flat distributed shared memory! Rely on the parallelizing compilers They do great with loop level parallelism Some computations need to be restructured To expose parallelism For thread safety Light-weight threading Full/empty bit on every word writeef/readfe/readff/writeff Compact thread state Low thread overhead Low synchronization overhead Performance tools Apprentice2 parse compiler annotations, visualize runtime behavior May 07 Slide 10

Traits of strong Cray XMT applications 1. Use lots of memory Cray XMT supports terabytes 2. Lots of parallelism Amdahl s law Parallelizing compiler 3. Fine granularity of memory access Network is efficient for all (including short) packets 4. Data hard to partition Uniform shared memory alleviates the need to partition 5. Difficult load balancing Uniform shared memory enables work migration May 07 Slide 11

Several Cray XMT application areas Graph problems (intelligence, protein folding, bioinformatics) Optimization problems (branch-and-bound, linear programming) Computational geometry (graphics, scene recognition and tracking) Coupled physics with multiple materials and time scales Let s look deeper at two kernels common to these apps. May 07 Slide 12

HPCC Random Access (part 1) Update a large table based on a random number generator NEXTRND returns next value of RNG unsigned rnd = 1; for(i=0; i<nupdate; i++) { rnd = NEXTRND(rnd); Table[rnd&(size-1)] ^= rnd; } HPCC_starts(k) returns k-th value of RNG for(i=0; i<nupdate; i++) { unsigned rnd = HPCC_starts(i); Table[rnd&(size-1)] ^= rnd; } Compiler can automatically parallelize this loop It generates readfe/writeef for atomicity May 07 Slide 13

HPCC Random Access (part 2) HPCC_starts is expensive Restructure loop to amortize cost for(i=0; i<nupdate; i+=bigstep) { unsigned v = HPCC_starts(i); for(j=0;j<bigstep;i++) { v = NEXTRND(v); Table[(v&(size-1)] ^= v; } } The compiler parallelizes outer loop across all processors Apprentice2 reports Five instructions per update (includes NEXTRND) Two (synchronized) memory operations per update May 07 Slide 14

HPCC Random Access (part 3) Performance analysis Most cache lines are dirty Loads usually miss the memory buffer Write back and evict some cache line Load new data into freed cache line The data is usually still in the buffer at the time of store Two DIMM transfers per update Peak of 33 M updates/s/processor On 64 CPU pre-production hardware Hardware bug workarounds limit memory performance 1.3 Gup/s on 64P: about 60% of peak 95% scaling efficiency (from 1P to 64P) May 07 Slide 15

Chained Hash Table Table Node 7 91 133 42 80 53 129 Table is an array of pointers to Node Node contains key and a pointer to the next Node 34 27 62 May 07 Slide 16

Key Lookup Node* lookup(keytype key) { Node *node = Table[hash(key)]; while (node && node->key!= key) { } node = node->next; return node; } Low concurrency per thread (a node at a time) Poor cache reuse Can perform multiple concurrent lookups Control dependency in the loop complicates vectorization May 07 Slide 17

Using Hash table Use large table To get O(1) amortized cost per operation To avoid contention Single lookup is still limited by latency Insert random elements Lookup all elements, count how many are absent for(i=0; i<dsize; i++) bad += (lookup(data[i])==0); The compiler parallelizes the loop across all processors It turns += into a reduction May 07 Slide 18

Software performance rules of thumb Instructions are cheap compared to memory ops Most workloads will be limited by bandwidth Keep enough memory operations in flight at all times Load balancing Minimize synchronization Use moderately cache friendly algorithms Cache hits are not necessary to hide latency Cache can improve effective bandwidth ~40% cache hit rate for distributed memory ~80% cache hit rate for nearby memory Reduce cache footprint Be careful about speculative loads (bandwidth is scarce) May 07 Slide 19

Summary Cray XMT adds value for an important class of problems Terabytes of memory Irregular access with small granularity Lots of parallelism Shared memory programming is good for productivity Starting to gather numbers now on the pre-production system May 07 Slide 20