represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Similar documents
BlueGene/L (No. 4 in the Latest Top500 List)

Top500 Supercomputer list

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

High Performance Computing in C and C++

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel computer architecture classification

COSC 6385 Computer Architecture - Multi Processor Systems

Scalability and Classifications

Parallel Architectures

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to parallel computing

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Fabio AFFINITO.

Multi-core Programming - Introduction

Cray XC Scalability and the Aries Network Tony Ford

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Computer Architecture

Outline Marquette University

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Unit 9 : Fundamentals of Parallel Processing

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Real Parallel Computers

Advanced Computer Architecture. The Architecture of Parallel Computers

Cluster Network Products

Introduction to High-Performance Computing

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Parallel Computing Introduction

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to Parallel Programming

Transputers. The Lost Architecture. Bryan T. Meyers. December 8, Bryan T. Meyers Transputers December 8, / 27

Chapter 11. Introduction to Multiprocessors

Lecture 2 Parallel Programming Platforms

A taxonomy of computer architectures

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Fundamentals of Computers Design

Architectures of Flynn s taxonomy -- A Comparison of Methods

Lecture 1: Introduction

Parallel Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Overview. Processor organizations Types of parallel machines. Real machines

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Dheeraj Bhardwaj May 12, 2003

Computer Architecture Spring 2016

Introduction II. Overview

Outline. Lecture 11: EIT090 Computer Architecture. Small-scale MIMD designs. Taxonomy. Anders Ardö. November 25, 2009

Fundamentals of Computer Design

Approaches to Parallel Computing

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Lecture 7: Parallel Processing

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Flynn s Taxonomy of Parallel Architectures

Parallel Computing Why & How?

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Lecture 8: RISC & Parallel Computers. Parallel computers

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Copyright 2012, Elsevier Inc. All rights reserved.

Introduction to Parallel Programming

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

WHY PARALLEL PROCESSING? (CE-401)

Computer parallelism Flynn s categories

Processor Architecture and Interconnect

Parallel Processors. Session 1 Introduction

Introduction to GPU computing

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Parallel Systems I The GPU architecture. Jan Lemeire

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

COSC 6374 Parallel Computation. Parallel Computer Architectures

Midterm 3 Revision and Parallel Computers. Prof. Sin-Min Lee Department of Computer Science

THREAD LEVEL PARALLELISM

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Lecture 7: Parallel Processing

Chapter 2 Parallel Computer Architecture

COSC 6374 Parallel Computation. Parallel Computer Architectures

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

General introduction: GPUs and the realm of parallel architectures

Lecture 9: MIMD Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS418 Operating Systems

Parallel and High Performance Computing CSE 745

Parallel Architecture. Sathish Vadhiyar

Introduction to Parallel Computing

PIPELINE AND VECTOR PROCESSING

Parallel Numerics, WT 2013/ Introduction

A General Discussion on! Parallelism!

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Transcription:

Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines win positions in the list General trend towards commodity machines (COTS Commodity-Off-The-Shelf) Connecting a large number of machines with relatively lower performance is more rewarding than connecting a small number of machines each with high performance reference paper: A note on the Zipf distribution of Top500 supercomputers Performance doubles every year Ranked in terms of Flops, obtained by Linpack benchmark 1

Challenging the Top 500 list Prof. Jack Dongarra, the developer of Top 500 list and Linpack, rebutes the Graph 500 isn t the first new benchmark to challenge the High-Performance Linpack,, Graph 500 score shouldn t be seen as some definitive number any more than the Linpack score used today,, If Graph 500 was the only benchmark we had, we d criticize that too 2

Understand the list 3

Understand the list Rmax: the max benchmarked performance Rpeak: the theoretical max performance Ratio of Rmax to Rpeak indicates how efficient the implementation is Nmax: the problem size when obtaining Rmax Nhalf: the problem size when obtaining half of Rmax 4

Supercomputer distributions 5

Performance distribution 6

Supercomputers in UK 25 supercomputers in the list come from UK University of Edinburgh: No.25 AWE: No.53 ECMWF: No. 56 University of Warwick: Francesca, used to be No.196 (11/2007), No. 476 (06/2008) http://www.top500.org/site/2860 7

TianHe-1A 112 computer cabinets, 12 storage cabinets, 6 communications cabinets, and 8 I/O cabinets. Each compute cabinet is composed of four frames, with each frame containing eight blades, plus a 16-port switching board. Each blade is composed of two computer nodes, with each compute node containing two Xeon X5670 6-core processors (3GHz) and one Nvidia M2050 GPU processor. In total, 14,336 Xeon X5670 processors and 7,168 Nvidia Tesla M2050 GPUs. The total disk storage of the systems is 2 Petabytes, and the total memory size of the system is 262 Terabytes. 8

TianHe-1A Interconnect: The Chinese-designed NUDT* custom designed proprietary high-speed interconnect called Arch that runs at 160 Gbps, twice the bandwidth of InfiniBand Application carry out computations for petroleum exploration and aircraft simulation. It is an "open access" computer meaning it provides services for other countries Cost $88 million to build around $20 million annually for costs involving energy intake and operating costs. Employ 200 workers in order to have it operate and function properly *National University of Defense Technology 9

BlueGene/L No. 12 now, used to be No.1 from 2004-2007 First supercomputer in the Blue Gene project Philosophy - High density of processors in a small area comparatively slow processors - just lots of them! Fast interconnects and lowlatency. 10

Architecture of BlueGene/L Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside on a compute card with 512MB memory. 16 of these compute cards are placed on a node board. 32 node boards fit into one cabinet, and there are 64 cabinets. 212992 CPUs in total with Rmax 478.2 TeraFlops and Rpeak of 596.4 TFLOPS Multiple network topologies available, which can be selected depending on the application. 11

Jaguar (No.1 from 2009-2010, 2010, No2. now) 12

Jaguar How it is built? http://www.youtube.com/watch?v=zx677ccebac 13

Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures into four classes, based on how many instruction and data streams can be observed in the architecture. They are: SISD - Single Instruction, Single Data Instructions are operated sequentially on a single stream of data in a single memory. Classic Von Neumann architecture. Machines may still consist of multiple processors, operating on independent data - these can be considered as multiple SISD systems. SIMD - Single Instruction, Multiple Data A single instruction stream (broadcast to all processors), acting on multiple data. The most common form of this architecture class are Vector processors. These can deliver results several times faster than scalar processors. 14

Architecture Classifications MISD - Multiple instruction, Single data No practical implementations of this architecture. MIMD - Multiple instruction, Multiple data multiple instruction streams, acting on different (but related) data Note the difference between multiple SISD and MIMD 15

Architecture Classifications MIMD: MPP, Cluster SISD: Machine with a single scalar processor SIMD: Machine with vector processors 16

Parallelism in single processor systems Pipelines Performing more operations per clock cycle (Reduces the idle time of hardware components). Difficult to keep pipelines full Dependency among instructions Branches Vector architectures One master processor and multiple math co-processors Large memory bandwidth and low latency access. No cache because of above. Perform operations involving large matrices, commonly encountered in engineering areas 17

Multiprocessor Parallelism Use multiple processors on the same program: Divide workload up between processors. Often achieved by dividing up a data structure. Each processor works on it s own data. Typically processors need to communicate. Shared memory Sending messages explicitly distributed shared memory (virtual global address space) 18

Granularity of Parallelism Defined as the size of the computations that are being performed in parallel l Four types of parallelism (in order of granularity size) Instruction-level parallelism (e.g. pipeline) Thread-level parallelism (e.g. run a multi-thread thread java program) Process-level parallelism (e.g. run an MPI job in a cluster) Job-level parallelism (e.g. run a batch of independent jobs in a cluster) 19

Dependency and Parallelism Dependency: If event A must occur before event B, then B is dependent d on A Two types of Dependency Control dependency: waiting for the instruction which controls the execution flow to be completed IF (X!=0) Then Y=1.0/X: Y=1.0/X has the control dependency on X!=0 Data dependency: dependency because of calculations or memory access Flow dependency: A=X+Y; B=A+C; Anti-dependency: B=A+C; A=X+Y; Output dependency: A=2; X=A+1; A=5; 20