Chapter 5 Supercomputers

Size: px
Start display at page:

Download "Chapter 5 Supercomputers"

Transcription

1 Chapter 5 Supercomputers Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Part IV. GPU Acceleration Part V. Map-Reduce Appendices 35

2 36 BIG CPU, BIG DATA M any readers of this book will never write a large scale parallel big CPU big data scientific or engineering application that needs to run on a supercomputer to achieve acceptable performance. However, it s important to understand how supercomputer performance is evaluated. Therein lie lessons pertinent to any parallel application. Since 1993, the Top500 List* has been publishing a list of the 500 fastest supercomputers in the world. Top500 invites anyone who owns a supercomputer to run a standard benchmark: the Highly Parallel Computing Benchmark. This is one program from the Linpack Benchmark suite of programs. The program solves a system of simultaneous linear equations, expressed in matrix form as Mx = b, where M is a dense matrix one where most or all of the elements are nonzero. The size of the matrix (the number of simultaneous equations) is chosen to be as large as possible while still fitting in the computer s memory. Given the size of the matrix, the number of 64-bit floating point operations additions, multiplications, reciprocals, and so on needed to solve the matrix equation is determined. The program is executed on the supercomputer, and the program s running time is measured. The performance metric is the rate at which the program executes floating point operations, calculated as the total number of floating point operations divided by the running time in seconds. The metric s units are floating point operations per second, or flops. Supercomputer owners all over the world run the Highly Parallel Computing Benchmark on their supercomputers and send the measured flops metrics, along with information about their machines, to Top500. Twice a year, in June and November, Top500 publishes a list of the top 500 supercomputers the 500 fastest supercomputers, in descending order of flops. Supercomputers nowadays are so fast that their performance is expressed in teraflops or petaflops rather than just flops. One teraflops equals one trillion flops, or flops. One petaflops equals flops. Here are the top five supercomputers on the June 2018 Top500 List, along with their speeds on the Highly Parallel Computing Benchmark: 1. Summit, United States petaflops 2. Sunway TaihuLight, China 93.0 petaflops 3. Sierra, United States 71.6 petaflops 4. Tianhe-2A, China 61.4 petaflops 5. AI Bridging Cloud Infrastructure (AIBCI), Japan 19.9 petaflops For comparison, a desktop PC s floating point performance is in the single gigaflops range (10 9 flops), about 100 million times slower than Summit. Besides flops rates, Top500 publishes information about the supercomputers themselves the number of cores, the CPU chips, the accelerators if *

3 Chapter 5. Supercomputers 37 any, the backend network, and so on. Here there is a wide variety. Considering just the top five supercomputers, for example: The number of cores ranges from 391,680 (AIBCI) to 2,282,544 (Summit). Summit, Sierra, Tianhe-2A, and AIBCI use commodity CPU chips; Sunway TaihuLight uses proprietary CPU chips. Summit, Sierra, and AIBCI use GPU accelerators; Sunway TaihuLight and Tianhe-2A do not use accelerators. Summit, Sierra, and AIBCI use a commodity backend network; Sunway TaihuLight and Tianhe-2A use a proprietary backend network. The only thing these five supercomputers have in common is that they are very large clusters of multicore nodes. In fact, all the supercomputers on the Top500 List are variously sized clusters of multicore nodes. The Top500 List ranks supercomputers based on their performance on solving dense systems of linear equations. Such calculations were typical of scientific and engineering applications in previous decades. In recent years, however, other kinds of calculations have become prevalent. A supercomputer that executes dense matrix calculations quickly will not necessarily execute other kinds of algorithms at the same speed. To measure supercomputer performance on these newer calculations, other benchmarks and their associated top lists have arisen. The Graph500 List* operates like the Top500 List, except it uses different benchmarks and a different performance metric. The Graph500 benchmarks are programs that calculate with graphs, rather than dense matrices. A graph is a collection of vertices and edges connecting pairs of vertices. Graphs are often used in big data analytics applications, such as social network analytics. For example, Facebook maintains an enormous graph where the vertices are Facebook users and the edges are Facebook friend relationships. Facebook wants to know who your friends are, who the friends of your friends are, who the friends of the friends of your friends are, and so on, so they can recommend new friends for you, thereby (once you friend them) increasing the number of posts in your news feed, increasing the number of ads they show you, and increasing their ad revenue their ultimate goal. Graph models are also used to study metabolic pathways in organisms, the spread of infectious disease through a population, the spread of malware through computer networks, and other scientific applications. Graphs can be represented as matrices. But unlike the dense matrices in the Top500 benchmark, graph matrices are typically large and sparse. A graph s matrix might have billions of rows and columns, but the nonzero elements might be only a miniscule fraction of the total elements; almost all of *

4 38 BIG CPU, BIG DATA the elements are zero. It makes no sense to allocate storage for all those zero elements; only the nonzero elements would be held in the computer s memory. But this means that algorithms that operate on graphs are fundamentally different from algorithms that operate on dense matrices. Compared to a matrix program, a graph program spends little or no time on floating point operations. Rather, the bulk of the time consists of traversing the graph going from one vertex to another along the edges. The Graph500 benchmark consists of three programs: a program that generates a very large, sparse graph; a program that does a breadth first traversal from selected vertices in that graph; and a program that finds the shortest paths through the graph from selected vertices to every other vertex. (Facebook s analysis mentioned above is based on a breadth first traversal.) The programs are executed on the supercomputer, the programs count the number of edges traversed during execution, and the programs running times are measured. The performance metric is the rate at which the programs traverse edges, calculated as the total number of edge traversals divided by the running time in seconds. The metric s units are traversed edges per second, or teps. Since 2010, the Graph500 List has been published twice a year, at the same time as the Top500 List. Here are the top five supercomputers on the June 2018 Graph500 List, along with their speeds on the breadth first search benchmark. The speeds are in units of terateps (10 12 teps). 1. K Computer, Japan 38.6 terateps 2. Sunway TaihuLight, China 23.8 terateps 3. Sequoia, United States 23.8 terateps 4. Mira, United States 15.0 terateps 5. JUQUEEN, Germany 5.8 terateps Note that four of the top five supercomputers on the Top500 List are not among the top five on the Graph500 List. A computer that does dense matrix calculations quickly does not necessarily do graph calculations quickly, and vice versa. Why are the top five dense matrix benchmark rates so much faster than the top five graph benchmark rates peta-flops versus only tera-teps? Much of the difference arises because dense matrix programs tend to be cache friendly, while graph programs tend not to be. A dense matrix s elements are stored in adjacent locations in the computer s main memory, and a dense matrix program tends to access the matrix s elements in the order they are stored in memory. When the program reads a certain element, the entire cache line containing that element is loaded from main memory into the L2 and L1 caches. When the program reads the next element (in an adjacent memory location), much of the time that element is already present in the caches, and so the element can be re-

5 Chapter 5. Supercomputers 39 trieved from the caches at nearly the full speed of the CPU. In other words, most of the memory accesses in a dense matrix program are fast cache hits. While a sparse matrix s elements (the nonzero elements) are still stored in adjacent memory locations, a graph program tends not to access the matrix s elements in the order they are stored in memory. Rather, when the graph program traverses an edge from one vertex (matrix element) to another, the target element is often not in an adjacent memory location, and therefore is often not in the cache. The graph program continually experiences cache misses, forcing elements to be loaded from the slow main memory rather than the cache. As a result, the CPU has to spend much of its time waiting for data to arrive from the main memory, leading to much smaller teps rates compared to flops rates. In the context of data-intensive applications like graph analytics, folks have started to refer to the memory wall as a prominent killer of supercomputer performance. It is as though the CPU is a car speeding down a race track, but every 50 feet it runs into a brick wall. The car is not going to finish the race very quickly. The Top500 List measures supercomputer performance on solving linear systems expressed as dense matrices. Another category of scientific and engineering computation works with partial differential equations (PDEs). Fluid dynamics problems, such as aircraft design, weather forecasting, and climate modeling, for example, are commonly expressed as PDEs. PDE solving programs perform intensive floating point calculations (like dense matrix programs), but on sparse matrices (like graph programs). To rank supercomputer speeds on these kinds of computations, a new top list based on a new benchmark, the High Performance Conjugate Gradient (HPCG) Benchmark,* has been published twice a year since Conjugate gradient refers to a particular technique for solving a PDE. The HPCG benchmark program measures supercomputer performance in flops, like the Top500 benchmark. Here are the top five supercomputers on the June 2018 HPCG List, along with their speeds on the HPCG benchmark in petaflops. 1. Summit, United States 2.93 petaflops 2. Sierra, United States 1.80 petaflops 3. K Computer, Japan 0.60 petaflops 4. Trinity, United States 0.55 petaflops 5. Piz Daint, Switzerland 0.49 petaflops The HPCG flops rates are considerably slower than the Top500 flops rates again, largely due to the memory wall encountered with sparse matrix computations. Besides raw computational speed, folks are becoming increasingly concerned with supercomputers energy efficiency. When Summit runs the Top- *

6 40 BIG CPU, BIG DATA 500 benchmark program at a rate of petaflops, it consumes electrical energy at a rate of 8.8 megawatts equivalent to the energy consumption of nearly seven thousand average United States households. The fossil fuels burned to generate the electricity for a supercomputer release carbon into the atmosphere, contributing to global warming. The supercomputer s cooling system pumps heat into the environment, also contributing to global warming. Folks might be willing to trade off a slower computation rate, leading to an increased time to finish a program, to gain a reduction in energy usage. From this point of view, the important question is: How many floating point operations can a supercomputer perform for every unit of energy consumed? A higher number indicates a more energy efficient supercomputer. To quantify energy efficiency, the machine s computation rate in flops is divided by its energy consumption rate in watts, yielding an energy efficiency metric in units of flops per watt. Since 2013, Top500 has published an additional list, the Green500 list,* which ranks supercomputers based on energy efficiency. The benchmark is the same as the Top500 List, the solution of a dense system of linear equations, but the metric is flops per watt. Here are the top five supercomputers on the June 2018 Green500 List, along with their energy efficiencies in gigaflops per watt (10 9 flops/watt) and their speeds in petaflops. 1. Shoubu System B, Japan 18.4 gigaflops/watt; petaflops 2. Suiren-2, Japan 16.8 gigaflops/watt; petaflops 3. Sakura, Japan 16.7 gigaflops/watt; petaflops 4. DGX SaturnV Volta, United States 15.1 gigaflops/watt; 1.07 petaflops 5. Summit, United States 13.9 gigaflops/watt; petaflops Only one machine, Summit, is in the top five both for raw computational speed and for energy efficiency. Summit, number one on the Top500 List, is number five on the Green500 List. Shoubu System B, number one on the Green500 List, is number 359 on the Top500 List. The fastest supercomputer is not the most energy efficient supercomputer, and vice versa. Descending from the rarefied heights of supercomputer performance, we can discern three lessons for anyone writing parallel programs. First lesson: Performance matters. It s not enough merely to write a program that solves the computational problem. The program must also get the answer in as little time as possible. Certain software design choices might lead to smaller running times, even when running on a parallel computer. Other design choices might lead to larger running times; such design choices are to be avoided. In later chapters we will see examples of how different design choices affect program performance. *

7 Chapter 5. Supercomputers 41 This leads to the second lesson: Assess performance using measured running time data from actual programs. We will be doing exactly that in subsequent chapters. Insights gained from performance measurements can guide design choices, and we will see examples of those as well. Third lesson: Memory matters. Both the amount of memory a program requires and the pattern in which the program accesses the memory affect the program s performance. To the extent practical, data structures that use less memory and fewer CPU cycles are preferable. Later chapters will include examples of memory-lean data structures. Now we re ready to start learning how to write parallel programs. Points to Remember Various benchmark programs are used to assess supercomputer performance on different kinds of calculations. Various metrics are used to assess supercomputers, including floating point operations per second (flops), traversed edges per second (teps), and flops per watt. The Top500 List rates supercomputer performance in flops on dense matrix calculations. All of the Top500 supercomputers are clusters of multicore nodes. The Graph500 List rates supercomputer performance in teps on graph algorithms. The HPCG List rates supercomputer performance in flops on partial differential equation (PDE) solving programs. The Green500 List rates supercomputer energy efficiency in flops per watt on dense matrix calculations. Performance matters. Assess performance using measured running time data from actual programs. Memory matters.

8 42 BIG CPU, BIG DATA

Chapter 2 Parallel Hardware

Chapter 2 Parallel Hardware Chapter 2 Parallel Hardware Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

Module 3 Building a Green ICT community

Module 3 Building a Green ICT community Green ICT Workshop EACO Working Group 10 27 29 March 2017 Module 3 Building a Green ICT community Diarmuid Ó Briain 27 March 2017 Learning Objectives By the end of this module you should be able to: Describe

More information

Overview and history of high performance computing

Overview and history of high performance computing Overview and history of high performance computing CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Overview and history of high performance computing Spring 2018 1

More information

Chapter 1 What Is Parallel Computing?

Chapter 1 What Is Parallel Computing? Chapter 1 What Is Parallel Computing? Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5.

More information

Motivation Goal Idea Proposition for users Study

Motivation Goal Idea Proposition for users Study Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan Computer and Information Science University of Oregon 23 November 2015 Overview Motivation:

More information

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017 Interconnect Your Future Enabling the Best Datacenter Return on Investment TOP500 Supercomputers, November 2017 InfiniBand Accelerates Majority of New Systems on TOP500 InfiniBand connects 77% of new HPC

More information

ECE 574 Cluster Computing Lecture 2

ECE 574 Cluster Computing Lecture 2 ECE 574 Cluster Computing Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 24 January 2019 Announcements Put your name on HW#1 before turning in! 1 Top500 List November

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

PART I - Fundamentals of Parallel Computing

PART I - Fundamentals of Parallel Computing PART I - Fundamentals of Parallel Computing Objectives What is scientific computing? The need for more computing power The need for parallel computing and parallel programs 1 What is scientific computing?

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

The Future of High- Performance Computing

The Future of High- Performance Computing Lecture 26: The Future of High- Performance Computing Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Comparing Two Large-Scale Systems Oakridge Titan Google Data Center Monolithic

More information

Chapter 16 Heuristic Search

Chapter 16 Heuristic Search Chapter 16 Heuristic Search Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

China's supercomputer surprises U.S. experts

China's supercomputer surprises U.S. experts China's supercomputer surprises U.S. experts John Markoff Reproduced from THE HINDU, October 31, 2011 Fast forward: A journalist shoots video footage of the data storage system of the Sunway Bluelight

More information

Introduction to High-Performance Computing

Introduction to High-Performance Computing Introduction to High-Performance Computing Dr. Axel Kohlmeyer Associate Dean for Scientific Computing, CST Associate Director, Institute for Computational Science Assistant Vice President for High-Performance

More information

COMP 633 Parallel Computing.

COMP 633 Parallel Computing. COMP 633 Parallel Computing http://www.cs.unc.edu/~prins/classes/633/ Parallel computing What is it? multiple processors cooperating to solve a single problem hopefully faster than a single processor!

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

Overview. Energy-Efficient and Power-Constrained Techniques for ExascaleComputing. Motivation: Power is becoming a leading design constraint in HPC

Overview. Energy-Efficient and Power-Constrained Techniques for ExascaleComputing. Motivation: Power is becoming a leading design constraint in HPC Energy-Efficient and Power-Constrained Techniques for ExascaleComputing Stephanie Labasan Computer and Information Science University of Oregon 17 October 2016 Overview Motivation: Power is becoming a

More information

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited 2018 IEEE 68th Electronic Components and Technology Conference San Diego, California May 29

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Moore's Law abandoned serial programming around 2004 Courtesy Liberty Computer Architecture Research Group

More information

High Performance Computing Course Notes HPC Fundamentals

High Performance Computing Course Notes HPC Fundamentals High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Comparison of Parallel Processing Systems. Motivation

Comparison of Parallel Processing Systems. Motivation Comparison of Parallel Processing Systems Ash Dean Katie Willis CS 67 George Mason University Motivation Increasingly, corporate and academic projects require more computing power than a typical PC can

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc Atos announces the Bull sequana X1000 the first exascale-class supercomputer Jakub Venc The world is changing The world is changing Digital simulation will be the key contributor to overcome 21 st century

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

What s inside your computer? Session 3. Peter Henderson

What s inside your computer? Session 3. Peter Henderson What s inside your computer? Session 3 Peter Henderson phenders@butler.edu 1 Time & Space/Size & Speed Time How long does it take to do something? (retrieve data from memory, execute a computer instruction,

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

High Performance Computing Course Notes Course Administration

High Performance Computing Course Notes Course Administration High Performance Computing Course Notes 2009-2010 2010 Course Administration Contacts details Dr. Ligang He Home page: http://www.dcs.warwick.ac.uk/~liganghe Email: liganghe@dcs.warwick.ac.uk Office hours:

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

High Performance Computing in Europe and USA: A Comparison

High Performance Computing in Europe and USA: A Comparison High Performance Computing in Europe and USA: A Comparison Erich Strohmaier 1 and Hans W. Meuer 2 1 NERSC, Lawrence Berkeley National Laboratory, USA 2 University of Mannheim, Germany 1 Introduction In

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

HPC and Big Data: Updates about China. Haohuan FU August 29 th, 2017

HPC and Big Data: Updates about China. Haohuan FU August 29 th, 2017 HPC and Big Data: Updates about China Haohuan FU August 29 th, 2017 1 Outline HPC and Big Data Projects in China Recent Efforts on Tianhe-2 Recent Efforts on Sunway TaihuLight 2 MOST HPC Projects 2016

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

Chapter 24 File Output on a Cluster

Chapter 24 File Output on a Cluster Chapter 24 File Output on a Cluster Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple

More information

The Constellation Project. Andrew W. Nash 14 November 2016

The Constellation Project. Andrew W. Nash 14 November 2016 The Constellation Project Andrew W. Nash 14 November 2016 The Constellation Project: Representing a High Performance File System as a Graph for Analysis The Titan supercomputer utilizes high performance

More information

High-Performance and Parallel Computing

High-Performance and Parallel Computing 9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement

More information

Wednesday, January 28, 2018

Wednesday, January 28, 2018 Wednesday, January 28, 2018 Topics for today History of Computing (brief) Encoding data in binary Unsigned integers Signed integers Arithmetic operations and status bits Number conversion: binary to/from

More information

Chapter 13 Strong Scaling

Chapter 13 Strong Scaling Chapter 13 Strong Scaling Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables

More information

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Parallel Computing Accelerators John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk This is the 50,000 ft. view of the parallel computing landscape. We want

More information

AMath 483/583 Lecture 11

AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Table of Contents: The Accelerated Data Center Optimizing Data Center Productivity Same Throughput with Fewer Server Nodes

More information

TSUBAME-KFC : Ultra Green Supercomputing Testbed

TSUBAME-KFC : Ultra Green Supercomputing Testbed TSUBAME-KFC : Ultra Green Supercomputing Testbed Toshio Endo,Akira Nukada, Satoshi Matsuoka TSUBAME-KFC is developed by GSIC, Tokyo Institute of Technology NEC, NVIDIA, Green Revolution Cooling, SUPERMICRO,

More information

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Large-Scale Data Engineering. Overview and Introduction

Large-Scale Data Engineering. Overview and Introduction Large-Scale Data Engineering Overview and Introduction Administration Blackboard Page Announcements, also via email (pardon html formatting) Practical enrollment, Turning in assignments, Check Grades Contact:

More information

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory Quinn Mitchell HPC UNIX/LINUX Storage Systems ORNL is managed by UT-Battelle for the US Department of Energy U.S. Department

More information

Chapter 31 Multi-GPU Programming

Chapter 31 Multi-GPU Programming Chapter 31 Multi-GPU Programming Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Part IV. GPU Acceleration Chapter 29. GPU Massively Parallel Chapter 30. GPU

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

The Beauty and Joy of Computing

The Beauty and Joy of Computing The Beauty and Joy of Computing Lecture #18 Distributed Computing UC Berkeley Sr Lecturer SOE Dan By the end of the decade, we re going to see computers that can compute one exaflop (recall kilo, mega,

More information

INTRODUCTION TO COMPUTER

INTRODUCTION TO COMPUTER INTRODUCTION TO COMPUTER COMPUTER An electronic device which is capable of receiving information (data) in a particular form and of performing a sequence of operations in accordance with a predetermined

More information

COST EFFICIENCY VS ENERGY EFFICIENCY. Anna Lepak Universität Hamburg Seminar: Energy-Efficient Programming Wintersemester 2014/2015

COST EFFICIENCY VS ENERGY EFFICIENCY. Anna Lepak Universität Hamburg Seminar: Energy-Efficient Programming Wintersemester 2014/2015 COST EFFICIENCY VS ENERGY EFFICIENCY Anna Lepak Universität Hamburg Seminar: Energy-Efficient Programming Wintersemester 2014/2015 TOPIC! Cost Efficiency vs Energy Efficiency! How much money do we have

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

Measurement of real time information using GPU

Measurement of real time information using GPU Measurement of real time information using GPU Pooja Sharma M. Tech Scholar, Department of Electronics and Communication E-mail: poojachaturvedi1985@gmail.com Rajni Billa M. Tech Scholar, Department of

More information

CSCI 2021: Binary Floating Point Numbers

CSCI 2021: Binary Floating Point Numbers CSCI 2021: Binary Floating Point Numbers Chris Kauffman Last Updated: Wed Feb 20 12:16:34 CST 2019 1 Logistics Reading Bryant/O Hallaron Ch 2.4-5 (Floats, now) Ch 3.1-7 (Assembly Intro, soon) 2021 Quick

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Optimizing Apache Spark with Memory1. July Page 1 of 14

Optimizing Apache Spark with Memory1. July Page 1 of 14 Optimizing Apache Spark with Memory1 July 2016 Page 1 of 14 Abstract The prevalence of Big Data is driving increasing demand for real -time analysis and insight. Big data processing platforms, like Apache

More information

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald CS 6240: Parallel Data Processing in MapReduce: Module 1 Mirek Riedewald Why Parallel Processing? Answer 1: Big Data 2 How Much Information? Source: http://www2.sims.berkeley.edu/research/projects/ho w-much-info-2003/execsum.htm

More information

HPC Technology Trends

HPC Technology Trends HPC Technology Trends High Performance Embedded Computing Conference September 18, 2007 David S Scott, Ph.D. Petascale Product Line Architect Digital Enterprise Group Risk Factors Today s s presentations

More information

Workshop: Innovation Procurement in Horizon 2020 PCP Contractors wanted

Workshop: Innovation Procurement in Horizon 2020 PCP Contractors wanted Workshop: Innovation Procurement in Horizon 2020 PCP Contractors wanted Supercomputing Centre Institute for Advanced Simulation / FZJ 1 www.prace-ri.eu Challenges: Aging Society Energy Food How we can

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 1st June 2017 Time: 14:00-16:00

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 1st June 2017 Time: 14:00-16:00 COMP 26120 Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Algorithms and Imperative Programming Date: Thursday 1st June 2017 Time: 14:00-16:00 Please answer THREE Questions from the FOUR

More information

Foundations of Data Engineering

Foundations of Data Engineering Foundations of Data Engineering Thomas Neumann 1 / 27 About this Lecture The goal of this lecture is teaching the standard tools and techniques for large-scale data processing. Related keywords include:

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Massive Data Analysis

Massive Data Analysis Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that

More information

High Performance Computing

High Performance Computing CSC630/CSC730: Parallel & Distributed Computing Trends in HPC 1 High Performance Computing High-performance computing (HPC) is the use of supercomputers and parallel processing techniques for solving complex

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

1 Homophily and assortative mixing

1 Homophily and assortative mixing 1 Homophily and assortative mixing Networks, and particularly social networks, often exhibit a property called homophily or assortative mixing, which simply means that the attributes of vertices correlate

More information

IBM Spectrum Scale IO performance

IBM Spectrum Scale IO performance IBM Spectrum Scale 5.0.0 IO performance Silverton Consulting, Inc. StorInt Briefing 2 Introduction High-performance computing (HPC) and scientific computing are in a constant state of transition. Artificial

More information

Introduction to National Supercomputing Centre in Guangzhou and Opportunities for International Collaboration

Introduction to National Supercomputing Centre in Guangzhou and Opportunities for International Collaboration Exascale Applications and Software Conference 21st 23rd April 2015, Edinburgh, UK Introduction to National Supercomputing Centre in Guangzhou and Opportunities for International Collaboration Xue-Feng

More information

Optimization of FEM solver for heterogeneous multicore processor Cell. Noriyuki Kushida 1

Optimization of FEM solver for heterogeneous multicore processor Cell. Noriyuki Kushida 1 Optimization of FEM solver for heterogeneous multicore processor Cell Noriyuki Kushida 1 1 Center for Computational Science and e-system Japan Atomic Energy Research Agency 6-9-3 Higashi-Ueno, Taito-ku,

More information

Advanced Topics in Computer Architecture

Advanced Topics in Computer Architecture Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

The Beauty and Joy of Computing

The Beauty and Joy of Computing The Beauty and Joy of Computing Lecture #19 Distributed Computing UC Berkeley Sr Lecturer SOE Dan By the end of the decade, we re going to see computers that can compute one exaflop (recall kilo, mega,

More information

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary white paper Computer-Aided Engineering ANSYS Mechanical on Intel Xeon Processors Engineer Productivity Boosted by Higher-Core CPUs Engineers can be significantly more productive when ANSYS Mechanical runs

More information

2: Computer Performance

2: Computer Performance 2: Computer Performance John Burkardt Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming... http://people.sc.fsu.edu/ jburkardt/presentations/ performance 2008

More information

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS

PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Proceedings of FEDSM 2000: ASME Fluids Engineering Division Summer Meeting June 11-15,2000, Boston, MA FEDSM2000-11223 PARALLELIZATION OF POTENTIAL FLOW SOLVER USING PC CLUSTERS Prof. Blair.J.Perot Manjunatha.N.

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Stephen Wang 1, James Lin 1, William Tang 2, Stephane Ethier 2, Bei Wang 2, Simon See

More information

Checking for duplicates Maximum density Battling computers and algorithms Barometer Instructions Big O expressions. John Edgar 2

Checking for duplicates Maximum density Battling computers and algorithms Barometer Instructions Big O expressions. John Edgar 2 CMPT 125 Checking for duplicates Maximum density Battling computers and algorithms Barometer Instructions Big O expressions John Edgar 2 Write a function to determine if an array contains duplicates int

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing Chieh-Sen (Jason) Huang Department of Applied Mathematics National Sun Yat-sen University Thank Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing

More information

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Parallel Computing Accelerators John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk This is the 50,000 ft. view of the parallel computing landscape. We want

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Monday, January 27, 2014

Monday, January 27, 2014 Monday, January 27, 2014 Topics for today History of Computing (brief) Encoding data in binary Unsigned integers Signed integers Arithmetic operations and status bits Number conversion: binary to/from

More information

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009 The Many-Core Revolution Understanding Change Alejandro Cabrera cpp.cabrera@gmail.com January 29, 2009 Disclaimer This presentation currently contains several claims requiring proper citations and a few

More information

Green Supercomputing

Green Supercomputing Green Supercomputing On the Energy Consumption of Modern E-Science Prof. Dr. Thomas Ludwig German Climate Computing Centre Hamburg, Germany ludwig@dkrz.de Outline DKRZ 2013 and Climate Science The Exascale

More information

Parallel computer architecture classification

Parallel computer architecture classification Parallel computer architecture classification Hardware Parallelism Computing: execute instructions that operate on data. Computer Instructions Data Flynn s taxonomy (Michael Flynn, 1967) classifies computer

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

Cloud Computing Capacity Planning

Cloud Computing Capacity Planning Cloud Computing Capacity Planning Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Introduction One promise of cloud computing is that virtualization

More information

BlueDBM: An Appliance for Big Data Analytics*

BlueDBM: An Appliance for Big Data Analytics* BlueDBM: An Appliance for Big Data Analytics* Arvind *[ISCA, 2015] Sang-Woo Jun, Ming Liu, Sungjin Lee, Shuotao Xu, Arvind (MIT) and Jamey Hicks, John Ankcorn, Myron King(Quanta) BigData@CSAIL Annual Meeting

More information