Overview of High Performance Computing

Size: px
Start display at page:

Download "Overview of High Performance Computing"

Transcription

1 Overview of High Performance Computing Timothy H. Kaiser, PH.D. Show_me_some_local_HPC_tutorials/ 1

2 Introduction What is High Performance Computing? Why go parallel? When do you go parallel? What are some limits of parallel computing? Types of parallel computers Some terminology What is available How this all works 2

3 What the Exa? Exa = 1,152,921,504,606,846,976 = 2**60=1024**6 = 10**18.06 Peta = 1,125,899,906,842,624 = 2**50=1024**5 = 10**15.05 Tera = 1,099,511,627,776 = 2**40=1024**4 = 10**12.04 Giga = 1,073,741,824 = 2**30=1024**3 = 10**9.03 Mega = 1,048,576 = 2**20=1024**2 = 10**6.02 Kilo = 1,024 = 2**10=1024**1 = 10**3.01 3

4 Top 500 4

5 What is Parallelism? Consider your favorite computational application One processor can give me results in N hours Why not use N processors -- and get the results in just one hour? The concept is simple: Parallelism = applying multiple processors to a single problem 5

6 Parallel computing is computing by committee Parallel computing: the use of multiple computers or processors working together on a common task. Each processor works on its section of the problem Grid of a Problem to be Solved Process 0 does work for this region Process 1 does work for this region Processors are allowed to exchange information with other processors Process 2 does work for this region Process 3 does work for this region 6

7 Why do parallel computing? Limits of single CPU computing Available memory Performance Parallel computing allows: Solve problems that don t fit on a single CPU Solve problems that can t be solved in a reasonable time 7

8 Why do parallel computing? We can run Larger problems Faster More cases Run simulations at finer resolutions Model physical phenomena more realistically 8

9 Weather Forecasting Atmosphere is modeled by dividing it into three-dimensional regions or cells 1 mile x 1 mile x 1 mile (10 cells high) about 500 x 10 6 cells. The calculations of each cell are repeated many times to model the passage of time. About 200 floating point operations per cell per time step or floating point operations necessary per time step 10 day forecast with 10 minute resolution => 1.5x10 14 flop 100 Mflops would take about 17 days 1.7 Tflops would take 2 minutes 17 Tflops would take 8 seconds 105 Tflops would take 1.3 seconds What might you want to do if running for 1.3 seconds? 9

10 Modeling Motion of Astronomical bodies (brute force) Each body is attracted to each other body by gravitational forces. Movement of each body can be predicted by calculating the total force experienced by the body. For N bodies, N - 1 forces / body yields N 2 calculations each time step A galaxy has, stars => 10 9 years for one iteration Using a N log N efficient approximate algorithm => about a year NOTE: This is closely related to another hot topic: Protein Folding 10

11 Types of parallelism two extremes Data parallel Each processor performs the same task on different data Example - grid problems Bag of Tasks or Embarrassingly Parallel is a special case Task parallel Each processor performs a different task Example - signal processing such as encoding multitrack data Pipeline is a special case 11

12 Simple data parallel program Example: integrate 2-D propagation problem Starting partial differential equation: Finite Difference Approximation: PE #0 PE #1 PE #2 PE #3 y PE #4 PE #5 PE #6 PE #7 x 12

13 Typical Task Parallel Application DATA Normalize Task FFT Task Multiply Task Inverse FFT Task Signal processing Use one processor for each task Can use more processors if one is overloaded This is a pipeline 13

14 Parallel Program Structure Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel End work 1d work 2d work (N)d 14

15 Parallel Problems Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel Start Serial Section work 1d work 2d Subtasks don t finish together work (N)d Serial Section (No Parallel Work) work 1x work 2x work (N)x End Serial Section start parallel work 1y work 2y work (N)y work 1z work 2z work (N)z Not using all processors End Parallel End 15

16 A Real example #!/usr/bin/env python from sys import argv from os.path import isfile from time import sleep from math import sin,cos # fname="message" my_id=int(argv[1]) print("\n%d starting program \n" % (my_id)) # if (my_id == 1): sleep(2) myval=cos(10.0) mf=open(fname,"w") mf.write(str(myval)) mf.close() if (my_id == 0): myval=sin(10.0) notready=true while notready : if isfile(fname) : notready=false sleep(3) mf=open(fname,"r") message=float(mf.readline()) mf.close() total=myval**2+message**2 else: sleep(5) print("sin(10)**2+cos(10)**2= %15.12f" % (total)) print("%d done with program \n" %(my_id)) 16

17 Theoretical upper limits All parallel programs contain: Parallel sections Serial sections Serial sections are when work is being duplicated or no useful work is being done, (waiting for others) Serial sections limit the parallel effectiveness If you have a lot of serial computation then you will not get good speedup No serial work allows perfect speedup Amdahl s Law states this formally 17

18 Amdahl s Law Amdahl s Law places a strict limit on the speedup that can be realized by using multiple processors. Effect of multiple processors on run time t p = (f p /N + f s )t s Effect of multiple processors on speed up Where Fs = serial fraction of code Fp = parallel fraction of code N = number of processors Perfect speedup t=t1/n or S(n)=n S = t s tp = 1 fp /N + f s 18

19 Illustration of Amdahl's Law It takes only a small fraction of serial content in a code to degrade the parallel performance. 19

20 Amdahl s Law Vs. Reality Amdahl s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance fp = Amdahl's Law Reality Number of processors 20

21 Sometimes you don t get what you expect! 21

22 Some other considerations Writing effective parallel application is difficult Communication can limit parallel efficiency Serial time can dominate Load balance is important Is it worth your time to rewrite your application Do the CPU requirements justify parallelization? Will the code be used just once? 22

23 Parallelism Carries a Price Tag Parallel programming Involves a steep learning curve Is effort-intensive Parallel computing environments are unstable and unpredictable Don t respond to many serial debugging and tuning techniques May not yield the results you want, even if you invest a lot of time Will the investment of your time be worth it? 23

24 Terms related to algorithms Amdahl s Law (talked about this already) Superlinear Speedup Efficiency Cost Scalability Problem Size Gustafson s Law 24

25 Superlinear Speedup S(n) > n, may be seen on occasion, but usually this is due to using a suboptimal sequential algorithm or some unique feature of the architecture that favors the parallel formation. One common reason for superlinear speedup is the extra cache in the multiprocessor system which can hold more of the problem data at any instant, it leads to less, relatively slow memory traffic. 25

26 Efficiency Efficiency = Execution time using one processor over the Execution time using a number of processors Its just the speedup divided by the number of processors 26

27 Cost The processor-time product or cost (or work) of a computation defined as Cost = (execution time) x (total number of processors used) The cost of a sequential computation is simply its execution time, t s. The cost of a parallel computation is t p x n. The parallel execution time, t p, is given by t s /S(n) Hence, the cost of a parallel computation is given by Cost-Optimal Parallel Algorithm One in which the cost to solve a problem on a multiprocessor is proportional to the cost 27

28 Scalability Used to indicate a hardware design that allows the system to be increased in size and in doing so to obtain increased performance - could be described as architecture or hardware scalability. Scalability is also used to indicate that a parallel algorithm can accommodate increased data items with a low and bounded increase in computational steps - could be described as algorithmic scalability. 28

29 Problem size Problem size: the number of basic steps in the best sequential algorithm for a given problem and data set size Intuitively, we would think of the number of data elements being processed in the algorithm as a measure of size. However, doubling the date set size would not necessarily double the number of computational steps. It will depend upon the problem. For example, adding two matrices has this effect, but multiplying matrices quadruples operations. Note: Bad sequential algorithms tend to scale well 29

30 Other names for Scaling Strong Scaling (Engineering) For a fixed problem size how does the time to solution vary with the number of processors Weak Scaling How the time to solution varies with processor count with a fixed problem size per processor 30

31 Some Classes of machines Network Processor Processor Processor Processor Memory Memory Memory Memory Distributed Memory Processors only Have access to their local memory talk to other processors over a network 31

32 Some Classes of machines Uniform Shared Memory (UMA) Processor Processor All processors have equal access to Memory Processor Processor Memory Processor Processor Can talk via memory Processor Processor 32

33 Some Classes of machines Hybrid Shared memory nodes connected by a network... 33

34 Some Classes of machines More common today Each node has a collection of multicore chips... Ra has 268 nodes 256 quad core dual socket 12 dual core quad socket 34

35 Some Classes of machines Hybrid Machines Add special purpose processors to normal processors Not a new concept but, regaining traction Example: our Power8/K80 nodes Issue: transfer speed between units "Normal" CPU Special Purpose Processor FPGA, GPU, Vector, Cell... 35

36 Network Topology For ultimate performance you may be concerned how you nodes are connected. Avoid communications between distant node For some machines it might be difficult to control or know the placement of applications 36

37 Network Terminology Latency How long to get between nodes in the network. Bandwidth How much data can be moved per unit time. Bandwidth is limited by the number of wires and the rate at which each wire can accept data and choke points 37

38 Ring 38

39 Grid Wrapping produces torus 39

40 Tree Fat tree the lines get wider as you go up 40

41 Hypercube dimensional hypercube 41

42 4D Hypercube Some communications algorithms are hypercube based How big would a 9d hypercube be? 42

43 5d Torus 3d Grid 01,03,29 3d Torus adds 01,03,29 5d adds 12 43

44 5d - Blue Gene Q MidPlane nodes 4x4x4x4x2 44

45 5D Torus Network BGQ Layout The network topology of BlueGene/Q is a five-dimensional (5D) torus, with direct links between the nearest neighbors in the ±A, ±B, ±C, ±D, and ±E directions. As such there are only a few optimum block sizes that will use the network efficiently. Node Boards Compute Nodes Cores Torus Dimensions x2x2x2x2 2 (adjacent pairs) x2x4x2x2 4 (quadrants) x2x4x4x2 8 (halves) x2x4x4x2 16 (midplane) x4x4x4x2 32 (1 rack) x4x4x8x2 64 (2 racks) x4x8x8x2 45

46 Star? Quality depends on what is in the center 46

47 Example: An Infiniband Switch Infiniband, DDR, Cisco 7024 IB Server Switch - 48 Port Adaptors. Each compute node has one DDR 1-Port HCA 4X DDR=> 16Gbit/sec 140 nanosecond hardware latency 1.26 microsecond at software level 47

48 Measured Bandwidth 48

49 Infiniband Rates 49

50 New Kid on the Block - Intel Omnipath Designed with the technical and cost requirements of future exascale supercomputers in mind Packet Integrity Protection: a link-level error checking capability that is applied to all data traversing the wire. It allows for transparent detection and recovery of transmission errors as they occur. Dynamic Lane Scaling: maintains link continuity in the event of a lane failure. With the help of PIP, Omni-Path uses the remaining lanes in the link to continue operation. Traffic Flow Optimization: improves quality of service by allowing higher priority data packets to preempt lower priority packets, regardless of packet ordering. 50

51 More Omnipath Info 100 gigabits/sec of bandwidth per port, with port-to-port latencies on par with that of EDR InfiniBand. Intel has stated that their host architecture supports message rates of up to 160 million messages per second Higher Density switches Host Integration Roadmap Intel is planning to offer an in-package host adapter configuration, where the fabric ASIC is integrated into the processor socket. Further down the road, the Omni-Path host interface will be integrated directly into the processor. 51

52 Back to coprocessors In the simple case all nodes contain just a collection of normal CPUs and Memory Similar to desktop to laptop machines Connected together via some network There are nonstandard nodes CPU with GPU FPGA High core count (Knights XXX or Phi) 52

53 Graphic Processing Unit - GPU Graphics cards are available in many systems Some years ago people realized graphics cards are good at some operations - vector and matrix Why not use them for general computing Difficulties Initially not designed for it Difficult to program bandwidth to/from CPU memory 53

54 GPU - Now NVIDIA is the biggest supplier of GPU cards for HPC Cards developed specifically for processing Programming has become easier Bandwidth is much improved Special instructions for AI Many libraries available Lots of applications 54

55 Vintage Nvidia GPU Systems 55

56 56

57 Nvidia - IBM Two computers Summit (ORNL) Sierra (LLNL) Pflops IBM Power 9 - Nvidia Volta GPU NVLink High Speed Interconnect EDR Infiniband 57

58 DoE IBM/Nvidia Machines Combines IBM Power 9 CPU Nvidia Volta GPU NVLink interconnect 58

59 Key features Volta will Peak out at over 7 Tflops Stacked Memory (very dense and lots of it) NVLink is a key technology in Summit s and Sierra s server node architecture, enabling IBM POWER CPUs and NVIDIA GPUs to access each other s memory (Unified memory >512GB HBM+DDR4) NVLink will be up to 5 to 12 times faster than PCI3 Less than half the watts per flop of current generation chips > 40 Tflops/node * 3,400 nodes about 150 PFlops Back of the envelop calculation - Power 9 = 14 Tflops (Don t quote me on this.) 59

60 What is Intel Knights-xxx or Xeon - Phi Xeon - Phi (Mic) A processing chip that contains a large number of cores >60 with > 240 threads Cores are lower performance than normal Xeon Slower clock speed 1st gen Only in order and Missing some Xeon instructions Runs as a coprocessor, can t boot OS 60

61 Current and Coming Knights Landing 3 versions: 1 Card and 2 bootable Support for full instruction set On package memory On board external memory with bootable versions Much better memory bandwidth 3 Tflops each, implies 576 Tflops/rack (48*4*3) 61

62 Shipping Knights Landing 62

63 What is Intel Knights-xxx or Xeon - Phi Many programming paradigms Runs regular C and Fortran Intel compilers Supports OpenMP Can run MPI Supports offloaded calculations Knights Mill (announced) will have special support for AI 63

64 Our Resources Documentation Hardware Mio BlueM Golden (AuN) Energy (Mc2) Next Machine? 64

65 Getting Help About Resources Getting accounts Mio node information 65

66 Platforms - Overview Mines has three high performance computing platforms available for campus use, AuN, Mc2, and Mio. AuN and Mc2 share a 480 Tbyte file system and are collectively known as BlueM. AuN is a 144 node, 50 Tflop, X86 system. Mc2 is a 512 node Blue Gene Q rated at 104 Tflop. Mio, is a shared resource built up using what is commonly know as the condo model. Individual research groups own nodes and they have priority access. There are also nodes owned by students. Mio currently has ~200 x86 nodes, three x86/gpu nodes, and two 4 way Phi nodes, two Power8/K80GPU nodes and is serviced by a 240 Tbyte file system. 104 Tflops- CPU 66

67 2010-current Mio Nodes: ~200 x86 11 GPUs 8 Phi 2 Power8 ~ 104 Tflops It s All Mine 240 TByte file system 67

68 Mio Concept CCIT Funds infrastructure Groups purchase nodes Groups can use their nodes when they desire Research Groups have priority access to their nodes Students have priority access to TechFee nodes When nodes are not being used by owners they are available for others Owner starting a job will kick others off 68

69 2x Intel Xeon 5770 CPU 24 GB Total 8 cores Total 2.93GHz gpu001 Nvida T10 Processor 4 GB 240 Cores Nvida T10 Processor 4 GB 240 Cores 2x Intel Xeon 5650 CPU 48 GB Total 12 cores Total 2.66GHz gpu003 Nvida M2070 Processor 5.6 GB 448 Cores Nvida M2070 Processor 5.6 GB 448 Cores Nvida M2070 Processor 5.6 GB 448 Cores 2x Intel Xeon 5770 CPU 24 GB Total 8 cores Total 2.93GHz gpu002 Nvida T10 Processor 4 GB 240 Cores Nvida T10 Processor 4 GB 240 Cores IBM Power 8 processor 256 GB Total 20 cores Total 3.49 GHz ppc001 Nvida K80 24 GB 4992 Cores Nvida K80 24 GB 4992 Cores Mio head node, management node, and network switch TechFee GPU Nodes IBM Power 8 processor 256 GB Total 20 cores Total 3.49 GHz ppc002 Nvida K80 24 GB 4992 Cores Nvida K80 24 GB 4992 Cores 240 Tbyte Parallel (GPFS) File System mic0 Intel Xeon Phi 8 GB Coprocessor P 60 Cores 2x Intel Xeon E CPU mic1 Intel Xeon Phi 8 GB Coprocessor P 60 Cores Compute Nodes GB Total 12 cores Total 2.3GHz phi002 mic2 Intel Xeon Phi 8 GB Coprocessor P 60 Cores mic3 Intel Xeon Phi 8 GB Coprocessor P 60 Cores Intel Xeon CPUs Based Nodes mic0 Intel Xeon Phi 8 GB Coprocessor P 60 Cores 2x Intel Xeon E CPU mic1 Intel Xeon Phi 8 GB Coprocessor P 60 Cores GB Each 8-28 cores Each 32GB Total 12 cores Total 2.3GHz phi001 mic2 Intel Xeon Phi 8 GB Coprocessor P 60 Cores 69 mic3 Intel Xeon Phi 8 GB Coprocessor P 60 Cores

70 BlueM Mines` Supercomputer 154 Tflops 17.4 Tbytes 10,496 Cores 85 KW dual architecture Best of both worlds Two Distinct Compute Units idataplex Blue Gene Q Shared 480 Tbyte File System Compact Low Power Consumption 70

71 BlueM s Compute Units - AuN AuN (Golden) idataplex Intel 8x2 core SandyBridge 144 Nodes 2,304 Cores 9,216 Gbytes Feature Latest Generation Intel Processors Large Memory / Node Common architecture Similar user environment to RA and Mio Quickly get researchers up and running 50 Tflops

72 BlueM s Compute Units - MC2 MC2 (Energy) Blue Gene Q PowerPC A2 17 Core 512 Nodes 8,192 Cores 8,192 Gbytes 104 Tflops Feature New Architecture Designed for large core count jobs Highly scaleable Multilevel parallelism - Direction of HPC Room to Grow Future looking machine

73 Mc2 - AuN Comparison Feature Mc2 AuN Gflop/Node Memory/Node 16 Gbytes 64 Gbytes Gflop/Gbytes Recommended Loading 16*4=64 16 Bandwidth Faster Fast 8 73

74 Advertised Layout Cooling Distribution Unit Hose for water cooling

75 75

76 Allocations for BlueM By proposal Must be faculty to propose Students can work on faculty s grant 76

77 Other Resources Xsede RMACC NCAR/UoWy computational-systems/cheyenne 77

78 Extreme Science and Engineering Discovery Environment The Extreme Science and Engineering Discovery Environment (XSEDE) is the most advanced, powerful, and robust collection of integrated advanced digital resources and services in the world. It is a single virtual system that scientists can use to interactively share computing resources, data, and expertise. Scientists and engineers around the world use these resources and services things like supercomputers, collections of data, and new tools to make our lives healthier, safer, and better. XSEDE, and the experts who lead the program, will make these resources easier to use and help more people use them. The five-year, $110-million project is supported by the National Science Foundation. In summer of 2016, the NSF announced that XSEDE was awarded an additional 5 years of funding after the first 5-year award completed. Originally, XSEDE replaced and expanded on the NSF TeraGrid project. More than 10,000 scientists used the TeraGrid to complete thousands of research projects, at no cost to the scientists. 78

79 79

80 Rocky Mountain Advanced Computing Consortium The Rocky Mountain Advanced Computing Consortium is a collaboration among academic and research institutions located throughout the intermountain states. Our mission is to facilitate widespread effective use of high performance computing throughout the Rocky Mountain region by: Educating graduate and undergraduate students, faculty, researchers, and industry partners on the use of computational science and high performance computing. Coordinating multi-institutional efforts to advance research, practice, and education in computational science in order to address important regional problems. Bringing together a broad range of researchers, faculty, and industry partners with a depth of experience and expertise not available at any single institution and facilitate their collaboration in multi-disciplinary and multi-institutional teams. Mines is a founding member RMACC High Performance Computing Symposium 80

81 RMACC available resources Accessing Summit Summit is a new HPC resource for researchers at CU, CSU, and RMACC partners Key features include 400 TFlops peak performance General compute nodes High-memory nodes GPGPU nodes KNL Xeon Phi nodes Omni-Path interconnect GPFS scratch filesystem 81

82 NCAR/UoWy Cheyenne 82

83 NCAR/UoWy Cheyenne Climate Simulation Laboratory Researchers must have funding from NSF awards to address the climate-related questions for which they are requesting CSL allocations. University Community In general, any U.S.-based researcher with an NSF award in the atmospheric sciences or computational science in support of the atmospheric sciences NCAR Community NCAR investigators have access Wyoming-NCAR Alliance The NWSC represents a collaboration between NCAR and the University of Wyoming. As part of the Wyoming-NCAR Alliance (WNA), a portion of the Cheyenne system about 160 million core-hours per year is reserved for Wyoming-led projects and allocated by a University of Wyoming-managed process. 83

84 How do you run? Getting on Programming Running 84

85 Getting on ssh from your machine to Mio or BlueM 85

86 Hello World in Parallel Compile your program with Parallel compilers 86

87 Running We write & run a script that tells the system what we want to do 87

88 Output After some time our program will run 88

Overview of High Performance Computing

Overview of High Performance Computing Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example

More information

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Overview of Parallel Computing. Timothy H. Kaiser, PH.D. Overview of Parallel Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu Introduction What is parallel computing? Why go parallel? The best example of parallel computing Some Terminology Slides and examples

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Advanced High Performance Computing CSCI 580

Advanced High Performance Computing CSCI 580 Advanced High Performance Computing CSCI 580 2:00 pm - 3:15 pm Tue & Thu Marquez Hall 322 Timothy H. Kaiser, Ph.D. tkaiser@mines.edu CTLM 241A http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Two Similar

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

The IBM Blue Gene/Q: Application performance, scalability and optimisation

The IBM Blue Gene/Q: Application performance, scalability and optimisation The IBM Blue Gene/Q: Application performance, scalability and optimisation Mike Ashworth, Andrew Porter Scientific Computing Department & STFC Hartree Centre Manish Modani IBM STFC Daresbury Laboratory,

More information

BİL 542 Parallel Computing

BİL 542 Parallel Computing BİL 542 Parallel Computing 1 Chapter 1 Parallel Programming 2 Why Use Parallel Computing? Main Reasons: Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion,

More information

Computing on Mio Introduction

Computing on Mio Introduction Computing on Mio Introduction Timothy H. Kaiser, Ph.D. tkaiser@mines.edu Director - CSM High Performance Computing Director - Golden Energy Computing Organization http://inside.mines.edu/mio/tutorial/

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System x idataplex CINECA, Italy Lenovo System

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori

More information

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center The Stampede is Coming Welcome to Stampede Introductory Training Dan Stanzione Texas Advanced Computing Center dan@tacc.utexas.edu Thanks for Coming! Stampede is an exciting new system of incredible power.

More information

High Performance Computing in C and C++

High Performance Computing in C and C++ High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University WELCOME BACK Course Administration Contact Details Dr. Rita Borgo Home page: http://cs.swan.ac.uk/~csrb/

More information

High-Performance Scientific Computing

High-Performance Scientific Computing High-Performance Scientific Computing Instructor: Randy LeVeque TA: Grady Lemoine Applied Mathematics 483/583, Spring 2011 http://www.amath.washington.edu/~rjl/am583 World s fastest computers http://top500.org

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

Introduction to High-Performance Computing

Introduction to High-Performance Computing Introduction to High-Performance Computing Dr. Axel Kohlmeyer Associate Dean for Scientific Computing, CST Associate Director, Institute for Computational Science Assistant Vice President for High-Performance

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Lecture 20: Distributed Memory Parallelism. William Gropp

Lecture 20: Distributed Memory Parallelism. William Gropp Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order

More information

Oak Ridge National Laboratory Computing and Computational Sciences

Oak Ridge National Laboratory Computing and Computational Sciences Oak Ridge National Laboratory Computing and Computational Sciences OFA Update by ORNL Presented by: Pavel Shamis (Pasha) OFA Workshop Mar 17, 2015 Acknowledgments Bernholdt David E. Hill Jason J. Leverman

More information

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING Meeting Today s Datacenter Challenges Produced by Tabor Custom Publishing in conjunction with: 1 Introduction In this era of Big Data, today s HPC systems are faced with unprecedented growth in the complexity

More information

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

IBM CORAL HPC System Solution

IBM CORAL HPC System Solution IBM CORAL HPC System Solution HPC and HPDA towards Cognitive, AI and Deep Learning Deep Learning AI / Deep Learning Strategy for Power Power AI Platform High Performance Data Analytics Big Data Strategy

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

Exascale: challenges and opportunities in a power constrained world

Exascale: challenges and opportunities in a power constrained world Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni c.cavazzoni@cineca.it SuperComputing Applications and Innovation Department CINECA CINECA non profit Consortium, made

More information

High Performance Computing Course Notes Course Administration

High Performance Computing Course Notes Course Administration High Performance Computing Course Notes 2009-2010 2010 Course Administration Contacts details Dr. Ligang He Home page: http://www.dcs.warwick.ac.uk/~liganghe Email: liganghe@dcs.warwick.ac.uk Office hours:

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

High Performance Computing (HPC) Introduction

High Performance Computing (HPC) Introduction High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Smart Interconnect for Next Generation HPC Platforms Gilad Shainer, August 2016, 4th Annual MVAPICH User Group (MUG) Meeting Mellanox Connects the World s Fastest Supercomputer

More information

University at Buffalo Center for Computational Research

University at Buffalo Center for Computational Research University at Buffalo Center for Computational Research The following is a short and long description of CCR Facilities for use in proposals, reports, and presentations. If desired, a letter of support

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

InfiniBand SDR, DDR, and QDR Technology Guide

InfiniBand SDR, DDR, and QDR Technology Guide White Paper InfiniBand SDR, DDR, and QDR Technology Guide The InfiniBand standard supports single, double, and quadruple data rate that enables an InfiniBand link to transmit more data. This paper discusses

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother? Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Paving the Road to Exascale

Paving the Road to Exascale Paving the Road to Exascale Gilad Shainer August 2015, MVAPICH User Group (MUG) Meeting The Ever Growing Demand for Performance Performance Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

The Future of High Performance Interconnects

The Future of High Performance Interconnects The Future of High Performance Interconnects Ashrut Ambastha HPC Advisory Council Perth, Australia :: August 2017 When Algorithms Go Rogue 2017 Mellanox Technologies 2 When Algorithms Go Rogue 2017 Mellanox

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

What does Heterogeneity bring?

What does Heterogeneity bring? What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or

More information

How To Design a Cluster

How To Design a Cluster How To Design a Cluster PRESENTED BY ROBERT C. JACKSON, MSEE FACULTY AND RESEARCH SUPPORT MANAGER INFORMATION TECHNOLOGY STUDENT ACADEMIC SERVICES GROUP THE UNIVERSITY OF TEXAS-RIO GRANDE VALLEY Abstract

More information

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11

AMath 483/583 Lecture 11. Notes: Notes: Comments on Homework. Notes: AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017 Power Systems AC922 Overview Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017 IBM POWER HPC Platform Strategy High-performance computer and high-performance

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

High Performance Computing Course Notes HPC Fundamentals

High Performance Computing Course Notes HPC Fundamentals High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

AMath 483/583 Lecture 11

AMath 483/583 Lecture 11 AMath 483/583 Lecture 11 Outline: Computer architecture Cache considerations Fortran optimization Reading: S. Goedecker and A. Hoisie, Performance Optimization of Numerically Intensive Codes, SIAM, 2001.

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

IBM Blue Gene/Q solution

IBM Blue Gene/Q solution IBM Blue Gene/Q solution Pascal Vezolle vezolle@fr.ibm.com Broad IBM Technical Computing portfolio Hardware Blue Gene/Q Power Systems 86 Systems idataplex and Intelligent Cluster GPGPU / Intel MIC PureFlexSystems

More information

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further

More information

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services WVU RESEARCH COMPUTING INTRODUCTION Introduction to WVU s Research Computing Services WHO ARE WE? Division of Information Technology Services Funded through WVU Research Corporation Provide centralized

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System X idataplex CINECA, Italy The site selection

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Overview of Tianhe-2

Overview of Tianhe-2 Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn

More information

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic

More information

The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016

The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016 The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016 2016 Supermicro 15 Minutes Two Swim Lanes Intel Phi Roadmap & SKUs Phi in the TOP500 Use Cases Supermicro

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Efficient Parallel Programming on Xeon Phi for Exascale

Efficient Parallel Programming on Xeon Phi for Exascale Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs Introduction Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University How to.. Process terabytes

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012 Outline NICS and AACE Architecture Overview Resources Native Mode Boltzmann BGK Solver Native/Offload

More information

PART I - Fundamentals of Parallel Computing

PART I - Fundamentals of Parallel Computing PART I - Fundamentals of Parallel Computing Objectives What is scientific computing? The need for more computing power The need for parallel computing and parallel programs 1 What is scientific computing?

More information