Modern computer architecture. From multicore to petaflops

Size: px
Start display at page:

Download "Modern computer architecture. From multicore to petaflops"

Transcription

1 Modern computer architecture From multicore to petaflops

2 Motivation: Multi-ores where and why

3 Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed #transistors on microchip doubles every months omputer Architecture 3

4 Introduction: Moore s law faster cycles and beyond Moore s law transistors are getting smaller run them faster Faster clock speed Higher Throughput (Ops/s) Frequency [MHz] Intel x86 clock speed Increasing transistor count and clock speed allows / requires architectural changes: Pipelining Superscalarity SIMD / Vector ops 1 0, Year Multi-ore/Threading omplex on chip caches omputer Architecture 4

5 Welcome to the multi-/many-core era The game is over: But Moore s law continues By courtesy of D. Vrsalovic, Intel 1.13x N transistors 1.73x Dual-ore Performance Power 1.00x 2N transistors 1.73x 1.02x Power envelope: Max W Power consumption: P = f * (V core ) 2 V core ~ V Same process technology: P ~ f 3 Over-clocked (+20%) Max Frequency Dual-core (-20%) since minimum V core depends on f omputer Architecture 5

6 Multi-ore: Intel Xeon 2600 (2012) Xeon 2600 Sandy Bridge EP : 8 cores running at 2.7 GHz (max 3.2 GHz) Simultaneous Multithreading reports as 16-way chip 2.3 Billion Transistors / 32 nm Die size: 435 mm 2 2-socket server omputer Architecture 6

7 From UMA to ccnuma Basic architecture of commodity compute cluster nodes Yesterday (2006): Dual-socket Intel ore2 node: Uniform Memory Architecture (UMA) Flat memory ; symmetric MPs But: system anisotropy Today: Dual-socket Intel (Westmere) node: ache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI provide scalable bandwidth at the price of ccnuma architectures: Where does my data finally end up? On AMD it is even more complicated ccnuma within a socket! omputer Architecture 7

8 Back to the 2-chip-per-case age 12 core AMD Magny-ours a 2x6-core ccnuma socket AMD: single-socket ccnuma since Magny ours 1 socket: 12-core Magny-ours built from two 6-core chips 2 NUMA domains 2 socket server 4 NUMA domains 4 socket server: 8 NUMA domains WHY? Shared resources are hard two scale: 2 x 2 memory channels vs. 1 x 4 memory channels per socket omputer Architecture 8

9 urrent AMD design: AMD Interlagos / Bulldozer Up to 16 cores (8 Bulldozer modules) in a single socket Max. 2.6 GHz (+ Turbo ore) 2048 kb 16 kb shared P max = (2.6 x 8 x 8) GF/s dedicated L2 cache = GF/s L1D cache 8 (6) MB shared L3 cache Each Bulldozer module: 2 lightweight cores 1 FPU: 4 MULT & 4 ADD (double precision) / cycle Supports AVX Supports FMA4 2 DDR3 (shared) memory channel > 15 GB/s 2 NUMA domains per socket omputer Architecture 9

10 ray XE6 Interlagos 32-core dual socket node Two 8- (integer-) core chips per 2.3 GHz turbo) Separate DDR3 memory interface per chip ccnuma on the socket! Shared FP unit per pair of integer cores ( module ) 256-bit FP unit SSE4.2, AVX, FMA4 16 kb L1 data cache per core 2 MB L2 cache per module 8 MB L3 cache per chip (6 MB usable) omputer Architecture 10

11 Other socket Other socket Other socket Woodcrest ore2 Duo 65nm Other socket Other socket Harpertown ore2 Quad 45nm The x86 multicore evolution so far Intel Single-Dual-/Quad-/Hexa-/-ores (one-socket view) 2005: Fake dual-core 2006: True dual-core P P P P P P P P P hipset hipset hipset hipset Memory 2008: Simultaneous Multi Threading (SMT) Memory Approx. constant clock speed 2010: 6-core chip Memory Memory : Wider SIMD units AVX: 256 Bit P T 0 P T0 P T0 P T0 P T 0 P T0 P T0 P T0 P T0 P T0 P T 0 P T0 P T0 P T0 P T 0 P T0 P T0 P T0 MI MI MI Memory Memory Memory Nehalem EP ore i7 45nm Westmere EP ore i7 32nm Sandy Bridge EP ore i7 32nm omputer Architecture 11

12 There is no single driving force for chip performance! Floating Point (FP) Performance: P = n core * F * S * n n core number of cores: 8 F FP instructions per cycle: 2 (1 MULT and 1 ADD) Intel Xeon Sandy Bridge EP socket 4,6,8 core variants available S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers AVX ) n lock speed : 2.7 GHz TOP500 rank 1 (1995) P = 173 GF/s (dp) / 346 GF/s (sp) But: P=5.4 GF/s (dp) for serial, non-simd code omputer Architecture 12

13 Specifications of the NVIDIA Fermi GPU 14 Multiprocessors (MP); each with: 32 processors (SP) driven by : Single Instruction Multiple Data (SIMD) Single Instruction Multiple Thread (SIMT) Explicit in-order architecture 32K Registers 48 KB of local on-chip memory 1st and 2nd level cache hierarchy clock rate of 1.15 GHz 1030 GFLOP/s (single precision) 515 GFLOP/s (double precision) Up to 6 GB of global memory (DRAM) 1500 MHz DDR 384 bit bus Global gather/scatter 144 GB/s bandwidth 16 GB/s PIe 2.0x16 (bidirectional) lock (MHz) Peak (GFLOPs) Memory (GB) Memory lock (MHz) Memory Interface (bit) Memory Bandwidth (GB/sec) Tesla GeForce GTX GeForce 8800 GTX Host ( Westmere) *64 63 September 2012 Parallel multi-and manycore programming 13

14 Trading single thread performance for parallelism: GPGPUs vs. PUs GPU vs. PU light speed estimate: 1. ompute bound: 2-5 X 2. Memory Bandwidth: 1-5 X Intel ore i ( Sandy Bridge ) Intel Xeon E DP node ( Sandy Bridge ) NVIDIA 2070 ( Fermi ) ores@lock 3.3 GHz 2 x 2.7 GHz 1.1 GHz Performance + /core 52.8 GFlop/s 43.2 GFlop/s 2.2 GFlop/s Threads@stream <4 <16 >8000 Total performance GFlop/s 691 GFlop/s 1,000 GFlop/s Stream BW 18 GB/s 2 x 36 GB/s 90 GB/s (E=1) Transistors / TDP 1 Billion* / 95 W 2 x (2.27 Billion / 130W) 3 Billion / 238 W + Single Precision * Includes on-chip GPU and PI-Express omplete compute device omputer Architecture 14

15 Parallelism in a modern compute node Parallel and shared resources within a shared-memory node 2 GPU # Other I/O 8 7 PIe link GPU #2 Parallel resources: Shared resources: Execution/SIMD units 1 Outer cache level per socket 6 ores 2 Memory bus per socket 7 Inner cache levels 3 Intersocket link 8 Sockets / memory domains 4 PIe bus(es) 9 Multiple accelerators 5 Other I/O resources 10 How does your application react to all of those details? omputer Architecture 15

16 Distributed-memory computers & hybrid systems

17 Parallel distributed-memory computers: Basics Pure distributed-memory parallel computer: Each processor P is connected to exclusive local memory (MM) and a network interface (NI) Node A (dedicated) communication network connects alls nodes No global cache-coherent shared address space No Remote Memory Access (NORMA) Data exchange between nodes: Passing messages via network ( Message Passing ) Some architectures provide limited remote memory access for speeding up message passing, e.g. through a global NON- OHENRENT address space (NUMA) Prototype of first P clusters: Node: Single-core/PU P Network: Ethernet First Massively Parallel Processing architectures: RAY T3D/E, Intel Paragon omputer Architecture 17

18 Parallel distributed-memory computers: Hybrid system Standard concept of most modern large parallel computers: Hybrid/hierarchical ompute node is a 2- or 4-socket shared memory compute nodes with a NI. ommunication network (GBit, Infiniband) connects the nodes Price / (Peak) Performance is optimal; Network capabilities / (Peak) Perf. gets worse Parallel Programming? Pure Message Passing is standard. Hybrid programming? Today: GPUs / Accelerators are added to the nodes to further increase complexity Distributed-memory parallel omputer Architecture 18

19 Networks What are the basic ideas and performance characteristics of modern networks?

20 Networks Basic performance characteristics Evaluate the network capabilities to transfer data Use the same idea as for main memory access: Total transfer time for a message of N Bytes is: T = T L + N/B T L is the latency (transfer setup time [sec]) and B is asymptotic (N ) network bandwidth [MBytes/sec] onsider simplest case ( Ping Pong ) Two processors in different nodes communicate via network ( Point-to-point ) A single message of N Bytes is sent forward and backward Overall data transfer is 2N Bytes! omputer Architecture 20

21 Networks Basic performance characteristics Ping-Pong benchmark (pseudo-code): myid = get_process_id() if(myid.eq.0) then targetid = 1 S = get_walltime() call Send_message(buffer,N,targetID) call Receive_message(buffer,N,targetID) E = get_walltime() MBYTES = 2*N/(E-S)/1.d6! Eff. BW: MBytes/sec rate TIME = (E-S)/2*1.d6! transfer time in microsecs! for single message else targetid = 0 call Receive_message(buffer,N,targetID) call Send_message(buffer,N,targetID) endif Effective BW: B eff = N T L + N B omputer Architecture 21

22 B eff = 2*N/(E-S)/1.d6 Networks Basic performance characteristics Ping-Pong benchmark for GBit-Ethernet (GigE) network N 1/2 : Message size where 50% of peak bandwidth is achieved Asymptotic bandwidth B=111 Mbytes/sec GBit/s Latency (N 0): Only qualitative agreement: 44 ms vs. 76 ms omputer Architecture 22

23 Networks Basic performance characteristics Ping-Pong benchmark for DDR Infiniband (DDR-IB) network Determine B and T L independently and combine them omputer Architecture 23

24 Networks Basic performance characteristics First Principles modeling of B eff (N) provides good qualitative results but quantitative description in particular of latency dominated region (N small) may fail because Overhead for transmission protocols, e.g. message headers Minimum frame size for message transmission, e.g. TP/IP over Ethernet does always transfer frames with N>1 Message setup/initialization involves multiple software layers and protocols; each software layer adds to latency; hardware only latency is often small As the message size increases the software may switch to a different protocol, e.g. from eager to rendezvous Typical message sizes in applications are neither small nor large N 1/2 value is also important: N 1/2 = B * T L Network balance: Relate network bandwidth (B or B eff (N 1/2 )) to computer power (or main memory bandwidth) of the nodes omputer Architecture 24

25 Latency and bandwidth in modern computer environments ns ms 1 GB/s ms 25

26 Networks: Topologies & Bisection bandwidth Network bisection bandwidth B b is a general metric for the data transfer capability of a system Minimum sum of the bandwidths of all connections cut when splitting the system into two equal parts More meaningful metric when comparing systems: Bisection BW per core or per node, B b /N Bisection BW depends on Bandwidth per link Network topology Uni- or Bi-directional bandwidth?! omputer Architecture 26

27 Network topologies: Bus Bus can be used by one connection at a time Bandwidth is shared among all devices Bisection BW is constant: B b /N ~ 1/N ollision detection, bus arbitration protocols must be in place Examples: PI bus, memory bus of multi-core chips, diagnostic buses, internal ring bus of the ell processor, Advantages Low latency Easy to implement Disadvantages Shared bandwidth, not scalable Problems with failure resiliency (one defective agent may block bus) Fast buses for large N require large signal power omputer Architecture 27

28 Network topologies: Switches and Fat-Trees Standard clusters are built with switched networks ompute nodes ( devices ) are split up in groups each group is connected to a single (small) (non-blocking crossbar-)switch ( leaf switches ) Leaf switches are connected with each other using an additional switch hierarchy ( spine switches ) or directly (for small configs) In switched networks the distance between any two devices is heterogeneous (number of hops in switch hierarchy) Diameter of a network: The maximum number of hops required to connect two arbitrary devices Example: Diameter of bus = 1 Perfect world: Fully non-blocking, i.e. any choice of N/2 disjoint device pairs can communicate at full speed omputer Architecture 28

29 Non-blocking crossbar A non-blocking crossbar can mediate a number of connections between a group of input and a group of output elements This can be used as a 4-port non-blocking switch (fold at the diagonal) Switches can be cascaded to form hierarchies (common case) rossbars can also be used directly as interconnects in computer systems Example: Scalable UMA memory access (NE SX) (Historic) example: Hitachi SR8000 2x2 switching element omputer Architecture 29

30 Fat tree switch hierarchies Fully non-blocking N/2 end-to-end connections with full bandwidth B b = B * N/2 B b /N = const. = B/2 Sounds good, but see next slide B B Oversubscribed Spine does not support N/2 full BW end-to-end connections B b /N = const. = B/2k, where k is the oversubscription factor Intelligent resource management is crucial k=3 leaf switch spine switch omputer Architecture 30

31 Fat trees and static routing If all end-to-end data paths are preconfigured ( static routing ), not all possible combinations of N agents will get full bandwidth Example: is a collision-free pattern here hange 2 6, 3 7 to 2 7, 3 6: has collisions if no other connections are re-routed at the same time Static routing is still a quasi-standard in commodity interconnects However, things are starting to improve slowly omputer Architecture 31

32 Full fat-tree: 288-port IB DDR-Switch SPINE switch level: 12 switches Basic building blocks: 24-port switches LEAF switch level: 24 switches with 24*12 ports to devices S = switches 288 ports omputer Architecture 32

33 Fat tree networks Examples Ethernet 1 Gbit/s &10 Gbit/s variants; 41% of all Top500 entries (June 2012) InfiniBand Dominant high-performance commodity interconnect (42% of Top500 entries) Myrinet SDR: 10 Gbit/s per link and direction (10 bits/byte) DDR: 20 Gbit/s per link and direction (Building blocks: 24-port switches) QDR: you figure that out by yourself QDR IB is used in the RRZE s TinyBlue and Lima clusters Building blocks: 36 port switches Large 36*18=648-port switches urrent version: 10 Gbit/s per link and direction Interoperable with 10 Gbit/s Ethernet Waning importance for HP Fat trees are expensive and complex to scale continuously to very high node counts omputer Architecture 33

34 Meshes Fat trees can become prohibitively expensive in large systems ompromise: Meshes n-dimensional Hypercubes Toruses (2D / 3D) Many others (including hybrids) Each node is a router Direct connections only between direct neighbors This is not a non-blocking crossbar! Intelligent resource management and routing algorithms are essential Example: 2D torus mesh Toruses are used in very large systems: ray XT, IBM Blue Gene B b ~ N (d-1)/d B b /N 0 for large N Sounds bad, but those machines show good scaling for many codes Well-defined and predictable bandwidth behavior! omputer Architecture 34

35 Meshes Advantages of toroidal/cubic meshes Limited cabling required ables can be kept short Meshes can come in all shapes and sizes Example: 4-socket dual-core AMD Opteron node with HyperTransport fabric This mesh is asymmetric since two sockets use one HT link each for I/O 4-socket 2xhexa-core AMD Magny-ours: 3D cube omputer Architecture 35

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

CAMA: Modern processors. Memory hierarchy: Caches. Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center

CAMA: Modern processors. Memory hierarchy: Caches. Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center AMA: Modern processors Memory hierarchy: aches Gerhard Wellein, Department for omputer Science and Erlangen Regional omputing enter Johannes Hofmann/Dietmar Fey, Department for omputer Science University

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

AMD Opteron 4200 Series Processor

AMD Opteron 4200 Series Processor What s new in the AMD Opteron 4200 Series Processor (Codenamed Valencia ) and the new Bulldozer Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

Parallel Computer Architecture - Basics -

Parallel Computer Architecture - Basics - Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor

More information

Philippe Thierry Sr Staff Engineer Intel Corp.

Philippe Thierry Sr Staff Engineer Intel Corp. HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem

More information

Lecture 20: Distributed Memory Parallelism. William Gropp

Lecture 20: Distributed Memory Parallelism. William Gropp Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued) Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

4. Shared Memory Parallel Architectures

4. Shared Memory Parallel Architectures Master rogram (Laurea Magistrale) in Computer cience and Networking High erformance Computing ystems and Enabling latforms Marco Vanneschi 4. hared Memory arallel Architectures 4.4. Multicore Architectures

More information

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

HW Trends and Architectures

HW Trends and Architectures Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

High Performance Computing - Parallel Computers and Networks. Prof Matt Probert

High Performance Computing - Parallel Computers and Networks. Prof Matt Probert High Performance Computing - Parallel Computers and Networks Prof Matt Probert http://www-users.york.ac.uk/~mijp1 Overview Parallel on a chip? Shared vs. distributed memory Latency & bandwidth Topology

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

COSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors

COSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors OS 6385 omputer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Spring 2012 Long-term trend on the number of transistor per integrated circuit Number of transistors

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

1. NoCs: What s the point?

1. NoCs: What s the point? 1. Nos: What s the point? What is the role of networks-on-chip in future many-core systems? What topologies are most promising for performance? What about for energy scaling? How heavily utilized are Nos

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1 Future of Interconnect Fabric A ontrarian View Shekhar Borkar June 13, 2010 Intel orp. 1 Outline Evolution of interconnect fabric On die network challenges Some simple contrarian proposals Evaluation and

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009

COMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 COMP 322: Principles of Parallel Programming Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322 Vivek Sarkar Department of Computer Science Rice

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp. Networks for Multi-core hips A A ontrarian View Shekhar Borkar Aug 27, 2007 Intel orp. 1 Outline Multi-core system outlook On die network challenges A simple contrarian proposal Benefits Summary 2 A Sample

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

INF5063: Programming heterogeneous multi-core processors Introduction

INF5063: Programming heterogeneous multi-core processors Introduction INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

INTERCONNECTION TECHNOLOGIES. Non-Uniform Memory Access Seminar Elina Zarisheva

INTERCONNECTION TECHNOLOGIES. Non-Uniform Memory Access Seminar Elina Zarisheva INTERCONNECTION TECHNOLOGIES Non-Uniform Memory Access Seminar Elina Zarisheva 26.11.2014 26.11.2014 NUMA Seminar Elina Zarisheva 2 Agenda Network topology Logical vs. physical topology Logical topologies

More information

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS

MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS Najem N. Sirhan 1, Sami I. Serhan 2 1 Electrical and Computer Engineering Department, University of New Mexico, Albuquerque, New Mexico, USA 2 Computer

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

Intel Workstation Technology

Intel Workstation Technology Intel Workstation Technology Turning Imagination Into Reality November, 2008 1 Step up your Game Real Workstations Unleash your Potential 2 Yesterday s Super Computer Today s Workstation = = #1 Super Computer

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Commercially Available Chip Mul3processors for Research. Welcome to the MulE core Era

Commercially Available Chip Mul3processors for Research. Welcome to the MulE core Era 4/2/11 ommercially Available hip Mul3processors for Research Bruce hilders University of Pi9sburgh h9p://www.cs.pi9.edu/~childers AAO h9p://www.cs.pi9.edu h9p://www.cacao team.org h9p://www.cs.pi9.edu/pm

More information

CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management

CAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management CAMA: Modern processors Memory hierarchy: Caches basics Data access locality Cache management Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Johannes Hofmann/Dietmar

More information

Processor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design

Processor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design Overview: Classical Parallel Hardware Processor Performance Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues

More information

CS 152, Spring 2011 Section 10

CS 152, Spring 2011 Section 10 CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel High Performance Computing: Blue-Gene and Road Runner Ravi Patel 1 HPC General Information 2 HPC Considerations Criterion Performance Speed Power Scalability Number of nodes Latency bottlenecks Reliability

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

Multi-core Programming Evolution

Multi-core Programming Evolution Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution

More information

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2 Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2 Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information