Modern computer architecture. From multicore to petaflops
|
|
- Margaret Daniella Harper
- 6 years ago
- Views:
Transcription
1 Modern computer architecture From multicore to petaflops
2 Motivation: Multi-ores where and why
3 Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed #transistors on microchip doubles every months omputer Architecture 3
4 Introduction: Moore s law faster cycles and beyond Moore s law transistors are getting smaller run them faster Faster clock speed Higher Throughput (Ops/s) Frequency [MHz] Intel x86 clock speed Increasing transistor count and clock speed allows / requires architectural changes: Pipelining Superscalarity SIMD / Vector ops 1 0, Year Multi-ore/Threading omplex on chip caches omputer Architecture 4
5 Welcome to the multi-/many-core era The game is over: But Moore s law continues By courtesy of D. Vrsalovic, Intel 1.13x N transistors 1.73x Dual-ore Performance Power 1.00x 2N transistors 1.73x 1.02x Power envelope: Max W Power consumption: P = f * (V core ) 2 V core ~ V Same process technology: P ~ f 3 Over-clocked (+20%) Max Frequency Dual-core (-20%) since minimum V core depends on f omputer Architecture 5
6 Multi-ore: Intel Xeon 2600 (2012) Xeon 2600 Sandy Bridge EP : 8 cores running at 2.7 GHz (max 3.2 GHz) Simultaneous Multithreading reports as 16-way chip 2.3 Billion Transistors / 32 nm Die size: 435 mm 2 2-socket server omputer Architecture 6
7 From UMA to ccnuma Basic architecture of commodity compute cluster nodes Yesterday (2006): Dual-socket Intel ore2 node: Uniform Memory Architecture (UMA) Flat memory ; symmetric MPs But: system anisotropy Today: Dual-socket Intel (Westmere) node: ache-coherent Non-Uniform Memory Architecture (ccnuma) HT / QPI provide scalable bandwidth at the price of ccnuma architectures: Where does my data finally end up? On AMD it is even more complicated ccnuma within a socket! omputer Architecture 7
8 Back to the 2-chip-per-case age 12 core AMD Magny-ours a 2x6-core ccnuma socket AMD: single-socket ccnuma since Magny ours 1 socket: 12-core Magny-ours built from two 6-core chips 2 NUMA domains 2 socket server 4 NUMA domains 4 socket server: 8 NUMA domains WHY? Shared resources are hard two scale: 2 x 2 memory channels vs. 1 x 4 memory channels per socket omputer Architecture 8
9 urrent AMD design: AMD Interlagos / Bulldozer Up to 16 cores (8 Bulldozer modules) in a single socket Max. 2.6 GHz (+ Turbo ore) 2048 kb 16 kb shared P max = (2.6 x 8 x 8) GF/s dedicated L2 cache = GF/s L1D cache 8 (6) MB shared L3 cache Each Bulldozer module: 2 lightweight cores 1 FPU: 4 MULT & 4 ADD (double precision) / cycle Supports AVX Supports FMA4 2 DDR3 (shared) memory channel > 15 GB/s 2 NUMA domains per socket omputer Architecture 9
10 ray XE6 Interlagos 32-core dual socket node Two 8- (integer-) core chips per 2.3 GHz turbo) Separate DDR3 memory interface per chip ccnuma on the socket! Shared FP unit per pair of integer cores ( module ) 256-bit FP unit SSE4.2, AVX, FMA4 16 kb L1 data cache per core 2 MB L2 cache per module 8 MB L3 cache per chip (6 MB usable) omputer Architecture 10
11 Other socket Other socket Other socket Woodcrest ore2 Duo 65nm Other socket Other socket Harpertown ore2 Quad 45nm The x86 multicore evolution so far Intel Single-Dual-/Quad-/Hexa-/-ores (one-socket view) 2005: Fake dual-core 2006: True dual-core P P P P P P P P P hipset hipset hipset hipset Memory 2008: Simultaneous Multi Threading (SMT) Memory Approx. constant clock speed 2010: 6-core chip Memory Memory : Wider SIMD units AVX: 256 Bit P T 0 P T0 P T0 P T0 P T 0 P T0 P T0 P T0 P T0 P T0 P T 0 P T0 P T0 P T0 P T 0 P T0 P T0 P T0 MI MI MI Memory Memory Memory Nehalem EP ore i7 45nm Westmere EP ore i7 32nm Sandy Bridge EP ore i7 32nm omputer Architecture 11
12 There is no single driving force for chip performance! Floating Point (FP) Performance: P = n core * F * S * n n core number of cores: 8 F FP instructions per cycle: 2 (1 MULT and 1 ADD) Intel Xeon Sandy Bridge EP socket 4,6,8 core variants available S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers AVX ) n lock speed : 2.7 GHz TOP500 rank 1 (1995) P = 173 GF/s (dp) / 346 GF/s (sp) But: P=5.4 GF/s (dp) for serial, non-simd code omputer Architecture 12
13 Specifications of the NVIDIA Fermi GPU 14 Multiprocessors (MP); each with: 32 processors (SP) driven by : Single Instruction Multiple Data (SIMD) Single Instruction Multiple Thread (SIMT) Explicit in-order architecture 32K Registers 48 KB of local on-chip memory 1st and 2nd level cache hierarchy clock rate of 1.15 GHz 1030 GFLOP/s (single precision) 515 GFLOP/s (double precision) Up to 6 GB of global memory (DRAM) 1500 MHz DDR 384 bit bus Global gather/scatter 144 GB/s bandwidth 16 GB/s PIe 2.0x16 (bidirectional) lock (MHz) Peak (GFLOPs) Memory (GB) Memory lock (MHz) Memory Interface (bit) Memory Bandwidth (GB/sec) Tesla GeForce GTX GeForce 8800 GTX Host ( Westmere) *64 63 September 2012 Parallel multi-and manycore programming 13
14 Trading single thread performance for parallelism: GPGPUs vs. PUs GPU vs. PU light speed estimate: 1. ompute bound: 2-5 X 2. Memory Bandwidth: 1-5 X Intel ore i ( Sandy Bridge ) Intel Xeon E DP node ( Sandy Bridge ) NVIDIA 2070 ( Fermi ) ores@lock 3.3 GHz 2 x 2.7 GHz 1.1 GHz Performance + /core 52.8 GFlop/s 43.2 GFlop/s 2.2 GFlop/s Threads@stream <4 <16 >8000 Total performance GFlop/s 691 GFlop/s 1,000 GFlop/s Stream BW 18 GB/s 2 x 36 GB/s 90 GB/s (E=1) Transistors / TDP 1 Billion* / 95 W 2 x (2.27 Billion / 130W) 3 Billion / 238 W + Single Precision * Includes on-chip GPU and PI-Express omplete compute device omputer Architecture 14
15 Parallelism in a modern compute node Parallel and shared resources within a shared-memory node 2 GPU # Other I/O 8 7 PIe link GPU #2 Parallel resources: Shared resources: Execution/SIMD units 1 Outer cache level per socket 6 ores 2 Memory bus per socket 7 Inner cache levels 3 Intersocket link 8 Sockets / memory domains 4 PIe bus(es) 9 Multiple accelerators 5 Other I/O resources 10 How does your application react to all of those details? omputer Architecture 15
16 Distributed-memory computers & hybrid systems
17 Parallel distributed-memory computers: Basics Pure distributed-memory parallel computer: Each processor P is connected to exclusive local memory (MM) and a network interface (NI) Node A (dedicated) communication network connects alls nodes No global cache-coherent shared address space No Remote Memory Access (NORMA) Data exchange between nodes: Passing messages via network ( Message Passing ) Some architectures provide limited remote memory access for speeding up message passing, e.g. through a global NON- OHENRENT address space (NUMA) Prototype of first P clusters: Node: Single-core/PU P Network: Ethernet First Massively Parallel Processing architectures: RAY T3D/E, Intel Paragon omputer Architecture 17
18 Parallel distributed-memory computers: Hybrid system Standard concept of most modern large parallel computers: Hybrid/hierarchical ompute node is a 2- or 4-socket shared memory compute nodes with a NI. ommunication network (GBit, Infiniband) connects the nodes Price / (Peak) Performance is optimal; Network capabilities / (Peak) Perf. gets worse Parallel Programming? Pure Message Passing is standard. Hybrid programming? Today: GPUs / Accelerators are added to the nodes to further increase complexity Distributed-memory parallel omputer Architecture 18
19 Networks What are the basic ideas and performance characteristics of modern networks?
20 Networks Basic performance characteristics Evaluate the network capabilities to transfer data Use the same idea as for main memory access: Total transfer time for a message of N Bytes is: T = T L + N/B T L is the latency (transfer setup time [sec]) and B is asymptotic (N ) network bandwidth [MBytes/sec] onsider simplest case ( Ping Pong ) Two processors in different nodes communicate via network ( Point-to-point ) A single message of N Bytes is sent forward and backward Overall data transfer is 2N Bytes! omputer Architecture 20
21 Networks Basic performance characteristics Ping-Pong benchmark (pseudo-code): myid = get_process_id() if(myid.eq.0) then targetid = 1 S = get_walltime() call Send_message(buffer,N,targetID) call Receive_message(buffer,N,targetID) E = get_walltime() MBYTES = 2*N/(E-S)/1.d6! Eff. BW: MBytes/sec rate TIME = (E-S)/2*1.d6! transfer time in microsecs! for single message else targetid = 0 call Receive_message(buffer,N,targetID) call Send_message(buffer,N,targetID) endif Effective BW: B eff = N T L + N B omputer Architecture 21
22 B eff = 2*N/(E-S)/1.d6 Networks Basic performance characteristics Ping-Pong benchmark for GBit-Ethernet (GigE) network N 1/2 : Message size where 50% of peak bandwidth is achieved Asymptotic bandwidth B=111 Mbytes/sec GBit/s Latency (N 0): Only qualitative agreement: 44 ms vs. 76 ms omputer Architecture 22
23 Networks Basic performance characteristics Ping-Pong benchmark for DDR Infiniband (DDR-IB) network Determine B and T L independently and combine them omputer Architecture 23
24 Networks Basic performance characteristics First Principles modeling of B eff (N) provides good qualitative results but quantitative description in particular of latency dominated region (N small) may fail because Overhead for transmission protocols, e.g. message headers Minimum frame size for message transmission, e.g. TP/IP over Ethernet does always transfer frames with N>1 Message setup/initialization involves multiple software layers and protocols; each software layer adds to latency; hardware only latency is often small As the message size increases the software may switch to a different protocol, e.g. from eager to rendezvous Typical message sizes in applications are neither small nor large N 1/2 value is also important: N 1/2 = B * T L Network balance: Relate network bandwidth (B or B eff (N 1/2 )) to computer power (or main memory bandwidth) of the nodes omputer Architecture 24
25 Latency and bandwidth in modern computer environments ns ms 1 GB/s ms 25
26 Networks: Topologies & Bisection bandwidth Network bisection bandwidth B b is a general metric for the data transfer capability of a system Minimum sum of the bandwidths of all connections cut when splitting the system into two equal parts More meaningful metric when comparing systems: Bisection BW per core or per node, B b /N Bisection BW depends on Bandwidth per link Network topology Uni- or Bi-directional bandwidth?! omputer Architecture 26
27 Network topologies: Bus Bus can be used by one connection at a time Bandwidth is shared among all devices Bisection BW is constant: B b /N ~ 1/N ollision detection, bus arbitration protocols must be in place Examples: PI bus, memory bus of multi-core chips, diagnostic buses, internal ring bus of the ell processor, Advantages Low latency Easy to implement Disadvantages Shared bandwidth, not scalable Problems with failure resiliency (one defective agent may block bus) Fast buses for large N require large signal power omputer Architecture 27
28 Network topologies: Switches and Fat-Trees Standard clusters are built with switched networks ompute nodes ( devices ) are split up in groups each group is connected to a single (small) (non-blocking crossbar-)switch ( leaf switches ) Leaf switches are connected with each other using an additional switch hierarchy ( spine switches ) or directly (for small configs) In switched networks the distance between any two devices is heterogeneous (number of hops in switch hierarchy) Diameter of a network: The maximum number of hops required to connect two arbitrary devices Example: Diameter of bus = 1 Perfect world: Fully non-blocking, i.e. any choice of N/2 disjoint device pairs can communicate at full speed omputer Architecture 28
29 Non-blocking crossbar A non-blocking crossbar can mediate a number of connections between a group of input and a group of output elements This can be used as a 4-port non-blocking switch (fold at the diagonal) Switches can be cascaded to form hierarchies (common case) rossbars can also be used directly as interconnects in computer systems Example: Scalable UMA memory access (NE SX) (Historic) example: Hitachi SR8000 2x2 switching element omputer Architecture 29
30 Fat tree switch hierarchies Fully non-blocking N/2 end-to-end connections with full bandwidth B b = B * N/2 B b /N = const. = B/2 Sounds good, but see next slide B B Oversubscribed Spine does not support N/2 full BW end-to-end connections B b /N = const. = B/2k, where k is the oversubscription factor Intelligent resource management is crucial k=3 leaf switch spine switch omputer Architecture 30
31 Fat trees and static routing If all end-to-end data paths are preconfigured ( static routing ), not all possible combinations of N agents will get full bandwidth Example: is a collision-free pattern here hange 2 6, 3 7 to 2 7, 3 6: has collisions if no other connections are re-routed at the same time Static routing is still a quasi-standard in commodity interconnects However, things are starting to improve slowly omputer Architecture 31
32 Full fat-tree: 288-port IB DDR-Switch SPINE switch level: 12 switches Basic building blocks: 24-port switches LEAF switch level: 24 switches with 24*12 ports to devices S = switches 288 ports omputer Architecture 32
33 Fat tree networks Examples Ethernet 1 Gbit/s &10 Gbit/s variants; 41% of all Top500 entries (June 2012) InfiniBand Dominant high-performance commodity interconnect (42% of Top500 entries) Myrinet SDR: 10 Gbit/s per link and direction (10 bits/byte) DDR: 20 Gbit/s per link and direction (Building blocks: 24-port switches) QDR: you figure that out by yourself QDR IB is used in the RRZE s TinyBlue and Lima clusters Building blocks: 36 port switches Large 36*18=648-port switches urrent version: 10 Gbit/s per link and direction Interoperable with 10 Gbit/s Ethernet Waning importance for HP Fat trees are expensive and complex to scale continuously to very high node counts omputer Architecture 33
34 Meshes Fat trees can become prohibitively expensive in large systems ompromise: Meshes n-dimensional Hypercubes Toruses (2D / 3D) Many others (including hybrids) Each node is a router Direct connections only between direct neighbors This is not a non-blocking crossbar! Intelligent resource management and routing algorithms are essential Example: 2D torus mesh Toruses are used in very large systems: ray XT, IBM Blue Gene B b ~ N (d-1)/d B b /N 0 for large N Sounds bad, but those machines show good scaling for many codes Well-defined and predictable bandwidth behavior! omputer Architecture 34
35 Meshes Advantages of toroidal/cubic meshes Limited cabling required ables can be kept short Meshes can come in all shapes and sizes Example: 4-socket dual-core AMD Opteron node with HyperTransport fabric This mesh is asymmetric since two sockets use one HT link each for I/O 4-socket 2xhexa-core AMD Magny-ours: 3D cube omputer Architecture 35
Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationCAMA: Modern processors. Memory hierarchy: Caches. Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center
AMA: Modern processors Memory hierarchy: aches Gerhard Wellein, Department for omputer Science and Erlangen Regional omputing enter Johannes Hofmann/Dietmar Fey, Department for omputer Science University
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationAMD Opteron 4200 Series Processor
What s new in the AMD Opteron 4200 Series Processor (Codenamed Valencia ) and the new Bulldozer Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed
More informationParallel Computer Architecture II
Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de
More informationParallel Computer Architecture - Basics -
Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor
More informationPhilippe Thierry Sr Staff Engineer Intel Corp.
HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationLecture 20: Distributed Memory Parallelism. William Gropp
Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More information4. Shared Memory Parallel Architectures
Master rogram (Laurea Magistrale) in Computer cience and Networking High erformance Computing ystems and Enabling latforms Marco Vanneschi 4. hared Memory arallel Architectures 4.4. Multicore Architectures
More informationEN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationHW Trends and Architectures
Pavel Tvrdík, Jiří Kašpar (ČVUT FIT) HW Trends and Architectures MI-POA, 2011, Lecture 1 1/29 HW Trends and Architectures prof. Ing. Pavel Tvrdík CSc. Ing. Jiří Kašpar Department of Computer Systems Faculty
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationHigh Performance Computing - Parallel Computers and Networks. Prof Matt Probert
High Performance Computing - Parallel Computers and Networks Prof Matt Probert http://www-users.york.ac.uk/~mijp1 Overview Parallel on a chip? Shared vs. distributed memory Latency & bandwidth Topology
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationCOSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors
OS 6385 omputer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Spring 2012 Long-term trend on the number of transistor per integrated circuit Number of transistors
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More information1. NoCs: What s the point?
1. Nos: What s the point? What is the role of networks-on-chip in future many-core systems? What topologies are most promising for performance? What about for energy scaling? How heavily utilized are Nos
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationXT Node Architecture
XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core
More informationFuture of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1
Future of Interconnect Fabric A ontrarian View Shekhar Borkar June 13, 2010 Intel orp. 1 Outline Evolution of interconnect fabric On die network challenges Some simple contrarian proposals Evaluation and
More informationInterconnection Networks
Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationCOMP 322: Principles of Parallel Programming. Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009
COMP 322: Principles of Parallel Programming Lecture 17: Understanding Parallel Computers (Chapter 2) Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322 Vivek Sarkar Department of Computer Science Rice
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationNetworks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.
Networks for Multi-core hips A A ontrarian View Shekhar Borkar Aug 27, 2007 Intel orp. 1 Outline Multi-core system outlook On die network challenges A simple contrarian proposal Benefits Summary 2 A Sample
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationINF5063: Programming heterogeneous multi-core processors Introduction
INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 19 th, 2012 INF5063 Overview Course topic and scope Background for the use and parallel processing using
More informationThread and Data parallelism in CPUs - will GPUs become obsolete?
Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationINTERCONNECTION TECHNOLOGIES. Non-Uniform Memory Access Seminar Elina Zarisheva
INTERCONNECTION TECHNOLOGIES Non-Uniform Memory Access Seminar Elina Zarisheva 26.11.2014 26.11.2014 NUMA Seminar Elina Zarisheva 2 Agenda Network topology Logical vs. physical topology Logical topologies
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationBlue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft
Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS Najem N. Sirhan 1, Sami I. Serhan 2 1 Electrical and Computer Engineering Department, University of New Mexico, Albuquerque, New Mexico, USA 2 Computer
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationAim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group
Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationIntel Workstation Technology
Intel Workstation Technology Turning Imagination Into Reality November, 2008 1 Step up your Game Real Workstations Unleash your Potential 2 Yesterday s Super Computer Today s Workstation = = #1 Super Computer
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationCommercially Available Chip Mul3processors for Research. Welcome to the MulE core Era
4/2/11 ommercially Available hip Mul3processors for Research Bruce hilders University of Pi9sburgh h9p://www.cs.pi9.edu/~childers AAO h9p://www.cs.pi9.edu h9p://www.cacao team.org h9p://www.cs.pi9.edu/pm
More informationCAMA: Modern processors. Memory hierarchy: Caches basics Data access locality Cache management
CAMA: Modern processors Memory hierarchy: Caches basics Data access locality Cache management Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Johannes Hofmann/Dietmar
More informationProcessor Performance. Overview: Classical Parallel Hardware. The Processor. Adding Numbers. Review of Single Processor Design
Overview: Classical Parallel Hardware Processor Performance Review of Single Processor Design so we talk the same language many things happen in parallel even on a single processor identify potential issues
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationAgenda. System Performance Scaling of IBM POWER6 TM Based Servers
System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationHigh Performance Computing: Blue-Gene and Road Runner. Ravi Patel
High Performance Computing: Blue-Gene and Road Runner Ravi Patel 1 HPC General Information 2 HPC Considerations Criterion Performance Speed Power Scalability Number of nodes Latency bottlenecks Reliability
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationMulti-core Programming Evolution
Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution
More informationSlides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2
Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2 Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More information