Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
|
|
- Alexandrina Melton
- 5 years ago
- Views:
Transcription
1 Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
2 Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia Kepler: 7 billion Intel Broadwell: 7.2 billion Nvidia Pascal: 15 billion 1965: G. Moore claimed #transistors on microchip doubles every months (c) RRZE 2017 Basic Architecture 2
3 Multi-core today: Intel Xeon 2600v3 (2014) Xeon E5-2600v3 Haswell EP : Up to 18 cores running at 2+ GHz (+ Turbo Mode : 3.5+ GHz) Simultaneous Multithreading reports as 36-way chip 5.7 billion 22 nm Die size: 662 mm 2 Optional: Cluster on Die (CoD) mode socket server (c) RRZE 2017 Basic Architecture 3
4 A deeper dive into core and chip architecture
5 General-purpose cache-based microprocessor core Modern CPU core Stored-program computer Implements Stored Program Computer concept (Turing 1936) Similar designs on all modern systems (Still) multiple potential bottlenecks Flexible! (c) RRZE 2017 Basic Architecture 5
6 Basic resources on a stored program computer Instruction execution and data movement 1. Instruction execution This is the primary resource of the processor. All efforts in hardware design are targeted towards increasing the instruction throughput. Instructions are the concept of work as seen by processor designers. Not all instructions count as work as seen by application developers! Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Processor work: LOAD r1 = A(i) LOAD r2 = B(i) ADD r1 = r1 + r2 STORE A(i) = r1 INCREMENT i BRANCH top if i<n User work: N Flops (ADDs) (c) RRZE 2017 Basic Architecture 6
7 Basic resources on a stored program computer Instruction execution and data movement 2. Data transfer Data transfers are a consequence of instruction execution and therefore a secondary resource. Maximum bandwidth is determined by the request rate of executed instructions and technical limitations (bus width, speed). Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Data transfers: 8 byte: LOAD r1 = A(i) 8 byte: LOAD r2 = B(i) 8 byte: STORE A(i) = r2 Sum: 24 byte Crucial question: What is the bottleneck? Data transfer? Code execution? (c) RRZE 2017 Basic Architecture 7
8 Microprocessors Pipelining
9 Pipelining of arithmetic/functional units Idea: Split complex instruction into several simple / fast steps (stages) Each step takes the same amount of time, e.g., a single cycle Execute different steps on different instructions at the same time (in parallel) Benefits: Core can work on 5 independent instructions simultaneously One instruction finished each cycle after the pipeline is full Drawback: Pipeline must be filled; large number of independent instructions required Requires complex instruction scheduling by hardware (out-of-order execution) or compiler (software pipelining) Pipelining is widely used in modern computer architectures (c) RRZE 2017 Basic Architecture 10
10 5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,n First result is available after 5 cycles (=latency of pipeline)! Wind-up/-down phases: Empty pipeline stages (c) RRZE 2017 Basic Architecture 11
11 Microprocessors Superscalarity and Simultaneous Multithreading
12 Superscalar Processors Instruction Level Parallelism Multiple units enable use of Instrucion Level Parallelism (ILP): Instruction stream is parallelized on the fly t Fetch Instruction 4 Fetch Instruction 3 Fetch from Instruction L1I 2 Fetch from Instruction L1I 1 Fetch Instruction from L1I 2 Decode Fetch Instruction from L1I 2 Decode Fetch from Instruction L1I 2 Instruction Decode 1 Fetch from Instruction L1I 5 Instruction Decode 1 Fetch Instruction from L1I 3 Decode Instruction 1 Fetch Instruction from L1I 3 Decode Instruction 1 Fetch from Instruction L1I 3 Instruction Decode 2 Fetch from Instruction L1I 9 Instruction Decode 2 Fetch Instruction from L1I 4 Decode Instruction 2 Fetch Instruction from L1I 4 Decode Instruction 5 Fetch from Instruction L1I 4 Instruction Decode 3 Fetch from Instruction L1I 13 Instruction Decode 3 from L1I Instruction 3 from L1I Instruction 9 4-way superscalar Execute Execute Instruction Execute 1 Instruction Execute 1 Execute Instruction 1 Execute Instruction 1 Instruction Execute 2 Instruction Execute 2 Instruction 2 Instruction 5 Example: LOAD STORE MULT ADD Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar & can perform 2 floating point instructions per cycles (c) RRZE 2017 Basic Architecture 14
13 2-way SMT Standard core Core details: Simultaneous multi-threading (SMT) SMT principle (2-way example): (c) RRZE 2017 Basic Architecture 16
14 Microprocessors Single Instruction Multiple Data (SIMD) a.k.a. vectorization
15 A[0] B[0] C[0] A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] C[0] C[1] C[2] C[3] Core details: SIMD processing Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on wide registers x86 SIMD instruction sets: SSE: register width = 128 Bit 2 double precision floating point operands AVX: register width = 256 Bit 4 double precision floating point operands AVX-512: you guessed it! Adding two registers holding double precision floating point operands R0 R1 R2 R0 R1 R2 SIMD execution: V64ADD [R0,R1] R Bit 64 Bit Scalar execution: R2 ADD [R0,R1] (c) RRZE 2017 Basic Architecture 18
16 Microprocessors Memory Hierarchy
17 Von Neumann bottleneck reloaded: DRAM gap DP peak performance and peak main memory bandwidth for a single Intel processor (chip) Approx. 10 F/B Main memory access speed not sufficient to keep CPU busy Introduce fast on-chip caches, holding copies of recently used data items (c) RRZE 2017 Basic Architecture 21
18 Registers and caches: Data transfers in a memory hierarchy Caches help with getting instructions and data to the CPU fast How does data travel from memory to the CPU and back? Remember: Caches are organized in cache lines (e.g., 64 bytes) Only complete cache lines are transferred between memory hierarchy levels (except registers) MISS: Load or store instruction does not find the data in a cache level CL transfer required LD C(1) MISS CL CPU registers ST A(1) MISS Cache CL write allocate LD C(2..N cl ) ST A(2..N cl ) HIT evict (delayed) Example: Array copy A(:)=C(:) CL CL C(:) A(:) Memory 3 CL transfers (c) RRZE 2017 Basic Architecture 23
19 Parallel Computers Terminology
20 From core to cluster T T chip/socket T T T T T T T T T T T T core distributed-memory cluster shared-memory compute node Network (c) RRZE 2017 Basic Architecture 25
21 Shared-Memory Parallel Computers Cache coherence Multicore-multisocket architecture ccnuma memory organization
22 Muliple cores and caches Cache coherence Data in cache is only a copy of data in memory Multiple copies of same data on multiprocessor systems Cache coherence protocol/hardware ensure consistent data view Without cache coherence, shared cache lines can become clobbered: (Cache line size = 2 Words; A1+A2 are in a single CL) P1 P2 P1 C1 P2 C2 Load A1 Write A1=0 Load A2 Write A2=0 A1, A2 A1, A2 Write-back to memory leads to incoherent data Bus A1, A2 A1, A2 A1, A2 A1, A2 Memory C1 & C2 entry can not be merged to: A1, A2 (c) RRZE 2017 Basic Architecture 27
23 Multiple caches Cache coherence Cache coherence protocol must keep track of cache line status P1 P2 C1 C2 A1, A2 A1, A2 Bus P1 Load A1 Write A1=0: 1. Request exclusive access to CL 2. Invalidate CL in C2 3. Modify A1 in C1 P2 Load A2 A1, A2 Memory 1. Request exclusive CL access Write A2=0: 2. CL write back+ Invalidate t 3. Load CL to C2 C2 is exclusive owner of CL 4. Modify A2 in C2 (c) RRZE 2017 Basic Architecture 28
24 Parallel computers Cache coherence Cache coherence can cause substantial overhead may reduce available bandwidth Different implementations Snoop: On modifying a CL, a CPU must broadcast its address to the whole system Directory, snoop filter : Chipset ( network ) keeps track of which CLs are where and filters coherence traffic Directory-based ccnuma can reduce pain of additional coherence traffic But always take care: Multiple processors should never write frequently to the same cache line ( false sharing )! (c) RRZE 2017 Basic Architecture 29
25 Cray XC30 SandyBridge-EP 8-core dual socket node 8 cores per socket 2.7 GHz turbo) DDR3 memory interface with 4 channels per chip Two-way SMT Two 256-bit SIMD FP units SSE4.2, AVX 32 kb L1 data cache per core 256 kb L2 cache per core 20 MB L3 cache per chip ccnuma memory architecture Memory is physically distributed but logically shared (c) RRZE 2017 Basic Architecture 30
26 Interlude: A glance at current accelerator technology NVidia Pascal GP100 vs. Intel Xeon Phi Knights Landing
27 NVidia Pascal GP100 block diagram Architecture 15.3 B Transistors ~ 1.4 GHz clock speed Up to 60 SM units 64 (SP) cores each 5.7 TFlop/s DP peak 4 MB L2 Cache 4096-bit HBM2 MemBW ~ 732 GB/s (theoretical) MemBW ~ 510 GB/s (measured) 2:1 SP:DP performance (c) RRZE 2017 Basic Architecture NVIDIA Corp. 32
28 Intel Xeon Phi Knights Landing block diagram VPU VPU VPU VPU MCDRAM MCDRAM MCDRAM MCDRAM T T T T T T T T P P 32 KiB L1 32 KiB L1 DDR4 DDR4 1 MiB L2 DDR4 DDR4 36 tiles (72 cores) max. DDR4 DDR4 Architecture 8 B Transistors Up to 1.5 GHz clock speed Up to 2x36 cores (2D mesh) 2x 512-bit SIMD units each 4-way SMT MCDRAM MCDRAM MCDRAM MCDRAM 3.5 TFlop/s DP peak (SP 2x) 36 MiB L2 Cache 16 GiB MCDRAM MemBW ~ 470 GB/s (measured) Large DDR4 main memory MemBW ~ 90 GB/s (measured) (c) RRZE 2017 Basic Architecture 33
29 Trading single thread performance for parallelism: GPGPUs vs. CPUs GPU vs. CPU light speed estimate (per device) MemBW ~ 5-10x Peak ~ 6-15x 2x Intel Xeon E5-2697v4 Broadwell Intel Xeon Phi 7250 Knights Landing NVidia Tesla P100 Pascal Cores@Clock 2 x 2.3 GHz 1.4 GHz 56 ~1.3 GHz SP Performance/core 73.6 GFlop/s 89.6 GFlop/s ~166 GFlop/s Threads@STREAM ~8 ~40 >8000? SP peak 2.6 TFlop/s 6.1 TFlop/s ~9.3 TFlop/s Stream BW (meas.) 2 x 62.5 GB/s 470 GB/s (HBM) 510 GB/s Transistors / TDP ~2x7 Billion / 2x145 W 8 Billion / 215W 14 Billion/300W (c) RRZE 2017 Basic Architecture 34
30 Node topology and programming models
31 Parallel programming models: Pure MPI Machine structure is invisible to user: Very simple programming model MPI knows what to do!? Performance issues Intranode vs. internode MPI Node/system topology (c) RRZE 2017 Basic Architecture 39
32 Parallel programming models are topology-agnostic: Example: Pure threading on the node (relevant for this tutorial) Machine structure is invisible to user Very simple programming model Threading SW (OpenMP, pthreads, TBB, ) should know about the details Performance issues Synchronization overhead Memory access Node topology (c) RRZE 2017 Basic Architecture 40
33 Distributed-memory computers & hybrid systems
34 Parallel distributed-memory computers: Basics Distributed-memory parallel computer: Each processor P is connected to exclusive local memory (M) and a network interface (NI) Node A (dedicated) communication network connects all nodes Data exchange between nodes: Passing messages via network ( Message Passing ) Variants: No global (shared) address space No Remote Memory Access (NORMA) NON-COHENRENT shared address space (NUMA), e.g. CRAY PGAS languages (CoArray Fortran, UPC) Prototype of first PC clusters: Node: Network: Single CPU PC Ethernet First Massively Parallel Processing (MPP) architectures: CRAY T3D/E, Intel Paragon (c) RRZE 2017 Basic Architecture 42
35 Parallel distributed-memory computers: Hybrid system Standard concept of most modern large parallel computers: Hybrid/hierarchical Compute node is a 2- or 4-socket shared memory compute nodes with a NI. Communication network (GBit, Infiniband) connects the nodes Price / (Peak) Performance is optimal; Network capabilities / (Peak) Perf. gets worse Parallel Programming? Pure Message Passing is standard. Hybrid programming? Today: GPUs / Accelerators are added to the nodes to further increase complexity Distributed-memory parallel (c) RRZE 2017 Basic Architecture 43
36 Networks Basic ideas and performance characteristics of modern networks
37 Networks Basic performance characteristics Evaluate the network capabilities to transfer data Use the same idea as for main memory access: Total transfer time for a message of V Bytes is: V T comm = λ + b network λ is the latency (transfer setup time [sec]) and b network is asymptotic (N oo) network bandwidth [MBytes/sec] Consider simplest case ( Ping Pong ) Two processors in different nodes communicate via network ( Point-to-point ) A single message of N Bytes is sent forward and backward Overall data transfer is 2 N Bytes! (c) RRZE 2017 Basic Architecture 45
38 Networks Basic performance characteristics Ping-Pong benchmark (schematic view) 1 myid = get_process_id() 2 if(myid.eq.0) then 3 targetid = 1 4 S = get_walltime() 5 call Send_message(buffer,N,targetID) 6 call Receive_message(buffer,N,targetID) 7 E = get_walltime() 8 MBYTES = 2*N/(E-S)/1.d6! MBytes/sec rate 9 TIME = (E-S)/(2*1.d6)! transfer time in microsecs 10! for single message 11 else 12 targetid = 0 13 call Receive_message(buffer,N,targetID) 14 call Send_message(buffer,N,targetID) 15 endif (c) RRZE 2017 Basic Architecture 46
39 B eff = 2*N/(E-S)/1.d6 Networks Basic performance characteristics Ping-Pong benchmark for GBit-Ethernet (GigE) network N 1/2 : Message size where 50% of peak bandwidth is achieved Asymptotic bandwidth b network =111 MB/s GBit/s Latency (N 0): Only qualitative agreement: 44 ms vs. 76 ms (c) RRZE 2017 Basic Architecture 47
40 Networks Basic performance characteristics Ping-Pong benchmark for DDR Infiniband (DDR-IB) network Determine b network and λ independently and combine them λ b network (c) RRZE 2017 Basic Architecture 48
41 Networks Basic performance characteristics First Principles modeling of B eff (V) provides good qualitative results but quantitative description in particular of latency dominated region (V small) may fail because Overhead for transmission protocols, e.g. message headers Minimum frame size for message transmission, e.g. TCP/IP over Ethernet does always transfer frames with V>1 Message setup/initialization involves multiple software layers and protocols; each software layer adds to latency; hardware only latency is often small As the message size increases the software may switch to different protocol, e.g. from eager to rendezvous Typical message sizes in applications are neither small nor large V 1/2 value is also important: V 1/2 = λ b network Network balance: Relate network bandwidth (b network or B eff (N 1/2 )) to computer power (or main memory bandwidth) of the nodes (c) RRZE 2017 Basic Architecture 49
42 Networks: Topologies & Bisection bandwidth Network bisection bandwidth B b is a general metric for the data transfer capability of a system: Minimum sum of the bandwidths of all connections cut when splitting the system into two equal parts More meaningful metric in terms of system scalability: Bisection BW per node: B b /N nodes Bisection BW depends on Bandwidth per link Network topology Uni- or Bi-directional bandwidth?! (c) RRZE 2017 Basic Architecture 50
43 Network topologies: Bus Bus can be used by one connection at a time Bandwidth is shared among all devices Bisection BW is constant B b /N nodes ~ 1/N nodes Collision detection, bus arbitration protocols must be in place Examples: PCI bus, diagnostic buses Advantages Low latency Easy to implement Disadvantages Shared bandwidth, not scalable Problems with failure resiliency (one defective agent may block bus) Fast buses for large N require large signal power (c) RRZE 2017 Basic Architecture 51
44 Non-blocking crossbar A non-blocking crossbar can mediate a number of connections between a group of input and a group of output elements This can be used as a 4-port non-blocking switch (fold at the secondary diagonal) Switches can be cascaded to form hierarchies (common case) 2x2 switching element Allows scalable communication at high hardware/energy costs Crossbars can be used as interconnects in computer systems NEC SX9 vector system ( IXS ) (c) RRZE 2017 Basic Architecture 52
45 Network topologies: Switches and Fat-Trees Standard clusters are built with switched networks Compute nodes ( devices ) are split up in groups each group is connected to single (non-blocking crossbar-)switch ( leaf switches ) Leaf switches are connected with each other using an additional switch hierarchy ( spine switches ) or directly (for small configs.) Switched networks: Distance between any two devices is heterogeneous (number of hops in switch hierarchy) Diameter of network: The maximum number of hops required to connect two arbitrary devices, e.g. diameter of bus=1 Perfect world: Fully non-blocking, i.e. any choice of N nodes /2 disjoint node (device) pairs can communicate at full speed (c) RRZE 2017 Basic Architecture 53
46 Fat tree switch hierarchies Fully non-blocking N nodes /2 end-to-end connections with full bandwidth B b = B * N nodes /2 B B b /N nodes = const. = B/2 Sounds good, but see next slide B Oversubscribed Spine does not support N nodes /2 full BW end-to-end connections B b /N nodes = const. = B/(2k), with k oversubscription factor k=3 spine switch Resource management (job placement) is crucial node leaf switch (c) RRZE 2017 Basic Architecture 54
47 Fat trees and static routing If all end-to-end data paths are preconfigured ( static routing ), not all possible combinations of N agents will get full bandwidth Example: is a collision-free pattern here (1 5, 2 6,3 7, 4 8) Change 2 6, 3 7 to 2 7, 3 6: has collisions if no other connections are re-routed at the same time Static routing: Quasi-standard in commodity interconnects However, things are starting to improve slowly (c) RRZE 2017 Basic Architecture 55
48 Full fat-tree: Single 288-port IB DDR-Switch SPINE switch level: 12 switches Basic building blocks: 24-port switches LEAF switch level: 24 switches with 24*12 ports to devices (c) RRZE 2017 S = switches 288 ports 56
49 Fat tree networks Examples Ethernet 1 Gbit/s &10 &100 Gbit/s variants InfiniBand Dominant high-performance commodity interconnect DDR: 20 Gbit/s per link and direction (Building blocks: 24-port switches) QDR: 40 Gbit/s per link and direction QDR IB is used in the RRZE s LiMa and Emmy clusters Building blocks: 36 port switches Large 36*18=648-port switches FDR-10 / FDR: 40/56 Gbit/s per link and direction EDR: 100 Gbit/s per link and direction Intel OmniPath Up to 100 Gbit/s per link & 48-port baseline switches RRZE Meggie cluster Expensive & complex to scale to very high node counts (c) RRZE 2017 Basic Architecture 57
50 Meshes Fat trees can become prohibitively expensive in large systems Compromise: Meshes n-dimensional Hypercubes Toruses (2D / 3D) Many others (including hybrids) Each node is a router Direct connections only between direct neighbors This is not a non-blocking corossbar! Intelligent resource management and routing algorithms are essential Example: 2D torus mesh Toruses at very large systems: Cray XE/XK series, IBM Blue Gene B b ~ N nodes (d-1)/d B b /N nodes 0 for large N nodes Sounds bad, but those machines show good scaling for many codes Well-defined and predictable bandwidth behavior! (c) RRZE 2017 Basic Architecture 58
51 Conclusions about architecture Modern computer architecture has a rich topology Node-level hardware parallelism takes many forms Sockets/devices CPU: 1-4 or more, GPGPU/Phi: 1-6 or more Cores moderate (CPU: 4-24, GPGPU: , Phi: 64-72) SIMD moderate (CPU: 2-8, Phi: 8-16) to massive (GPGPU: 10 s-100 s) Superscalarity (CPU/Phi: 2-6) System-level architecture is mostly defined by network topology Fat tree, torus, Exploiting performance: parallelism + bottleneck awareness High Performance Computing == computing at a bottleneck Performance of programs is sensitive to architecture Topology/affinity influences overheads of popular programming models Programming standards do not contain (many) topology-aware features Things are starting to improve slowly (MPI 3.0, OpenMP 4.0) Apart from overheads, performance features are largely independent of the programming model (c) RRZE 2017 Basic Architecture 59
Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell
More informationModern computer architecture. From multicore to petaflops
Modern computer architecture From multicore to petaflops Motivation: Multi-ores where and why Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationParallel architectures in detail Shared memory architectures Multicore issues ccnuma-aware programming Network topologies
Parallel architectures in detail Shared memory architectures Multicore issues ccnuma-aware programming Network topologies J. Eitzinger (RRZE) G. Hager (RRZE) R. Bader (LRZ) Anatomy of a cluster system
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationHigh Performance Computing (HPC) Introduction
High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationAgenda. System Performance Scaling of IBM POWER6 TM Based Servers
System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies
More informationCS/COE1541: Intro. to Computer Architecture
CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationParallel Computer Architecture II
Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationLecture 20: Distributed Memory Parallelism. William Gropp
Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationHigh Performance Computing - Parallel Computers and Networks. Prof Matt Probert
High Performance Computing - Parallel Computers and Networks Prof Matt Probert http://www-users.york.ac.uk/~mijp1 Overview Parallel on a chip? Shared vs. distributed memory Latency & bandwidth Topology
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationCOSC 6374 Parallel Computation. Parallel Computer Architectures
OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationMulticore Scaling: The ECM Model
Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationThe ECM (Execution-Cache-Memory) Performance Model
The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationCOMP4300/8300: Overview of Parallel Hardware. Alistair Rendell
COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationTools and techniques for optimization and debugging. Fabio Affinito October 2015
Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationParallel Computer Architecture - Basics -
Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationParallel Computer Architecture Concepts
Outline Parallel Computer Architecture Concepts TDDD93 Lecture 1 Christoph Kessler PELAB / IDA Linköping university Sweden 2017 Lecture 1: Parallel Computer Architecture Concepts Parallel computer, multiprocessor,
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationInterconnection Networks
Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially
More informationOverview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware
Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and
More informationOverview: Shared Memory Hardware
Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationVector Engine Processor of SX-Aurora TSUBASA
Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationParallel Programming Platforms
arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationComputer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra
Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating
More informationOverview. Processor organizations Types of parallel machines. Real machines
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationIntroduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE)
Introduction to omputer Architecture Jan Eitzinger (RRZE) Georg Hager (RRZE) Milestone Inventions 1938 Elwood Shannon: Solve boolean algebra and binary arithmetic with arrangements of relays 1941 Zuse
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More information