Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Size: px
Start display at page:

Download "Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes"

Transcription

1 Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

2 Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia Kepler: 7 billion Intel Broadwell: 7.2 billion Nvidia Pascal: 15 billion 1965: G. Moore claimed #transistors on microchip doubles every months (c) RRZE 2017 Basic Architecture 2

3 Multi-core today: Intel Xeon 2600v3 (2014) Xeon E5-2600v3 Haswell EP : Up to 18 cores running at 2+ GHz (+ Turbo Mode : 3.5+ GHz) Simultaneous Multithreading reports as 36-way chip 5.7 billion 22 nm Die size: 662 mm 2 Optional: Cluster on Die (CoD) mode socket server (c) RRZE 2017 Basic Architecture 3

4 A deeper dive into core and chip architecture

5 General-purpose cache-based microprocessor core Modern CPU core Stored-program computer Implements Stored Program Computer concept (Turing 1936) Similar designs on all modern systems (Still) multiple potential bottlenecks Flexible! (c) RRZE 2017 Basic Architecture 5

6 Basic resources on a stored program computer Instruction execution and data movement 1. Instruction execution This is the primary resource of the processor. All efforts in hardware design are targeted towards increasing the instruction throughput. Instructions are the concept of work as seen by processor designers. Not all instructions count as work as seen by application developers! Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Processor work: LOAD r1 = A(i) LOAD r2 = B(i) ADD r1 = r1 + r2 STORE A(i) = r1 INCREMENT i BRANCH top if i<n User work: N Flops (ADDs) (c) RRZE 2017 Basic Architecture 6

7 Basic resources on a stored program computer Instruction execution and data movement 2. Data transfer Data transfers are a consequence of instruction execution and therefore a secondary resource. Maximum bandwidth is determined by the request rate of executed instructions and technical limitations (bus width, speed). Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Data transfers: 8 byte: LOAD r1 = A(i) 8 byte: LOAD r2 = B(i) 8 byte: STORE A(i) = r2 Sum: 24 byte Crucial question: What is the bottleneck? Data transfer? Code execution? (c) RRZE 2017 Basic Architecture 7

8 Microprocessors Pipelining

9 Pipelining of arithmetic/functional units Idea: Split complex instruction into several simple / fast steps (stages) Each step takes the same amount of time, e.g., a single cycle Execute different steps on different instructions at the same time (in parallel) Benefits: Core can work on 5 independent instructions simultaneously One instruction finished each cycle after the pipeline is full Drawback: Pipeline must be filled; large number of independent instructions required Requires complex instruction scheduling by hardware (out-of-order execution) or compiler (software pipelining) Pipelining is widely used in modern computer architectures (c) RRZE 2017 Basic Architecture 10

10 5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,n First result is available after 5 cycles (=latency of pipeline)! Wind-up/-down phases: Empty pipeline stages (c) RRZE 2017 Basic Architecture 11

11 Microprocessors Superscalarity and Simultaneous Multithreading

12 Superscalar Processors Instruction Level Parallelism Multiple units enable use of Instrucion Level Parallelism (ILP): Instruction stream is parallelized on the fly t Fetch Instruction 4 Fetch Instruction 3 Fetch from Instruction L1I 2 Fetch from Instruction L1I 1 Fetch Instruction from L1I 2 Decode Fetch Instruction from L1I 2 Decode Fetch from Instruction L1I 2 Instruction Decode 1 Fetch from Instruction L1I 5 Instruction Decode 1 Fetch Instruction from L1I 3 Decode Instruction 1 Fetch Instruction from L1I 3 Decode Instruction 1 Fetch from Instruction L1I 3 Instruction Decode 2 Fetch from Instruction L1I 9 Instruction Decode 2 Fetch Instruction from L1I 4 Decode Instruction 2 Fetch Instruction from L1I 4 Decode Instruction 5 Fetch from Instruction L1I 4 Instruction Decode 3 Fetch from Instruction L1I 13 Instruction Decode 3 from L1I Instruction 3 from L1I Instruction 9 4-way superscalar Execute Execute Instruction Execute 1 Instruction Execute 1 Execute Instruction 1 Execute Instruction 1 Instruction Execute 2 Instruction Execute 2 Instruction 2 Instruction 5 Example: LOAD STORE MULT ADD Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar & can perform 2 floating point instructions per cycles (c) RRZE 2017 Basic Architecture 14

13 2-way SMT Standard core Core details: Simultaneous multi-threading (SMT) SMT principle (2-way example): (c) RRZE 2017 Basic Architecture 16

14 Microprocessors Single Instruction Multiple Data (SIMD) a.k.a. vectorization

15 A[0] B[0] C[0] A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] C[0] C[1] C[2] C[3] Core details: SIMD processing Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on wide registers x86 SIMD instruction sets: SSE: register width = 128 Bit 2 double precision floating point operands AVX: register width = 256 Bit 4 double precision floating point operands AVX-512: you guessed it! Adding two registers holding double precision floating point operands R0 R1 R2 R0 R1 R2 SIMD execution: V64ADD [R0,R1] R Bit 64 Bit Scalar execution: R2 ADD [R0,R1] (c) RRZE 2017 Basic Architecture 18

16 Microprocessors Memory Hierarchy

17 Von Neumann bottleneck reloaded: DRAM gap DP peak performance and peak main memory bandwidth for a single Intel processor (chip) Approx. 10 F/B Main memory access speed not sufficient to keep CPU busy Introduce fast on-chip caches, holding copies of recently used data items (c) RRZE 2017 Basic Architecture 21

18 Registers and caches: Data transfers in a memory hierarchy Caches help with getting instructions and data to the CPU fast How does data travel from memory to the CPU and back? Remember: Caches are organized in cache lines (e.g., 64 bytes) Only complete cache lines are transferred between memory hierarchy levels (except registers) MISS: Load or store instruction does not find the data in a cache level CL transfer required LD C(1) MISS CL CPU registers ST A(1) MISS Cache CL write allocate LD C(2..N cl ) ST A(2..N cl ) HIT evict (delayed) Example: Array copy A(:)=C(:) CL CL C(:) A(:) Memory 3 CL transfers (c) RRZE 2017 Basic Architecture 23

19 Parallel Computers Terminology

20 From core to cluster T T chip/socket T T T T T T T T T T T T core distributed-memory cluster shared-memory compute node Network (c) RRZE 2017 Basic Architecture 25

21 Shared-Memory Parallel Computers Cache coherence Multicore-multisocket architecture ccnuma memory organization

22 Muliple cores and caches Cache coherence Data in cache is only a copy of data in memory Multiple copies of same data on multiprocessor systems Cache coherence protocol/hardware ensure consistent data view Without cache coherence, shared cache lines can become clobbered: (Cache line size = 2 Words; A1+A2 are in a single CL) P1 P2 P1 C1 P2 C2 Load A1 Write A1=0 Load A2 Write A2=0 A1, A2 A1, A2 Write-back to memory leads to incoherent data Bus A1, A2 A1, A2 A1, A2 A1, A2 Memory C1 & C2 entry can not be merged to: A1, A2 (c) RRZE 2017 Basic Architecture 27

23 Multiple caches Cache coherence Cache coherence protocol must keep track of cache line status P1 P2 C1 C2 A1, A2 A1, A2 Bus P1 Load A1 Write A1=0: 1. Request exclusive access to CL 2. Invalidate CL in C2 3. Modify A1 in C1 P2 Load A2 A1, A2 Memory 1. Request exclusive CL access Write A2=0: 2. CL write back+ Invalidate t 3. Load CL to C2 C2 is exclusive owner of CL 4. Modify A2 in C2 (c) RRZE 2017 Basic Architecture 28

24 Parallel computers Cache coherence Cache coherence can cause substantial overhead may reduce available bandwidth Different implementations Snoop: On modifying a CL, a CPU must broadcast its address to the whole system Directory, snoop filter : Chipset ( network ) keeps track of which CLs are where and filters coherence traffic Directory-based ccnuma can reduce pain of additional coherence traffic But always take care: Multiple processors should never write frequently to the same cache line ( false sharing )! (c) RRZE 2017 Basic Architecture 29

25 Cray XC30 SandyBridge-EP 8-core dual socket node 8 cores per socket 2.7 GHz turbo) DDR3 memory interface with 4 channels per chip Two-way SMT Two 256-bit SIMD FP units SSE4.2, AVX 32 kb L1 data cache per core 256 kb L2 cache per core 20 MB L3 cache per chip ccnuma memory architecture Memory is physically distributed but logically shared (c) RRZE 2017 Basic Architecture 30

26 Interlude: A glance at current accelerator technology NVidia Pascal GP100 vs. Intel Xeon Phi Knights Landing

27 NVidia Pascal GP100 block diagram Architecture 15.3 B Transistors ~ 1.4 GHz clock speed Up to 60 SM units 64 (SP) cores each 5.7 TFlop/s DP peak 4 MB L2 Cache 4096-bit HBM2 MemBW ~ 732 GB/s (theoretical) MemBW ~ 510 GB/s (measured) 2:1 SP:DP performance (c) RRZE 2017 Basic Architecture NVIDIA Corp. 32

28 Intel Xeon Phi Knights Landing block diagram VPU VPU VPU VPU MCDRAM MCDRAM MCDRAM MCDRAM T T T T T T T T P P 32 KiB L1 32 KiB L1 DDR4 DDR4 1 MiB L2 DDR4 DDR4 36 tiles (72 cores) max. DDR4 DDR4 Architecture 8 B Transistors Up to 1.5 GHz clock speed Up to 2x36 cores (2D mesh) 2x 512-bit SIMD units each 4-way SMT MCDRAM MCDRAM MCDRAM MCDRAM 3.5 TFlop/s DP peak (SP 2x) 36 MiB L2 Cache 16 GiB MCDRAM MemBW ~ 470 GB/s (measured) Large DDR4 main memory MemBW ~ 90 GB/s (measured) (c) RRZE 2017 Basic Architecture 33

29 Trading single thread performance for parallelism: GPGPUs vs. CPUs GPU vs. CPU light speed estimate (per device) MemBW ~ 5-10x Peak ~ 6-15x 2x Intel Xeon E5-2697v4 Broadwell Intel Xeon Phi 7250 Knights Landing NVidia Tesla P100 Pascal Cores@Clock 2 x 2.3 GHz 1.4 GHz 56 ~1.3 GHz SP Performance/core 73.6 GFlop/s 89.6 GFlop/s ~166 GFlop/s Threads@STREAM ~8 ~40 >8000? SP peak 2.6 TFlop/s 6.1 TFlop/s ~9.3 TFlop/s Stream BW (meas.) 2 x 62.5 GB/s 470 GB/s (HBM) 510 GB/s Transistors / TDP ~2x7 Billion / 2x145 W 8 Billion / 215W 14 Billion/300W (c) RRZE 2017 Basic Architecture 34

30 Node topology and programming models

31 Parallel programming models: Pure MPI Machine structure is invisible to user: Very simple programming model MPI knows what to do!? Performance issues Intranode vs. internode MPI Node/system topology (c) RRZE 2017 Basic Architecture 39

32 Parallel programming models are topology-agnostic: Example: Pure threading on the node (relevant for this tutorial) Machine structure is invisible to user Very simple programming model Threading SW (OpenMP, pthreads, TBB, ) should know about the details Performance issues Synchronization overhead Memory access Node topology (c) RRZE 2017 Basic Architecture 40

33 Distributed-memory computers & hybrid systems

34 Parallel distributed-memory computers: Basics Distributed-memory parallel computer: Each processor P is connected to exclusive local memory (M) and a network interface (NI) Node A (dedicated) communication network connects all nodes Data exchange between nodes: Passing messages via network ( Message Passing ) Variants: No global (shared) address space No Remote Memory Access (NORMA) NON-COHENRENT shared address space (NUMA), e.g. CRAY PGAS languages (CoArray Fortran, UPC) Prototype of first PC clusters: Node: Network: Single CPU PC Ethernet First Massively Parallel Processing (MPP) architectures: CRAY T3D/E, Intel Paragon (c) RRZE 2017 Basic Architecture 42

35 Parallel distributed-memory computers: Hybrid system Standard concept of most modern large parallel computers: Hybrid/hierarchical Compute node is a 2- or 4-socket shared memory compute nodes with a NI. Communication network (GBit, Infiniband) connects the nodes Price / (Peak) Performance is optimal; Network capabilities / (Peak) Perf. gets worse Parallel Programming? Pure Message Passing is standard. Hybrid programming? Today: GPUs / Accelerators are added to the nodes to further increase complexity Distributed-memory parallel (c) RRZE 2017 Basic Architecture 43

36 Networks Basic ideas and performance characteristics of modern networks

37 Networks Basic performance characteristics Evaluate the network capabilities to transfer data Use the same idea as for main memory access: Total transfer time for a message of V Bytes is: V T comm = λ + b network λ is the latency (transfer setup time [sec]) and b network is asymptotic (N oo) network bandwidth [MBytes/sec] Consider simplest case ( Ping Pong ) Two processors in different nodes communicate via network ( Point-to-point ) A single message of N Bytes is sent forward and backward Overall data transfer is 2 N Bytes! (c) RRZE 2017 Basic Architecture 45

38 Networks Basic performance characteristics Ping-Pong benchmark (schematic view) 1 myid = get_process_id() 2 if(myid.eq.0) then 3 targetid = 1 4 S = get_walltime() 5 call Send_message(buffer,N,targetID) 6 call Receive_message(buffer,N,targetID) 7 E = get_walltime() 8 MBYTES = 2*N/(E-S)/1.d6! MBytes/sec rate 9 TIME = (E-S)/(2*1.d6)! transfer time in microsecs 10! for single message 11 else 12 targetid = 0 13 call Receive_message(buffer,N,targetID) 14 call Send_message(buffer,N,targetID) 15 endif (c) RRZE 2017 Basic Architecture 46

39 B eff = 2*N/(E-S)/1.d6 Networks Basic performance characteristics Ping-Pong benchmark for GBit-Ethernet (GigE) network N 1/2 : Message size where 50% of peak bandwidth is achieved Asymptotic bandwidth b network =111 MB/s GBit/s Latency (N 0): Only qualitative agreement: 44 ms vs. 76 ms (c) RRZE 2017 Basic Architecture 47

40 Networks Basic performance characteristics Ping-Pong benchmark for DDR Infiniband (DDR-IB) network Determine b network and λ independently and combine them λ b network (c) RRZE 2017 Basic Architecture 48

41 Networks Basic performance characteristics First Principles modeling of B eff (V) provides good qualitative results but quantitative description in particular of latency dominated region (V small) may fail because Overhead for transmission protocols, e.g. message headers Minimum frame size for message transmission, e.g. TCP/IP over Ethernet does always transfer frames with V>1 Message setup/initialization involves multiple software layers and protocols; each software layer adds to latency; hardware only latency is often small As the message size increases the software may switch to different protocol, e.g. from eager to rendezvous Typical message sizes in applications are neither small nor large V 1/2 value is also important: V 1/2 = λ b network Network balance: Relate network bandwidth (b network or B eff (N 1/2 )) to computer power (or main memory bandwidth) of the nodes (c) RRZE 2017 Basic Architecture 49

42 Networks: Topologies & Bisection bandwidth Network bisection bandwidth B b is a general metric for the data transfer capability of a system: Minimum sum of the bandwidths of all connections cut when splitting the system into two equal parts More meaningful metric in terms of system scalability: Bisection BW per node: B b /N nodes Bisection BW depends on Bandwidth per link Network topology Uni- or Bi-directional bandwidth?! (c) RRZE 2017 Basic Architecture 50

43 Network topologies: Bus Bus can be used by one connection at a time Bandwidth is shared among all devices Bisection BW is constant B b /N nodes ~ 1/N nodes Collision detection, bus arbitration protocols must be in place Examples: PCI bus, diagnostic buses Advantages Low latency Easy to implement Disadvantages Shared bandwidth, not scalable Problems with failure resiliency (one defective agent may block bus) Fast buses for large N require large signal power (c) RRZE 2017 Basic Architecture 51

44 Non-blocking crossbar A non-blocking crossbar can mediate a number of connections between a group of input and a group of output elements This can be used as a 4-port non-blocking switch (fold at the secondary diagonal) Switches can be cascaded to form hierarchies (common case) 2x2 switching element Allows scalable communication at high hardware/energy costs Crossbars can be used as interconnects in computer systems NEC SX9 vector system ( IXS ) (c) RRZE 2017 Basic Architecture 52

45 Network topologies: Switches and Fat-Trees Standard clusters are built with switched networks Compute nodes ( devices ) are split up in groups each group is connected to single (non-blocking crossbar-)switch ( leaf switches ) Leaf switches are connected with each other using an additional switch hierarchy ( spine switches ) or directly (for small configs.) Switched networks: Distance between any two devices is heterogeneous (number of hops in switch hierarchy) Diameter of network: The maximum number of hops required to connect two arbitrary devices, e.g. diameter of bus=1 Perfect world: Fully non-blocking, i.e. any choice of N nodes /2 disjoint node (device) pairs can communicate at full speed (c) RRZE 2017 Basic Architecture 53

46 Fat tree switch hierarchies Fully non-blocking N nodes /2 end-to-end connections with full bandwidth B b = B * N nodes /2 B B b /N nodes = const. = B/2 Sounds good, but see next slide B Oversubscribed Spine does not support N nodes /2 full BW end-to-end connections B b /N nodes = const. = B/(2k), with k oversubscription factor k=3 spine switch Resource management (job placement) is crucial node leaf switch (c) RRZE 2017 Basic Architecture 54

47 Fat trees and static routing If all end-to-end data paths are preconfigured ( static routing ), not all possible combinations of N agents will get full bandwidth Example: is a collision-free pattern here (1 5, 2 6,3 7, 4 8) Change 2 6, 3 7 to 2 7, 3 6: has collisions if no other connections are re-routed at the same time Static routing: Quasi-standard in commodity interconnects However, things are starting to improve slowly (c) RRZE 2017 Basic Architecture 55

48 Full fat-tree: Single 288-port IB DDR-Switch SPINE switch level: 12 switches Basic building blocks: 24-port switches LEAF switch level: 24 switches with 24*12 ports to devices (c) RRZE 2017 S = switches 288 ports 56

49 Fat tree networks Examples Ethernet 1 Gbit/s &10 &100 Gbit/s variants InfiniBand Dominant high-performance commodity interconnect DDR: 20 Gbit/s per link and direction (Building blocks: 24-port switches) QDR: 40 Gbit/s per link and direction QDR IB is used in the RRZE s LiMa and Emmy clusters Building blocks: 36 port switches Large 36*18=648-port switches FDR-10 / FDR: 40/56 Gbit/s per link and direction EDR: 100 Gbit/s per link and direction Intel OmniPath Up to 100 Gbit/s per link & 48-port baseline switches RRZE Meggie cluster Expensive & complex to scale to very high node counts (c) RRZE 2017 Basic Architecture 57

50 Meshes Fat trees can become prohibitively expensive in large systems Compromise: Meshes n-dimensional Hypercubes Toruses (2D / 3D) Many others (including hybrids) Each node is a router Direct connections only between direct neighbors This is not a non-blocking corossbar! Intelligent resource management and routing algorithms are essential Example: 2D torus mesh Toruses at very large systems: Cray XE/XK series, IBM Blue Gene B b ~ N nodes (d-1)/d B b /N nodes 0 for large N nodes Sounds bad, but those machines show good scaling for many codes Well-defined and predictable bandwidth behavior! (c) RRZE 2017 Basic Architecture 58

51 Conclusions about architecture Modern computer architecture has a rich topology Node-level hardware parallelism takes many forms Sockets/devices CPU: 1-4 or more, GPGPU/Phi: 1-6 or more Cores moderate (CPU: 4-24, GPGPU: , Phi: 64-72) SIMD moderate (CPU: 2-8, Phi: 8-16) to massive (GPGPU: 10 s-100 s) Superscalarity (CPU/Phi: 2-6) System-level architecture is mostly defined by network topology Fat tree, torus, Exploiting performance: parallelism + bottleneck awareness High Performance Computing == computing at a bottleneck Performance of programs is sensitive to architecture Topology/affinity influences overheads of popular programming models Programming standards do not contain (many) topology-aware features Things are starting to improve slowly (MPI 3.0, OpenMP 4.0) Apart from overheads, performance features are largely independent of the programming model (c) RRZE 2017 Basic Architecture 59

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell

More information

Modern computer architecture. From multicore to petaflops

Modern computer architecture. From multicore to petaflops Modern computer architecture From multicore to petaflops Motivation: Multi-ores where and why Introduction: Moore s law Intel Sandy Brige EP: 2.3 Billion nvidia FERMI: 3 Billion 1965: G. Moore claimed

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Parallel architectures in detail Shared memory architectures Multicore issues ccnuma-aware programming Network topologies

Parallel architectures in detail Shared memory architectures Multicore issues ccnuma-aware programming Network topologies Parallel architectures in detail Shared memory architectures Multicore issues ccnuma-aware programming Network topologies J. Eitzinger (RRZE) G. Hager (RRZE) R. Bader (LRZ) Anatomy of a cluster system

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

High Performance Computing (HPC) Introduction

High Performance Computing (HPC) Introduction High Performance Computing (HPC) Introduction Ontario Summer School on High Performance Computing Scott Northrup SciNet HPC Consortium Compute Canada June 25th, 2012 Outline 1 HPC Overview 2 Parallel Computing

More information

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel and Distributed Programming Introduction. Kenjiro Taura Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

CS/COE1541: Intro. to Computer Architecture

CS/COE1541: Intro. to Computer Architecture CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.1 Lecture Outline Review of Single Processor Design So we talk

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Lecture 20: Distributed Memory Parallelism. William Gropp

Lecture 20: Distributed Memory Parallelism. William Gropp Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

High Performance Computing - Parallel Computers and Networks. Prof Matt Probert

High Performance Computing - Parallel Computers and Networks. Prof Matt Probert High Performance Computing - Parallel Computers and Networks Prof Matt Probert http://www-users.york.ac.uk/~mijp1 Overview Parallel on a chip? Shared vs. distributed memory Latency & bandwidth Topology

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

Multicore Scaling: The ECM Model

Multicore Scaling: The ECM Model Multicore Scaling: The ECM Model Single-core performance prediction The saturation point Stencil code examples: 2D Jacobi in L1 and L2 cache 3D Jacobi in memory 3D long-range stencil G. Hager, J. Treibig,

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

The ECM (Execution-Cache-Memory) Performance Model

The ECM (Execution-Cache-Memory) Performance Model The ECM (Execution-Cache-Memory) Performance Model J. Treibig and G. Hager: Introducing a Performance Model for Bandwidth-Limited Loop Kernels. Proceedings of the Workshop Memory issues on Multi- and Manycore

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell COMP4300/8300: Overview of Parallel Hardware Alistair Rendell COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University 2.2 The Performs: Floating point operations (FLOPS) - add, mult,

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Tools and techniques for optimization and debugging. Fabio Affinito October 2015 Tools and techniques for optimization and debugging Fabio Affinito October 2015 Fundamentals of computer architecture Serial architectures Introducing the CPU It s a complex, modular object, made of different

More information

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued) Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn

More information

Parallel Computer Architecture - Basics -

Parallel Computer Architecture - Basics - Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Parallel Computer Architecture Concepts

Parallel Computer Architecture Concepts Outline Parallel Computer Architecture Concepts TDDD93 Lecture 1 Christoph Kessler PELAB / IDA Linköping university Sweden 2017 Lecture 1: Parallel Computer Architecture Concepts Parallel computer, multiprocessor,

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

Vector Engine Processor of SX-Aurora TSUBASA

Vector Engine Processor of SX-Aurora TSUBASA Vector Engine Processor of SX-Aurora TSUBASA Shintaro Momose, Ph.D., NEC Deutschland GmbH 9 th October, 2018 WSSP 1 NEC Corporation 2018 Contents 1) Introduction 2) VE Processor Architecture 3) Performance

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

Parallel Programming Platforms

Parallel Programming Platforms arallel rogramming latforms Ananth Grama Computing Research Institute and Department of Computer Sciences, urdue University ayg@cspurdueedu http://wwwcspurdueedu/people/ayg Reference: Introduction to arallel

More information

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D. Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra Binary Representation Computer Systems Information is represented as a sequence of binary digits: Bits What the actual bits represent depends on the context: Seminar 3 Numerical value (integer, floating

More information

Overview. Processor organizations Types of parallel machines. Real machines

Overview. Processor organizations Types of parallel machines. Real machines Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

Introduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE)

Introduction to Computer Architecture. Jan Eitzinger (RRZE) Georg Hager (RRZE) Introduction to omputer Architecture Jan Eitzinger (RRZE) Georg Hager (RRZE) Milestone Inventions 1938 Elwood Shannon: Solve boolean algebra and binary arithmetic with arrangements of relays 1941 Zuse

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information