Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Similar documents
Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

19: Multiprocessing and Multicore. Computer Architecture and Systems Programming , Herbstsemester 2013 Timothy Roscoe

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Scalable Locking. Adam Belay

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Other consistency models

Multi-core Architectures. Dr. Yingwu Zhu

Multiprocessor Systems. Chapter 8, 8.1

Multiprocessors & Thread Level Parallelism

Multi-core Architectures. Dr. Yingwu Zhu

Parallel Architecture. Hwansoo Han

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multicore Programming

Computer Systems Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Chapter 5. Multiprocessors and Thread-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

MULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

CSE502: Computer Architecture CSE 502: Computer Architecture

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computer Architecture

Computer Systems Architecture

Computer Architecture Spring 2016

Parallelism, Multicore, and Synchronization

The Multikernel A new OS architecture for scalable multicore systems

Advanced Parallel Programming I

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Implications of Multicore Systems. Advanced Operating Systems Lecture 10

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Shared Symmetric Memory Systems

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Multicore and Parallel Processing

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

COSC 6385 Computer Architecture - Multi Processor Systems

4. Shared Memory Parallel Architectures

ECE 485/585 Microprocessor System Design

Outline Marquette University

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Lecture 1: Introduction

Computer Architecture

Parallel Algorithm Engineering

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

CSE 120 Principles of Operating Systems

HW Trends and Architectures

Kaisen Lin and Michael Conley

Introducing Multi-core Computing / Hyperthreading

Processor Performance and Parallelism Y. K. Malaiya

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Multicore Hardware and Parallelism

Computer Architecture Crash course

Advanced Operating Systems ( L)

Microarchitecture Overview. Performance

How much energy can you save with a multicore computer for web applications?

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 18 - Multicore Computers

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Chapter Seven Morgan Kaufmann Publishers

WHY PARALLEL PROCESSING? (CE-401)

CSE 392/CS 378: High-performance Computing - Principles and Practice

Arquitecturas y Modelos de. Multicore

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computer Science 146. Computer Architecture

Parallelism in Hardware

Comp. Org II, Spring

CS 152 Computer Architecture and Engineering

Parallel Processing & Multicore computers

CS 654 Computer Architecture Summary. Peter Kemper

CS 152 Computer Architecture and Engineering

Comp. Org II, Spring

Design Principles for End-to-End Multicore Schedulers

Computer Architecture 计算机体系结构. Lecture 9. CMP and Multicore System 第九讲 片上多处理器与多核系统. Chao Li, PhD. 李超博士

PC I/O. May 7, Howard Huang 1

Microprocessor Trends and Implications for the Future

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Modern CPU Architectures

THREAD LEVEL PARALLELISM

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

ECE 588/688 Advanced Computer Architecture II

Advanced Processor Architecture

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced Computer Networks. End Host Optimization

Lecture 1: CS/ECE 3810 Introduction

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Transcription:

Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012

Today Non-Uniform Memory Access (NUMA) Multicore Performance implications Heterogeneous manycore Course summary

SMP architecture More cores brings more cycles 1 2 3 4 5 6 7 8...not necessarily proportionately more cache...nor more off-chip bandwidth or total capacity To main memory

SMP architecture 1 2 3 4 Cache Cache Cache Cache Main memory More cores and faster cores use more memory bandwidth I/O system Buses replaced with interconnection networks

Distributed memory architecture CPU1 CPU2 CPU3 CPU4 Adding more CPUs brings more of most other things CPUs + CPUs + CPUs + CPUs + & directory To interconnect Locality property: only go to interconnect for real I/O or sharing DSM/NUMA Message-passing, eg clusters Could scale to 100s or 1000s of cores CPUs + CPUs + CPUs + CPUs +

Non-Uniform Memory Access CPU 0 Cache Interconnect CPU 1 Cache Removes bottleneck Multiple, independent memory banks Processors have independent paths to memory

Non-Uniform Memory Access CPU 0 CPU 1 Cache Interconnect Cache Interconnect is not a bus any more: it s a network link Carries messages between nodes (usually processor sockets) Read/write request/response, cache invalidate, etc.

Non-Uniform Memory Access Fast (ish) CPU 0 Cache Slow Interconnect CPU 1 Cache All memory is globally addressable But local is faster

Non-Uniform Memory Access CPU 0 CPU 1 Cache Interconnect Cache NUMA interconnects are not new, but are new in PCs AMD HyperTransport (originally the Alpha EV7 interconnect) Intel CSI/QuickPath

NUMA cache coherence Can t snoop on the bus any more: it s not a bus! NUMA use a message-passing interconnect Solution 1: Bus emulation Similar to snooping, but without a shared bus Each node sends a message to all other nodes E.g. Read exclusive Waits for a reply from all nodes before proceeding E.g. Acknowledge Example: AMD coherent HyperTransport

Solution 2: Cache Directory Augment each node s local memory with a cache directory: Cache line data Owner 0 1 2 3 4 5 6 7 (e.g. 64 bytes) 0 3 5 3 1 Memory itself (usually D) Node ID of owner of cache line 1 bit per node indicating presence of line

Directory-based Cache Coherence Directory-based cache coherence Home node maintains set of nodes that may have line Large multiprocessors, plus AMD HTAssist, Intel Beckton QuickPath More efficient when: 1. Lines are not widely shared 2. Lots of NUMA nodes Avoid broadcast/incast Reduces interconnect traffic, load at each node Requires lots of fast memory HTAssist lose up to 33% of L3!

Today Non-Uniform Memory Access (NUMA) Multicore Performance implications Heterogeneous manycore Course summary

Moore s law: the free lunch Intel Pentium Intel Pentium 4 Intel Itanium 2 286 Intel 486 Intel Pentium II 8086 Intel 386 4004 Source: table from http://download.intel.com/pressroom/kits/events/moores_law_40th/mltimeline.pdf

The power wall AMD Athlon 1500 processor http://www.phys.ncku.edu.tw/~htsu/humor/fry_egg.html

The power wall Power dissipation = Capacitive load Voltage 2 Frequency Increase in clock frequency mean more power dissipated more cooling required Decrease in voltage reduces dynamic power but increase the static power leakage We ve reached the practical power limit for cooling commodity microprocessors Can t increase clock frequency without expensive cooling

Operations / µs The memory wall 10000 CPU 1000 100 10 1MHz CPU clock, 500ns to access memory in 1985 Memory 1

The ILP wall ILP = Instruction level parallelism Implicit parallelism between instructions in 1 thread Processor can re-order and pipeline instructions, split them into microinstructions, do aggressive branch prediction etc. Requires hardware safeguards to prevent potential errors from out-of-order execution Increases execution unit complexity and associated power consumption Diminishing returns Serial performance acceleration using ILP has stalled

End of the road for serial hardware Power wall + ILP wall + memory wall = brick wall Power wall can t clock processors any faster Memory wall for many workloads performance dominated by memory access times ILP wall can t keep functional units busy while waiting for memory accesses There is also a complexity wall, but chip designers don t like to talk about it

Multicore processors Multiple processor cores per chip This is the future (and present) of computing Most multicore chips so far are shared memory multiprocessors (SMP) Single physical address space shared by all processors Communication between processors happens through shared variables in memory Hardware typically provides cache coherence

Log (seq. perf) Implications for software The things that would have used this lost perf must now be written to use cores/accel Historical 1-thread perf gains via improved clock rate and transistors used to extract ILP #transistors still growing, but delivered as additional cores and accelerators Year

Intel Nehalem-EX (Beckton) 2 threads per core

Intel Nehalem-EX 8-socket configuration Interconnect is in fact a cube with added main diagonals:

8-socket 32-core AMD Barcelona L3 L3 L3 L3 L3 L3 L3 L3 PCIe PCIe PCI SATA 1GbE PCI SATA 1GbE Floppy disk drive

Cache access latency L1: 2 : 15 130 130 130 L3: 75 190 260 332 190 260 332 369 Memory is a bit faster than L3, but it s complicated

Today Non-Uniform Memory Access (NUMA) Multicore Performance implications Heterogeneous manycore Course summary

Performance implications 1. Memory latency How long does it take to load from/store to memory? Depends on the physical address High-performance code allocates memory accordingly! 2. Contention Loading a cache line from is slow, but so is loading it from another cache Avoid cache-cache transfers wherever possible

Memory latency Memory here will be slower to access from thread Thread runs on this core L3 L3 L3 L3 Memory here will be much slower to access from thread! Allocate private data from this bank of L3 L3 L3 L3

False sharing Written often by thread A Written often by thread B Cache line: For best performance On a single CPU: pack data into single cache line Better locality, more cache coverage Threads on different CPUs: very bad! Cache line ping-pongs between caches on every write!

Example There are many tricks to making things go fast on a multiprocessor Closely related to lock-free data structures, etc. Example: MCS locks Possibly the best known locking system for multiprocessors Excellent cache properties: Only spin on local data Only one processor wakes up on release()

MCS locks [Mellor-Crummey and Scott, 1991] Problem: cache line containing lock is a hot spot Continuously invalidated as every processor tries to acquire it Dominates interconnect traffic Solution: When acquiring, a processor enqueues itself on a list of waiting processors, and spins on its own entry in the list When releasing, only the next processor is awakened

MCS lock pseudocode struct qnode { struct qnode *next; int locked; }; typedef struct qnode *lock_t; qnode in local memory void acquire( lock_t *lock, struct qnode *local) { local->next = NULL; struct qnode *prev = XCHG(lock, local); if (prev) { // queue was non-empty local->locked = 1; prev->next = local; while (local->locked) ; // spin } } lock last element of a queue of spinning processes 1. Add ourselves to the end of this queue using XCHG 2. If the queue was empty, we have the lock! 3. If not, point the previous tail at us, and spin.

MCS lock pseudocode struct qnode { struct qnode *next; int locked; }; typedef struct qnode *lock_t; void release (lock_t *lock, struct qnode *local) { if (local->next == NULL) { if ( CAS(lock, local, NULL) ) return; while (local->next == NULL) ; // spin local->next->locked = 0; } 1. We have the lock. Is someone after us waiting? 2. If yes, tell them, and they will do the rest (see acquire()!) 3. If no, set the lock to NULL unless someone appears in the meantime 4. If they do, wait for them to enqueue, and then go to (2)

MCS lock performance 4x4-core AMD Opteron, Linux [Boyd-Wickizer et al., 2008]

Today Non-Uniform Memory Access (NUMA) Multicore Performance implications Heterogeneous manycore Course summary

The future: processors are becoming more diverse Different numbers of cores, threads Different sizes and numbers of caches Shared in different ways between cores Different extensions to an instruction set Crypto instructions Graphics acceleration Different consistency models And potentially incoherent memory

Sun Niagara 2 (UltraSPARC T2) 8 threads per core

Intel Knight s Ferry (now Xeon Phi )

Tilera TilePro64

Beyond Multicore: Heterogeneous Manycore Cores will vary Some large, superscalar, out-of-order cores Many small, in-order cores (less power) Some will be specialized (graphics, crypto, network, etc.) Some will have different instructions, or instruction sets Cores will come and go Constantly powered on or off to save energy May not even be able to run all at the same time Every machine will be different Hardware is now changing faster than system software Can t tune your OS to one type of machine any more Caches may not be coherent any more

Example: Intel Single-Chip Cloud Computer

24 tiles 2 x P54C processors The 100MHz Pentium In-order execution 256k $ per core Non-coherent No write-allocate 16k fast S ( MPB ) 8-port router Configuration registers

Our part of the picture: Barrelfish Open-source OS written from scratch Collaboration between ETHZ and Microsoft Research Goals: Scale to many cores Adapt to different hardware Not reply on cache coherence Keep up with trends

Today Non-Uniform Memory Access (NUMA) Multicore Performance implications Heterogeneous manycore Course summary

Course summary How to program in C Close to the metal language Can understand disassembly from compiler What hardware looks like to software How hardware designers make it go fast What this means for software What the compiler does Optimizations, transformations How to make the code: Correct (e.g. floating point, memory consistency, devices, etc.) Fast (caches, pipelines, vectors, DMA, etc.)

Course goals Make you into an extraordinary programmer Most programmers don t know this material It shows in their code Truly great programmers understand all this And it shows in their code