Computer Architecture Crash course

Similar documents
Concurrent Programming Introduction

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Keywords and Review Questions

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Computer Systems Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Chap. 4 Multiprocessors and Thread-Level Parallelism

CPUs vs. Memory Hierarchies

Online Course Evaluation. What we will do in the last week?

Computer Systems Architecture

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

WHY PARALLEL PROCESSING? (CE-401)

Processor Performance and Parallelism Y. K. Malaiya

CS 654 Computer Architecture Summary. Peter Kemper

Multiprocessors & Thread Level Parallelism

Message Passing. Frédéric Haziza Summer Department of Computer Systems Uppsala University

Parallel Processing SIMD, Vector and GPU s cont.

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

THREAD LEVEL PARALLELISM

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018


Computer Architecture

Review: Creating a Parallel Program. Programming for Performance

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

Top500 Supercomputer list

Parallel Architecture. Hwansoo Han

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Multiprocessor Synchronization

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Multiprocessors and Locking

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

Written Exam / Tentamen

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Parallel Computer Architecture

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Multiprocessors - Flynn s Taxonomy (1966)

UNIT I (Two Marks Questions & Answers)

An Introduction to Parallel Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Computer Architecture

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

High Performance Computing Systems

Chapter 5: Thread-Level Parallelism Part 1

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

ECE 588/688 Advanced Computer Architecture II

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Computer Science 146. Computer Architecture

Multithreaded Processors. Department of Electrical Engineering Stanford University

Introduction II. Overview

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Lecture 9: MIMD Architectures

Multi-core Architectures. Dr. Yingwu Zhu

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Kaisen Lin and Michael Conley

Outline Marquette University

Page 1. Multilevel Memories (Improving performance using a little cash )

CS 426 Parallel Computing. Parallel Computing Platforms

Lecture 9: MIMD Architectures

Architecture-Conscious Database Systems

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Introduction to GPU programming with CUDA

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

CPU Pipelining Issues

Kernel Synchronization I. Changwoo Min

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Comp. Org II, Spring

Parallelism in Hardware

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Parallel Processing & Multicore computers

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Copyright 2012, Elsevier Inc. All rights reserved.

27. Parallel Programming I

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ECE/CS 250 Computer Architecture. Summer 2016

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Comp. Org II, Spring

Suggested Readings! Lecture 24" Parallel Processing on Multi-Core Chips! Technology Drive to Multi-core! ! Readings! ! H&P: Chapter 7! vs.! CSE 30321!

Transcription:

Computer Architecture Crash course Frédéric Haziza <daz@it.uu.se> Department of Computer Systems Uppsala University Summer 2008

Conclusions The multicore era is already here cost of parallelism is dropping dramatically Cache/Thread is dropping Memory bandwidth/thread is dropping The introduction of multicore processors will have profound impact on computational software For high-performance computing (HPC) applications, multicore processors introduces an additional layer of complexity, which will force users to go through a phase change in how they address parallelism or risk being left behind. [IDC presentation on HPC at International Supercomputing Conference, ICS07, Dresden, June 26-29 2007] 5 OS2 08 Computer Architecture (Crash course)

Why multiple cores/threads? (even in your laptops) Instruction level parallelism is running out Wire delay is hurting Power dissipation is hurting The memory latency/bandwidth bottleneck is becoming even worse 7 OS2 08 Computer Architecture (Crash course)

Focus: Performance Create and explore: 1 Parallelism Instruction level parallelism (ILP) In some cases: Parallelism at the program/algorithm level ( Parallel computing ) 2 Locality of data Caches Temporal locality Cache blocks/lines: Spatial locality 8 OS2 08 Computer Architecture (Crash course)

Old Trend 1: Deeper pipelines (exploring ILP) 9 OS2 08 Computer Architecture (Crash course)

Old Trend 2: Wider pipelines (exploring more ILP) 10 OS2 08 Computer Architecture (Crash course)

Old Trend 3: Deeper memory hierarchy (exploring locality of data) 11 OS2 08 Computer Architecture (Crash course)

Are we hitting the wall now? Performance [log] 1000 100 10 1 Moore s law Now Possible path, but requires a paradigm shift Business as usual... Humm, transistors can be made smaller and faster, but there are other problems Year 12 OS2 08 Computer Architecture (Crash course)

Classical microprocessors: Whatever it takes to run one program fast Exploring ILP (instruction-level parallelism) Faster clocks Deep pipelines Superscalar Pipelines Branch Prediction Out-of-Order Execution Trace Cache Speculation Predicate Execution Advanced Load Address Table Return Address Stack... Bad News #1: Already explored most ILP 13 OS2 08 Computer Architecture (Crash course)

#2: Future technology Bad News #2: Looong wire delay slow CPUs 14 OS2 08 Computer Architecture (Crash course)

#2: How much of the chip area can you reach in one cycle? Bad News #2: Looong wire delay slow CPUs 15 OS2 08 Computer Architecture (Crash course)

Bad News #3: Mem latency/bandwidth is the bottleneck B B+C A C 16 OS2 08 Computer Architecture (Crash course)

Bad News #4: Power is the limit Power consumption is the bottleneck Cooling servers is hard Battery lifetime for mobile computers Energy is money Dissipated effect is proportional to Frequency Voltage 2 17 OS2 08 Computer Architecture (Crash course)

1 Why mutliple threads/cores? Old Trends Bad news Solutions? 2 Introducing concurrency Scenario Definitions Amdhal s law 3 Can we trust the hardware? 18 OS2 08 Computer Architecture (Crash course)

Bad News #1: Not enough ILP feed one CPU with instr. from many threads 19 OS2 08 Computer Architecture (Crash course)

Simltaneous multithreading 20 OS2 08 Computer Architecture (Crash course)

Bad News #2: Wire delay Multiple small CPUs with private L1$ 21 OS2 08 Computer Architecture (Crash course)

Bad News #2: Wire delay Multiple small CPUs with private L1$ Multicore processor Thread 1 PC Issue logic I R B M MW I R B M MW Regs SEK Thread N PC Issue logic I R B M MW I 22 OS2 08 Computer Architecture (Crash course) R! $ Mem Regs SEK B M MW

Bad News #3: Memory latency/bandwidth memory accesses from many threads B B+C A C B B+C A C 23 OS2 08 Computer Architecture (Crash course)

Bad News #4: Power consumption Lower the frequency lower voltage P dyn = C f V 2 area freq voltage 2 CPU freq=f VS. CPU freq=f/2 CPU freq=f/2 P dyn (C, f, V ) < CfV 2 P dyn (2C, f /2, < V ) < CfV 2 CPU freq=f VS. CPU CPU CPU CPU freq=f/2 P dyn (C, f, V ) < CfV 2 P dyn (C, f /2, < V ) < 1 2 CfV 2 24 OS2 08 Computer Architecture (Crash course)

Solving all the problems (?): Exploring thread parallelism #1 Running out of ILP feed one CPU with instr. from many threads #2 Wire delay is starting to hurt Multiple small CPUs with private caches #3 Memory is the bottleneck memory accesses from many threads #4 Power is the limit Lower the frequency lower voltage 25 OS2 08 Computer Architecture (Crash course)

In all computers very soon... Multicore processors, probably also with simultaneous multithreading 26 OS2 08 Computer Architecture (Crash course)

Scenario Several cars want to drive from point A to point B. They can compete for space on the same road and end up either: following each other or competing for positions (and having accidents!). Or they could drive in parallel lanes, thus arriving at about the same time without getting in each other s way. Or they could travel different routes, using separate roads. 28 OS2 08 Computer Architecture (Crash course)

Scenario Several cars want to drive from point A to point B. Sequential Programming They can compete for space on the same road and end up either: following each other or competing for positions (and having accidents!). Parallel Programming Or they could drive in parallel lanes, thus arriving at about the same time without getting in each other s way. Distributed Programming Or they could travel different routes, using separate roads. 29 OS2 08 Computer Architecture (Crash course)

Definitions Concurrent Program 2 + processes working together to perform a task. Each process is a sequential program (= sequence of statements executed one after another) Single thread of control vs multiple thread of control Communication Shared Variables Message Passing Synchronization Mutual Exclusion Condition Synchronization 30 OS2 08 Computer Architecture (Crash course)

Correctness Wanna write a concurrent program? What kinds of processes? How many processes? How should they interact? Correctness Ensure that processes interaction is properly synchronized Mutual Exclusion Ensuring the critical sections of statements do not execute at the same time Condition Synchronization Delaying a process until a given condition is true Our focus: imperative programs and asynchronous execution 31 OS2 08 Computer Architecture (Crash course)

Amdhal s law P is the fraction of a calculation that can be parallelized (1 P) is the fraction that is sequential (i.e. cannot benefit from parallelization) N processors speedup (1 P)+P (1 P)+P/N maximum speedup 1 1 P Example If P = 90% max speedup of 10 no matter how large the value of N used (ie N ) 32 OS2 08 Computer Architecture (Crash course)

Single-Processor machine CPU Cache Mem Storage Level 1 Level 2 sram sram dram 2000: 1982: 1ns 3ns 200ns 10ns 200ns 150ns 200ns 5 000 000ns 10 000 000ns 34 OS2 08 Computer Architecture (Crash course)

Memory Hierarchy Main Memory Level 2 cache Level 1 cache CPU 35 OS2 08 Computer Architecture (Crash course)

Why do we miss in the cache? Compulsory miss Touching the data for the first time Capacity miss The cache is too small Conflict miss Non-ideal cache implementation (data hash to the same cache line) Main Memory Miss Cache Hit CPU 36 OS2 08 Computer Architecture (Crash course)

Locality Temporal locality Spatial locality Inner loop stepping through an array A, B, C, A+1, B, C, A+2, B, C, spatial temporal 37 OS2 08 Computer Architecture (Crash course)

MultiProcessor world - Taxonomy SIMD MIMD Message Passing Shared Memory Fine-grained Coarse-grained UMA NUMA COMA 38 OS2 08 Computer Architecture (Crash course)

Shared-Memory Multiprocessors Memory... Memory Interconnection network / Bus Cache CPU... Cache CPU 39 OS2 08 Computer Architecture (Crash course)

Programming Model Shared Memory $ $ $ $ $ $ $ $ $ Thread Thread Thread Thread Thread Thread Thread Thread Thread 40 OS2 08 Computer Architecture (Crash course)

Cache coherency A: B: Shared Memory $ $ $ Thread Read A Read A...... Read A Thread... Read A... Write A Thread Read B... Read A 41 OS2 08 Computer Architecture (Crash course)

Summing up Coherence There can be many copies of a datum, but only one value There is a single global order of value changes to each datum Too strong!!! 42 OS2 08 Computer Architecture (Crash course)

Memory Ordering The coherence defines a per-datum order of value changes. The memory model defines the order of value changes for all the data. What ordering does the memory system guarantees? Contract between the HW and the SW developers Without it, we can t say much about the result of a parallel execution 43 OS2 08 Computer Architecture (Crash course)

What order for these threads? A denotes a modified value to the datum at address A Thread 1 LD A ST B LD C ST D LD E...... LD A happens before ST A Thread 2 ST A LD B ST C LD D ST E...... 44 OS2 08 Computer Architecture (Crash course)

Other possible orders? Thread 1 LD A ST B LD C ST D LD E...... Thread 2 ST A LD B ST C LD D ST E...... Thread 1 LD A ST B LD C ST D LD E...... Thread 2 ST A LD B ST C LD D ST E...... 45 OS2 08 Computer Architecture (Crash course)

Memory model flavors Sequentially Consistent: Programmer s intuition Total Store Order: Almost Programmer s intuition Weak/Release Consistency: No guaranty 46 OS2 08 Computer Architecture (Crash course)

Dekker s algorithm Does the write become globally visible before the read is performed? Initially A = 0,B = 0 fork A := 1 if(b==0)print( A wins ); Can both A and B win? B := 1 if(a==0)print( B wins ); Left: The read (ie, test if B==0) can bypass the store (A := 1) Right: The read (ie, test if A==0) can bypass the store (B := 1) Both loads can be performed before any of the stores Yes, it is possible that both win! 47 OS2 08 Computer Architecture (Crash course)

Dekker s algorithm for Total Store Order Does the write become globally visible before the read is performed? Initially A=0,B=0 fork A := 1 Membar #StoreLoad; if(b==0)print( A wins ); Can both A and B win? B := 1 Membar #StoreLoad; if(a==0)print( B wins ); Membar: the read is started after all previous stores have been globally ordered Behaves like a sequentially consistent machine No, they won t both win. Good job Mister Programmer! 48 OS2 08 Computer Architecture (Crash course)

Dekker s algorithm, in general Initially A = 0,B = 0 fork A := 1 if(b==0)print( A wins ); B := 1 if(a==0)print( B wins ); Can both A and B win? The answer depends on the memory model Remember?... Contract between the HW and SW developers. 49 OS2 08 Computer Architecture (Crash course)

So... Memory Model is a tricky issue 50 OS2 08 Computer Architecture (Crash course)

New issues Compulsory miss Capacity miss Conflict miss Memory... Memory Interconnection network / Bus Cache Cache... Communication miss Cache-to-cache transfer False-sharing Side-effect from large cache lines What about the compiler? Code reordering? volatile keyword in C... CPU CPU 51 OS2 08 Computer Architecture (Crash course)

Good to know Performance Use of Cache Memory hierarchy Consistency problems To get maximal performance on a given machine, the programmer has to know about the characteristics of the memory system and has to write programs to account them 52 OS2 08 Computer Architecture (Crash course)

Distributed Memory Architecture Interconnection network Memory Cache CPU... Memory Cache CPU Communication through Message Passing Own cache, but memory not shared No coherency problems 53 OS2 08 Computer Architecture (Crash course)

Isn t a CMP just a SMP on a chip? 54 OS2 08 Computer Architecture (Crash course)

Cost of communication? 55 OS2 08 Computer Architecture (Crash course)

Impact on Algorithms For performance, we need to understand the interaction between algorithms and architecture. The rules have changed We need to question old algorithms and results! 56 OS2 08 Computer Architecture (Crash course)

Criteria for algorithm design Pre-CMP: Communication is expensive: Minimize communication Data locality is important Maximize scalability for large-scale applications Within a CMP chip today: (On-chip) communication is almost to free Data locality is even more important (SMT may help by hiding some poor locality) Scalability to 2-32 threads In a multi-cmp system tomorrow: Communication is sometimes almost free (on-chip), sometimes (very) expensive (between chips) Data locality (minimizing of-chip references) is a key to efficiency Hierarchical scalability 57 OS2 08 Computer Architecture (Crash course)