Parallelism Marco Serafini

Similar documents
CMSC 330: Organization of Programming Languages

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Parallel Programming Multicore systems

Parallelization and Synchronization. CS165 Section 8

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

CS 261 Fall Mike Lam, Professor. Threads

THREADS & CONCURRENCY

High Performance Computing Course Notes Shared Memory Parallel Programming

Kernel Synchronization I. Changwoo Min

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

Parallelism and Concurrency. COS 326 David Walker Princeton University

High Performance Computing Lecture 21. Matthew Jacob Indian Institute of Science

CS 31: Introduction to Computer Systems : Threads & Synchronization April 16-18, 2019

PRACE Autumn School Basic Programming Models

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Multiprocessor Systems. Chapter 8, 8.1

CS 5523 Operating Systems: Midterm II - reivew Instructor: Dr. Tongping Liu Department Computer Science The University of Texas at San Antonio

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Concurrency: State Models & Design Patterns

Application Programming

Parallel Computing Concepts. CSInParallel Project

Computer Architecture Crash course

Operating Systems. Lecture 4 - Concurrency and Synchronization. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

EECS 482 Introduction to Operating Systems

CSE 374 Programming Concepts & Tools

Introduction to Parallel Computing

Threads Tuesday, September 28, :37 AM

Concurrent Programming with Threads: Why you should care deeply

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

FYS Data acquisition & control. Introduction. Spring 2018 Lecture #1. Reading: RWI (Real World Instrumentation) Chapter 1.

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Multiprocessor Systems. COMP s1

Principles of Software Construction: Objects, Design, and Concurrency. The Perils of Concurrency Can't live with it. Cant live without it.

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Systems software design. Processes, threads and operating system resources

UNIT:2. Process Management

Computation Abstractions. Processes vs. Threads. So, What Is a Thread? CMSC 433 Programming Language Technologies and Paradigms Spring 2007

THREAD LEVEL PARALLELISM

COMP 530: Operating Systems Concurrent Programming with Threads: Why you should care deeply

Introduction to Parallel Computing

An Introduction to Parallel Programming

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Top500 Supercomputer list

Parallelism. CS6787 Lecture 8 Fall 2017

CMSC 132: Object-Oriented Programming II

CS3733: Operating Systems

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

THREADS & CONCURRENCY

THREADS: (abstract CPUs)

Parallel Algorithm Engineering

Why do we care about parallel?

The University of Texas at Arlington

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Programmable NICs. Lecture 14, Computer Networks (198:552)

CS3350B Computer Architecture

Threads & Concurrency

Parallel Computing. Prof. Marco Bertini

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

CMSC 330: Organization of Programming Languages. Concurrency & Multiprocessing

Processor speed. Concurrency Structure and Interpretation of Computer Programs. Multiple processors. Processor speed. Mike Phillips <mpp>

Test and Verification Solutions

Threads & Concurrency

Chapter 7: Deadlocks. Operating System Concepts 9 th Edition

! Readings! ! Room-level, on-chip! vs.!

INF 212 ANALYSIS OF PROG. LANGS CONCURRENCY. Instructors: Crista Lopes Copyright Instructors.

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

CS510 Advanced Topics in Concurrency. Jonathan Walpole

What is the Race Condition? And what is its solution? What is a critical section? And what is the critical section problem?

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Lec 26: Parallel Processing. Announcements

CEC 450 Real-Time Systems

Parallel Computing. Parallel Computing. Hwansoo Han

Dr Markus Hagenbuchner CSCI319. Distributed Systems Chapter 3 - Processes

The University of Texas at Arlington

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

COMP 322: Fundamentals of Parallel Programming

Parallelism. Parallel Hardware. Introduction to Computer Systems

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Concurrency, Mutual Exclusion and Synchronization C H A P T E R 5

The Art of Parallel Processing

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Cloud Computing CS

CS 3305 Intro to Threads. Lecture 6

Process Synchronisation (contd.) Deadlock. Operating Systems. Spring CS5212

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (II)

CS 475: Parallel Programming Introduction

Parallel Programming: Background Information

Parallel Programming: Background Information

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Other consistency models

CS 3723 Operating Systems: Final Review

Main Points of the Computer Organization and System Software Module

Transcription:

Parallelism Marco Serafini COMPSCI 590S Lecture 3

Announcements Reviews First paper posted on website Review due by this Wednesday 11 PM (hard deadline) Data Science Career Mixer (save the date!) November 5, 4-7 pm Campus Center Auditorium Recruiting and industry engagement event 2

Why multi-core architectures? 3

Multi-Cores We have talked about multi-core architectures Why do we actually use multi-cores? Why not a single core? 4

Maximum Clock Rate is Stagnating Two major laws are collapsing Moore s law Dennard scaling Source: https://queue.acm.org/detail.cfm?id=2181798 5

Moore s Law Density of transistors in an integrated circuit doubles every two years. Smaller à changes propagate faster Exponential axis So far so good, but the trend is slowing down and it won t last for long (Intel s prediction: until 2021 unless new technologies arise) [1] [1] https://www.technologyreview.com/s/601441/moores-law-isdead-now-what/ 6

Dennard Scaling Reducing transistor size does not increase power density à power consumption proportional to chip area Stopped holding around 2006 Assumptions break when physical system close to limit Post-Dennard-scaling world of today Huge cooling and power consumption issues If we kept the same clock frequency trends, today a CPU would have the power density of a nuclear reactor 7

Heat Dissipation Problem Large datacenters consume energy like large cities Cooling is the main cost factor Google @ Columbia River valley (2006) Facebook @ Luleå (2015) 8

Where is Luleå? 9

Possible Solutions Dynamic Voltage and Frequency Scaling (DVFS) E.g. Intel s TurboBoost Only works under low load Use part of the chip for coprocessors (e.g. graphics) Lower power consumption Limited number of generic functionalities to offload 10

More Solutions Multicores Replace 1 powerful core with multiple weaker cores on a chip SIMD Single Instruction Multiple Data A massive number of cores with reduced flexibility FPGAs Dedicated hardware designed for a specific task 11

Multi-Core processors Idea: scale computational power linearly Instead of a single 5 GHz core, 2 * 2.5 GHz cores Scale heat dissipation linearly k cores have ~ k times the heat dissipation of a single core Increasing frequency of a single core by k times creates superlinear heat dissipation increase 12

Memory Bandwidth Bottleneck Cores compete for the same main memory bus Caches help in two ways They reduce latency (as we have discussed) They also increase throughput by avoiding bus contention 13

How to Leverage Multicores Run multiple tasks in parallel Multiprocessing Multithreading E.g. PCs have many parallel background apps OS, music, antivirus, web browser, How to parallelize one app is not trivial Embarrassingly parallel tasks Can be run by multiple threads No coordination 14

SIMD Processors Single Instruction Multiple Data (SIMD) processors Example Graphical Processing Units (GPUs) Intel Phi coprocessors Q: Possible SIMD snippets for i in [0,n-1] do v[i] = v[i] * pi for i in [0,n-1] do if v[i] < 0.01 then v[i] = 0 15

Automatic Parallelization? Holy grail in the multi-processor era Approaches Programming languages Systems with APIs that help express parallelism Efficient coordination mechanisms 16

Processes vs. Threads 17

Processes & Threads We have discussed that multicores is the future How to make use of parallelism? OS/PL support for parallel programming Processes Threads 18

Processes vs. Threads Process: separate memory space Thread: shared memory space (except stack) Processes Threads Heap not shared shared Global variables not shared shared Local variables (Stack) not shared not shared Code shared shared File handles not shared shared 19

Parallel Programming Shared memory Threads Access same memory locations (in heap & global variables) Message-Passing Processes Explicit communication: message-passing 20

Shared Memory

Shared Memory Example void main (){ x = 12; // assume that x is a global variable t = new ThreadX(); t.start(); // starts thread t y = 12/x; System.out.println(y); t.join(); // wait until t completes } class ThreadX extends Thread{ void run (){ x = 0; } } This is pseudo-java in C++: pthread_create pthread_join Question: What is printed as output? 22

Desired: Atomicity Thread a foo() Thread b foo() void foo (){ x = 0; x = 1; y = 1/x; } foo should be atomic, in the sense of indivisible (ancient Greek) DESIRED Thread a Thread b POSSIBLE Thread a Thread b x = 0 x = 0 x = 1 x = 1 y = 1 time x = 0 x = 1 y = 1 happensbefore changes become visible y = 1/0 x = 0 23

Race Condition Non-deterministic access to shared variables Correctness requires specific sequence of accesses But we cannot rely on it because of non-determinism! Solutions Enforce a specific order using synchronization Enforce a sequence of happen-before relationships Locks, mutexes, semaphores: threads block each other Lock-free algorithms: threads do not wait for each other Hard to implement correctly! Typical programmer uses locks Java has optimized data structures with thread-safety, e.g., ConcurrentHashMap 24

Locks Thread a l.lock() foo() l.unlock() Thread b l.lock() foo() l.unlock() void foo (){ x = 0; x ++; y = 1/x; } We use a lock variable l and use it to synchronize Equivalent: declare void synchronized foo() Impossible now Thread a Thread b Possible Thread a Thread b x = 0 x = 1 x = 0 l.lock() foo() l.unlock() l.lock() - waits l.lock() - acquires foo() l.unlock() time 25

Deadlock Thread a l1.lock() l2.lock() foo() l1.unlock() l2.unlock() Thread b l2.lock() l1.lock() foo() l2.unlock() l1.unlock() Question: What can go wrong? 26

Requirements for a Deadlock Mutual exclusion: resources (locks) held and nonshareable Hold and wait: hold a resource and request another No preemption: can unlock only when holding Circular wait: chain of threads waiting each other Question: Simple solution? All threads acquire locks in same order 27

Notify / Wait Thread a synchronized(o){ o.wait(); foo(); } Thread b synchronized(o){ foo(); o.notify(); } Thread a o.wait() Thread a waits o.wait() foo() Thread b foo() o.notify() Notify on an object sends a signal that activates other threads waiting on that object This code guarantees that Thread b executes foo before Thread a 28

What About Cache Coherency? Cache coherency ensures atomicity for Single instructions Single cache lines In reality Different variables may reside on different cache lines A variable may be accessed across multiple instructions Single high-level instructions may compile to multiple low-level ones Example: a++ in C may compile to load (a, r0); r0 = r0 + 1; store(r0, a) That s why we need locks Main lesson learned from cache coherency discussion: you should partition data 29

Challenges with Multi-Threading Correctness Heisenbugs: Non-deterministic bugs that appear only in certain conditions. Hard to reproduce à Hard to debug Performance Understanding concurrency bottlenecks is hard! Waiting time does not show up in profilers (only CPU time) Load-balance Make sure all cores work all the time and do not wait 30

Critical Path t1 t1 t2 t3 start multiple threads one step each Coordination (barrier) makes load balancing harder Critical path: Maximum sequential path (thread t1, 10 steps) 9 extra steps t1 t1 wait for all threads to complete (barrier) 31

Message Passing

Message Passing Processes communicate by exchanging messages Sockets: Communication endpoints On a network: UDP sockets, TCP sockets Internal to a node: Inter-Process Communication (IPC) Different technologies but similar abstractions 33

Building a Message Serialization Message content stored at random locations in RAM They need to be packed into a byte array to be sent Deserialization Receive the byte array Rebuild the original variable Pointers do not make sense anymore across nodes! 34

Example: Serializing a Binary Tree 10 Question: How to serialize it? Possible solution DFS Mark null pointers with -1 null 5 null null 12 null 10 5-1 -1 12-1 -1 How to deserialize? 35

Threads + Message Passing Client-server model Client sends requests Server computes replies and sends them back Threads often used to hide latency Each client request is handled by a thread The request might wait for resources (e.g. I/O) Other threads execute other requests in the meanwhile 36

Processes in Different Languages Java (interpreted) The Java Virtual Machine (interpreter) is a process Creating a new process entails creating a new JVM ProcessBuilder C/C++ (compiled) OS-specific details of how processes can be generated Typical command: fork() Creates a child process, which executes instruction after fork() Child process is a full copy of the parent More on forking later 37