Parallelism Marco Serafini - PDF Free Download

Parallelism Marco Serafini COMPSCI 590S Lecture 3

Announcements Reviews First paper posted on website Review due by this Wednesday 11 PM (hard deadline) Data Science Career Mixer (save the date!) November 5, 4-7 pm Campus Center Auditorium Recruiting and industry engagement event 2

Why multi-core architectures? 3

Multi-Cores We have talked about multi-core architectures Why do we actually use multi-cores? Why not a single core? 4

Maximum Clock Rate is Stagnating Two major laws are collapsing Moore s law Dennard scaling Source: https://queue.acm.org/detail.cfm?id=2181798 5

Moore s Law Density of transistors in an integrated circuit doubles every two years. Smaller à changes propagate faster Exponential axis So far so good, but the trend is slowing down and it won t last for long (Intel s prediction: until 2021 unless new technologies arise) [1] [1] https://www.technologyreview.com/s/601441/moores-law-isdead-now-what/ 6

Dennard Scaling Reducing transistor size does not increase power density à power consumption proportional to chip area Stopped holding around 2006 Assumptions break when physical system close to limit Post-Dennard-scaling world of today Huge cooling and power consumption issues If we kept the same clock frequency trends, today a CPU would have the power density of a nuclear reactor 7

Heat Dissipation Problem Large datacenters consume energy like large cities Cooling is the main cost factor Google @ Columbia River valley (2006) Facebook @ Luleå (2015) 8

Where is Luleå? 9

Possible Solutions Dynamic Voltage and Frequency Scaling (DVFS) E.g. Intel s TurboBoost Only works under low load Use part of the chip for coprocessors (e.g. graphics) Lower power consumption Limited number of generic functionalities to offload 10

More Solutions Multicores Replace 1 powerful core with multiple weaker cores on a chip SIMD Single Instruction Multiple Data A massive number of cores with reduced flexibility FPGAs Dedicated hardware designed for a specific task 11

Multi-Core processors Idea: scale computational power linearly Instead of a single 5 GHz core, 2 * 2.5 GHz cores Scale heat dissipation linearly k cores have ~ k times the heat dissipation of a single core Increasing frequency of a single core by k times creates superlinear heat dissipation increase 12

Memory Bandwidth Bottleneck Cores compete for the same main memory bus Caches help in two ways They reduce latency (as we have discussed) They also increase throughput by avoiding bus contention 13

How to Leverage Multicores Run multiple tasks in parallel Multiprocessing Multithreading E.g. PCs have many parallel background apps OS, music, antivirus, web browser, How to parallelize one app is not trivial Embarrassingly parallel tasks Can be run by multiple threads No coordination 14

SIMD Processors Single Instruction Multiple Data (SIMD) processors Example Graphical Processing Units (GPUs) Intel Phi coprocessors Q: Possible SIMD snippets for i in [0,n-1] do v[i] = v[i] * pi for i in [0,n-1] do if v[i] < 0.01 then v[i] = 0 15

Automatic Parallelization? Holy grail in the multi-processor era Approaches Programming languages Systems with APIs that help express parallelism Efficient coordination mechanisms 16

Processes vs. Threads 17

Processes & Threads We have discussed that multicores is the future How to make use of parallelism? OS/PL support for parallel programming Processes Threads 18

Processes vs. Threads Process: separate memory space Thread: shared memory space (except stack) Processes Threads Heap not shared shared Global variables not shared shared Local variables (Stack) not shared not shared Code shared shared File handles not shared shared 19

Parallel Programming Shared memory Threads Access same memory locations (in heap & global variables) Message-Passing Processes Explicit communication: message-passing 20

Shared Memory

Shared Memory Example void main (){ x = 12; // assume that x is a global variable t = new ThreadX(); t.start(); // starts thread t y = 12/x; System.out.println(y); t.join(); // wait until t completes } class ThreadX extends Thread{ void run (){ x = 0; } } This is pseudo-java in C++: pthread_create pthread_join Question: What is printed as output? 22

Desired: Atomicity Thread a foo() Thread b foo() void foo (){ x = 0; x = 1; y = 1/x; } foo should be atomic, in the sense of indivisible (ancient Greek) DESIRED Thread a Thread b POSSIBLE Thread a Thread b x = 0 x = 0 x = 1 x = 1 y = 1 time x = 0 x = 1 y = 1 happensbefore changes become visible y = 1/0 x = 0 23

Race Condition Non-deterministic access to shared variables Correctness requires specific sequence of accesses But we cannot rely on it because of non-determinism! Solutions Enforce a specific order using synchronization Enforce a sequence of happen-before relationships Locks, mutexes, semaphores: threads block each other Lock-free algorithms: threads do not wait for each other Hard to implement correctly! Typical programmer uses locks Java has optimized data structures with thread-safety, e.g., ConcurrentHashMap 24

Locks Thread a l.lock() foo() l.unlock() Thread b l.lock() foo() l.unlock() void foo (){ x = 0; x ++; y = 1/x; } We use a lock variable l and use it to synchronize Equivalent: declare void synchronized foo() Impossible now Thread a Thread b Possible Thread a Thread b x = 0 x = 1 x = 0 l.lock() foo() l.unlock() l.lock() - waits l.lock() - acquires foo() l.unlock() time 25

Deadlock Thread a l1.lock() l2.lock() foo() l1.unlock() l2.unlock() Thread b l2.lock() l1.lock() foo() l2.unlock() l1.unlock() Question: What can go wrong? 26

Requirements for a Deadlock Mutual exclusion: resources (locks) held and nonshareable Hold and wait: hold a resource and request another No preemption: can unlock only when holding Circular wait: chain of threads waiting each other Question: Simple solution? All threads acquire locks in same order 27

Notify / Wait Thread a synchronized(o){ o.wait(); foo(); } Thread b synchronized(o){ foo(); o.notify(); } Thread a o.wait() Thread a waits o.wait() foo() Thread b foo() o.notify() Notify on an object sends a signal that activates other threads waiting on that object This code guarantees that Thread b executes foo before Thread a 28

What About Cache Coherency? Cache coherency ensures atomicity for Single instructions Single cache lines In reality Different variables may reside on different cache lines A variable may be accessed across multiple instructions Single high-level instructions may compile to multiple low-level ones Example: a++ in C may compile to load (a, r0); r0 = r0 + 1; store(r0, a) That s why we need locks Main lesson learned from cache coherency discussion: you should partition data 29

Challenges with Multi-Threading Correctness Heisenbugs: Non-deterministic bugs that appear only in certain conditions. Hard to reproduce à Hard to debug Performance Understanding concurrency bottlenecks is hard! Waiting time does not show up in profilers (only CPU time) Load-balance Make sure all cores work all the time and do not wait 30

Critical Path t1 t1 t2 t3 start multiple threads one step each Coordination (barrier) makes load balancing harder Critical path: Maximum sequential path (thread t1, 10 steps) 9 extra steps t1 t1 wait for all threads to complete (barrier) 31

Message Passing

Message Passing Processes communicate by exchanging messages Sockets: Communication endpoints On a network: UDP sockets, TCP sockets Internal to a node: Inter-Process Communication (IPC) Different technologies but similar abstractions 33

Building a Message Serialization Message content stored at random locations in RAM They need to be packed into a byte array to be sent Deserialization Receive the byte array Rebuild the original variable Pointers do not make sense anymore across nodes! 34

Example: Serializing a Binary Tree 10 Question: How to serialize it? Possible solution DFS Mark null pointers with -1 null 5 null null 12 null 10 5-1 -1 12-1 -1 How to deserialize? 35

Threads + Message Passing Client-server model Client sends requests Server computes replies and sends them back Threads often used to hide latency Each client request is handled by a thread The request might wait for resources (e.g. I/O) Other threads execute other requests in the meanwhile 36

Processes in Different Languages Java (interpreted) The Java Virtual Machine (interpreter) is a process Creating a new process entails creating a new JVM ProcessBuilder C/C++ (compiled) OS-specific details of how processes can be generated Typical command: fork() Creates a child process, which executes instruction after fork() Child process is a full copy of the parent More on forking later 37