CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015
Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX Threads
Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX Threads
Multiprocessing The Programmer View Multiple Processors One Operating System One Memory Address Space In contrast, distributed systems: Multiple Operating Systems (not necessary) Multiple Address Spaces (necessary)
Symmetric Multiprocessing (SMP) Memory System Bus μp μp Socket Socket Multiple, identical processors on the same mainboard 1 socket per processor (Optional) Interconnect between processors x86 example: Pentium (Pro)
Simultaneous Multithreading (SMT) Observation: Not enough ILP to keep wide OoO pipelines busy Where can we find extra ILP? Dean Tullsen, Susan Eggers, Henry Levy, Simultaneous Multithreading, ISCA 1998
Simultaneous Multithreading (SMT) Idea: let multiple programs feed the pipeline Guaranteed to be independent Context-switch (not really) in hardware Usually, front-end and back-end separated I.e., they know instructions are from different programs Functional units can be shared These do not need to know about separate programs μp
Chip Multiprocessing (CMP) Memory System Bus μp core core Socket Moore s Law delivers transistors for free What to do with all these transistors? More complex Out-of-order ILP is limited by program Scalability and Power limits Solution: Duplicate Cores
Modern Multicores Memory μp μp core core core core LLC LLC Socket/Package Socket/Package Memory
Multicore Layout Near (Local) and Far (Remote) Cores Memories (still one address space!) Private and Shared Resources L1/L2 usually private LLC usually shared Transparent to programs Not to performance!
Performance Implications What if the data you re addressing is in: Remote memory vs Local memory Local shared cache vs Remote shared cache Local private cache vs On-package private cache vs Remote private cache Producer/Consumer: On SMT threads In different cores, but on same socket Across sockets
Nomenclature Test How many threads? Quad-socket Dual-threaded Decacore Machine.
Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX Threads
The Free Lunch is Over Herb Sutter, The Free Lunch is Over
The Free Lunch is Over Implications In 1990s, you could wait for 18 months to double program performance But, What Intel giveth, Microsoft taketh away Extra effort for performance usually not worth it Just wait! This class would be uninteresting in the 1990s Performance is no longer free Single thread performance has stagnated Programmers must work to get performance
Very Simplified Programmer s View of Multicore core core core core core PC PC PC PC PC Multiple program counters (PC) MIMD machine To what do we set these PCs? Can the hardware do this automatically for us?
Automatic Parallelization to the Rescue? for(i = 0; i < N; i++) { for(j = 0; j < i; j++) { // something(i, j) } } Assume a stream of instructions from a single-threaded program How do we split this stream into pieces?
Thread-Level Parallelism Break stream into long continuous streams of instructions Much bigger than issue window on superscalars 8 instructions vs hundreds Streams are largely independent Best performance on current hardware Thread-Level Parallelism ILP DLP MLP
Parallelization Issues Assume we have a parallel algorithm Work Distribution How to split up work to be performed among threads? Communication How to send and receive data between threads? Synchronization How to coordinate different threads? A form of communication
Multiprocessing Simplest way to take advantage of multiple cores Traditional way in Unix Processes are cheap Not cheap in Windows Nothing-shared model Unix Interprocess Communication (IPC) Filesystem Pipes Unix sockets Semaphores SysV Shared Memory Only viable model available in some programming languages Python
Question What happens when you run a parallel program on a single core machine?
Parallelization and Concurrency Parallel Program Can execute concurrently Notions of independence Concurrent Programs Are executing at the same time Depends on hardware resources Can all parallel programs execute correctly when concurrency=1? Important: Definitions apply to this class
Summary Multicores are not going away Free single-thread performance is dead Automatic parallelization is hard (for compilers) Must write parallel programs to exploit multicores
Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX Threads
Multithreading One process Process creates threads ( lightweight processes ) Everything shared model Communication Read and write to memory Relies on programmers to think carefully about access to shared data Tricky
Multithreading Programming Models Roughly in (decreasing) order of power and complexity: POSIX threads (pthreads) (*) C++11 threads may be simpler than this Thread Building Blocks from Intel Cilk OpenMP (*) (*) What we will use in this class, others are usually built on top of pthreads.
POSIX Threads on Linux Processes == Threads for scheduler in Linux 1:1 threading model See OS textbook pthreads provided as a library gcc test.c -lpthread OS scheduler can affect performance significantly
Multithreading Components Thread Management Creation, death, waiting, etc. Communication Shared variables (ordinary variables) Condition Variables Synchronization Mutexes (Mutual Exclusion) Barriers
Conclusion Multicores can be programmed in two distinct ways: Multiprocessing Multithreading Programs need to be rewritten for multithreading Need to pick a programming model Low-level models are pthreads, C++11 threads May not be productive! Threads are usually too low-level Operating System and Runtime (pthreads) affect performance significantly