CS377P Programming for Performance Multicore Performance Multithreading

Similar documents
THREAD LEVEL PARALLELISM

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Architecture Spring 2016

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Computer Systems Architecture

CS425 Computer Systems Architecture

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Multicore and Multiprocessor Systems: Part I

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

ECE 588/688 Advanced Computer Architecture II

Multicore Hardware and Parallelism

Parallelism in Hardware

Moore s Law. Computer architect goal Software developer assumption

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

WHY PARALLEL PROCESSING? (CE-401)

Parallel Processing SIMD, Vector and GPU s cont.

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Concurrency Terminology

Lecture 1: Why Parallelism? Parallel Computer Architecture and Programming CMU , Spring 2013

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Lesson 1. Concurrency. Terminology Concurrency in Systems Problem Examples Copyright Teemu Kerola 2009

CS420: Operating Systems

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Multicore and Multiprocessor Systems: Part I

Application Programming

Computer Systems Architecture

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu

27. Parallel Programming I

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Multiprocessors - Flynn s Taxonomy (1966)

ECE 588/688 Advanced Computer Architecture II

CSE 392/CS 378: High-performance Computing - Principles and Practice

27. Parallel Programming I

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CS370 Operating Systems Midterm Review

Computer Architecture Crash course

27. Parallel Programming I

Thread. Disclaimer: some slides are adopted from the book authors slides with permission 1

Chapter 5: Threads. Outline

PCS - Part Two: Multiprocessor Architectures

Hyperthreading Technology

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Simultaneous Multithreading: a Platform for Next Generation Processors

CUDA GPGPU Workshop 2012

CS420/520 Computer Architecture I

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Why do we care about parallel?

Computer Architecture

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Threads, SMP, and Microkernels. Chapter 4

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Arquitecturas y Modelos de. Multicore

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chapter 4 Threads, SMP, and

Multi-core Programming Evolution

Types of Parallel Computers

Simultaneous Multithreading on Pentium 4

Handout 3 Multiprocessor and thread level parallelism

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

OPERATING SYSTEM. Chapter 4: Threads

Comp. Org II, Spring

Parallelism and Concurrency. COS 326 David Walker Princeton University

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Chapter 18 Parallel Processing

Parallel Processing & Multicore computers

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

! Readings! ! Room-level, on-chip! vs.!

Introduction to Parallel Computing

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Using POSIX Threading to Build Scalable Multi-Core Applications

Parallel and High Performance Computing CSE 745

Comp. Org II, Spring

Parallel Programming Multicore systems

Parallel Architecture. Hwansoo Han

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessors & Thread Level Parallelism

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Parallelism, Multicore, and Synchronization

Introduction II. Overview

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Multithreaded Processors. Department of Electrical Engineering Stanford University

POSIX Threads Programming

Processor Performance and Parallelism Y. K. Malaiya

Operating Systems Overview. Chapter 2

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Transcription:

CS377P Programming for Performance Multicore Performance Multithreading Sreepathi Pai UTCS October 14, 2015

Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX Threads

Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX Threads

Multiprocessing The Programmer View Multiple Processors One Operating System One Memory Address Space In contrast, distributed systems: Multiple Operating Systems (not necessary) Multiple Address Spaces (necessary)

Symmetric Multiprocessing (SMP) Memory System Bus μp μp Socket Socket Multiple, identical processors on the same mainboard 1 socket per processor (Optional) Interconnect between processors x86 example: Pentium (Pro)

Simultaneous Multithreading (SMT) Observation: Not enough ILP to keep wide OoO pipelines busy Where can we find extra ILP? Dean Tullsen, Susan Eggers, Henry Levy, Simultaneous Multithreading, ISCA 1998

Simultaneous Multithreading (SMT) Idea: let multiple programs feed the pipeline Guaranteed to be independent Context-switch (not really) in hardware Usually, front-end and back-end separated I.e., they know instructions are from different programs Functional units can be shared These do not need to know about separate programs μp

Chip Multiprocessing (CMP) Memory System Bus μp core core Socket Moore s Law delivers transistors for free What to do with all these transistors? More complex Out-of-order ILP is limited by program Scalability and Power limits Solution: Duplicate Cores

Modern Multicores Memory μp μp core core core core LLC LLC Socket/Package Socket/Package Memory

Multicore Layout Near (Local) and Far (Remote) Cores Memories (still one address space!) Private and Shared Resources L1/L2 usually private LLC usually shared Transparent to programs Not to performance!

Performance Implications What if the data you re addressing is in: Remote memory vs Local memory Local shared cache vs Remote shared cache Local private cache vs On-package private cache vs Remote private cache Producer/Consumer: On SMT threads In different cores, but on same socket Across sockets

Nomenclature Test How many threads? Quad-socket Dual-threaded Decacore Machine.

Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX Threads

The Free Lunch is Over Herb Sutter, The Free Lunch is Over

The Free Lunch is Over Implications In 1990s, you could wait for 18 months to double program performance But, What Intel giveth, Microsoft taketh away Extra effort for performance usually not worth it Just wait! This class would be uninteresting in the 1990s Performance is no longer free Single thread performance has stagnated Programmers must work to get performance

Very Simplified Programmer s View of Multicore core core core core core PC PC PC PC PC Multiple program counters (PC) MIMD machine To what do we set these PCs? Can the hardware do this automatically for us?

Automatic Parallelization to the Rescue? for(i = 0; i < N; i++) { for(j = 0; j < i; j++) { // something(i, j) } } Assume a stream of instructions from a single-threaded program How do we split this stream into pieces?

Thread-Level Parallelism Break stream into long continuous streams of instructions Much bigger than issue window on superscalars 8 instructions vs hundreds Streams are largely independent Best performance on current hardware Thread-Level Parallelism ILP DLP MLP

Parallelization Issues Assume we have a parallel algorithm Work Distribution How to split up work to be performed among threads? Communication How to send and receive data between threads? Synchronization How to coordinate different threads? A form of communication

Multiprocessing Simplest way to take advantage of multiple cores Traditional way in Unix Processes are cheap Not cheap in Windows Nothing-shared model Unix Interprocess Communication (IPC) Filesystem Pipes Unix sockets Semaphores SysV Shared Memory Only viable model available in some programming languages Python

Question What happens when you run a parallel program on a single core machine?

Parallelization and Concurrency Parallel Program Can execute concurrently Notions of independence Concurrent Programs Are executing at the same time Depends on hardware resources Can all parallel programs execute correctly when concurrency=1? Important: Definitions apply to this class

Summary Multicores are not going away Free single-thread performance is dead Automatic parallelization is hard (for compilers) Must write parallel programs to exploit multicores

Outline 1 Multiprocessor Systems 2 Programming Models for Multicore 3 Multithreading and POSIX Threads

Multithreading One process Process creates threads ( lightweight processes ) Everything shared model Communication Read and write to memory Relies on programmers to think carefully about access to shared data Tricky

Multithreading Programming Models Roughly in (decreasing) order of power and complexity: POSIX threads (pthreads) (*) C++11 threads may be simpler than this Thread Building Blocks from Intel Cilk OpenMP (*) (*) What we will use in this class, others are usually built on top of pthreads.

POSIX Threads on Linux Processes == Threads for scheduler in Linux 1:1 threading model See OS textbook pthreads provided as a library gcc test.c -lpthread OS scheduler can affect performance significantly

Multithreading Components Thread Management Creation, death, waiting, etc. Communication Shared variables (ordinary variables) Condition Variables Synchronization Mutexes (Mutual Exclusion) Barriers

Conclusion Multicores can be programmed in two distinct ways: Multiprocessing Multithreading Programs need to be rewritten for multithreading Need to pick a programming model Low-level models are pthreads, C++11 threads May not be productive! Threads are usually too low-level Operating System and Runtime (pthreads) affect performance significantly