Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Similar documents
Introduction to Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Parallel Architectures

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Computing Why & How?

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Parallel Computing Introduction

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

WHY PARALLEL PROCESSING? (CE-401)

Lect. 2: Types of Parallelism

Computer parallelism Flynn s categories

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Introduction II. Overview

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Computer Architecture

Online Course Evaluation. What we will do in the last week?

Computer Systems Architecture

THREAD LEVEL PARALLELISM

COSC 6385 Computer Architecture - Multi Processor Systems

Computer Systems Architecture

An Introduction to Parallel Programming

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

! Readings! ! Room-level, on-chip! vs.!

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Issues in Multiprocessors

High Performance Computing in C and C++

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Computing Platforms

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Kaisen Lin and Michael Conley

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Why Multiprocessors?

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Plan. 1: Parallel Computations. Plan. Aims of the course. Course in Scalable Computing. Vittorio Scarano. Dottorato di Ricerca in Informatica

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Lecture 2 Parallel Programming Platforms

Multicores, Multiprocessors, and Clusters

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

Issues in Multiprocessors

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

Introduction to Parallel Programming

Handout 3 Multiprocessor and thread level parallelism

High Performance Computing (HPC) Introduction

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

The Art of Parallel Processing

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Lecture 24: Memory, VM, Multiproc

Comp. Org II, Spring

Parallel Processing SIMD, Vector and GPU s cont.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Computer Architecture - Basics -

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

BİL 542 Parallel Computing

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Introduction to parallel computing

Introduction to Parallel Programming

Lecture 24: Virtual Memory, Multiprocessors

Comp. Org II, Spring

Parallel Architectures

Parallel Architecture. Hwansoo Han

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel and High Performance Computing CSE 745

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

Multiprocessors & Thread Level Parallelism

How to Write Fast Code , spring th Lecture, Mar. 31 st

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Parallel Processing & Multicore computers

Multiprocessors - Flynn s Taxonomy (1966)

Moore s Law. Computer architect goal Software developer assumption

Parallel programming. Luis Alejandro Giraldo León

Chap. 4 Multiprocessors and Thread-Level Parallelism

CS377P Programming for Performance Multicore Performance Multithreading

An Introduction to Parallel Programming

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Module 5 Introduction to Parallel Processing Systems

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Computer Architecture Spring 2016

27. Parallel Programming I

27. Parallel Programming I

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Transcription:

Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1

Content A quick overview of morden parallel hardware Parallelism within a chip Pipelining Superscaler execution SIMD Multiple cores Parallelism within a node Multiple sockets UMA vs. NUMA Parallelism across multiple nodes A very quick overview of parallel programming Note: The textbook [Quinn 2004] is outdated on parallel hardware, please read instead https://computing.llnl.gov/tutorials/parallel comp/ Introduction to parallel computersand parallel programming p. 2

First things first CPU central processing unit is the brain of a computer CPU processes instructions, many of which require data transfers from/to the memory on a computer CPU integrates many components (registers, FPUs, caches...) CPU has a clock, which at each clock cycle synchronizes the logic units within the CPU to process instructions Introduction to parallel computersand parallel programming p. 3

An example of a CPU Block diagram of an Intel Xeon Woodcrest CPU (core) Introduction to parallel computersand parallel programming p. 4

Instruction pipelining Suppose every instruction has five stages (each needs one cycle) Without instruction pipelining With instruction pipelining Introduction to parallel computersand parallel programming p. 5

Superscalar execution Multiple execution units more than one instruction can finish per cycle An enhanced form of instruction-level parallelism Introduction to parallel computersand parallel programming p. 6

Data Data are stored in computer memory as sequence of 0s and 1s Each 0 or 1 occupies one bit 8 bits constitute one byte Normally, in the C language char: 1 byte int: 4 bytes float: 4 bytes double: 8 bytes Bandwidth the speed of data transfer is measured as number of bytes transferred per second Introduction to parallel computersand parallel programming p. 7

SIMD SISD: single instruction stream single data stream SIMD: single instruction stream multiple data streams Introduction to parallel computersand parallel programming p. 8

An example of a floating-point unit FP unit on a Xeon CPU Introduction to parallel computersand parallel programming p. 9

Multicore processor Modern hardware technology can put several independent CPUs (called cores ) on the same chip a multicore processor Intel Xeon Nehalem quad-core processor Introduction to parallel computersand parallel programming p. 10

Multi-threading Modern CPU cores often have threading capability Hardware support for multiple threads to be executed within a core However, threads have to share resources of a core computing units caches translation lookaside buffer Introduction to parallel computersand parallel programming p. 11

Vector processor Another approach, different from multicore (and multi-threading) Massive SIMD Vector registers Direct pipes into main memory with high bandwidth Used to be dominating high-performance computing hardware, but now only niche technology Introduction to parallel computersand parallel programming p. 12

Multi-socket Socket a connection on motherboard that a processor is plugged into Modern computers often have several sockets Each socket holds a multicore processor Example: Nehalem-EP (2 socket, quad-core CPUs, 8 cores in total) OS maps each core to its concept of a CPU for historical reasons Introduction to parallel computersand parallel programming p. 13

Shared memory Shared memory: all CPU cores can access all memory as global address space Traditionally called multiprocessor Two common definitions of SMP (dependent on context): 1. SMP (shared-memory processing) multiple CPUs (or multiple CPU cores) share access, not necessarily of same access speed, to the entire memory available on a single system 2. SMP (symmetric multi-processing) multiple CPUs (or multiple CPU cores) have same speed and latency to a shared memory Introduction to parallel computersand parallel programming p. 14

UMA UMA uniform memory access, one type of shared memory Another name for symmetric multi-processing Dual-socket Xeon Clovertown CPUs Introduction to parallel computersand parallel programming p. 15

NUMA NUMA non-uniform memory access, another type of shared memory Several symmetric multi-processing units are linked together Each core should access its closest memory unit, as much as possible Dual-socket Xeon Nehalem CPUs Introduction to parallel computersand parallel programming p. 16

Cache coherence Important for shared-memory systems If one CPU core updates a value in its private cache, all the other cores know about the update Cache coherence is accomplished by hardware Introduction to parallel computersand parallel programming p. 17

Competition among the cores Within a multi-socket multicore computer, some resources are shared Within a socket, the cores share the last-level cache The memory bandwidth is also shared to a great extent # cores 1 2 4 6 8 BW 3.42 GB/s 4.56 GB/s 4.57 GB/s 4.32 GB/s 5.28 GB/s Actual memory bandwidth measured on a 2 socket quad-core Xeon Harpertown Introduction to parallel computersand parallel programming p. 18

Distributed memory The entire memory consists of several disjoint parts A communication network is needed in between There is not a single global memory space A CPU (core) can directly access its own local memory A CPU (core) cannot directly access a remote memory A distributed-memory is traditionally called multicomputer Introduction to parallel computersand parallel programming p. 19

Comparing shared memory and distributed memory Sharedmemory User-friendly programming Data sharing between processors Not cost effective Synchronization needed Distributed-memory Memory is scalable with the number of processors Cost effective Programmer responsible for data communication Introduction to parallel computersand parallel programming p. 20

Hybrid memory system Core Core Core Core Core Core Core Core Cache Cache Cache Cache Bus Memory Compute Node Core Core Core Core Core Core Core Core Interconnect Network Cache Cache Cache Cache Bus Memory Compute Node Core Core Core Core Core Core Core Core Cache Cache Cache Cache Bus Memory Compute Node Introduction to parallel computersand parallel programming p. 21

Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives) Parallelism "behind the scene" (little user control) Difficult to scale to many CPUs (NUMA, cache coherence) Message passing model Many programming details (MPI or PVM) Better user control (data & work decomposition) Larger systems and better performance Stream-based programming (for using GPUs) Some special parallel languages Co-Array Fortran, Unified Parallel C, Titanium Hybrid parallel programming Introduction to parallel computersand parallel programming p. 22

Designing parallel programs Determine whether or not the problem is parallelizable Identify hotspots Where are most of the computations? Parallelization should focus on the hotspots Partition the problem Insert collaboration (unless embarrassingly parallel) Introduction to parallel computersand parallel programming p. 23

Partitioning Break the problem into chunks Domain decomposition (data decomposition) Functional decomposition Introduction to parallel computersand parallel programming p. 24

Examples of domain decomposition Introduction to parallel computersand parallel programming p. 25

Collaboration Communication Overhead depends on both the number and size of messages Overlap communication with computation, if possible Different types of communications (one-to-one, collective) Synchronization Barrier Lock & semaphore Synchronous communication operations Introduction to parallel computersand parallel programming p. 26

Load balancing Objective: idle time is minimized Important for parallel performance Equal partitioning of work (and/or data) Dynamic work assignment may be necessary Introduction to parallel computersand parallel programming p. 27

Granularity Computations are typically separated from communications by synchronization events Granularity: ratio of computation to communication Fine-grain parallelism Individual tasks are relatively small More overhead incurred Might be easier for load balancing Coarse-grain parallelism Individual tasks are relatively large Advantageous for performance due to lower overhead Might be harder for load balancing Introduction to parallel computersand parallel programming p. 28

Exercises Problem 1 of INF3380 written exam V10 Problem 1 of INF3380 written exam V11 Exercise 1.5 from the textbook [Quinn 2004] If we denote by T pipe, which is found in the above exercise, the time required to process n tasks using a m-stage pipeline. Let us also denote by T seq the time needed to process n tasks without pipelining. How large should n be in order to get a speedup of p? Exercise 1.10 from the textbook [Quinn 2004] Introduction to parallel computersand parallel programming p. 29