CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Similar documents
CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computer parallelism Flynn s categories

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Processor Architecture and Interconnect

BlueGene/L (No. 4 in the Latest Top500 List)

WHY PARALLEL PROCESSING? (CE-401)

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

THREAD LEVEL PARALLELISM

Introduction to parallel computing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

COSC 6385 Computer Architecture - Multi Processor Systems

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CS4961 Parallel Programming. Lecture 4: Memory Systems and Interconnects 9/1/11. Administrative. Mary Hall September 1, Homework 2, cont.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Online Course Evaluation. What we will do in the last week?

Embedded Systems Architecture. Computer Architectures

Lect. 2: Types of Parallelism

COSC 6374 Parallel Computation. Parallel Computer Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Lecture 1: Introduction

COSC 6374 Parallel Computation. Parallel Computer Architectures

Parallel Computing Platforms

Issues in Multiprocessors

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Introduction II. Overview

A taxonomy of computer architectures

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Processor Performance and Parallelism Y. K. Malaiya

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Parallel Architectures

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Homework 1: Parallel Programming Basics Homework 1: Parallel Programming Basics Homework 1, cont.

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

How to Write Fast Code , spring th Lecture, Mar. 31 st

Outline Marquette University

Parallel Architecture. Hwansoo Han

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Flynn s Taxonomy of Parallel Architectures

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors - Flynn s Taxonomy (1966)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Parallel Computing: Parallel Architectures Jin, Hai

Lecture 2 Parallel Programming Platforms

Computer Architecture Spring 2016

Lecture 2. Memory locality optimizations Address space organization

L17: CUDA, cont. 11/3/10. Final Project Purpose: October 28, Next Wednesday, November 3. Example Projects

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Issues in Multiprocessors

Top500 Supercomputer list

L14: CUDA, cont. Execution Model and Memory Hierarchy"

Computer Systems Architecture

Processor Architectures

Parallel Processing SIMD, Vector and GPU s cont.

! Readings! ! Room-level, on-chip! vs.!

Unit 9 : Fundamentals of Parallel Processing


CS4961 Parallel Programming. Lecture 5: Data and Task Parallelism, cont. 9/8/09. Administrative. Mary Hall September 8, 2009.

Computer Systems Architecture

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

COURSE DESCRIPTION. CS 232 Course Title Computer Organization. Course Coordinators

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Parallel Computing Introduction

Fundamentals of Computer Design

Fundamentals of Computers Design

Parallel Computing Architectures

Lecture 3: Intro to parallel machines and models

CSE 392/CS 378: High-performance Computing - Principles and Practice

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lecture 8: RISC & Parallel Computers. Parallel computers

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

GPUs have enormous power that is enormously difficult to use

Architecture-Conscious Database Systems

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CS 267: Introduction to Parallel Machines and Programming Models Lecture 3 "

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

CS/COE1541: Intro. to Computer Architecture

Chapter 4 Data-Level Parallelism

Transcription:

CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically on the CADE machines using the handin program: handin cs4230 hw1 <probfile> Problem 1: (from today s lecture) We can develop a model for the performance behavior from the versions of parallel sum in today s lecture based on sequential execution time S, number of threads T, parallelization overhead O (fixed for all versions), and the cost B for the barrier or M for each invocation of the mutex. Let N be the number of elements in the list. For version 5, there is some additional work for thread 0 that you should also model using the variables above. (a) Using these variables, what is the execution time of valid parallel versions 2, 3 and 5; (b) present a model of when parallelization is profitable for version 3; (c) discuss how varying T and N impact the relative profitability of versions 3 and 5. 1! 08/23/2012! CS4230! 2! Homework 1: Parallel Programming Basics Problem 2: (#1.3 in textbook): Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1. Assume the number of cores is a power of two (1, 2, 4, 8, ). Hints: Use a variable divisor to determine whether a core should send its sum or receive and add. The divisor should start with the value 2 and be doubled after each iteration. Also use a variable core_difference to determine which core should be partnered with the current core. It should start with the value 1 and also be doubled after each iteration. For example, in the first iteration 0 % divisor = 0 and 1 % divisor = 1, so 0 receives and adds, while 1 sends. Also in the first iteration 0 + core_difference = 1 and 1 core_difference = 0, so 0 and 1 are paired in the first iteration. Today s Lecture Flynn s Taxonomy Some types of parallel architectures - Shared memory - Distributed memory These platforms are things you will probably use - CADE Lab1 machines (Intel Nehalem i7) - Sun Ultrasparc T2 (water, next assignment) - Nvidia GTX260 GPUs in Lab1 machines Sources for this lecture: - Textbook - Jim Demmel, UC Berkeley - Notes on various architectures 08/23/2012! CS4230! 3! 4! 1

Reading this week: Chapter 2.1-2.3 in textbook Chapter 2: Parallel Hardware and Parallel Software 2.1 Some background The von Neumann architecture esses, multitasking, and threads 2.2 Modifications to the von Neumann Model The basics of caching Mappings s and programs: an example Virtual memory Instruction-level parallelism Hardware multithreading 2.3 Parallel Hardware SIMD systems MIMD systems Interconnection networks coherence Shared-memory versus distributed-memory 5! An Abstract Parallel Architecture How is parallelism managed? Where is the memory physically located? Is it connected directly to processors? What is the connectivity of the network? 6! Why are we looking at a bunch of architectures There is no canonical parallel computer a diversity of parallel architectures - Hence, there is no canonical parallel programming language The von Neumann Architecture Conceptually, a von Neumann architecture executes one instruction at a time Architecture has an enormous impact on performance - And we wouldn t write parallel code if we didn t care about performance Many parallel architectures fail to succeed commercially - Can t always tell what is going to be around in N years Challenge is to write parallel code that abstracts away architectural features, focuses on their commonality, and is therefore easily ported from one platform to another. 7! Figure 2.1 CS4230! 2

Locality and Parallelism Conventional Storage Hierarchy L3 Large memories are slow, fast memories are small Program should do most work on local data L3 L3 9! potential interconnects Uniprocessor and Parallel Architectures Achieve performance by addressing the von Neumann bottleneck Reduce memory latency Access data from nearby storage: registers, caches, scratchpad memory We ll look at this in detail in a few weeks Hide or Tolerate memory latency Multithreading and, when necessary, context switches while memory is being serviced Prefetching, predication, speculation Uniprocessors that execute multiple instructions in parallel Pipelining Multiple issue SIMD multimedia extensions 10! How Does a Parallel Architecture Improve on this Further? Computation and data partitioning focus a single processor on a subset of data that can fit in nearby storage Can achieve performance gains with simpler processors Even if individual processor performance is reduced, throughput can be increased Complements instruction-level parallelism techniques Multiple threads operate on distinct data Exploit ILP within a thread Flynn s Taxonomy SISD Single instruction stream Single data stream classic von Neumann MISD Multiple instruction stream Single data stream not covered (SIMD) Single instruction stream Multiple data stream (MIMD) Multiple instruction stream Multiple data stream 11! CS4230! 08/28/2012! 12! 3

More Concrete: Parallel Control Mechanisms Two main classes of parallel architecture organizations Shared memory multiprocessor architectures A collection of autonomous processors connected to a memory system. Supports a global address space where each processor can access each memory location. Distributed memory architectures A collection of autonomous systems connected by an interconnect. Each system has its own distinct address space, and processors must explicitly communicate to share data. Clusters of PCs connected by commodity interconnect is the most common example. 13! 14! Programming Shared Architectures A shared-memory program is a collection of threads of control. Threads are created at program start or possibly dynamically Each thread has private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate through locks and barriers implemented using shared variables. Programming Distributed Architectures A distributed-memory program consists of named processes. ess is a thread of control plus local address space -- NO shared data. Logically shared data is partitioned over local processes. esses communicate by explicit send/receive pairs Coordination is implicit in every communication event. 15! 16! 4

Shared Architecture 1: Intel i7 860 Nehalem (CADE LAB1) Data Instr 256KB Unified Data Instr 256KB Unified Data Instr 256KB Unified Bus (Interconnect) Data Instr 256KB Unified More on Nehalem and Lab1 machines -- ILP Target users are general-purpose Personal use Games High-end PCs in clusters Support for SSE 4.2 SIMD instruction set 8-way hyperthreading (executes two threads per core) multiscalar execution (4-way issue per thread) out-of-order execution usual branch prediction, etc. Shared 8MB L3 Up to 16 GB Main (DDR3 Interface) 17! 18! Shared Architecture 2: Sun Ultrasparc T2 Niagara (water) More on Niagara Target applications are server-class, business operations Characterization: Floating point? Array-based computation? Support for VIS 2.0 SIMD instruction set 64-way multithreading (8-way per processor, 8 processors) Full Cross-Bar (Interconnect) 19! 20! 5

Shared Architecture 3: GPUs Lab1 has Nvidia GTX 260 accelerators 24 Multiprocessors, with 8 Device SIMD processors per Mul*processor 23 Mul*processor 2 multiprocessor SIMD Execution of warpsize threads (from single block) Multithreaded Execution across different instruction streams Complex and largely programmer-controlled memory hierarchy Shared Device memory Per-multiprocessor Shared memory Some other constrained memories (constant and texture memories/caches) Mul*processor 0 Registers essor 0 Device memory Shared Registers essor 2 No standard data cache 21! Registers essor 7 Instruc*on Unit Constant Texture Jaguar Supercomputer Peak performance of 2.33 Petaflops 224,256 AMD Opteron cores http://www.olcf.ornl.gov/computing-resources/jaguar/ 22! Shared Architecture 4: Each Socket is a 12-core AMD Opteron Istanbul Shared Architecture 3: Each Socket is a 12-core AMD Opteron Istanbul 6-core essor 6-core essor 64KB L1 64KB L1 64KB L1 64KB L1 64KB L1 64KB L1 Data Data Data Data Data Data 64KB L1 64KB L1 64KB L1 64KB L1 64KB L1 64KB L1 Instr Instr Instr Instr Instr Instr 512 KB 512 KB 512 KB 512 KB 512 KB 512 KB Unified Unified Unified Unified Unified Unified Link (Interconnect) Link (Interconnect) Shared 6MB L3 Shared 6MB L3 8 GB Main (DDR3 Interface) 8 GB Main (DDR3 Interface) 23! 24! 6

Jaguar is a Cray XT5 (plus XT4) Interconnect is a 3-d mesh 3-dimensional toroidal mesh http://www.cray.com/assets/pdf/products/xt/crayxt5brochure.pdf 25! Summary of Architectures Two main classes Complete connection: CMPs, SMPs, X-bar - Preserve single memory image - Complete connection limits scaling to small number of processors (say, 32 or 256 with heroic network) - Available to everyone (multi-core) Sparse connection: Clusters, Supercomputers, Networked computers used for parallelism - Separate memory images - Can grow arbitrarily large - Available to everyone with LOTS of air conditioning Programming differences are significant 26! Brief Discussion Why is it good to have different parallel architectures? - Some may be better suited for specific application domains - Some may be better suited for a particular community - Cost - Explore new ideas And different programming models/ languages? - Relate to architectural features - Application domains, user community, cost, exploring new ideas 27! 7