How to Write Fast Code , spring th Lecture, Mar. 31 st

Similar documents
Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Introduction to GPU computing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Introduction II. Overview

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

WHY PARALLEL PROCESSING? (CE-401)

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Computer Architecture

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

How to Write Fast Code , spring st Lecture, Jan. 14 th

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CellSs Making it easier to program the Cell Broadband Engine processor

Parallel Computing: Parallel Architectures Jin, Hai

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

INF5063: Programming heterogeneous multi-core processors Introduction

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Spring 2011 Prof. Hyesoon Kim

Cell Processor and Playstation 3

Introduction to Computing and Systems Architecture

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Technology Trends Presentation For Power Symposium

Chap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Simultaneous Multithreading on Pentium 4

Parallel Computing Platforms

Parallel Computing. Hwansoo Han (SKKU)

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Memory Architectures. Week 2, Lecture 1. Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

high performance medical reconstruction using stream programming paradigms

What does Heterogeneity bring?

UC Berkeley CS61C : Machine Structures

Outline. Lecture 40 Hardware Parallel Computing Thanks to John Lazarro for his CS152 slides inst.eecs.berkeley.

QDP++/Chroma on IBM PowerXCell 8i Processor

Parallel Exact Inference on the Cell Broadband Engine Processor

Lecture 1: Gentle Introduction to GPUs

Multithreading: Exploiting Thread-Level Parallelism within a Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

45-year CPU Evolution: 1 Law -2 Equations

Lecture 9: MIMD Architectures

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Multi-core Programming - Introduction

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Lecture 9: MIMD Architecture

Trends and Challenges in Multicore Programming

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

CSC 553 Operating Systems

Chapter 1 Computer System Overview

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

Computer Architecture!

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Advanced Processor Architecture

All About the Cell Processor

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Computer Architecture Spring 2016

High Performance Computing. University questions with solution

Parallel Computing Why & How?

Graphics Processor Acceleration and YOU

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Compilation for Heterogeneous Platforms

Xbox 360 high-level architecture

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CONSOLE ARCHITECTURE

Lecture 1: Introduction

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Chapter 1 Computer System Overview

Intel released new technology call P6P

Node Hardware. Performance Convergence

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Computer Systems Architecture

Lecture 7: Parallel Processing

THREAD LEVEL PARALLELISM

Xbox 360 Architecture. Lennard Streat Samuel Echefu

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

GPU for HPC. October 2010

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

General introduction: GPUs and the realm of parallel architectures

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

HW Trends and Architectures

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Parallel Computer Architecture - Basics -

Computer Architecture!

Parallelized Progressive Network Coding with Hardware Acceleration

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Fundamentals of Quantitative Design and Analysis

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

EECS4201 Computer Architecture

Computer Systems Architecture

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

Transcription:

How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred)

Introduction Parallelism: definition Carrying instructions out simultaneously (Think multi-core) Why parallelism? Duh (if you can make it faster, why not?) More importantly: not many other options now Why not parallelism? Difficult to build hardware Difficult to program Not intuitive to humans No clear hardware or software model Application specific

Lecture overview Introduction / Background Parallel platforms / hardware paradigms Parallel software paradigms Mapping algorithms to parallel platforms SIMD homework feedback Administrative stuff Feedback form Schedule project meetings

Parallelism: motivation Idea has been around for a while. Why now? Frequency scaling no longer an option We re forced to parallelism Think Intel/AMD multicores Future (now) will be driven by parallelism

Parallel Programming Paradigms: Data parallel Task parallel Streaming EPIC Problem + parallel algorithms Archs: SMP Cluster Multi-core SIMD FPGA

Parallel Architectures Parallelism: one size doesn t fit all SMP (symmetric multiprocessing): smaller CPUs Multi-core Multi-CPU, Hybrids, FPGAs etc. Distributed/NUMA: larger systems Cluster (including grid computing) Supercomputers Other types: SIMD: fine-grained parallelism Stream computing: pipelined parallelism Instruction level parallelism (VLIW/EPIC/superscalar)

Parallel Architectures - Constraints SMP: Easy to program System complexity is pushed to hardware design Coherency protocols Scalability Core1 Distributed computing: Can never match SMP s node-to-node data transfer capability Hardware architecture scales very well Commodity hardware Grid computing ( virtual supercomputers built out of untrusted, unreliable nodes) Programmer burden is higher $ Core2 $ Core3 $ Shared Memory Node1 Node3 Network Core4 $ Node2 Node4

Solving a Problem in Parallel So far, we ve looked at parallel architectures To solve problems in parallel, we must identify and extract parallelism out of problems Question 1: does it make sense to parallelize? Amdahl s law: predict theoretical maximum speedup Question 2: what is the target platform/arch? Does the problem fit the architecture well?

Parallel Paradigms Task parallelism Distribute execution processes across computing nodes Data parallelism Distribute chunks of data across computing nodes Many real-world problems lie on the continuum Other models: streaming Sequential tasks performed on the same data Pipelined parallelism Eg: DVD decoding

Mapping problems to machines Paradigms: Data parallel Task parallel Streaming EPIC Use these Problem + parallel algorithms Given You! Fit to: Archs: SMP Cluster Multi-core SIMD FPGA Challenge: To map the problem to the architecture using abstractions Some problems match some architectures better

Case studies Heavy metal Clusters BlueGene/P (Supercomputer) Desktop Future Intel (Nehalem) GPU (Graphics Processing Unit) Hybrid/accelerated (CPU+GPU+FPGA) Cell Broadband Engine

Cluster Computing Designed for: Supercomputing applications High availability (redundant nodes) Load balancing Architecture Multiple nodes connected by fast local area networks Can use COTS nodes Peak performance 100 Teraflop/s (Cluster Platform 3000. 14,000 Intel Xeons!) How to program Specific to each cluster Notes Power/heat concerns Most supercomputers, grids

BlueGene/P: Cluster Supercomputer IBM s supercomputer in the making Architecture PowerPC based High-speed 3D toroidal network Dual processors per node Peak performance Designed for upto 1 Petaflop/s 32,000 cpus for 111 Teraflop/s Features Scalable in increments of 1024 nodes

Future Intel: Nehalem Designed for: Desktop/mainstream computing Servers Successor to Intel s Core architecture (2008-09) Microarchitecture Your standard parallel CPU of the future 1-8 cores native (single die) Out-of-order 8MB on-chip L3 cache Hyperthreading

Intel Nehalem Peak performance ~ 100 Gflop/s (excluding GPU) Features 45nm SSE4 Intergrated Northbridge (combat memory wall) GPU on chip (off-die) Programming: Not very different from current CPUs Notes AMD + ATI = something similar What next?

GPU (Graphics Processing Unit) Designed for: Originally/largely: graphics rendering Also: as a massive floating-point compute unit. General Purpose GPU (GPGPU) Microarchitecture Multiple Shader units : vertex/geometry/pixel Typically a PCI/PCIe card Can be on-chip Peak performance 0.5 1 TeraFlop/s

GPU Programming Programmable shader units designed for matrix/vector operations Stream processing Grid computing (ATI / Folding@Home) Notes Numeric computation only Shipping data from and to destination (main memory)

Cell Broadband Engine ( Cell ) Designed for: Heavy desktop numeric computing General purpose Gaming / heavy multimedia vector processing Supercomputer node Bridge gap: desktop and GPU Microarchitecture Hybrid multi-core CPU 1 PPE (general-purpose core) 8 SPEs (specialized vector cores) Core Interconnect Peak performance ~ 230 Gflop/s (3.2GHz, using all 9 cores) 204.8 Gflop/s (3.2GHz, using just the SPEs)

Cell BE PPE (Power Processor Element) x1 Power architecture based Out-of-order, dual-threaded core For general purpose computing (OS) 32+32KB L1, 512KB L2 AltiVec vector unit Synergistic Processing Elements x8 For numeric computing Dual-pipelined 256kb Local Stores 1 per SPE Interconnect: EIB (SPE Interconnect Bus) Explicit DMA instead of caching! Programmer has to manually ship around data (overlap with comp.) Advantage: fine-grained control over data, deterministic run times

Cell How to program SMP or DMP model? Parallel model: task parallel / data parallel / stream Pushes some complexity to programmer Libraries can handle DMA. (do not use for high-performance computing) Available as: Already in use in the PlayStation 3 IBM BladeCenter Add-on accelerator cards

Hybrid/Accelerated Designed for: Customized applications/supercomputing node Main use: numeric/specialized computing Microarchitecture CPU, FPGA on the same motherboard High speed data interconnect Peak performance CPU s + FPGA s Features Benefits of FPGAs

Hybrid / Accelerated How to program Specialized: depends on setup, program Need to generate FPGA design (non-trivial) Then, generate software Spiral generated FPGAs:

Parallel Hierarchy Fine-grained parallelism CPU SIMD EPIC / SMT Multi-core Multiproc + GPU + FPGA Cluster Distributed Grid Coarsegrained parallelism Parallelism works on multiple levels (nested?) Numeric fast code must explicitly address many/most levels A major difference between levels: cost of shipping data between nodes

Parallel Mathematical Constructs Blackboard

Conclusion Parallelism is the future Extracting/using parallelism: ongoing challenge Hardware is ahead of software: Producing parallel hardware currently easier than producing parallelized software Our industry has bet its future on parallelism(!) - David Patterson, UC Berkeley Challenge: how to map a given problem to a parallel architecture/platform

Meetings Apr 7 (next Monday) Markus 11 11:45 8 11:45 12:30 7 1:30 2:15 9 2:15-3 16 3 3:45 14 12 4:30 5:15 6 5:15 6 13 Fred 4:30 5:15 1 5:15 6 2 6 6:45 3 Franz 1 1:45 17 2 2:45 11 4:30 5:15 5 Vas 3:45 4:30 4 4:30 5:15 10 5:15-6 15