Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Similar documents
Complexity and Advanced Algorithms Monsoon Parallel Algorithms Lecture 2

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Trends in HPC (hardware complexity and software challenges)

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Multicore Hardware and Parallelism

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

Unit 9 : Fundamentals of Parallel Processing

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

BlueGene/L (No. 4 in the Latest Top500 List)

Fundamentals of Computer Design

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Memory Systems IRAM. Principle of IRAM

Introduction to GPU computing

What is Parallel Computing?

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Fundamentals of Computers Design

Threading Hardware in G80

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Parallel Computing: Parallel Architectures Jin, Hai

Lec 25: Parallel Processors. Announcements

UC Berkeley CS61C : Machine Structures

Top500 Supercomputer list

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Parallelism in Hardware

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

45-year CPU Evolution: 1 Law -2 Equations

GPU for HPC. October 2010

The Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009

Copyright 2012, Elsevier Inc. All rights reserved.

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Spring 2011 Prof. Hyesoon Kim

GPUs and GPGPUs. Greg Blanton John T. Lubia

CS 426 Parallel Computing. Parallel Computing Platforms

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

WHY PARALLEL PROCESSING? (CE-401)

GPU Fundamentals Jeff Larkin November 14, 2016

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

CPU Pipelining Issues

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

All About the Cell Processor

Parallel Architecture. Hwansoo Han

Computer Architecture

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Portland State University ECE 588/688. Graphics Processors

Parallel Programming

B. Tech. Project Second Stage Report on

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Chapter 2 Parallel Computer Models & Classification Thoai Nam

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Lecture 7: Parallel Processing

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Introduction to the MMAGIX Multithreading Supercomputer

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Warps and Reduction Algorithms

Parallel Computing. Hwansoo Han (SKKU)

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Introduction to CELL B.E. and GPU Programming. Agenda

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Computer Architecture - Basics -

UC Berkeley CS61C : Machine Structures

Parallelism. CS6787 Lecture 8 Fall 2017

Multi-Processors and GPU

What does Heterogeneity bring?

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

The Art of Parallel Processing

Architectures of Flynn s taxonomy -- A Comparison of Methods

The Memory Hierarchy & Cache

Computer Architecture

Introduction to GPU hardware and to CUDA

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Lecture 8: RISC & Parallel Computers. Parallel computers

Chapter 2 Parallel Hardware

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Portland State University ECE 588/688. Cray-1 and Cray T3E

Parallel Computer Architecture Concepts

Fundamentals of Quantitative Design and Analysis

Lecture 1: Introduction

Steve Scott, Tesla CTO SC 11 November 15, 2011

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CS5222 Advanced Computer Architecture. Lecture 1 Introduction

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Introduction to Microprocessor

New Advances in Micro-Processors and computer architectures

Double-Precision Matrix Multiply on CUDA

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Fra superdatamaskiner til grafikkprosessorer og

Transcription:

Complexity and Advanced Algorithms Introduction to Parallel Algorithms

Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical motivations: Application requirements Architectural concerns. Why now? Soon may not be able to buy a good single core computer! Need to therefore study how to use parallel computers.

1. Application Requirements A computational fluid dynamics(cfd) calculation on an air plane wing 512 X 64 X 256 grid 5000 fl-pt operations per grid point 5000 steps 2.1x10 14 ft-ops. 3.5 minutes on a machine sustaining 1 trillion flops simulation of a full aircraft 3.5 x 10 17 grid points total of 8.7 x 10 24 ft-pt operations on the same machine requires more than 275,000 years to complete.

1. Application Requirements Digital movies and special effects 10 14 fl-pt operations per frame and 50 frames per second 90-minute movie represents 2.7 x 10 19 fl-pt operations. It would take 2,000 1-Gflops CPUs approximately 150 days to complete the computation.

1. Application Requirements Simulation of magnetic materials at the level of 2000-atom systems require 2.64 Tflops of computational power and 512 GB of storage. Full hard disk simulation 30 Tflops and 2 TB Current investigations limited about 1000 atoms 0.5 Tflops 250 GB Future investigations involving 10,000 atoms 100 Tflops 2.5TB Inventory planning, risk analysis, workforce scheduling and chip design.

2. Architectural Advances Moore's Law: The number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years. Present Difficulties Memory Wall Power Wall ILP Wall

The Brick Wall - 1 Memory Wall Memory latency up to 200 cycles per load/store. Floating point operations take no more than 4 cycles. Earlier, it was thought that multiply is slow but load and store is fast.

The Brick Wall - 2 Power Wall Enormous increase in power consumption. Power leakage. However, presently Power is expensive but transistors are free.

Basic Architecture Concepts t=1 2 3 4 5 Fetch Decode Execute Write Fetch Decode Execute Write Fetch Decode Execute Write Fetch Decode Execute Write Cache CPU Architecture 4 stages of instruction execution Too many cycles per instruction (CPI) To reduce the CPI, introduce pipelined execution Needs buffers to store results across stages. A cache to handle slow memory access times

Basic Architecture Concepts t=1 2 3 4 5 Fetch Decode Execute Write Fetch Decode Execute Write Fetch Decode Execute Write Cache CPU Architecture Fetch Decode Execute Write 4 stages of instruction execution Too many cycles per instruction (CPI) To reduce the CPI, introduce pipelined execution Needs buffers to store results across stages. A cache to handle slow memory access times Caches, out-of-order execution, branch prediction,...

The Brick Wall - 3 ILP Wall ILP via branch prediction, out-of-order and speculative execution Diminishing returns from instruction level parallelism.

Conventional Wisdom in Computer Architecture Power Wall + Memory Wall + ILP Wall = Brick Wall Old CW: Uniprocessor performance 2X / 1.5 yrs New CW: Uniprocessor performance only 2X / 5 yrs?

Multicores to the Rescue Predicted that 100+ core computers would be a reality soon. Increased number of cores without significant improvement in clock rates. Due to silicon technology improvements Big questions How to exploit these cores in parallel? What are the killer applications that can democratize these new models? Search, web,???

Multicore and Manycore Processors IBM Cell NVidia GeForce 8800 includes 128 scalar processors and Tesla Sun T1 and T2 Tilera Tile64 Picochip combines 430 simple RISC cores Cisco 188 TRIPS

Cell Broadband Engine Cell processing elements A standard PowerPC core 8 SIMD Cores (SPEs) Dual XDR memory controller and two I/O controllers Game controllers, HPC Power 100 W SPE local store of 256 KB

The NVidia Telsa The Tesla C870 GPU computing processor transforms a standard workstation into a personal supercomputer. With 128 streaming processor cores, the CUDA C-language development environment and developer tools, and a range of applications. Tesla C870 GPU specifications - One GPU (128 thread processors) - 518 gigaflops (peak) - 1.5 GB dedicated memory - Fits in one full-length, dual slot with one open PCI Express x16 slot Massively Multi-threaded Processor Architecture Solve compute problems at your workstation that previously required a large cluster 128 Floating Point Processor Cores Achieve up to 350 GFLOPS of performance (512 GFLOPS peak) with one C870 GPU Multi-GPU Computing Solve large-scale problems by dividing it across multiple GPUs Shared Data Memory Groups of processor cores can collaborate using shared data High Speed, PCI-Express Data Transfer Fast and high-bandwidth communication between CPU and GPU

The Nvidia Tesla

Some Performance Numbers

Some Performance Numbers The K Supercomputer Country: Japan Rated at 8 PFlops Tianhe 1 Supercomputer Country: China Rated at 2 PFlops The Jaguar Cray XT5 Country: USA Rated at 1.7 PFlops The Nebulae Country: China Rated at 1.27 PFlops Tsubame Supercomputer Country: Japan Rated at 1.19 PFlops The Eka Supercomputer Country: India Rated at 170 TFlops

The Academic Interest Algorithmics and compelxity How to design parallel algorithms? What are good theoretical models for parallel computing? How to analyze parallel algorithms? Can every sequential algorithm be parallelized? What are some complexity classes wrt parallel computing?

The Academic Interest Systems and Programming How to write parallel programs? What are some tools and environments. How to convert algorithms to efficient implementations. What are the differences to sequential programming? What are the performance measures? Can sequential programs be automatically converted to parallel programs?

The Academic Interest Architectures What are standard architectural designs? What new issues are raised due to multiple cores? Downstream concerns Does a programmer have to worry about this? How to support the systems software as architecture changes?

The Course Coverage Focus on algorithms and complexity Models for parallel algorithms Algorithm design methodologies with application Semi-numerical Lists Trees and graphs Some parallel programming practice Complexity, characterization, and connection to sequential complexity classes.

The PRAM Model P1 P2 P3 Pn Global Shared Memory An extension of the von Neumann model.

The PRAM Model A set of n identical processors A common access shared memory Synchronous time steps Access to the shared memory costs the same as a unit of computation. Different models to provide semantics for concurrent access to the shared memory EREW, CREW, CRCW(Common, Aribitrary, Priority,...)

The Semantics In all cases, it is the programmer to ensure that his program meets the required semantics. EREW : Exclusive Read, Exclusive Write No scope for memory contention. Usually the weakest model, and hence algorithm design is tough. CREW : Concurrent Read, Exclusive Write Allow processors to read simultaneously from the same memory location at the same instant. Can be made practically feasible with additional hardware

The Semantics CRCW : Concurrent Read, Concurrent Write Allow processors to read/write simultaneously from/to the same memory location at the same instant. Requires further specification of semantics for concurrent write. Popular variants include COMMON : Concurrent write is allowed so long as the all the values being attempted are equal. Example: Consider finding the Boolean OR of n bits. ARBITRARY : In case of a concurrent write, it is guaranteed that some processor succeeds and its write takes effect. PRIORITY : Assumes that processors have numbers that can be used to decide which write succeeds.

PRAM Model Advantages and Drawbacks Advantages A simple model for algorithm design Hides architectural details for the designer. A good starting point Disadvantages Ignores architectural features such as: memory bandwidth, communication cost and latency, scheduling,... Hardware may be difficult to realize

Example 1 Matrix Multiplication One of the fundamental parallel processing tasks. Applications to several important problems in linear algebra, signal processing and optimization. Several techniques that work in parallel also.

Example I Matrix Multiplication Recall that in C = A x B, C[i,j] = A[i,k].B[k,j]. Consider the following recursive approach: Works well in practice. A 00 A 01 X B 00 B 01 = C 00 C 01 A 10 A 11 B 10 B 11 C 10 C 11 C 00 = A 00. B 00 + A 01. B 10 C 01 = A 00. B 01 + A 01. B 11 C 10 = A 10. B 00 + A 11. B 10 C 11 = A 10. B 01 + A 11. B 11

Example I Matrix Multiplication Other approaches include Cannon's algorithm X = A B C Can overlap computation with communication. Works well when the number of processors is more.

Example 2 New Parallel Algorithm Listing 1: S(1) = A(1) for i = 2 to n do S(i) = S(i-1) o A(i) Prefix Computations: Given an array A of n elements and an associative operation o, compute A(1) o A(2) o... A(i) for each i. A very simple sequential algorithm exists for this problem.

Parallel Prefix Computation The sequential algorithm in Listing 1 is not efficient in parallel. Need a new algorithm approach. Balanced Binary Tree

Balanced Binary Tree An algorithm design approach for parallel algorithms Many problems can be solved with this design technique. Easily amenable to parallelization and analysis.

Balanced Binary Tree A complete binary tree with processors at each internal node. Input is at the leaf nodes Define operations to be executed at the internal nodes. Inputs for this operation at a node are the values at the children of this node. Computation as a tree traversal from leaf to root.

Balanced Binary Tree Prefix Sums + + + + + + + a0 a1 a2 a3 a4 a5 a6 a7

Balanced Binary Tree Sum + a i a0 + a1 + a2 + a3 + + a4 + a5 + a6 + a7 a0 + a1 a2 + a3 a4 + a5 a6 + a7 + + + + a0 a1 a2 a3 a4 a5 a6 a7

Balanced Binary Tree Sum The above approach called as an ``upward traversal'' Data flow from the children to the root. Helpful in other situations also such as computing the max, expression evaluation. Analogously, can define a downward traversal Data flows from root to leaf Helps in settings such as element broadcast

Balanced Binary Tree Can use a combination of both upward and downward traversal. Prefix computation requires that. Illustration in the next slide.

Balanced Binary Tree Sum + a i a1 + a2 + a3 + a4 + + a5 + a6 + a7 + a8 a1 + a2 a3 + a4 a5 + a6 a7 + a8 + + + + a1 a2 a3 a4 a5 a6 a7 a8

Balanced Binary Tree Prefix Sum Upward traversal + a i a1 + a2 + a3 + a4 + + a5 + a6 + a7 + a8 a1 + a2 a3 + a4 a5 + a6 a7 + a8 + + + + a1 a2 a3 a4 a5 a6 a7 a8

Balanced Binary Tree Prefix Sum Downward traversal Even indices + a i + a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 a1 + a2 a3 + a4 a5 + a6 a7 + a8 a1 + a2 a1+a2+a i=16 a a i i 3 + a4 + + + + a1 a2 a3 a4 a5 a6 a7 a8 a1 a1+a2 a1+a2+a3+a4 i=16 a i a i

Balanced Binary Tree Prefix Sum Downward traversal Odd indices + a i + a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + + + + a1 + a2 a3 + a4 a5 + a6 a7 + a8 a1 + a2 a1+a2+a i=16 a a i i 3 + a4 a1 a2 a3 a4 a5 a6 a7 a8 a1 (a1+a2) + a3 i=14 a i ) + a5 i=16 a i ) + a7

Balanced Binary Tree Prefix Sums Two traversals of a complete binary tree. The tree is only a visual aid. Map processors to locations in the tree Perform equivalent computations. Algorithm designed in the PRAM model. Works in logarithmic time, and optimal number of operations. //upward traversal 1. for i = 1 to n/2 do in parallel b i = a 2i-2 o a 2i 2. Recursively compute the prefix sums of B= (b 1, b 2,..., b n/2 ) and store them in C = (c 1, c 2,..., c n/2 ) //downward traversal 3. for i = 1 to n do in parallel i is even : s i = c i i= 1 : s 1 = c 1 i is odd : s i = c (i-1)/2 o a i