Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Similar documents
Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

CS671 Parallel Programming in the Many-Core Era

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

Computer Architecture!

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

CS 194 Parallel Programming. Why Program for Parallelism?

Multicore and Parallel Processing

Computer Architecture

Computer Architecture

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Computer Architecture!

CS427 Multicore Architecture and Parallel Computing

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Lecture 1: Introduction

Concurrency & Parallelism, 10 mi

Computer Architecture!

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Parallel Computing. Parallel Computing. Hwansoo Han

Multicore Hardware and Parallelism

CSC 447: Parallel Programming for Multi- Core and Cluster Systems. Lectures TTh, 11:00-12:15 from January 16, 2018 until 25, 2018 Prerequisites

CS758: Multicore Programming

Keywords and Review Questions

COMP 633 Parallel Computing.

Trends and Challenges in Multicore Programming

CIT 668: System Architecture

45-year CPU Evolution: 1 Law -2 Equations

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Lec 25: Parallel Processors. Announcements

An Introduction to Parallel Programming

Instructor Information

Spring 2016 :: CSE 502 Computer Architecture. Introduction. Nima Honarmand

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

High-Performance Parallel Computing

CS 475: Parallel Programming Introduction

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

CS 654 Computer Architecture Summary. Peter Kemper

Parallelism, Multicore, and Synchronization

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

Parallelism in Hardware

Parallelism. CS6787 Lecture 8 Fall 2017

Parallel Computing: Parallel Architectures Jin, Hai

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Why Parallel Architecture

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

RECAP. B649 Parallel Architectures and Programming

The Shift to Multicore Architectures

Course web site: teaching/courses/car. Piazza discussion forum:

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

EITF20: Computer Architecture Part1.1.1: Introduction

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Written Exam / Tentamen

NPTEL. High Performance Computer Architecture - Video course. Computer Science and Engineering.

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

Moore s Law. CS 6534: Tech Trends / Intro. Good Ol Days: Frequency Scaling. The Power Wall. Charles Reiss. 24 August 2016

Hyperthreading Technology

CS 6534: Tech Trends / Intro

The Beauty and Joy of Computing

Fundamentals of Computer Design

URL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture

What is this class all about?

Supercomputing and Mass Market Desktops

Advanced Computer Architecture

Control Hazards. Branch Prediction

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

ECE 475/CS 416 Computer Architecture - Introduction. Today s Agenda. Edward Suh Computer Systems Laboratory

ECE 588/688 Advanced Computer Architecture II

CS5222 Advanced Computer Architecture. Lecture 1 Introduction

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

CS654 Advanced Computer Architecture. Lec 1 - Introduction

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Control Hazards. Prediction

UC Berkeley CS61C : Machine Structures

Computer Architecture

E&CE 429 Computer Structures

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

The Art of Parallel Processing

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Microprocessor Trends and Implications for the Future

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. Introduction. Lynn Choi Korea University

EE141- Spring 2007 Introduction to Digital Integrated Circuits

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

CIT 668: System Architecture. Computer Systems Architecture

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

LECTURE 5: MEMORY HIERARCHY DESIGN

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Computer Architecture Spring 2016

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

High Performance Computing

Chapter 1: Fundamentals of Quantitative Design and Analysis

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar,

Transcription:

CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas, Austin Fall 2009 Prerequisites Course in computer architecture (e.g.) book by Hennessy and Patterson Course in compilers (e.g.) book by Allen and Kennedy Self-motivation willingness to learn on your own to fill in gaps in your knowledge Why study parallel programming? Fundamental ongoing change in computer industry Until recently: Moore s law(s) 1. Number of transistors on chip double every 1.5 years Transistors used to build complex, superscalar processors, deep pipelines, etc. to exploit instructionlevel parallelism (ILP) 2. Processor frequency doubles every 1.5 years Speed goes up by factor of 10 roughly every 5 years Many programs ran faster if you just waited a while. Fundamental change Micro-architectural innovations for exploiting ILP are reaching limits Clock speeds are not increasing any more because of power problems Programs will not run any faster if you wait. Let us understand why. Gordon Moore 1

(1) Micro-architectural approaches to improving processor performance Add functional units Superscalar is known territory Diminishing returns for adding more functional blocks Alternatives like VLIW have been considered and rejected by the market Wider data paths Increasing bandwidth between functional units in a core makes a difference Such as comprehensive 64-bit design, but then where to? Transistors 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 (from Paul Teisch, AMD) i8080 i4004 i8086 i80286 i80386 R10000 Pentium R3000 R2000 1970 1975 1980 1985 1990 1995 2000 2005 Year (1) Micro-architectural approaches (contd.) Deeper pipeline Deeper pipeline buys frequency at expense of increased branch mis-prediction penalty and cache miss penalty Deeper pipelines => higher clock frequency => more power Industry converging on middle ground 9 to 11 stages Successful RISC CPUs are in the same range More cache More cache buys performance until working set of program fits in cache Exploiting caches requires help from programmer/compiler as we will see (2) Processor clock speeds Old picture: Processor clock frequency doubled every 1.5 years New picture: Power problems limit further increases in clock frequency (see next couple of slides) Clock Rate (MHz) Increase in clock rate 1000 100 10 1 0.1 1970 1980 1990 2000 Year 2

15 Static Current Embedded Parts Very High Leakage and Power Fast, High Power Fast, Low Power 0 1.0 1.5 Frequency Static current rises non-linearly as processors approach max frequency 3

Recap Power Density (W/cm 2 ) 10000 1000 100 10 1 4004 8008 8080 8086 Nuclear Reactor Hot Plate 8085 286 386 486 Rocket Nozzle P6 Pentium 1970 1980 1990 2000 2010 Year Sun s Surface Source: Patrick Gelsinger, Intel Old picture: Moore s law(s): 1. Number of transistors doubled every 1.5 years Use these to implement micro-architectural innovations for ILP 2. Processor clock frequency doubled every 1.5 years Many programs ran faster if you just waited a while. New picture: Moore s law 1. Number of transistors still double every 1.5 years But micro-architectural innovations for ILP are flat-lining Processor clock frequencies are not increasing very much Programs will not run faster if you wait a while. Questions: Hardware: What do we do with all those extra transistors? Software: How do we keep speeding up program execution? One hardware solution: go multicore Design choices Use semi-conductor tech improvements to build multiple cores without increasing clock frequency does not require microarchitectural breakthroughs non-linear scaling of power density with frequency will not be a problem Predictions: from now on. number of cores will double every 1.5 years (from Saman Amarasinghe, MIT) Homogenous multicore processors large number of identical cores Heterogenous multicore processors cores have different functionalities It is likely that future processors will be heterogenous multicores migrate important functionality into special-purpose hardware (eg. codecs) much more power efficient than executing program in general-purpose core trade-off: programmability 4

Problem: multicore software Parallel Programming More aggregate performance for: Multi-tasking Transactional apps: many instances of same app Multi-threaded apps (our focus) Problem Most apps are not multithreaded Writing multithreaded code increases software costs dramatically factor of 3 for Unreal game engine (Tim Sweeney, EPIC games) The great multicore software quest: Can we write programs so that performance doubles when the number of cores doubles? Very hard problem for many reasons (see later) Amdahl s law Locality Overheads of parallel execution Load balancing We are the cusp of a transition to multicore, multithreaded architectures, and we still have not demonstrated the ease of programming the move will require I have talked with a few people at Microsoft Research who say this is also at or near the top of their list [of critical CS research problems]. Justin Rattner, CTO Intel Community has worked on parallel programming for more than 30 years programming models machine models programming languages. However, parallel programming is still a research problem matrix computations, stencil computations, FFTs etc. are well-understood few insights for other applications each new application is a new phenomenon Thesis: we need a science of parallel programming analysis: framework for thinking about parallelism in application synthesis: produce an efficient parallel implementation of application The Alchemist Cornelius Bega (1663) Analogy: science of electro-magnetism Course objective Create a science of parallel programming Structure: understand the patterns of parallelism and locality in applications Analysis: abstractions for reasoning about parallelism and locality in applications programming models based on these abstractions tools for quantitative estimates of parallelism and locality Synthesis: exploiting structure to produce efficient implementations Seemingly unrelated phenomena Unifying abstractions Specialized models that exploit structure 5

Approach Example: structure in algorithms Niklaus Wirth s aphorism: Algorithms + Data structures = Programs Algorithms: a description of the computation, expressed in terms of abstract data types (ADTs) like sets, matrices, and graphs Data structures: concrete implementations of ADTs (eg) matrices can be represented using arrays, space-filling curves, etc. (eg) graphs can be represented using adjacency matrices, adjacency lists, etc. Strategy: study parallelism and locality in algorithms, independent of concrete data structures What structure can we exploit for efficient implementation? study concrete parallel data structures required to support parallelism in algorithms What structure can we exploit for efficient implementation? iterative algorithms topology operator ordering general graph grid tree morph: modifies structure of graph local computation: only updates values on nodes/edges reader: does not modify graph in any way unordered ordered We will elaborate on this structure in a couple of weeks. Course content Structure of parallelism and locality in important algorithms computational science algorithms graph algorithms Algorithm abstractions dependence graphs halographs Multicore architectures interconnection networks, caches and cache coherence, memory consistency models, locks and lock-free synchronization Parallel data structures linearizability array and graph partitioning lock-free data structures and transactional memory Optimistic parallel execution of programs Scheduling and load-balancing Course content (contd.) Locality spatial and temporal locality cache blocking cache-oblivious algorithms Static program analysis techniques array dependence analysis points-to and shape analysis Performance models PRAM, BPRAM, logp Special topics self-optimizing software and machine learning techniques for optimization GPUs and GPU programming parallel programming languages/libraries: Cilk, PGAS languages, OpenMP, TBBs, map-reduce, MPI 6

Course work Small number of programming assignments Paper presentations Substantial final project Participation in class discussions 7