Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Similar documents
Issues in Multiprocessors

Issues in Multiprocessors

Introduction II. Overview

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

WHY PARALLEL PROCESSING? (CE-401)

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer Systems Architecture

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Convergence of Parallel Architecture

Parallel Computing Platforms

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Chap. 4 Multiprocessors and Thread-Level Parallelism

THREAD LEVEL PARALLELISM

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Computer Systems Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

EE382 Processor Design. Processor Issues for MP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction

Computer Architecture

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

NPTEL. High Performance Computer Architecture - Video course. Computer Science and Engineering.

Lecture 2. Memory locality optimizations Address space organization

Lecture 3: Intro to parallel machines and models

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Computer Architecture!

6.1 Multiprocessor Computing Environment

COMPUTER ARCHITECTURE

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CS 654 Computer Architecture Summary. Peter Kemper

Dr. Joe Zhang PDC-3: Parallel Platforms

Multi-core Architectures. Dr. Yingwu Zhu

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture!

The Art of Parallel Processing

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Chapter 1: Perspectives

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Keywords and Review Questions

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

High Performance Computing

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Academic Course Description. EM2101 Computer Architecture

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

COSC 6385 Computer Architecture - Multi Processor Systems

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

What are Clusters? Why Clusters? - a Short History

Lecture 8: RISC & Parallel Computers. Parallel computers

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Instructor Information

Multi-core Architectures. Dr. Yingwu Zhu

High Performance Computing

Multi-core Programming - Introduction

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

Parallel Architecture. Hwansoo Han

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

High Performance Computing in C and C++

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Exploring Parallelism At Different Levels

Online Course Evaluation. What we will do in the last week?

Lecture 9: MIMD Architectures

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Lecture 9: MIMD Architectures

Parallel Programming. Michael Gerndt Technische Universität München

CS 426 Parallel Computing. Parallel Computing Platforms

Clusters of SMP s. Sean Peisert

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lect. 2: Types of Parallelism

Advanced Processor Architecture

Lecture 1: Parallel Architecture Intro

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

Advanced Parallel Programming I

Lecture 7: Parallel Processing

Introduction to Parallel Computing

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

CSE 392/CS 378: High-performance Computing - Principles and Practice

Transcription:

CS 612 Software Design for High-performance Architectures 1

computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes 496 TA: Jim Ezick, ezick@cs.cornell.edu, 4139 Upson URL: http://www.cs.cornell.edu/courses/cs612/2001sp/ Prerequisites: Experience in writing moderate-sized (about lines) programs, and interest in software for 2000 Lectures: two per week Course-work: Four or five assignments which will involve on work-stations and the IBM SP-2, and a programming substantial final project. 2

Resources Books Advanced Compiler Design and Implementation", Steve Muchnick, Morgan Kaufmann Publishers. Introduction to Parallel Computing", Vipin Kumar et al, Benjamin/Cummings Publishers. Computer Architecture: A Quantitative Approach", Hennessy and Patterson, Morgan Kaufmann Publishers. Conferences ACM Symposium on Principles and Practice of Parallel Programming" ACM SIGPLAN Symposium on Programming Language Design and Implementation" International Conference on Supercomputing" Supercomputing" 3

Objective We will study software systems that permit applications programs to exploit the power of modern high-performance computers. Computational Science Database Systems?? Software Systems High-performance Work-stations Shared-memory Multiprocessors Distributed-memory Multiprocessors SGI Octane, DEC Alpha SGI Origin 2000, CRAY T3E IBM SP - some emphasis on applications and architecture - primary emphasis on restructuring compilers, parallel languages (HPF), and libraries (OpenMP, MPI). 4

software environment is not adequate for modern This computers. high-performance understand this, let us look at some high-performance To computers. Conventional Software Environment Languages: FORTRAN,C or C++ Compiler: GNU (Dragon-book optimizations) O/S: UNIX 5

The CONVEX Exemplar: A Shared-Memory MultiProcessor Cache CPU Memory Cache CPU Memory Private Memory : Hypernode Agent Agent PA-RISC Hypernode Interconnect Global Memory Interface Network Cache Hypernode Network of hypernodes Parallelism: Coarse-grain parallelism: processors operate in parallel Instruction-level parallelism: each processor is pipelined Memory latencies: Processor cache 10 ns CPU private memory 500 ns Hypernode private memory 500 ns Network cache 500 ns Interhypernode shared memory 2 microsec Within hypernode: SMP (Symmetric MultiProcessor) Across hypernodes: NUMA (Non-uniform Memory Access machine) Locality of reference is extremely important Programming model: C/FORTRAN + OpenMP 6

computers: each processor has a different Distributed-memory space (eg. IBM SP-2) address Programming model: C/FORTRAN + MPI 7

CPU Cache CPU Cache Memory The AC3 Cluster: network of SMP nodes Agent Agent Fast Bus Pentium III Interface... 64 nodes... Giganet Interconnect - Each node is a 4-way SMP with Pentium III processors - 64 nodes are connected by a Giganet interconnect - Within each node, we have a shared-memory multiprocessor - Across nodes, we have a distributed-memory multiprocessor => Programming such a hybrid machine is even more complex! 8

Pipelined processors: must exploit instruction-level parallelism Instruction fetch Branch Floating-point pipeline Branches Integer pipeline 9

obtain good performance on such high-performance computers, To application program must an us study how this is done, and understand why it is so hard to Let about both parallelism and locality simultaneously. worry Lessons for software exploit coarse-grain parallelism exploit instruction level parallelism exploit temporal and spatial locality of reference 10

Exploiting coarse-grain Parallelism do j = 1..N do i = 1..N Y[i] = Y[i] + A[i,j]*X[j] = y A x Each row of the matrix can be multiplied by x in parallel. (ie., inner loop is a parallel loop) If addition is assumed to be commutative and associative, then outer loop is a parallel loop as well. Question: How do we tell which loops are parallel? do i = 1..N do i = 1..N x(2*i + 1) =...x(2*i)... x(i+1) =...x(i)... One of these loops is parallel, the other is sequential! 11

LOAD R1, M1 R1 <- R1 + 1 LOAD R2, M2 R2 <- R2 + 1 LOAD R1, M1 LOAD R2, M2 R1 <- R1 + 1 R2 <- R2 + 1 To exploit pipelines, instructions must be scheduled properly. LOADs are not overlapped LOADs are overlapped - Software pipelining: instruction reordering across loop boundaries - Hardware vs software: superscalar architectures: processor performs reordering on the fly (Intel P6, AMD K5, PA-8000) VLIW, in-order issue architectures: hardware issues instructions in order (CRAY, DEC ALPHA 21164) 12

do i = 1..N do j = 1..N Y[i] = Y[i] + A[i,j]*X[j] = y A x Exploiting locality (I) P M P M P M P M Interconnection Network Computation distribution: which iterations does a processor do? Data distribution: which data is mapped to each processor? Computation and data distributions should be "aligned" to optimize locality: a processor should "own" the data it needs for its computation. Misaligned references: communication Question: What are good distributions for MVM? 13

exploit caches. to Straight-forward coding of most algorithms results in programs poor locality. with Data shackling: automatic blocking of codes to improve locality do i = 1..N do j = 1..N Y[i] = Y[i] + A[i,j]*X[j] = y A x Exploiting locality (II) P C C M Processor has a 1st and 2nd level cache and local memory Uniprocessor locality Program must have spatial and temporal locality of reference 14

switches to different thread. processor If cost of context-switching is small, this can be a win. Worrying simultaneously about parallelism and locality is hard. Radical solution: multithreaded processors Forget about locality. Processor maintains a pool of active threads. When current thread makes a non-local memory reference, Tera, IBM Blue Gene machine 15

Summary To obtain good performance, an application program must exploit coarse-grain parallelism exploit temporal and spatial locality of reference exploit instruction level parallelism Systems software must support low-cost process management low-latency communication efficient synchronization 16

locality. or Optimizing compilers focus only on reducing the operation of the program. count O/S protocols for activities like inter-process communication too heavy-weight. are New problems: load balancing to re-design languages, compilers and systems software to Need applications that demand high-performance computation. support Mismatch with conventional software environments: Conventional languages do not permit expression of parallelism => 17

Lecture Topics Applications requirements: examples from computational science Architectural concerns: shared and distributed memory memory hierarchies, multithreaded processors, computers, pipelined processors Explicitly parallel languages: MPI and OpenMP APIs Restructuring compilers: Program analysis and transformation Memory hierarchy management: Block algorithms, tiling and shackling. 18

Lecture Topics (cont.) Automatic parallelization: shared and distributed memory High Performance FORTRAN. parallelization, Program optimization: control dependence, static single form, dataflow analysis, optimizations. assignment Exploiting instruction level parallelism: instruction scheduling, software pipelining. Object-oriented languages: Object models and inheritance 19