Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Similar documents
Simone Campanoni Loop transformations

Compiling for Advanced Architectures

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Loop Transformations! Part II!

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

CS 2461: Computer Architecture 1

Optimization Prof. James L. Frankel Harvard University

Module 16: Data Flow Analysis in Presence of Procedure Calls Lecture 32: Iteration. The Lecture Contains: Iteration Space.

Program Transformations for the Memory Hierarchy

Office Hours: Mon/Wed 3:30-4:30 GDC Office Hours: Tue 3:30-4:30 Thu 3:30-4:30 GDC 5.


Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

Module 18: Loop Optimizations Lecture 36: Cycle Shrinking. The Lecture Contains: Cycle Shrinking. Cycle Shrinking in Distance Varying Loops

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Parallelism and runtimes

Control flow graphs and loop optimizations. Thursday, October 24, 13

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

CS222: Cache Performance Improvement

Supercomputing in Plain English Part IV: Henry Neeman, Director

Lecture 2. Memory locality optimizations Address space organization

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Tiling: A Data Locality Optimizing Algorithm

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

PERFORMANCE OPTIMISATION

High-Performance Parallel Computing

Module 20: Multi-core Computing Multi-processor Scheduling Lecture 39: Multi-processor Scheduling. The Lecture Contains: User Control.

Advanced optimizations of cache performance ( 2.2)

Enhancing Parallelism

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

Cache Memories. Lecture, Oct. 30, Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Parallelisation. Michael O Boyle. March 2014

CSC D70: Compiler Optimization Memory Optimizations

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Uniprocessor Optimizations. Idealized Uniprocessor Model

CSE P 501 Compilers. Loops Hal Perkins Spring UW CSE P 501 Spring 2018 U-1

6.1 Multiprocessor Computing Environment

Exploring Parallelism At Different Levels

EE/CSCI 451: Parallel and Distributed Computation

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

CSCI-580 Advanced High Performance Computing

Vector Processors. Abhishek Kulkarni Girish Subramanian

211: Computer Architecture Summer 2016

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

Concurrent Programming Introduction

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

A Preliminary Assessment of the ACRI 1 Fortran Compiler

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Cache Impact on Program Performance. T. Yang. UCSB CS240A. 2017

Reading Assignment. Lazy Evaluation

Concurrent Programming with OpenMP

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

Pipelining and Vector Processing

LECTURE 5: MEMORY HIERARCHY DESIGN

Goals of Program Optimization (1 of 2)

Storage Management 1

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

Tour of common optimizations

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Double-precision General Matrix Multiply (DGEMM)

Go Multicore Series:

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Copyright 2012, Elsevier Inc. All rights reserved.

CS 351 Final Exam Solutions

Coarse-Grained Parallelism

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Chapter 2: Memory Hierarchy Design Part 2

Leftover. UMA vs. NUMA Parallel Computer Architecture. Shared-memory Distributed-memory

Program Optimization

Transforming Imperfectly Nested Loops

Application parallelization for multi-core Android devices

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Lecture notes for CS Chapter 2, part 1 10/23/18

Wenisch Final Review. Fall 2007 Prof. Thomas Wenisch EECS 470. Slide 1

Unit 9 : Fundamentals of Parallel Processing

Memory Consistency. Challenges. Program order Memory access order


CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Modern Processor Architectures. L25: Modern Compiler Design

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Copyright 2012, Elsevier Inc. All rights reserved.

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Transcription:

The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence Analysis Program Restructurer Technique to Improve Detection of Parallelism: Interactive Compilation Scalar Processors Matrix Multiplication Code for Scalar Processor Spatial Locality Improve Spatial Locality Temporal Locality Improve Temporal Locality Matrix Multiplication on Vector Machine Strip-mining Shared Memory Model file:///d /...ary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_1.htm[6/14/2012 12:07:57 PM]

Similarly Can be replaced by Loop Unswitching A loop may be split into two loops. For example the following loop file:///d /...ary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_2.htm[6/14/2012 12:07:57 PM]

May be replaced by Supercomputing Applications INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Used in scientific computing requiring very large compute time Weather prediction: For a days prediction at 109 fpo/sec requires 3 hours Used in biology, genetic engineering, astrophysics, aerospace, nuclear and particle physics, medicine, tomography... Power of supercomputers comes from Hardware technology: Faster machines Multilevel architectural parallelism Vector handles arrays with a single instruction Parallel lot of processors each capable of executing an independent instruction stream VLIW handles many instructions in a single cycle Programming Paradigms Vector Machines very close to sequential machines. Different in use of arrays Parallel Machines deal with a system of processes. Has individual threads execution. Synchronization is a very important issue. Important things to remeber are Aavoid deadlocks Prevent race conditions Avoid too many parallel threads Performance evaluation criterion Speed up vs number of processors Synchronization over heads Number of processors which can be kept busy file:///d /...ary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_3.htm[6/14/2012 12:07:58 PM]

Current software technology is unable to handle all these issues Book keeping is still done by the users Main directions of research Design of concurrent algorithms Design of concurrent languages Construction of tools to do software development Sequential compilers and book keeping tools Parallelizing and vectoring compilers Message passing libraries Development of mathematical libraries Languages Fortran, C, Java, X10 etc. Dusty decks problem Important Problems Restructuring Conversion of large body of existing sequential programs developed over last 40 years Several billions lines of code and manual conversion is not possible Restructuring compilers are required Identify parts of program to take advantage of characteristics of machine Manual coding is not possible Compilers are more efficient Scheduling Exploit parallelism on a given machine Static scheduling: done at compile time Dynamic scheduling: done at run time High overhead Implemented through low level calls without involving OS (auto scheduling compilers) Auto scheduling compilers Offer new environments for executing parallel programs Small overhead Parallel execution of fine grain granularity is possible Loop scheduling: Singly nested vs multiple nested Runtime overheads: Scheduling, synchronization, inter processor communication file:///d /...ary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_4.htm[6/14/2012 12:07:58 PM]

Sources and Types of Parallelism Structured: Identical tasks on different data sets Unstructured: Different data streams and different instructions Algorithm level: Appropriate algorithms and data structures Programming: Specify parallelism in parallel languages Write sequential code and use compilers Use course grain parallelism: independent modules Use medium grain parallelism: loop level Use fine grain parallelism: basic block or statement Expressing parallelism in programs No good languages Programmers are unable to exploit whatever little is available Model of Compiler CODE OPTIMIZATION High Level Analysis for optimization Common subexpression elimination Copy propagation Dead code elimination Code motion Strength reduction Constant folding Loop unrolling Induction variable simplification file:///d /...ary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_5.htm[6/14/2012 12:07:58 PM]

Data Dependence Analysis Check whether program can be restructured GCD test Banerjee s Test I-test Omega test Architectural specific tests Program Restructurer Loop restructuring Statement reordering Loop (distribution) fission Breaking dependence cycle: Node splitting Loop interchanging Loop skewing Loop jamming Handling while loops Technique to Improve Detection of Parallelism: Interactive Compilation Back-end to user Compiler directives Force Parallel No Parallel Max-trips count Task identification No side effects Incremental Analysis Scalar Processors Machines use standard CPUs Main memory is in hierarchy: RAM and cache Most programs exhibit locality of reference Same items are referred very close in time (temporal) Nearby locations are referred frequently (spatial) Programmers do not have control over memory Programmers can take advantage of cache file:///d /...ary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_6.htm[6/14/2012 12:07:58 PM]

Matrix Multiplication for i := 1 to n do for j := 1 to n do for k := 1 to n do c[i,j] = c[i,j] + a[i,k] * b[k,j] Code for Scalar Processor loop: loadf f2, (r2) loadf f3, (r3) mpyf f4, f2, f3 addf f1, f1, f4 addi r2, r2, #4 add r3, r3, r13 subi r1, r1, #1 bnz r1, loop Code for Scalar Processor ;f1 holds value of C[i,j] ;r1 holds value of n ;r2 holds address of A[i,1] ;r3 holds address of B[1,j] ;r13 holds size of row of B ;4 is the size of an element of A ;loop label ;load A[i,k] ;load B[k,j] ;A[i,k] * B[k,j] ;C = C + A * B ;incr pointer to A[i,k+1] ;incr pointer to B[k+1,j] ;decr r1 by 1 ;branch to loop if r1 6= 0 file:///d /...ary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_7.htm[6/14/2012 12:07:58 PM]

Spatial Locality Instruction count can be decremented but is not the only issue Performance is related to memory access Use cache to improve performance Matrices are stored row major Fetch A[i,k] shows good spatial locality Fetch B[k,j] is slow Re-order loops Improve Spatial Locality for i := 1 to n do for k := 1 to n do for j := 1 to n do c[i,j] = c[i,j] + a[i,k] * b[k,j] Temporal Locality Previous program has no temporal locality for large matrices Entire matrix B is fetched for each i If row of B or C are large it does not benefit from temporal locality Use sub-matrix multiplication Improve Temporal Locality for it := 1 to n by s do for kt := 1 to n by s do for jt := 1 to n by s do for i = it to min(it+s-1, n) do for k = kt to min(kt+s-1, n) do for j = jt to min(jt+s-1, n) do c[i,j] = c[i,j] + a[i,k] * b[k,j] file:///d /...ary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_8.htm[6/14/2012 12:07:58 PM]

Matrix Multiplication on Vector Machine for i := 1 to n do for k := 1 to n do c[i,1:n] = c[i,1:n] + a[i,k] * b[k,1:n] for i = 1 to n do for k = 1 to n do setvl r1 loadf f2, (r2) loadv v3, (r3) mpyvs v3, v3, f2 loadv v4, (r4) addvv v4, v4, v3 strorev v4, (r4) addi r2, r2, #4 add r3, r3, r13 add r4, r4, r13 ;r1 holds n and r13 holds row sizes of B and C ;r2 holds address of A[i,k] ;r3 holds address of B[k,1] ;r4 holds address of c[i,1] ;set vactor length to n ;load A[i,k] ;load B[k,1:n] ;A[i,k]*B[k,1:n] ;load C[i,1:n] ;update C[i,1:n] ;store C[i,1:n] ;point to A[i,k+1] ;point to B[k+1,1] ;point to C[i+1,1] file:///d /...ary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_9.htm[6/14/2012 12:07:58 PM]

Load and store can be floated out of k loop for i = 1 to n do setvl r1 loadv v4, (r4) for k = 1 to n do loadf f2, (r2) loadv v3, (r3) mpyvs v3, v3, f2 addvv v4, v4, v3 addi r2, r2, #4 add r3, r3, r13 strorev v4, (r4) add r4, r4, r13 ;set vactor length to n ;load C[i,1:n] ;load A[i,k] ;load B[k,1:n] ;A[i,k]*B[k,1:n] ;update C[i,1:n] ;point to A[i,k+1] ;point to B[k+1,1] ;store C[i,1:n] ;point to C[i+1,1] Strip-mining If n is larger than vector size, the code will not work. To handle the general case the vector must be divided into strips of size m where m is no longer than a vector register. Assuming m=64 for i := 1 to n do for k := 1 to n do for js := 0 to n-1 by 64 do vl := min(n-js, 64) c[i,js+1:js+vl] = c[i,js+1:js+vl] + a[i,k] * b[k,js+1:js+vl] Shared Memory Model Discover iterations which can be executed in parallel Master processor executes task upto the parallel loop Fork tasks for each of processor Synchronize at the end of the loop One way to parallelize matrix multiplication is: for i := 1 to n do for k := 1 to n do doall j := 1 to n do c[i,j] = c[i,j] + a[i,k] * b[k,j] endall file:///d /...ry,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture%2025/25_10.htm[6/14/2012 12:07:58 PM]