Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Similar documents
Optimizing MMM & ATLAS Library Generator

Control flow graphs and loop optimizations. Thursday, October 24, 13

9/5/17. The Design and Implementation of Programming Languages. Compilation. Interpretation. Compilation vs. Interpretation. Hybrid Implementation

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Office Hours: Mon/Wed 3:30-4:30 GDC Office Hours: Tue 3:30-4:30 Thu 3:30-4:30 GDC 5.

CS 2461: Computer Architecture 1

CS 612: Software Design for High-performance Architectures. Keshav Pingali Cornell University

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Advanced Compiler Construction

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Cache-oblivious Programming

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

USC 227 Office hours: 3-4 Monday and Wednesday CS553 Lecture 1 Introduction 4

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Lecture 2. Memory locality optimizations Address space organization

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Wenisch Final Review. Fall 2007 Prof. Thomas Wenisch EECS 470. Slide 1

CS 426 Parallel Computing. Parallel Computing Platforms

Linear Loop Transformations for Locality Enhancement

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

211: Computer Architecture Summer 2016

Introduction. L25: Modern Compiler Design

What is a compiler? var a var b mov 3 a mov 4 r1 cmpi a r1 jge l_e mov 2 b jmp l_d l_e: mov 3 b l_d: ;done

HPC VT Machine-dependent Optimization

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

An Experimental Comparison of Cache-oblivious and Cache-aware Programs DRAFT: DO NOT DISTRIBUTE

Simone Campanoni Loop transformations

Loops. Lather, Rinse, Repeat. CS4410: Spring 2013

Program Optimization

Master Informatics Eng.

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573

CSC D70: Compiler Optimization Memory Optimizations

How to Write Fast Numerical Code

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

Double-precision General Matrix Multiply (DGEMM)

Tiling: A Data Locality Optimizing Algorithm

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

DEPARTMENT OF COMPUTER SCIENCE

1 Motivation for Improving Matrix Multiplication

Program Transformations for the Memory Hierarchy

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

Mo Money, No Problems: Caches #2...

High-Performance Parallel Computing

Uniprocessor Optimizations. Idealized Uniprocessor Model

Lecture 2: Single processor architecture and memory

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Compilers. Optimization. Yannis Smaragdakis, U. Athens

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

X-Ray : Automatic Measurement of Hardware Parameters

CSCI 565 Compiler Design and Implementation Spring 2014

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

ECE 5730 Memory Systems

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimization

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CS560 Lecture Parallel Architecture 1

Tour of common optimizations

CS3350B Computer Architecture

Hardware-based Speculation

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Pipelining and Vector Processing

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Null space basis: mxz. zxz I

HW 2 is out! Due 9/25!

Compilers and Code Optimization EDOARDO FUSELLA

High Level, high speed FPGA programming

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?

Memory Hierarchies 2009 DAT105

How to Write Fast Numerical Code

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

Two hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time

CS61C : Machine Structures

Exploitation of instruction level parallelism

Code Optimization & Performance. CS528 Serial Code Optimization. Great Reality There s more to performance than asymptotic complexity

Computer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key

Lecture 16 Optimizing for the memory hierarchy

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

How to Write Fast Numerical Code Spring 2012 Lecture 13. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

Loop Transformations! Part II!

Optimization Prof. James L. Frankel Harvard University

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology

Online Course Evaluation. What we will do in the last week?

Transcription:

Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting times Lecture: TTh 12:30-2:00PM, GDC 2.210 Office hours: eshav Pingali: Tuesday 3-4 PM, POB 4.126 Prerequisites Compilers and architecture Some background in compilers (front-end stuff) Basic computer architecture Software and math maturity Able to implement large programs in C/C++ Comfortable with abstractions like graph theory Ability to read research papers and understand content 1

Course material Website for course http://www.cs.utexas.edu/users/pingali/cs380c/2016/index.html All lecture notes, announcements, papers, assignments, etc. will be posted there No assigned book for the course but we will put papers and other material on the website as appropriate Coursework 4-5 programming assignments and problem sets Work in pairs Term project Substantial implementation project Based on our ideas or yours in the area of compilers Work in pairs Paper presentations Towards the end of the semester What do compilers do? Conventional view of compilers Program that analyzes and translates a high-level language program automatically into low-level machine code that can be executed by the hardware May do simple (scalar) optimizations to reduce the number of operations Ignore data structures for the most part Modern view of compilers Program for translation, transformation and verification of high-level language programs Reordering (restructuring) the computations is as important if not more important than reducing the amount of computation Optimization of data structure computations is critical Program analysis techniques can be useful for other applications such as debugging, verifying the correctness of a program against a specification, detecting malware,. Why do we need translators? Bridge the semantic gap Programmers prefer to write programs at a high level of abstraction Modern architectures are very complex, so to get good performance, we have to worry about a lot of low-level details Compilers let programmers write high-level programs and still get good performance on complex machine architectures Application portability When a new ISA or architecture comes out, you only need to reimplement the compiler on that machine Application programs should run without (substantial) modification Saves a huge amount of programming effort 2

Complexity of modern architectures: AMD Barcelona Quad-core Processor Discussion To get good performance on modern processors, program must exploit coarse-grain (multicore) parallelism memory hierarchy (L1,L2,L3,..) instruction-level parallelism (ILP) registers. ey questions: How important is it to exploit these hardware features? If you have n cores and you run on only one, you get at most 1/n of peak performance, so this is obvious How about other hardware features? If it is important, how hard is it to do this by hand? Let us look at memory hierarchies to get a feel for this Typical latencies L1 cache: ~ 1 cycle L2 cache: ~ 10 cycles Memory: ~ 500-1000 cycles Software problem Caches are useful only if programs have locality of reference temporal locality: program references to given memory address are clustered together in time spatial locality: program references clustered in address space are clustered in time Problem: Programs obtained by expressing most algorithms in the straight-forward way do not have much locality of reference Worrying about locality when coding algorithms complicates the software process enormously. Example: matrix multiplication DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO = 1, N C(I,J) = C(I,J) + A(I,)*B(,J) Great algorithmic data reuse: each array element is touched O(N) times! All six loop permutations are computationally equivalent (even modulo round-off error). However, execution times of the six versions can be very different if machine has a cache. 3

IJ version (large cache) IJ version (small cache) DO I = 1, N DO J = 1, N DO = 1, N C(I,J) = C(I,J) + A(I,)*B(,J) A B C DO I = 1, N A DO J = 1, N DO = 1, N C(I,J) = C(I,J) + A(I,)*B(,J) Small cache scenario: Matrices are large compared to cache/row-major storage Cold and capacity misses Miss ratio: C: N 2 /b misses (good temporal locality) A: N 3 /b misses (good spatial locality) B: N 3 misses (poor temporal and spatial locality) Miss ratio 0.25 (b+1)/b = 0.3125 (for b = 4) B C Large cache scenario: Matrices are small enough to fit into cache Only cold misses, no capacity misses Miss ratio: Data size = 3 N 2 Each miss brings in b floating-point numbers Miss ratio = 3 N 2 /b*4n 3 = 0.75/bN = 0.019 (b = 4,N=10) MMM Experiments Simulated L1 Cache Miss Ratio for Intel Pentium III MMM with N = 1 1300 16B 32B/Block 4-way 8-byte elements Quantifying performance differences DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO = 1, N C(I,J) = C(I,J) + A(I,)*B(,J) Typical cache parameters: L2 cache hit: 10 cycles, cache miss 70 cycles Time to execute IJ version: 2N 3 + 70*0.13*4N 3 + 10*0.87*4N 3 = 73.2 N 3 Time to execute JI version: 2N 3 + 70*0.5*4N 3 + 10*0.5*4N 3 = 162 N 3 Speed-up = 2.2 ey transformation: loop permutation 4

Even better.. Break MMM into a bunch of smaller MMMs so that large cache model is true for each small MMM large cache model is valid for entire computation miss ratio will be 0.75/bt for entire computation where t is Loop tiling/blocking DO It = 1,N, t DO Jt = 1,N,t DO t = 1,N,t DO I = It,It+t-1 DO J = Jt,Jt+t-1 DO = t,t+t-1 C(I,J) = C(I,J)+A(I,)*B(,J) It A I t t t Jt J t t B C Break big MMM into sequence of smaller MMMs where each smaller MMM multiplies sub-matrices of size txt. Parameter t (tile size) must be chosen carefully as large as possible working set of small matrix multiplication must fit in cache Speed-up from tiling/blocking Miss ratio for block computation = miss ratio for large cache model = 0.75/bt = 0.001 (b = 4, t = 200) Time to execute tiled version = 2N 3 + 70*0.001*4N 3 + 10*0.999*4N 3 = 42.3N 3 Speed-up over JI version = 4 Observations Locality optimized code is more complex than high-level algorithm. Locality optimization changed the order in which operations were done, not the number of operations Fine-grain view of data structures (arrays) is critical Loop orders and tile size must be chosen carefully cache size is key parameter associativity matters Actual code is even more complex: must optimize for processor resources registers: register tiling pipeline: loop unrolling Optimized MMM code can be ~1000 s of lines of C code Wouldn t it be nice to have all this be done automatically by a compiler? Actually, it is done automatically nowadays 5

factor faster than -O2 30 25 20 15 10 5 0 -O1 Performance of MMM code produced by Intel s Itanium compiler (-O3) GFLOPS relative to -O2; bigger is better -O2 + prefetch + interchange + unroll-jam + blocking = -O3 92% of Peak Performance gcc -O4 Discussion Exploiting parallelism, memory hierarchies etc. is very important If program uses only one core out of n cores in processors, you get at most 1/n of peak performance Memory hierarchy optimizations are very important can improve performance by factor of 10 or more ey points: need to focus on data structure manipulation reorganization of computations and data structure layout are key few opportunities usually to reduce the number of computations Goto BLAS obtains close to 99% of peak, so compiler is pretty good! Organization of modern compiler Front-end Our focus Goal: convert linear representation of program to hierarchical representation Input: text file Output: abstract syntax tree + symbol table ey modules: Lexical analyzer: converts sequence of characters in text file into sequence of tokens Parser: converts sequence of tokens into abstract syntax tree + symbol table Semantic checker: (eg) perform type checking 6

High-level optimizer Goal: perform high-level analysis and optimization of program Input: AST + symbol table from front-end Output: Low-level program representation such as 3-address code Tasks: Procedure/method inlining Array/pointer dependence analysis Loop transformations: unrolling, permutation, tiling, jamming,. Low-level optimizer Goal: perform scalar optimizations on low-level representation of program Input: low-level representation of program such as 3-address code Output: optimized low-level representation + additional information such as def-use chains Tasks: Dataflow analysis: live variables, reaching definitions, Scalar optimizations: constant propagation, partial redundancy elimination, strength reduction,. Code generator Goal: produce assembly/machine code from optimized low-level representation of program Input: optimized low-level representation of program from low-level optimizer Output: assembly/machine code for real or virtual machine Tasks: Register allocation Instruction selection Discussion (I) Traditionally, all phases of compilation were completed before program was executed New twist: virtual machines Offline compiler: Generates code for virtual machine like JVM Just-in-time compiler: Generates code for real machine from VM code while program is executing Advantages: Portability JIT compiler can perform optimizations for particular input 7

Discussion (II) On current processors, accessing memory to fetch operands for a computation takes much longer than performing the computation performance of most programs is limited by memory latency rather than by speed of computation (memory wall problem) reducing memory traffic (locality) is more important than optimizing scalar computations Another problem: energy takes much more energy to move data than to perform an arithmetic operation exploiting locality is critical for power/energy management as well Course content (scalar stuff) Introduction compiler structure, architecture and compilation, sources of improvement Control flow analysis basic blocks & loops, dominators, postdominators, control dependence Data flow analysis lattice theory, iterative frameworks, reaching definitions, liveness Static-single assignment static-single assignment, constant propagation. Global optimizations loop invariant code motion, common subexpression elimination, strength reduction. Register allocation coloring, allocation, live range splitting. Instruction scheduling pipelined and VLIW architectures, list scheduling. Course content (data structure stuff) Array dependence analysis integer linear programming, dependence abstractions. Loop transformations linear loop transformations, loop fusion/fission, enhancing parallelism and locality Self-optimizing programs empirical search, ATLAS, FFTW Analysis of pointer-based programs points-to and shape analysis Parallelizing graph programs amorphous data parallelism, exploiting amorphous data-parallelism Program verification Floyd-Hoare style proofs, model checking, theorem provers Lecture schedule See http://www.cs.utexas.edu/users/pingali/cs380c/2016/index.html Some lectures will be given by guest lecturers from my group and from industry 8