Scalarization of Index Vectors in Compiled APL
|
|
- Georgia Anthony
- 5 years ago
- Views:
Transcription
1 Snake Island Research Inc 18 Fifth Street, Ward s Island Toronto, Canada tel: bernecky@snakeisland.com September 30, 2011
2 Abstract High-performance for array languages offers several unique challenges to the compiler writer, including fusion of loops over large arrays, detection and elimination of scalars as arbitrary arrays, and eliminating or minimizing the run-time creation of index vectors. We introduce one of those challenges in the context of SAC, a functional array languge, and give preliminary results on the performance of a compiler that eliminates index vectors by scalarizing them within the optimization cycle.
3 The Question How much faster is compiled APL than interpreted APL?
4 The Question How much faster is compiled APL than interpreted APL? The answer is NOT a scalar.
5 Environment Dyalog APL 13.0 vs. APEX/SAC
6 Environment Dyalog APL 13.0 vs. APEX/SAC The current SAC compiler a functional array language data-parallel nested loops: With-Loop array-based optimizations functional loops and conditionals as functions
7 Environment Dyalog APL 13.0 vs. APEX/SAC The current SAC compiler a functional array language data-parallel nested loops: With-Loop array-based optimizations functional loops and conditionals as functions Goal: Compiled APL performance competitive with hand-coded C
8 Some Reasons Why APL is Slow Fixed per-primitive overheads: Syntax analysis, conf checks, fn dispatch, mem mgmt Variable per-primitive overheads Index vector materialization APL vs. APEX CPU Time Performance (2, ) Speedup (APL/APEX w/awlf) 1, Higher is better for APEX SAC: 17,654:MODIFIED APL: Dyalog APL buildvaks buildvfaks buildv2aks compiotaaks compiotadaks csbenchaks downgradepvaks fdaks gewlfaks histgradeaks histlpaks histopaks histopfaks iotanaks ipapeaks ipbbaks ipbdaks ipddaks ipopneaks ipplusandaks lltopaks logd3aks logd4aks loopfsaks loopfvaks loopisaks mconvaks mconvoutaks nsvaks nthoneaks schedraks scsaks testforaks testindxaks testlcvaks unirandaks unirand3aks upgradeboolaks upgradecharaks upgradepvaks upgraderpvaks
9 Why APL is Slow: Fixed Per-Primitive Overheads APL Primitive Overhead time/element for (Intvec+intvec) microseconds/element # elements in array Who suffers? Apps dominated by operations on scalars: CRC, loopy histograms, dynamic programming, RNG
10 Why APL is Slow: Fixed Per-Primitive Overheads Speedup (APL/APEX w/awlf) 1, Higher is better for APEX APL: Dyalog APL 13.0 SAC: 17,654:MODIFIED APL vs. APEX CPU Time Performance (2, ) 10 crcaks histlpaks lltopaks loopisaks scsaks testforaks Benchmark name
11 Why APL is Slow: Fixed Per-Primitive Overheads Scalar-dominated apps have good serial speedup... but poor parallel speedup Execution time (sec) APEX/SAC Parallel Performance SAC (17654:MODIFIED) real time 2, x mt1 mt2 0.4x mt3 6 core AMD Phenom II X6 1,075T mt4 mt5 mt6 0.3x 0.2x 0.1x 0x crc histlp lltop loopis scs testfor # threads
12 Why is APL Slow? Variable Per-Primitive Overheads Naive execution: Limited fn composition, e.g. sum( iota( N))
13 Why is APL Slow? Variable Per-Primitive Overheads Naive execution: Limited fn composition, e.g. sum( iota( N)) Array-valued intermediate results: memory madness
14 Why is APL Slow? Variable Per-Primitive Overheads Naive execution: Limited fn composition, e.g. sum( iota( N)) Array-valued intermediate results: memory madness Who suffers? Apps dominated by operations on large arrays: Signal processing, convolution, normal move-out APL vs. APEX CPU Time Performance (2, ) Speedup (APL/APEX w/awlf) 1, Higher is better for APEX SAC: 17,654:MODIFIED APL: Dyalog APL iotanaks ipapeaks ipbdaks ipopneaks logd3aks logd4aks upgradecharaks
15 Why is APL Slow? Variable Per-Primitive Overheads Who suffers? Apps dominated by operations on large arrays: Signal processing, convolution, normal move-out APEX/SAC Parallel Performance SAC (17654:MODIFIED) real time 2, Execution time (sec) 1.2x 1x 0.8x 0.6x 0.4x 6 core AMD Phenom II X6 1,075T mt1 mt2 mt3 mt4 mt5 mt6 0.2x 0x iotan ipape ipbbakd ipbd ipopneakd # threads ipopne logd3 logd4 upgradechar
16 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i]
17 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i] [i,pi] and [pi,i] are materialized index vectors
18 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i] [i,pi] and [pi,i] are materialized index vectors A few simple changes to scalarize index vectors: tmp = a[;i] a[;i] = a[;pi] a[;pi] = tmp
19 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i] [i,pi] and [pi,i] are materialized index vectors A few simple changes to scalarize index vectors: tmp = a[;i] a[;i] = a[;pi] a[;pi] = tmp Matrix divide model now runs twice as fast!
20 Why Materialized Index Vectors are Expensive Materialization of [i,pi] and [pi,i]:
21 Why Materialized Index Vectors are Expensive Materialization of [i,pi] and [pi,i]: (* for indexing part) *Increment reference counts on i and pi Allocate 2-element temp vector Initialize temp vector descriptor Initialize temp vector elements *Perform indexing Deallocate 2-element temp vector *Decrement reference counts on i and pi
22 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing
23 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign
24 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign Who suffers? Matrix divide, compress, deal, dynamic programming
25 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign Who suffers? Matrix divide, compress, deal, dynamic programming Who suffers? Inner products that use the CDC STAR-100 algorithm
26 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign Who suffers? Matrix divide, compress, deal, dynamic programming Who suffers? Inner products that use the CDC STAR-100 algorithm 800x800 ipplusandakd CPU time: 45 minutes!
27 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko
28 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation
29 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M)
30 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M) IVE: Replace: z = M[ IV] by: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M)
31 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M) IVE: Replace: z = M[ IV] by: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) Implication: IV is a materialized index vector!
32 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M) IVE: Replace: z = M[ IV] by: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) Implication: IV is a materialized index vector! Implication: offset calculation may be liftable (LIR)
33 Eliminating Materialized Index Vectors With IVE IVE: If IV can be represented as scalars, eliminate vect2offset. Before: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M)
34 Eliminating Materialized Index Vectors With IVE IVE: If IV can be represented as scalars, eliminate vect2offset. Before: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: IV = [ i, j, k] offset = idxs2offset( shape(m), i, j, k) z = idxsel( offset, M)
35 Eliminating Materialized Index Vectors With IVE IVE: If IV can be represented as scalars, eliminate vect2offset. Before: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: IV = [ i, j, k] offset = idxs2offset( shape(m), i, j, k) z = idxsel( offset, M) IV is now dead code, and can be eliminated!
36 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M)
37 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV)
38 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV) IV is now dead code, and can be eliminated!
39 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV) IV is now dead code, and can be eliminated! Earlier substitution by idxs2offset now feasible.
40 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV) IV is now dead code, and can be eliminated! Earlier substitution by idxs2offset now feasible. Unfortunately...
41 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K]
42 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function
43 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV
44 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV
45 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV This is where we came in!
46 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV This is where we came in! So, what can we do?
47 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV This is where we came in! So, what can we do? A kludged LIR to deal with this was deemed tasteless
48 Biting the Bullet Another approach: move IVE into the optimization cycle
49 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations!
50 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations
51 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops!
52 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops! e.g., unroll IV+1, guard and extrema functions..
53 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops! e.g., unroll IV+1, guard and extrema functions.. e.g., extend existing optimizations, such as CF
54 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops! e.g., unroll IV+1, guard and extrema functions.. e.g., extend existing optimizations, such as CF The Good News: I had already scalarized many index vectors for Algebraic With-Loop Folding!
55 Current Status Moving IVE into optimization cycle worked, sort of:
56 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes
57 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations!
58 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them!
59 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them! The Good News: Performance is improving daily
60 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them! The Good News: Performance is improving daily More Good News: Many opportunities exist for further optimization
61 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them! The Good News: Performance is improving daily More Good News: Many opportunities exist for further optimization Even More Good News: More parallelism is coming
62 How Fast Is Compiled APL? Array sizes affect performance
63 How Fast Is Compiled APL? Array sizes affect performance Iteration counts affect performance
64 How Fast Is Compiled APL? Array sizes affect performance Iteration counts affect performance Indexed assigns affect performance
65 How Fast Is Compiled APL? Array sizes affect performance Iteration counts affect performance Indexed assigns affect performance Characterize your application, then we can provide an answer.
APL on GPUs A Progress Report with a Touch of Machine Learning
APL on GPUs A Progress Report with a Touch of Machine Learning Martin Elsman, DIKU, University of Copenhagen Joined work with Troels Henriksen and Cosmin Oancea @ Dyalog 17, Elsinore 1 Motivation Goal:
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationTour of common optimizations
Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)
More informationLecture Notes on Loop Optimizations
Lecture Notes on Loop Optimizations 15-411: Compiler Design Frank Pfenning Lecture 17 October 22, 2013 1 Introduction Optimizing loops is particularly important in compilation, since loops (and in particular
More informationControl flow graphs and loop optimizations. Thursday, October 24, 13
Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange
More informationAccelerating Data Warehousing Applications Using General Purpose GPUs
Accelerating Data Warehousing Applications Using General Purpose s Sponsors: Na%onal Science Founda%on, LogicBlox Inc., IBM, and NVIDIA The General Purpose is a many core co-processor 10s to 100s of cores
More informationCompiler Design and Construction Optimization
Compiler Design and Construction Optimization Generating Code via Macro Expansion Macroexpand each IR tuple or subtree A := B+C; D := A * C; lw $t0, B, lw $t1, C, add $t2, $t0, $t1 sw $t2, A lw $t0, A
More informationMachine-Independent Optimizations
Chapter 9 Machine-Independent Optimizations High-level language constructs can introduce substantial run-time overhead if we naively translate each construct independently into machine code. This chapter
More informationWhat Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?
What Do Compilers Do? Lecture 1 Introduction I What would you get out of this course? II Structure of a Compiler III Optimization Example Reference: Muchnick 1.3-1.5 1. Translate one language into another
More informationSupercomputing in Plain English Part IV: Henry Neeman, Director
Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is
More informationCSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation
CSE 501: Compiler Construction Course outline Main focus: program analysis and transformation how to represent programs? how to analyze programs? what to analyze? how to transform programs? what transformations
More informationAn Optimizing Compiler for the TMS320C25 DSP Chip
An Optimizing Compiler for the TMS320C25 DSP Chip Wen-Yen Lin, Corinna G Lee, and Paul Chow Published in Proceedings of the 5th International Conference on Signal Processing Applications and Technology,
More informationCompiler Optimization
Compiler Optimization The compiler translates programs written in a high-level language to assembly language code Assembly language code is translated to object code by an assembler Object code modules
More informationUsually, target code is semantically equivalent to source code, but not always!
What is a Compiler? Compiler A program that translates code in one language (source code) to code in another language (target code). Usually, target code is semantically equivalent to source code, but
More informationCode optimization. Have we achieved optimal code? Impossible to answer! We make improvements to the code. Aim: faster code and/or less space
Code optimization Have we achieved optimal code? Impossible to answer! We make improvements to the code Aim: faster code and/or less space Types of optimization machine-independent In source code or internal
More informationCompiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler
Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis Synthesis of output program (back-end) Intermediate Code Generation Optimization Before and after generating machine
More informationCPSC 510 (Fall 2007, Winter 2008) Compiler Construction II
CPSC 510 (Fall 2007, Winter 2008) Compiler Construction II Robin Cockett Department of Computer Science University of Calgary Alberta, Canada robin@cpsc.ucalgary.ca Sept. 2007 Introduction to the course
More informationIntermediate representation
Intermediate representation Goals: encode knowledge about the program facilitate analysis facilitate retargeting facilitate optimization scanning parsing HIR semantic analysis HIR intermediate code gen.
More informationAbstract. APL SIMD Boolean Array Algorithms. A BIT of Introduction. Why Does APL have One-bit Booleans? Abstract. Robert Bernecky.
Abstract Abstract Snake Island Research Inc, Canada bernecky@snakeisland.com October 5, 2016 Computation on large Boolean arrays is becoming more prevalent, due to applications such as cryptography, data
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationBEAMJIT: An LLVM based just-in-time compiler for Erlang. Frej Drejhammar
BEAMJIT: An LLVM based just-in-time compiler for Erlang Frej Drejhammar 140407 Who am I? Senior researcher at the Swedish Institute of Computer Science (SICS) working on programming languages,
More informationCompiler Construction 2016/2017 Loop Optimizations
Compiler Construction 2016/2017 Loop Optimizations Peter Thiemann January 16, 2017 Outline 1 Loops 2 Dominators 3 Loop-Invariant Computations 4 Induction Variables 5 Array-Bounds Checks 6 Loop Unrolling
More informationLoop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion
Outline Loop Optimizations Induction Variables Recognition Induction Variables Combination of Analyses Copyright 2010, Pedro C Diniz, all rights reserved Students enrolled in the Compilers class at the
More informationCSC D70: Compiler Optimization
CSC D70: Compiler Optimization Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons CSC D70: Compiler Optimization
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationA Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013
A Bad Name Optimization is the process by which we turn a program into a better one, for some definition of better. CS 2210: Optimization This is impossible in the general case. For instance, a fully optimizing
More informationCSc 453 Interpreters & Interpretation
CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson Interpreters An interpreter is a program that executes another program. An interpreter implements a virtual machine,
More informationIteration. CSE / ENGR 142 Programming I. Chapter 5. Motivating Loops. One More Type of Control Flow. What s Wrong with HW1?
CSE / ENGR 142 Programming I Iteration Chapter 5 Read Sections 5.1-5.6, 5.10 5.1 Introduction & While Statement 5.2 While example 5.3 For Loop 5.4 Looping with a fixed bound 5.5 Loop design 5.6 Nested
More informationCompilers. Intermediate representations and code generation. Yannis Smaragdakis, U. Athens (original slides by Sam
Compilers Intermediate representations and code generation Yannis Smaragdakis, U. Athens (original slides by Sam Guyer@Tufts) Today Intermediate representations and code generation Scanner Parser Semantic
More informationCOMPILER DESIGN - CODE OPTIMIZATION
COMPILER DESIGN - CODE OPTIMIZATION http://www.tutorialspoint.com/compiler_design/compiler_design_code_optimization.htm Copyright tutorialspoint.com Optimization is a program transformation technique,
More informationCSE443 Compilers. Dr. Carl Alphonce 343 Davis Hall
CSE443 Compilers Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis Hall http://www.cse.buffalo.edu/faculty/alphonce/sp17/cse443/index.php https://piazza.com/class/iybn4ndqa1s3ei Announcements Grading survey
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly
More informationTowards Hybrid Array Types in SAC
Towards Hybrid Array Types in SAC Clemens Grelck, Fangyong Tang Informatics Institute University of Amsterdam Science Park 904 1098XH Amsterdam, Netherlands c.grelck@uva.nl f.tang@uva.nl Abstract: Array
More information7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker!
7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker! Roadmap > Introduction! > Optimizations in the Back-end! > The Optimizer! > SSA Optimizations! > Advanced Optimizations! 2 Literature!
More informationIntermediate Representations. Reading & Topics. Intermediate Representations CS2210
Intermediate Representations CS2210 Lecture 11 Reading & Topics Muchnick: chapter 6 Topics today: Intermediate representations Automatic code generation with pattern matching Optimization Overview Control
More informationCSE 504. Expression evaluation. Expression Evaluation, Runtime Environments. One possible semantics: Problem:
Expression evaluation CSE 504 Order of evaluation For the abstract syntax tree + + 5 Expression Evaluation, Runtime Environments + + x 3 2 4 the equivalent expression is (x + 3) + (2 + 4) + 5 1 2 (. Contd
More informationCompiler Construction 2010/2011 Loop Optimizations
Compiler Construction 2010/2011 Loop Optimizations Peter Thiemann January 25, 2011 Outline 1 Loop Optimizations 2 Dominators 3 Loop-Invariant Computations 4 Induction Variables 5 Array-Bounds Checks 6
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationSupercomputing and Science An Introduction to High Performance Computing
Supercomputing and Science An Introduction to High Performance Computing Part IV: Dependency Analysis and Stupid Compiler Tricks Henry Neeman, Director OU Supercomputing Center for Education & Research
More informationThin Locks: Featherweight Synchronization for Java
Thin Locks: Featherweight Synchronization for Java D. Bacon 1 R. Konuru 1 C. Murthy 1 M. Serrano 1 Presented by: Calvin Hubble 2 1 IBM T.J. Watson Research Center 2 Department of Computer Science 16th
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationOverview of a Compiler
High-level View of a Compiler Overview of a Compiler Compiler Copyright 2010, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have
More informationIntermediate Code Generation (ICG)
Intermediate Code Generation (ICG) Transform AST to lower-level intermediate representation Basic Goals: Separation of Concerns Generate efficient code sequences for individual operations Keep it fast
More informationStatic Single Assignment Form in the COINS Compiler Infrastructure
Static Single Assignment Form in the COINS Compiler Infrastructure Masataka Sassa (Tokyo Institute of Technology) Background Static single assignment (SSA) form facilitates compiler optimizations. Compiler
More informationSoot A Java Bytecode Optimization Framework. Sable Research Group School of Computer Science McGill University
Soot A Java Bytecode Optimization Framework Sable Research Group School of Computer Science McGill University Goal Provide a Java framework for optimizing and annotating bytecode provide a set of API s
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More informationImproving Interrupt Response Time in a Verifiable Protected Microkernel
Improving Interrupt Response Time in a Verifiable Protected Microkernel Bernard Blackham Yao Shi Gernot Heiser The University of New South Wales & NICTA, Sydney, Australia EuroSys 2012 Motivation The desire
More informationA Multiprocessor Memory Processor for Efficient Sharing And Access Coordination
1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu
More informationSpeed Improvements in pqr:
Speed Improvements in pqr: Current Status and Future Plans Radford M. Neal, University of Toronto Dept. of Statistical Sciences and Dept. of Computer Science http://www.cs.utoronto.ca/ radford http://pqr-project.org
More informationParallel Programming with OpenMP. CS240A, T. Yang
Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs
More informationDyalog. Experimental Functionality Compiler. Version Dyalog Limited. Minchens Court, Minchens Lane Bramley, Hampshire RG26 5BH United Kingdom
The tool of thought for expert programming Dyalog Experimental Functionality Compiler Version 14.1 Dyalog Limited Minchens Court, Minchens Lane Bramley, Hampshire RG26 5BH United Kingdom tel: +44(0)1256
More informationCSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics
CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics Computing system: performance, speedup, performance/cost Origins and benefits of scalar instruction pipelines and caches
More informationLoop Unrolling. Source: for i := 1 to 100 by 1 A[i] := A[i] + B[i]; endfor
Source: for i := 1 to 100 by 1 A[i] := A[i] + B[i]; Loop Unrolling Transformed Code: for i := 1 to 100 by 4 A[i ] := A[i ] + B[i ]; A[i+1] := A[i+1] + B[i+1]; A[i+2] := A[i+2] + B[i+2]; A[i+3] := A[i+3]
More informationAlgorithm Performance. (the Big-O)
Algorithm Performance (the Big-O) Lecture 6 Today: Worst-case Behaviour Counting Operations Performance Considerations Time measurements Order Notation (the Big-O) Pessimistic Performance Measure Often
More informationCS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall
CS 701 Charles N. Fischer Class Meets Tuesdays & Thursdays, 11:00 12:15 2321 Engineering Hall Fall 2003 Instructor http://www.cs.wisc.edu/~fischer/cs703.html Charles N. Fischer 5397 Computer Sciences Telephone:
More informationModule 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching
The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence
More informationAdvanced R. V!aylor & Francis Group. Hadley Wickham. ~ CRC Press
~ CRC Press V!aylor & Francis Group Advanced R Hadley Wickham ')'l If trlro r r 1 Introduction 1 1.1 Who should read this book 3 1.2 What you will get out of this book 3 1.3 Meta-techniques... 4 1.4 Recommended
More informationOptimising with the IBM compilers
Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square
More informationWeld: A Common Runtime for Data Analytics
Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia Stanford InfoLab, *MIT CSAIL
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationCompiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz
Compiler Design Fall 2015 Control-Flow Analysis Sample Exercises and Solutions Prof. Pedro C. Diniz USC / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292
More informationCOMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview
COMP 181 Compilers Lecture 2 Overview September 7, 2006 Administrative Book? Hopefully: Compilers by Aho, Lam, Sethi, Ullman Mailing list Handouts? Programming assignments For next time, write a hello,
More informationLift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules
Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th
More informationSAC A Functional Array Language for Efficient Multi-threaded Execution
International Journal of Parallel Programming, Vol. 34, No. 4, August 2006 ( 2006) DOI: 10.1007/s10766-006-0018-x SAC A Functional Array Language for Efficient Multi-threaded Execution Clemens Grelck 1,3
More informationBuilding a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano
Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code
More informationCSE341: Programming Languages Lecture 17 Implementing Languages Including Closures. Dan Grossman Autumn 2018
CSE341: Programming Languages Lecture 17 Implementing Languages Including Closures Dan Grossman Autumn 2018 Typical workflow concrete syntax (string) "(fn x => x + x) 4" Parsing Possible errors / warnings
More informationIntroduction to Compilers
Introduction to Compilers Compilers are language translators input: program in one language output: equivalent program in another language Introduction to Compilers Two types Compilers offline Data Program
More informationIR Optimization. May 15th, Tuesday, May 14, 13
IR Optimization May 15th, 2013 Tuesday, May 14, 13 But before we talk about IR optimization... Wrapping up IR generation Tuesday, May 14, 13 Three-Address Code Or TAC The IR that you will be using for
More informationCSE 504: Compiler Design. Intermediate Representations Symbol Table
Intermediate Representations Symbol Table Pradipta De pradipta.de@sunykorea.ac.kr Current Topic Intermediate Representations Graphical IRs Linear IRs Symbol Table Information in a Program Compiler manages
More informationCOMP Parallel Computing. SMM (2) OpenMP Programming Model
COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel
More informationCS 360 Programming Languages Interpreters
CS 360 Programming Languages Interpreters Implementing PLs Most of the course is learning fundamental concepts for using and understanding PLs. Syntax vs. semantics vs. idioms. Powerful constructs like
More informationThe View from 35,000 Feet
The View from 35,000 Feet This lecture is taken directly from the Engineering a Compiler web site with only minor adaptations for EECS 6083 at University of Cincinnati Copyright 2003, Keith D. Cooper,
More informationLecture Notes on Cache Iteration & Data Dependencies
Lecture Notes on Cache Iteration & Data Dependencies 15-411: Compiler Design André Platzer Lecture 23 1 Introduction Cache optimization can have a huge impact on program execution speed. It can accelerate
More informationComputer Architecture. R. Poss
Computer Architecture R. Poss 1 lecture-7-24 september 2015 Virtual Memory cf. Henessy & Patterson, App. C4 2 lecture-7-24 september 2015 Virtual Memory It is easier for the programmer to have a large
More informationA Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads
A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads Mingliang Wei, Marc Snir, Josep Torrellas (UIUC) Brett Tremaine (IBM) Work supported by HPCS/PERCS Motivation Many important
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationCompilers. Lecture 2 Overview. (original slides by Sam
Compilers Lecture 2 Overview Yannis Smaragdakis, U. Athens Yannis Smaragdakis, U. Athens (original slides by Sam Guyer@Tufts) Last time The compilation problem Source language High-level abstractions Easy
More informationBentley Rules for Optimizing Work
6.172 Performance Engineering of Software Systems SPEED LIMIT PER ORDER OF 6.172 LECTURE 2 Bentley Rules for Optimizing Work Charles E. Leiserson September 11, 2012 2012 Charles E. Leiserson and I-Ting
More informationPetalisp. A Common Lisp Library for Data Parallel Programming. Marco Heisig Chair for System Simulation FAU Erlangen-Nürnberg
Petalisp A Common Lisp Library for Data Parallel Programming Marco Heisig 16.04.2018 Chair for System Simulation FAU Erlangen-Nürnberg Petalisp The library Petalisp 1 is a new approach to data parallel
More informationRunning class Timing on Java HotSpot VM, 1
Compiler construction 2009 Lecture 3. A first look at optimization: Peephole optimization. A simple example A Java class public class A { public static int f (int x) { int r = 3; int s = r + 5; return
More informationOpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system
OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives
More informationOverview of a Compiler
Overview of a Compiler Copyright 2015, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have explicit permission to make copies of
More informationA Short History of Array Computing in Python. Wolf Vollprecht, PyParis 2018
A Short History of Array Computing in Python Wolf Vollprecht, PyParis 2018 TOC - - Array computing in general History up to NumPy Libraries after NumPy - Pure Python libraries - JIT / AOT compilers - Deep
More informationPerformance. frontend. iratrecon - rational reconstruction. sprem - sparse pseudo division
Performance frontend The frontend command is used extensively by Maple to map expressions to the domain of rational functions. It was rewritten for Maple 2017 to reduce time and memory usage. The typical
More informationWith-Loop Scalarization Merging Nested Array Operations
With-Loop Scalarization Merging Nested Array Operations Clemens Grelck 1, Sven-Bodo Scholz 2, and Kai Trojahner 1 1 University of Lübeck Institute of Software Technology and Programming Languages {grelck,trojahne}@ispuni-luebeckde
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationOptimization Prof. James L. Frankel Harvard University
Optimization Prof. James L. Frankel Harvard University Version of 4:24 PM 1-May-2018 Copyright 2018, 2016, 2015 James L. Frankel. All rights reserved. Reasons to Optimize Reduce execution time Reduce memory
More informationCS 2461: Computer Architecture 1
Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code
More informationCS122 Lecture 4 Winter Term,
CS122 Lecture 4 Winter Term, 2014-2015 2 SQL Query Transla.on Last time, introduced query evaluation pipeline SQL query SQL parser abstract syntax tree SQL translator relational algebra plan query plan
More informationChapter 7 The Potential of Special-Purpose Hardware
Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture
More informationStatic Single Assignment Form in the COINS Compiler Infrastructure - Current Status and Background -
Static Single Assignment Form in the COINS Compiler Infrastructure - Current Status and Background - Masataka Sassa, Toshiharu Nakaya, Masaki Kohama (Tokyo Institute of Technology) Takeaki Fukuoka, Masahito
More informationEECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more
More informationECE 498AL. Lecture 12: Application Lessons. When the tires hit the road
ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different
More informationAdvances in Memory Management and Symbol Lookup in pqr
Advances in Memory Management and Symbol Lookup in pqr Radford M. Neal, University of Toronto Dept. of Statistical Sciences and Dept. of Computer Science http://www.cs.utoronto.ca/ radford http://radfordneal.wordpress.com
More informationDomain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch
Domain Specific Languages for Financial Payoffs Matthew Leslie Bank of America Merrill Lynch Outline Introduction What, How, and Why do we use DSLs in Finance? Implementation Interpreting, Compiling Performance
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationCompiler Code Generation COMP360
Compiler Code Generation COMP360 Students who acquire large debts putting themselves through school are unlikely to think about changing society. When you trap people in a system of debt, they can t afford
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationUnder the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.
Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.
More information