Scalarization of Index Vectors in Compiled APL

Size: px
Start display at page:

Download "Scalarization of Index Vectors in Compiled APL"

Transcription

1 Snake Island Research Inc 18 Fifth Street, Ward s Island Toronto, Canada tel: bernecky@snakeisland.com September 30, 2011

2 Abstract High-performance for array languages offers several unique challenges to the compiler writer, including fusion of loops over large arrays, detection and elimination of scalars as arbitrary arrays, and eliminating or minimizing the run-time creation of index vectors. We introduce one of those challenges in the context of SAC, a functional array languge, and give preliminary results on the performance of a compiler that eliminates index vectors by scalarizing them within the optimization cycle.

3 The Question How much faster is compiled APL than interpreted APL?

4 The Question How much faster is compiled APL than interpreted APL? The answer is NOT a scalar.

5 Environment Dyalog APL 13.0 vs. APEX/SAC

6 Environment Dyalog APL 13.0 vs. APEX/SAC The current SAC compiler a functional array language data-parallel nested loops: With-Loop array-based optimizations functional loops and conditionals as functions

7 Environment Dyalog APL 13.0 vs. APEX/SAC The current SAC compiler a functional array language data-parallel nested loops: With-Loop array-based optimizations functional loops and conditionals as functions Goal: Compiled APL performance competitive with hand-coded C

8 Some Reasons Why APL is Slow Fixed per-primitive overheads: Syntax analysis, conf checks, fn dispatch, mem mgmt Variable per-primitive overheads Index vector materialization APL vs. APEX CPU Time Performance (2, ) Speedup (APL/APEX w/awlf) 1, Higher is better for APEX SAC: 17,654:MODIFIED APL: Dyalog APL buildvaks buildvfaks buildv2aks compiotaaks compiotadaks csbenchaks downgradepvaks fdaks gewlfaks histgradeaks histlpaks histopaks histopfaks iotanaks ipapeaks ipbbaks ipbdaks ipddaks ipopneaks ipplusandaks lltopaks logd3aks logd4aks loopfsaks loopfvaks loopisaks mconvaks mconvoutaks nsvaks nthoneaks schedraks scsaks testforaks testindxaks testlcvaks unirandaks unirand3aks upgradeboolaks upgradecharaks upgradepvaks upgraderpvaks

9 Why APL is Slow: Fixed Per-Primitive Overheads APL Primitive Overhead time/element for (Intvec+intvec) microseconds/element # elements in array Who suffers? Apps dominated by operations on scalars: CRC, loopy histograms, dynamic programming, RNG

10 Why APL is Slow: Fixed Per-Primitive Overheads Speedup (APL/APEX w/awlf) 1, Higher is better for APEX APL: Dyalog APL 13.0 SAC: 17,654:MODIFIED APL vs. APEX CPU Time Performance (2, ) 10 crcaks histlpaks lltopaks loopisaks scsaks testforaks Benchmark name

11 Why APL is Slow: Fixed Per-Primitive Overheads Scalar-dominated apps have good serial speedup... but poor parallel speedup Execution time (sec) APEX/SAC Parallel Performance SAC (17654:MODIFIED) real time 2, x mt1 mt2 0.4x mt3 6 core AMD Phenom II X6 1,075T mt4 mt5 mt6 0.3x 0.2x 0.1x 0x crc histlp lltop loopis scs testfor # threads

12 Why is APL Slow? Variable Per-Primitive Overheads Naive execution: Limited fn composition, e.g. sum( iota( N))

13 Why is APL Slow? Variable Per-Primitive Overheads Naive execution: Limited fn composition, e.g. sum( iota( N)) Array-valued intermediate results: memory madness

14 Why is APL Slow? Variable Per-Primitive Overheads Naive execution: Limited fn composition, e.g. sum( iota( N)) Array-valued intermediate results: memory madness Who suffers? Apps dominated by operations on large arrays: Signal processing, convolution, normal move-out APL vs. APEX CPU Time Performance (2, ) Speedup (APL/APEX w/awlf) 1, Higher is better for APEX SAC: 17,654:MODIFIED APL: Dyalog APL iotanaks ipapeaks ipbdaks ipopneaks logd3aks logd4aks upgradecharaks

15 Why is APL Slow? Variable Per-Primitive Overheads Who suffers? Apps dominated by operations on large arrays: Signal processing, convolution, normal move-out APEX/SAC Parallel Performance SAC (17654:MODIFIED) real time 2, Execution time (sec) 1.2x 1x 0.8x 0.6x 0.4x 6 core AMD Phenom II X6 1,075T mt1 mt2 mt3 mt4 mt5 mt6 0.2x 0x iotan ipape ipbbakd ipbd ipopneakd # threads ipopne logd3 logd4 upgradechar

16 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i]

17 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i] [i,pi] and [pi,i] are materialized index vectors

18 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i] [i,pi] and [pi,i] are materialized index vectors A few simple changes to scalarize index vectors: tmp = a[;i] a[;i] = a[;pi] a[;pi] = tmp

19 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i] [i,pi] and [pi,i] are materialized index vectors A few simple changes to scalarize index vectors: tmp = a[;i] a[;i] = a[;pi] a[;pi] = tmp Matrix divide model now runs twice as fast!

20 Why Materialized Index Vectors are Expensive Materialization of [i,pi] and [pi,i]:

21 Why Materialized Index Vectors are Expensive Materialization of [i,pi] and [pi,i]: (* for indexing part) *Increment reference counts on i and pi Allocate 2-element temp vector Initialize temp vector descriptor Initialize temp vector elements *Perform indexing Deallocate 2-element temp vector *Decrement reference counts on i and pi

22 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing

23 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign

24 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign Who suffers? Matrix divide, compress, deal, dynamic programming

25 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign Who suffers? Matrix divide, compress, deal, dynamic programming Who suffers? Inner products that use the CDC STAR-100 algorithm

26 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign Who suffers? Matrix divide, compress, deal, dynamic programming Who suffers? Inner products that use the CDC STAR-100 algorithm 800x800 ipplusandakd CPU time: 45 minutes!

27 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko

28 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation

29 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M)

30 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M) IVE: Replace: z = M[ IV] by: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M)

31 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M) IVE: Replace: z = M[ IV] by: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) Implication: IV is a materialized index vector!

32 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M) IVE: Replace: z = M[ IV] by: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) Implication: IV is a materialized index vector! Implication: offset calculation may be liftable (LIR)

33 Eliminating Materialized Index Vectors With IVE IVE: If IV can be represented as scalars, eliminate vect2offset. Before: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M)

34 Eliminating Materialized Index Vectors With IVE IVE: If IV can be represented as scalars, eliminate vect2offset. Before: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: IV = [ i, j, k] offset = idxs2offset( shape(m), i, j, k) z = idxsel( offset, M)

35 Eliminating Materialized Index Vectors With IVE IVE: If IV can be represented as scalars, eliminate vect2offset. Before: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: IV = [ i, j, k] offset = idxs2offset( shape(m), i, j, k) z = idxsel( offset, M) IV is now dead code, and can be eliminated!

36 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M)

37 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV)

38 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV) IV is now dead code, and can be eliminated!

39 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV) IV is now dead code, and can be eliminated! Earlier substitution by idxs2offset now feasible.

40 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV) IV is now dead code, and can be eliminated! Earlier substitution by idxs2offset now feasible. Unfortunately...

41 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K]

42 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function

43 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV

44 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV

45 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV This is where we came in!

46 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV This is where we came in! So, what can we do?

47 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV This is where we came in! So, what can we do? A kludged LIR to deal with this was deemed tasteless

48 Biting the Bullet Another approach: move IVE into the optimization cycle

49 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations!

50 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations

51 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops!

52 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops! e.g., unroll IV+1, guard and extrema functions..

53 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops! e.g., unroll IV+1, guard and extrema functions.. e.g., extend existing optimizations, such as CF

54 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops! e.g., unroll IV+1, guard and extrema functions.. e.g., extend existing optimizations, such as CF The Good News: I had already scalarized many index vectors for Algebraic With-Loop Folding!

55 Current Status Moving IVE into optimization cycle worked, sort of:

56 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes

57 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations!

58 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them!

59 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them! The Good News: Performance is improving daily

60 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them! The Good News: Performance is improving daily More Good News: Many opportunities exist for further optimization

61 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them! The Good News: Performance is improving daily More Good News: Many opportunities exist for further optimization Even More Good News: More parallelism is coming

62 How Fast Is Compiled APL? Array sizes affect performance

63 How Fast Is Compiled APL? Array sizes affect performance Iteration counts affect performance

64 How Fast Is Compiled APL? Array sizes affect performance Iteration counts affect performance Indexed assigns affect performance

65 How Fast Is Compiled APL? Array sizes affect performance Iteration counts affect performance Indexed assigns affect performance Characterize your application, then we can provide an answer.

APL on GPUs A Progress Report with a Touch of Machine Learning

APL on GPUs A Progress Report with a Touch of Machine Learning APL on GPUs A Progress Report with a Touch of Machine Learning Martin Elsman, DIKU, University of Copenhagen Joined work with Troels Henriksen and Cosmin Oancea @ Dyalog 17, Elsinore 1 Motivation Goal:

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

Tour of common optimizations

Tour of common optimizations Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)

More information

Lecture Notes on Loop Optimizations

Lecture Notes on Loop Optimizations Lecture Notes on Loop Optimizations 15-411: Compiler Design Frank Pfenning Lecture 17 October 22, 2013 1 Introduction Optimizing loops is particularly important in compilation, since loops (and in particular

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

Accelerating Data Warehousing Applications Using General Purpose GPUs

Accelerating Data Warehousing Applications Using General Purpose GPUs Accelerating Data Warehousing Applications Using General Purpose s Sponsors: Na%onal Science Founda%on, LogicBlox Inc., IBM, and NVIDIA The General Purpose is a many core co-processor 10s to 100s of cores

More information

Compiler Design and Construction Optimization

Compiler Design and Construction Optimization Compiler Design and Construction Optimization Generating Code via Macro Expansion Macroexpand each IR tuple or subtree A := B+C; D := A * C; lw $t0, B, lw $t1, C, add $t2, $t0, $t1 sw $t2, A lw $t0, A

More information

Machine-Independent Optimizations

Machine-Independent Optimizations Chapter 9 Machine-Independent Optimizations High-level language constructs can introduce substantial run-time overhead if we naively translate each construct independently into machine code. This chapter

More information

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization?

What Do Compilers Do? How Can the Compiler Improve Performance? What Do We Mean By Optimization? What Do Compilers Do? Lecture 1 Introduction I What would you get out of this course? II Structure of a Compiler III Optimization Example Reference: Muchnick 1.3-1.5 1. Translate one language into another

More information

Supercomputing in Plain English Part IV: Henry Neeman, Director

Supercomputing in Plain English Part IV: Henry Neeman, Director Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is

More information

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation

CSE 501: Compiler Construction. Course outline. Goals for language implementation. Why study compilers? Models of compilation CSE 501: Compiler Construction Course outline Main focus: program analysis and transformation how to represent programs? how to analyze programs? what to analyze? how to transform programs? what transformations

More information

An Optimizing Compiler for the TMS320C25 DSP Chip

An Optimizing Compiler for the TMS320C25 DSP Chip An Optimizing Compiler for the TMS320C25 DSP Chip Wen-Yen Lin, Corinna G Lee, and Paul Chow Published in Proceedings of the 5th International Conference on Signal Processing Applications and Technology,

More information

Compiler Optimization

Compiler Optimization Compiler Optimization The compiler translates programs written in a high-level language to assembly language code Assembly language code is translated to object code by an assembler Object code modules

More information

Usually, target code is semantically equivalent to source code, but not always!

Usually, target code is semantically equivalent to source code, but not always! What is a Compiler? Compiler A program that translates code in one language (source code) to code in another language (target code). Usually, target code is semantically equivalent to source code, but

More information

Code optimization. Have we achieved optimal code? Impossible to answer! We make improvements to the code. Aim: faster code and/or less space

Code optimization. Have we achieved optimal code? Impossible to answer! We make improvements to the code. Aim: faster code and/or less space Code optimization Have we achieved optimal code? Impossible to answer! We make improvements to the code Aim: faster code and/or less space Types of optimization machine-independent In source code or internal

More information

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis Synthesis of output program (back-end) Intermediate Code Generation Optimization Before and after generating machine

More information

CPSC 510 (Fall 2007, Winter 2008) Compiler Construction II

CPSC 510 (Fall 2007, Winter 2008) Compiler Construction II CPSC 510 (Fall 2007, Winter 2008) Compiler Construction II Robin Cockett Department of Computer Science University of Calgary Alberta, Canada robin@cpsc.ucalgary.ca Sept. 2007 Introduction to the course

More information

Intermediate representation

Intermediate representation Intermediate representation Goals: encode knowledge about the program facilitate analysis facilitate retargeting facilitate optimization scanning parsing HIR semantic analysis HIR intermediate code gen.

More information

Abstract. APL SIMD Boolean Array Algorithms. A BIT of Introduction. Why Does APL have One-bit Booleans? Abstract. Robert Bernecky.

Abstract. APL SIMD Boolean Array Algorithms. A BIT of Introduction. Why Does APL have One-bit Booleans? Abstract. Robert Bernecky. Abstract Abstract Snake Island Research Inc, Canada bernecky@snakeisland.com October 5, 2016 Computation on large Boolean arrays is becoming more prevalent, due to applications such as cryptography, data

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

BEAMJIT: An LLVM based just-in-time compiler for Erlang. Frej Drejhammar

BEAMJIT: An LLVM based just-in-time compiler for Erlang. Frej Drejhammar BEAMJIT: An LLVM based just-in-time compiler for Erlang Frej Drejhammar 140407 Who am I? Senior researcher at the Swedish Institute of Computer Science (SICS) working on programming languages,

More information

Compiler Construction 2016/2017 Loop Optimizations

Compiler Construction 2016/2017 Loop Optimizations Compiler Construction 2016/2017 Loop Optimizations Peter Thiemann January 16, 2017 Outline 1 Loops 2 Dominators 3 Loop-Invariant Computations 4 Induction Variables 5 Array-Bounds Checks 6 Loop Unrolling

More information

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion Outline Loop Optimizations Induction Variables Recognition Induction Variables Combination of Analyses Copyright 2010, Pedro C Diniz, all rights reserved Students enrolled in the Compilers class at the

More information

CSC D70: Compiler Optimization

CSC D70: Compiler Optimization CSC D70: Compiler Optimization Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons CSC D70: Compiler Optimization

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013 A Bad Name Optimization is the process by which we turn a program into a better one, for some definition of better. CS 2210: Optimization This is impossible in the general case. For instance, a fully optimizing

More information

CSc 453 Interpreters & Interpretation

CSc 453 Interpreters & Interpretation CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson Interpreters An interpreter is a program that executes another program. An interpreter implements a virtual machine,

More information

Iteration. CSE / ENGR 142 Programming I. Chapter 5. Motivating Loops. One More Type of Control Flow. What s Wrong with HW1?

Iteration. CSE / ENGR 142 Programming I. Chapter 5. Motivating Loops. One More Type of Control Flow. What s Wrong with HW1? CSE / ENGR 142 Programming I Iteration Chapter 5 Read Sections 5.1-5.6, 5.10 5.1 Introduction & While Statement 5.2 While example 5.3 For Loop 5.4 Looping with a fixed bound 5.5 Loop design 5.6 Nested

More information

Compilers. Intermediate representations and code generation. Yannis Smaragdakis, U. Athens (original slides by Sam

Compilers. Intermediate representations and code generation. Yannis Smaragdakis, U. Athens (original slides by Sam Compilers Intermediate representations and code generation Yannis Smaragdakis, U. Athens (original slides by Sam Guyer@Tufts) Today Intermediate representations and code generation Scanner Parser Semantic

More information

COMPILER DESIGN - CODE OPTIMIZATION

COMPILER DESIGN - CODE OPTIMIZATION COMPILER DESIGN - CODE OPTIMIZATION http://www.tutorialspoint.com/compiler_design/compiler_design_code_optimization.htm Copyright tutorialspoint.com Optimization is a program transformation technique,

More information

CSE443 Compilers. Dr. Carl Alphonce 343 Davis Hall

CSE443 Compilers. Dr. Carl Alphonce 343 Davis Hall CSE443 Compilers Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis Hall http://www.cse.buffalo.edu/faculty/alphonce/sp17/cse443/index.php https://piazza.com/class/iybn4ndqa1s3ei Announcements Grading survey

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

Towards Hybrid Array Types in SAC

Towards Hybrid Array Types in SAC Towards Hybrid Array Types in SAC Clemens Grelck, Fangyong Tang Informatics Institute University of Amsterdam Science Park 904 1098XH Amsterdam, Netherlands c.grelck@uva.nl f.tang@uva.nl Abstract: Array

More information

7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker!

7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker! 7. Optimization! Prof. O. Nierstrasz! Lecture notes by Marcus Denker! Roadmap > Introduction! > Optimizations in the Back-end! > The Optimizer! > SSA Optimizations! > Advanced Optimizations! 2 Literature!

More information

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210

Intermediate Representations. Reading & Topics. Intermediate Representations CS2210 Intermediate Representations CS2210 Lecture 11 Reading & Topics Muchnick: chapter 6 Topics today: Intermediate representations Automatic code generation with pattern matching Optimization Overview Control

More information

CSE 504. Expression evaluation. Expression Evaluation, Runtime Environments. One possible semantics: Problem:

CSE 504. Expression evaluation. Expression Evaluation, Runtime Environments. One possible semantics: Problem: Expression evaluation CSE 504 Order of evaluation For the abstract syntax tree + + 5 Expression Evaluation, Runtime Environments + + x 3 2 4 the equivalent expression is (x + 3) + (2 + 4) + 5 1 2 (. Contd

More information

Compiler Construction 2010/2011 Loop Optimizations

Compiler Construction 2010/2011 Loop Optimizations Compiler Construction 2010/2011 Loop Optimizations Peter Thiemann January 25, 2011 Outline 1 Loop Optimizations 2 Dominators 3 Loop-Invariant Computations 4 Induction Variables 5 Array-Bounds Checks 6

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Supercomputing and Science An Introduction to High Performance Computing

Supercomputing and Science An Introduction to High Performance Computing Supercomputing and Science An Introduction to High Performance Computing Part IV: Dependency Analysis and Stupid Compiler Tricks Henry Neeman, Director OU Supercomputing Center for Education & Research

More information

Thin Locks: Featherweight Synchronization for Java

Thin Locks: Featherweight Synchronization for Java Thin Locks: Featherweight Synchronization for Java D. Bacon 1 R. Konuru 1 C. Murthy 1 M. Serrano 1 Presented by: Calvin Hubble 2 1 IBM T.J. Watson Research Center 2 Department of Computer Science 16th

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Overview of a Compiler

Overview of a Compiler High-level View of a Compiler Overview of a Compiler Compiler Copyright 2010, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have

More information

Intermediate Code Generation (ICG)

Intermediate Code Generation (ICG) Intermediate Code Generation (ICG) Transform AST to lower-level intermediate representation Basic Goals: Separation of Concerns Generate efficient code sequences for individual operations Keep it fast

More information

Static Single Assignment Form in the COINS Compiler Infrastructure

Static Single Assignment Form in the COINS Compiler Infrastructure Static Single Assignment Form in the COINS Compiler Infrastructure Masataka Sassa (Tokyo Institute of Technology) Background Static single assignment (SSA) form facilitates compiler optimizations. Compiler

More information

Soot A Java Bytecode Optimization Framework. Sable Research Group School of Computer Science McGill University

Soot A Java Bytecode Optimization Framework. Sable Research Group School of Computer Science McGill University Soot A Java Bytecode Optimization Framework Sable Research Group School of Computer Science McGill University Goal Provide a Java framework for optimizing and annotating bytecode provide a set of API s

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

Improving Interrupt Response Time in a Verifiable Protected Microkernel

Improving Interrupt Response Time in a Verifiable Protected Microkernel Improving Interrupt Response Time in a Verifiable Protected Microkernel Bernard Blackham Yao Shi Gernot Heiser The University of New South Wales & NICTA, Sydney, Australia EuroSys 2012 Motivation The desire

More information

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination 1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu

More information

Speed Improvements in pqr:

Speed Improvements in pqr: Speed Improvements in pqr: Current Status and Future Plans Radford M. Neal, University of Toronto Dept. of Statistical Sciences and Dept. of Computer Science http://www.cs.utoronto.ca/ radford http://pqr-project.org

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

Dyalog. Experimental Functionality Compiler. Version Dyalog Limited. Minchens Court, Minchens Lane Bramley, Hampshire RG26 5BH United Kingdom

Dyalog. Experimental Functionality Compiler. Version Dyalog Limited. Minchens Court, Minchens Lane Bramley, Hampshire RG26 5BH United Kingdom The tool of thought for expert programming Dyalog Experimental Functionality Compiler Version 14.1 Dyalog Limited Minchens Court, Minchens Lane Bramley, Hampshire RG26 5BH United Kingdom tel: +44(0)1256

More information

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics Computing system: performance, speedup, performance/cost Origins and benefits of scalar instruction pipelines and caches

More information

Loop Unrolling. Source: for i := 1 to 100 by 1 A[i] := A[i] + B[i]; endfor

Loop Unrolling. Source: for i := 1 to 100 by 1 A[i] := A[i] + B[i]; endfor Source: for i := 1 to 100 by 1 A[i] := A[i] + B[i]; Loop Unrolling Transformed Code: for i := 1 to 100 by 4 A[i ] := A[i ] + B[i ]; A[i+1] := A[i+1] + B[i+1]; A[i+2] := A[i+2] + B[i+2]; A[i+3] := A[i+3]

More information

Algorithm Performance. (the Big-O)

Algorithm Performance. (the Big-O) Algorithm Performance (the Big-O) Lecture 6 Today: Worst-case Behaviour Counting Operations Performance Considerations Time measurements Order Notation (the Big-O) Pessimistic Performance Measure Often

More information

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall CS 701 Charles N. Fischer Class Meets Tuesdays & Thursdays, 11:00 12:15 2321 Engineering Hall Fall 2003 Instructor http://www.cs.wisc.edu/~fischer/cs703.html Charles N. Fischer 5397 Computer Sciences Telephone:

More information

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence

More information

Advanced R. V!aylor & Francis Group. Hadley Wickham. ~ CRC Press

Advanced R. V!aylor & Francis Group. Hadley Wickham. ~ CRC Press ~ CRC Press V!aylor & Francis Group Advanced R Hadley Wickham ')'l If trlro r r 1 Introduction 1 1.1 Who should read this book 3 1.2 What you will get out of this book 3 1.3 Meta-techniques... 4 1.4 Recommended

More information

Optimising with the IBM compilers

Optimising with the IBM compilers Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square

More information

Weld: A Common Runtime for Data Analytics

Weld: A Common Runtime for Data Analytics Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia Stanford InfoLab, *MIT CSAIL

More information

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,

More information

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz Compiler Design Fall 2015 Control-Flow Analysis Sample Exercises and Solutions Prof. Pedro C. Diniz USC / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292

More information

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview COMP 181 Compilers Lecture 2 Overview September 7, 2006 Administrative Book? Hopefully: Compilers by Aho, Lam, Sethi, Ullman Mailing list Handouts? Programming assignments For next time, write a hello,

More information

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th

More information

SAC A Functional Array Language for Efficient Multi-threaded Execution

SAC A Functional Array Language for Efficient Multi-threaded Execution International Journal of Parallel Programming, Vol. 34, No. 4, August 2006 ( 2006) DOI: 10.1007/s10766-006-0018-x SAC A Functional Array Language for Efficient Multi-threaded Execution Clemens Grelck 1,3

More information

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code

More information

CSE341: Programming Languages Lecture 17 Implementing Languages Including Closures. Dan Grossman Autumn 2018

CSE341: Programming Languages Lecture 17 Implementing Languages Including Closures. Dan Grossman Autumn 2018 CSE341: Programming Languages Lecture 17 Implementing Languages Including Closures Dan Grossman Autumn 2018 Typical workflow concrete syntax (string) "(fn x => x + x) 4" Parsing Possible errors / warnings

More information

Introduction to Compilers

Introduction to Compilers Introduction to Compilers Compilers are language translators input: program in one language output: equivalent program in another language Introduction to Compilers Two types Compilers offline Data Program

More information

IR Optimization. May 15th, Tuesday, May 14, 13

IR Optimization. May 15th, Tuesday, May 14, 13 IR Optimization May 15th, 2013 Tuesday, May 14, 13 But before we talk about IR optimization... Wrapping up IR generation Tuesday, May 14, 13 Three-Address Code Or TAC The IR that you will be using for

More information

CSE 504: Compiler Design. Intermediate Representations Symbol Table

CSE 504: Compiler Design. Intermediate Representations Symbol Table Intermediate Representations Symbol Table Pradipta De pradipta.de@sunykorea.ac.kr Current Topic Intermediate Representations Graphical IRs Linear IRs Symbol Table Information in a Program Compiler manages

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

CS 360 Programming Languages Interpreters

CS 360 Programming Languages Interpreters CS 360 Programming Languages Interpreters Implementing PLs Most of the course is learning fundamental concepts for using and understanding PLs. Syntax vs. semantics vs. idioms. Powerful constructs like

More information

The View from 35,000 Feet

The View from 35,000 Feet The View from 35,000 Feet This lecture is taken directly from the Engineering a Compiler web site with only minor adaptations for EECS 6083 at University of Cincinnati Copyright 2003, Keith D. Cooper,

More information

Lecture Notes on Cache Iteration & Data Dependencies

Lecture Notes on Cache Iteration & Data Dependencies Lecture Notes on Cache Iteration & Data Dependencies 15-411: Compiler Design André Platzer Lecture 23 1 Introduction Cache optimization can have a huge impact on program execution speed. It can accelerate

More information

Computer Architecture. R. Poss

Computer Architecture. R. Poss Computer Architecture R. Poss 1 lecture-7-24 september 2015 Virtual Memory cf. Henessy & Patterson, App. C4 2 lecture-7-24 september 2015 Virtual Memory It is easier for the programmer to have a large

More information

A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads

A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads A Near Memory Processor for Vector, Streaming and Bit Manipulations Workloads Mingliang Wei, Marc Snir, Josep Torrellas (UIUC) Brett Tremaine (IBM) Work supported by HPCS/PERCS Motivation Many important

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer

More information

Compilers. Lecture 2 Overview. (original slides by Sam

Compilers. Lecture 2 Overview. (original slides by Sam Compilers Lecture 2 Overview Yannis Smaragdakis, U. Athens Yannis Smaragdakis, U. Athens (original slides by Sam Guyer@Tufts) Last time The compilation problem Source language High-level abstractions Easy

More information

Bentley Rules for Optimizing Work

Bentley Rules for Optimizing Work 6.172 Performance Engineering of Software Systems SPEED LIMIT PER ORDER OF 6.172 LECTURE 2 Bentley Rules for Optimizing Work Charles E. Leiserson September 11, 2012 2012 Charles E. Leiserson and I-Ting

More information

Petalisp. A Common Lisp Library for Data Parallel Programming. Marco Heisig Chair for System Simulation FAU Erlangen-Nürnberg

Petalisp. A Common Lisp Library for Data Parallel Programming. Marco Heisig Chair for System Simulation FAU Erlangen-Nürnberg Petalisp A Common Lisp Library for Data Parallel Programming Marco Heisig 16.04.2018 Chair for System Simulation FAU Erlangen-Nürnberg Petalisp The library Petalisp 1 is a new approach to data parallel

More information

Running class Timing on Java HotSpot VM, 1

Running class Timing on Java HotSpot VM, 1 Compiler construction 2009 Lecture 3. A first look at optimization: Peephole optimization. A simple example A Java class public class A { public static int f (int x) { int r = 3; int s = r + 5; return

More information

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives

More information

Overview of a Compiler

Overview of a Compiler Overview of a Compiler Copyright 2015, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California have explicit permission to make copies of

More information

A Short History of Array Computing in Python. Wolf Vollprecht, PyParis 2018

A Short History of Array Computing in Python. Wolf Vollprecht, PyParis 2018 A Short History of Array Computing in Python Wolf Vollprecht, PyParis 2018 TOC - - Array computing in general History up to NumPy Libraries after NumPy - Pure Python libraries - JIT / AOT compilers - Deep

More information

Performance. frontend. iratrecon - rational reconstruction. sprem - sparse pseudo division

Performance. frontend. iratrecon - rational reconstruction. sprem - sparse pseudo division Performance frontend The frontend command is used extensively by Maple to map expressions to the domain of rational functions. It was rewritten for Maple 2017 to reduce time and memory usage. The typical

More information

With-Loop Scalarization Merging Nested Array Operations

With-Loop Scalarization Merging Nested Array Operations With-Loop Scalarization Merging Nested Array Operations Clemens Grelck 1, Sven-Bodo Scholz 2, and Kai Trojahner 1 1 University of Lübeck Institute of Software Technology and Programming Languages {grelck,trojahne}@ispuni-luebeckde

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Optimization Prof. James L. Frankel Harvard University

Optimization Prof. James L. Frankel Harvard University Optimization Prof. James L. Frankel Harvard University Version of 4:24 PM 1-May-2018 Copyright 2018, 2016, 2015 James L. Frankel. All rights reserved. Reasons to Optimize Reduce execution time Reduce memory

More information

CS 2461: Computer Architecture 1

CS 2461: Computer Architecture 1 Next.. : Computer Architecture 1 Performance Optimization CODE OPTIMIZATION Code optimization for performance A quick look at some techniques that can improve the performance of your code Rewrite code

More information

CS122 Lecture 4 Winter Term,

CS122 Lecture 4 Winter Term, CS122 Lecture 4 Winter Term, 2014-2015 2 SQL Query Transla.on Last time, introduced query evaluation pipeline SQL query SQL parser abstract syntax tree SQL translator relational algebra plan query plan

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

Static Single Assignment Form in the COINS Compiler Infrastructure - Current Status and Background -

Static Single Assignment Form in the COINS Compiler Infrastructure - Current Status and Background - Static Single Assignment Form in the COINS Compiler Infrastructure - Current Status and Background - Masataka Sassa, Toshiharu Nakaya, Masaki Kohama (Tokyo Institute of Technology) Takeaki Fukuoka, Masahito

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more

More information

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road

ECE 498AL. Lecture 12: Application Lessons. When the tires hit the road ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different

More information

Advances in Memory Management and Symbol Lookup in pqr

Advances in Memory Management and Symbol Lookup in pqr Advances in Memory Management and Symbol Lookup in pqr Radford M. Neal, University of Toronto Dept. of Statistical Sciences and Dept. of Computer Science http://www.cs.utoronto.ca/ radford http://radfordneal.wordpress.com

More information

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch Domain Specific Languages for Financial Payoffs Matthew Leslie Bank of America Merrill Lynch Outline Introduction What, How, and Why do we use DSLs in Finance? Implementation Interpreting, Compiling Performance

More information

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,

More information

Compiler Code Generation COMP360

Compiler Code Generation COMP360 Compiler Code Generation COMP360 Students who acquire large debts putting themselves through school are unlikely to think about changing society. When you trap people in a system of debt, they can t afford

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.

More information