Module II: Optimizing Serial Programs Part 1 - Compilation and Linking
|
|
- Dominic Bailey
- 5 years ago
- Views:
Transcription
1 Performance Programming: Theory, Practice and Case Studies Module II: Optimizing Serial Programs Part 1 - Compilation and Linking 38
2 Compilation overview Outline Compiler optimizations General optimizations Specifying target architecture Function inlining Data alignment optimizations Data prefetching Aliasing optimizations Compiler macro options Compiler pragmas and directives Software pipelining Linking overview Static and dynamic linking Using optimized mathematical libraries 39
3 Compilation Overview 40 Compiler: translates source language program into target language program Analysis stage Synthesis stage Analysis stage converts source program into intermediate representation includes three stages: lexical, syntax and semantic analysis Synthesis stage converts intermediate representation into target language program is divided into two steps: code optimization and code generation Other activities: symbol table management, error handling, OS interface
4 Compiler Organization Example: Sun ONE Studio compilers 41 Different frontends (C, C++, Fortran 77, Fortran 95) generate intermediate representation (SunIR) Backend generates code for target architectures (SPARC, x86)
5 Using Compilers Optimizing compiler: most important performance enhancing tool Compiler optimizations from a usage perspective: Applicability Utility Cost Complementary Optimizations Example: using double-word alignment and architecture-specific optimizations results in better performance than when these are used in isolation Use the latest release of compiler for best performance 42
6 Selecting Compiler Options Determining optimal compiler options is an iterative process Order of flags can make a difference, e.g. later options may take precedence over earlier ones Possible cross-compilation: compilation and target systems may be different Default settings of options should be considered Macro options can be used Trade-off between optimization level and compilation time and resource utilization 43
7 Setting Optimization Level 44 The most basic and actively used optimization option Generally is set with -O option Different compilers have different number of levels and different meanings for each of them For example, GNU: O1-O3, Sun: O1-O5 Compaq (DEC) -O2 and HP +O2 don't necessarily mean the same thing By default (no option) usually no optimization is used. Default optimization level (usually -O) can mean different things and is not recommended. Higher optimizations lead to higher compilation times and larger binaries
8 Different Optimization Levels Example: Optimization levels for Sun ONE Studio compilers -O2: -O1 plus basic global optimization including algebraic simplification, local and global subexpression elimination, register allocation, elimination of the dead code, constant propagation, tail-call elimination. -O1: Basic local optimization, assembly postpass. -O3: -O2 plus loop unrolling, fusion, software pipelining. -O4: -O3 plus function inlining within the module, aggressive global optimization. 45 -O5: Highest optimization level, likely to improve performance when used in combination with profile feedback.
9 Example: Effect of Optimization Runtime for basic matrix-matrix multiplication compiled with different optimization B HP C compiler Run Time (sec.) None O1 O2 O3 O
10 Increased Compilation Time Compilation time for dblat3.f function from Netlib Compaq Fortran Compiler X5.4A B5P Compilation Time (sec.) -g O1 O2 O3 O4 O
11 Setting Target Architecture Selecting specific target architecture allows the compiler backend to optimally use available resources Instructions understood by the CPU Registers (number, types) CPU properties (latencies, pipeline depth, etc.) Example options IBM: -qarch=pwr4 Sun: -xarch=v8plusa SGI: -mips4 GNU: -march=pentiumpro DEC: -arch ev67 48
12 Benefit of Architecture Setting Example: repeated coordinate transformation Sun Forte 6 Fortran 77 compiler Runs on Sun Ultra 80 (400MHz UltraSPARC-II) CPU characterized by -xarch=v8plusa option Option -xarch=v8 (for SuperSPARC) is suboptimal Runtime in seconds v8 v8plusa
13 Function Call Inlining 50 Inlining: replicating function code in the callee function. Pros Cons Eliminates function call overhead Improved optimization due to code transparency across function calls Larger binaries and increased compilation time Possibility of increased instruction cache misses Inlining can be controlled by compiler options (optimization levels Ox and specialized options) Some environments allow inlining assembly templates
14 Basic example Example: Inlining Inlining point Function body 51
15 Example: Inlining (continued) Building executable with and without inlining Sun ONE Studio 7 compilers; Ultra 60 (360 MHz) Cross-file inlining Runtime in seconds no inline inline
16 Inlining Options 53 Options that control inlining for various compilers: -inline -xinline +Oinline -Q -Minline -xcrossfile High -Ox optimization levels can imply inlining Functions can be inlined selectively: e.g. - xinline=[f1..fn] Selective inlining may reduce the number of inlined functions (otherwise inlined with -Ox)
17 Vectorization of Standard Calls Standard math calls in large loops can be vectorized (exp, log, trig functions, etc.) Compiler can replace calls with vectorized versions provided in libraries Can be a dedicated option HP: +Ovectorize Sun: -xvector Can be included in some optimization -Ox level Can be controlled by directives or pragmas in the code 54
18 Profile/Feedback Optimization 55 Feedback directed optimizations based on runtime execution frequency data: Optimizations performed on more frequently executed portions Register allocation, basic block ordering, code motion and rearrangement, inlining Corresponding compiler options HP: +Oprofile=collect, +Oprofile=use Sun: -xprofile=collect, -xprofile=use Compaq: -feedback (in combination with Pixie) GNU: -fprofile-arcs, -fbranch-probabilities Using profile-feedback optimization Compile program with options to collect data Run on a training data set. Profile file or directory is generated Recompile with option to optimize based on collected data Other options should be used consistently
19 Data Prefetching 56 Prefetching allows overlapping execution with fetching data from memory Benefit memory latency bound applications (particularly on high latency machines) Efficient for programs with repeatable memory access patterns Best used in combination with option that specify microarchitecture features Sample compiler options HP: +Odataprefetch Sun: -xprefetch, -xprefetch_level SGI: pf<n>, prefetch, prefetch_ahead, prefetch_manual Prefetching can also be manually controlled with pragmas or directives
20 Benefits of Prefetching Example daxpy-type loop calculation Moderate effect on Sun Ultra 60 Big impact on high-latency Sun Enterprise Even higher effect on UltraSPARC-III based SunFire 480R (extra prefetch cache, higher clock speed makes relative latencies higher) Scaled runtimes Ultra 60 Sun E1000 Sun 480R no -xprefetch -xprefetch 57
21 Floating Point Optimizations 58 IEEE 754 floating point standard ensures similar numerical behavior on different platforms Binary representation of FP numbers Operation precedence Rounding Underflow, overflow and trap handling Some compiler optimizations relax IEEE 754 requirements (slight numeric differences can occur) Algebraic simplifications Underflow control Rounding control Trap handling Example options that affect FP behavior Sun: -fsimple, -fround, -fns, -ftrap HP: +FPstring, +FPVZO
22 Example: Algebraic Simplifications 59 Sun -fsimple option allows non-ieee arithmetic and specifies FP simplifying assumptions -fsimple=0 No simplifying assumptions; IEEE conformant. -fsimple=1 Conservative simplification; does not strictly conform to IEEE 754, but numeric results typically unchanged. IEEE 754 default rounding and trapping modes unchanged. Infinities or NaNs not propagated as NaNs; that is, x*0 can be replaced by 0. Computations do not depend on the sign of zero. -fsimple=2 Aggressive optimizations that may lead to different numeric results. Example: with -fsimple=2 computation of x/y replaced with x*z, where z=1/y is computed once. Other optimizations: cycle shrinking, height reduction of the directed acyclic graph, cross-iteration common subexpression elimination, and scalar replacement. -fsimple=2 best used in conjunction with other high optimization flags.
23 Algebraic Simplifications (cont.) Example 1: elimination of redundant FP operations causing exceptions: dum1 = 0.d0 dum2 = 20.d0/dum1 -fsimple=0 - program to aborts; -fsimple=1 okay Example2: -fsimple=1 and -fsimple=2 for dotproduct (Forte 6 Update 1 Fortran 90; Ultra60) sum=0.d0 do i=1,nele sum = sum + a(i)*b(i) enddo Runtime in seconds -fsimple=1 -fsimple=
24 Data Alignment 61 Most microprocessors have preferred data alignment Data on natural or preferred byte-boundaries accessed faster, e.g. data on 8-byte boundaries can be accessed in 1 doubleword load/store instruction Misaligned data may cause restrictive load/store instructions Programming language standards specify language-specific alignment rules Compiler makes conservative assumption on data-alignment Compiler options (+align, -dalign) cause generation of double-word load/store for DP data Padding may be inserted in Fortran COMMON blocks Must be used consistently (if one module compiled with it, compile all) Optimizer can better carry out other optimizations (e.g. software pipelining)
25 Example: Data Alignment Example: array copy on SunFire 480 Forte Developer 7 compiler Runtime in seconds Memory no -dalign -dalign L2 cache
26 Pointer Alias Analysis Options 63 C programs: Pointers can point to overlapping regions of memory (aliasing) causing memory ambiguity Memory alias disambiguation in general programs is very complex for compiler to perform Compiler is conservative: optimizations (load-hoisting, unrolling, s/w pipelining) suppressed Options can be used to inform the compiler about aliasing properties of the code IBM: -qalias, SGI: alias, SUN: -xrestrict, -xalias_level, HP: -alias, GNU: -fstrict-aliasing, - fargument-alias, -fstrict-aliasing Fortran standard makes it programmer's responsibility to ensure absence of aliasing
27 Pointer Alias Analysis Alias Relationships in a program: Does alias Does not alias May alias int *i, j=1; double func(double b) int sum, *vloc; i = &j { double *c; function foo (double *a) c = (double *)... malloc(sizeof(double)); for (i=0;i<n;i++) { *c = 10.0; for (j=0;j<m;j++) { b = b + *c; a[vloc[i]+j] = free(c); a[vloc[i]+j] + sum return(b); } } } Can have big performance impact in memory bound programs Compiler generates conservative code in does alias situation Large programs with many pointers: may alias is most common. Considered equivalent to does alias (conservative code generated) More the number of does not alias relationships, higher the flexibility compiler has in generating optimized code
28 Effect of Pointer Alias Analysis Computational program on Sun Blade 1000 Structures similar to those used in graph-partitioning Type-based alias disambiguation with -xalias_level=< setting> option; 7 settings: any, basic, weak, layout, strict, std, strong Without any setting: (default level) layout used Programs conforming to ISO 1999 C standard: std Runtime in seconds -xalias_level=strong (14 ld instructions) -xalias_level=std (21) -xalias_level=strict (21) -xalias_level=layout (31) -xalias_level=weak (31) xalias_level=basic (36) -xalias_level=any (66) 65
29 Compiler Macro Options 66 Some compilers provide meta-options that combine most effective optimizations Examples: Sun: -fast, SGI: -Ofast Can be a good starting point for selecting options Caveats Can interfere with other options Macro options can change between releases Can have components that should be used consitently on all parts of the code (e.g. data alignment) Some component options should be also used for linking Macro options can have architecture settings Sun ONE Studio C compiler: -fast -fns -fsimple=2 -fsingle -ftrap=%none - xalias_level=basic -native -xbuiltin=%all -xdepend -xlibmil -xmemalign=8s -xo5 - xprefetch=auto,explicit
30 Compiler Directives and Pragmas Directives and pragmas: Annotations inserted in source Provide compiler with specific information about parts of the program (usually statements following the annotation) Directives/pragmas specific to a compiler are usually ignored by other compilers First step in source-code modification Directives are used for Parallelization (e.g. OpenMP) Pipelining control Prefetching control Data alignment Alias analysis Storage allocation / array padding 67
31 Pipelining 68 Software pipelining Runtime of a program is T = Ninstr X Cycle-time X CPI A higher Instruction Level Parallelism (ILP) decreases CPI Technique to extract Instruction Level Parallelism (ILP) in loops Breaks loop iterations in multiple parts; parts from disjoint iterations are overlapped such that they map to different functional units of the processor Parallelism in loop iterations mapped onto ILP of the processor Software pipelining and loop unrolling: Independent but related approaches; often used together Loop unrolling: decreases loop overhead but no scheduling of instructions Software pipelining: instructions scheduled to decrease processor pipeline stalls by hiding instruction latencies
32 Pipelining (cont.) Software Pipelining via Modulo Scheduling In many cases it is not safe for compiler to pipeline a loop Loops with indirection Loops with branches or conditionals Fat (computationally dense) loops Loops where trip count calculation cannot be safely performed 69
33 Pipelining Control Example: extracted from reservoir simulation application (data-set fits in 4MB level-2 cache) Sun Forte 6 update 1 compilers do ivr=isvs(1,ksld),isvs(2,ksld) iv1 = isvr(1,ivr) iv2 = isvr(2,ivr) iv = isvr(3,ivr)!$pragma sun pipeloop = 0 do ic=iv1,iv2 in1 = ic + iv in2 = ic - icf wrk(in1) = wrk(in1) - a(in2)*wrk(ic) enddo Runtime in seconds enddo Pragma pipeloop No pragma
34 Linking Overview 71 Linking: stages in generating and running executables (link-editing and runtime linking) Link-editor Concatenates relocatable object files (produced by the compiler or assembler) to generate libraries or executables Performs symbol resolution to bind external symbols to implementations using the symbol tables in object files and libraries Runtime (dynamic) linker Loads executable files and shared libraries and generates a runnable process. Maps the files produced by the link-editor to memory and performs relocations Linker stages can be hidden: link-editing can done by the compiler and the runtime linker is invoked by running the executable
35 Static and Dynamic Linking Static linking: objects files from archives (*.a files) go into the executable Executable contains all the code it needs to run Easier to deploy and test applications Only the code for the required functions is used Limited portability/flexibility Dynamic linking: executable may have dependencies on runtime libraries Executables tend to be smaller (do not replicate the code ) Libraries can be shared between several processes running on the system (improves memory utilization and reduces paging) Greater flexibility for building and testing an application 72
36 Types of Libraries 73 System libraries (usually in /usr/lib) Typically shared libraries (can be static as well, but use of shared libs is recommended for portability) Some environments provide both 64-bit and 32-bit versions Compiler libraries Static and dynamic libraries In some environments these libraries might not be available on user system, developers can either link them statically or link dynamic versions and distribute libraries with the application The dynamically linked compiler libraries can be later replaced with a new version or patch User or application libraries An application can use dynamic or static libraries as well as the mixture of the two Greater flexibility with dynamic linking
37 Linker Mapfiles Mapfiles can be used to specify the layout of functions in libraries (and memory) Can improve instruction cache utilization and reduce paging activity if callers/callees are placed nearby Mapfiles can be generated by profiling tools Can be used for other purposes (e.g. to reduce the scope of symbols) 74
38 Optimized Mathematical Libraries Using optimized libraries takes advantage of the highly efficient implementations of standard mathematical functions Examples of optimized libraries Optimized versions of standard UNIX math (libm) calls Vectorized versions of standard calls Optimized inline templates Optimized and/or parallelized BLAS, FFT, etc. Libraries/calls with SIMD instructions Libraries for distributed computing 75
39 Vectorized Math Libraries Some libraries offer vector versions of some elementary mathematical functions Functions are evaluated for an entire vector of values at once Can be invoked (replacing standard calls) using compiler options Examples: HP: HP-VML (Itanium) Sun: Vector Math Library (libmvec) IBM: Vector Mathematical Acceleration SubSystem (MASS) 76
40 Optimized BLAS, FFT, etc. Optimized/parallelized libraries for BLAS, sparse BLAS, LAPACK, ScaLAPACK, FFT Versions for 32- and 64-bit, different CPU architectures, etc. Examples Sun: Performance Library (libsunperf), Scientific Subroutine Library (libs3l) HP: MLIB (VECLIB, LAPACK, ScaLAPACK, and Super- LU_DIST) IBM: Engineering and Scientific Subroutine Library (ESSL) SGI: Scientific Computing Software Library (SCSL) Compaq: Compaq Extended Math Library (CXML) 77
41 Single Instruction Multiple Data Single Instruction Multiple Data (SIMD) One instruction operates on several pieces of data Used in media, bioinformatics, etc. Can be accessed by wrappers provided in libraries SIMD implementations Sun: VIS Intel: MMX, SSE, SSE-2 AMD: 3DNow! Motorola: AltiVec 78
42 Summary 79 Compiler is the most efficient tool for optimizing applications Compiler options should be carefully selected Higher optimization leads to larger binaries and higher compilation times Large performance potential in using options related to setting architecture, data alignment and alias disambiguation Prefetching options can be used to hide memory latency Directives can be used to provide additional information to the compiler about regions of the code Optimized mathematical libraries can efficiently implement common APIs
Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations
Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers
More informationKampala August, Agner Fog
Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler
More informationPerformance Tools and Environments Carlo Nardone. Technical Systems Ambassador GSO Client Solutions
Performance Tools and Environments Carlo Nardone Technical Systems Ambassador GSO Client Solutions The Stack Applications Grid Management Standards, Open Source v. Commercial Libraries OS Management MPI,
More informationCompiler Options. Linux/x86 Performance Practical,
Center for Information Services and High Performance Computing (ZIH) Compiler Options Linux/x86 Performance Practical, 17.06.2009 Zellescher Weg 12 Willers-Bau A106 Tel. +49 351-463 - 31945 Ulf Markwardt
More informationCompiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005
Compiling for Performance on hp OpenVMS I64 Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005 Compilers discussed C, Fortran, [COBOL, Pascal, BASIC] Share GEM optimizer
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationThe New C Standard (Excerpted material)
The New C Standard (Excerpted material) An Economic and Cultural Commentary Derek M. Jones derek@knosof.co.uk Copyright 2002-2008 Derek M. Jones. All rights reserved. 39 3.2 3.2 additive operators pointer
More informationAMD S X86 OPEN64 COMPILER. Michael Lai AMD
AMD S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries
More informationOptimising with the IBM compilers
Optimising with the IBM Overview Introduction Optimisation techniques compiler flags compiler hints code modifications Optimisation topics locals and globals conditionals data types CSE divides and square
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCompiler Optimization
Compiler Optimization The compiler translates programs written in a high-level language to assembly language code Assembly language code is translated to object code by an assembler Object code modules
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationBuilding a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano
Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Compilers Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationWhat Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009
What Compilers Can and Cannot Do Saman Amarasinghe Fall 009 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Optimization
More informationLec 25: Parallel Processors. Announcements
Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More informationAdministration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers
Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting
More informationProgrammazione Avanzata
Programmazione Avanzata Vittorio Ruggiero (v.ruggiero@cineca.it) Roma, Marzo 2017 Pipeline Outline CPU: internal parallelism? CPU are entirely parallel pipelining superscalar execution units SIMD MMX,
More informationOutline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006
P3 / 2006 Register Allocation What is register allocation Spilling More Variations and Optimizations Kostis Sagonas 2 Spring 2006 Storing values between defs and uses Program computes with values value
More informationPerformance Issues and Query Optimization in Monet
Performance Issues and Query Optimization in Monet Stefan Manegold Stefan.Manegold@cwi.nl 1 Contents Modern Computer Architecture: CPU & Memory system Consequences for DBMS - Data structures: vertical
More informationCell SDK and Best Practices
Cell SDK and Best Practices Stefan Lutz Florian Braune Hardware-Software-Co-Design Universität Erlangen-Nürnberg siflbrau@mb.stud.uni-erlangen.de Stefan.b.lutz@mb.stud.uni-erlangen.de 1 Overview - Introduction
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationOptimisation p.1/22. Optimisation
Performance Tuning Optimisation p.1/22 Optimisation Optimisation p.2/22 Constant Elimination do i=1,n a(i) = 2*b*c(i) enddo What is wrong with this loop? Compilers can move simple instances of constant
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationVirtual Machines and Dynamic Translation: Implementing ISAs in Software
Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Software Applications How is a software application
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationIntroduction to Compilers
Introduction to Compilers Compilers are language translators input: program in one language output: equivalent program in another language Introduction to Compilers Two types Compilers offline Data Program
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationMultiple Choice Type Questions
Techno India Batanagar Computer Science and Engineering Model Questions Subject Name: Computer Architecture Subject Code: CS 403 Multiple Choice Type Questions 1. SIMD represents an organization that.
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationCPSC 313, 04w Term 2 Midterm Exam 2 Solutions
1. (10 marks) Short answers. CPSC 313, 04w Term 2 Midterm Exam 2 Solutions Date: March 11, 2005; Instructor: Mike Feeley 1a. Give an example of one important CISC feature that is normally not part of a
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationA Study of Workstation Computational Performance for Real-Time Flight Simulation
A Study of Workstation Computational Performance for Real-Time Flight Simulation Summary Jeffrey M. Maddalon Jeff I. Cleveland II This paper presents the results of a computational benchmark, based on
More informationChapter 5:: Target Machine Architecture (cont.)
Chapter 5:: Target Machine Architecture (cont.) Programming Language Pragmatics Michael L. Scott Review Describe the heap for dynamic memory allocation? What is scope and with most languages how what happens
More informationGroup B Assignment 8. Title of Assignment: Problem Definition: Code optimization using DAG Perquisite: Lex, Yacc, Compiler Construction
Group B Assignment 8 Att (2) Perm(3) Oral(5) Total(10) Sign Title of Assignment: Code optimization using DAG. 8.1.1 Problem Definition: Code optimization using DAG. 8.1.2 Perquisite: Lex, Yacc, Compiler
More informationOptimization Prof. James L. Frankel Harvard University
Optimization Prof. James L. Frankel Harvard University Version of 4:24 PM 1-May-2018 Copyright 2018, 2016, 2015 James L. Frankel. All rights reserved. Reasons to Optimize Reduce execution time Reduce memory
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationMultiple Issue ILP Processors. Summary of discussions
Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 10 Compiler Techniques / VLIW Israel Koren ECE568/Koren Part.10.1 FP Loop Example Add a scalar
More informationPERFORMANCE OPTIMISATION
PERFORMANCE OPTIMISATION Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Hardware design Image from Colfax training material Pipeline Simple five stage pipeline: 1. Instruction fetch get instruction
More informationIntroduction to Parallel Computing
Introduction to Parallel Computing W. P. Petersen Seminar for Applied Mathematics Department of Mathematics, ETHZ, Zurich wpp@math. ethz.ch P. Arbenz Institute for Scientific Computing Department Informatik,
More informationIA-64 Compiler Technology
IA-64 Compiler Technology David Sehr, Jay Bharadwaj, Jim Pierce, Priti Shrivastav (speaker), Carole Dulong Microcomputer Software Lab Page-1 Introduction IA-32 compiler optimizations Profile Guidance (PGOPTI)
More informationEKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ EKT 303 WEEK 13 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. + Chapter 15 Reduced Instruction Set Computers (RISC) Table 15.1 Characteristics of Some CISCs, RISCs, and Superscalar
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit
More information55:132/22C:160, HPCA Spring 2011
55:132/22C:160, HPCA Spring 2011 Second Lecture Slide Set Instruction Set Architecture Instruction Set Architecture ISA, the boundary between software and hardware Specifies the logical machine that is
More informationInstruction Set Architecture
Computer Architecture Instruction Set Architecture Lynn Choi Korea University Machine Language Programming language High-level programming languages Procedural languages: C, PASCAL, FORTRAN Object-oriented
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationThe SGI Pro64 Compiler Infrastructure - A Tutorial
The SGI Pro64 Compiler Infrastructure - A Tutorial Guang R. Gao (U of Delaware) J. Dehnert (SGI) J. N. Amaral (U of Alberta) R. Towle (SGI) Acknowledgement The SGI Compiler Development Teams The MIPSpro/Pro64
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationSRI VENKATESWARA COLLEGE OF ENGINEERING AND TECHNOLOGY DEPARTMENT OF ECE EC6504 MICROPROCESSOR AND MICROCONTROLLER (REGULATION 2013)
SRI VENKATESWARA COLLEGE OF ENGINEERING AND TECHNOLOGY DEPARTMENT OF ECE EC6504 MICROPROCESSOR AND MICROCONTROLLER (REGULATION 2013) UNIT I THE 8086 MICROPROCESSOR PART A (2 MARKS) 1. What are the functional
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationCompiler Architecture
Code Generation 1 Compiler Architecture Source language Scanner (lexical analysis) Tokens Parser (syntax analysis) Syntactic structure Semantic Analysis (IC generator) Intermediate Language Code Optimizer
More informationAgenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division
What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division thomas.siebold@hp.com Agenda Terminology What is the Itanium Architecture? 1 Terminology Processor Architectures
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More informationCSCE 5610: Computer Architecture
HW #1 1.3, 1.5, 1.9, 1.12 Due: Sept 12, 2018 Review: Execution time of a program Arithmetic Average, Weighted Arithmetic Average Geometric Mean Benchmarks, kernels and synthetic benchmarks Computing CPI
More informationCode optimization techniques
& Alberto Bertoldo Advanced Computing Group Dept. of Information Engineering, University of Padova, Italy cyberto@dei.unipd.it May 19, 2009 The Four Commandments 1. The Pareto principle 80% of the effects
More informationLECTURE 19. Subroutines and Parameter Passing
LECTURE 19 Subroutines and Parameter Passing ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments behind a simple name. Data abstraction: hide data
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationChapter 9 Memory Management
Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual
More informationMath 230 Assembly Programming (AKA Computer Organization) Spring MIPS Intro
Math 230 Assembly Programming (AKA Computer Organization) Spring 2008 MIPS Intro Adapted from slides developed for: Mary J. Irwin PSU CSE331 Dave Patterson s UCB CS152 M230 L09.1 Smith Spring 2008 MIPS
More informationEN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design
EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown
More informationAbout the Authors... iii Introduction... xvii. Chapter 1: System Software... 1
Table of Contents About the Authors... iii Introduction... xvii Chapter 1: System Software... 1 1.1 Concept of System Software... 2 Types of Software Programs... 2 Software Programs and the Computing Machine...
More informationDSP Mapping, Coding, Optimization
DSP Mapping, Coding, Optimization On TMS320C6000 Family using CCS (Code Composer Studio) ver 3.3 Started with writing a simple C code in the class, from scratch Project called First, written for C6713
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationThese slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.
11 1 This Set 11 1 These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information. Text covers multiple-issue machines in Chapter 4, but
More informationMulti-core processors are here, but how do you resolve data bottlenecks in native code?
Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session
More informationPage # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?
Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the
More informationIntel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth
Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3
More informationMake Your C/C++ and PL/I Code FLY With the Right Compiler Options
Make Your C/C++ and PL/I Code FLY With the Right Compiler Options Visda Vokhshoori/Peter Elderon IBM Corporation Session 13790 Insert Custom Session QR if Desired. WHAT does good application performance
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationOverview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.
Vectorisation James Briggs 1 COSMOS DiRAC April 28, 2015 Session Plan 1 Overview 2 Implicit Vectorisation 3 Explicit Vectorisation 4 Data Alignment 5 Summary Section 1 Overview What is SIMD? Scalar Processing:
More informationAn Oracle White Paper June Optimizing Applications with Oracle Solaris Studio Compilers and Tools
An Oracle White Paper June 2010 Optimizing Applications with Oracle Solaris Studio Compilers and Tools Introduction...1 Oracle Solaris Studio Compilers and Tools...2 Optimizing Applications for Serial
More informationLecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis
Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 2009. Thanks to many sources for slide material: Computer Organization
More informationNumber Representations
Number Representations times XVII LIX CLXX -XVII D(CCL)LL DCCC LLLL X-X X-VII = DCCC CC III = MIII X-VII = VIIIII-VII = III 1/25/02 Memory Organization Viewed as a large, single-dimension array, with an
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationDEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level
More informationMemory Management. Reading: Silberschatz chapter 9 Reading: Stallings. chapter 7 EEL 358
Memory Management Reading: Silberschatz chapter 9 Reading: Stallings chapter 7 1 Outline Background Issues in Memory Management Logical Vs Physical address, MMU Dynamic Loading Memory Partitioning Placement
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More informationCHAPTER 5 A Closer Look at Instruction Set Architectures
CHAPTER 5 A Closer Look at Instruction Set Architectures 5.1 Introduction 199 5.2 Instruction Formats 199 5.2.1 Design Decisions for Instruction Sets 200 5.2.2 Little versus Big Endian 201 5.2.3 Internal
More informationIntel s MMX. Why MMX?
Intel s MMX Dr. Richard Enbody CSE 820 Why MMX? Make the Common Case Fast Multimedia and Communication consume significant computing resources. Providing specific hardware support makes sense. 1 Goals
More informationHP PA-8000 RISC CPU. A High Performance Out-of-Order Processor
The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA
More informationRISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.
COMP 212 Computer Organization & Architecture Pipeline Re-Cap Pipeline is ILP -Instruction Level Parallelism COMP 212 Fall 2008 Lecture 12 RISC & Superscalar Divide instruction cycles into stages, overlapped
More information