Optimizations - Compilation for Embedded Processors -

Similar documents
Rechnerarchitektur (RA)

technische universität dortmund fakultät für informatik informatik 12 Optimizations Peter Marwedel TU Dortmund Informatik 12 Germany 2010/01/10

Optimizations - Compilation for Embedded Processors -

Hardware/ Software Partitioning

Computer Architecture Spring 2016

WCET-Aware C Compiler: WCC

Standard Optimization Techniques

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Loop Transformations! Part II!

Control flow graphs and loop optimizations. Thursday, October 24, 13

Specifications and Modeling

Lec 12 How to improve cache performance (cont.)

Optimizations - Compilation for Embedded Processors -

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Imperative model of computation

DECstation 5000 Miss Rates. Cache Performance Measures. Example. Cache Performance Improvements. Types of Cache Misses. Cache Performance Equations

Optimizations - Compilation for Embedded Processors -

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Last class. Caches. Direct mapped

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memories. CPE480/CS480/EE480, Spring Hank Dietz.

Floating-point to Fixed-point Conversion. Digital Signal Processing Programs (Short Version for FPGA DSP)

Evaluation and Validation

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Cache Memories October 8, 2007

Advanced optimizations of cache performance ( 2.2)

Tiling: A Data Locality Optimizing Algorithm

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Systems I. Optimizing for the Memory Hierarchy. Topics Impact of caches on performance Memory hierarchy considerations

Types of Cache Misses: The Three C s

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

EITF20: Computer Architecture Part4.1.1: Cache - 2

Giving credit where credit is due

Cache performance Outline

Compiling for Advanced Architectures

Cache Performance (H&P 5.3; 5.5; 5.6)

Copyright 2012, Elsevier Inc. All rights reserved.

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

CISC 360. Cache Memories Nov 25, 2008

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

Final Exam Review Questions

Discrete Event Models

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Affine Loop Optimization using Modulo Unrolling in CHAPEL

CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance. Why More on Memory Hierarchy?

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Tag der mündlichen Prüfung: 03. Juni 2004 Dekan / Dekanin: Prof. Dr. Bernhard Steffen Gutachter / Gutachterinnen: Prof. Dr. Francky Catthoor, Prof. Dr

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

CSC266 Introduction to Parallel Computing using GPUs Optimizing for Caches

ECE 454 Computer Systems Programming

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

SDL. Jian-Jia Chen (slides are based on Peter Marwedel) TU Dortmund, Informatik 年 10 月 18 日. technische universität dortmund

Program Transformations for the Memory Hierarchy

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

NOW Handout Page # Why More on Memory Hierarchy? CISC 662 Graduate Computer Architecture Lecture 18 - Cache Performance

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Lecture 2. Memory locality optimizations Address space organization

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Cache Performance II 1

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Adapted from David Patterson s slides on graduate computer architecture

CSE 502 Graduate Computer Architecture

fakultät für informatik informatik 12 technische universität dortmund Optimizations Peter Marwedel TU Dortmund, Informatik 12 Germany 2008/06/25

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

PERFORMANCE OPTIMISATION

Discrete Event Models

Program Op*miza*on and Analysis. Chenyang Lu CSE 467S

CSC D70: Compiler Optimization Memory Optimizations

write-through v. write-back write-through v. write-back write-through v. write-back option 1: write-through write 10 to 0xABCD CPU RAM Cache ABCD: FF

Memory Hierarchy Basics

Review for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects

Programming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy

Double-precision General Matrix Multiply (DGEMM)

Mapping of Applications to Multi-Processor Systems

CPE 631 Lecture 06: Cache Design

Loop Modifications to Enhance Data-Parallel Performance

Imperative model of computation

High Performance Computing and Programming, Lecture 3

Aleksandar Milenkovich 1

Tiling: A Data Locality Optimizing Algorithm

Representing and Manipulating Floating Points. Jo, Heeseung

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Advanced Caching Techniques

Program Optimization Through Loop Vectorization

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems

Transcription:

12 Optimizations - Compilation for Embedded Processors - Jian-Jia Chen (Slides are based on Peter Marwedel) TU Dortmund Informatik 12 Germany 2015 年 01 月 13 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply. Springer, 2010

Structure of this course Application Knowledge 2: Specification & Modeling 3: ES-hardware 4: system software (RTOS, middleware, ) Design repository 6: Application mapping 7: Optimization 5: Evaluation & validation & (energy, cost, performance, ) Design 8: Test Numbers denote sequence of chapters - 2 -

Task-level concurrency management Granularity: size of tasks (e.g. in instructions) Readable specifications and efficient implementations can possibly require different task structures. F Granularity changes Book section 7.1-3 -

Merging of tasks Reduced overhead of context switches, More global optimization of machine code, Reduced overhead for inter-process/task communication. - 4 -

Splitting of tasks No blocking of resources while waiting for input, more flexibility for scheduling, possibly improved result. - 5 -

Merging and splitting of tasks The most appropriate task graph granularity depends upon the context F merging and splitting may be required. Merging and splitting of tasks should be done automatically, depending upon the context. - 6 -

Automated rewriting of the task system - Example - - 7 -

Attributes of a system that needs rewriting Tasks blocking after they have already started running - 8 -

Work by Cortadella et al. 1. Transform each of the tasks into a Petri net, 2. Generate one global Petri net from the nets of the tasks, 3. Partition global net into sequences of transitions 4. Generate one task from each such sequence Mature, commercial approach not yet available - 9 -

Result, as published by Cortadella Initialization task Reads only at the beginning Never true Always true - 10 -

Optimized version of Tin Never true j==i-1 j F i Tin () { READ (IN, sample, 1); sum += sample; i++; DATA = sample; d = DATA; L0: if (i < N) return; DATA = sum/n; d = DATA; d = d*c; WRITE(OUT,d,1); sum = 0; i = 0; return; } Always true - 11 -

12 High-level software transformations Peter Marwedel TU Dortmund Informatik 12 Germany Springer, 2010 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

High-level optimizations Book section 7.2 Floating-point to fixed point conversion Simple loop transformations Loop tiling/blocking Loop (nest) splitting Array folding - 13 -

Fixed-Point Data Format Floating-Point vs. Fixed-Point Integer vs. Fixed-Point exponent, mantissa Floating-Point automatic computation and update of each exponent at run-time Fixed-Point implicit exponent determined off-line S 1 0 0... 0 0 0 0 1 0 (a) Integer IWL=3 FWL S 1 0 0... 0 0 0 0 1 0 hypothetical binary point (b) Fixed-Point Ki-Il Kum, et al - 14 -

Floating-point to fixed point conversion Pros: Lower cost Faster Lower power consumption Sufficient SQNR, if properly scaled Suitable for portable applications Cons: Decreased dynamic range Finite word-length effect, unless properly scaled Overflow and excessive quantization noise Extra programming effort Ki-Il Kum, et al. (Seoul National University): A Floating-point To Fixed-point C Converter For Fixed-point Digital Signal Processors, 2nd SUIF Workshop, 1996-15 -

Development Procedure Floating-Point C Program Range Estimator Floating- Point to Fixed-Point C Program Converter Range Estimation C Program Execution Manual specification Fixed-Point C Program IWL information Ki-Il Kum, - 16 et - al

Performance Comparison - Machine Cycles - C ycles ADPCM 140000 120000 100000 80000 60000 40000 20000 0 26718 61401 125249 Fixed- Point (16b) Fixed- Point (32b) Floating- Point Ki-Il Kum, et al - 17 -

Performance Comparison - SNR - SNR (db) 25 20 15 10 5 ADPCM F ixed- P oint (16b) F ixed- P oint (32b) F loating- P oint 0 A B C D Ki-Il Kum, et al - 18 -

High-level optimizations Book section 7.2 Floating-point to fixed point conversion Simple loop transformations Loop tiling/blocking Loop (nest) splitting Array folding - 19 -

Impact of memory allocation on efficiency Array p[j][k] Row major order (C) Column major order (FORTRAN) j=0 k=0 j=0 j=1 j=1 k=1 j=0 j=1 j=2 k=2 j=0 j=1-20 -

Best performance of innermost loop corresponds to rightmost array index Two loops, assuming row major order (C): for (k=0; k<=m; k++) for (j=0; j<=n; j++) for (j=0; j<=n; j++) ) for (k=0; k<=m; k++) p[j][k] =... p[j][k] =... Same behavior for homogeneous memory access, but: j=0 j=1 j=2 For row major order Poor cache behavior Good cache behavior F memory architecture dependent optimization - 21 -

F Program transformation Loop interchange Example: #define iter 400000 int a[20][20][20]; void computeijk() {int i,j,k; for (i = 0; i < 20; i++) { for (j = 0; j < 20; j++) { for (k = 0; k < 20; k++) { a[i][j][k] += a[i][j][k];}}}} void computeikj() {int i,j,k; for (i = 0; i < 20; i++) { for (j = 0; j < 20; j++) { for (k = 0; k < 20; k++) { a[i][k][j] += a[i][k][j] ;}}}} start=time(&start);for(z=0;z<iter;z++)computeijk F Improved locality - 22 -

Time [s] Results: strong influence of the memory architecture Loop structure: i j k Dramatic impact of locality Processor reduction to [%] Ti C6xx ~ 57% Sun SPARC 35% Intel Pentium 3.2 % Not always the same impact.. [Till Buchwald, Diploma thesis, Univ. Dortmund, Informatik 12, 12/2004] - 23 -

Transformations Loop fusion (merging), loop fission for(j=0; j<=n; j++) for (j=0; j<=n; j++) p[j]=... ; {p[j]=... ; for (j=0; j<=n; j++), p[j]= p[j] +...} p[j]= p[j] +... Loops small enough to Better locality for allow zero overhead access to p. Loops Better chances for parallel execution. Which of the two versions is best? Architecture-aware compiler should select best version. - 24 -

Example: simple loops #define size 30 #define iter 40000 int a[size][size]; float b[size][size]; void ss1() {int i,j; for (i=0;i<size;i++){ for (j=0;j<size;j++){ a[i][j]+= 17;}} for(i=0;i<size;i++){ for (j=0;j<size;j++){ b[i][j]-=13;}}} void ms1() {int i,j; for (i=0;i< size;i++){ for (j=0;j<size;j++){ a[i][j]+=17; } for (j=0;j<size;j++){ b[i][j]-=13; }}} void mm1() {int i,j; for(i=0;i<size;i++){ for(j=0;j<size;j++){ a[i][j] += 17; b[i][j] -= 13;}}} - 25 -

Results: simple loops % 120 100 80 60 40 ss1 ms1 mm1 Runtime (100% max) F Merged loops superior; except Sparc with o3 20 0 X86 gcc 3.2-03 x86 gcc 2.95 -o3 Sparc gcc 3xo1 Sparc gcc 3x o3 Plattform - 26 -

Loop unrolling for (j=0; j<=n; j++) p[j]=... ; for (j=0; j<=n; j+=2) {p[j]=... ; p[j+1]=...} factor = 2 Better locality for access to p. Less branches per execution of the loop. More opportunities for optimizations. Tradeoff between code size and improvement. Extreme case: completely unrolled loop (no branch). - 27 -

High-level optimizations Book section 7.2 Floating-point to fixed point conversion Simple loop transformations Loop tiling/blocking Loop (nest) splitting Array folding - 28 -

Impact of caches on execution times? Execution time for traversal of linked list, stored in an array, each entry comprising NPAD*8 Bytes Pentium P4 16 kb L1 data cache, 4 cycles/access 1 MB L2 cache, 14 cycles/access Main memory, 200 cycles/access U. Drepper: What every programmer should know about memory*, 2007, http://www.akkadia. org/drepper/cpumemory.pdf; Dank an Prof. Teubner (LS6) für Hinweis auf diese Quelle * In Anlehnung an das Papier David Goldberg, What every programmer should know about floating point arithmetic, ACM Computing Surveys, 1991 (auch für diesen Kurs benutzt). Graphik: U. Drepper, 2007-29 -

Cycles/access as a function of the size of the list L1d L2 hit L2 miss* L1d L2 hit L2 miss * prefetching succeeds prefetching fails Graphics: U. Drepper, 2007-30 -

Impact of TLB misses and larger caches Elements on different pages; run time increase when exceeding the size of the TLB Larger caches are shifting the steps to the right Graphics: U. Drepper, 2007-31 -

Program transformation Loop tiling/loop blocking: - Original version - for (i=1; i<=n; i++) for(k=1; k<=n; k++){ r=x[i,k]; /* to be allocated to a register*/ for (j=1; j<=n; j++) Z[i,j] += r* Y[k,j] } % Never reusing information in the cache for Y and Z if N is large or cache is small (2 N³ references for Z). k++ i++ i++ j++ j++ k++ k++ j++ i++ - 32 -

Loop tiling/loop blocking - tiled version - for (kk=1; kk<= N; kk+=b) for (jj=1; jj<= N; jj+=b) for (i=1; i<= N; i++) for (k=kk; k<= min(kk+b-1,n); k++){ r=x[i][k]; /* to be allocated to a register*/ for (j=jj; j<= min(jj+b-1, N); j++) Z[i][j] += r* Y[k][j] } Same elements for next iteration of i Reuse factor of B for Z, N for Y O(N³/B) accesses to main memory Compiler should select best option k++ k++ kk j++ i++ jj i++ k++, j++ jj - 33 - Monica Lam: The Cache Performance and Optimization of Blocked Algorithms, ASPLOS, 1991

High-level optimizations Book section 7.2 Floating-point to fixed point conversion Simple loop transformations Loop tiling/blocking Loop (nest) splitting Array folding - 34 -

Transformation Loop nest splitting Example: Separation of margin handling many ifstatements for margin-checking no checking, efficient + only few margin elements to be processed - 35 -

High-level optimizations Book section 7.2 Floating-point to fixed point conversion Simple loop transformations Loop tiling/blocking Loop (nest) splitting Array folding - 36 -

Array folding Initial arrays - 37 -

Array folding Unfolded arrays - 38 -

Intra-array folding Inter-array folding - 39 -

Summary Task concurrency management Re-partitioning of computations into tasks Floating-point to fixed point conversion Range estimation Conversion Analysis of the results High-level loop transformations Fusion Unrolling Tiling Loop nest splitting Array folding - 40 -