Code Transformations at System Level

Similar documents
Page 1. Multilevel Memories (Improving performance using a little cash )

Topics. Digital Systems Architecture EECE EECE Need More Cache?

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Alexandria University

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

A Cache Hierarchy in a Computer System

14:332:331. Week 13 Basics of Cache

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

EE 4683/5683: COMPUTER ARCHITECTURE

Course Administration

Chapter 2: Memory Hierarchy Design Part 2

Question?! Processor comparison!

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

Lecture notes for CS Chapter 2, part 1 10/23/18

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Chapter 2: Memory Hierarchy Design Part 2

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Multiple Instruction Issue. Superscalars

Computer Architecture Spring 2016

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Getting CPI under 1: Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

CISC 662 Graduate Computer Architecture. Classifying ISA. Lecture 3 - ISA Michela Taufer. In a CPU. From Source to Assembly Code

Unit 2: High-Level Synthesis

Lecture 11 Cache. Peng Liu.

CS 136: Advanced Architecture. Review of Caches

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

ECE331: Hardware Organization and Design

Advanced cache optimizations. ECE 154B Dmitri Strukov

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Low-power Architecture. By: Jonathan Herbst Scott Duntley

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CMSC 611: Advanced Computer Architecture. Cache and Memory

Chapter-5 Memory Hierarchy Design

CS3350B Computer Architecture

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

There are 16 total numbered pages, 7 Questions. You have 2 hours. Budget your time carefully!

Memory Hierarchies 2009 DAT105

Pipelining and Vector Processing

CS 701. Class Meets. Instructor. Teaching Assistant. Key Dates. Charles N. Fischer. Fall Tuesdays & Thursdays, 11:00 12: Engineering Hall

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

CMSC 611: Advanced Computer Architecture

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Co-synthesis and Accelerator based Embedded System Design

Loop Transformations! Part II!

Hardware-Based Speculation

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Architecture Spring 2016

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

COSC 6385 Computer Architecture - Memory Hierarchies (II)

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

Memory Hierarchy. Slides contents from:

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

Hardware-Based Speculation

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

ECEC 355: Cache Design

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Adapted from David Patterson s slides on graduate computer architecture

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Page 1. Memory Hierarchies (Part 2)

Id Question Microprocessor is the example of architecture. A Princeton B Von Neumann C Rockwell D Harvard Answer A Marks 1 Unit 1

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

ECE 486/586. Computer Architecture. Lecture # 7

Tutorial 11. Final Exam Review

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Advanced cache memory optimizations

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Transcription:

ode Transformations at System Level Motivation Multimedia applications MPEG 4 > 2 GOP/s, > 6 G/s, > 10 M storage Embedded systems, So, No Memory is a power bottleneck [Sakurai & Kuroda, ED&T 97] MPEG-2 decodes logic 27%, i/o 24%, clock 24%, on-chip memory 25% PU + caches logic 25%, i/o 3%, clock 26%, on-chip memory 46% Memory is a speed bottleneck [Patterson] Moore s Law μproc 60%/year DRM 7%/year Processor-memory speed gap 50%/year! data in PE1 PE2 PE3 data out M1 M2 M3 hdl - code trafo - 1 What can be done? Reduce overheads! do necessary calculations only do not duplicate calculations vs. do not store unnecessary data data must have the right size (bitwidth, array size) select the right data layout bitwidth can be reduced by coding number of switchings on a bus can be reduced by coding exploit memory hierarchy local copies may significantly reduce data transfer distribute load systems are designed to stand the peak performance ut do not forget trade-offs... hdl - code trafo - 2

Data layout optimization Input/output data range given what about intermediate data? Integer vs. fixed point vs. floating point integer + simple operations - no fractions fixed point + simple add/subtract operations - normalization needed for multiplication/division floating point + flexible range - normalization needed hdl - code trafo - 3 Data layout optimization Software implementations Example: 16-bit vs. 32-bit integers? depends on the PU, bus & memory architectures depends on the compiler (optimizations) 16-bit + less space in the main memory + faster loading from the main memory into cache - unpacking/packing may be needed (compiler & PU depending) - relative redundancy when accessing a random word 32-bit + data matches the PU word [ native for most of the modern PUs] - waste of bandwidth (bus load!) for small data ranges hdl - code trafo - 4

Data layout optimization Hardware implementations Example #1: 16-bit integer vs. 16-bit fixed point vs. 16-bit floating point integer 1+15 bits ~~ -32000 to +32000, precision 1 fixed point 1+5+10 bits ~~ -32 to +32, precision ~0.001 (~0.03%) normalization == shifting floating point 1+5+10 bits ~~ -64000 to +64000, precision ~0.1% normalization == analysis + shifting Example #2: 2 s complement vs. integer with sign 2 s complement +1 == 00000001, 0 == 00000000, -1 == 11111111 integer + sign +1 == 00000001, 0 == 00000000/10000000, -1 == 10000001 number of switchings when data swings around zero!!!! [e.g., DSP] hdl - code trafo - 5 Optimizing the number of calculations Input space adaptive design Wang, et al @ Princeton [D 2001] The main idea avoid unnecessary calculations input data is analyzed for boundary cases boundary cases have simplified calculations (vs. the full algorithm) boundary case it may happen in the most of the time... and the full algorithm is actually calculating nothing Trade-offs extra hardware is introduced implementation is faster implementation consumes less power hdl - code trafo - 6

Input space adaptive design Trade-offs Extra hardware is introduced the algorithm is essentially duplicated (or triplicated or...) extra units for analysis are needed Implementation is faster less operations require less clock steps operations themselves can be faster Implementation consumes less power a smaller number of units are involved in calculations the clock frequency can be slowed supply voltage can be lowered hdl - code trafo - 7 Input space adaptive design The ase Discrete wavelet transformation four input words,,, D in 95.2% cases: = = = D four additions, four subtractions and four shifts are replaced with four assignments i n i i 1 2 i 3..... original subbehavior f(i 1,i 2 ) i 2 > i 1 i n i i 1 2 i 3... optimized sub-behavior.. original subbehavior f(i 1,i 2 ) Results area: +17.5% speed: 52.1% (cycles) power: 58.9% V dd scaled: 26.6%.. o 1 o 3 o2... original behavior o m o 1 o 3 o2... optimized behavior.. o m hdl - code trafo - 8

Input space adaptive design Few examples Multiplication with zero result is zero > nothing to calculate two comparators, few blocking gates, plus some extras to the controller multiplication with one, etc. it must be cost efficient the extra HW consumes power too! Multiplier size 32 * 32 = ( H16 *2 16 + L16 )*( H16 *2 16 + L16 ) =...... = H16 * H16 *2 32 + H16 * L16 *2 16 + L16 * H16 *2 16 + L16 * L16 < 2 16 & < 2 16 > H16 =0 & H16 =0 Multiplication with a constant > shift-adds 6*x = 4*x+2*x = (x<<2) + (x<<1) - 1.75*x = 0.25*x - 2*x = (x>>2) - (x<<1) hdl - code trafo - 9 Memory architecture optimization Life time of array elements = f(); = g(); wasted Improved allocations typical allocation a[i]==mem[i] b[i]==mem[base_b+i] c[i]==mem[base_c+i] life times a smarter allocation a[i]==mem[i] b[i]==mem[base_b+i] c[i]==mem[i] the smallest memory waste a[i]==mem[i] b[i]==mem[size-i] c[i]==mem[i] hdl - code trafo - 10

Memory architecture optimization ode transformations to reduce/balance data transfers TOMIUM + ROPOLIS M. Miranda,. Vandecapelle, E. rockmeyer, P. Slock, F. atthoor, et al @ IME ode transformations to reduce/balance data transfers Data copies to exploit memory hierarchy Remove redundant transfer, increase locality Explicit data reuse: add explicit data copies Reduce capacity misses: advanced life time analysis Reduce conflict misses: change data layout hdl - code trafo - 11 ode transformations Loop transformations improve regularity of accesses improve locality: production > consumption expected influence reduce temporary and background storage for (j=1; j<=m; j++) for (i=1; i<=n; i++) [i]=foo([i],...); for (i=1; i<=n; i++) out[i]=[i]; for (i=1; i<=n; i++) { for (j=1; j<=m; j++) { [i]=foo([i],...); } out[i]=[i]; } storage size N storage size 1 hdl - code trafo - 12

Polyhedral model - polytope placement Life time analysis of array elements for (i=1; i<=n; i++) for (j=n; j>=1; j--) [i][j]=foo([i-1][j],...); for (j=1; j<=n; j++) for (i=1; i<=n; i++) [i][j]=foo([i-1][j],...); The indexing order must preserve causality! Storage requirements # of dependency lines j indexing order i select the right indexing order merge the loops can be used for loop pipelining possible index range expansion hdl - code trafo - 13 Polytope placement initial code /* */ for (i=1; i<=n; i++) for (j=1; j>=n-i+1; j++) a[i][j]=in[i][j]+a[i-1][j]; /* */ for (p=1; p<=n; p++) b[p][1]=f(a[n-p+1][p],a[n-p][p]); /* */ for (k=1; i<=n; i++) for (l=1; l>=k; l++) b[k][l+1]=g(b[k][l]); i j p storage size 2*N-1 k l hdl - code trafo - 14

Polytope placement reordering + loop merging j for (j=1; j<=n; j++) { for (i=1; i>=n-j+1; i++) a[i][j]=in[i][j]+a[i-1][j]; b[j][1]=f(a[n-j+1][j],a[n-j][j]); for (l=1; l>=j; l++) b[j][l+1]=g(b[j][l]); } i (p) l (k) storage size 2 hdl - code trafo - 15 Exploiting memory hierarchy Inter-array in-place mapping reduces the size of combined memories source array is consumed while processing and can be replaced with the result Introducing local copies reduces accesses to the (main) memory cost of the local memory should be smaller than the cost of reduced accesses Proper base addresses reduce cache penalties direct mapped caches the same block may be used by two different arrays accessed in the same loop! ll optimizations ~60% of reduce in memory accesses ~80% of reduce in total memory size ~75% of reduce in cycle count hdl - code trafo - 16

Introducing local copies Image processing blurring edge detection etc. j image i Extra copies should be analyzed for efficiency time & power window no buffering 9 accesses per pixel 3 line buffers (in cache) 3 accesses per pixel in the buffer memory 9 pixels in register file 1 access per pixel in the main memory hdl - code trafo - 17 Memory architecture optimization Packing data into memories (#1) Packing arrays into memories Memory allocation and binding task memory access distribution in time simultaneous accesses mapping data arrays into physical memories single-port vs. multi-port memories unused memory area read(a), write(b); read(c); access sequence c a wasted b a b c legal mappings a b c hdl - code trafo - 18

Packing data into memories (#2) Storage operation scheduling Usually scheduled together with other operations ffects memory architecture number of memories single-port versus multi-port memories Memory operations can be scheduled before others to explore possible memory architectures puts additional constraints for ordinary scheduling distributes access more uniformly reduces required bus bandwidth reduces the number of possible combinations hdl - code trafo - 19 Storage operation scheduling ccesses onflict Graph llocation I n i t i a l D D D 2 1 ccesses onflict Graph llocation fter scheduling of memory accesses D D 1 D 1 hdl - code trafo - 20

Packing data into memories (#3) Packing data words (data fields) into memory words ase #1 ase #2 ase #3 ase #2 & ase #3 ase #4 typedef struct { int field_0:3; int field_1:11; } my_type; p-> memo field_0 field_0 D field_0 my_type memo[256]; my_type *p; //... p->filed_0 = 2; //... field_1 field_1 field_1 example wasted allocation by a compiler separate memories single memory combined mapping packed into a single word How to select candidates for packing? hdl - code trafo - 21 Packing data fields into memory words Data access duplications may occur a word (field) is read but discarded extra reads may be needed to avoid data corruption when writing read-after-read read-read r(f1) c r(f2) r(f1+f2) r(f1) r(f2) r(f1+f2) w(f1) w(f2) write-write w(f1+f2) r (f1) single-write a a w(f2) r (f1+f2) r(f1+f2) w(f1+f2) hdl - code trafo - 22

Memory architecture optimization ddress code optimization DOPT Processor architecture independent DDress code OPTimization M. Miranda, F. atthoor, et al @ IME n array access... base address dimensions multidimensional arrays vs. physical memories repeated access patterns e.g., windows in image processing... may require a large number of calculations duplications can be removed operations can be simplified hdl - code trafo - 23 ode hoisting loop-invariant code motion ddress calculations that are repeated at every iteration Not handled by compilers - it s out of basic block optimization for (k=-4; k<3; k++) for (i=-4; i<3; i++) d[32*(8+k)+16+i] (3+, 1*) x 49 = 245 272+32*k+i for (k=-4; k<3; k++) tmp_k = 272 + 32*k; for (i=-4; i<3; i++) d[tmp_k+i] (1+, 1*) x 7 + + (1+) x 49 = 70 hdl - code trafo - 24

Modulo operations for (i=0; i<=20; i++) addr = i % 3; ptr = -1; for (i=0; i<=20; i++) { if (ptr>=2) ptr-=2; else ptr++; addr = ptr; } ounter base address generation units ustomized hardware special addressing units are cheaper data-path itself gets simpler essentially distributed controllers hdl - code trafo - 25 Scheduling of high performance applications ritical path sequence of operations with the longest finishing time the longest path in the (weighted) graph ritical path in clock steps ritical path inside clock period (in time units) Topological paths vs. false paths Latency ~ clock_period * critical_path_in_clock_steps Loop folding and pipelining Loop folding fixed number of cycles no need to calculate conditions in loop header iterations of loop body can overlap Loop pipelining executing some iterations of loop body in parallel (without data dependencies) hdl - code trafo - 26

Scheduling of high performance applications Speculative execution Out of order execution Execution of operations in a conditional branch can be started before the branch itself starts Extra resources required Speeding up the algorithm guaranteed speed-up overall speed-up (statistical) Scheduling outside basic blocks flattening hierarchy increase in the optimization complexity hdl - code trafo - 27 onclusion Efficient implementation is not only the hardware optimization Efficient implementation is selecting the right algorithm selecting the right data types selecting the right architecture making the right modifications and optimizing... right == the most suitable Think first, implement later! hdl - code trafo - 28