DSP Mapping, Coding, Optimization

Similar documents
TMS320C6000 Programmer s Guide

TMS320C6000 Programmer s Guide

Leveraging DSP: Basic Optimization for C6000 Digital Signal Processors

Writing Interruptible Looped Code for the TMS320C6x DSP

C6000 Compiler Roadmap

Computer Systems Lecture 9

TMS320C62x/C67x Programmer s Guide

Advanced Computer Architecture

D. Richard Brown III Associate Professor Worcester Polytechnic Institute Electrical and Computer Engineering Department

Computer Architecture

QUIZ Lesson 4. Exercise 4: Write an if statement that assigns the value of x to the variable y if x is in between 1 and 20, otherwise y is unchanged.

Pragma intrinsic and more

Migrating from Keil µvision for 8051 to IAR Embedded Workbench for 8051

Control Flow. COMS W1007 Introduction to Computer Science. Christopher Conway 3 June 2003

Corrections made in this version not in first posting:

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

VLIW/EPIC: Statically Scheduled ILP

Program Optimization

Application Specific Signal Processors S

Changes made in this version not seen in first lecture:

Lecture 4 - Number Representations, DSK Hardware, Assembly Programming

Migrating from Keil µvision for 8051 to IAR Embedded Workbench for 8051

April 4, 2001: Debugging Your C24x DSP Design Using Code Composer Studio Real-Time Monitor

The New C Standard (Excerpted material)

Embedded Controller Programming 2

Why Pointers. Pointers. Pointer Declaration. Two Pointer Operators. What Are Pointers? Memory address POINTERVariable Contents ...

Machine Language Instructions Introduction. Instructions Words of a language understood by machine. Instruction set Vocabulary of the machine

Intermediate Programming, Spring 2017*

There are algorithms, however, that need to execute statements in some other kind of ordering depending on certain conditions.

A Fast Review of C Essentials Part I

Basic Types, Variables, Literals, Constants

High Performance Computing in C and C++

Signature: 1. (10 points) Basic Microcontroller Concepts

A Fast Review of C Essentials Part II

ELEC 377 C Programming Tutorial. ELEC Operating Systems

3/7/2018. Sometimes, Knowing Which Thing is Enough. ECE 220: Computer Systems & Programming. Often Want to Group Data Together Conceptually

CSCE 5610: Computer Architecture

Optimising with the IBM compilers

SIMD Programming CS 240A, 2017

Short Notes of CS201

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Chapter 2A Instructions: Language of the Computer

Agenda. Peer Instruction Question 1. Peer Instruction Answer 1. Peer Instruction Question 2 6/22/2011

CS201 - Introduction to Programming Glossary By

Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.

Introduction to Embedded Systems

Appendix. Grammar. A.1 Introduction. A.2 Keywords. There is no worse danger for a teacher than to teach words instead of things.

/* defines are mostly used to declare constants */ #define MAX_ITER 10 // max number of iterations

Mechatronics Laboratory Assignment 2 Serial Communication DSP Time-Keeping, Visual Basic, LCD Screens, and Wireless Networks

Cell Programming Tutorial JHD

Contents of Lecture 11

Cortex-R5 Software Development

Binghamton University. CS-211 Fall Syntax. What the Compiler needs to understand your program

C5500 Compiler Build Options One Use Case

CMSC 611: Advanced Computer Architecture

Introduction. C provides two styles of flow control:

Lecture 04: Machine Instructions

Porting BLIS to new architectures Early experiences

Today s topics. MIPS operations and operands. MIPS arithmetic. CS/COE1541: Introduction to Computer Architecture. A Review of MIPS ISA.

HW1 due Monday by 9:30am Assignment online, submission details to come

Review! * follows a pointer to its value! & gets the address of a variable! Pearce, Summer 2010 UCB! ! int x = 1000; Pearce, Summer 2010 UCB!

CS61C : Machine Structures

Advanced Systems Programming

EL2310 Scientific Programming

Model-based Software Development

Multiple Instruction Issue. Superscalars

CSCI 402: Computer Architectures. Instructions: Language of the Computer (3) Fengguang Song Department of Computer & Information Science IUPUI.

Dynamic Control Hazard Avoidance

Reminder: tutorials start next week!

A QUICK INTRO TO PRACTICAL OPTIMIZATION TECHNIQUES

Programming refresher and intro to C programming

MPLAB C1X Quick Reference Card

Stored Program Concept. Instructions: Characteristics of Instruction Set. Architecture Specification. Example of multiple operands

Multicycle Approach. Designing MIPS Processor

Zero-Latency Interrupts and TI-RTOS on C2000 Devices. Todd Mullanix TI-RTOS Apps Manager March, 5, 2018

Generating Rectify( ) Test driven development approach to TigerSHARC

Procedure and Object- Oriented Abstraction

CS61C : Machine Structures

SignalMaster Manual Version PN: M072005

ME 4447/ME Microprocessor Control of Manufacturing Systems/ Introduction to Mechatronics. Instructor: Professor Charles Ume

COMP3221: Microprocessors and. and Embedded Systems. Instruction Set Architecture (ISA) What makes an ISA? #1: Memory Models. What makes an ISA?

Optimization Prof. James L. Frankel Harvard University

More C Pointer Dangers

QUIZ. What are 3 differences between C and C++ const variables?

An introduction to Digital Signal Processors (DSP) Using the C55xx family

MIPS R-format Instructions. Representing Instructions. Hexadecimal. R-format Example. MIPS I-format Example. MIPS I-format Instructions

Programming in C - Part 2

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

G Programming Languages - Fall 2012

Important From Last Time

G Programming Languages - Fall 2012

Page 1. Today. Important From Last Time. Is the assembly code right? Is the assembly code right? Which compiler is right?

Optimizing Memory Bandwidth

Branch Addressing. Jump Addressing. Target Addressing Example. The University of Adelaide, School of Computer Science 28 September 2015

ISA: The Hardware Software Interface

CSE A215 Assembly Language Programming for Engineers

Complementing Software Pipelining with Software Thread Integration

Important From Last Time

Martin Kruliš, v

Review of the C Programming Language for Principles of Operating Systems

Transcription:

DSP Mapping, Coding, Optimization On TMS320C6000 Family using CCS (Code Composer Studio) ver 3.3 Started with writing a simple C code in the class, from scratch Project called First, written for C6713 cycle accurate simulator SPRU198i.pdf chapters 2,3,4,5 - Chassaing book chapters 3 and 8, CCS help, Now, rules to optimize DSP C code Data Types: char 8 bits short 16 bits int 32 bits long 40 bits not the same as in PC Windows C programming long long 64 bits float 32 bits double 64 bits Basic Hints: 1) Use short vars when multiplication is involved (1 cycle vs. 5 cycles) 2) Use int or unsigned int for loop counters 3) Use correct libraries e.g. for C67 to use floats 66

Analyzing Performance 1) Use clock() function S =clock(); E= clock(); O= E-S; // Clock overhead S =clock(); /*Function or range To evaluate*/ ; E= clock(); printf 2) Use Profile mode Setup Profile, Enable Clock Run the code 3) Use Simulator Analysis in CCS Tools > Simulator Analysis, reset counters, choose counters, run the code 4) Use Stand-alone Simulator Analysis load6x g code.out generates.vaa file Recognize Critical Areas mostly loops 67

DSP Mapping, Coding, Optimization Analyzing Performance Total cycles in, Including inner functions One call Max cycles including inner functions 68

1) Use Compiler options -o2/o3 SW pipelining, SIMD, loop unrolling -pm combining src files, -ms2/ms3 size optimization 2) Memory Dependencies Compiler must know if two vars are independent to schedule them in parallel Use restrict keyword indicates that the pointer is the only one Use -pm option, (program-level opt) gives global access to compiler Use -mt option allows compiler to assume no dependencies short a[10], b[10]; // calling vecsum vecsum(a, a, b, 10); restrict is a type qualifier, load should not wait for store! void vecsum(short * restrict sum, short * restrict in1, short * restrict in2, unsigned int N) 69

3) Program Level Optimizations pm option In addition to dependency detection All source files go into one file If a particular argument in a function always has the same value, the compiler replaces the argument with the value and passes the value instead of the argument. If a return value of a function is never used, the compiler deletes the return code in the function. If a function is not called, directly or indirectly, the compiler removes the function. 4) Using Intrinsics C6000 compiler provides intrinsics, special functions that map directly to inline C62x/C64x/C64x+/C67x instructions, to optimize C/C++ code quickly 70

DSP Mapping, Coding, Optimization 4) Using Intrinsics 71 Complicated code can be replaced by HW mapped instructions

DSP Mapping, Coding, Optimization 4) Using Intrinsics 72 Part of C6000 C/C++ Compiler intrinsics spru198

DSP Mapping, Coding, Optimization 4) Using Intrinsics 73 Part of C64X/C64X+ C/C++ Compiler intrinsics spru198

DSP Mapping, Coding, Optimization 4) Using Intrinsics 74 Part of C67X C/C++ Compiler intrinsics spru198

5) Wider memory access for smaller data types Maximizing the throughput using single load/store to access more data elements of smaller size Don t use the old style of pointer casting N must be (made) even 75

76 DSP Mapping, Coding, Optimization 5) Wider memory access for smaller data types p can be of any type just to fill the wider memory size long long or double

5) Wider memory access for smaller data types Non-aligned memory access for C64 X and C64X+ only two times slower 8 byte access: 77

5) Wider memory access for smaller data types Non-aligned only on C64X and C64X+ 128 bits / cycle memory BW for aligned data 64 bits / cycle memory BW for non-aligned data used when there is a good reason Generic routines which cannot impose alignment e.g.memcpy() Single sample algorithms which update their input or output pointers by only one sample e.g. adaptive filters Nested loop algorithms e.g. 2D convolution Routines which have an irregular memory access pattern, or whose access pattern is data-dependent and not known until run time e.g. motion compensation in image processing 78

5) Wider memory access for smaller data types Spru187.pdf To tell the compiler about the loop trip count: #pragma MUST_ITERATE(Min trip count, Max trip count, multiple of) To tell the compiler to align arrays of any memory bank on double word boundaries: #pragma DATA_MEM_BANK (pointer, memory bank number) 0 means double word alignment 4k+0(bank0), 4k+1(bank1),, 4k+3(bank3) To tell the compiler about the location of the variable sections, defined in linker file #pragma DATA_SECTION ( symbol, " section name "); 79

5) Wider memory access for smaller data types _nassert(boolean) intrinsic provides alignment information to the compiler Claims the argument is true Example : _nassert(((int)sum & 0x3) == 0); // generates no code With -pm option it might be useless Compiler already has enough information! 6) Software pipelining Scheduling the loop instructions to execute instructions of different iterations together Let s look at loops more carefully Loop Qualification 80

6) Software pipelining Trip count : # of loop iterations Minimum trip count: sometimes by compiler, or MUST_ITERATE pragma Minimum safe trip count: minimum trip count to safely execute the software pipelined version of the loop If min-trip-count < min-safe-trip count Redundant loop! (next slide) Loop reversal: count down loops are more efficiently software pipelined 81

6) Software pipelining Eliminating the redundant loop If compiler cannot determine the minimum trip count makes two versions Non-pipelined to be executed if min<min-safe else pipelined A redundant loop will be required : reduces performance, increases code size Use mhn: n-byte memory pad will be assumed by the compiler, get prolog and epilog in the loop Provide info to the compiler when possible, must iterate and _nassert Loop unrolling Expanding the small loops so that some iterations appear in the loop body Either compiler does it, or programmer suggests by #pragma UNROLL(n) (n times), or programmer does it Compiler performs SW pipelining only on inner loops Unrolling makes larger inner loops 82

DSP Mapping, Coding, Optimization 6) Software pipelining Loop unrolling Increases code size 5 83

6) Software pipelining What Disqualifies a loop from being SW pipelined Having a register value being live too long If the loop has complex condition code, needing 5 (6 for C64) condition registers Having function calls (use inline functions) Having conditional break or early exit in a loop Increasing trip counter (use loop reversal if compiler cannot) Modifying trip counter in the loop Big loop body that uses more than DSP registers Compiler Feedback -k mw options: -k is to make the *.asm files and mw to have feedback in there Understanding the feedback is important 84

DSP Mapping, Coding, Optimization Compiler Feedback (chap 4, spru198) First part Second part Write to read =ceil((.l +.S +.LS) / 2) = ceil((.l +.S +.D +.LS +.LSD) / 3) 85

Compiler Feedback (chap 4) Third part Iteration Interval : Cycles between successive initiations of the loop Removes prolog by increasing the loop trip count -mh option ii increments until a schedule is found ii: always the maximum of loop carried dependency bound and the partitioned resource bound Did not find Schedule parameters: Example: RegsLive Always Max RegsLive Max CondRegs Regs Live Always : 1/5 (A/B side) Max Regs Live : 14/19 Max Cond Regs Live : 1/0 86

DSP Mapping, Coding, Optimization Compiler Feedback 87 More info

DSP Mapping, Coding, Optimization 88 Compiler Feedback Loop disqualification messages ( format as appeared in spru198 chapter 5) Bad Loop Structure Loop Contains a Call. Too Many Instructions.

See code examples enclosed Addressing.rar, showing all modes of addressing And division.rar for next slides Also showing : 1) How to init memory in asm file 2) How to return double data from linear assembly files 3) How to write asm in C 4) How to pass a variable back to C Pipeline: Fetch Packet, Execution Packet VelociTI architecture does not allow execute packets to cross-fetch packet boundaries, VelociTI.2 eliminates this restriction by instruction packing 89

To continue: 1) Go for spru198 chapter 3 for tutorial example, lesson 1-3 C62 version in the class C64 version as an Exercise 2) Move to TI C6000 ROM, chapters 7 and 11 Exercise-3 Added missing stuff like linear assembly directives from spru198 chapter 5 3) And TI C6000 ROM, chapters 12 Hand optimization added from different text books 4) And TI C6000 ROM, chapters 4 DSP BIOS, Simulator examples added 5) Same stuff, Updated for newer DSPPs in Optimization Workshop 90