Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1

Similar documents
Integrated Modulo Scheduling and Cluster Assignment for TI TMS320C64x+Architecture

IR-Level Versus Machine-Level If-Conversion for Predicated Architectures

TECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are

The IA-64 Architecture. Salient Points

Impact of Source-Level Loop Optimization on DSP Architecture Design

Instruction Scheduling. Software Pipelining - 3

Generic Software pipelining at the Assembly Level

RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors

Model-based Software Development

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Modern Processor Architectures. L25: Modern Compiler Design

Complementing Software Pipelining with Software Thread Integration

Lecture Compiler Backend

VLIW/EPIC: Statically Scheduled ILP

The Elcor Intermediate Representation. Trimaran Tutorial

Code Generation for TMS320C6x in Ptolemy

Advanced Computer Architecture

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

An Optimizing Compiler for the TMS320C25 DSP Chip

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 1 / 1

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Achieving Out-of-Order Performance with Almost In-Order Complexity

Basics of Performance Engineering

TMS320C3X Floating Point DSP

Yunsup Lee UC Berkeley 1

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple

CMSC411 Fall 2013 Midterm 2 Solutions

If-Conversion SSA Framework and Transformations SSA 09

The University of Texas at Austin

Compiler Optimizations and Auto-tuning. Amir H. Ashouri Politecnico Di Milano -2014

Lecture: Pipeline Wrap-Up and Static ILP

Flexible wireless communication architectures

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs

Architectures for Instruction-Level Parallelism

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Embedded Systems Development

C6000 Compiler Roadmap

Stereo Vision II: Dense Stereo Matching

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Static Branch Prediction

Removing Communications in Clustered Microarchitectures Through Instruction Replication

An Evaluation of Multi-Hit Ray Traversal in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Impact of the current LLVM inlining strategy on complex embedded application memory utilization and performance

Lecture 13 - VLIW Machines and Statically Scheduled ILP

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

Multi-cycle Instructions in the Pipeline (Floating Point)

Processors, Performance, and Profiling

Understanding multimedia application chacteristics for designing programmable media processors

CS341l Fall 2009 Test #2

Recursion. Comp Sci 1575 Data Structures. Introduction. Simple examples. The call stack. Types of recursion. Recursive programming

Still Image Processing on Coarse-Grained Reconfigurable Array Architectures

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Instruction Scheduling

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Native Offload of Haskell Repa Programs to Integrated GPUs

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Impact of ILP-improving Code Transformations on Loop Buffer Energy

CS 188: Artificial Intelligence. Recap Search I

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL

Microprocessor Extensions for Wireless Communications

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Software De-Pipelining Technique

Exploitation of instruction level parallelism

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

Chapter 2 Parallel Computer Architecture

Introduction. L25: Modern Compiler Design

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

Germán Llort

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

IA-64 Compiler Technology

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

CPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition

Mapping Vector Codes to a Stream Processor (Imagine)

SSA Construction. Daniel Grund & Sebastian Hack. CC Winter Term 09/10. Saarland University

Lect. 2: Types of Parallelism

Weaving Relations for Cache Performance

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CS 343: Artificial Intelligence

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Leveraging Predicated Execution for Multimedia Processing

Applications to MPSoCs

CS 341l Fall 2008 Test #2

ECE 505 Computer Architecture

Transcription:

Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1 Nikolai Kim, Andreas Krall {kim,andi}@complang.tuwien.ac.at Institute of Computer Languages University of Technology Vienna computer lang uages ODES-11: Optimizations for DSP and Embedded Systems 1 This work is supported by the Austrian Science Fund (FWF) under contract P21842, Epicopt: Optimal Code Generation for Explicitly Parallel Processors.

Outline Implementation Swing modulo scheduling extension/adaptation Two different cluster assignment heuristics Implemented within LLVM 2.9 Targeting TI s TMS320C64X DSP Evaluation Taking UAS, ILP as baseline Based on a cycle accurate simulator MiBench, mediabench, DSPStone, BenchmarkGames, SingleUnit tests 35 kernels in total, 14 most representative presented

Target architecture cluster A crosspath A Register file A LB SB MB DB LA SA MA DA Register file B crosspath B cluster B Texas Instruments TMS320C64X Clustered VLIW architecture, 2 clusters 4 functional units, 32 GP registers per cluster 3 predicate registers per cluster, 6 cycles branch latency DSP, SIMD subset, predication, soft. pipelining buffer

Intercluster communication cluster B cluster A B0 =... A0 =... B1 = COPY A0 B2 = LOAD B0 [B1] a) B0 =... A0 =... B1 = ADD B0, A0 b) Data transfer a) explicit, via inserted COPY instructions b) implicit, via intercluster crosspaths, 1 cycle delay (crosspath stall) for uses placed directly after definitions

If-Conversion entry entry for.cond for.cond p!p p land.rhs land.rhs BB#6!q for.loop!p BB#6 q land.end land.end for.end!q for.body q!q b) c) for.end for.body a) Basics As preprocessing to modulo scheduling Requires hardware support, removes conditional branches Reduces basic block count, increases ILP

Modulo scheduling (1) General Iterative II scheme, swing scheduling adaptation Extended to address target specific factors such as functional unit support and crosspath stalls Employs modulo variable expansion based on lifetime analysis Utilizes modulo resource table, captures crosspath occupation

Modulo scheduling (2) Schedule nodes Assign clusters yes success? no Reschedule nodes Increase II Emit schedule Specific Two-pass setup: Iteratively generate a preliminary schedule in combination with provided clustering heuristics Distribute intercluster copies, avoid crosspath stalls

Cluster assignment Simple naive heuristic Non-integrated, losely coupled with scheduling routine DG depth ordering, uniform handling of all dependences Processes the DG at once in a top-down manner Decides upon already assigned predecessor nodes only Extended variant Runs inline with the modulo scheduler Operates on a DG with edges annotated prior to scheduling Uses a simple copy cost scheme for DG edge annotation Additionally incorporates cluster utilization counters

Copy-cost annotation 0 a 0 1 b c 0 0 d 1 e f g 0 h 1 1 0 i Details Qualifies adjacent nodes in terms of register copies Annotation only, no cluster information generated Takes crosspath access possibilities into account

Performance factors Optimization objectives Fast schedule generation Minimal initiation interval through iterative scheme Reduction of crosspath stalls through explicit rescheduling Minimization of intercluster copies through DG labeling Even cluster balance through utilization counters

Performance evaluation: UAS as baseline Simple/UAS Extended/UAS 60.00 Speedup in % 40.00 20.00 0.00-20.00-40.00 Figure: Cycle speedup (%) comparison to UAS

Performance evaluation: optimal ILP as baseline 70.00 Simple/ILP Extended/ILP Runtime optimality gap % 60.00 50.00 40.00 30.00 20.00 10.00 0.00 Figure: Optimality gap (%) to ILP

Performance evaluation: initiation intervals 30 Simple/UAS Extended/UAS 25 Initiation interval 20 15 10 5 0 Figure: Absolute initiation interval values

Summary Conclusions Extended clustering heuristic generally more potent Significant speedup compared to UAS (avg. 24.8%) Partially significant gap to ILP (avg. 15.8%) Nearly even cluster load distribution Shortcomings, current research Backend modulo scheduling support currently very basic Rudimentary loop analysis, restricted applicability Clustering still suboptimal in terms of register copies More sophisticated clustering algorithms in development Fair, undistorted comparison to alternative implementations

Thank You Thank you for being my audience!