Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1

Size: px
Start display at page:

Download "Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1"

Transcription

1 Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1 Nikolai Kim, Andreas Krall {kim,andi}@complang.tuwien.ac.at Institute of Computer Languages University of Technology Vienna computer lang uages ODES-11: Optimizations for DSP and Embedded Systems 1 This work is supported by the Austrian Science Fund (FWF) under contract P21842, Epicopt: Optimal Code Generation for Explicitly Parallel Processors.

2 Outline Implementation Swing modulo scheduling extension/adaptation Two different cluster assignment heuristics Implemented within LLVM 2.9 Targeting TI s TMS320C64X DSP Evaluation Taking UAS, ILP as baseline Based on a cycle accurate simulator MiBench, mediabench, DSPStone, BenchmarkGames, SingleUnit tests 35 kernels in total, 14 most representative presented

3 Target architecture cluster A crosspath A Register file A LB SB MB DB LA SA MA DA Register file B crosspath B cluster B Texas Instruments TMS320C64X Clustered VLIW architecture, 2 clusters 4 functional units, 32 GP registers per cluster 3 predicate registers per cluster, 6 cycles branch latency DSP, SIMD subset, predication, soft. pipelining buffer

4 Intercluster communication cluster B cluster A B0 =... A0 =... B1 = COPY A0 B2 = LOAD B0 [B1] a) B0 =... A0 =... B1 = ADD B0, A0 b) Data transfer a) explicit, via inserted COPY instructions b) implicit, via intercluster crosspaths, 1 cycle delay (crosspath stall) for uses placed directly after definitions

5 If-Conversion entry entry for.cond for.cond p!p p land.rhs land.rhs BB#6!q for.loop!p BB#6 q land.end land.end for.end!q for.body q!q b) c) for.end for.body a) Basics As preprocessing to modulo scheduling Requires hardware support, removes conditional branches Reduces basic block count, increases ILP

6 Modulo scheduling (1) General Iterative II scheme, swing scheduling adaptation Extended to address target specific factors such as functional unit support and crosspath stalls Employs modulo variable expansion based on lifetime analysis Utilizes modulo resource table, captures crosspath occupation

7 Modulo scheduling (2) Schedule nodes Assign clusters yes success? no Reschedule nodes Increase II Emit schedule Specific Two-pass setup: Iteratively generate a preliminary schedule in combination with provided clustering heuristics Distribute intercluster copies, avoid crosspath stalls

8 Cluster assignment Simple naive heuristic Non-integrated, losely coupled with scheduling routine DG depth ordering, uniform handling of all dependences Processes the DG at once in a top-down manner Decides upon already assigned predecessor nodes only Extended variant Runs inline with the modulo scheduler Operates on a DG with edges annotated prior to scheduling Uses a simple copy cost scheme for DG edge annotation Additionally incorporates cluster utilization counters

9 Copy-cost annotation 0 a 0 1 b c 0 0 d 1 e f g 0 h i Details Qualifies adjacent nodes in terms of register copies Annotation only, no cluster information generated Takes crosspath access possibilities into account

10 Performance factors Optimization objectives Fast schedule generation Minimal initiation interval through iterative scheme Reduction of crosspath stalls through explicit rescheduling Minimization of intercluster copies through DG labeling Even cluster balance through utilization counters

11 Performance evaluation: UAS as baseline Simple/UAS Extended/UAS Speedup in % Figure: Cycle speedup (%) comparison to UAS

12 Performance evaluation: optimal ILP as baseline Simple/ILP Extended/ILP Runtime optimality gap % Figure: Optimality gap (%) to ILP

13 Performance evaluation: initiation intervals 30 Simple/UAS Extended/UAS 25 Initiation interval Figure: Absolute initiation interval values

14 Summary Conclusions Extended clustering heuristic generally more potent Significant speedup compared to UAS (avg. 24.8%) Partially significant gap to ILP (avg. 15.8%) Nearly even cluster load distribution Shortcomings, current research Backend modulo scheduling support currently very basic Rudimentary loop analysis, restricted applicability Clustering still suboptimal in terms of register copies More sophisticated clustering algorithms in development Fair, undistorted comparison to alternative implementations

15 Thank You Thank you for being my audience!

Integrated Modulo Scheduling and Cluster Assignment for TI TMS320C64x+Architecture

Integrated Modulo Scheduling and Cluster Assignment for TI TMS320C64x+Architecture Integrated Modulo Scheduling and Cluster Assignment for TI TMS32C64x+Architecture Nikolai Kim Andreas Krall Institute of Computer Languages, Vienna University of Technology {kim,andi}@complang.tuwien.ac.at

More information

IR-Level Versus Machine-Level If-Conversion for Predicated Architectures

IR-Level Versus Machine-Level If-Conversion for Predicated Architectures IR-Level Versus Machine-Level If-Conversion for Predicated Architectures Alexander Jordan Nikolai Kim Andreas Krall Institute of Computer Languages, Vienna University of Technology {ajordan,kim,andi}@complang.tuwien.ac.at

More information

TECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are

TECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are 9. Code Scheduling for ILP-Processors Typical layout of compiler: traditional, optimizing, pre-pass parallel, post-pass parallel {Software! compilers optimizing code for ILP-processors, including VLIW}

More information

The IA-64 Architecture. Salient Points

The IA-64 Architecture. Salient Points The IA-64 Architecture Department of Electrical Engineering at College Park OUTLINE: Architecture overview Background Architecture Specifics UNIVERSITY OF MARYLAND AT COLLEGE PARK Salient Points 128 Registers

More information

Impact of Source-Level Loop Optimization on DSP Architecture Design

Impact of Source-Level Loop Optimization on DSP Architecture Design Impact of Source-Level Loop Optimization on DSP Architecture Design Bogong Su Jian Wang Erh-Wen Hu Andrew Esguerra Wayne, NJ 77, USA bsuwpc@frontier.wilpaterson.edu Wireless Speech and Data Nortel Networks,

More information

Instruction Scheduling. Software Pipelining - 3

Instruction Scheduling. Software Pipelining - 3 Instruction Scheduling and Software Pipelining - 3 Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 NPTEL Course on Principles of Compiler Design Instruction

More information

Generic Software pipelining at the Assembly Level

Generic Software pipelining at the Assembly Level Generic Software pipelining at the Assembly Level Markus Pister pister@cs.uni-sb.de Daniel Kästner kaestner@absint.com Embedded Systems (ES) 2/23 Embedded Systems (ES) are widely used Many systems of daily

More information

RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors

RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 1395 RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors

More information

Model-based Software Development

Model-based Software Development Model-based Software Development 1 SCADE Suite Application Model in SCADE (data flow + SSM) System Model (tasks, interrupts, buses, ) SymTA/S Generator System-level Schedulability Analysis Astrée ait StackAnalyzer

More information

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling CE431 Parallel Computer Architecture Spring 2018 Compile-time ILP extraction Modulo Scheduling Nikos Bellas Electrical and Computer Engineering University of Thessaly Parallel Computer Architecture 1 Readings

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Complementing Software Pipelining with Software Thread Integration

Complementing Software Pipelining with Software Thread Integration Complementing Software Pipelining with Software Thread Integration LCTES 05 - June 16, 2005 Won So and Alexander G. Dean Center for Embedded System Research Dept. of ECE, North Carolina State University

More information

Lecture Compiler Backend

Lecture Compiler Backend Lecture 19-23 Compiler Backend Jianwen Zhu Electrical and Computer Engineering University of Toronto Jianwen Zhu 2009 - P. 1 Backend Tasks Instruction selection Map virtual instructions To machine instructions

More information

VLIW/EPIC: Statically Scheduled ILP

VLIW/EPIC: Statically Scheduled ILP 6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind

More information

The Elcor Intermediate Representation. Trimaran Tutorial

The Elcor Intermediate Representation. Trimaran Tutorial 129 The Elcor Intermediate Representation Traditional ILP compiler phase diagram 130 Compiler phases Program region being compiled Region 1 Region 2 Region n Global opt. Region picking Regional memory

More information

Code Generation for TMS320C6x in Ptolemy

Code Generation for TMS320C6x in Ptolemy Code Generation for TMS320C6x in Ptolemy Sresth Kumar, Vikram Sardesai and Hamid Rahim Sheikh EE382C-9 Embedded Software Systems Spring 2000 Abstract Most Electronic Design Automation (EDA) tool vendors

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs. Exam 2 April 12, 2012 You have 80 minutes to complete the exam. Please write your answers clearly and legibly on this exam paper. GRADE: Name. Class ID. 1. (22 pts) Circle the selected answer for T/F and

More information

An Optimizing Compiler for the TMS320C25 DSP Chip

An Optimizing Compiler for the TMS320C25 DSP Chip An Optimizing Compiler for the TMS320C25 DSP Chip Wen-Yen Lin, Corinna G Lee, and Paul Chow Published in Proceedings of the 5th International Conference on Signal Processing Applications and Technology,

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Evaluating Inter-cluster Communication in Clustered VLIW Architectures Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September

More information

Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 1 / 1

Dynamic Binary Translation for Generation of Cycle AccurateOctober Architecture 28, 2008 Simulators 1 / 1 Dynamic Binary Translation for Generation of Cycle Accurate Architecture Simulators Institut für Computersprachen Technische Universtät Wien Austria Andreas Fellnhofer Andreas Krall David Riegler Part

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Achieving Out-of-Order Performance with Almost In-Order Complexity

Achieving Out-of-Order Performance with Almost In-Order Complexity Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

TMS320C3X Floating Point DSP

TMS320C3X Floating Point DSP TMS320C3X Floating Point DSP Microcontrollers & Microprocessors Undergraduate Course Isfahan University of Technology Oct 2010 By : Mohammad 1 DSP DSP : Digital Signal Processor Why A DSP? Example Voice

More information

Yunsup Lee UC Berkeley 1

Yunsup Lee UC Berkeley 1 Yunsup Lee UC Berkeley 1 Why is Supporting Control Flow Challenging in Data-Parallel Architectures? for (i=0; i

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo

More information

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

If-Conversion SSA Framework and Transformations SSA 09

If-Conversion SSA Framework and Transformations SSA 09 If-Conversion SSA Framework and Transformations SSA 09 Christian Bruel 29 April 2009 Motivations Embedded VLIW processors have architectural constraints - No out of order support, no full predication,

More information

The University of Texas at Austin

The University of Texas at Austin EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1 Outline 2 Principles of parallel

More information

Compiler Optimizations and Auto-tuning. Amir H. Ashouri Politecnico Di Milano -2014

Compiler Optimizations and Auto-tuning. Amir H. Ashouri Politecnico Di Milano -2014 Compiler Optimizations and Auto-tuning Amir H. Ashouri Politecnico Di Milano -2014 Compilation Compilation = Translation One piece of code has : Around 10 ^ 80 different translations Different platforms

More information

Lecture: Pipeline Wrap-Up and Static ILP

Lecture: Pipeline Wrap-Up and Static ILP Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle

More information

Flexible wireless communication architectures

Flexible wireless communication architectures Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April

More information

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2015-11-30-13-42 1 Motivating VLIW Processors

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs

Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs George F. Zaki, William Plishker, Shuvra S. Bhattacharyya University of Maryland, College Park, MD, USA & Frank Fruth Texas Instruments

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

Embedded Systems Development

Embedded Systems Development Embedded Systems Development Lecture 8 Code Generation for Embedded Processors Daniel Kästner AbsInt Angewandte Informatik GmbH kaestner@absint.com 2 Life Range and Register Interference A symbolic register

More information

C6000 Compiler Roadmap

C6000 Compiler Roadmap C6000 Compiler Roadmap CGT v7.4 CGT v7.3 CGT v7. CGT v8.0 CGT C6x v8. CGT Longer Term In Development Production Early Adopter Future CGT v7.2 reactive Current 3H2 4H 4H2 H H2 Future CGT C6x v7.3 Control

More information

Stereo Vision II: Dense Stereo Matching

Stereo Vision II: Dense Stereo Matching Stereo Vision II: Dense Stereo Matching Nassir Navab Slides prepared by Christian Unger Outline. Hardware. Challenges. Taxonomy of Stereo Matching. Analysis of Different Problems. Practical Considerations.

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures An introduction to DSP s Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures DSP example: mobile phone DSP example: mobile phone with video camera DSP: applications Why a DSP?

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

Static Branch Prediction

Static Branch Prediction Announcements EE382A Lecture 5: Branch Prediction Project proposal due on Mo 10/14 List the group members Describe the topic including why it is important and your thesis Describe the methodology you will

More information

Removing Communications in Clustered Microarchitectures Through Instruction Replication

Removing Communications in Clustered Microarchitectures Through Instruction Replication Removing Communications in Clustered Microarchitectures Through Instruction Replication ALEX ALETÀ, JOSEP M. CODINA, and ANTONIO GONZÁLEZ UPC and DAVID KAELI Northeastern University The need to communicate

More information

An Evaluation of Multi-Hit Ray Traversal in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations

An Evaluation of Multi-Hit Ray Traversal in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations Journal of Computer Graphics Techniques in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations This document provides algorithm listings for multi-hit ray

More information

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding ST20 icore and architectures D Albis Tiziano 707766 Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster

More information

Impact of the current LLVM inlining strategy on complex embedded application memory utilization and performance

Impact of the current LLVM inlining strategy on complex embedded application memory utilization and performance Impact of the current LLVM inlining strategy on complex embedded application memory utilization and performance Sergei Larin Senior Staff Engineer, Harsha Jagasia Staff Engineer, Tobias Edler von Koch

More information

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture 13 - VLIW Machines and Statically Scheduled ILP CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors

ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors ECE 4750 Computer Architecture, Fall 2018 T15 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2018-11-28-13-01 1 Motivating VLIW Processors

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Understanding multimedia application chacteristics for designing programmable media processors

Understanding multimedia application chacteristics for designing programmable media processors Understanding multimedia application chacteristics for designing programmable media processors Jason Fritts Jason Fritts, Wayne Wolf, and Bede Liu SPIE Media Processors '99 January 28, 1999 Why programmable

More information

CS341l Fall 2009 Test #2

CS341l Fall 2009 Test #2 CS341l all 2009 est #2 riday, 9 October 2009 10-10:50am Name: Key CS 341l all 2009, est #2. 100 points total, number of points each question is worth is indicated in parentheses. Answer all questions.

More information

Recursion. Comp Sci 1575 Data Structures. Introduction. Simple examples. The call stack. Types of recursion. Recursive programming

Recursion. Comp Sci 1575 Data Structures. Introduction. Simple examples. The call stack. Types of recursion. Recursive programming Recursion Comp Sci 1575 Data Structures Outline 1 2 3 4 Definitions To understand, you must understand. Familiar of recursive definitions Natural numbers are either: n+1, where n is a natural number 1

More information

Still Image Processing on Coarse-Grained Reconfigurable Array Architectures

Still Image Processing on Coarse-Grained Reconfigurable Array Architectures ESTIMEDIA 2007 1 Still Image Processing on Coarse-Grained Reconfigurable Array Architectures Matthias Hartmann 1 Vassilis Pantazis 2 Tom Vander Aa 2 Mladen Berekovic 2 Christian Hochberger 3 Bjorn de Sutter

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Instruction Scheduling

Instruction Scheduling Instruction Scheduling Michael O Boyle February, 2014 1 Course Structure Introduction and Recap Course Work Scalar optimisation and dataflow L5 Code generation L6 Instruction scheduling Next register allocation

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Native Offload of Haskell Repa Programs to Integrated GPUs

Native Offload of Haskell Repa Programs to Integrated GPUs Native Offload of Haskell Repa Programs to Integrated GPUs Hai (Paul) Liu with Laurence Day, Neal Glew, Todd Anderson, Rajkishore Barik Intel Labs. September 28, 2016 General purpose computing on integrated

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Impact of ILP-improving Code Transformations on Loop Buffer Energy

Impact of ILP-improving Code Transformations on Loop Buffer Energy Impact of ILP-improving Code Transformations on Loop Buffer Tom Vander Aa Murali Jayapala Henk Corporaal Francky Catthoor Geert Deconinck IMEC, Kapeldreef 75, B-300 Leuven, Belgium ESAT, KULeuven, Kasteelpark

More information

CS 188: Artificial Intelligence. Recap Search I

CS 188: Artificial Intelligence. Recap Search I CS 188: Artificial Intelligence Review of Search, CSPs, Games DISCLAIMER: It is insufficient to simply study these slides, they are merely meant as a quick refresher of the high-level ideas covered. You

More information

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor

More information

Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL

Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL Forecast This research studies the performance of memory ordering mechanisms on Chip Multi- Processors (CMPs) for modern

More information

Microprocessor Extensions for Wireless Communications

Microprocessor Extensions for Wireless Communications Microprocessor Extensions for Wireless Communications Sridhar Rajagopal and Joseph R. Cavallaro DRAFT REPORT Rice University Center for Multimedia Communication Department of Electrical and Computer Engineering

More information

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine Intel Itanium Line Processor Efforts Xiaobin Li PASCAL EECS Dept. UC, Irvine Outline Intel Itanium Line Roadmap IA-64 Architecture Itanium Processor Microarchitecture Case Study of Exploiting TLP at VLIW

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Software De-Pipelining Technique

Software De-Pipelining Technique Software De-Pipelining Technique Bogong Su Jian Wang Erh-Wen Hu Joseph Manzano sub@wpunj.edu jiwang@nortelnetworks.com hue@wpunj.edu Josbry@cs.com Dept. of Computer Science, The William Paterson University

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization Mian-Muhammad Hamayun, Frédéric Pétrot and Nicolas Fournel System Level Synthesis

More information

Chapter 2 Parallel Computer Architecture

Chapter 2 Parallel Computer Architecture Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general

More information

Introduction. L25: Modern Compiler Design

Introduction. L25: Modern Compiler Design Introduction L25: Modern Compiler Design Course Aims Understand the performance characteristics of modern processors Be familiar with strategies for optimising dynamic dispatch for languages like JavaScript

More information

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics Computing system: performance, speedup, performance/cost Origins and benefits of scalar instruction pipelines and caches

More information

Germán Llort

Germán Llort Germán Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? IPDPS - Atlanta,

More information

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation 1 Motivation

More information

IA-64 Compiler Technology

IA-64 Compiler Technology IA-64 Compiler Technology David Sehr, Jay Bharadwaj, Jim Pierce, Priti Shrivastav (speaker), Carole Dulong Microcomputer Software Lab Page-1 Introduction IA-32 compiler optimizations Profile Guidance (PGOPTI)

More information

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

Media Instructions, Coprocessors, and Hardware Accelerators. Overview Media Instructions, Coprocessors, and Hardware Accelerators Steven P. Smith SoC Design EE382V Fall 2009 EE382 System-on-Chip Design Coprocessors, etc. SPS-1 University of Texas at Austin Overview SoCs

More information

CPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition

CPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition CPU Structure and Function Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition CPU must: CPU Function Fetch instructions Interpret/decode instructions Fetch data Process data

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

SSA Construction. Daniel Grund & Sebastian Hack. CC Winter Term 09/10. Saarland University

SSA Construction. Daniel Grund & Sebastian Hack. CC Winter Term 09/10. Saarland University SSA Construction Daniel Grund & Sebastian Hack Saarland University CC Winter Term 09/10 Outline Overview Intermediate Representations Why? How? IR Concepts Static Single Assignment Form Introduction Theory

More information

Lect. 2: Types of Parallelism

Lect. 2: Types of Parallelism Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE

More information

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP

CS 152 Computer Architecture and Engineering. Lecture 16 - VLIW Machines and Statically Scheduled ILP CS 152 Computer Architecture and Engineering Lecture 16 - VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters

Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters Junaid Nomani and Jakub Szefer Computer Architecture and Security Laboratory Yale University junaid.nomani@yale.edu

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Constraint Satisfaction Problems Prof. Scott Niekum The University of Texas at Austin [These slides are based on those of Dan Klein and Pieter Abbeel for CS188 Intro to

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

Leveraging Predicated Execution for Multimedia Processing

Leveraging Predicated Execution for Multimedia Processing Leveraging Predicated Execution for Multimedia Processing Dietmar Ebner Florian Brandner Andreas Krall Institut für Computersprachen Technische Universität Wien Argentinierstr. 8, A-1040 Wien, Austria

More information

Applications to MPSoCs

Applications to MPSoCs 3 rd Workshop on Mapping of Applications to MPSoCs A Design Exploration Framework for Mapping and Scheduling onto Heterogeneous MPSoCs Christian Pilato, Fabrizio Ferrandi, Donatella Sciuto Dipartimento

More information

CS 341l Fall 2008 Test #2

CS 341l Fall 2008 Test #2 CS 341l all 2008 Test #2 Name: Key CS 341l, test #2. 100 points total, number of points each question is worth is indicated in parentheses. Answer all questions. Be as concise as possible while still answering

More information

ECE 505 Computer Architecture

ECE 505 Computer Architecture ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems

More information