High performance, power-efficient DSPs based on the TI C64x

Size: px
Start display at page:

Download "High performance, power-efficient DSPs based on the TI C64x"

Transcription

1 High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University RICE UNIVERSITY

2 Recent (2003) Research Results Stream-based programmable processors meet real-time requirements for a set of base-station phy layer algorithms, Map algorithms on stream processors and studied tradeoffs between packing, ALU utilization and memory operations Improve power efficiency in stream processors by adapting compute resources to workload variations and varying voltage and clock frequency to real-time requirements Design exploration between #ALUs and clock frequency to minimize power consumption of the processor S. Rajagopal, S. Rixner, J. R. Cavallaro 'A programmable baseband processor design for software RICE UNIVERSITY2 defined radios, 2002, Paper draft sent previously, rest of the contributions in thesis

3 Recent (2003) Research Results Peak computation rate available : ~200 billion arithmetic operations at 1.2 GHz Estimated Peak Power (0.13 micron) : W at 1.2 GHz Power: W for 32 users, constraint 9 decoding, at 128Kbps/user At 1.2 GHz, 1.4 V 300 mw for 4 users, constraint 7 decoding, at 128Kbps/user At 433 MHz, V RICE UNIVERSITY3

4 Motivation This research could be applied to DSP design! Designing High performance DSPs Power-efficient Adapt computing resources with workload changes Such that Gradual changes in C64x architecture Gradual changes in compilers and tools RICE UNIVERSITY4

5 Levels of changes To allow changes in TI DSPs and tools gradually Changes classified into 3 levels Level 1 : simple, minimum changes (next silicon) Level 2 : intermediate, handover changes (1-2 years) Level 3 : actual proposed changes (2-3 years) We want to go to Level 3 but in steps! RICE UNIVERSITY5

6 Level 1 changes: Power-efficiency RICE UNIVERSITY6

7 Level 1 changes: Power saving features (1) Use Dynamic Voltage and Frequency scaling When workload changes such as Users, data rates, modulation, coding rates, Already in industry : Crusoe, XScale (2) Use Voltage gating to turn off unused resources When units idle for a sufficiently long time Saves static and dynamic power dissipation See example on next page RICE UNIVERSITY7

8 Turning off ALUs Adders Multipliers Adders Multipliers Instruction Schedule Sleep Instruction Default schedule Schedule after exploration 2 multipliers turned off to save power Turned off using voltage gating to eliminate static and dynamic power dissipation RICE UNIVERSITY8

9 Level 1: Architecture tradeoffs DVS: Advanced voltage regulation scheme Cannot use NMOS pass gates Cannot use tri-state buffers Use at a coarser time scale (once in a million cycles) cycles settling time Voltage gating: Gating device design important Should be able to supply current to gated circuit Use at coarser time scale (once in cycles) 1-10 cycles settling time RICE UNIVERSITY9

10 Level 1: Tools/Programming impact Need a DSP BIOS TASK running continuously which looks at the workload change and changes voltage/frequency using a look-up table in memory Compiler should be made re-targetable Target subset of ALUs and explore static performance with different adder-multiplier schedules Voltage gating using a sleep instruction that the compiler generates for unused ALUs ALUs should be idle for > 100 cycles for this to occur Other resources can be gated off similarly to save static power dissipation Programmer is not aware of these changes RICE UNIVERSITY10

11 Level 2 changes: Performance RICE UNIVERSITY11

12 Solutions to increase DSP performance (1) Increasing clock frequency C64x: ? Easiest solution but limited benefits Not good for power, given cubic dependence with frequency (2) Increasing ALUs Limited instruction level parallelism (ILP) Register file area, ports explosion Compiler issues in extracting more ILP (3) Multiprocessors (MIMD) Usually 3 rd party vendors (except C40-types) RICE UNIVERSITY12

13 DSP multiprocessors DSP DSP ASSP Network Interface Interconnection DSP DSP ASSP Co-Proc s Source: Texas Instruments Wireless Infrastructure Solutions Guide, Pentek, Sundance, C80 RICE UNIVERSITY13

14 Multiprocessing tradeoffs Advantages: Performance, and tools don t have to change!! Load-balancing algorithms on multiple DSPs not straight-forward Burden pushed on to the programmer Not scalable with number of processors difficult to adapt with workload changes Traditional DSPs not built for multiprocessing (except C40-types) I/O impacts throughput, power and area (E)DMA use minimizes the throughput problem Power and area problems still remain R. Baines, The DSP bottleneck, IEEE Communications Magazine, May 1995, pp (outdated?) S. Rajagopal, B. Jones and J.R. Cavallaro, Task partitioning wireless base-station algorithms RICE on UNIVERSITY14 multiple DSPs and FPGAs, ICSPAT 2001

15 Options Chip multiprocessors with SIMD parallelism (Level 3) SIMD parallelism can alleviate load balancing (shown in Level 3) Scalable with processors Automatic SIMD parallelism can be done by the compiler Single chip will alleviate I/O bottlenecks Tool will need changes To get to level 3, intermediate (Level 2) level investigation Level 2 Do SPMD on DSP multiprocessor RICE UNIVERSITY15

16 Texas Instruments C64x DSP C64x Datapath Source: Texas Instruments C64x DSP Generation (sprt236a.pdf) RICE UNIVERSITY16

17 A possible, plausible solution Exploit data parallelism (DP) Available in many wireless algorithms This is what ASICs do! int i,a[n],b[n],sum[n]; // 32 bits short int c[n],d[n],diff[n]; // 16 bits packed for (i = 0; i< 1024; i) { sum[i] = a[i] b[i]; diff[i] = c[i] - d[i]; } Subword ILP DP RICE UNIVERSITY Data Parallelism is defined as the parallelism available after subword packing and loop unrolling 17

18 SPMD multiprocessor DSP C64x Datapath C64x Datapath Same Program running on all DSPs C64x Datapath C64x Datapath RICE UNIVERSITY18

19 Level 2: Architecture tradeoffs C64x s Interconnection could be similar to the ones used by 3 rd party vendors FPGA- based C40 comm ports (Sundance) ~400 MBps VIM modules (Pentek) ~300 MBps Others developed by TI, BlueWave systems RICE UNIVERSITY19

20 Level 2: Tools/Programming impact All DSPs run the same program Programmer thinks of only 1 DSP program Burden now on tools Can use C8x compiler and tool support expertise Integration of C8x and C6x compilers Data parallelism used for SPMD DMA data movement can be left to programmer at this stage to keep data fed to the all the processors MPI (Message Passing) can also be alternatively applied RICE UNIVERSITY20

21 Level 3 changes: Performance and Power RICE UNIVERSITY21

22 A chip multiprocessor (CMP) DSP Internal Memory (L2) ILP Subword Internal Memory L2 Instruction decoder C64x DSP Core (1 cluster) ILP Subword DP Instruction decoder C64x based CMP DSP Core adapt #clusters to DP Identical clusters, same operations. Power-down unused ALUs, clusters RICE UNIVERSITY22

23 A 4 cluster CMP using TI C64x C64x Datapath Significant savings possible in area and power C64x Datapath C64x Datapath C64x Datapath Increasing benefits with larger #clusters (8,16,32 clusters) RICE UNIVERSITY23

24 Alternate view of the CMP DSP DMA Controller L2 internal memory Bank 1 Bank 2 Bank C Prefetch Buffers Clusters Of C64x C64x core 0 C64x core 1 C64x core C Instruction decoder Inter-cluster communication network RICE UNIVERSITY24

25 Adapting #clusters to Data Parallelism Adaptive Multiplexer Network Turned off using voltage gating to eliminate static and dynamic power dissipation C C C C No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off C C C C C C C RICE UNIVERSITY25

26 Level 3: Architecture tradeoffs Single processor -> SPMD -> SIMD Single chip : Max die size limited to 128 clusters with 8 functional units/cluster at 90 nm technology [estimate] Number of memory banks = #clusters Instruction addition to turn off clusters when data parallelism is insufficient RICE UNIVERSITY26

27 Level 3: Tools/Programming impact Level 2 compiler provides support for data parallelism adapt #clusters to data parallelism for power savings check for loop count index after loop unrolling If less than #clusters, provide instruction to turn off clusters Design of parallel algorithms and mapping important Programmer still writes regular C code Transparent to the programmer Burden on the compiler Automatic DMA data movement to keep data feeding into the arithmetic units RICE UNIVERSITY27

28 Verification of potential benefits Level 3 potential verification using the Imagine stream processor simulator Replacing the C64x DSP with a cluster containing 3, 3 X and a distributed register file RICE UNIVERSITY28

29 Need for adapting to flexibility Base-stations are designed for worst case workload Base-stations rarely operate at worst case workload Adapting the resources to the workload can save power! RICE UNIVERSITY29

30 Example of flexibility needed in workloads Operation count (in GOPs) G base-station (16 Kbps/user) 3G base-station (128 Kbps/user) Note: GOPs refer only to arithmetic computations 0 (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) (Users, Constraint lengths) Billions of computations per second needed Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi RICE UNIVERSITY30 to ~23 GOPs for 32 users, constraint 9 viterbi

31 Flexibility affects Data Parallelism U - Users, K - constraint length, N - spreading gain, R - decoding rate Workload Estimation Detection Decoding (U,K) f(u,n) f(u,n) f(u,k,r) (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) Data Parallelism is defined as the parallelism available after subword packing and loop unrolling RICE UNIVERSITY31

32 Cluster utilization variation with workload (4,9) (4,7) Cluster Utilization (8,9) (8,7) (16,9) (16,7) (32,9) (32,7) Cluster Index Cluster utilization variation on a 32-cluster processor (32, 9) = 32 users, constraint length 9 Viterbi RICE UNIVERSITY32

33 Frequency variation with workload Real-time Frequency (in MHz) Mem Stall L2 Stall Busy (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) RICE UNIVERSITY33

34 Operation DVS when system changes significantly Users, data rates Coarse time scale (every few seconds) Turn off clusters when parallelism changes significantly Parallelism can change within the same algorithm Eg: spreading gain changes during matched filtering Finer time scales (100 s of microseconds) Turn off ALUs when algorithms change significantly estimation, detection, decoding Finer time scales (100 s of microseconds) RICE UNIVERSITY34

35 Power savings: Voltage Gating & Scaling Workload Freq (MHz) Voltage Power Savings (W) Power (W) Savings needed used (V) clocking Memory Clusters New Base (4,7) % (4,9) % (8,7) % (8,9) % (16,7) % (16,9) % (32,7) % (32,9) % Estimated Cluster Power Consumption 78 % Estimated L2 memory Power Consumption 11.5 % Estimated instruction decooder Power Consumption 10.5 % Estimated Chip Area (0.13 micron process) 45.7 mm 2 Power can change from W to 300 mw depending on workload changes RICE UNIVERSITY35

36 How to decide ALUs vs. clock frequency No independent variables Clusters, ALUs, frequency, voltage Trade-offs exist P 2 3 CV f V f P How to find the right combination for lowest power! f a m a m a m 1 cluster 100 clusters c clusters 100 GHz 10 MHz f MHz (A) (B) (C) RICE UNIVERSITY36

37 Setting clusters, adders, multipliers If sufficient DP, linear decrease in frequency with clusters Set clusters depending on DP and execution time estimate To find adders and multipliers, Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time Put all numbers in previous equation Compare increase in capacitance due to added ALUs and clusters with benefits in execution time Choose the solution that minimizes the power Details available in Sridhar s thesis RICE UNIVERSITY37

38 Conclusions We propose a step-by-step methodology to design high performance power-efficient DSPs based on the TI 64x architecture Initial results show benefits in power/performance greater than an order-of-magnitude over a conventional C64x We tailor the design to ensure maximum compatibility with TI s C6x architecture and tools We are interested in exploring opportunities in TI for designing and actual fabrication of a chip and associated tool development We are interested in feedback limitations that we have not accounted for Unreasonable assumptions that we have made Recommended reading: S. Rixner et al, A register organization for media processing, HPCA 2000 B. Khailany et al, Exploring the VLSI scalability of stream processors, HPCA 2003 U. J. Kapasi et al, Programmable Stream Processors, IEEE Computer, August 2003 RICE UNIVERSITY38

Flexible wireless communication architectures

Flexible wireless communication architectures Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April

More information

Design space exploration for real-time embedded stream processors

Design space exploration for real-time embedded stream processors Design space exploration for real-time embedded stream processors Sridhar Rajagopal, Joseph R. Cavallaro, and Scott Rixner Department of Electrical and Computer Engineering Rice University sridhar, cavallar,

More information

Improving Power Efficiency in Stream Processors Through Dynamic Cluster Reconfiguration

Improving Power Efficiency in Stream Processors Through Dynamic Cluster Reconfiguration Improving Power Efficiency in Stream Processors Through Dynamic luster Reconfiguration Sridhar Rajagopal WiQuest ommunications Allen, T 75 sridhar.rajagopal@wiquest.com Scott Rixner and Joseph R. avallaro

More information

Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration

Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration June 5, 24 Abstract Stream processors support hundreds of functional units in a programmable architecture by clustering those

More information

Reconfigurable Stream Processors for Wireless Base-stations

Reconfigurable Stream Processors for Wireless Base-stations Reconfigurable Stream Processors for Wireless Base-stations Sridhar Rajagopal Scott Rixner Joseph R. avallaro Rice University {sridhar,rixner,cavallar}@rice.edu Abstract This paper presents the design

More information

Designing Scalable Wireless Application-specific Processors

Designing Scalable Wireless Application-specific Processors Designing calable Wireless Application-specific Processors ridhar Rajagopal (sridhar@rice.edu) eptember 1, 2003, 10:52 PM Abstract This paper presents a structured way of designing and exploring scalable,

More information

Design space exploration for real-time embedded stream processors

Design space exploration for real-time embedded stream processors Design space eploration for real-time embedded stream processors Sridhar Rajagopal, Joseph R. Cavallaro, and Scott Riner Department of Electrical and Computer Engineering Rice University sridhar, cavallar,

More information

DATA-PARALLEL DIGITAL SIGNAL PROCESSORS: ALGORITHM MAPPING, ARCHITECTURE SCALING AND WORKLOAD ADAPTATION

DATA-PARALLEL DIGITAL SIGNAL PROCESSORS: ALGORITHM MAPPING, ARCHITECTURE SCALING AND WORKLOAD ADAPTATION DATA-PARALLEL DIGITAL SIGNAL PROCESSORS: ALGORITHM MAPPING, ARCHITECTURE SCALING AND WORKLOAD ADAPTATION Sridhar Rajagopal Thesis: Doctor of Philosophy Electrical and Computer Engineering Rice University,

More information

Reconfigurable VLSI Communication Processor Architectures

Reconfigurable VLSI Communication Processor Architectures Reconfigurable VLSI Communication Processor Architectures Joseph R. Cavallaro Center for Multimedia Communication www.cmc.rice.edu Department of Electrical and Computer Engineering Rice University, Houston

More information

ECE 747 Digital Signal Processing Architecture. DSP Implementation Architectures

ECE 747 Digital Signal Processing Architecture. DSP Implementation Architectures ECE 747 Digital Signal Processing Architecture DSP Implementation Architectures Spring 2006 W. Rhett Davis NC State University W. Rhett Davis NC State University ECE 406 Spring 2006 Slide 1 My Goal Challenge

More information

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine

More information

Reconfigurable stream processors for wireless base-stations

Reconfigurable stream processors for wireless base-stations Reconfigurable stream processors for wireless base-stations ridhar Rajagopal (sridhar@rice.edu) eptember 9, 2003, 3:25 AM Abstract The need to support evolving standards, rapid prototyping and fast time-to-market

More information

Microprocessor Extensions for Wireless Communications

Microprocessor Extensions for Wireless Communications Microprocessor Extensions for Wireless Communications Sridhar Rajagopal and Joseph R. Cavallaro DRAFT REPORT Rice University Center for Multimedia Communication Department of Electrical and Computer Engineering

More information

Multi-Core Microprocessor Chips: Motivation & Challenges

Multi-Core Microprocessor Chips: Motivation & Challenges Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

A PROGRAMMABLE COMMUNICATIONS PROCESSOR FOR THIRD GENERATION WIRELESS COMMUNICATION SYSTEMS

A PROGRAMMABLE COMMUNICATIONS PROCESSOR FOR THIRD GENERATION WIRELESS COMMUNICATION SYSTEMS A PROGRAMMABLE COMMUNICATIONS PROCESSOR FOR THIRD GENERATION WIRELESS COMMUNICATION SYSTEMS Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia Communication Department of Electrical

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

FABRICATION TECHNOLOGIES

FABRICATION TECHNOLOGIES FABRICATION TECHNOLOGIES DSP Processor Design Approaches Full custom Standard cell** higher performance lower energy (power) lower per-part cost Gate array* FPGA* Programmable DSP Programmable general

More information

IMAGINE: Signal and Image Processing Using Streams

IMAGINE: Signal and Image Processing Using Streams IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture

More information

VLSI Signal Processing

VLSI Signal Processing VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface

More information

VLSI Design Automation. Maurizio Palesi

VLSI Design Automation. Maurizio Palesi VLSI Design Automation 1 Outline Technology trends VLSI Design flow (an overview) 2 Outline Technology trends VLSI Design flow (an overview) 3 IC Products Processors CPU, DSP, Controllers Memory chips

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

Implementing FFT in an FPGA Co-Processor

Implementing FFT in an FPGA Co-Processor Implementing FFT in an FPGA Co-Processor Sheac Yee Lim Altera Corporation 101 Innovation Drive San Jose, CA 95134 (408) 544-7000 sylim@altera.com Andrew Crosland Altera Europe Holmers Farm Way High Wycombe,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

SDR Forum Technical Conference 2007

SDR Forum Technical Conference 2007 THE APPLICATION OF A NOVEL ADAPTIVE DYNAMIC VOLTAGE SCALING SCHEME TO SOFTWARE DEFINED RADIO Craig Dolwin (Toshiba Research Europe Ltd, Bristol, UK, craig.dolwin@toshiba-trel.com) ABSTRACT This paper presents

More information

Universität Dortmund. ARM Architecture

Universität Dortmund. ARM Architecture ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture

More information

Reconfigurable Computing. Introduction

Reconfigurable Computing. Introduction Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally

More information

The extreme Adaptive DSP Solution to Sensor Data Processing

The extreme Adaptive DSP Solution to Sensor Data Processing The extreme Adaptive DSP Solution to Sensor Data Processing Abstract Martin Vorbach PACT XPP Technologies Leo Mirkin Sky Computers, Inc. The new ISR mobile autonomous sensor platforms present a difficult

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

SpiNNaker - a million core ARM-powered neural HPC

SpiNNaker - a million core ARM-powered neural HPC The Advanced Processor Technologies Group SpiNNaker - a million core ARM-powered neural HPC Cameron Patterson cameron.patterson@cs.man.ac.uk School of Computer Science, The University of Manchester, UK

More information

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced

More information

Software Defined Modem A commercial platform for wireless handsets

Software Defined Modem A commercial platform for wireless handsets Software Defined Modem A commercial platform for wireless handsets Charles F Sturman VP Marketing June 22 nd ~ 24 th Brussels charles.stuman@cognovo.com www.cognovo.com Agenda SDM Separating hardware from

More information

An Asynchronous Array of Simple Processors for DSP Applications

An Asynchronous Array of Simple Processors for DSP Applications An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, Bevan Baas

More information

Power Optimization in FPGA Designs

Power Optimization in FPGA Designs Mouzam Khan Altera Corporation mkhan@altera.com ABSTRACT IC designers today are facing continuous challenges in balancing design performance and power consumption. This task is becoming more critical as

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling

A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge,

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

Register Organization and Raw Hardware. 1 Register Organization for Media Processing

Register Organization and Raw Hardware. 1 Register Organization for Media Processing EE482C: Advanced Computer Organization Lecture #7 Stream Processor Architecture Stanford University Thursday, 25 April 2002 Register Organization and Raw Hardware Lecture #7: Thursday, 25 April 2002 Lecturer:

More information

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras CAD for VLSI Debdeep Mukhopadhyay IIT Madras Tentative Syllabus Overall perspective of VLSI Design MOS switch and CMOS, MOS based logic design, the CMOS logic styles, Pass Transistors Introduction to Verilog

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing

More information

FPGA Power Management and Modeling Techniques

FPGA Power Management and Modeling Techniques FPGA Power Management and Modeling Techniques WP-01044-2.0 White Paper This white paper discusses the major challenges associated with accurately predicting power consumption in FPGAs, namely, obtaining

More information

Computer Systems Architecture Spring 2016

Computer Systems Architecture Spring 2016 Computer Systems Architecture Spring 2016 Lecture 01: Introduction Shuai Wang Department of Computer Science and Technology Nanjing University [Adapted from Computer Architecture: A Quantitative Approach,

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

Design and Low Power Implementation of a Reorder Buffer

Design and Low Power Implementation of a Reorder Buffer Design and Low Power Implementation of a Reorder Buffer J.D. Fisher, C. Romo, E. John, W. Lin Department of Electrical and Computer Engineering, University of Texas at San Antonio One UTSA Circle, San

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

CHAPTER 7 FPGA IMPLEMENTATION OF HIGH SPEED ARITHMETIC CIRCUIT FOR FACTORIAL CALCULATION

CHAPTER 7 FPGA IMPLEMENTATION OF HIGH SPEED ARITHMETIC CIRCUIT FOR FACTORIAL CALCULATION 86 CHAPTER 7 FPGA IMPLEMENTATION OF HIGH SPEED ARITHMETIC CIRCUIT FOR FACTORIAL CALCULATION 7.1 INTRODUCTION Factorial calculation is important in ALUs and MAC designed for general and special purpose

More information

Algorithm-Architecture Co- Design for Efficient SDR Signal Processing

Algorithm-Architecture Co- Design for Efficient SDR Signal Processing Algorithm-Architecture Co- Design for Efficient SDR Signal Processing Min Li, limin@imec.be Wireless Research, IMEC Introduction SDR Baseband Platforms Today are Usually Based on ILP + DLP + MP Massive

More information

Two-level Reconfigurable Architecture for High-Performance Signal Processing

Two-level Reconfigurable Architecture for High-Performance Signal Processing International Conference on Engineering of Reconfigurable Systems and Algorithms, ERSA 04, pp. 177 183, Las Vegas, Nevada, June 2004. Two-level Reconfigurable Architecture for High-Performance Signal Processing

More information

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem. The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults

More information

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public Reduce Your System Power Consumption with Altera FPGAs Agenda Benefits of lower power in systems Stratix III power technology Cyclone III power Quartus II power optimization and estimation tools Summary

More information

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto. Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors

More information

Ten Reasons to Optimize a Processor

Ten Reasons to Optimize a Processor By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor

More information

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture 1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

Hardware and Software Optimisation. Tom Spink

Hardware and Software Optimisation. Tom Spink Hardware and Software Optimisation Tom Spink Optimisation Modifying some aspect of a system to make it run more efficiently, or utilise less resources. Optimising hardware: Making it use less energy, or

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache

More information

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad

NoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad NoC Round Table / ESA Sep. 2009 Asynchronous Three Dimensional Networks on on Chip Frédéric ric PétrotP Outline Three Dimensional Integration Clock Distribution and GALS Paradigm Contribution of the Third

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Classification of Semiconductor LSI

Classification of Semiconductor LSI Classification of Semiconductor LSI 1. Logic LSI: ASIC: Application Specific LSI (you have to develop. HIGH COST!) For only mass production. ASSP: Application Specific Standard Product (you can buy. Low

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Computer Organization and Design, 5th Edition: The Hardware/Software Interface Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Ascenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005

Ascenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005 Ascenium: A Continuously Reconfigurable Architecture Robert Mykland Founder/CTO robert@ascenium.com August, 2005 Ascenium: A Continuously Reconfigurable Processor Continuously reconfigurable approach provides:

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

MPSOC 2011 BEAUNE, FRANCE

MPSOC 2011 BEAUNE, FRANCE MPSOC 2011 BEAUNE, FRANCE BOADRES: A SCALABLE BASEBAND PROCESSOR TEMPLATE FOR Gbps RADIOS VICE PRESIDENT, CHAIRMAN OF THE TECHNOLOGY OFFICE PROFESSOR AT THE KATHOLIEKE UNIVERSITEIT LEUVEN STATUS SDR BASEBAND

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor Taeho Kgil, Shaun D Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Steve Reinhardt, Krisztian Flautner,

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

Interface Design Techniques for Single-Chip Systems

Interface Design Techniques for Single-Chip Systems Interface Design Techniques for Single-Chip Systems Robert H. Bell, Jr. Lizy Kurian John Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712-24 {belljr,

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm

More information

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141 ECE 637 Integrated VLSI Circuits Introduction EE141 1 Introduction Course Details Instructor Mohab Anis; manis@vlsi.uwaterloo.ca Text Digital Integrated Circuits, Jan Rabaey, Prentice Hall, 2 nd edition

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology 1 ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology Mikkel B. Stensgaard and Jens Sparsø Technical University of Denmark Technical University of Denmark Outline 2 Motivation ReNoC Basic

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

EE482S Lecture 1 Stream Processor Architecture

EE482S Lecture 1 Stream Processor Architecture EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered

More information

An FPGA Architecture Supporting Dynamically-Controlled Power Gating

An FPGA Architecture Supporting Dynamically-Controlled Power Gating An FPGA Architecture Supporting Dynamically-Controlled Power Gating Altera Corporation March 16 th, 2012 Assem Bsoul and Steve Wilton {absoul, stevew}@ece.ubc.ca System-on-Chip Research Group Department

More information

A PROGRAMMABLE BASEBAND PLATFORM FOR SOFTWARE-DEFINED RADIO

A PROGRAMMABLE BASEBAND PLATFORM FOR SOFTWARE-DEFINED RADIO A PROGRAMMABLE BASEBAND PLATFORM FOR SOFTWARE-DEFINED RADIO Hans-Martin Bluethgen, Cyprian Grassmann, Wolfgang Raab, Ulrich Ramacher, Josef Hausner, Infineon Technologies AG, 81609 Munich, Germany, Hans-Martin.Bluethgen@infineon.com

More information

Low Power Design Techniques

Low Power Design Techniques Low Power Design Techniques August 2005, ver 1.0 Application Note 401 Introduction This application note provides low-power logic design techniques for Stratix II and Cyclone II devices. These devices

More information

CS310 Embedded Computer Systems. Maeng

CS310 Embedded Computer Systems. Maeng 1 INTRODUCTION (PART II) Maeng Three key embedded system technologies 2 Technology A manner of accomplishing a task, especially using technical processes, methods, or knowledge Three key technologies for

More information

DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions

DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions White Paper: Spartan-3 FPGAs WP212 (v1.0) March 18, 2004 DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions By: Steve Zack, Signal Processing Engineer Suhel Dhanani, Senior

More information

Trends in the Infrastructure of Computing

Trends in the Infrastructure of Computing Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes MSEE Curriculum All MSEE students are required to take the following two core courses: 3531-571 Linear systems 3531-507 Probability and Random Processes The course requirements for students majoring in

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information