High performance, power-efficient DSPs based on the TI C64x
|
|
- Chloe Cummings
- 5 years ago
- Views:
Transcription
1 High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University RICE UNIVERSITY
2 Recent (2003) Research Results Stream-based programmable processors meet real-time requirements for a set of base-station phy layer algorithms, Map algorithms on stream processors and studied tradeoffs between packing, ALU utilization and memory operations Improve power efficiency in stream processors by adapting compute resources to workload variations and varying voltage and clock frequency to real-time requirements Design exploration between #ALUs and clock frequency to minimize power consumption of the processor S. Rajagopal, S. Rixner, J. R. Cavallaro 'A programmable baseband processor design for software RICE UNIVERSITY2 defined radios, 2002, Paper draft sent previously, rest of the contributions in thesis
3 Recent (2003) Research Results Peak computation rate available : ~200 billion arithmetic operations at 1.2 GHz Estimated Peak Power (0.13 micron) : W at 1.2 GHz Power: W for 32 users, constraint 9 decoding, at 128Kbps/user At 1.2 GHz, 1.4 V 300 mw for 4 users, constraint 7 decoding, at 128Kbps/user At 433 MHz, V RICE UNIVERSITY3
4 Motivation This research could be applied to DSP design! Designing High performance DSPs Power-efficient Adapt computing resources with workload changes Such that Gradual changes in C64x architecture Gradual changes in compilers and tools RICE UNIVERSITY4
5 Levels of changes To allow changes in TI DSPs and tools gradually Changes classified into 3 levels Level 1 : simple, minimum changes (next silicon) Level 2 : intermediate, handover changes (1-2 years) Level 3 : actual proposed changes (2-3 years) We want to go to Level 3 but in steps! RICE UNIVERSITY5
6 Level 1 changes: Power-efficiency RICE UNIVERSITY6
7 Level 1 changes: Power saving features (1) Use Dynamic Voltage and Frequency scaling When workload changes such as Users, data rates, modulation, coding rates, Already in industry : Crusoe, XScale (2) Use Voltage gating to turn off unused resources When units idle for a sufficiently long time Saves static and dynamic power dissipation See example on next page RICE UNIVERSITY7
8 Turning off ALUs Adders Multipliers Adders Multipliers Instruction Schedule Sleep Instruction Default schedule Schedule after exploration 2 multipliers turned off to save power Turned off using voltage gating to eliminate static and dynamic power dissipation RICE UNIVERSITY8
9 Level 1: Architecture tradeoffs DVS: Advanced voltage regulation scheme Cannot use NMOS pass gates Cannot use tri-state buffers Use at a coarser time scale (once in a million cycles) cycles settling time Voltage gating: Gating device design important Should be able to supply current to gated circuit Use at coarser time scale (once in cycles) 1-10 cycles settling time RICE UNIVERSITY9
10 Level 1: Tools/Programming impact Need a DSP BIOS TASK running continuously which looks at the workload change and changes voltage/frequency using a look-up table in memory Compiler should be made re-targetable Target subset of ALUs and explore static performance with different adder-multiplier schedules Voltage gating using a sleep instruction that the compiler generates for unused ALUs ALUs should be idle for > 100 cycles for this to occur Other resources can be gated off similarly to save static power dissipation Programmer is not aware of these changes RICE UNIVERSITY10
11 Level 2 changes: Performance RICE UNIVERSITY11
12 Solutions to increase DSP performance (1) Increasing clock frequency C64x: ? Easiest solution but limited benefits Not good for power, given cubic dependence with frequency (2) Increasing ALUs Limited instruction level parallelism (ILP) Register file area, ports explosion Compiler issues in extracting more ILP (3) Multiprocessors (MIMD) Usually 3 rd party vendors (except C40-types) RICE UNIVERSITY12
13 DSP multiprocessors DSP DSP ASSP Network Interface Interconnection DSP DSP ASSP Co-Proc s Source: Texas Instruments Wireless Infrastructure Solutions Guide, Pentek, Sundance, C80 RICE UNIVERSITY13
14 Multiprocessing tradeoffs Advantages: Performance, and tools don t have to change!! Load-balancing algorithms on multiple DSPs not straight-forward Burden pushed on to the programmer Not scalable with number of processors difficult to adapt with workload changes Traditional DSPs not built for multiprocessing (except C40-types) I/O impacts throughput, power and area (E)DMA use minimizes the throughput problem Power and area problems still remain R. Baines, The DSP bottleneck, IEEE Communications Magazine, May 1995, pp (outdated?) S. Rajagopal, B. Jones and J.R. Cavallaro, Task partitioning wireless base-station algorithms RICE on UNIVERSITY14 multiple DSPs and FPGAs, ICSPAT 2001
15 Options Chip multiprocessors with SIMD parallelism (Level 3) SIMD parallelism can alleviate load balancing (shown in Level 3) Scalable with processors Automatic SIMD parallelism can be done by the compiler Single chip will alleviate I/O bottlenecks Tool will need changes To get to level 3, intermediate (Level 2) level investigation Level 2 Do SPMD on DSP multiprocessor RICE UNIVERSITY15
16 Texas Instruments C64x DSP C64x Datapath Source: Texas Instruments C64x DSP Generation (sprt236a.pdf) RICE UNIVERSITY16
17 A possible, plausible solution Exploit data parallelism (DP) Available in many wireless algorithms This is what ASICs do! int i,a[n],b[n],sum[n]; // 32 bits short int c[n],d[n],diff[n]; // 16 bits packed for (i = 0; i< 1024; i) { sum[i] = a[i] b[i]; diff[i] = c[i] - d[i]; } Subword ILP DP RICE UNIVERSITY Data Parallelism is defined as the parallelism available after subword packing and loop unrolling 17
18 SPMD multiprocessor DSP C64x Datapath C64x Datapath Same Program running on all DSPs C64x Datapath C64x Datapath RICE UNIVERSITY18
19 Level 2: Architecture tradeoffs C64x s Interconnection could be similar to the ones used by 3 rd party vendors FPGA- based C40 comm ports (Sundance) ~400 MBps VIM modules (Pentek) ~300 MBps Others developed by TI, BlueWave systems RICE UNIVERSITY19
20 Level 2: Tools/Programming impact All DSPs run the same program Programmer thinks of only 1 DSP program Burden now on tools Can use C8x compiler and tool support expertise Integration of C8x and C6x compilers Data parallelism used for SPMD DMA data movement can be left to programmer at this stage to keep data fed to the all the processors MPI (Message Passing) can also be alternatively applied RICE UNIVERSITY20
21 Level 3 changes: Performance and Power RICE UNIVERSITY21
22 A chip multiprocessor (CMP) DSP Internal Memory (L2) ILP Subword Internal Memory L2 Instruction decoder C64x DSP Core (1 cluster) ILP Subword DP Instruction decoder C64x based CMP DSP Core adapt #clusters to DP Identical clusters, same operations. Power-down unused ALUs, clusters RICE UNIVERSITY22
23 A 4 cluster CMP using TI C64x C64x Datapath Significant savings possible in area and power C64x Datapath C64x Datapath C64x Datapath Increasing benefits with larger #clusters (8,16,32 clusters) RICE UNIVERSITY23
24 Alternate view of the CMP DSP DMA Controller L2 internal memory Bank 1 Bank 2 Bank C Prefetch Buffers Clusters Of C64x C64x core 0 C64x core 1 C64x core C Instruction decoder Inter-cluster communication network RICE UNIVERSITY24
25 Adapting #clusters to Data Parallelism Adaptive Multiplexer Network Turned off using voltage gating to eliminate static and dynamic power dissipation C C C C No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off C C C C C C C RICE UNIVERSITY25
26 Level 3: Architecture tradeoffs Single processor -> SPMD -> SIMD Single chip : Max die size limited to 128 clusters with 8 functional units/cluster at 90 nm technology [estimate] Number of memory banks = #clusters Instruction addition to turn off clusters when data parallelism is insufficient RICE UNIVERSITY26
27 Level 3: Tools/Programming impact Level 2 compiler provides support for data parallelism adapt #clusters to data parallelism for power savings check for loop count index after loop unrolling If less than #clusters, provide instruction to turn off clusters Design of parallel algorithms and mapping important Programmer still writes regular C code Transparent to the programmer Burden on the compiler Automatic DMA data movement to keep data feeding into the arithmetic units RICE UNIVERSITY27
28 Verification of potential benefits Level 3 potential verification using the Imagine stream processor simulator Replacing the C64x DSP with a cluster containing 3, 3 X and a distributed register file RICE UNIVERSITY28
29 Need for adapting to flexibility Base-stations are designed for worst case workload Base-stations rarely operate at worst case workload Adapting the resources to the workload can save power! RICE UNIVERSITY29
30 Example of flexibility needed in workloads Operation count (in GOPs) G base-station (16 Kbps/user) 3G base-station (128 Kbps/user) Note: GOPs refer only to arithmetic computations 0 (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) (Users, Constraint lengths) Billions of computations per second needed Workload variation from ~1 GOPs for 4 users, constraint 7 viterbi RICE UNIVERSITY30 to ~23 GOPs for 32 users, constraint 9 viterbi
31 Flexibility affects Data Parallelism U - Users, K - constraint length, N - spreading gain, R - decoding rate Workload Estimation Detection Decoding (U,K) f(u,n) f(u,n) f(u,k,r) (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) Data Parallelism is defined as the parallelism available after subword packing and loop unrolling RICE UNIVERSITY31
32 Cluster utilization variation with workload (4,9) (4,7) Cluster Utilization (8,9) (8,7) (16,9) (16,7) (32,9) (32,7) Cluster Index Cluster utilization variation on a 32-cluster processor (32, 9) = 32 users, constraint length 9 Viterbi RICE UNIVERSITY32
33 Frequency variation with workload Real-time Frequency (in MHz) Mem Stall L2 Stall Busy (4,7) (4,9) (8,7) (8,9) (16,7) (16,9) (32,7) (32,9) RICE UNIVERSITY33
34 Operation DVS when system changes significantly Users, data rates Coarse time scale (every few seconds) Turn off clusters when parallelism changes significantly Parallelism can change within the same algorithm Eg: spreading gain changes during matched filtering Finer time scales (100 s of microseconds) Turn off ALUs when algorithms change significantly estimation, detection, decoding Finer time scales (100 s of microseconds) RICE UNIVERSITY34
35 Power savings: Voltage Gating & Scaling Workload Freq (MHz) Voltage Power Savings (W) Power (W) Savings needed used (V) clocking Memory Clusters New Base (4,7) % (4,9) % (8,7) % (8,9) % (16,7) % (16,9) % (32,7) % (32,9) % Estimated Cluster Power Consumption 78 % Estimated L2 memory Power Consumption 11.5 % Estimated instruction decooder Power Consumption 10.5 % Estimated Chip Area (0.13 micron process) 45.7 mm 2 Power can change from W to 300 mw depending on workload changes RICE UNIVERSITY35
36 How to decide ALUs vs. clock frequency No independent variables Clusters, ALUs, frequency, voltage Trade-offs exist P 2 3 CV f V f P How to find the right combination for lowest power! f a m a m a m 1 cluster 100 clusters c clusters 100 GHz 10 MHz f MHz (A) (B) (C) RICE UNIVERSITY36
37 Setting clusters, adders, multipliers If sufficient DP, linear decrease in frequency with clusters Set clusters depending on DP and execution time estimate To find adders and multipliers, Let compiler schedule algorithm workloads across different numbers of adders and multipliers and let it find execution time Put all numbers in previous equation Compare increase in capacitance due to added ALUs and clusters with benefits in execution time Choose the solution that minimizes the power Details available in Sridhar s thesis RICE UNIVERSITY37
38 Conclusions We propose a step-by-step methodology to design high performance power-efficient DSPs based on the TI 64x architecture Initial results show benefits in power/performance greater than an order-of-magnitude over a conventional C64x We tailor the design to ensure maximum compatibility with TI s C6x architecture and tools We are interested in exploring opportunities in TI for designing and actual fabrication of a chip and associated tool development We are interested in feedback limitations that we have not accounted for Unreasonable assumptions that we have made Recommended reading: S. Rixner et al, A register organization for media processing, HPCA 2000 B. Khailany et al, Exploring the VLSI scalability of stream processors, HPCA 2003 U. J. Kapasi et al, Programmable Stream Processors, IEEE Computer, August 2003 RICE UNIVERSITY38
Flexible wireless communication architectures
Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April
More informationDesign space exploration for real-time embedded stream processors
Design space exploration for real-time embedded stream processors Sridhar Rajagopal, Joseph R. Cavallaro, and Scott Rixner Department of Electrical and Computer Engineering Rice University sridhar, cavallar,
More informationImproving Power Efficiency in Stream Processors Through Dynamic Cluster Reconfiguration
Improving Power Efficiency in Stream Processors Through Dynamic luster Reconfiguration Sridhar Rajagopal WiQuest ommunications Allen, T 75 sridhar.rajagopal@wiquest.com Scott Rixner and Joseph R. avallaro
More informationImproving Power Efficiency in Stream Processors Through Dynamic Reconfiguration
Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration June 5, 24 Abstract Stream processors support hundreds of functional units in a programmable architecture by clustering those
More informationReconfigurable Stream Processors for Wireless Base-stations
Reconfigurable Stream Processors for Wireless Base-stations Sridhar Rajagopal Scott Rixner Joseph R. avallaro Rice University {sridhar,rixner,cavallar}@rice.edu Abstract This paper presents the design
More informationDesigning Scalable Wireless Application-specific Processors
Designing calable Wireless Application-specific Processors ridhar Rajagopal (sridhar@rice.edu) eptember 1, 2003, 10:52 PM Abstract This paper presents a structured way of designing and exploring scalable,
More informationDesign space exploration for real-time embedded stream processors
Design space eploration for real-time embedded stream processors Sridhar Rajagopal, Joseph R. Cavallaro, and Scott Riner Department of Electrical and Computer Engineering Rice University sridhar, cavallar,
More informationDATA-PARALLEL DIGITAL SIGNAL PROCESSORS: ALGORITHM MAPPING, ARCHITECTURE SCALING AND WORKLOAD ADAPTATION
DATA-PARALLEL DIGITAL SIGNAL PROCESSORS: ALGORITHM MAPPING, ARCHITECTURE SCALING AND WORKLOAD ADAPTATION Sridhar Rajagopal Thesis: Doctor of Philosophy Electrical and Computer Engineering Rice University,
More informationReconfigurable VLSI Communication Processor Architectures
Reconfigurable VLSI Communication Processor Architectures Joseph R. Cavallaro Center for Multimedia Communication www.cmc.rice.edu Department of Electrical and Computer Engineering Rice University, Houston
More informationECE 747 Digital Signal Processing Architecture. DSP Implementation Architectures
ECE 747 Digital Signal Processing Architecture DSP Implementation Architectures Spring 2006 W. Rhett Davis NC State University W. Rhett Davis NC State University ECE 406 Spring 2006 Slide 1 My Goal Challenge
More informationA 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing
A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine
More informationReconfigurable stream processors for wireless base-stations
Reconfigurable stream processors for wireless base-stations ridhar Rajagopal (sridhar@rice.edu) eptember 9, 2003, 3:25 AM Abstract The need to support evolving standards, rapid prototyping and fast time-to-market
More informationMicroprocessor Extensions for Wireless Communications
Microprocessor Extensions for Wireless Communications Sridhar Rajagopal and Joseph R. Cavallaro DRAFT REPORT Rice University Center for Multimedia Communication Department of Electrical and Computer Engineering
More informationMulti-Core Microprocessor Chips: Motivation & Challenges
Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationA PROGRAMMABLE COMMUNICATIONS PROCESSOR FOR THIRD GENERATION WIRELESS COMMUNICATION SYSTEMS
A PROGRAMMABLE COMMUNICATIONS PROCESSOR FOR THIRD GENERATION WIRELESS COMMUNICATION SYSTEMS Sridhar Rajagopal and Joseph R. Cavallaro Rice University Center for Multimedia Communication Department of Electrical
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationFABRICATION TECHNOLOGIES
FABRICATION TECHNOLOGIES DSP Processor Design Approaches Full custom Standard cell** higher performance lower energy (power) lower per-part cost Gate array* FPGA* Programmable DSP Programmable general
More informationIMAGINE: Signal and Image Processing Using Streams
IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture
More informationVLSI Signal Processing
VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface
More informationVLSI Design Automation. Maurizio Palesi
VLSI Design Automation 1 Outline Technology trends VLSI Design flow (an overview) 2 Outline Technology trends VLSI Design flow (an overview) 3 IC Products Processors CPU, DSP, Controllers Memory chips
More informationData Parallel Architectures
EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003
More informationImplementing FFT in an FPGA Co-Processor
Implementing FFT in an FPGA Co-Processor Sheac Yee Lim Altera Corporation 101 Innovation Drive San Jose, CA 95134 (408) 544-7000 sylim@altera.com Andrew Crosland Altera Europe Holmers Farm Way High Wycombe,
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationSDR Forum Technical Conference 2007
THE APPLICATION OF A NOVEL ADAPTIVE DYNAMIC VOLTAGE SCALING SCHEME TO SOFTWARE DEFINED RADIO Craig Dolwin (Toshiba Research Europe Ltd, Bristol, UK, craig.dolwin@toshiba-trel.com) ABSTRACT This paper presents
More informationUniversität Dortmund. ARM Architecture
ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture
More informationReconfigurable Computing. Introduction
Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally
More informationThe extreme Adaptive DSP Solution to Sensor Data Processing
The extreme Adaptive DSP Solution to Sensor Data Processing Abstract Martin Vorbach PACT XPP Technologies Leo Mirkin Sky Computers, Inc. The new ISR mobile autonomous sensor platforms present a difficult
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationStream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop
Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationSpiNNaker - a million core ARM-powered neural HPC
The Advanced Processor Technologies Group SpiNNaker - a million core ARM-powered neural HPC Cameron Patterson cameron.patterson@cs.man.ac.uk School of Computer Science, The University of Manchester, UK
More informationA Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on
A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced
More informationSoftware Defined Modem A commercial platform for wireless handsets
Software Defined Modem A commercial platform for wireless handsets Charles F Sturman VP Marketing June 22 nd ~ 24 th Brussels charles.stuman@cognovo.com www.cognovo.com Agenda SDM Separating hardware from
More informationAn Asynchronous Array of Simple Processors for DSP Applications
An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, Bevan Baas
More informationPower Optimization in FPGA Designs
Mouzam Khan Altera Corporation mkhan@altera.com ABSTRACT IC designers today are facing continuous challenges in balancing design performance and power consumption. This task is becoming more critical as
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationA 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling
A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge,
More informationHardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University
Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis
More informationRegister Organization and Raw Hardware. 1 Register Organization for Media Processing
EE482C: Advanced Computer Organization Lecture #7 Stream Processor Architecture Stanford University Thursday, 25 April 2002 Register Organization and Raw Hardware Lecture #7: Thursday, 25 April 2002 Lecturer:
More informationCAD for VLSI. Debdeep Mukhopadhyay IIT Madras
CAD for VLSI Debdeep Mukhopadhyay IIT Madras Tentative Syllabus Overall perspective of VLSI Design MOS switch and CMOS, MOS based logic design, the CMOS logic styles, Pass Transistors Introduction to Verilog
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationPACE: Power-Aware Computing Engines
PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious
More informationA scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment
LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationComputer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing
More informationFPGA Power Management and Modeling Techniques
FPGA Power Management and Modeling Techniques WP-01044-2.0 White Paper This white paper discusses the major challenges associated with accurately predicting power consumption in FPGAs, namely, obtaining
More informationComputer Systems Architecture Spring 2016
Computer Systems Architecture Spring 2016 Lecture 01: Introduction Shuai Wang Department of Computer Science and Technology Nanjing University [Adapted from Computer Architecture: A Quantitative Approach,
More informationUnit 2: High-Level Synthesis
Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationDEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK
DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level
More informationDesign and Low Power Implementation of a Reorder Buffer
Design and Low Power Implementation of a Reorder Buffer J.D. Fisher, C. Romo, E. John, W. Lin Department of Electrical and Computer Engineering, University of Texas at San Antonio One UTSA Circle, San
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationCHAPTER 7 FPGA IMPLEMENTATION OF HIGH SPEED ARITHMETIC CIRCUIT FOR FACTORIAL CALCULATION
86 CHAPTER 7 FPGA IMPLEMENTATION OF HIGH SPEED ARITHMETIC CIRCUIT FOR FACTORIAL CALCULATION 7.1 INTRODUCTION Factorial calculation is important in ALUs and MAC designed for general and special purpose
More informationAlgorithm-Architecture Co- Design for Efficient SDR Signal Processing
Algorithm-Architecture Co- Design for Efficient SDR Signal Processing Min Li, limin@imec.be Wireless Research, IMEC Introduction SDR Baseband Platforms Today are Usually Based on ILP + DLP + MP Massive
More informationTwo-level Reconfigurable Architecture for High-Performance Signal Processing
International Conference on Engineering of Reconfigurable Systems and Algorithms, ERSA 04, pp. 177 183, Las Vegas, Nevada, June 2004. Two-level Reconfigurable Architecture for High-Performance Signal Processing
More informationPower dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.
The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults
More informationReduce Your System Power Consumption with Altera FPGAs Altera Corporation Public
Reduce Your System Power Consumption with Altera FPGAs Agenda Benefits of lower power in systems Stratix III power technology Cyclone III power Quartus II power optimization and estimation tools Summary
More informationEmbedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.
Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors
More informationTen Reasons to Optimize a Processor
By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor
More informationEE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture
1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationHardware and Software Optimisation. Tom Spink
Hardware and Software Optimisation Tom Spink Optimisation Modifying some aspect of a system to make it run more efficiently, or utilise less resources. Optimising hardware: Making it use less energy, or
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationPower Reduction Techniques in the Memory System. Typical Memory Hierarchy
Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache
More informationNoC Round Table / ESA Sep Asynchronous Three Dimensional Networks on. on Chip. Abbas Sheibanyrad
NoC Round Table / ESA Sep. 2009 Asynchronous Three Dimensional Networks on on Chip Frédéric ric PétrotP Outline Three Dimensional Integration Clock Distribution and GALS Paradigm Contribution of the Third
More informationManaging Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks
Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department
More informationClassification of Semiconductor LSI
Classification of Semiconductor LSI 1. Logic LSI: ASIC: Application Specific LSI (you have to develop. HIGH COST!) For only mass production. ASSP: Application Specific Standard Product (you can buy. Low
More informationA Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors
A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,
More informationAn Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki
An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &
More informationComputer Organization and Design, 5th Edition: The Hardware/Software Interface
Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program
More informationRuntime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays
Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann
More informationAscenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005
Ascenium: A Continuously Reconfigurable Architecture Robert Mykland Founder/CTO robert@ascenium.com August, 2005 Ascenium: A Continuously Reconfigurable Processor Continuously reconfigurable approach provides:
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationMPSOC 2011 BEAUNE, FRANCE
MPSOC 2011 BEAUNE, FRANCE BOADRES: A SCALABLE BASEBAND PROCESSOR TEMPLATE FOR Gbps RADIOS VICE PRESIDENT, CHAIRMAN OF THE TECHNOLOGY OFFICE PROFESSOR AT THE KATHOLIEKE UNIVERSITEIT LEUVEN STATUS SDR BASEBAND
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More informationPicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor
PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor Taeho Kgil, Shaun D Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Steve Reinhardt, Krisztian Flautner,
More informationINTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016
NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering
More informationInterface Design Techniques for Single-Chip Systems
Interface Design Techniques for Single-Chip Systems Robert H. Bell, Jr. Lizy Kurian John Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712-24 {belljr,
More informationVerilog for High Performance
Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes
More informationCOPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design
COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm
More informationECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141
ECE 637 Integrated VLSI Circuits Introduction EE141 1 Introduction Course Details Instructor Mohab Anis; manis@vlsi.uwaterloo.ca Text Digital Integrated Circuits, Jan Rabaey, Prentice Hall, 2 nd edition
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationReNoC: A Network-on-Chip Architecture with Reconfigurable Topology
1 ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology Mikkel B. Stensgaard and Jens Sparsø Technical University of Denmark Technical University of Denmark Outline 2 Motivation ReNoC Basic
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationEE482S Lecture 1 Stream Processor Architecture
EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered
More informationAn FPGA Architecture Supporting Dynamically-Controlled Power Gating
An FPGA Architecture Supporting Dynamically-Controlled Power Gating Altera Corporation March 16 th, 2012 Assem Bsoul and Steve Wilton {absoul, stevew}@ece.ubc.ca System-on-Chip Research Group Department
More informationA PROGRAMMABLE BASEBAND PLATFORM FOR SOFTWARE-DEFINED RADIO
A PROGRAMMABLE BASEBAND PLATFORM FOR SOFTWARE-DEFINED RADIO Hans-Martin Bluethgen, Cyprian Grassmann, Wolfgang Raab, Ulrich Ramacher, Josef Hausner, Infineon Technologies AG, 81609 Munich, Germany, Hans-Martin.Bluethgen@infineon.com
More informationLow Power Design Techniques
Low Power Design Techniques August 2005, ver 1.0 Application Note 401 Introduction This application note provides low-power logic design techniques for Stratix II and Cyclone II devices. These devices
More informationCS310 Embedded Computer Systems. Maeng
1 INTRODUCTION (PART II) Maeng Three key embedded system technologies 2 Technology A manner of accomplishing a task, especially using technical processes, methods, or knowledge Three key technologies for
More informationDSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions
White Paper: Spartan-3 FPGAs WP212 (v1.0) March 18, 2004 DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions By: Steve Zack, Signal Processing Engineer Suhel Dhanani, Senior
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationOUTLINE Introduction Power Components Dynamic Power Optimization Conclusions
OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationAll MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes
MSEE Curriculum All MSEE students are required to take the following two core courses: 3531-571 Linear systems 3531-507 Probability and Random Processes The course requirements for students majoring in
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More information