Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit
|
|
- Lucy Watkins
- 6 years ago
- Views:
Transcription
1 Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki Murakami, Koji Inoue, Mehdi Sedighi Computer and IT Engineering Department, Amirkabir University of Technology {mehdipur,szamani,msedighi}@aut.ac.ir Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University noori@c.csce.kyushu-u.ac.jp, {murakami,inoue}@i.kyushu-u.ac.jp
2 Agenda Introduction Application-specific instruction set extension Temporal Partitioning Some Definitions General overview of the architecture RFU Architecture: A Quantitative Approach Generating Custom Instructions Mapping Custom Instructions Integrating RFU with base processor Integrated framework for generating and mapping custom instructions Performance Evaluation References
3 Introduction An extensible processor with a reconfigurable functional unit (RFU) can be an alternative to General Purpose Processors (GPPs), Application-Specific Integrated Circuits (ASICs) and Application-Specific Instruction set Processors (ASIPs) to achieve enhanced performance in embedded systems ASICs GPPs not flexible expensive and time consuming design process very flexible may not offer the necessary performance
4 Introduction ASIPs more flexible than ASICs more potential to meet the high-performance demands of embedded applications, compared to GPPs needs to generation of a complete instruction set architecture for the targeted application full-custom solution is too expensive and has long design turnaround times
5 Application-specific instruction set extension Another Method for performance improvement An extensible processor with a reconfigurable functional unit favorable tradeoff between efficiency and flexibility keeping design turnaround time much shorter. Critical portions of an application s dataflow graph (DFG) are accelerated by using custom functional units The nodes of DFGs -> instructions of critical potions Edges of DFGs -> dependencies between instructions
6 Temporal Partitioning Partitioning a data flow graph into a number of partitions such that each partition can fit into the target hardware and dependencies among the graph nodes are not violated.
7 Some definitions Hot Basic Block (HBB) A basic block which execution frequency is greater than a given threshold specified in the profiler Custom Instructions (CIs) Are the extended Instruction Set Architecture (ISA) that are executed on the RFU Reconfigurable Functional Unit (RFU) Custom hardware for executing CIs
8 General overview of the architecture N-way in-order general RISC Adaptive Dynamic Extensible Processor Base Processor Fetch Reg File Decode Execute Memory Augmented Hardware Profiler RFU Sequencer Detects start addresses of Hot Basic Blocks (HBBs) Switches between main processor and RFU Write Executes Custom Instructions
9 Operation modes Training Mode Training Mode Normal Mode Applications Binary-Level Profiling Detecting Start Address of HBBs Applications Running Tools for Generating Custom Instructions, Generating Configuration Data for ACC and Initializing Sequencer Table Applications Monitors PC and Switches between main processor and ACC Processor Profiler ACC Sequencer Processor ACC Profiler Sequencer Processor ACC Profiler Sequencer Binary Rewriting Executing CIs
10 Tool Chain Base Processor Profiler Simplescalar (PISA Configuration) Reading HBBs from Obj Code 22 Applications of Mibench Detecting Start Addr of HBBs Results are used for designing RFU Generating DFG for HBBs Custom Instruction Generator Mapping CIs on the RFU Optimization (Constant Propagation) Updating DFG
11 Reconfigurable Functional Unit (RFU) RFU is a matrix of Functional Units (FUs) RFU has a two level configuration memory A multi-context memory (keeps two or four config) A cache FUs support only logical operations, add/subtract, shifts and compare RFU updates the PC RFU has variable delay which depends on size of Custom Instruction
12 RFU Architecture: A Quantitative Approach 22 programs of MiBench were chosen Simplescalar toolset was utilized for simulation RFU is a matrix of FUs No of Inputs No of Outputs No of FUs Connections Location of Inputs & Outputs Some definitions: Considering frequency and weight in measurement CI Execution Frequency Weight (To equal number of executed instructions) Average = for all CIs (ΣFreq*Weight) Rejection: Percentage of CI that could not be mapped on the RFU Coverage: Percentage of CI that could be mapped on the RFU Basic Blocks: A sequence of instructions terminates in a control instruction Hot Basic Blocks: A basic block executed more than a threshold
13 RFU Architecture Distributing Inputs in different rows Row1 = 7 Row 2 = 2 Row 3 = 2 Row 4 = 2 Row 5 = 1 Connections with Variable Length row1 row3 = 1 row1 row4 = 1 row1 row5 = 1 row2 row4 = 1 Synthesis results using Hitachi 0.18 μm Area : mm 2 Delay : 9.66 ns
14 Integrating RFU with the Base Processor Reg0. Reg31 Config Mem Decoder Sequencer DEC/EXE Pipeline Registers FU1 FU2 FU3 FU4 RFU Sequencer EXE/MEM Pipeline Registers
15 Generation of Custom Instructions Custom instructions Exclude floating point, multiply, divide and load instructions Include at most one STORE, at most one BRANCH/JUMP and all other fixed point instructions Simple algorithm for generating custom instructions HBBs usually include 10~40 instructions for Mibench Custom instruction generator is going to be executed on the base processor (in online training mode)
16 Mapping Custom Instructions Mapping is the same as the well-known placement problem: Determining the appropriate positions for DFG nodes on the RFU. Assigning CI instructions to FUs is done based on the priority of the nodes.
17 Mapping Custom Instructions Slack of each node represents its criticality and also their priority for partitioning. Slack equal to 0 means that it is on the critical path of DFG and should be scheduled with the highest priority. For the nodes with the same criticality, ASAP level of them determines their mapping order.
18 Mapping Algorithm (1/2) First Step: determining an appropriate row for that node Row number= Last Row (if the selected node is on a critical path with the length more than or equal to RFU depth) Row number= ALAP- slack -1(to prevent the occupation of FUs in the lower RFU rows by the nodes do not belong to critical paths )
19 Mapping Algorithm (2/2) Second Step: Determining an appropriate column That is determined according to the minimum connection length criterion. For each row, a maximum capacity is considered to prohibit gathering many nodes in a row. Capacity of rows is determined with respect to longest critical path and the number of critical paths in the DFG.
20 An Example: Mapping of a CI on the RFU
21 Generating Custom Instruction for the Target RFU In our primary CI generator we did not consider any constraints for the generated CIs and tried to generate CIs as large as possible. Therefore, some of the generated CIs can not be mapped on the proposed RFU due to its constraints.
22 Customizing CI generator for the Target RFU First Approach Some primary constraints of RFU (number of inputs, number of outputs and number of nodes) were added to our CI generator tool to generate CIs that are mappable. In this approach the CI generator is unaware of the mapping process results Some of CIs may not be ultimately mapped to the RFU due to the routing constraints
23 Customizing CI generator for the Target RFU Second Approach Integrated Framework Performs an integrated temporal partitioning and mapping process Takes rejected CIs as input Partitions them to appropriate mappable CIs Adds nodes to the current partition while architectural constraints are satisfied The ASAP level of nodes represents their order to execute according to their dependencies Advantages Reducing the number of rejected CI Using a mapping-aware temporal partitioning process
24 Integrated Framework- Temporal Partitioning Algorithms HTTP VTTP Traverses DFG nodes horizontally according to the ASAP level of the nodes usually brings about more parallelism for instruction execution may require large intermediate data The size of intermediate data affects data transfer rate and the size of configuration memory. Traverse the DFG nodes vertically Creates partitions with longer critical paths Reduces the size of intermediate data
25 Integrated Framework- Incremental Temporal Partitioning Algorithm Incremental temporal partitioning process is performed iteratively Each partition which does not satisfy RFU constraints is modified A new iteration starts. Two different partition modification strategies are used for HTTP and VTTP The main difference is in the way of selecting the nodes to be moved to the next partition.
26 Integrated Framework- Incremental Temporal Partitioning Algorithm Incremental HTTP The node with the highest ASAP level is selected and moved to the subsequent partition. Nodes selection and moving order: 15, 13, 11, 9, 14, 12, 10, 8, 3 and 7.
27 Integrated Framework- Incremental Temporal Partitioning Algorithm Incremental VTTP: A node with the highest ASAP level is selected and moved. The other nodes are selected from the path where the previous moved node had been located in their ASAP level order. Nodes selection and moving order:15, 14, 6, 13, 12, 5, 11, 10, 4 and 7.
28 Customizing Mapping Tool Spiral shaped mapping is possible thanks to the horizontal connections in the third and fourth rows of RFU
29 Performance Evaluation issue L1- I cache L1- D cache Unified L2 Execution units RUU size Fetch queue size 4-way 32K, 2 way, 1 cycle latency 32K, 4 way, 1 cycle latency 1M, 6 cycle latency 4 integer, 4 floating point Simplescalar was configured to behave as a 4-issue in-order RISC processor. The base processor supports MIPS instruction set. 22 applications of Mibench
30 Delay of RFU according to CI length CI Length RFU Delay (ns) Synopsys Tools + Hitachi 0.18μm
31 CIs length for Mibench applications
32 Intermediate data size No. of 32bit Intermediate Data bitcnts blowfish blowfish (dec) cjpeg djpeg fft fft (inv) gsm (dec) gsm (enc) lame rijndael (enc) rijndael (dec) sha HTTP Intermediate Data Size VTTP Intermediate Data Size
33 Maximum critical path length for CIs HTTP Critical Path Length VTTP Critical Path Length Critical Path Length bitcnts blowfish blowfish (dec) cjpeg djpeg fft fft (inv) gsm (dec) gsm (enc) lame rijndael (enc) rijndael (dec) sha
34 Speedup comparison Speedup bitcounts blowfish blowfish (dec) cjpeg djpeg fft fft (inv) gsm (dec) gsm (enc) lame rijndael (enc) rijndael (dec) sha HTTP VTTP CIGen
35 References Arnold, M., Corporaal, H., Designing domain-specific processors. In Proceedings of the Design, Automation and Test in Europe Conf, 2001, pp Atasu, K., Pozzi, L., Lenne, P., Automatic application-specific instruction-set extensions under microarchitectural constraints, 40th Design Automation Conference, Bobda, C., Synthesis of dataflow graphs for reconfigurable systems using temporal partitioning and temporal placement, Ph.D thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University of Paderborn, Clark, N., Kudlur, M., Park, H., Mahlke, S., Flautner, K., Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization, In Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, Karthikeya, M., Gajjala, P., Dinesh, B., Temporal partitioning and scheduling data flow graphs for reconfigurable computer, IEEE Transactions on Computers, vol. 48, no. 6, 1999, pp
36 References Kastner, R. Kaplan, A., Ogrenci Memik, S., Bozorgzadeh, E., Instruction generation for hybrid reconfigurable systems, ACM TODAES, vol. 7, no. 4, 2002, pp Ouaiss, I., Govindarajan, S., Srinivasan, V., Kaul M., Vemuri R., An integrated partitioning and synthesis system for dynamically reconfigurable multi-fpga architectures, In Proceedings of the Reconfigurable Architecture Workshop, 1998, pp Spillane, J., Owen, H., Temporal partitioning for partially reconfigurable field programmable gate arrays, IPPS/SPDP Workshops, 1998, pp Tanougast, C., Berviller, Y., Brunet, P., Weber, S., Rabah, H., Temporal partitioning methodology optimizing FPGA resources for dynamically reconfigurable embedded real-time system, International Journal of Microprocessors and Microsystems, vol. 27, 2003, pp Yu, P., Mitra, T., Characterizing embedded applications for instruction-set extensible processors, In Proceedings of Design and Automation Conference, 2004, pp
37 Thank you for your listening
A Reconfigurable Functional Unit for an Adaptive Extensible Processor
A Reconfigurable Functional Unit for an Adaptive Extensible Processor Hamid Noori Farhad Mehdipour Kazuaki Murakami Koji Inoue and Morteza SahebZamani Department of Informatics, Graduate School of Information
More informationEnhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension
Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension Hamid Noori, Farhad Mehdipour, Koji Inoue, and Kazuaki Murakami Institute of Systems, Information
More informationEnergy Consumption Evaluation of an Adaptive Extensible Processor
Energy Consumption Evaluation of an Adaptive Extensible Processor Hamid Noori, Farhad Mehdipour, Maziar Goudarzi, Seiichiro Yamaguchi, Koji Inoue, and Kazuaki Murakami December 2007 Outline Introduction
More informationExploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions
Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg
More informationThe Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems
The Effect of Nanometer-Scale Technologies on the Cache Size Selection for Low Energy Embedded Systems Hamid Noori, Maziar Goudarzi, Koji Inoue, and Kazuaki Murakami Kyushu University Outline Motivations
More informationExploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions
Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg
More informationIntegrating MRPSOC with multigrain parallelism for improvement of performance
Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,
More informationReducing Power Consumption for High-Associativity Data Caches in Embedded Processors
Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine
More informationAccelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path
Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering
More informationSoftware Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors
Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,
More informationA Just-in-Time Customizable Processor
A Just-in-Time Customizable Processor Liang Chen, Joseph Tarango, Tulika Mitra, Philip Brisk School of Computing, National University of Singapore Department of Computer Science and Engineering, University
More informationPerformance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path
Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical
More informationConfiguration Steering for a Reconfigurable Superscalar Processor
Steering for a Reconfigurable Superscalar Processor Brian F. Veale School of Computer Science University of Oklahoma veale@ou.edu John K. Annio School of Computer Science University of Oklahoma annio@ou.edu
More informationUnit 2: High-Level Synthesis
Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationExploiting Narrow Accelerators with Data-Centric Subgraph Mapping
Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping Amir Hormati, Nathan Clark, and Scott Mahlke Advanced Computer Architecture Lab University of Michigan - Ann Arbor E-mail: {hormati, ntclark,
More informationAutomated RTR Temporal Partitioning for Reconfigurable Embedded Real-Time System Design
Automated RTR Temporal Partitioning for Reconfigurable Embedded Real-Time System Design C. Tanougast, Y. Berviller, P. Brunet and S. Weber L. I. E. N. Laboratoire d Instrumentation Electronique de Nancy
More informationAnand Raghunathan
ECE 695R: SYSTEM-ON-CHIP DESIGN Module 2: HW/SW Partitioning Lecture 2.15: ASIP: Approaches to Design Anand Raghunathan raghunathan@purdue.edu ECE 695R: System-on-Chip Design, Fall 2014 Fall 2014, ME 1052,
More informationHigh-Level Synthesis (HLS)
Course contents Unit 11: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 11 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationTowards Optimal Custom Instruction Processors
Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors
More informationNISC Application and Advantages
NISC Application and Advantages Daniel D. Gajski Mehrdad Reshadi Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {gajski, reshadi}@cecs.uci.edu CECS Technical
More informationRISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER
RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER Miss. Sushma kumari IES COLLEGE OF ENGINEERING, BHOPAL MADHYA PRADESH Mr. Ashish Raghuwanshi(Assist. Prof.) IES COLLEGE OF ENGINEERING, BHOPAL
More informationA Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup
A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington
More informationHIGH-LEVEL SYNTHESIS
HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms
More informationStructure of Computer Systems
288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram
More informationA Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific
More informationIntroduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation
Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,
More informationEnergy Efficient Asymmetrically Ported Register Files
Energy Efficient Asymmetrically Ported Register Files Aneesh Aggarwal ECE Department University of Maryland College Park, MD 20742 aneesh@eng.umd.edu Manoj Franklin ECE Department and UMIACS University
More informationQuick Reconfiguration in Clustered Micro-Sequencer
Quick Reconfiguration in Clustered Micro-Sequencer Roozbeh Jafari UCLA 3256N Boelter Hall Los Angeles, CA 90095 rjafari@cs.ucla.edu Seda Ogrenci Memik Northwestern University 2145 Sheridan Road Evanston,
More informationSynthetic Benchmark Generator for the MOLEN Processor
Synthetic Benchmark Generator for the MOLEN Processor Stephan Wong, Guanzhou Luo, and Sorin Cotofana Computer Engineering Laboratory, Electrical Engineering Department, Delft University of Technology,
More informationLossless Compression using Efficient Encoding of Bitmasks
Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA
More informationHandling Constraints in Multi-Objective GA for Embedded System Design
Handling Constraints in Multi-Objective GA for Embedded System Design Biman Chakraborty Ting Chen Tulika Mitra Abhik Roychoudhury National University of Singapore stabc@nus.edu.sg, {chent,tulika,abhik}@comp.nus.edu.sg
More informationCharacterizing Embedded Applications for Instruction-Set Extensible Processors
Characterizing Embedded Applications for Instruction-Set Extensible Processors Pan Yu panyu@comp.nus.edu.sg School of Computing National University of Singapore Singapore 117543 Tulika Mitra tulika@comp.nus.edu.sg
More informationExtended Instruction Exploration for Multiple-Issue Architectures
Extended Instruction Exploration for Multiple-Issue Architectures I-WEI WU and JEAN JYH-JIUN SHANN, National Chiao Tung University WEI-CHUNG HSU, National Taiwan University CHUNG-PING CHUNG, National Chiao
More informationInstruction Cache Energy Saving Through Compiler Way-Placement
Instruction Cache Energy Saving Through Compiler Way-Placement Timothy M. Jones, Sandro Bartolini, Bruno De Bus, John Cavazosζ and Michael F.P. O Boyle Member of HiPEAC, School of Informatics University
More informationStatic Analysis of Worst-Case Stack Cache Behavior
Static Analysis of Worst-Case Stack Cache Behavior Florian Brandner Unité d Informatique et d Ing. des Systèmes ENSTA-ParisTech Alexander Jordan Embedded Systems Engineering Sect. Technical University
More informationInstruction Set Extension with Shadow Registers for Configurable Processors
Instruction Set Extension with Shadow Registers for Configurable Processors Jason Cong, Yiping Fan, Guoling Han, Ashok Jagannathan, Glenn Reinman, Zhiru Zhang Computer Science Department, University of
More informationA Complete Data Scheduler for Multi-Context Reconfigurable Architectures
A Complete Data Scheduler for Multi-Context Reconfigurable Architectures M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida, N. Bagherzadeh, F. J. Kurdahi Departamento de Arquitectura de Computadores
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationGeneral-purpose Reconfigurable Functional Cache architecture. Rajesh Ramanujam. A thesis submitted to the graduate faculty
General-purpose Reconfigurable Functional Cache architecture by Rajesh Ramanujam A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE
More informationCourse web site: teaching/courses/car. Piazza discussion forum:
Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start
More informationDesign of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1
Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later
More informationA Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications
A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications Metin Mete Özbilen 1 and Mustafa Gök 2 1 Mersin University, Engineering Faculty, Department of Computer Science,
More informationDATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS
Georgios Zacharopoulos Giovanni Ansaloni Laura Pozzi DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Università della Svizzera italiana (USI Lugano), Faculty
More informationFlexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic
368 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 1, JANUARY 2016 Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic Kostas Tsoumanis, Sotirios Xydis,
More informationDesign of a Floating-Point Fused Add-Subtract Unit Using Verilog
International Journal of Electronics and Computer Science Engineering 1007 Available Online at www.ijecse.org ISSN- 2277-1956 Design of a Floating-Point Fused Add-Subtract Unit Using Verilog Mayank Sharma,
More informationMARKET demands urge embedded systems to incorporate
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011 429 High Performance and Area Efficient Flexible DSP Datapath Synthesis Sotirios Xydis, Student Member, IEEE,
More informationDesign Space Exploration Using Parameterized Cores
RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS UNIVERSITY OF WINDSOR Design Space Exploration Using Parameterized Cores Ian D. L. Anderson M.A.Sc. Candidate March 31, 2006 Supervisor: Dr. M. Khalid 1 OUTLINE
More informationEmbedded Systems: Hardware Components (part I) Todor Stefanov
Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System
More informationPolynomial-Time Subgraph Enumeration for Automated Instruction Set Extension
Polynomial-Time Subgraph Enumeration for Automated Instruction Set Extension Paolo Bonzini and Laura Pozzi Faculty of Informatics University of Lugano (USI) Switzerland Email: paolo.bonzini@lu.unisi.ch,
More informationThe Use of Runtime Reconfiguration on FPGA Circuits to Increase the Performance of the AES Algorithm Implementation
Journal of Universal Computer Science, vol. 13, no. 3 (2007), 349-362 submitted: 30/11/06, accepted: 16/2/07, appeared: 28/3/07 J.UCS The Use of Runtime Reconfiguration on FPGA Circuits to Increase the
More informationArea/Delay Estimation for Digital Signal Processor Cores
Area/Delay Estimation for Digital Signal Processor Cores Yuichiro Miyaoka Yoshiharu Kataoka, Nozomu Togawa Masao Yanagisawa Tatsuo Ohtsuki Dept. of Electronics, Information and Communication Engineering,
More informationAchieving Out-of-Order Performance with Almost In-Order Complexity
Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost
More informationResearch Article Dynamic Reconfigurable Computing: The Alternative to Homogeneous Multicores under Massive Defect Rates
International Journal of Reconfigurable Computing Volume 2, Article ID 452589, 7 pages doi:.55/2/452589 Research Article Dynamic Reconfigurable Computing: The Alternative to Homogeneous Multicores under
More informationA Mechanism for Verifying Data Speculation
A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationReconfigurable Accelerator with Binary Compatibility for General Purpose Processors
Reconfigurable Accelerator with Binary Compatibility for General Purpose Processors Universidade Federal do Rio Grande do Sul Instituto de Informática Av. Bento Gonçalves, 9500 Campus do Vale Porto Alegre/Brazil
More informationPerformance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks
Performance Cloning: Technique for isseminating Proprietary pplications as enchmarks jay Joshi (University of Texas) Lieven Eeckhout (Ghent University, elgium) Robert H. ell Jr. (IM Corp.) Lizy John (University
More informationManaging Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks
Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department
More informationTwo-level Reconfigurable Architecture for High-Performance Signal Processing
International Conference on Engineering of Reconfigurable Systems and Algorithms, ERSA 04, pp. 177 183, Las Vegas, Nevada, June 2004. Two-level Reconfigurable Architecture for High-Performance Signal Processing
More informationModule 4c: Pipelining
Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A
More informationCache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics
More informationVLIW Digital Signal Processor. Michael Chang. Alison Chen. Candace Hobson. Bill Hodges
VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges Introduction Functionality ISA Implementation Functional blocks Circuit analysis Testing Off Chip Memory Status Things
More informationAN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1
AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University
More informationTiming Model of Superscalar O o O processor in HAsim Framework
Timing Model of Superscalar O o O processor in HAsim Framework Submitted by Muralidaran Vijayaraghavan I Introduction to Timing Models: Software simulation of micro architecture designs before actually
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationVerilog for High Performance
Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes
More informationOptimal Cache Organization using an Allocation Tree
Optimal Cache Organization using an Allocation Tree Tony Givargis Technical Report CECS-2-22 September 11, 2002 Department of Information and Computer Science Center for Embedded Computer Systems University
More informationTransparent Reconfigurable Acceleration for Heterogeneous Embedded Applications
Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications Antonio Carlos S. Beck 1,2, Mateus B. Rutzig 1, Georgi Gaydadjiev 2, Luigi Carro 1 1 Universidade Federal do Rio Grande do
More informationA Library of Parameterized Floating-point Modules and Their Use
A Library of Parameterized Floating-point Modules and Their Use Pavle Belanović and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115, USA {pbelanov,mel}@ece.neu.edu
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationHardware Design I Chap. 10 Design of microprocessor
Hardware Design I Chap. 0 Design of microprocessor E-mail: shimada@is.naist.jp Outline What is microprocessor? Microprocessor from sequential machine viewpoint Microprocessor and Neumann computer Memory
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations
More informationAnalytical Design Space Exploration of Caches for Embedded Systems
Analytical Design Space Exploration of Caches for Embedded Systems Arijit Ghosh and Tony Givargis Department of Information and Computer Science Center for Embedded Computer Systems University of California,
More informationNode Prefetch Prediction in Dataflow Graphs
Node Prefetch Prediction in Dataflow Graphs Newton G. Petersen Martin R. Wojcik The Department of Electrical and Computer Engineering The University of Texas at Austin newton.petersen@ni.com mrw325@yahoo.com
More informationPipelined MIPS processor with cache controller using VHDL implementation for educational purpose
Journal From the SelectedWorks of Kirat Pal Singh Winter December 28, 203 Pipelined MIPS processor with cache controller using VHDL implementation for educational purpose Hadeel Sh. Mahmood, College of
More informationDESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2
ISSN 2277-2685 IJESR/November 2014/ Vol-4/Issue-11/799-807 Shruti Hathwalia et al./ International Journal of Engineering & Science Research DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL ABSTRACT
More informationEnhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support
Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers,
More informationRun-time Adaptable Architectures for Heterogeneous Behavior Embedded Systems
Run-time Adaptable Architectures for Heterogeneous Behavior Embedded Systems Antonio Carlos S. Beck 1, Mateus B. Rutzig 1, Georgi Gaydadjiev 2, Luigi Carro 1, 1 Universidade Federal do Rio Grande do Sul
More informationSpeculative Tag Access for Reduced Energy Dissipation in Set-Associative L1 Data Caches
Speculative Tag Access for Reduced Energy Dissipation in Set-Associative L1 Data Caches Alen Bardizbanyan, Magnus Själander, David Whalley, and Per Larsson-Edefors Chalmers University of Technology, Gothenburg,
More informationENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013
ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of
More information15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011
15-740/18-740 Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011 Review of Last Lecture More ISA Tradeoffs Programmer vs. microarchitect Transactional
More informationAbstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE
A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany
More informationHigh-Level Synthesis
High-Level Synthesis 1 High-Level Synthesis 1. Basic definition 2. A typical HLS process 3. Scheduling techniques 4. Allocation and binding techniques 5. Advanced issues High-Level Synthesis 2 Introduction
More informationVLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT
VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT K.Sandyarani 1 and P. Nirmal Kumar 2 1 Research Scholar, Department of ECE, Sathyabama
More informationPredictive Thermal Management for Hard Real-Time Tasks
Predictive Thermal Management for Hard Real-Time Tasks Albert Mo Kim Cheng and Chen Feng Real-Time System Laboratory, Department of Computer Science University of Houston, Houston, TX 77204, USA {cheng,
More informationCSE Lecture In Class Example Handout
CSE 30321 Lecture 10-11 In Class Example Handout Question 1: First, we briefly review the notion of a clock cycle (CC). Generally speaking a CC is the amount of time required for (i) a set of inputs to
More informationImplementation of Floating Point Multiplier Using Dadda Algorithm
Implementation of Floating Point Multiplier Using Dadda Algorithm Abstract: Floating point multiplication is the most usefull in all the computation application like in Arithematic operation, DSP application.
More informationDesign Partitioning Methodology for Systems on Programmable Chip
Design Partitioning Methodology for Systems on Programmable Chip Abdo Azibi and Ramzi Ayadi Department of Electronics College of Technology at Alkharj, Saudi Arabia Email: aazibi, amzi.ayadi@tvtc.gov.sa
More informationGeneral Purpose Processors
Calcolatori Elettronici e Sistemi Operativi Specifications Device that executes a program General Purpose Processors Program list of instructions Instructions are stored in an external memory Stored program
More informationasoc: : A Scalable On-Chip Communication Architecture
asoc: : A Scalable On-Chip Communication Architecture Russell Tessier, Jian Liang,, Andrew Laffely,, and Wayne Burleson University of Massachusetts, Amherst Reconfigurable Computing Group Supported by
More informationA Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors
A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal
More informationHybrid Signed Digit Representation for Low Power Arithmetic Circuits
Hybrid Signed Digit Representation for Low Power Arithmetic Circuits Dhananjay S. Phatak Steffen Kahle, Hansoo Kim and Jason Lue Electrical Engineering Department State University of New York Binghamton,
More informationFPGA Matrix Multiplier
FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri
More informationUNIT I BASIC STRUCTURE OF COMPUTERS Part A( 2Marks) 1. What is meant by the stored program concept? 2. What are the basic functional units of a
UNIT I BASIC STRUCTURE OF COMPUTERS Part A( 2Marks) 1. What is meant by the stored program concept? 2. What are the basic functional units of a computer? 3. What is the use of buffer register? 4. Define
More informationA Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs
A Model-based Methodology for Application Specific Energy Efficient Data Path Design using FPGAs Sumit Mohanty 1, Seonil Choi 1, Ju-wook Jang 2, Viktor K. Prasanna 1 1 Dept. of Electrical Engg. 2 Dept.
More informationCS152 Computer Architecture and Engineering. Complex Pipelines
CS152 Computer Architecture and Engineering Complex Pipelines Assigned March 6 Problem Set #3 Due March 20 http://inst.eecs.berkeley.edu/~cs152/sp12 The problem sets are intended to help you learn the
More informationProcessor (I) - datapath & control. Hwansoo Han
Processor (I) - datapath & control Hwansoo Han Introduction CPU performance factors Instruction count - Determined by ISA and compiler CPI and Cycle time - Determined by CPU hardware We will examine two
More informationPACE: Power-Aware Computing Engines
PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious
More informationA Novel Profile-Driven Technique for Simultaneous Power and Code-size Optimization of Microcoded IPs
A Novel Profile-Driven Technique for Simultaneous Power and Code-size Optimization of Microcoded IPs Bita Gorjiara, Daniel Gajski Center for Embedded Computer Systems, University of California, Irvine
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More information