A Review on Parallel Logic Simulation
|
|
- Cody Hoover
- 5 years ago
- Views:
Transcription
1 Volume 114 No , ISSN: (printed version); ISSN: (on-line version) url: ijpam.eu A Review on Parallel Logic Simulation 1 S. Karthik and 2 S. Saravana Kumar 1 Department of CSE, Vels University, Chennai, India. skarthikvit@gmail.com 2 Department of CSE, Karpagam College of Engineering, Coimbatore, India. saravanakumars81@gmail.com Abstract Verification is the important in decreasing the development and production cost of an IC. Parallel processing is considered as one of the key technique to achieve very high speed and performance in case of logic simulation/verification. Logic simulation is recognized as a most frequently used design verification approaches. In this paper various parallel processing techniques used in verification of ICs are reviewed. Key Words:VHDL, Verilog, PDES, GPGP, HDL. 191
2 1. Introduction The design of a VLSI circuit system starts with specification and formally the behaviour of the system is described using hardware description language. VHDL [l] and Verilog [2] are considered as one of the famous languages even though other languages like System C, System Verilog have been developed. Verification plays a pivotal role and a major step in designing an IC as it finds the bugs in the early stage. It can either be done by formal verification or by simulation based. Complex circuit design depends heavily on simulation to ensure that the design matches specifications and to increase system performance. Simulation of VLSI based systems containing millions of transistors/gates is time consuming, and has become a hindrance in the design flow. Engineers rely upon parallel simulation of HDL based systems to increase the speed of simulation. This paper reviews various parallel simulation technique and also discuss about future trends. 2. Factors Affecting Parallel Simulation Design Partitioning is a significant characteristic of distributed parallel simulation. Partitioning impacts the signals flowing between the partitions and also it affects the synchronization. Partitioning can be done based on functionality, no. of modules, gates etc. Biggest difficulty is minimizing communication between the partitioned modules and also running the modules concurrently. Another biggest challenge is communication overhead which is the is defined as the time exhausted in exchanging the values or messages between the partitions. Synchronization overhead is another factor in coordinating all simulations running parallel. When number of partitions increase all the factors discussed will be dominate. Fig.1shows performance enhancement in CPU, Memory latency and interconnect technologies. Growth of CPU has significantly high when compared with memory access time and interconnects technology. The reason behind not adapting parallel simulation is clearly shown in the figure below. Performance in decades interconnect memory latency cpu Performance in decades Figure 1: Shows the idea behind partitioning the design for accelerating simulation 192
3 Partition1 Partition2 Partition3 Core1 Core2 Core 3 Figure 2: Distributed simulation Equation (1) shows the formula for speedup where tp1 is time taken in simulating partition 1 and tparallel is the overall simulation time and tcommunication is time taken in communicating intermediate messages between the cores. Fig.2 shows the distributed simulation. Speedup =tp1+(tparallel+tcommunication) (1) Speed up will be affected for larger design as there will be many inter dependency between the modules. In spite of these researchers has come up with many more techniques which will be reviewed in further sections. 3. Parallel Simulation Architecture Paper [1] presented simulation architecture for running parallel executable codes which is generated from verilog using MPI library and ParaMid library which is based on efficient parallel algorithm. The algorithm discussed partition the modules in such a way that gates or circuit present in a particular module is mapped into same logic process while previous algorithms partitioned the modules but gates or circuits in the same module are mapped into different LPs. Fig.3 shows the simulation architecture. Partitioning Algorithm HDL Parser C++ code generator Executable code ParaMID Simulation Kernel Figure 3: Simulation architecture 4. Distributed Simulator Paper [2] designed a Distributed simulator based on time wrap to aid productive parallel simulation on distributed computers. The simulator consists of a HDL parser performing semantic checking and syntax and flattens every module and connects it to the top module by renaming the gates and finally a netlist will be generated. The simulator uses global virtual time algorithm which counts on constant length short time messages rather than messages with variable length containing vectors. The simulator also employs local fossil collection where each logic process before attempting to process a new event collects fossil. It has the feature of pre allocating a dynamically sized buffer for every event. The simulation was done on the netlist of Viterbi decoder design containing 1 million gates which showed a high speed up. 193
4 5. Effect of PC Cluster on PDES Paper [3]demonstrated PDES on PC cluster and finding the communication latencies and performance. The simulation of IP router model was carried out. The experiments were conducted on two separate 8 node clusters, one containing Ethernet cards and other one with Myrinet network cards. Each PC consists of P3 processor with 256 megabyte ram and the clock speed is 600 megahertz. IPCs are handled by both Ethernet and myrinet NIC. The latency reported on Ethernet system is 142 microseconds on 72 bytes of messages and 21 μsec on the other system. The author showed that rolling-back munches speed and worst when the CPU is faster and speed up factor declines when compared to non-rolling back schemes. 6. Q-Learning Approach The author [4] added Verilog parser with XTW and created a new simulator called VXTW. The simulator model is shown in the Fig.4.The first step is to synthesize the Verilog code using synopsys DC compiler. The target library is the GTECH. Next step consist of verliog parser for checking semantics and creating bench files which will be fed to the simulator. Synopsys Dc compiler Verilog Parser Circuit Simulator.V file GTECH lib.bench file Figure 4: Simulator model The author presented two active load matching algorithms for balancing the computational load and communication and used two Q learning agents to pool these algorithms. The first agent studies the parameters of the dynamic algorithm and the other optimizes the time window value. The author claimed 46 percent improvement in run time for many benchmark circuits. 7. Parallel Simulation Using NVIDIA GPU Paper [4] explored the use of General-purpose computing on graphics processing units for accelerating logic simulation. To have good load balance - and to attain variable coarse grain partitioning an adaptable partitioning approach is used. The author demonstrated the use of NVIDIA CUDA technology shown in Fig.5to achieve speed up and the result showed the speed up increasing by a factor of 21 when compared to single core simulation. Basic CUDA architecture is shown in figure.nvidia graphic processing unit consist of irregular number of cores on a single chip. Basic processing unit within a stream multiprocessor is called stream processor(sp).parallel programming was possible by the use of hierarchical threads and parallel memory. Programs are run parallel on single instruction multi thread mode. 194
5 Figure 5: Basic CUDA architecture 8. SIMD Parallelization Method. The author [5] proposed a technique where the netlist is converted into task graph shown in Fig.6. A dependency check is done for logic interdependency and multiple task are executed parallel using SIMD parallelization method where a SIMD instruction calculates the output of several unique task using single instruction. Task scheduling is done to assign task on machines dynamically. Figure 6: Task graph 9. Multilevel Temporal Parallel Event Driven Simulation The author [6] suggested a novel approach to gate level parallel simulation targeting at simulation of time slices (shown in Fig.7) rather than design partitioning. This avoids synchronization and communication of messages between partitions. The simulation time is partitioned into slices. Figure 7: Simulation slices The approach of MULTES is as follows. During initial or reference simulation internal state of each slice are stored. Then during parallel simulation of slices each slice is mapped to the processor. No. of slices depends on number of processors. 195
6 10. CMB based Simulator The author [7]implemented a compiled simulation of veriloghdl which is being translated into LP which consist C++ functions. Icarus verilog simulator is being deployed as verilog parser for translation. An LP is the smallest executable block which can be scheduled to the processor as shown in the figure.fig.8 shows the scheduling of LPs. First in first out queue is used as the transmission mechanism for passing the messages between the LPs. The author used CMB as the backbone of parallel simulation. The author got speed up for certain bench mark circuits when compared to other simulators like modelsim. LP1 LP2 LP3 LP4 SCHEDULER Figure 8: Scheduling of LPs 11. Parallel Simulation Using OPENMP and Verilator OS1 OS2 OS3 OS4 The author [7] used a open source Verilog simulator called. Verilator. It used for translating Verilog HDL code into executable C/C++. Since it is open source it is widely used by industries and academicians. In order to execute the generated codes parallel we can adopt OpenMp API library.the author has experimented the flow with different partitioning schemes namely domain partitioning and functional partitioning. Domain partitioning partitions the input data like D1,D2,Dnwhere as functional partitioning partitions the design into different modules. Fig.9 shows thedomain partitioning strategy and Fig.10 shows the functional partitioning scheme Data1 Design Core1 Data2 Design Core2 Figure 9: Domain partitioning Module1 Core1 Module2 Core2 Figure 10: Functional partitioning 196
7 12. ATPG based True Value Simulation The author [8] proposed calculation of real value at the output of each gate. The value corresponding to the output distance is calculated by the difference between bad machine and good machine value. the threshold value is set as 0.5. The approach is based Cheng and Agrawal's value calculation on the output of the gate. The author tested the methodology using 720 benchmark circuits containing 30,000 SA0 and SA1 faults in total and got a speed up of 5.3 using GPGPU when compared with serial processing.fig.11 shows the calculation time. 13. Conclusion Figure 11: Calculation of output value We have reviewed many implementation of parallel logic simulation where author used different approaches in speeding up the simulation. Getting speed up is difficult unless bringing down the communication overhead, adopting better partitioning techniques, use of better simulators. If all the factors are taken account then speed up can be definitely achieved. References [1] Li T., Li S., Ao F., Li G., Parallel verilog simulation: architecture and circuit partition, Proceedings of the Asia and South Pacific Design Automation Conference (2004), [2] Zhu L., Chen G., Szymanski B.K., Tropper C., Zhang T., Parallel logic simulation of million-gate VLSI circuits, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (2005), [3] Le T.T., Rejeb J., Performance of parallel logic event simulation on PC-cluster, 7th International Symposium on Parallel Architectures, Algorithms and Networks (2004), [4] Meraji S., Tropper C., A machine learning approach for optimizing parallel logic simulation, 39th International Conference on Parallel Processing (ICPP) (2010),
8 [5] Zhang Y., Wei T., Kai Y., Fan X., Zhang M., Zhao l., Logis simulation accerlaration based on GPU, 18th International Conference Mixed Design of Integrated Circuits and Systems (2011). [6] Kai N., Nishinohara R., Koide H., A SIMD parallelization method for an application for LSI logic simulation, 41st International Conference on Parallel Processing Workshops (ICPPW) (2012), [7] Kim D., Ciesielski M., Yang S., MULTES: multilevel temporalparallel event-driven simulation, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32(6) (2013), [8] Lingfeng W., Hong C., Yangdong Steve D., Robust Conservative Parallel HDL Simulation on Multi-Core CPUs, International conference on HPCS (2013). [9] Tariq B.A., Maciej C., Parallel Multi-core Verilog HDL Simulation based on Domain, IEEE Computer Society Annual Symposium on VLSI (2014). [10] Goro S., ATGPS Using Real Value Logic Simulation, 12th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT). 198
9 199
10 200
ON THE SCALABILITY AND DYNAMIC LOAD BALANCING OF PARALLEL VERILOG SIMULATIONS. Sina Meraji Wei Zhang Carl Tropper
Proceedings of the 2009 Winter Simulation Conference M. D. Rossetti, R. R. Hill, B. Johansson, A. Dunkin, and R. G. Ingalls, eds. ON THE SCALABILITY AND DYNAMIC LOAD BALANCING OF PARALLEL VERILOG SIMULATIONS
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationCHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits
More informationParallel Multi-core Verilog HDL Simulation based on Domain Partitioning
Parallel Multi-core Verilog HDL Simulation based on Domain Partitioning ABSTRACT Tariq B. Ahmad ECE Department, UMASS Amherst tbashir@ecs.umass.edu While multi-core computing has become pervasive, scaling
More informationOverview of Digital Design with Verilog HDL 1
Overview of Digital Design with Verilog HDL 1 1.1 Evolution of Computer-Aided Digital Design Digital circuit design has evolved rapidly over the last 25 years. The earliest digital circuits were designed
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationEvolution of CAD Tools & Verilog HDL Definition
Evolution of CAD Tools & Verilog HDL Definition K.Sivasankaran Assistant Professor (Senior) VLSI Division School of Electronics Engineering VIT University Outline Evolution of CAD Different CAD Tools for
More informationDigital Design Methodology
Digital Design Methodology Prof. Soo-Ik Chae Digital System Designs and Practices Using Verilog HDL and FPGAs @ 2008, John Wiley 1-1 Digital Design Methodology (Added) Design Methodology Design Specification
More informationDigital Design Methodology (Revisited) Design Methodology: Big Picture
Digital Design Methodology (Revisited) Design Methodology Design Specification Verification Synthesis Technology Options Full Custom VLSI Standard Cell ASIC FPGA CS 150 Fall 2005 - Lec #25 Design Methodology
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationHARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK
DOI: 10.21917/ijct.2012.0092 HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK U. Saravanakumar 1, R. Rangarajan 2 and K. Rajasekar 3 1,3 Department of Electronics and Communication
More informationSpecC Methodology for High-Level Modeling
EDP 2002 9 th IEEE/DATC Electronic Design Processes Workshop SpecC Methodology for High-Level Modeling Rainer Dömer Daniel D. Gajski Andreas Gerstlauer Center for Embedded Computer Systems Universitiy
More informationJava based Digital Simulation Automation System (JSAS)
Java based Digital Simulation Automation System (JSAS) Youngmin Hur Quickturn Design Systems, Inc. 440 Clyde Ave. Mt. View, CA 94043-2232 Tel: (415)694-6508 Fax: (415)691-6020 Email: youngmin@quickturn.com
More informationA Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems
A Study of the Effect of Partitioning on Parallel Simulation of Multicore Systems Zhenjiang Dong, Jun Wang, George Riley, Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute
More informationADVANCED DIGITAL IC DESIGN. Digital Verification Basic Concepts
1 ADVANCED DIGITAL IC DESIGN (SESSION 6) Digital Verification Basic Concepts Need for Verification 2 Exponential increase in the complexity of ASIC implies need for sophisticated verification methods to
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationCOMPILED CODE IN DISTRIBUTED LOGIC SIMULATION. Jun Wang Carl Tropper. School of Computer Science McGill University Montreal, Quebec, CANADA H3A2A6
Proceedings of the 2006 Winter Simulation Conference L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, eds. COMPILED CODE IN DISTRIBUTED LOGIC SIMULATION Jun Wang Carl
More informationDesign methodology for multi processor systems design on regular platforms
Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationISSN Vol.03, Issue.02, March-2015, Pages:
ISSN 2322-0929 Vol.03, Issue.02, March-2015, Pages:0122-0126 www.ijvdcs.org Design and Simulation Five Port Router using Verilog HDL CH.KARTHIK 1, R.S.UMA SUSEELA 2 1 PG Scholar, Dept of VLSI, Gokaraju
More informationConcurrency for data-intensive applications
Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive
More informationDIGITAL DESIGN TECHNOLOGY & TECHNIQUES
DIGITAL DESIGN TECHNOLOGY & TECHNIQUES CAD for ASIC Design 1 INTEGRATED CIRCUITS (IC) An integrated circuit (IC) consists complex electronic circuitries and their interconnections. William Shockley et
More informationEfficient VLSI Huffman encoder implementation and its application in high rate serial data encoding
LETTER IEICE Electronics Express, Vol.14, No.21, 1 11 Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding Rongshan Wei a) and Xingang Zhang College of Physics
More informationRISC Processors and Parallel Processing. Section and 3.3.6
RISC Processors and Parallel Processing Section 3.3.5 and 3.3.6 The Control Unit When a program is being executed it is actually the CPU receiving and executing a sequence of machine code instructions.
More informationLecture 2 Hardware Description Language (HDL): VHSIC HDL (VHDL)
Lecture 2 Hardware Description Language (HDL): VHSIC HDL (VHDL) Pinit Kumhom VLSI Laboratory Dept. of Electronic and Telecommunication Engineering (KMUTT) Faculty of Engineering King Mongkut s University
More informationDistributed Binary Decision Diagrams for Symbolic Reachability
Distributed Binary Decision Diagrams for Symbolic Reachability Wytse Oortwijn Formal Methods and Tools, University of Twente November 1, 2015 Wytse Oortwijn (Formal Methods and Tools, Distributed University
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationCompiling for GPUs. Adarsh Yoga Madhav Ramesh
Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationHardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University
Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More informationOpenMP Optimization and its Translation to OpenGL
OpenMP Optimization and its Translation to OpenGL Santosh Kumar SITRC-Nashik, India Dr. V.M.Wadhai MAE-Pune, India Prasad S.Halgaonkar MITCOE-Pune, India Kiran P.Gaikwad GHRIEC-Pune, India ABSTRACT For
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationHIERARCHICAL DESIGN. RTL Hardware Design by P. Chu. Chapter 13 1
HIERARCHICAL DESIGN Chapter 13 1 Outline 1. Introduction 2. Components 3. Generics 4. Configuration 5. Other supporting constructs Chapter 13 2 1. Introduction How to deal with 1M gates or more? Hierarchical
More informationOutline HIERARCHICAL DESIGN. 1. Introduction. Benefits of hierarchical design
Outline HIERARCHICAL DESIGN 1. Introduction 2. Components 3. Generics 4. Configuration 5. Other supporting constructs Chapter 13 1 Chapter 13 2 1. Introduction How to deal with 1M gates or more? Hierarchical
More informationHardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Hardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture 01 Introduction Welcome to the course on Hardware
More informationManaging Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks
Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department
More informationAlgorithm for Determining Most Qualified Nodes for Improvement in Testability
ISSN:2229-6093 Algorithm for Determining Most Qualified Nodes for Improvement in Testability Rupali Aher, Sejal Badgujar, Swarada Deodhar and P.V. Sriniwas Shastry, Department of Electronics and Telecommunication,
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationPARALLEL LOGIC SIMULATION OF MILLION-GATE VLSI CIRCUITS
PARALLEL LOGIC SIMULATION OF MILLION-GATE VLSI CIRCUITS By Lijuan Zhu A Thesis Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree
More informationDesign Process. Design : specify and enter the design intent. Verify: Implement: verify the correctness of design and implementation
Design Verification 1 Design Process Design : specify and enter the design intent Verify: verify the correctness of design and implementation Implement: refine the design through all phases Kurt Keutzer
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS
ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University
More informationA Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique
A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique P. Durga Prasad, M. Tech Scholar, C. Ravi Shankar Reddy, Lecturer, V. Sumalatha, Associate Professor Department
More informationTest on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)
Test on Wednesday! 50 minutes Closed notes, closed computer, closed everything Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs) Study notes and readings posted on course
More informationParallel and Distributed VHDL Simulation
Parallel and Distributed VHDL Simulation Dragos Lungeanu Deptartment of Computer Science University of Iowa C.J. chard Shi Department of Electrical Engineering University of Washington Abstract This paper
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationBibliography. Measuring Software Reuse, Jeffrey S. Poulin, Addison-Wesley, Practical Software Reuse, Donald J. Reifer, Wiley, 1997.
Bibliography Books on software reuse: 1. 2. Measuring Software Reuse, Jeffrey S. Poulin, Addison-Wesley, 1997. Practical Software Reuse, Donald J. Reifer, Wiley, 1997. Formal specification and verification:
More informationEvaluation of NOC Using Tightly Coupled Router Architecture
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 01-05 www.iosrjournals.org Evaluation of NOC Using Tightly Coupled Router
More informationNicarus: A Distributed Verilog Compiler
Nicarus: A Distributed Verilog Compiler Jun Wang and Carl Tropper School of Computer Science McGill University Montreal, Quebec, Canada jwang90, carl@cs.mcgill.ca Abstract Software design tools, such as
More informationThe Design and Implementation of a Low-Latency On-Chip Network
The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current
More informationAddressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers
Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Subash Chandar G (g-chandar1@ti.com), Vaideeswaran S (vaidee@ti.com) DSP Design, Texas Instruments India
More informationA Multi-State Q-learning Approach for the Dynamic Load Balancing of Time Warp
A Multi-State Q-learning Approach for the Dynamic Load Balancing of Time Warp Sina Meraji School of Computer Science McGill University Montreal, Canada Email: smeraj@cs.mcgill.ca Wei Zhang School of Computer
More informationSynchronization In Digital Systems
2011 International Conference on Information and Network Technology IPCSIT vol.4 (2011) (2011) IACSIT Press, Singapore Synchronization In Digital Systems Ranjani.M. Narasimhamurthy Lecturer, Dr. Ambedkar
More informationPlugging the Holes: SystemC and VHDL Functional Coverage Methodology
Plugging the Holes: SystemC and VHDL Functional Coverage Methodology Pankaj Singh Infineon Technologies Pankaj.Singh@infineon.com Gaurav Kumar Verma Mentor Graphics Gaurav-Kumar_Verma@mentor.com ABSTRACT
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationayaz ali Micro & Macro Scheduling Techniques Ayaz Ali Department of Computer Science University of Houston Houston, TX
ayaz ali Micro & Macro Scheduling Techniques Ayaz Ali Department of Computer Science University of Houston Houston, TX 77004 ayaz@cs.uh.edu 1. INTRODUCTION Scheduling techniques has historically been one
More informationOverview of Digital Design Methodologies
Overview of Digital Design Methodologies ELEC 5402 Pavan Gunupudi Dept. of Electronics, Carleton University January 5, 2012 1 / 13 Introduction 2 / 13 Introduction Driving Areas: Smart phones, mobile devices,
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationAn FPGA-Based Optical IOH Architecture for Embedded System
An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationA task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b
5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of
More informationDESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER
G MAHESH BABU, et al, Volume 2, Issue 7, PP:, SEPTEMBER 2014. DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER G.Mahesh Babu 1*, Prof. Ch.Srinivasa Kumar 2* 1. II. M.Tech (VLSI), Dept of ECE,
More informationHardware/Software Co-design
Hardware/Software Co-design Zebo Peng, Department of Computer and Information Science (IDA) Linköping University Course page: http://www.ida.liu.se/~petel/codesign/ 1 of 52 Lecture 1/2: Outline : an Introduction
More informationSYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS
SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS Embedded System System Set of components needed to perform a function Hardware + software +. Embedded Main function not computing Usually not autonomous
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationThe Need for Speed: Understanding design factors that make multicore parallel simulations efficient
The Need for Speed: Understanding design factors that make multicore parallel simulations efficient Shobana Sudhakar Design & Verification Technology Mentor Graphics Wilsonville, OR shobana_sudhakar@mentor.com
More informationIntroduction to GPU programming with CUDA
Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU
More informationParallel Implementation of VLSI Gate Placement in CUDA
ME 759: Project Report Parallel Implementation of VLSI Gate Placement in CUDA Movers and Placers Kai Zhao Snehal Mhatre December 21, 2015 1 Table of Contents 1. Introduction...... 3 2. Problem Formulation...
More informationEXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System
EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System By Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick
More informationEEL 5722C Field-Programmable Gate Array Design
EEL 5722C Field-Programmable Gate Array Design Lecture 19: Hardware-Software Co-Simulation* Prof. Mingjie Lin * Rabi Mahapatra, CpSc489 1 How to cosimulate? How to simulate hardware components of a mixed
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationSubset Sum Problem Parallel Solution
Subset Sum Problem Parallel Solution Project Report Harshit Shah hrs8207@rit.edu Rochester Institute of Technology, NY, USA 1. Overview Subset sum problem is NP-complete problem which can be solved in
More informationInitial Evaluation of a User-Level Device Driver Framework
Initial Evaluation of a User-Level Device Driver Framework Stefan Götz Karlsruhe University Germany sgoetz@ira.uka.de Kevin Elphinstone National ICT Australia University of New South Wales kevine@cse.unsw.edu.au
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationCover TBD. intel Quartus prime Design software
Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationIntegrating MRPSOC with multigrain parallelism for improvement of performance
Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,
More informationSystemC Implementation of VLSI Embedded Systems for MEMS. Application
Fourth LACCEI International Latin American and Caribbean Conference for Engineering and Technology (LACCET 2006) Breaking Frontiers and Barriers in Engineering: Education, Research and Practice 21-23 June
More information8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments
8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments QII51017-9.0.0 Introduction The Quartus II incremental compilation feature allows you to partition a design, compile partitions
More informationA Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding
A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding N.Rajagopala krishnan, k.sivasuparamanyan, G.Ramadoss Abstract Field Programmable Gate Arrays (FPGAs) are widely
More informationIntroduction to Parallel Programming. Tuesday, April 17, 12
Introduction to Parallel Programming 1 Overview Parallel programming allows the user to use multiple cpus concurrently Reasons for parallel execution: shorten execution time by spreading the computational
More informationParallel Discrete Event Simulation for DEVS Cellular Models using a GPU
Parallel Discrete Event Simulation for DEVS Cellular Models using a GPU Moon Gi Seok and Tag Gon Kim Systems Modeling Simulation Laboratory Korea Advanced Institute of Science and Technology (KAIST) 373-1
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationSystem Verification of Hardware Optimization Based on Edge Detection
Circuits and Systems, 2013, 4, 293-298 http://dx.doi.org/10.4236/cs.2013.43040 Published Online July 2013 (http://www.scirp.org/journal/cs) System Verification of Hardware Optimization Based on Edge Detection
More informationUser Guide for GLU V2.0
User Guide for GLU V2.0 Lebo Wang and Sheldon Tan University of California, Riverside June 2017 Contents 1 Introduction 1 2 Using GLU within a C/C++ program 2 2.1 GLU flowchart........................
More informationFramework of rcuda: An Overview
Framework of rcuda: An Overview Mohamed Hussain 1, M.B.Potdar 2, Third Viraj Choksi 3 11 Research scholar, VLSI & Embedded Systems, Gujarat Technological University, Ahmedabad, India 2 Project Director,
More informationA hardware operating system kernel for multi-processor systems
A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More information