Design and Implementation of Quad Core Architecture Using FPGA

Similar documents
CS425 Computer Systems Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Parallel Processing SIMD, Vector and GPU s cont.


Computing architectures Part 2 TMA4280 Introduction to Supercomputing

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

A hardware operating system kernel for multi-processor systems

Simultaneous Multithreading (SMT)

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Simultaneous Multithreading Architecture

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Parallel Computer Architecture

Computer Architecture

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

A Prototype Multithreaded Associative SIMD Processor

Multi-core Architectures. Dr. Yingwu Zhu

Architectures of Flynn s taxonomy -- A Comparison of Methods

Kaisen Lin and Michael Conley

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Simultaneous Multithreading: a Platform for Next Generation Processors

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor

An FPGA Implementation of 8-bit RISC Microcontroller

Multicore Hardware and Parallelism

Simultaneous Multithreading (SMT)

Hardware-Based Speculation

Online Course Evaluation. What we will do in the last week?

FPGA Implementation of ALU Based Address Generation for Memory

Parallelism in Hardware

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

RISC Processors and Parallel Processing. Section and 3.3.6

WHY PARALLEL PROCESSING? (CE-401)

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Multithreaded Processors. Department of Electrical Engineering Stanford University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Introducing Multi-core Computing / Hyperthreading

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Cache Justification for Digital Signal Processors

Simultaneous Multithreading (SMT)

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

THREAD LEVEL PARALLELISM

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Integrating MRPSOC with multigrain parallelism for improvement of performance

Lecture 14: Multithreading

Microprocessor Architecture Dr. Charles Kim Howard University

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

VLSI Design of Multichannel AMBA AHB

Parallel Computing. Parallel Computing. Hwansoo Han

Reduced Instruction Set Computer

Design and Verification of Serial Peripheral Interface 1 Ananthula Srinivas, 2 M.Kiran Kumar, 3 Jugal Kishore Bhandari

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Why Parallel Computing? Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Multi-core Programming - Introduction

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

SAE5C Computer Organization and Architecture. Unit : I - V

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Designing a Dual Core Processor

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Why Parallel Computing?

45-year CPU Evolution: 1 Law -2 Equations

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Chapter 13 Reduced Instruction Set Computers

Parallel Programming Multicore systems

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

High performance, power-efficient DSPs based on the TI C64x

Novel Design of Dual Core RISC Architecture Implementation

32 bit Arithmetic Logical Unit (ALU) using VHDL

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip

Chip Multiprocessors A Cost-effective Alternative to Simultaneous Multithreading

CS 654 Computer Architecture Summary. Peter Kemper

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Hardware Software Codesign of Embedded Systems

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Parallel Computing: Parallel Architectures Jin, Hai

Multi-core Architectures. Dr. Yingwu Zhu

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Unit 9 : Fundamentals of Parallel Processing

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

ISSN Vol.05, Issue.12, December-2017, Pages:

Hyperthreading Technology

Advanced Processor Architecture

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

THE latest generation of microprocessors uses a combination

DESIGN AND IMPLEMENTATION OF APPLICATION SPECIFIC 32-BITALU USING XILINX FPGA

Hakam Zaidan Stephen Moore

Transcription:

RESEARCH ARTICLE OPEN ACCESS Design and Implementation of Quad Core Architecture Using FPGA V.Prasanth 1, K.Raja Sekhar 2 1 M. TECH, ASSOC.PROF., HOD IN ECE, PRAGATI ENGG.COLLEGE, KAKINADA, A.P, INDIA. 2 M. TECH STUDENT (EMBEDDED SYSTEMS), PRAGATI ENGG.COLLEGE, KAKINADA, A.P, INDIA. Abstract The quad processor core is a design philosophy that has become a mainstream in Scientific and engineering applications. Increasing performance and gate capacity of recent FPGA devices permit complex logic systems to be implemented on a single programmable device. The Embedded multiprocessors face a new problem with thread safety [5]. It is caused by the shared memory, when thread safety is violated the processors can access the same value at the same time. Basically the processor performance can be increased by adopting clock scaling technique [4] and micro architectural enhancements. Therefore, designed a new Architecture called quad processor core architecture for SOC applications. This is implemented on the FPGA chip using VHDL.This architecture performs a simultaneous use of both parallel and distributed computing. The full architecture of a Quad processor core designed for realistic to perform arithmetic, logical, shifting and bit manipulate operations. The proposed quad processor core contains Homogeneous RISC processors [3],[7] added with pipelined processing units, multibus organization and I/o ports along with the other functional elements required to implement embedded SoC solutions. The designed Quad core performance issues like area, speed and power dissipation. Keywords MPSoC, Quad processor, Architecture, Embedded system design, Fpga. I. INTRODUCTION Quad core consists of four connected processors that are capable of communicating. This can be done on a single chip where the processors are connected typically by a multibus. Alternatively, the multiprocessor system can be in more than one chip, typically connected by some type of bus, and each chip can then be a multiprocessor system. A third option is a multiprocessor system working with more than one computer connected by a network, in which each computer can contain more than one chip, and each chip can contain more than one processor. Most modern supercomputers are built this way. A parallel system is presented with more than one task, known as threads. It is important to spread the workload over the entire processor, keeping the difference in idle time as low as possible. To do this, it is important to coordinate the work and workload between the processors. Here, it is most important to consider whether or not some processors are specialpurpose IP cores. To keep a system with N processors effective, it has to work with N or more threads so that each processor constantly has something to do. Furthermore, it is necessary for the processors to be able to communicate with each other, usually via a shared memory, where values that other processors can use are stored. This introduces the new problem of thread safety. When thread safety is violated, two processors (working threads) access the same value at the same time. Consider the following code representation: A=A+1 When two processors P1 and P2 execute this code, a number of different out come may arise due to the real fact that the code will be split into three parts. L1: get A; L2: add 1 to A; L3: store A; It could be that P1 will first execute L1, L2 and L3 and afterward P2 will execute L1, L2 and L3. It could also be that P1 will first execute L1 followed by P2 executing L1 and L2, giving another result. Therefore, some methods for restricting access to shared resources are necessary. These methods are known as Thread safety or synchronization. Moreover, it is necessary for each processor to have some internal memory, where the processor does not have to think about thread safety to speed up the processor. As an example, each processor needs to have a private stack. The benefits of having a Multiprocessor are as follows: 1. Faster calculations are made possible. 2. A more responsive system is created. 3. Different processors can be utilized for different tasks. 745 P a g e

In the future, we expect thread and process parallelism to become wide spread for two reasons: the nature of the applications and the nature of the operating system. The Researchers proposes two alternative micro architectures that exploit multiple threads of control. i.e. simultaneous multithreading (SMT) and chip multiprocessors (CMP).Chip multiprocessors (CMPs) use relatively simple singlethread processor cores that exploit only moderate amounts of parallelism within any one thread, while executing multiple threads in parallel across multiple processor cores.wide-issue superscalar processors exploit instruction-level parallelism (ILP) by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit threadlevel parallelism (TLP) by executing different threads in parallel on different multiprocessors. II. QUAD CORE ARCHITECTURE The Quad processor architecture is completely designed after implementing towards the embedded concurrent processor. The concurrent processor focuses more on the interaction between tasks, correct sequencing of the interactions or communication between the tasks, and the coordination of access to their resources to input and output devices are the key concerns during the design of concurrent computing system. For this the concurrent components like SIMD array, mapping and duplicate memories are added to each processor as shown in Fig.i. Fig.i: Designed QUAD CORE Architecture These modules independently work as a quad core that involves the both parallel and distributed concurrent strategy. III. DESIGN The proposed quad core executes the multiple tasks. These multi tasks may be implemented by the Height tree evaluation technique and reordering of instruction Execution is necessary for this proposed architecture. The Clock is the heart beat of any processor. The processor executes one instruction within one clock period. The quad core is responsible for many concurrent computing operations and it consists of three main modules. These are Core processor, Quad RISC Processor, I/O Ports. Generic RISC Processor are called scalar RISC Processor because they are designed to issue one instruction per cycle, similar to the base scalar processor. Four representatives RISC based processors from the year 1990, the sun SPARC, Intel i860, Motorola M88100, and AMD 29000.Each processor use 32 bit instruction length. The instruction set consists of 51 to 124 basic instructions. We consider these four processors as generic scalar RISC, issuing essentially only these four processors as scalar RISC, issuing essentially only one instruction per pipeline cycle. Among the four scalars RISC Processor, we choose to examine the sun SPARC and i860 architectures. The sun SPARC is derived from the original Berkeley RISC design. The concept of parallel and distributed computing Processor Core mainly consists of SIMD array, mapping, and duplicate memories. Duplicate memories: The most important thing in quad processor mapping is the boundaries and data flow of concurrent processors. Duplicate memories are important and make a design flexible and have a high throughput for both parallel and distributed strategies and are too efficient for concurrent Architecture. SIMD array: This SIMD array supports the multidimensional array of data. It allows the simultaneous use of multiple processors for solving a task. An SIMD array is a synchronous array of Processing Elements under the supervision of one 746 P a g e

control unit and all P.E s receive the same instruction broadcast from the control unit, but operate on different multiple data sets from distinct data streams. It is usually loads data into its duplicate memories before starting the computation. All these are working together as a single processor that involves both the parallel and distributed concurrent strategy. RISC Processors (RISC): In the present work, the design of an 8-bit data width Quad Reduced Instruction Set Computer (RISC) processor is presented. It was developed with implementation efficiency and simplicity in mind. It has a complete instruction set, program memories and data memories, general purpose registers and a simple Arithmetical Logical Unit (ALU) for basic operations. In this design, most of the instructions are uniform length and similar format. Arithmetic operations are restricted to CPU registers. The Instruction cycle consists of three stages namely fetches, decode and execute. Many numbers of RISC processors give the highest performance per unit area for parallel codes. A larger number of RISC processor cores allow a fine-grained ability to perform dynamic voltage scaling and power down. The RISC processor core with a simple architecture is easier to design and functionally verify. This processor is an economic element that is easy to shut down in the face of catastrophic defects and easier to reconfigure in the face of large parametric variation. consideration the logic-implementation-specific choices driven by the constraints of the target FPGA chip. It is vital to note that the choices in all three steps are interdependent. For instance, an initially appealing architectural technique may require logic design choices that lead toward a prohibitively expensive implementation. In this case, the choice of the architectural technique has to be reexamined. In case there are feasible implementations of the selected architectural technique, the required performance can be achieved by balancing the tradeoff between throughputs and operating frequency. Higher throughput necessitates higher design complexity. On the other hand, to achieve high operating frequency, the design complexity should be kept low. As the resulting maximum frequency is highly influenced by the CAD tools that automatically place and route the circuit. IV. FPGA IMPLEMENTATION Modern FPGAs are large enough to implement Quad-Processor Systems-on-Chip. Commercial FPGA companies also provide system design tools. To aid our design decisions, we devise a design methodology adapted to the specific challenges we are facing. Our methodology for deriving a digital logic Implementation of the required functionalities encompasses three steps, each of which comprises a set of choices. The criteria we employ to evaluate the final set of choices are complexity and scalability. We define the complexity of a unit by two metrics: the maximum operating frequency and resource usage. Lower complexity is better; that is, the unit has a higher frequency and consumes less resource. We define scalability as the ability of the unit to expand in a chosen dimension with a minimal increase in complexity. To determine complexity and scalability, we examine the FPGA mapping of the design. In the first step, we start by choosing a particular architectural technique that provides a certain subset of the required functionalities. The second step considers the structural design choices of each architectural technique in terms of the temporal and spatial parallelism necessary to meet throughput and latency requirements. In the third step, we take into Fig.ii: Flow Chart for Implementation of Quad Core Architecture 747 P a g e

Fig.iii:Simulation results of Single ALU Executing Addition operation. Fig.vii:Simulation results of Quad Core Processor main module The above figures gives the results of designed quad core architecture which is simulated by using model sim simulator and verifying the parallel and distributed computing which show that the desinge dprocessor exhibits the concurrent results under the quad processorplatform. Fig.iv:Simulation results of Single ALU Executing Subtraction operation. Fig.viii: Floor Planning Diagram. Fig.v:Simulation results of Single Processor Executing Addition operation in both parallel &distributing computing. Fig.vi:Simulation results of Executing 2 s complement operation. Floor planning diagram of implemented design represents the total modules in the system and their internal connections of I/O ports as shown in the Fig.viii. V. CONCLUSION The designed quad processor core is very efficient and powerful core architecture, which is analyzed by using Xilinx software and simulation by using model sim simulator. Although the quad processor core was based on well-proven superscalar architectural techniques, including parallel execution and register renaming, these techniques were originally devised for fine-grained parallelism among instructions. The designed quad core implementation using FPGA which is very advantageous to the architecture,because it uses the parallel processing and pipielinig techniques are easily implemented in FPGA.The concurrent operation of quad processor gives the high speed and takes less power. 748 P a g e

REFERENCES [1] Tai-Hua, Lu, Chung-Ho Chen, KuenJong Lee. Effective Hybrid Test Program Development for Software-Based Self- Testing of Quad Cores,IEEE Manuscript received April 03, 2012, revised August 14, 2012, first published December 18, 2012. [2] Gohringer, D., Hubner, M.Perschke, T., Becker. J. New Dimensions for Quad core Architectures Demand Heterogeneity, Infrastructure and Performance through reconfigurability The EMPSoC Approach. In Proc of FPL 2010, PP.495-498, Sept 2010. [3] Lysaght, P. Blodget, B. Mason, J.Young, B.Bridgford. Invited Paper: Quad core design Methodologies and CAD Tools for Dynamic Reconfiguration of Xilinx FPGAs. In Proceedings of FPL 2009, August 2009. [4] D. Tullsen, S. Eggers, and H. Levy, Simultaneous Multithreading: Maximizing On- Chip Parallelism, Proc. 22nd Ann. Int l Symp. Computer Architecture, ACM Press, New York,1995, pp. 392-403. [5] J. lo, S. Eggers, J. Emer, H. Levy, R. Sstamm, and D. Tullsen. Converting thread level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), pp. 323-354, August 1997. [6] Lance, Hammond, Basem, Ku.umen ANayfeh, Kunle Olukotun.ASingle-Chip multiprocessor. IEEE Computer, vol. 30, no. 9, pp. 79--85, September1997. [7] J. Borkenhagen, R. Eickemeyer, and R. Kalla : A Multithreaded PowerPC Processor for Commercial Servers, IBM Journal of Research and Development, November 2000, Vol. 44, No. 6, pp.1995. AUTHORS PRASANTH VARASALA Completed B.Tech in 2004, M.tech in VLSI system Design in 2010 and Pursuing P.hD in JNTUK in Low power VLSI. Areas of interests are low power designing, realtime applications, Embedded Systems, Digital Designing.He is an Assoc. Professor and HOD of ECE in Pragathi Engineering College, Surampalem. K.RAJA SEKHAR Completed B.Tech in 2010 in Gudlavalleru Engineering College and presently doing his Master degree in Embedded Systems from JNTU Kakinada University in Pragathi Engineering College, Surampalem. Areas of interests are realtime applications, Embedded Systems, Digital Designing. 749 P a g e