Partitioning of computationally intensive tasks between FPGA and CPUs

Size: px
Start display at page:

Download "Partitioning of computationally intensive tasks between FPGA and CPUs"

Transcription

1 Partitioning of computationally intensive tasks between FPGA and CPUs Tobias Welti, MSc (Author) Institute of Embedded Systems Zurich University of Applied Sciences Winterthur, Switzerland Matthias Rosenthal, PhD (Author) Institute of Embedded Systems Zurich University of Applied Sciences Winterthur, Switzerland Abstract With the recent development of faster and more complex Multiprocessor System-on-Chips (MPSoCs), a large number of different resources have become available on a single chip. For example, Xilinx's Zynq UltraScale+ is a powerful MPSoC with four ARM Cortex-A53 CPUs, two Cortex-R5 real cores, an FPGA fabric and a Mali-400 GPU. Optimal process partitioning between CPUs, real- cores, GPU and FPGA is therefore a challenge. For many scientific applications with high sampling s and real- signal analysis, an FFT needs to be calculated and analyzed directly in the measuring device. The goal of partitioning such an FFT in an MPSoC is to make best use of the available resources, to minimize latency and to optimize performance. The paper compares different partitioning designs and discusses their advantages and disadvantages. Measurement results with up to 250 MSamples per second are shown. Keywords FPGA; UltraScale+ MPSoC; partitioning; ARM NEON; SIMD; asymmetric multi-processing; high performance FFT; low latency processing I. INTRODUCTION The transition from field-programmable gate arrays (FPGAs) to System-on-Chips (SoCs) in 2011 was the unavoidable development when FPGAs needed to execute ever more complex software programs. The soft-core processors available for inclusion in the programmable logic were either not powerful enough or took up too many logic resources. The combination of hardware processors with the FPGA, interconnected through a high-performance bus showed the potential of this architecture. With the recent development of faster and more complex Multiprocessor System-on-Chips (MPSoCs), many different resources are available on one chip. For example, Xilinx's Zynq UltraScale+ MPSoC combines up to four ARM Cortex- A53 application processor cores, two ARM Cortex-R5 real cores, an ARM Mali-400 GPU as well as an FPGA fabric with programmable logic, on-chip memory, hardware multipliers (DSP slices) and many high-throughput I/Os. The challenge for the system architect has now become finding the optimal execution environment for your design's processes: the partitioning. The goal is to make best use of the available resources, minimizing latency and optimizing performance. In this paper, we use the Fast Fourier Transform as a computationally expensive algorithm that can be acceled through several means: multiprocessing on several cores of the same type (Symmetric Multiprocessing) vector processing using a special instruction set for Single Instruction, Multiple Data (SIMD), available on most current processors using additional, different processors than the ones the main software is run on (Asymmetric Multiprocessing) generating accelerator functions that run in the FPGA fabric and using them as external functions running the whole algorithm in the FPGA core, controlled by a CPU core running the algorithm standalone in the FPGA For each method, we present the communication paths and software architecture, along with performance data. The FFT is a well-studied algorithm and many papers have been published on methods for efficient execution on specific multiprocessor architectures in [1], [2], [3], [4] and [5]. It is not the goal of this paper to improve on these methods, but to provide an overview and an understanding of the possibilities available in today's devices. The paper is organized as follows: Section II introduces the FFT algorithm and how it can be calculated on multiple processing devices. In Section III, we discuss the partitioning methods based on software, executing on processor cores. The FPGA-based methods are explored in Section IV. Section V elabos on ways of collaboration between the FPGA and processors. Finally, in Section VI we sum up the advantages of the presented methods.

2 II. FFT PARTITIONING The discrete Fourier Transform (DFT) is used to transform a sequence of samples from the domain into the frequency domain to analyze the frequency components of the sampled signal. Spectral analysis, measuring and controlling, signal processing and quantum computing are but a few applications of the DFT. The DFT has a very high computational cost of O(N 2 ). The Fast Fourier Transform (FFT) improves efficiency of the transform by reducing the number of redundant calculations. This is achieved by splitting the sequence into smaller parts and performing the Fourier Transform on these as shown in Fig. 1. In doing so, the computational cost can be reduced to O(N log N). Note that the splitting includes a reordering of the input values, effectively selecting every other input value for each subset. Fig. 1. Principle of the FFT algorithm. Since the FFT algorithm is a divide-and-conquer approach, it is well suited for parallel processing on multiple processors. Each core can perform the smaller FFT on its part of the data, independent of the remaining data. However, as shown in Fig. 1, there will be at least one step requiring data from other cores when combining the smaller FFTs into the complete spectrum. This all-to-all communication is a critical step because it requires synchronization of the cores. The optimization of this step has been the subject of several publications, i.e. [2] and [4]. The technique of calculating smaller FFTs and combining them into larger spectra allows to efficiently process FFTs larger than the memory of your processor, given that the currently unused data is stored in an efficient way. Refs. [2], [4] and [6] have published possible implementations. III. MULTICORE PROCESSING FOR FFT A. Available resources The Xilinx Zynq UltraScale+ MPSoC portfolio offers multiple ranges of SoCs with varying numbers of processor cores and FPGA fabric resources. In this paper, we use the XCZU9EG device, an SoC with the following resources: four ARM Cortex-A53 application processing cores, running at 1100 MHz and featuring the NEON instruction set two ARM Cortex-R5 real- processing cores with tightly-coupled memory (TCM) for low-latency access, running at 500 MHz an ARM Mali-400 GPU an FPGA fabric with 600'000 System Logic Cells, 32 Mbit of FPGA memory and 2520 hardware multipliers (DSP slices) Fig. 2 shows the block diagram of the available resources. The ARM Cortex-A53 core is a mid-range application processing core that balances power usage vs. performance. It is equipped with the ARMv8 instruction set, including NEONv2 SIMD instructions for vectorized execution on multiple data (up to 128 bit wide). Four A53 cores make up the Application Processing Unit (APU), in the bottom right of Fig. 2. The ARM Cortex-R5 core is a real- processor with a focus on fast reaction to events. Its 128 kb of TCM allow very fast memory accesses, but in turn limit the amount of data that can be worked on. Two R5 cores form the Real- Processing Unit (RPU) in the top right of Fig. 2. The Level 3 interconnect enables fast data transfers between the APU, the RPU and the FPGA fabric with on-chip memory, DSP slices and programmable logic. B. Executing the FFT in Software When executing an FFT in software, you have the choice of several FFT libraries, many of them capable of exploiting both multiprocessing and vector processing. ARM Ne10 [7] provides highly optimized ARM NEON intrinsics written in Assembler and its FFT algorithm makes use of these. However, it does not support multiprocessing. kissfft [8] is a very lightweight library with the goal of being easy to use and modely efficient while supporting multiprocessing. However, it makes use of NEON instructions only to execute four sepa FFTs in parallel instead of accelerating one FFT transform. The fastest and most versatile FFT library that was tested in our work is FFTW3 [9], exploiting both multi-processing and NEON instructions and including a mechanism to optimize the algorithm for the available hardware. This mechanism will test many possible FFT optimization algorithms, measuring performance and selecting the fastest one as described in [10]. This is done in order to make the best possible use of first and second level caches, memory access speeds and other hardware characteristics. For our implementations, we used FFTW3 on the A53 and Ne10 on the R5. Fig. 2. Block diagram of MPSoC.

3 C. Implementations The software-only implementations were run on the Xilinx PetaLinux operating system, using the FFTW3 library to calculate double precision floating-point complex FFTs. Using double precision limits the speed improvement for the NEON instruction set to a factor of two, because the NEON registers are 128 bit wide and can therefore accommodate only two double precision floating-point values. The following five scenarios were tested: a. Single-core A53 b. Single-core A53 with NEON instructions c. Symmetric Multi-Processing (SMP) with four A53 d. Symmetric Multi-Processing (SMP) with four A53 with NEON instructions e. Asymmetric Multi-Processing (AMP) with the R5 as coprocessor Scenarios a-d require no special software stack except the pthread library for SMP. Scenario e requires additional frameworks and drivers for communication between the Master A53 core and the remote R5 core to enable AMP. Fig. 3 shows the software architecture. Two operating systems are required: Linux on the APU master CPU and FreeRTOS on the RPU slave CPU. First, the APU boots Linux and uses the OpenAMP framework to load the RPU firmware into the TCM via a DMA transfer. The RPU is then booted out of the TCM. The remoteproc driver handles the life cycle management, allocates the required resources and creates a virtual I/O (virtio) device for each remote processor. RPMsg is the Remote Processor Messaging API to provide inter-process communication between processes on independent cores. The flow of OpenAMP booting and software execution is as follows (as in [11]): 1. The remote processor is configured as a virtual IO device, shared memory for message passing is reserved. 2. The Master loads the firmware for the remote processor into its memory, then boots the remote processor. 3. After booting, the remote processor creates the virtio and RPMsg channels and informs the master. 4. The Master invokes the callback channel and acknowledges the remote processor and application. 5. The remote processor invokes the RPMsg channel. 6. The RPMsg channel is established, both Master and Slave can initiate communication via RPMsg calls. 7. During operation, communication buffers in reserved shared DDR memory are used to pass messages between the Master and the Slave. Usually, these buffers are small. To load larger amounts of data, such as the FFT input and output data, the data is written to or read from on-chip memory (OCM) of the R5, and the pointers are passed via message buffer. Shutdown proceeds in the reverse sequence of the booting and initialization process. D. Performance To provide an overview of the performance, FFTs of three sizes (4 096, and data points) were calculated using the implementation scenarios a-d. The Cortex-R5 can only perform point FFTs due to its limited amount of OCM Fig. 3. Software stack for Asymmetric Multi-Processing (as in [11])

4 Table I shows the achieved calculation s in microseconds, comprised of the s for loading the input data, executing the FFT and storing the result in memory. We also show the feasible sampling s that would allow the CPUs to keep up a seamless processing of the input data. TABLE I. SOFTWARE FFT PERFORMANCE Scenario a A b A53NEON c 4xA d 4xA53NEON e R5 a a. R5 includes OpenAMP communication overhead (approx.100 µs) It is evident from the data that the FFT scales well for multiprocessing. When using four A53 cores, a speed-up of up to factor 3.3 is observed. Scaling is better for large FFTs, because there are more calculation steps that don't require allto-all communication. More detailed tests have been run with one to four cores, but for the sake of brevity, the data is not shown here. In summary, the speedup corresponds almost directly to the number of cores as long as there remains at least one core for execution of the processes of the operating system. When all cores are used for multi-processing, the speedup is capped. A reasonable explanation is that the FFT is competing with the other processes, resulting in many context switches. Using the NEON instruction set, a speedup of roughly 10% is observed for single-processing. For multi-processing, enabling NEON yields a speed gain of 5-20%. This is nowhere near a factor of two that could be expected since two values can be processed at the same. Among the possible reasons, we suspect the fact that the NEON instructions have their own execution pipeline and registers. If the algorithm can not be optimized to 100% NEON instructions, there will be data transfers between NEON and standard registers. The R5 is clearly not designed for computationally heavy tasks, its target assignment is to react to events in real-. It has to be noted that the communication overhead of the OpenAMP framework contributes approximately 100 µs to the execution, but even if this overhead could be avoided, the R5 would be no competition for the A53s. IV. ACCELERATORS IN FPGA Traditionally, bringing an algorithm to programmable logic means writing HDL code or using an existing IP core. Today, there are tools to gene HDL code from software code. Xilinx provides the SDSoC (Software Defined System on Chip) toolchain that genes logic blocks from your C-code along with the required data transfer logic. To allow your software to interface with this computation block, SDSoC compiles a software library with the required function for configuring the FFT core, loading and storing data as well as the necessary interrupt service functions. We have compared the performance of the following scenarios: f. SDSoC-Accelerator controlled by A53 g. FFT IP-core controlled by A53 h. FFT IP-core working standalone Scenario f: An SDSoC accelerator core can only be implemented for a fixed FFT size. Therefore, three accelerators were implemented in the programmable logic, clocked at 300 MHz. The big advantage of performing the FFT in programmable logic is that the processor core can perform other tasks in the mean. The processor will still be required for loading and storing the data. Fig. 4 shows the block diagram of this setup. Scenario g: The Xilinx FFT IP-core can be configured for different FFT sizes at run. One instance of the FFT core is therefore sufficient for our tests. The processor will load the input data into on-chip memory in the FPGA fabric, then start the FFT core and finally transfer the processed data back to DDR memory. These transfers can be done by DMA, leaving the CPU free for other tasks. This setup is shown in Fig. 5. Scenario h: If the input data is acquired in the FPGA fabric, there is no sense in transferring the data to DDR memory first, then loading it to FPGA on-chip memory for the FFT. Instead, the FFT core is configured for constant, standalone operation on the input data stream and a DMA stream is set up for transfer of the output data to a range of reserved DDR memory, where the APU can retrieve the processed data for analysis (See Fig. 6). Fig. 4. SDSoC accelerators in FPGA. Fig. 5. FFT IP block, controlled by A53

5 efficiency reasons, this is best done in on-chip memory (BRAM). Fig. 6. FFT IP block, self-controlled Table II shows the achieved execution s for FPGAacceled FFT on the Zynq UltraScale+ MPSoC. Scenarios f and g have similar performance, showing that the SDSoCgened HDL code is efficient and compares well to manually optimized HDL code of the FFT IP. TABLE II. FPGA FFT PERFORMANCE Scenario f SDSoC g IP-A h IP-standalone Fig. 7. Partitioning FFT, parallel processing in FPGA The more efficient data path in scenario h can easily explain the difference in performance between scenarios g and h, omitting the transfer of the input data from DDR to on-chip memory. In fact, the limiting factor for the sampling is the DMA transfer from FPGA fabric into DDR memory. V. COMBINING FPGA AND PROCESSING SYSTEM The results in Sections III and IV show clearly that the FPGA easily outperforms the pure software implementations. Nevertheless, there are limits to the size of FFT than can be executed in the FPGA. The FFT IP core can process up to points for one FFT. The SDSoC toolchain would allow to create accelerator functions for larger FFT sizes, but the available FPGA resources (DSP slices and on-chip memory) would be exhausted quickly. We have explored ways for the FPGA and processing system to collabo in processing the FFT. The goal was a point FFT, using fewer resources in the FPGA while still maintaining good performance. As shown in Fig. 1, the FFT algorithm is divided in clearly defined steps that can be processed in sepa units. Our idea was to process the first steps of the FFT in the FPGA grid, then transfer the data into processor memory and do the remaining steps in software, as shown in Fig. 7. The FFT would be split into four point FFTs in the FPGA. These smaller FFTs can be processed either in parallel or in series. Parallel processing requires four FFT cores with four s the resource usage. For serial processing (Fig. 8), only one FFT core is implemented, but the data of the three remaining FFTs must be stored until the core is ready for processing. For Fig. 8. Partitioning FFT, serial processing in FPGA We found that the amount of BRAM resources used is similar for both the parallel and the serial approach. Table III shows the number of BRAM blocks and DSP slices used. Values in parentheses show the percentage of all available resources. Because the amount of data to be stored or processed is the same as for a point FFT, our approach even uses roughly the same amount of BRAM as the full point FFT core. With BRAM being the most limited FPGA resource for this application, there is no gain from partitioning the FFT between FPGA and processing system. Furthermore, the FFT calculation needs to be finished in the processing system, adding more latency and resource usage to the bill.

6 TABLE III. RESOURCE REQUIREMENTS OF PARTITIONED FFT Scenario BRAM DSP Parallel FFT 4x 16k FFT 232 (27%) 180 (7%) Serial FFT 1x 16k FFT & BRAM 244 (25%) 45 (2%) Full FFT 1x 64k FFT 238 (27%) 54 (2%) VI. DISCUSSION For the FFT, we have shown that the FPGA fabric is able to perform several s faster than the complete processing system of the Zynq UltraScale+ MPSoC. This power can be harvested in several ways, be it as stand-alone FFT processor or as an external accelerator function. Depending on the amount of processing to be done apart from the FFT, doing the whole transform in the processing system can also be an option, leaving more room in your FPGA. The decision where to execute an algorithm depends on many factors, such as: Where does your data originate? Try to keep it local, reducing the amount of data transfer. What are the required data s? Can the amount of data be transferred over the L3 interconnect without interfering with the remaining processes? How well can your algorithm be split up and be processed in parallel? The more an algorithm can be parallelized, the better the FPGA will perform in comparison to the processing system. How many FPGA resources can you spare for your algorithm? In the end, it remains the challenge of the system architect to choose where and how the data is to be processed. A deep understanding of the algorithm and both processing system and FPGA hardware is required. REFERENCES [1] P. N. Swarztrauber, "Multiprocessor FFTs," Parallel Computing, Vol. 5, Issues 1 2, pp , [2] J. Sánchez-Curto, P. Chamorro-Posada, "On a faster parallel implementation of the split-step Fourier method," Parallel Computing, Vol. 34, Issue 9, pp , [3] T. H. Cormen, D. M. Nicol, "Performing out-of-core FFTs on parallel disk systems," Parallel Computing, Vol. 24, Issue 1, pp. 5-20, [4] E. Chu, A. George, "FFT algorithms and their adaptation to parallel processing," Linear Algebra and its Applications, Vol. 284, Issues 1 3, pp , [5] S. Xue, J. Wang, Y. Li and Q. Peng, "Parallel FFT implementation based on multi-core DSPs," 2011 International Conference on Computational Problem-Solving (ICCP), Chengdu, pp , [6] R. Lyons, "Computing large DFTs using small FFTs", [Online]: [7] ARM Ne10 Project [Online]: [8] kissfft [Online]: [9] FFTW3 [Online]: [10] M. Frigo, S. G. Johnson, "The Design and Implementation of FFTW3," Proc. IEEE, vol. 93, no. 2, pp , [11] Xilinx User Guide UG1211, "Zynq UltraScale+ MPSoC Software Acceleration Targeted Reference Design", [Online]: /2017_2/ug1211-zcu102-swaccel-trd.pdf. Xilinx, Inc. 2017

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC

Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC Use ZCU102 TRD to Accelerate Development of ZYNQ UltraScale+ MPSoC Topics Hardware advantages of ZYNQ UltraScale+ MPSoC Software stacks of MPSoC Target reference design introduction Details about one Design

More information

Heterogeneous Software Architecture with OpenAMP

Heterogeneous Software Architecture with OpenAMP Heterogeneous Software Architecture with OpenAMP Shaun Purvis, Xilinx Agenda Heterogeneous SoCs Linux and OpenAMP OpenAMP for HSA Heterogeneous SoCs A System-on-Chip that integrates multiple processor

More information

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015. Optimizing HW/SW Partition of a Complex Embedded Systems Simon George November 2015 Zynq-7000 All Programmable SoC HP ACP GP Page 2 Zynq UltraScale+ MPSoC Page 3 HW/SW Optimization Challenges application()

More information

«Real Time Embedded systems» Multi Masters Systems

«Real Time Embedded systems» Multi Masters Systems «Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can

More information

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator Embedded Computing Conference 2017 Matthias Frei zhaw InES Patrick Müller Enclustra GmbH 5 September 2017 Agenda Enclustra introduction

More information

Real-Timeness and System Integrity on a Asymmetric Multi Processing configuration

Real-Timeness and System Integrity on a Asymmetric Multi Processing configuration Real-Timeness and System Integrity on a Asymmetric Multi Processing configuration D&E Event November 2nd Relator: Manuele Papais Sales & Marketing Manager 1 DAVE Embedded Systems DAVE Embedded Systems'

More information

Designing a Multi-Processor based system with FPGAs

Designing a Multi-Processor based system with FPGAs Designing a Multi-Processor based system with FPGAs BRINGING BRINGING YOU YOU THE THE NEXT NEXT LEVEL LEVEL IN IN EMBEDDED EMBEDDED DEVELOPMENT DEVELOPMENT Frank de Bont Trainer / Consultant Cereslaan

More information

Zynq Ultrascale+ Architecture

Zynq Ultrascale+ Architecture Zynq Ultrascale+ Architecture Stephanie Soldavini and Andrew Ramsey CMPE-550 Dec 2017 Soldavini, Ramsey (CMPE-550) Zynq Ultrascale+ Architecture Dec 2017 1 / 17 Agenda Heterogeneous Computing Zynq Ultrascale+

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Copyright 2016 Xilinx

Copyright 2016 Xilinx Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs

Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs Making Full use of Emerging ARM-based Heterogeneous Multicore SoCs Felix Baum, Arvind Raghuraman To cite this version: Felix Baum, Arvind Raghuraman. Making Full use of Emerging ARM-based Heterogeneous

More information

XMC-ZU1. XMC Module Xilinx Zynq UltraScale+ MPSoC. Overview. Key Features. Typical Applications

XMC-ZU1. XMC Module Xilinx Zynq UltraScale+ MPSoC. Overview. Key Features. Typical Applications XMC-ZU1 XMC Module Xilinx Zynq UltraScale+ MPSoC Overview PanaTeQ s XMC-ZU1 is a XMC module based on the Zynq UltraScale+ MultiProcessor SoC device from Xilinx. The Zynq UltraScale+ integrates a Quad-core

More information

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing

More information

Zynq-7000 All Programmable SoC Product Overview

Zynq-7000 All Programmable SoC Product Overview Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform

More information

Multimedia SoC System Solutions

Multimedia SoC System Solutions Multimedia SoC System Solutions Presented By Yashu Gosain & Forrest Picket: System Software & SoC Solutions Marketing Girish Malipeddi: IP Subsystems Marketing Agenda Zynq Ultrascale+ MPSoC and Multimedia

More information

Computer-System Organization (cont.)

Computer-System Organization (cont.) Computer-System Organization (cont.) Interrupt time line for a single process doing output. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism,

More information

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator FPGA Kongress München 2017 Martin Heimlicher Enclustra GmbH Agenda 2 What is Visual System Integrator? Introduction Platform

More information

CS6401- Operating System QUESTION BANK UNIT-I

CS6401- Operating System QUESTION BANK UNIT-I Part-A 1. What is an Operating system? QUESTION BANK UNIT-I An operating system is a program that manages the computer hardware. It also provides a basis for application programs and act as an intermediary

More information

Mapping applications into MPSoC

Mapping applications into MPSoC Mapping applications into MPSoC concurrency & communication Jos van Eijndhoven jos@vectorfabrics.com March 12, 2011 MPSoC mapping: exploiting concurrency 2 March 12, 2012 Computation on general purpose

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

Parallel Simulation Accelerates Embedded Software Development, Debug and Test Parallel Simulation Accelerates Embedded Software Development, Debug and Test Larry Lapides Imperas Software Ltd. larryl@imperas.com Page 1 Modern SoCs Have Many Concurrent Processing Elements SMP cores

More information

Chapter 5. Introduction ARM Cortex series

Chapter 5. Introduction ARM Cortex series Chapter 5 Introduction ARM Cortex series 5.1 ARM Cortex series variants 5.2 ARM Cortex A series 5.3 ARM Cortex R series 5.4 ARM Cortex M series 5.5 Comparison of Cortex M series with 8/16 bit MCUs 51 5.1

More information

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx A So%ware Developer's Journey into a Deeply Heterogeneous World Tomas Evensen, CTO Embedded So%ware, Xilinx Embedded Development: Then Simple single CPU Most code developed internally 10 s of thousands

More information

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University CS 571 Operating Systems Midterm Review Angelos Stavrou, George Mason University Class Midterm: Grading 2 Grading Midterm: 25% Theory Part 60% (1h 30m) Programming Part 40% (1h) Theory Part (Closed Books):

More information

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,

More information

i.mx 7 - Hetereogenous Multiprocessing Architecture

i.mx 7 - Hetereogenous Multiprocessing Architecture i.mx 7 - Hetereogenous Multiprocessing Architecture Overview Toradex Innovative Business Model Independent Companies Direct Sales Publicly disclosed Sales Prices Local Warehouses In-house HW and SW Development

More information

SoC Basics Avnet Silica & Enclustra Seminar Getting started with Xilinx Zynq SoC Fribourg, April 26, 2017

SoC Basics Avnet Silica & Enclustra Seminar Getting started with Xilinx Zynq SoC Fribourg, April 26, 2017 1 2 3 4 Introduction - Cool new Stuff Everybody knows, that new technologies are usually driven by application requirements. A nice example for this is, that we developed portable super-computers with

More information

Copyright 2014 Xilinx

Copyright 2014 Xilinx IP Integrator and Embedded System Design Flow Zynq Vivado 2014.2 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able

More information

HEAD HardwarE Accelerated Deduplication

HEAD HardwarE Accelerated Deduplication HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version

More information

A unified multicore programming model

A unified multicore programming model A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging

More information

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication Erik H. D Hollander Electronics and Information Systems Department Ghent University, Ghent, Belgium Erik.DHollander@ugent.be

More information

Exploring OpenCL Memory Throughput on the Zynq

Exploring OpenCL Memory Throughput on the Zynq Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors Mitchell A Cox, Robert Reed and Bruce Mellado School of Physics, University of the Witwatersrand.

More information

Asymmetric MultiProcessing for embedded vision

Asymmetric MultiProcessing for embedded vision Asymmetric MultiProcessing for embedded vision D. Berardi, M. Brian, M. Melletti, A. Paccoia, M. Rodolfi, C. Salati, M. Sartori T3LAB Bologna, October 18, 2017 A Linux centered SW infrastructure for the

More information

CSC 553 Operating Systems

CSC 553 Operating Systems CSC 553 Operating Systems Lecture 1- Computer System Overview Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users Manages secondary memory

More information

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems Designing, developing, debugging ARM and heterogeneous multi-processor systems Kinjal Dave Senior Product Manager, ARM ARM Tech Symposia India December 7 th 2016 Topics Introduction System design Software

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Seventh Edition By William Stallings Objectives of Chapter To provide a grand tour of the major computer system components:

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Ninth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Ninth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

New ARMv8-R technology for real-time control in safetyrelated

New ARMv8-R technology for real-time control in safetyrelated New ARMv8-R technology for real-time control in safetyrelated applications James Scobie Product manager ARM Technical Symposium China: Automotive, Industrial & Functional Safety October 31 st 2016 November

More information

Amber Baruffa Vincent Varouh

Amber Baruffa Vincent Varouh Amber Baruffa Vincent Varouh Advanced RISC Machine 1979 Acorn Computers Created 1985 first RISC processor (ARM1) 25,000 transistors 32-bit instruction set 16 general purpose registers Load/Store Multiple

More information

Simplify System Complexity

Simplify System Complexity Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint

More information

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School

More information

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture. ARM CORTEX-R52 Course Family: ARMv8-R Cortex-R CPU Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture. Duration: 4 days Prerequisites and related

More information

RISC Processors and Parallel Processing. Section and 3.3.6

RISC Processors and Parallel Processing. Section and 3.3.6 RISC Processors and Parallel Processing Section 3.3.5 and 3.3.6 The Control Unit When a program is being executed it is actually the CPU receiving and executing a sequence of machine code instructions.

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

SoC Platforms and CPU Cores

SoC Platforms and CPU Cores SoC Platforms and CPU Cores COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Seventh Edition By William Stallings Course Outline & Marks Distribution Hardware Before mid Memory After mid Linux

More information

Introduction to Embedded System Design using Zynq

Introduction to Embedded System Design using Zynq Introduction to Embedded System Design using Zynq Zynq Vivado 2015.2 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Operating Systems: Internals and Design Principles. Chapter 1 Computer System Overview Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 1 Computer System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles No artifact designed by man

More information

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS Mike Ashworth, Graham Riley, Andrew Attwood and John Mawer Advanced Processor Technologies Group School

More information

Copyright 2017 Xilinx.

Copyright 2017 Xilinx. All Programmable Automotive SoC Comparison XA Zynq UltraScale+ MPSoC ZU2/3EG, ZU4/5EV Devices XA Zynq -7000 SoC Z-7010/7020/7030 Devices Application Processor Real-Time Processor Quad-core ARM Cortex -A53

More information

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors Paul Ekas, DSP Engineering, Altera Corp. pekas@altera.com, Tel: (408) 544-8388, Fax: (408) 544-6424 Altera Corp., 101

More information

Sri Vidya College of Engineering and Technology. EC6703 Embedded and Real Time Systems Unit IV Page 1.

Sri Vidya College of Engineering and Technology. EC6703 Embedded and Real Time Systems Unit IV Page 1. Sri Vidya College of Engineering and Technology ERTS Course Material EC6703 Embedded and Real Time Systems Page 1 Sri Vidya College of Engineering and Technology ERTS Course Material EC6703 Embedded and

More information

Five Ways to Build Flexibility into Industrial Applications with FPGAs

Five Ways to Build Flexibility into Industrial Applications with FPGAs GM/M/A\ANNETTE\2015\06\wp-01154- flexible-industrial.docx Five Ways to Build Flexibility into Industrial Applications with FPGAs by Jason Chiang and Stefano Zammattio, Altera Corporation WP-01154-2.0 White

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Operating Systems. Computer Science & Information Technology (CS) Rank under AIR 100

Operating Systems. Computer Science & Information Technology (CS) Rank under AIR 100 GATE- 2016-17 Postal Correspondence 1 Operating Systems Computer Science & Information Technology (CS) 20 Rank under AIR 100 Postal Correspondence Examination Oriented Theory, Practice Set Key concepts,

More information

Signal Conversion in a Modular Open Standard Form Factor. CASPER Workshop August 2017 Saeed Karamooz, VadaTech

Signal Conversion in a Modular Open Standard Form Factor. CASPER Workshop August 2017 Saeed Karamooz, VadaTech Signal Conversion in a Modular Open Standard Form Factor CASPER Workshop August 2017 Saeed Karamooz, VadaTech At VadaTech we are technology leaders First-to-market silicon Continuous innovation Open systems

More information

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas Multiple processor systems 1 Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor, access

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Operating Systems, Fall Lecture 9, Tiina Niklander 1 Multiprocessor Systems Multiple processor systems Ch 8.1 8.3 1 Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message passing multiprocessor,

More information

HKG : OpenAMP Introduction. Wendy Liang

HKG : OpenAMP Introduction. Wendy Liang HKG2018-411: OpenAMP Introduction Wendy Liang Agenda OpenAMP Projects Overview OpenAMP Libraries Changes in Progress Future Improvements OpenAMP Projects Overview Introduction With today s sophisticated

More information

SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems

SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems SysAlloc: A Hardware Manager for Dynamic Memory Allocation in Heterogeneous Systems Zeping Xue, David Thomas zeping.xue10@imperial.ac.uk FPL 2015, London 4 Sept 2015 1 Single Processor System Allocator

More information

An Introduction to Asymmetric Multiprocessing: when this architecture can be a game changer and how to survive it.

An Introduction to Asymmetric Multiprocessing: when this architecture can be a game changer and how to survive it. An Introduction to Asymmetric Multiprocessing: when this architecture can be a game changer and how to survive it. Nicola La Gloria, Laura Nao Kynetics LLC Santa Clara, California Hybrid Architecture:

More information

Simplify System Complexity

Simplify System Complexity 1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller

More information

The ARM Cortex-A9 Processors

The ARM Cortex-A9 Processors The ARM Cortex-A9 Processors This whitepaper describes the details of the latest high performance processor design within the common ARM Cortex applications profile ARM Cortex-A9 MPCore processor: A multicore

More information

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7 Improving Energy Efficiency in High-Performance Mobile Platforms Peter Greenhalgh, ARM September 2011 This paper presents the rationale and design

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

XMC-RFSOC-A. XMC Module Xilinx Zynq UltraScale+ RFSOC. Overview. Key Features. Typical Applications. Advanced Information Subject To Change

XMC-RFSOC-A. XMC Module Xilinx Zynq UltraScale+ RFSOC. Overview. Key Features. Typical Applications. Advanced Information Subject To Change Advanced Information Subject To Change XMC-RFSOC-A XMC Module Xilinx Zynq UltraScale+ RFSOC Overview PanaTeQ s XMC-RFSOC-A is a XMC module based on the Zynq UltraScale+ RFSoC device from Xilinx. The Zynq

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

Leverage Vybrid's asymmetrical multicore architecture for real-time applications by Stefan Agner

Leverage Vybrid's asymmetrical multicore architecture for real-time applications by Stefan Agner Leverage Vybrid's asymmetrical multicore architecture for real-time applications 2014 by Stefan Agner Vybrid Family of ARM processors suitable for embedded devices VF3XX Single core no DDR VF5XX Single

More information

The Next Steps in the Evolution of Embedded Processors

The Next Steps in the Evolution of Embedded Processors The Next Steps in the Evolution of Embedded Processors Terry Kim Staff FAE, ARM Korea ARM Tech Forum Singapore July 12 th 2017 Cortex-M Processors Serving Connected Applications Energy grid Automotive

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study

Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Soft-Core Embedded Processor-Based Built-In Self- Test of FPGAs: A Case Study Bradley F. Dutton, Graduate Student Member, IEEE, and Charles E. Stroud, Fellow, IEEE Dept. of Electrical and Computer Engineering

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

Zynq Ultrascale Mpsoc For The System Architect Logtel

Zynq Ultrascale Mpsoc For The System Architect Logtel We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing it on your computer, you have convenient answers with zynq ultrascale mpsoc

More information

Extending Model-Based Design for HW/SW Design and Verification in MPSoCs Jim Tung MathWorks Fellow

Extending Model-Based Design for HW/SW Design and Verification in MPSoCs Jim Tung MathWorks Fellow Extending Model-Based Design for HW/SW Design and Verification in MPSoCs Jim Tung MathWorks Fellow jim@mathworks.com 2014 The MathWorks, Inc. 1 Model-Based Design: From Concept to Production RESEARCH DESIGN

More information

VPX3-ZU1. 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site. Overview. Key Features. Typical Applications

VPX3-ZU1. 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site. Overview. Key Features. Typical Applications VPX3-ZU1 3U OpenVPX Module Xilinx Zynq UltraScale+ MPSoC with FMC HPC Site Overview PanaTeQ s VPX3-ZU1 is a 3U OpenVPX module based on the Zynq UltraScale+ MultiProcessor SoC device from Xilinx. The Zynq

More information

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7 ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7 Scherer Balázs Budapest University of Technology and Economics Department of Measurement and Information Systems BME-MIT 2018 Trends of 32-bit microcontrollers

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

CPU offloading using SoC fabric Avnet Silica & Enclustra Seminar Getting started with Xilinx Zynq SoC Fribourg, April 26, 2017

CPU offloading using SoC fabric Avnet Silica & Enclustra Seminar Getting started with Xilinx Zynq SoC Fribourg, April 26, 2017 1 2 3 Introduction The next few slides give a short introduction of what CPU offloading is and how it can help improving system performance. 4 What is Offloading? Offloading means taking load from one

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Implementation of DSP Algorithms

Implementation of DSP Algorithms Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University

Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage ECE Temple University Signal Processing Algorithms into Fixed Point FPGA Hardware Dennis Silage silage@temple.edu ECE Temple University www.temple.edu/scdl Signal Processing Algorithms into Fixed Point FPGA Hardware Motivation

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Software Driven Verification at SoC Level. Perspec System Verifier Overview

Software Driven Verification at SoC Level. Perspec System Verifier Overview Software Driven Verification at SoC Level Perspec System Verifier Overview June 2015 IP to SoC hardware/software integration and verification flows Cadence methodology and focus Applications (Basic to

More information