An OpenCL-based Framework for Rapid Virtual Prototyping of Heterogeneous Architectures

Size: px
Start display at page:

Download "An OpenCL-based Framework for Rapid Virtual Prototyping of Heterogeneous Architectures"

Transcription

1 An -based Framework for Rapid Virtual Prototyping of Heterogeneous Architectures Efstathios Sotiriou-Xanthopoulos, Leonard Masing, Kostas Siozios, George Economakos, Dimitrios Soudris and Jürgen Becker School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece {stasot, ksiop, geconom, Institute for Information Processing, Karlsruhe Institute of Technology, Karlsruhe, Germany {leonard.masing, Abstract The increasing performance and power requirements in embedded systems has lead to a variety of heterogeneous hardware architectures, featuring many different types of processing elements. This heterogeneity however induces extra effort on system development and programming. To address this heterogeneity, provides a portable programming model which enables the use of one source code in various architectures featuring different types of processors. Also, such systems impose higher design complexity due to the existence of an increased number of hardware components. Virtual Prototyping aims to alleviate this issue by enabling the hardware modeling in higher abstraction levels. This paper combines the benefits of with Virtual Prototyping, by proposing an based framework for rapid prototyping, which (a) automatically derives a virtual prototype from an code; (b) executes the application by running the host program along with the hardware simulation; and (c) proposes a design flow for faster system evaluation, as compared to state-of-the-art FPGA-based flow. Using a set of benchmarks, it is shown that the proposed framework enables faster prototyping by up to 18, as compared to state-of-the-art flow. I. INTRODUCTION Due to the ever increasing need for more processing power despite the limited energy budget available, efficient data processing is becoming more and more imperative. Thus, heterogeneous multi-processor Systems-on-Chip (MPSoC) have been an effective selection, as their hardware components can be customized to the exact application requirements. To exploit their full potential, choosing the right architecture for the running application is a requirement of utmost importance. This however imposes increased design effort for both software and hardware, especially when different types of processors (e.g. s, GPUs, FPGAs etc.) are taken into consideration. To solve the difficulties imposed by the programming heterogeneity of such platforms, [1] provides a portable programming model which enables the programming of different types of processing elements, without the need for adapting the source code to each type. Hence, the designer is able to investigate multiple data processing architectures without extra programming effort. Although originally directed to GPU programming, the FPGA community is increasingly adopting, thus enabling the easier and more efficient programming of FPGA devices. However in the FPGA or ASIC world these nearly limitless customization options during the MPSoC design increase the design complexity. This is caused by the numerous architectural parameters in RTL design (e.g. when using FPGAs). Thus choosing an efficient architecture is a tedious and slow task; doing this task manually by experienced developers can take a This work was partially supported by TEAChER: TEach AdvanCEd Reconfigurable architectures and tools project funded by DAAD (2014) ARM ARM RAM FPGA BUS RAM FPGA ASIC (a) (b) (c) (d) Fig. 1. Typical examples of heterogeneous architectures to be taken into consideration during SoC design. Architectures (b), (c) and (d) are not supported by state-of-art -based development frameworks. lot of effort and man-month, making it unfeasible for most use cases. Virtual Prototyping has been proposed to alleviate this problem: The hardware is modeled in a software representation called Virtual Platform (), typically written in SystemC. The main benefit of such an approach is the hardware modeling in various abstraction levels, in each of which a number of architectural details is removed, thus limiting the architectural parameter combinations, especially in early design stages when some of the architectural details are not yet available. This enables the early software development and design space exploration, targeting to easier bug fixing, better design space coverage and shorter time-to-market. The might also serve as a golden reference to the development team. The goal of this work is to propose a rapid Virtual Prototyping framework which (a) enables the automated modeling of heterogeneous hardware architectures by taking as input an source code; (b) provides a platformindependent simulation environment between the hardware model and the host (i.e. the processor that coordinates the simulated hardware) without the need for real hardware platforms; and (c) is accompanied with a design flow that features faster development and evaluation cycles during a design space exploration procedure, as compared to state-of-the-art FPGAbased system design using. The paper is structured as follows: Sections II and III present the motivation and the related work of this paper. Section IV explains the proposed methodology. Section V shows the experimental results and discusses the insights gained. Finally, we conclude our work in Section VI. II. MOTIVATION To better clarify the benefits of combining with Virtual Prototyping, we consider two scenarios, which are related to state-of-art design flows: (i) without using s (e.g. using FPGA only) and (ii) modeling without using (i.e. manual hardware modeling and programming according to the hardware architecture). without s: The designer would use a FPGAbased platform for the system design and evaluation. This leads

2 to a vendor-dependent low-level RTL design with a specific supported architectural scheme, similar to Figure 1a, which depicts a typical bus-based SoC with one FPGA fabric and a dual-core ARM. However there might be alternative architectures to be taken into consideration during SoC design: For example, Figure 1b depicts a cluster of sub-systems, following the architectural pattern of Synopsys HAPS [2]: Each subsystem is similar to the SoC of Figure 1a and executes a different set of threads. Figure 1c depicts a SoC where the s and memories are decoupled from the FPGAs, while Figure 1d shows a NoC-based system, quite similar to [3], incorporating s, FPGAs, ASICs and distributed memory into different modules. The architectures of Figures 1b, 1c and 1d are very difficult to be prototyped in real hardware, especially because of the increased cost for acquiring such hardware platforms. Hence, using s in -based systems (i) facilitates cost-free vendor-independent rapid prototyping, (ii) allows for easier and faster platform debugging and timing/power metrics evaluation, (iii) provides extensive architectural flexibility and (iv) enables the iterative platform refinement with a small set of architectural details in each design stage. modeling without : A typical for heterogeneous MPSoCs requires (a) the software programming, (b) the modeling of computation for hardware accelerators and (c) the modeling of the interconnection. Modifying one of these elements might result in modifications to the other parts as well. For example, re-assigning a task to another processor type might lead to software changes for handling the newlyassigned component. In another example, a bus might need different accelerator modeling than a NoC: in the latter case, the accelerator might be adapted in order to exploit transactions parallelization. Therefore various portability issues arise between different architectural schemes incorporating different processors, memory organization and interconnection schemes. This is alleviated by using during prototyping, as is able to provide a portable programming and simulation environment which is adapted to any architectural scheme, without the need for software or hardware model modifications, while also enabling the easy runtime assignment of the application tasks onto the processing elements. III. RELATED WORK Due to its provided functional portability, has been extensively supported in GPU programming, in order to abstract away the complex programming model of GPUs. Typical examples are the development environments of AMD [4] and NVIDIA [5]. A survey on the performance and portability of in GPUs is provided by [6]. Apart from GPUs, an ever-increasing effort is made for adopting in FPGA design. The most typical example is Altera SDK for [7], an -based development and execution environment that allows for the automatic synthesis of code down to FPGA bitstream, while including the appropriate communication environment between the host program and the FPGA-mapped accelerator(s). Xilinx also adopted by providing Xilinx SDAccel environment [8], which provides an integrated development and runtime solution from C, C++ or sources down to FPGAmapped applications [9], as well as by enhancing Vivado HLS tool with support (however only for high-level synthesis) [10]. Although the above -based development environments for FPGA programming are evolving more and more, they suffer from the inherent constraints of FPGA-based system design: The system design is made in RTL, by executing the whole flow which is required in order to (a) transform an code into a hardware description and (b) map the description onto an FPGA, e.g. by using Quartus in case of Host Control Host Read Buffer Write Buffer Fig. 2. Exec. Sync. Event Context Device (, GPU, HW Accelerator or ) 1 Local Item 1 Item 2 Item i 2 Local Item i+1 Item i+2 Item 2*i Global / Constants W Local Item (W-1)*i+1 Item W*i execution and memory model (for one-dimension indices). the Altera SDK for. Moreover, the system evaluation is only made by using a real (and potentally expensive) FPGA board. Apart from the cost, there is no explicit support of alternative architectures involving s other than those provided by the SoC fabric of the FPGA board. Last but not least, Altera SDK for requires a license for compiling an description of the accelerator, while SDAccel can be obtained only after contacting Xilinx. This paper supports that the above issues can be alleviated by combining the portability of with the abstracted hardware modeling of Virtual Prototyping. The most relevant example of such a combination is the emulator of Altera SDK for, which however is suitable only for functional verification. Moreover, prototyping frameworks that enable the automatic creation, e.g. Mentor Vista Virtual Prototyping [11], do not explicitly support applications, while they also focus on software development. To the best of our knowledge, there is no -based framework for virtual prototyping of heterogeneous SoCs in multiple abstraction levels. On the contrary, our proposed prototyping framework addresses the above issues by (a) providing an automated flow for deriving a SystemC-based from an source, and (b) enabling the simulation with different configurations, without the need for existing hardware. The vendor-independent nature of the framework enables the use of numerous different architectural schemes which might be difficult to map onto an FPGA. IV. SYSTEMC PROTOTYPING METHODOLOGY FOR OPENCL APPLICATIONS After a brief background of the execution model, this section analyzes the proposed prototyping framework. This analysis includes the framework structure and functionality, as well as a prototyping flow for converting a set of kernels into modules. A. Background on execution model The proposed prototyping framework is based on the 1.0 specifications [12], according to which an application consists of two main parts: (a) the host and (b) a number of kernels, as the execution model example of Figure 2 depicts. The kernels part is organized as an context, i.e. a unified environment which contains the kernels executable (a.k.a. program), the kernel instances (a.k.a. work-items), the utilized devices 1 and the memories. Therefore, the host controls the kernel instances and the respective devices which are included in this context. Each work-item matches to a specific part of the kernels execution, such as a single iteration of a for loop or a branch of an if-else block. To define which parts should be executed in each workitem, built-in functions return the global and the local index of the work-item. Although the example of Figure 2 1 It is possible to use multiple devices in one context as well.

3 PC Execution (x86) Host API Fig. 3. OR Host Software Simulation (O) Model IPC2TLM Adapter TLM Inter-Process Communication Time Single Platform Separate Platforms Fig. 4. Shared Model (e.g. ARM) Host API Virtual Platform (SystemC) -Item Wrapper Sync. Arbiter Control handling Interconnection Global Data Local Pointers & Constants Data Item Item Item Item Item Item READY ENABLE SystemC Accelerator Data I/O The structure of the proposed prototyping framework. -Items -Items Triggerà Time waste ß Trigger ß Trigger TIME ß Trigger Time waste Using single O-based platform versus using separate platforms. utilizes one-dimension indices, the developer can use up to three dimensions. The global index distinguishes each workitem from the others. However, the work-items might be organized in work-groups. In that case, the local index is used for identifying a work-item inside a specific work-group. In the example of Figure 2, given W work-groups, the global index range is 0,, W i 1, while the local index range for each work-group is 0,, i 1. This grouping is related to the memory model: There are four distinct memory types: (a) the global memory, which is visible by any work-item, as well as the host; (b) the constant memory, i.e. a read-only global memory; (c) the local memory, which is visible only by the work-items of a single work-group (each work-group has its own local memory); and (d) the private memory, which is used only inside a specific work-item. This memory model allows for multiple memory accesses when using local memories for temporary data sharing, thus leveraging the parallelization potential provided by. To avoid race conditions, synchronization mechanisms known as barriers can be used inside the kernel code, for global, local or both memories. Therefore, within a specific context, (1) the host selects the execution of one of the kernels and defines a set of buffers for data sending/receiving. (2) After enqueuing 2 the input data to be sent to the global and/or constant memory, (3) the host invokes (i.e. triggers) the kernels by enqueuing an NDRange command, which involves the creation of an N-dimensional range of work-items and work-groups. Afterwards, (4) the data and the command are flushed to the deployed devices. (5) When the kernel is executed, an event is returned to the host. (6) The host enqueues a command for data reception. This typical flow is repeated for each kernel. B. Structure for the Proposed Prototyping Framework Based on the execution model of Section IV-A, Figure 3 depicts the main structure of the proposed based virtual prototyping framework, which comprises the host and the SystemC-based part. The host is either a x86 PC or an instruction-set simulator, e.g. provided by O [13]. In both cases, the host software utilizes the host API, which provides standarized functions for command/data enqueueing and synchronization. The API manipulates an Inter-Process 2 This term describes the buffering of data and/or commands. The buffered content may not be sent immediately to the device(s), but only when the host reaches a specific synchronization point. CLK RST Communication (IPC) mechanism for the connection with the. If the host is a software simulator, the IPC manipulation is made via a Transaction-Level Modeled (TLM) adapter (IPC2TLM), with which the accesses to specific bus addresses are translated to IPC commands. This scheme enables the decoupling of the software simulator from the workitems, following the concept of Figure 4: In software simulators like O, each platform component is scheduled in serial for a specific time quantum. Therefore, in a single platform including models and work-items, the following behaviour is noticed: If a signal is sent from the to the work-items, it will take effect only at the end of the time quantum. The time frame between the signal sending and the end of the quantum will be wasted. This also occurs when the work-items send a signal to another component before the end of their quantum. To provide a simulator-independent solution for this issue, this work proposes the use of a separate software simulator (e.g. O-based platform with models and memories, all connected typically via a TLM bus), which runs in parallel with the. With this scheme, the signal exchange will instantly take effect, based on the event-driven scheduling of SystemC. In addition, this decoupling enables the parameterization during the application execution. For data exchanging between the host and the, a shared memory segment is allocated into the host. This segment includes the global and constant memory for the. In addition, the shared memory incorporates a 64-bit variable for the simulated time of the. This variable is necessary because, during the application execution, the may be restarted in order to execute another kernel or different workitems. Hence, in order to avoid the resetting of the simulated time, the time-stamp is stored into the external time variable. This variable is also utilized for time profiling through the built-in functions. The consists of multiple work-items 3, which are organized in work-groups. Each work-item is a SystemC-modeled accelerator which includes the kernel code, as well as control and data signals. An important feature is the gated clock input for each work-item: Firstly, it enables a low-power system design in early design stages. Secondly, this technique may lead to significant simulation time improvements, as SystemC is enabled to ommit the unused (e.g. early-finished) work-items. All work-items are controlled by a wrapper module, written in SystemC, which provides (1) the work-item interconnection, including the data access arbitration and the work-item synchronization (i.e. barrier handling), and (2) the control handling from the host via the IPC, i.e. work-item triggering and event notification. Also, the wrapper includes a pointer to the shared memory segment for global data and constants, as well as local memories, one for each work-group. This organization features modularity and configurability: The designer may use different system architectures by only choosing another wrapper version with different interconnection scheme (e.g. bus, Network-on-Chip, etc.) and memory model (e.g. distributed memory, etc.), without having to change the behavioural description of the work-items, and vice-versa. Below, we provide an analysis on the layout of a typical wrapper and the proposed IPC mechanism. 1) Wrapper layout: Although the layout of a work-item wrapper strongly depends on the deployed inteconnection and memory model, this section provides a typical wrapper architecture, which can be used as a paradigm for designing a wrapper library with a variety of different architectural features. The wrapper consists of two main modules, which control the work-items: (a) the scheduler and (b) the memory and interconnection model. 3 In side, the work-items match to the available resources of the platform.

4 Available -Items (in SystemC ) W x N available items W work-groups W x S x N invoked items wiw i Invoked -Items (by Host) wwi i 1 Item 1 Item 2 Item N wwi i 1 Previous Segment s j Fig. 5. Fig. 6. Scheduler [For each work-group] Available Resources Global ID N* (i+s*j) + 0 Item 1 N* (i+s*j) + 1 wiw i Item 2 1 Current Segment sj s j wi i of Invoked -Items TIME N* (i+s*j) + N-1 Item N Next Segment s j 1 -item wrapper scheduler. Local Data wwi i 1 W work-groups wiw i Item 1 Item N Interconn. Local Data wwi i 1 W workgroups Interconn. Interconn. Local Data Interconnection & Interconnection Model wi i 1 wwi i Segment 1 Item 1 Global ID: N*i + 0 Item 2 Global ID: N*i + 1 Item N Global ID: N*i + N-1 Segment S Item 1 Global ID: N*(i+S-1) + 0 Item 2 Global ID: N*(i+S-1) + 1 Item N Global ID: N*(i+S-1) + N-1 wi i 1 Interconnection Ports for -item 1 Cross Ports for bar -item N Global Data & Constants Cache To IPC Ports Control handling -item wrapper model for interconnection and memory. i. Scheduler: The host may invoke more work-items than the available resources of the. In this case, the scheduler is responsible for the serialization of the invoked work-items, according to the available ones in the, as shown in Figure 5. The invoked work-items are separated into parallel groups, the number of which is equal to the number of work-groups (i.e. W ). In each group, the invoked work-items are organized into S segments, in each of which the invoked work-items should not exceed the available ones. The scheduler properly adjusts the global and local indices, so that one segment is running on the available work-items. When the execution is finished, the work-items are re-triggered for the next segment. ii. and Interconnection: As Figure 6 shows, the wrapper uses separate local and global interconnection for local (one for each work-group) and global data access respectively, thus enabling data access parallelization. Each workitem has dedicated input/output signals for local and global interconnection. Every interconnection is a typical crossbar which consists of input/output ports, one pair for each workitem, as well as one pair for the memories. Each pair of ports consists of control and data signals, allowing for transactions in words of multiple bytes, defined by the designer at compile time. The latter enables single-cycle transfers of vectors of 2, 4, 8 or 16 values (of up to 32-bit each), which are supported by [12], thus enabling parallelism on data processing. Each module of the local memory is attached to one local interconnection, while a global/constant memory is attached to the global interconnection. Upon memory access, the workitem source code defines the memory type (global/constant or local) and the address inside the memory. If multiple work-items access the same memory module, a round-robin arbitration is applied. We assume that single-port memories are utilized, supporting 32-bit accesses. However, significant bottlenecks may be induced, especially when reading global or constant vectors of data. Hence, a cache module is used for Fig. 7. Host Side IPC Wrapper Side Time Get Time Stamp Host API Invoke Wait READY Start proc. TRIGGER READY semaphore semaphore Acknowledge READY ACK semaphore Polling Notification Ack. Waiting Update Time Control Handling Inter-process communication mechanism. memcpy() Shared Pointers the global data 4, the size of which is determined at compile time. The cache supports accesses in lengths equal to the word length of the interconnection, in order to retain the interconnection performance and thus avoid bottlenecks. The area/power cost of such a cache depends on its size and word length, however the designer may fine-tune both parameters for achieving optimized solutions. 2) Inter-Process Communication: The IPC mechanism of the proposed framework is based on a set of Unix semaphores, which are utilized for the control between the host and the, as shown in Figure 7. In particular, the set includes three semaphores; one for the triggering (i.e. Trigger ) and two for the host notification when processing ends (i.e. Ready and Ack ). Apart from the semaphore-based control, the IPC mechanism incorporates an API for data exchange. In particular, the shared memory segment is manipulated by the host through memcpy() calls. Also, the time variable is updated by the wrapper in every (simulated) clock tick. The host reads this variable when polling the current time-stamp. Hence, (1) when the host invokes a kernel, the process is started by taking as input the number of work-items and work-groups, as well as the input/output data size. During the startup, the semaphores and the shared memory are attached to the process. Afterwards, (2) the host triggers the data processing through the Trigger semaphore, which is polled by the wrapper. This kind of waiting is non-blocking 5 in order not to stall the simulated time. (3) During processing, the host waits until the result is ready, using the Ready semaphore (typically this is a blocking waiting). (4) When the wrapper notifies the host that the processing has finished (through the Ready semaphore ), (5) the process performs a blocking waiting through the ACK semaphore, which is used for verifying that the host has received the notification. C. Prototyping Flow for Applications In order to automatically create the work-items prototype, the proposed framework is accompanied with an to-systemc prototyping flow, presented in Figure 8. After a syntax check (typically using clang [14]), the source is converted into SystemC by using (a) a work-item template; (b) a C++ class for vectors 6 of different data types, supporting arithmetic/logic operations and vector comparisons according to the specifications [12], while also enabling different degrees of parallelization in vector processing; (c) mathematical functions for both scalar variables and vectors; and (d) input/output functions. As the syntax of vector operations differs from the default C/C++ syntax, any vector-related operation is rewritten according to the provided methods of the deployed C++ vector class. Figure 8 ( Vector Processing ) shows typical conversion examples. This conversion is applied recursively: For example, V.odd is firstly converted to V.s13, then to Vector(V(1),V(3)) and finally to Vector(V.array[1],V.array[3]). 4 The constants are fetched only once and are saved inside the work-item. 5 In non-blocking waiting, the process is not blocked, but it performs active waiting. In blocking waiting, the process is blocked. 6 Different from the built-in vector class provided by C++.

5 Fig. 8. Syntax Check [clang] -Item Template Vector Handling -to-systemc Code Conversion [proposed] Vector Processing [recursive] Custom Literals V(i) V.array[i] Concat. (V1(i), V2(j)) Access V.s01 V.xyzw Compilation Operations V.odd V.even V1 + V2 V1.odd * V2.even Custom Vector Class Vector(V1(i), V2(j)) Vector(V(0), V(1)) V.s0123 V.s13 V.s02 V1 + V2 ß As is (V1(1)*V2(0), V1(3)*V2(2)) Proposed prototyping flow. Prototyping with SystemC [Figure 8] Additional Architectural Details Compilation [HLS + Quartus] Parameters Prototype Refinement I/O Math Wrapper Library Construction [proposed] I/O Transactions [gcc] Detecting Globals, Constants and Locals [globals, constants & locals] Type T A[pos] Input value = read_t(addr(a), pos) Output write_t(addr(a), pos, value) Transactions Interface Parameters Annotated HLS SystemC Library Design Parameters Change Design Space Exploration in a single design stage (a) FPGA Board Bitstream Programming + Execution Design Parameters Change (b) Compilation + Simulation Metrics Metrics Design Space Exploration Fig. 9. Typical design flows when using (a) the proposed prototyping framework; and (b) Altera SDK for. The Altera-based flow requires the kernel compilation after every parameters change, in contrast to the flow utilizing the proposed framework. Additionally, the -to-systemc conversion includes the detection of the global constants and the global and local variables 7. Every access to that data is replaced by input/output function calls for implementing memory accesses to/from the memories, as shown in Figure 8 ( I/O Transactions ). When the SystemC source for the work-items is created, the next stage of the proposed flow involves the construction of the whole, including the wrapper model. Finally a conventional C++ compiler is utilized, in combination with the SystemC library, so that the executable is produced. The main advantage of this flow is that it is applied only once: The code is not needed any more in system design, as the created will be utilized in all the remaining hardware design stages, i.e. (i) functional verification, (ii) design parameters evaluation in terms of timing, resource utilization and power consumption, as well as (iii) final synthesis. This is also a major contribution of this work. To better explain this advantage, Figures 9a and 9b show two typical design flows when using (a) the proposed prototyping framework and (b) Altera SDK for respectively. Altera SDK for is chosen as state-of-the-art for mapping kernels onto FPGAs. The design flow using the proposed framework (Figure 9a) starts with the prototyping procedure of Figure 8. The produced can be used in a typical design space exploration (DSE) and can be refined with more architectural details in later design stages. In a single DSE, The is enriched with timing/area/power annotations which are derived using High-Level Synthesis (HLS) [15]. The annotated is then compiled and simulated. During simulation, the computation and communication behavior are combined for timing and 7 variables are finally implemented as registers. TABLE I. OPENCL-IMPLEMENTED ALGORITHMS FOR THE EXPERIMENTATION SETUP. -Items Algorithm Invoked Available Local Input Pathfinder matrix BFS node graph Gaussian Elimination matrix Particle Filter video particles a Nearest Neighbor records Histogram elements MergeSort elements BucketSort elements Back-Propagation input ANN b a In 10-frame video b Neural network with 64 inputs, 1 hidden layer with 16 neurons and 1 output power estimation, while different execution scenarios are taken by using different input data. As compared to the above flow, the design procedure using Altera SDK for (Figure 9b) starts with the kernels compilation, including HLS and RTL synthesis with Quartus. The result is a bitstream for programming an FPGA board, where the system is evaluated. When using a typical DSE, the whole procedure is repeated after every parameters change. Also there is no support for higher abstraction levels. V. EXPERIMENTAL RESULTS The proposed prototyping framework is evaluated by using Rodinia benchmark suite [16]. Rodinia provides implemented algorithms, mainly focusing on GPU acceleration. However, the provided kernels can be mapped onto FPGAs as well. For the scope of this work, we used the benchmarks of Table I, which also shows the number of work-items in total (including the invoked and the available work-items in the ) and locally (i.e. per work-group), as well as the application input size. In applications with large number of invoked work-items, the kernels have been partially serialized, in order to both avoid the excessive memory allocation and provide a more realistic model of the system-under-design. The rest of this section (i) provides a quantitative comparison between Altera SDK for (i.e. state-of-art) and the proposed framework in terms of compilation and applications execution time; and (ii) analyzes the simulation time when using x86 or O-based hosts, while also evaluating the effect of separating the O simulator from the work-item platform. All the experiments have been executed on an Intel Core-i5 Quad-Core at 3.2 GHz running Fedora 23 with Linux kernel 4.4. Altera SDK for vs. Proposed Framework: The selected kernels have been mapped onto a Cyclone V device, included in an Altera DE1SoC board, on which the applications have been run in order to measure the algorithm execution time. In the meanwhile, the kernels have been prototyped in SystemC with the proposed framework and annotated in terms of timing and resource utilization, by using Xilinx Vivado HLS 8. Afterwards, the SystemC models are simulated by using the x86 as host. The comparison results 9 are depicted in Figure 10. The dominant part of the Altera-based flow is compilation, which is from 3 up to 18 slower than the proposed flow (including prototyping, annotation and simulation). Although the simulation depends on the input volume, the proposed methodology enables the designer to perform a rapid evaluation by 8 Altera does not provide a standalone HLS tool. However, despite the use of different vendors, we intend to acquire typical execution time results only. Similar HLS run-time results are expected with the use of any other commercial HLS tool. 9 The board execution time is multiplied by 500 to be visible to the chart.

6 As Figure 11 depicts, the use of an O software simulator as a separate platform is able to leverage the high simulation speed provided by O without causing significant communication overhead between the and the workitems, as it achieves similar simulation times as compared to the use of a x86 host. On the contrary, the use of a single platform may cause significant simulation time overhead, ranging from 10% up to 5. The first reason, as explained in Figure 4, is that a single O-based platform performs quantumbased O scheduling, which may lead to significant time waste. The second reason is that a single platform deploys a constant number of work-items. On the contrary, in separate platforms, only the necessary work-items are allocated. Thus if the host repeats a kernel with less invoked work-items, the simulation will be faster, as less components will be simulated. Fig. 10. Comparison between the compilation/execution time with Altera SDK for and the prototyping/simulation time with the proposed prototyping framework. Fig. 11. Simulation time comparison when using a x86 host and O-based ARM host. For the O scenarios, we evaluate the use of separate s, as well as the use of a common O-based platform. using a small amount of representative input data in early design stages, for fast decision making, while larger input data volumes can be utilized in later design stages. On the contrary, this feature is not provided by Altera SDK for : In case of a parameter modification, the designer has to wait more than 40 minutes (independently from the input data) until the kernels are (re-)implemented. Last but not least, there are cases (e.g. Back-propagation) where the kernels do not fit into the FPGA fabric; in that case, Altera compilation fails. Simulation Time Analysis: The aim of this analysis is to study how the application execution time is affected when using different hosts, namely a x86 host (i.e. the Intel Core-i5 ) and an ARM Cortex-A9 model, provided by O. For the second case, two scenarios are investigated: (i) separate platforms for the software simulation and the work-items (i.e. the proposed approach); and (ii) the use of a single TLM platform including the model, memories and the work-items (i.e. the state-of-art prototyping approach). In the single-platform scenario, using a set of preliminary simulations, we have adjusted the time quantum of the O scheduler appropriately, so that achieving the minimum possible simulation time overhead, according to the concept of Figure 4. For the separate-platform scenario, such adjustments are not necessary, which is a first evidence about the efficiency of the proposed approach. VI. CONCLUSIONS This paper presents a rapid prototyping framework, which automatically derives a SystemC-based from sources, thus combining the portability with the abstracted modeling of Virtual Prototyping. The proposed framework supports different hardware architectures and memory models without the need for kernel modifications, while also enabling fast evaluation cycles, without long compilation procedures. In particular, the design flow which accompanies the proposed framework achieves evaluation time improvements up to 18, as compared to Altera SDK for. The proposed framework also enables the use of any host, which can be either a x86 or a software simulator. The host communicates with the through an inter-process communication mechanism, which also allows for the separation of a software simulator from the, thus leading to significant simulation time improvements reaching up to 5. REFERENCES [1], by khronos group. [2] (2013) Synopsys High-performance ASIC Prototyping Systems. [3] J. Cong, M. Ghodrat, M. Gill, B. Grigorian, H. Huang, and G. Reinman, Composable accelerator-rich microprocessor enhanced for adaptivity and longevity, in IEEE International Symposium on Low Power Electronics and Design (ISLPED), 2013, Sept 2013, pp [4] AMD. [5] NVIDIA SDK. [6] K. Komatsu, K. Sato, Y. Arai, K. Koyama, H. Takizawa, and H. Kobayashi, Evaluating performance and portability of opencl programs, in 5th Intl. shop on Automatic Performance Tuning, 2010, pp [7] Altera SDK for. [8] Xilinx SDAccel. [9] The Next Logical Step in C/C++, Programming, by Xcell Software Journal, Issue 1, [10] Vivado Design Suite User Guide, UG902 (v2015.4), Nov. 24, manuals/xilinx2015 4/ug902- vivado-high-level-synthesis.pdf. [11] Vista virtual prototyping, by mentor graphics. [12] The specifications, version [13] Open virtual platforms website. [14] clang: a C language family frontend for LLVM. [15] E. Sotiriou-Xanthopoulos, S. Xydis, K. Siozios, G. Economakos, and D. Soudris, Effective platform-level exploration for heterogeneous multicores exploiting simulation-induced slacks, in PARMA-DITAM 14. New York, NY, USA: ACM, 2014, pp. 13:13 13:16. [16] Rodinia: A benchmark suite for heterogeneous computing, version

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17, Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms SAMOS XIV July 14-17, 2014 1 Outline Introduction + Motivation Design requirements for many-accelerator SoCs Design problems

More information

Hardware Design and Simulation for Verification

Hardware Design and Simulation for Verification Hardware Design and Simulation for Verification by N. Bombieri, F. Fummi, and G. Pravadelli Universit`a di Verona, Italy (in M. Bernardo and A. Cimatti Eds., Formal Methods for Hardware Verification, Lecture

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National

More information

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA 1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Performance Verification for ESL Design Methodology from AADL Models

Performance Verification for ESL Design Methodology from AADL Models Performance Verification for ESL Design Methodology from AADL Models Hugues Jérome Institut Supérieur de l'aéronautique et de l'espace (ISAE-SUPAERO) Université de Toulouse 31055 TOULOUSE Cedex 4 Jerome.huges@isae.fr

More information

FPGA-Based Rapid Prototyping of Digital Signal Processing Systems

FPGA-Based Rapid Prototyping of Digital Signal Processing Systems FPGA-Based Rapid Prototyping of Digital Signal Processing Systems Kevin Banovic, Mohammed A. S. Khalid, and Esam Abdel-Raheem Presented By Kevin Banovic July 29, 2005 To be presented at the 48 th Midwest

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

HVSoCs: A Framework for Rapid Prototyping of 3-D Hybrid Virtual System-on-Chips

HVSoCs: A Framework for Rapid Prototyping of 3-D Hybrid Virtual System-on-Chips on introducing a new design paradigm HVSoCs: A Framework for Rapid Prototyping of 3-D Hybrid Virtual System-on-Chips D. Diamantopoulos, K. Siozios, E. Sotiriou-Xanthopoulos, G. Economakos and D. Soudris

More information

FPGA design with National Instuments

FPGA design with National Instuments FPGA design with National Instuments Rémi DA SILVA Systems Engineer - Embedded and Data Acquisition Systems - MED Region ni.com The NI Approach to Flexible Hardware Processor Real-time OS Application software

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

«Real Time Embedded systems» Multi Masters Systems

«Real Time Embedded systems» Multi Masters Systems «Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can

More information

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao

More information

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience

Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience H. Krupnova CMG/FMVG, ST Microelectronics Grenoble, France Helena.Krupnova@st.com Abstract Today, having a fast hardware

More information

Chapter 5: ASICs Vs. PLDs

Chapter 5: ASICs Vs. PLDs Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.

More information

SDAccel Development Environment User Guide

SDAccel Development Environment User Guide SDAccel Development Environment User Guide Features and Development Flows Revision History The following table shows the revision history for this document. Date Version Revision 05/13/2016 2016.1 Added

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

Cover TBD. intel Quartus prime Design software

Cover TBD. intel Quartus prime Design software Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a

More information

ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures

ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures 1 ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures Yu-Ting Chen, Jason Cong, Zhenman Fang, Bingjun Xiao, Peipei Zhou Center for Domain-Specific Computing, University

More information

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer) ESE Back End 2.0 D. Gajski, S. Abdi (with contributions from H. Cho, D. Shin, A. Gerstlauer) Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu 1 Technology advantages

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Cover TBD. intel Quartus prime Design software

Cover TBD. intel Quartus prime Design software Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a

More information

Long Term Trends for Embedded System Design

Long Term Trends for Embedded System Design Long Term Trends for Embedded System Design Ahmed Amine JERRAYA Laboratoire TIMA, 46 Avenue Félix Viallet, 38031 Grenoble CEDEX, France Email: Ahmed.Jerraya@imag.fr Abstract. An embedded system is an application

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

Hardware-Software Codesign. 1. Introduction

Hardware-Software Codesign. 1. Introduction Hardware-Software Codesign 1. Introduction Lothar Thiele 1-1 Contents What is an Embedded System? Levels of Abstraction in Electronic System Design Typical Design Flow of Hardware-Software Systems 1-2

More information

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Niu Feng Technical Specialist, ARM Tech Symposia 2016 Agenda Introduction Challenges: Optimizing cache coherent subsystem

More information

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC

Multi-core microcontroller design with Cortex-M processors and CoreSight SoC Multi-core microcontroller design with Cortex-M processors and CoreSight SoC Joseph Yiu, ARM Ian Johnson, ARM January 2013 Abstract: While the majority of Cortex -M processor-based microcontrollers are

More information

Cosimulation of ITRON-Based Embedded Software with SystemC

Cosimulation of ITRON-Based Embedded Software with SystemC Cosimulation of ITRON-Based Embedded Software with SystemC Shin-ichiro Chikada, Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada Graduate School of Information Science, Nagoya University Information Technology

More information

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano Modeling and Simulation of System-on on-chip Platorms Donatella Sciuto 10/01/2007 Politecnico di Milano Dipartimento di Elettronica e Informazione Piazza Leonardo da Vinci 32, 20131, Milano Key SoC Market

More information

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd Optimizing ARM SoC s with Carbon Performance Analysis Kits ARM Technical Symposia, Fall 2014 Andy Ladd Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block

More information

Optimization of Behavioral IPs in Multi-Processor System-on- Chips

Optimization of Behavioral IPs in Multi-Processor System-on- Chips Optimization of Behavioral IPs in Multi-Processor System-on- Chips Yidi Liu and Benjamin Carrion Schafer # Department of Electronic and Information Engineering b.carrionschafer@polyu.edu.hk # Outline High-Level

More information

FPGA system development What you need to think about. Frédéric Leens, CEO

FPGA system development What you need to think about. Frédéric Leens, CEO FPGA system development What you need to think about Frédéric Leens, CEO About Byte Paradigm 2005 : Founded by 3 ASIC-SoC-FPGA engineers as a Design Center for high-end FPGA and board design. 2007 : GP

More information

A Matlab/Simulink Simulation Approach for Early Field-Programmable Gate Array Hardware Evaluation

A Matlab/Simulink Simulation Approach for Early Field-Programmable Gate Array Hardware Evaluation A Matlab/Simulink Simulation Approach for Early Field-Programmable Gate Array Hardware Evaluation Celso Coslop Barbante, José Raimundo de Oliveira Computing Laboratory (COMLAB) Department of Computer Engineering

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

System Debugging Tools Overview

System Debugging Tools Overview 9 QII53027 Subscribe About Altera System Debugging Tools The Altera system debugging tools help you verify your FPGA designs. As your product requirements continue to increase in complexity, the time you

More information

Vivado HLx Design Entry. June 2016

Vivado HLx Design Entry. June 2016 Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page

More information

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany 2013 The MathWorks, Inc. 1 Agenda Model-Based Design of embedded Systems Software Implementation

More information

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor

More information

Design Creation & Synthesis Division Avoid FPGA Project Delays by Adopting Advanced Design Methodologies

Design Creation & Synthesis Division Avoid FPGA Project Delays by Adopting Advanced Design Methodologies Design Creation & Synthesis Division Avoid FPGA Project Delays by Adopting Advanced Design Methodologies Alex Vals, Technical Marketing Engineer Mentor Graphics Corporation June 2008 Introduction Over

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,

More information

Chapter 2 M3-SCoPE: Performance Modeling of Multi-Processor Embedded Systems for Fast Design Space Exploration

Chapter 2 M3-SCoPE: Performance Modeling of Multi-Processor Embedded Systems for Fast Design Space Exploration Chapter 2 M3-SCoPE: Performance Modeling of Multi-Processor Embedded Systems for Fast Design Space Exploration Hector Posadas, Sara Real, and Eugenio Villar Abstract Design Space Exploration for complex,

More information

Xilinx Vivado/SDK Tutorial

Xilinx Vivado/SDK Tutorial Xilinx Vivado/SDK Tutorial (Laboratory Session 1, EDAN15) Flavius.Gruian@cs.lth.se March 21, 2017 This tutorial shows you how to create and run a simple MicroBlaze-based system on a Digilent Nexys-4 prototyping

More information

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University

More information

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015

Cadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015 Cadence SystemC Design and Verification NMI FPGA Network Meeting Jan 21, 2015 The High Level Synthesis Opportunity Raising Abstraction Improves Design & Verification Optimizes Power, Area and Timing for

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

Introducing the FPGA-Based Prototyping Methodology Manual (FPMM) Best Practices in Design-for-Prototyping

Introducing the FPGA-Based Prototyping Methodology Manual (FPMM) Best Practices in Design-for-Prototyping Introducing the FPGA-Based Prototyping Methodology Manual (FPMM) Best Practices in Design-for-Prototyping 1 What s the News? Introducing the FPMM: FPGA-Based Prototyping Methodology Manual Launch of new

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Chapter 2 Parallel Hardware

Chapter 2 Parallel Hardware Chapter 2 Parallel Hardware Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices

Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices Mohammad Hosseinabady and Jose Luis Nunez-Yanez Department of Electrical and Electronic Engineering University of Bristol, UK. Email: {m.hosseinabady,

More information

Contents 1 Introduction 2 Functional Verification: Challenges and Solutions 3 SystemVerilog Paradigm 4 UVM (Universal Verification Methodology)

Contents 1 Introduction 2 Functional Verification: Challenges and Solutions 3 SystemVerilog Paradigm 4 UVM (Universal Verification Methodology) 1 Introduction............................................... 1 1.1 Functional Design Verification: Current State of Affair......... 2 1.2 Where Are the Bugs?.................................... 3 2 Functional

More information

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

CMPE 415 Programmable Logic Devices Introduction

CMPE 415 Programmable Logic Devices Introduction Department of Computer Science and Electrical Engineering CMPE 415 Programmable Logic Devices Introduction Prof. Ryan Robucci What are FPGAs? Field programmable Gate Array Typically re programmable as

More information

Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks

Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks 2014 The MathWorks, Inc. 1 Traditional Implementation Workflow: Challenges Algorithm Development

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

Advanced FPGA Design Methodologies with Xilinx Vivado

Advanced FPGA Design Methodologies with Xilinx Vivado Advanced FPGA Design Methodologies with Xilinx Vivado Alexander Jäger Computer Architecture Group Heidelberg University, Germany Abstract With shrinking feature sizes in the ASIC manufacturing technology,

More information

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS Embedded System System Set of components needed to perform a function Hardware + software +. Embedded Main function not computing Usually not autonomous

More information

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Overview. Technology Details. D/AVE NX Preliminary Product Brief Overview D/AVE NX is the latest and most powerful addition to the D/AVE family of rendering cores. It is the first IP to bring full OpenGL ES 2.0/3.1 rendering to the FPGA and SoC world. Targeted for graphics

More information

Modular SystemC. In-house Training Options. For further information contact your local Doulos Sales Office.

Modular SystemC. In-house Training Options. For further information contact your local Doulos Sales Office. Modular SystemC is a set of modules related to SystemC TM (IEEE 1666-2005) aimed at fulfilling teambased training requirements for engineers from a range of technical backgrounds, i.e. hardware and software

More information

A novel way to efficiently simulate complex full systems incorporating hardware accelerators

A novel way to efficiently simulate complex full systems incorporating hardware accelerators ARM Research Summit 2017 Workshop A novel way to efficiently simulate complex full systems incorporating hardware accelerators Nikolaos Tampouratzis Technical University of Crete, Greece Motivation / The

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Coupling MPARM with DOL

Coupling MPARM with DOL Coupling MPARM with DOL Kai Huang, Wolfgang Haid, Iuliana Bacivarov, Lothar Thiele Abstract. This report summarizes the work of coupling the MPARM cycle-accurate multi-processor simulator with the Distributed

More information

A Framework for Rapid System-Level Synthesis Targeting to Reconfigurable Platforms

A Framework for Rapid System-Level Synthesis Targeting to Reconfigurable Platforms A Framework for Rapid System-Level Synthesis Targeting to Reconfigurable Platforms A Computer Vision Case Study Dionysios Diamantopoulos, Ioannis Galanis, Kostas Siozios, George Economakos and Dimitrios

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

System Level Design with IBM PowerPC Models

System Level Design with IBM PowerPC Models September 2005 System Level Design with IBM PowerPC Models A view of system level design SLE-m3 The System-Level Challenges Verification escapes cost design success There is a 45% chance of committing

More information

Chapter 2 The AMBA SOC Platform

Chapter 2 The AMBA SOC Platform Chapter 2 The AMBA SOC Platform SoCs contain numerous IPs that provide varying functionalities. The interconnection of IPs is non-trivial because different SoCs may contain the same set of IPs but have

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Employing Multi-FPGA Debug Techniques

Employing Multi-FPGA Debug Techniques Employing Multi-FPGA Debug Techniques White Paper Traditional FPGA Debugging Methods Debugging in FPGAs has been difficult since day one. Unlike simulation where designers can see any signal at any time,

More information

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013

Getting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013 Getting the Most out of Advanced ARM IP ARM Technology Symposia November 2013 Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block are now Sub-Systems Cortex

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Design Methodologies. Kai Huang

Design Methodologies. Kai Huang Design Methodologies Kai Huang News Is that real? In such a thermally constrained environment, going quad-core only makes sense if you can properly power gate/turbo up when some cores are idle. I have

More information

Parallel Neural Network Training with OpenCL

Parallel Neural Network Training with OpenCL Parallel Neural Network Training with OpenCL Nenad Krpan, Domagoj Jakobović Faculty of Electrical Engineering and Computing Unska 3, Zagreb, Croatia Email: nenadkrpan@gmail.com, domagoj.jakobovic@fer.hr

More information

Developing a Data Driven System for Computational Neuroscience

Developing a Data Driven System for Computational Neuroscience Developing a Data Driven System for Computational Neuroscience Ross Snider and Yongming Zhu Montana State University, Bozeman MT 59717, USA Abstract. A data driven system implies the need to integrate

More information

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC 1 Pawar Ruchira Pradeep M. E, E&TC Signal Processing, Dr. D Y Patil School of engineering, Ambi, Pune Email: 1 ruchira4391@gmail.com

More information

Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration

Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration Marie Nguyen Carnegie Mellon University Pittsburgh, Pennsylvania James C. Hoe Carnegie Mellon University Pittsburgh,

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

Fast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations

Fast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations FZI Forschungszentrum Informatik at the University of Karlsruhe Fast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations Oliver Bringmann 1 RESEARCH ON YOUR BEHALF Outline

More information

Digital Design Methodology (Revisited) Design Methodology: Big Picture

Digital Design Methodology (Revisited) Design Methodology: Big Picture Digital Design Methodology (Revisited) Design Methodology Design Specification Verification Synthesis Technology Options Full Custom VLSI Standard Cell ASIC FPGA CS 150 Fall 2005 - Lec #25 Design Methodology

More information

Abstraction Layers for Hardware Design

Abstraction Layers for Hardware Design SYSTEMC Slide -1 - Abstraction Layers for Hardware Design TRANSACTION-LEVEL MODELS (TLM) TLMs have a common feature: they implement communication among processes via function calls! Slide -2 - Abstraction

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Copyright Khronos Group Page 1. Vulkan Overview. June 2015 Copyright Khronos Group 2015 - Page 1 Vulkan Overview June 2015 Copyright Khronos Group 2015 - Page 2 Khronos Connects Software to Silicon Open Consortium creating OPEN STANDARD APIs for hardware acceleration

More information

SIGGRAPH Briefing August 2014

SIGGRAPH Briefing August 2014 Copyright Khronos Group 2014 - Page 1 SIGGRAPH Briefing August 2014 Neil Trevett VP Mobile Ecosystem, NVIDIA President, Khronos Copyright Khronos Group 2014 - Page 2 Significant Khronos API Ecosystem Advances

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Simplify System Complexity

Simplify System Complexity 1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller

More information