An OpenCL-based Framework for Rapid Virtual Prototyping of Heterogeneous Architectures
|
|
- Donald Sherman
- 6 years ago
- Views:
Transcription
1 An -based Framework for Rapid Virtual Prototyping of Heterogeneous Architectures Efstathios Sotiriou-Xanthopoulos, Leonard Masing, Kostas Siozios, George Economakos, Dimitrios Soudris and Jürgen Becker School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece {stasot, ksiop, geconom, Institute for Information Processing, Karlsruhe Institute of Technology, Karlsruhe, Germany {leonard.masing, Abstract The increasing performance and power requirements in embedded systems has lead to a variety of heterogeneous hardware architectures, featuring many different types of processing elements. This heterogeneity however induces extra effort on system development and programming. To address this heterogeneity, provides a portable programming model which enables the use of one source code in various architectures featuring different types of processors. Also, such systems impose higher design complexity due to the existence of an increased number of hardware components. Virtual Prototyping aims to alleviate this issue by enabling the hardware modeling in higher abstraction levels. This paper combines the benefits of with Virtual Prototyping, by proposing an based framework for rapid prototyping, which (a) automatically derives a virtual prototype from an code; (b) executes the application by running the host program along with the hardware simulation; and (c) proposes a design flow for faster system evaluation, as compared to state-of-the-art FPGA-based flow. Using a set of benchmarks, it is shown that the proposed framework enables faster prototyping by up to 18, as compared to state-of-the-art flow. I. INTRODUCTION Due to the ever increasing need for more processing power despite the limited energy budget available, efficient data processing is becoming more and more imperative. Thus, heterogeneous multi-processor Systems-on-Chip (MPSoC) have been an effective selection, as their hardware components can be customized to the exact application requirements. To exploit their full potential, choosing the right architecture for the running application is a requirement of utmost importance. This however imposes increased design effort for both software and hardware, especially when different types of processors (e.g. s, GPUs, FPGAs etc.) are taken into consideration. To solve the difficulties imposed by the programming heterogeneity of such platforms, [1] provides a portable programming model which enables the programming of different types of processing elements, without the need for adapting the source code to each type. Hence, the designer is able to investigate multiple data processing architectures without extra programming effort. Although originally directed to GPU programming, the FPGA community is increasingly adopting, thus enabling the easier and more efficient programming of FPGA devices. However in the FPGA or ASIC world these nearly limitless customization options during the MPSoC design increase the design complexity. This is caused by the numerous architectural parameters in RTL design (e.g. when using FPGAs). Thus choosing an efficient architecture is a tedious and slow task; doing this task manually by experienced developers can take a This work was partially supported by TEAChER: TEach AdvanCEd Reconfigurable architectures and tools project funded by DAAD (2014) ARM ARM RAM FPGA BUS RAM FPGA ASIC (a) (b) (c) (d) Fig. 1. Typical examples of heterogeneous architectures to be taken into consideration during SoC design. Architectures (b), (c) and (d) are not supported by state-of-art -based development frameworks. lot of effort and man-month, making it unfeasible for most use cases. Virtual Prototyping has been proposed to alleviate this problem: The hardware is modeled in a software representation called Virtual Platform (), typically written in SystemC. The main benefit of such an approach is the hardware modeling in various abstraction levels, in each of which a number of architectural details is removed, thus limiting the architectural parameter combinations, especially in early design stages when some of the architectural details are not yet available. This enables the early software development and design space exploration, targeting to easier bug fixing, better design space coverage and shorter time-to-market. The might also serve as a golden reference to the development team. The goal of this work is to propose a rapid Virtual Prototyping framework which (a) enables the automated modeling of heterogeneous hardware architectures by taking as input an source code; (b) provides a platformindependent simulation environment between the hardware model and the host (i.e. the processor that coordinates the simulated hardware) without the need for real hardware platforms; and (c) is accompanied with a design flow that features faster development and evaluation cycles during a design space exploration procedure, as compared to state-of-the-art FPGAbased system design using. The paper is structured as follows: Sections II and III present the motivation and the related work of this paper. Section IV explains the proposed methodology. Section V shows the experimental results and discusses the insights gained. Finally, we conclude our work in Section VI. II. MOTIVATION To better clarify the benefits of combining with Virtual Prototyping, we consider two scenarios, which are related to state-of-art design flows: (i) without using s (e.g. using FPGA only) and (ii) modeling without using (i.e. manual hardware modeling and programming according to the hardware architecture). without s: The designer would use a FPGAbased platform for the system design and evaluation. This leads
2 to a vendor-dependent low-level RTL design with a specific supported architectural scheme, similar to Figure 1a, which depicts a typical bus-based SoC with one FPGA fabric and a dual-core ARM. However there might be alternative architectures to be taken into consideration during SoC design: For example, Figure 1b depicts a cluster of sub-systems, following the architectural pattern of Synopsys HAPS [2]: Each subsystem is similar to the SoC of Figure 1a and executes a different set of threads. Figure 1c depicts a SoC where the s and memories are decoupled from the FPGAs, while Figure 1d shows a NoC-based system, quite similar to [3], incorporating s, FPGAs, ASICs and distributed memory into different modules. The architectures of Figures 1b, 1c and 1d are very difficult to be prototyped in real hardware, especially because of the increased cost for acquiring such hardware platforms. Hence, using s in -based systems (i) facilitates cost-free vendor-independent rapid prototyping, (ii) allows for easier and faster platform debugging and timing/power metrics evaluation, (iii) provides extensive architectural flexibility and (iv) enables the iterative platform refinement with a small set of architectural details in each design stage. modeling without : A typical for heterogeneous MPSoCs requires (a) the software programming, (b) the modeling of computation for hardware accelerators and (c) the modeling of the interconnection. Modifying one of these elements might result in modifications to the other parts as well. For example, re-assigning a task to another processor type might lead to software changes for handling the newlyassigned component. In another example, a bus might need different accelerator modeling than a NoC: in the latter case, the accelerator might be adapted in order to exploit transactions parallelization. Therefore various portability issues arise between different architectural schemes incorporating different processors, memory organization and interconnection schemes. This is alleviated by using during prototyping, as is able to provide a portable programming and simulation environment which is adapted to any architectural scheme, without the need for software or hardware model modifications, while also enabling the easy runtime assignment of the application tasks onto the processing elements. III. RELATED WORK Due to its provided functional portability, has been extensively supported in GPU programming, in order to abstract away the complex programming model of GPUs. Typical examples are the development environments of AMD [4] and NVIDIA [5]. A survey on the performance and portability of in GPUs is provided by [6]. Apart from GPUs, an ever-increasing effort is made for adopting in FPGA design. The most typical example is Altera SDK for [7], an -based development and execution environment that allows for the automatic synthesis of code down to FPGA bitstream, while including the appropriate communication environment between the host program and the FPGA-mapped accelerator(s). Xilinx also adopted by providing Xilinx SDAccel environment [8], which provides an integrated development and runtime solution from C, C++ or sources down to FPGAmapped applications [9], as well as by enhancing Vivado HLS tool with support (however only for high-level synthesis) [10]. Although the above -based development environments for FPGA programming are evolving more and more, they suffer from the inherent constraints of FPGA-based system design: The system design is made in RTL, by executing the whole flow which is required in order to (a) transform an code into a hardware description and (b) map the description onto an FPGA, e.g. by using Quartus in case of Host Control Host Read Buffer Write Buffer Fig. 2. Exec. Sync. Event Context Device (, GPU, HW Accelerator or ) 1 Local Item 1 Item 2 Item i 2 Local Item i+1 Item i+2 Item 2*i Global / Constants W Local Item (W-1)*i+1 Item W*i execution and memory model (for one-dimension indices). the Altera SDK for. Moreover, the system evaluation is only made by using a real (and potentally expensive) FPGA board. Apart from the cost, there is no explicit support of alternative architectures involving s other than those provided by the SoC fabric of the FPGA board. Last but not least, Altera SDK for requires a license for compiling an description of the accelerator, while SDAccel can be obtained only after contacting Xilinx. This paper supports that the above issues can be alleviated by combining the portability of with the abstracted hardware modeling of Virtual Prototyping. The most relevant example of such a combination is the emulator of Altera SDK for, which however is suitable only for functional verification. Moreover, prototyping frameworks that enable the automatic creation, e.g. Mentor Vista Virtual Prototyping [11], do not explicitly support applications, while they also focus on software development. To the best of our knowledge, there is no -based framework for virtual prototyping of heterogeneous SoCs in multiple abstraction levels. On the contrary, our proposed prototyping framework addresses the above issues by (a) providing an automated flow for deriving a SystemC-based from an source, and (b) enabling the simulation with different configurations, without the need for existing hardware. The vendor-independent nature of the framework enables the use of numerous different architectural schemes which might be difficult to map onto an FPGA. IV. SYSTEMC PROTOTYPING METHODOLOGY FOR OPENCL APPLICATIONS After a brief background of the execution model, this section analyzes the proposed prototyping framework. This analysis includes the framework structure and functionality, as well as a prototyping flow for converting a set of kernels into modules. A. Background on execution model The proposed prototyping framework is based on the 1.0 specifications [12], according to which an application consists of two main parts: (a) the host and (b) a number of kernels, as the execution model example of Figure 2 depicts. The kernels part is organized as an context, i.e. a unified environment which contains the kernels executable (a.k.a. program), the kernel instances (a.k.a. work-items), the utilized devices 1 and the memories. Therefore, the host controls the kernel instances and the respective devices which are included in this context. Each work-item matches to a specific part of the kernels execution, such as a single iteration of a for loop or a branch of an if-else block. To define which parts should be executed in each workitem, built-in functions return the global and the local index of the work-item. Although the example of Figure 2 1 It is possible to use multiple devices in one context as well.
3 PC Execution (x86) Host API Fig. 3. OR Host Software Simulation (O) Model IPC2TLM Adapter TLM Inter-Process Communication Time Single Platform Separate Platforms Fig. 4. Shared Model (e.g. ARM) Host API Virtual Platform (SystemC) -Item Wrapper Sync. Arbiter Control handling Interconnection Global Data Local Pointers & Constants Data Item Item Item Item Item Item READY ENABLE SystemC Accelerator Data I/O The structure of the proposed prototyping framework. -Items -Items Triggerà Time waste ß Trigger ß Trigger TIME ß Trigger Time waste Using single O-based platform versus using separate platforms. utilizes one-dimension indices, the developer can use up to three dimensions. The global index distinguishes each workitem from the others. However, the work-items might be organized in work-groups. In that case, the local index is used for identifying a work-item inside a specific work-group. In the example of Figure 2, given W work-groups, the global index range is 0,, W i 1, while the local index range for each work-group is 0,, i 1. This grouping is related to the memory model: There are four distinct memory types: (a) the global memory, which is visible by any work-item, as well as the host; (b) the constant memory, i.e. a read-only global memory; (c) the local memory, which is visible only by the work-items of a single work-group (each work-group has its own local memory); and (d) the private memory, which is used only inside a specific work-item. This memory model allows for multiple memory accesses when using local memories for temporary data sharing, thus leveraging the parallelization potential provided by. To avoid race conditions, synchronization mechanisms known as barriers can be used inside the kernel code, for global, local or both memories. Therefore, within a specific context, (1) the host selects the execution of one of the kernels and defines a set of buffers for data sending/receiving. (2) After enqueuing 2 the input data to be sent to the global and/or constant memory, (3) the host invokes (i.e. triggers) the kernels by enqueuing an NDRange command, which involves the creation of an N-dimensional range of work-items and work-groups. Afterwards, (4) the data and the command are flushed to the deployed devices. (5) When the kernel is executed, an event is returned to the host. (6) The host enqueues a command for data reception. This typical flow is repeated for each kernel. B. Structure for the Proposed Prototyping Framework Based on the execution model of Section IV-A, Figure 3 depicts the main structure of the proposed based virtual prototyping framework, which comprises the host and the SystemC-based part. The host is either a x86 PC or an instruction-set simulator, e.g. provided by O [13]. In both cases, the host software utilizes the host API, which provides standarized functions for command/data enqueueing and synchronization. The API manipulates an Inter-Process 2 This term describes the buffering of data and/or commands. The buffered content may not be sent immediately to the device(s), but only when the host reaches a specific synchronization point. CLK RST Communication (IPC) mechanism for the connection with the. If the host is a software simulator, the IPC manipulation is made via a Transaction-Level Modeled (TLM) adapter (IPC2TLM), with which the accesses to specific bus addresses are translated to IPC commands. This scheme enables the decoupling of the software simulator from the workitems, following the concept of Figure 4: In software simulators like O, each platform component is scheduled in serial for a specific time quantum. Therefore, in a single platform including models and work-items, the following behaviour is noticed: If a signal is sent from the to the work-items, it will take effect only at the end of the time quantum. The time frame between the signal sending and the end of the quantum will be wasted. This also occurs when the work-items send a signal to another component before the end of their quantum. To provide a simulator-independent solution for this issue, this work proposes the use of a separate software simulator (e.g. O-based platform with models and memories, all connected typically via a TLM bus), which runs in parallel with the. With this scheme, the signal exchange will instantly take effect, based on the event-driven scheduling of SystemC. In addition, this decoupling enables the parameterization during the application execution. For data exchanging between the host and the, a shared memory segment is allocated into the host. This segment includes the global and constant memory for the. In addition, the shared memory incorporates a 64-bit variable for the simulated time of the. This variable is necessary because, during the application execution, the may be restarted in order to execute another kernel or different workitems. Hence, in order to avoid the resetting of the simulated time, the time-stamp is stored into the external time variable. This variable is also utilized for time profiling through the built-in functions. The consists of multiple work-items 3, which are organized in work-groups. Each work-item is a SystemC-modeled accelerator which includes the kernel code, as well as control and data signals. An important feature is the gated clock input for each work-item: Firstly, it enables a low-power system design in early design stages. Secondly, this technique may lead to significant simulation time improvements, as SystemC is enabled to ommit the unused (e.g. early-finished) work-items. All work-items are controlled by a wrapper module, written in SystemC, which provides (1) the work-item interconnection, including the data access arbitration and the work-item synchronization (i.e. barrier handling), and (2) the control handling from the host via the IPC, i.e. work-item triggering and event notification. Also, the wrapper includes a pointer to the shared memory segment for global data and constants, as well as local memories, one for each work-group. This organization features modularity and configurability: The designer may use different system architectures by only choosing another wrapper version with different interconnection scheme (e.g. bus, Network-on-Chip, etc.) and memory model (e.g. distributed memory, etc.), without having to change the behavioural description of the work-items, and vice-versa. Below, we provide an analysis on the layout of a typical wrapper and the proposed IPC mechanism. 1) Wrapper layout: Although the layout of a work-item wrapper strongly depends on the deployed inteconnection and memory model, this section provides a typical wrapper architecture, which can be used as a paradigm for designing a wrapper library with a variety of different architectural features. The wrapper consists of two main modules, which control the work-items: (a) the scheduler and (b) the memory and interconnection model. 3 In side, the work-items match to the available resources of the platform.
4 Available -Items (in SystemC ) W x N available items W work-groups W x S x N invoked items wiw i Invoked -Items (by Host) wwi i 1 Item 1 Item 2 Item N wwi i 1 Previous Segment s j Fig. 5. Fig. 6. Scheduler [For each work-group] Available Resources Global ID N* (i+s*j) + 0 Item 1 N* (i+s*j) + 1 wiw i Item 2 1 Current Segment sj s j wi i of Invoked -Items TIME N* (i+s*j) + N-1 Item N Next Segment s j 1 -item wrapper scheduler. Local Data wwi i 1 W work-groups wiw i Item 1 Item N Interconn. Local Data wwi i 1 W workgroups Interconn. Interconn. Local Data Interconnection & Interconnection Model wi i 1 wwi i Segment 1 Item 1 Global ID: N*i + 0 Item 2 Global ID: N*i + 1 Item N Global ID: N*i + N-1 Segment S Item 1 Global ID: N*(i+S-1) + 0 Item 2 Global ID: N*(i+S-1) + 1 Item N Global ID: N*(i+S-1) + N-1 wi i 1 Interconnection Ports for -item 1 Cross Ports for bar -item N Global Data & Constants Cache To IPC Ports Control handling -item wrapper model for interconnection and memory. i. Scheduler: The host may invoke more work-items than the available resources of the. In this case, the scheduler is responsible for the serialization of the invoked work-items, according to the available ones in the, as shown in Figure 5. The invoked work-items are separated into parallel groups, the number of which is equal to the number of work-groups (i.e. W ). In each group, the invoked work-items are organized into S segments, in each of which the invoked work-items should not exceed the available ones. The scheduler properly adjusts the global and local indices, so that one segment is running on the available work-items. When the execution is finished, the work-items are re-triggered for the next segment. ii. and Interconnection: As Figure 6 shows, the wrapper uses separate local and global interconnection for local (one for each work-group) and global data access respectively, thus enabling data access parallelization. Each workitem has dedicated input/output signals for local and global interconnection. Every interconnection is a typical crossbar which consists of input/output ports, one pair for each workitem, as well as one pair for the memories. Each pair of ports consists of control and data signals, allowing for transactions in words of multiple bytes, defined by the designer at compile time. The latter enables single-cycle transfers of vectors of 2, 4, 8 or 16 values (of up to 32-bit each), which are supported by [12], thus enabling parallelism on data processing. Each module of the local memory is attached to one local interconnection, while a global/constant memory is attached to the global interconnection. Upon memory access, the workitem source code defines the memory type (global/constant or local) and the address inside the memory. If multiple work-items access the same memory module, a round-robin arbitration is applied. We assume that single-port memories are utilized, supporting 32-bit accesses. However, significant bottlenecks may be induced, especially when reading global or constant vectors of data. Hence, a cache module is used for Fig. 7. Host Side IPC Wrapper Side Time Get Time Stamp Host API Invoke Wait READY Start proc. TRIGGER READY semaphore semaphore Acknowledge READY ACK semaphore Polling Notification Ack. Waiting Update Time Control Handling Inter-process communication mechanism. memcpy() Shared Pointers the global data 4, the size of which is determined at compile time. The cache supports accesses in lengths equal to the word length of the interconnection, in order to retain the interconnection performance and thus avoid bottlenecks. The area/power cost of such a cache depends on its size and word length, however the designer may fine-tune both parameters for achieving optimized solutions. 2) Inter-Process Communication: The IPC mechanism of the proposed framework is based on a set of Unix semaphores, which are utilized for the control between the host and the, as shown in Figure 7. In particular, the set includes three semaphores; one for the triggering (i.e. Trigger ) and two for the host notification when processing ends (i.e. Ready and Ack ). Apart from the semaphore-based control, the IPC mechanism incorporates an API for data exchange. In particular, the shared memory segment is manipulated by the host through memcpy() calls. Also, the time variable is updated by the wrapper in every (simulated) clock tick. The host reads this variable when polling the current time-stamp. Hence, (1) when the host invokes a kernel, the process is started by taking as input the number of work-items and work-groups, as well as the input/output data size. During the startup, the semaphores and the shared memory are attached to the process. Afterwards, (2) the host triggers the data processing through the Trigger semaphore, which is polled by the wrapper. This kind of waiting is non-blocking 5 in order not to stall the simulated time. (3) During processing, the host waits until the result is ready, using the Ready semaphore (typically this is a blocking waiting). (4) When the wrapper notifies the host that the processing has finished (through the Ready semaphore ), (5) the process performs a blocking waiting through the ACK semaphore, which is used for verifying that the host has received the notification. C. Prototyping Flow for Applications In order to automatically create the work-items prototype, the proposed framework is accompanied with an to-systemc prototyping flow, presented in Figure 8. After a syntax check (typically using clang [14]), the source is converted into SystemC by using (a) a work-item template; (b) a C++ class for vectors 6 of different data types, supporting arithmetic/logic operations and vector comparisons according to the specifications [12], while also enabling different degrees of parallelization in vector processing; (c) mathematical functions for both scalar variables and vectors; and (d) input/output functions. As the syntax of vector operations differs from the default C/C++ syntax, any vector-related operation is rewritten according to the provided methods of the deployed C++ vector class. Figure 8 ( Vector Processing ) shows typical conversion examples. This conversion is applied recursively: For example, V.odd is firstly converted to V.s13, then to Vector(V(1),V(3)) and finally to Vector(V.array[1],V.array[3]). 4 The constants are fetched only once and are saved inside the work-item. 5 In non-blocking waiting, the process is not blocked, but it performs active waiting. In blocking waiting, the process is blocked. 6 Different from the built-in vector class provided by C++.
5 Fig. 8. Syntax Check [clang] -Item Template Vector Handling -to-systemc Code Conversion [proposed] Vector Processing [recursive] Custom Literals V(i) V.array[i] Concat. (V1(i), V2(j)) Access V.s01 V.xyzw Compilation Operations V.odd V.even V1 + V2 V1.odd * V2.even Custom Vector Class Vector(V1(i), V2(j)) Vector(V(0), V(1)) V.s0123 V.s13 V.s02 V1 + V2 ß As is (V1(1)*V2(0), V1(3)*V2(2)) Proposed prototyping flow. Prototyping with SystemC [Figure 8] Additional Architectural Details Compilation [HLS + Quartus] Parameters Prototype Refinement I/O Math Wrapper Library Construction [proposed] I/O Transactions [gcc] Detecting Globals, Constants and Locals [globals, constants & locals] Type T A[pos] Input value = read_t(addr(a), pos) Output write_t(addr(a), pos, value) Transactions Interface Parameters Annotated HLS SystemC Library Design Parameters Change Design Space Exploration in a single design stage (a) FPGA Board Bitstream Programming + Execution Design Parameters Change (b) Compilation + Simulation Metrics Metrics Design Space Exploration Fig. 9. Typical design flows when using (a) the proposed prototyping framework; and (b) Altera SDK for. The Altera-based flow requires the kernel compilation after every parameters change, in contrast to the flow utilizing the proposed framework. Additionally, the -to-systemc conversion includes the detection of the global constants and the global and local variables 7. Every access to that data is replaced by input/output function calls for implementing memory accesses to/from the memories, as shown in Figure 8 ( I/O Transactions ). When the SystemC source for the work-items is created, the next stage of the proposed flow involves the construction of the whole, including the wrapper model. Finally a conventional C++ compiler is utilized, in combination with the SystemC library, so that the executable is produced. The main advantage of this flow is that it is applied only once: The code is not needed any more in system design, as the created will be utilized in all the remaining hardware design stages, i.e. (i) functional verification, (ii) design parameters evaluation in terms of timing, resource utilization and power consumption, as well as (iii) final synthesis. This is also a major contribution of this work. To better explain this advantage, Figures 9a and 9b show two typical design flows when using (a) the proposed prototyping framework and (b) Altera SDK for respectively. Altera SDK for is chosen as state-of-the-art for mapping kernels onto FPGAs. The design flow using the proposed framework (Figure 9a) starts with the prototyping procedure of Figure 8. The produced can be used in a typical design space exploration (DSE) and can be refined with more architectural details in later design stages. In a single DSE, The is enriched with timing/area/power annotations which are derived using High-Level Synthesis (HLS) [15]. The annotated is then compiled and simulated. During simulation, the computation and communication behavior are combined for timing and 7 variables are finally implemented as registers. TABLE I. OPENCL-IMPLEMENTED ALGORITHMS FOR THE EXPERIMENTATION SETUP. -Items Algorithm Invoked Available Local Input Pathfinder matrix BFS node graph Gaussian Elimination matrix Particle Filter video particles a Nearest Neighbor records Histogram elements MergeSort elements BucketSort elements Back-Propagation input ANN b a In 10-frame video b Neural network with 64 inputs, 1 hidden layer with 16 neurons and 1 output power estimation, while different execution scenarios are taken by using different input data. As compared to the above flow, the design procedure using Altera SDK for (Figure 9b) starts with the kernels compilation, including HLS and RTL synthesis with Quartus. The result is a bitstream for programming an FPGA board, where the system is evaluated. When using a typical DSE, the whole procedure is repeated after every parameters change. Also there is no support for higher abstraction levels. V. EXPERIMENTAL RESULTS The proposed prototyping framework is evaluated by using Rodinia benchmark suite [16]. Rodinia provides implemented algorithms, mainly focusing on GPU acceleration. However, the provided kernels can be mapped onto FPGAs as well. For the scope of this work, we used the benchmarks of Table I, which also shows the number of work-items in total (including the invoked and the available work-items in the ) and locally (i.e. per work-group), as well as the application input size. In applications with large number of invoked work-items, the kernels have been partially serialized, in order to both avoid the excessive memory allocation and provide a more realistic model of the system-under-design. The rest of this section (i) provides a quantitative comparison between Altera SDK for (i.e. state-of-art) and the proposed framework in terms of compilation and applications execution time; and (ii) analyzes the simulation time when using x86 or O-based hosts, while also evaluating the effect of separating the O simulator from the work-item platform. All the experiments have been executed on an Intel Core-i5 Quad-Core at 3.2 GHz running Fedora 23 with Linux kernel 4.4. Altera SDK for vs. Proposed Framework: The selected kernels have been mapped onto a Cyclone V device, included in an Altera DE1SoC board, on which the applications have been run in order to measure the algorithm execution time. In the meanwhile, the kernels have been prototyped in SystemC with the proposed framework and annotated in terms of timing and resource utilization, by using Xilinx Vivado HLS 8. Afterwards, the SystemC models are simulated by using the x86 as host. The comparison results 9 are depicted in Figure 10. The dominant part of the Altera-based flow is compilation, which is from 3 up to 18 slower than the proposed flow (including prototyping, annotation and simulation). Although the simulation depends on the input volume, the proposed methodology enables the designer to perform a rapid evaluation by 8 Altera does not provide a standalone HLS tool. However, despite the use of different vendors, we intend to acquire typical execution time results only. Similar HLS run-time results are expected with the use of any other commercial HLS tool. 9 The board execution time is multiplied by 500 to be visible to the chart.
6 As Figure 11 depicts, the use of an O software simulator as a separate platform is able to leverage the high simulation speed provided by O without causing significant communication overhead between the and the workitems, as it achieves similar simulation times as compared to the use of a x86 host. On the contrary, the use of a single platform may cause significant simulation time overhead, ranging from 10% up to 5. The first reason, as explained in Figure 4, is that a single O-based platform performs quantumbased O scheduling, which may lead to significant time waste. The second reason is that a single platform deploys a constant number of work-items. On the contrary, in separate platforms, only the necessary work-items are allocated. Thus if the host repeats a kernel with less invoked work-items, the simulation will be faster, as less components will be simulated. Fig. 10. Comparison between the compilation/execution time with Altera SDK for and the prototyping/simulation time with the proposed prototyping framework. Fig. 11. Simulation time comparison when using a x86 host and O-based ARM host. For the O scenarios, we evaluate the use of separate s, as well as the use of a common O-based platform. using a small amount of representative input data in early design stages, for fast decision making, while larger input data volumes can be utilized in later design stages. On the contrary, this feature is not provided by Altera SDK for : In case of a parameter modification, the designer has to wait more than 40 minutes (independently from the input data) until the kernels are (re-)implemented. Last but not least, there are cases (e.g. Back-propagation) where the kernels do not fit into the FPGA fabric; in that case, Altera compilation fails. Simulation Time Analysis: The aim of this analysis is to study how the application execution time is affected when using different hosts, namely a x86 host (i.e. the Intel Core-i5 ) and an ARM Cortex-A9 model, provided by O. For the second case, two scenarios are investigated: (i) separate platforms for the software simulation and the work-items (i.e. the proposed approach); and (ii) the use of a single TLM platform including the model, memories and the work-items (i.e. the state-of-art prototyping approach). In the single-platform scenario, using a set of preliminary simulations, we have adjusted the time quantum of the O scheduler appropriately, so that achieving the minimum possible simulation time overhead, according to the concept of Figure 4. For the separate-platform scenario, such adjustments are not necessary, which is a first evidence about the efficiency of the proposed approach. VI. CONCLUSIONS This paper presents a rapid prototyping framework, which automatically derives a SystemC-based from sources, thus combining the portability with the abstracted modeling of Virtual Prototyping. The proposed framework supports different hardware architectures and memory models without the need for kernel modifications, while also enabling fast evaluation cycles, without long compilation procedures. In particular, the design flow which accompanies the proposed framework achieves evaluation time improvements up to 18, as compared to Altera SDK for. The proposed framework also enables the use of any host, which can be either a x86 or a software simulator. The host communicates with the through an inter-process communication mechanism, which also allows for the separation of a software simulator from the, thus leading to significant simulation time improvements reaching up to 5. REFERENCES [1], by khronos group. [2] (2013) Synopsys High-performance ASIC Prototyping Systems. [3] J. Cong, M. Ghodrat, M. Gill, B. Grigorian, H. Huang, and G. Reinman, Composable accelerator-rich microprocessor enhanced for adaptivity and longevity, in IEEE International Symposium on Low Power Electronics and Design (ISLPED), 2013, Sept 2013, pp [4] AMD. [5] NVIDIA SDK. [6] K. Komatsu, K. Sato, Y. Arai, K. Koyama, H. Takizawa, and H. Kobayashi, Evaluating performance and portability of opencl programs, in 5th Intl. shop on Automatic Performance Tuning, 2010, pp [7] Altera SDK for. [8] Xilinx SDAccel. [9] The Next Logical Step in C/C++, Programming, by Xcell Software Journal, Issue 1, [10] Vivado Design Suite User Guide, UG902 (v2015.4), Nov. 24, manuals/xilinx2015 4/ug902- vivado-high-level-synthesis.pdf. [11] Vista virtual prototyping, by mentor graphics. [12] The specifications, version [13] Open virtual platforms website. [14] clang: a C language family frontend for LLVM. [15] E. Sotiriou-Xanthopoulos, S. Xydis, K. Siozios, G. Economakos, and D. Soudris, Effective platform-level exploration for heterogeneous multicores exploiting simulation-induced slacks, in PARMA-DITAM 14. New York, NY, USA: ACM, 2014, pp. 13:13 13:16. [16] Rodinia: A benchmark suite for heterogeneous computing, version
Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,
Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms SAMOS XIV July 14-17, 2014 1 Outline Introduction + Motivation Design requirements for many-accelerator SoCs Design problems
More informationHardware Design and Simulation for Verification
Hardware Design and Simulation for Verification by N. Bombieri, F. Fummi, and G. Pravadelli Universit`a di Verona, Italy (in M. Bernardo and A. Cimatti Eds., Formal Methods for Hardware Verification, Lecture
More informationDesign methodology for multi processor systems design on regular platforms
Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationFCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture
More informationA Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs
A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National
More informationFCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA
1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationPerformance Verification for ESL Design Methodology from AADL Models
Performance Verification for ESL Design Methodology from AADL Models Hugues Jérome Institut Supérieur de l'aéronautique et de l'espace (ISAE-SUPAERO) Université de Toulouse 31055 TOULOUSE Cedex 4 Jerome.huges@isae.fr
More informationFPGA-Based Rapid Prototyping of Digital Signal Processing Systems
FPGA-Based Rapid Prototyping of Digital Signal Processing Systems Kevin Banovic, Mohammed A. S. Khalid, and Esam Abdel-Raheem Presented By Kevin Banovic July 29, 2005 To be presented at the 48 th Midwest
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationHVSoCs: A Framework for Rapid Prototyping of 3-D Hybrid Virtual System-on-Chips
on introducing a new design paradigm HVSoCs: A Framework for Rapid Prototyping of 3-D Hybrid Virtual System-on-Chips D. Diamantopoulos, K. Siozios, E. Sotiriou-Xanthopoulos, G. Economakos and D. Soudris
More informationFPGA design with National Instuments
FPGA design with National Instuments Rémi DA SILVA Systems Engineer - Embedded and Data Acquisition Systems - MED Region ni.com The NI Approach to Flexible Hardware Processor Real-time OS Application software
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More information«Real Time Embedded systems» Multi Masters Systems
«Real Time Embedded systems» Multi Masters Systems rene.beuchat@epfl.ch LAP/ISIM/IC/EPFL Chargé de cours rene.beuchat@hesge.ch LSN/hepia Prof. HES 1 Multi Master on Chip On a System On Chip, Master can
More informationA Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function
A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function Chen-Ting Chang, Yu-Sheng Chen, I-Wei Wu, and Jyh-Jiun Shann Dept. of Computer Science, National Chiao
More informationMapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience
Mapping Multi-Million Gate SoCs on FPGAs: Industrial Methodology and Experience H. Krupnova CMG/FMVG, ST Microelectronics Grenoble, France Helena.Krupnova@st.com Abstract Today, having a fast hardware
More informationChapter 5: ASICs Vs. PLDs
Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.
More informationSDAccel Development Environment User Guide
SDAccel Development Environment User Guide Features and Development Flows Revision History The following table shows the revision history for this document. Date Version Revision 05/13/2016 2016.1 Added
More informationA 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation
A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,
More informationOpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania
OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming
More informationCHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP
133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located
More informationCover TBD. intel Quartus prime Design software
Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a
More informationARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures
1 ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architectures Yu-Ting Chen, Jason Cong, Zhenman Fang, Bingjun Xiao, Peipei Zhou Center for Domain-Specific Computing, University
More informationESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)
ESE Back End 2.0 D. Gajski, S. Abdi (with contributions from H. Cho, D. Shin, A. Gerstlauer) Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu 1 Technology advantages
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationCover TBD. intel Quartus prime Design software
Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a
More informationLong Term Trends for Embedded System Design
Long Term Trends for Embedded System Design Ahmed Amine JERRAYA Laboratoire TIMA, 46 Avenue Félix Viallet, 38031 Grenoble CEDEX, France Email: Ahmed.Jerraya@imag.fr Abstract. An embedded system is an application
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationHardware-Software Codesign. 1. Introduction
Hardware-Software Codesign 1. Introduction Lothar Thiele 1-1 Contents What is an Embedded System? Levels of Abstraction in Electronic System Design Typical Design Flow of Hardware-Software Systems 1-2
More informationOptimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs
Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Niu Feng Technical Specialist, ARM Tech Symposia 2016 Agenda Introduction Challenges: Optimizing cache coherent subsystem
More informationMulti-core microcontroller design with Cortex-M processors and CoreSight SoC
Multi-core microcontroller design with Cortex-M processors and CoreSight SoC Joseph Yiu, ARM Ian Johnson, ARM January 2013 Abstract: While the majority of Cortex -M processor-based microcontrollers are
More informationCosimulation of ITRON-Based Embedded Software with SystemC
Cosimulation of ITRON-Based Embedded Software with SystemC Shin-ichiro Chikada, Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada Graduate School of Information Science, Nagoya University Information Technology
More informationModeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano
Modeling and Simulation of System-on on-chip Platorms Donatella Sciuto 10/01/2007 Politecnico di Milano Dipartimento di Elettronica e Informazione Piazza Leonardo da Vinci 32, 20131, Milano Key SoC Market
More informationOptimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd
Optimizing ARM SoC s with Carbon Performance Analysis Kits ARM Technical Symposia, Fall 2014 Andy Ladd Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block
More informationOptimization of Behavioral IPs in Multi-Processor System-on- Chips
Optimization of Behavioral IPs in Multi-Processor System-on- Chips Yidi Liu and Benjamin Carrion Schafer # Department of Electronic and Information Engineering b.carrionschafer@polyu.edu.hk # Outline High-Level
More informationFPGA system development What you need to think about. Frédéric Leens, CEO
FPGA system development What you need to think about Frédéric Leens, CEO About Byte Paradigm 2005 : Founded by 3 ASIC-SoC-FPGA engineers as a Design Center for high-end FPGA and board design. 2007 : GP
More informationA Matlab/Simulink Simulation Approach for Early Field-Programmable Gate Array Hardware Evaluation
A Matlab/Simulink Simulation Approach for Early Field-Programmable Gate Array Hardware Evaluation Celso Coslop Barbante, José Raimundo de Oliveira Computing Laboratory (COMLAB) Department of Computer Engineering
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationSystem Debugging Tools Overview
9 QII53027 Subscribe About Altera System Debugging Tools The Altera system debugging tools help you verify your FPGA designs. As your product requirements continue to increase in complexity, the time you
More informationVivado HLx Design Entry. June 2016
Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page
More informationModel-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany
Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany 2013 The MathWorks, Inc. 1 Agenda Model-Based Design of embedded Systems Software Implementation
More informationOpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group
Copyright Khronos Group, 2012 - Page 1 OpenCL Overview Shanghai March 2012 Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group Copyright Khronos Group, 2012 - Page 2 Processor
More informationDesign Creation & Synthesis Division Avoid FPGA Project Delays by Adopting Advanced Design Methodologies
Design Creation & Synthesis Division Avoid FPGA Project Delays by Adopting Advanced Design Methodologies Alex Vals, Technical Marketing Engineer Mentor Graphics Corporation June 2008 Introduction Over
More informationHSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!
Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationDesign and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA
Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,
More informationChapter 2 M3-SCoPE: Performance Modeling of Multi-Processor Embedded Systems for Fast Design Space Exploration
Chapter 2 M3-SCoPE: Performance Modeling of Multi-Processor Embedded Systems for Fast Design Space Exploration Hector Posadas, Sara Real, and Eugenio Villar Abstract Design Space Exploration for complex,
More informationXilinx Vivado/SDK Tutorial
Xilinx Vivado/SDK Tutorial (Laboratory Session 1, EDAN15) Flavius.Gruian@cs.lth.se March 21, 2017 This tutorial shows you how to create and run a simple MicroBlaze-based system on a Digilent Nexys-4 prototyping
More informationOpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch
OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device
More informationFrom Application to Technology OpenCL Application Processors Chung-Ho Chen
From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication
More informationA Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs
A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor Lloyd, Artem Chikin, Erick Ochoa, Karim Ali, José Nelson Amaral University of Alberta Sept 7 FSP 2017 1 University
More informationCadence SystemC Design and Verification. NMI FPGA Network Meeting Jan 21, 2015
Cadence SystemC Design and Verification NMI FPGA Network Meeting Jan 21, 2015 The High Level Synthesis Opportunity Raising Abstraction Improves Design & Verification Optimizes Power, Area and Timing for
More informationUltra-Fast NoC Emulation on a Single FPGA
The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo
More informationIntroducing the FPGA-Based Prototyping Methodology Manual (FPMM) Best Practices in Design-for-Prototyping
Introducing the FPGA-Based Prototyping Methodology Manual (FPMM) Best Practices in Design-for-Prototyping 1 What s the News? Introducing the FPMM: FPGA-Based Prototyping Methodology Manual Launch of new
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationChapter 2 Parallel Hardware
Chapter 2 Parallel Hardware Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationOptimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices
Optimised OpenCL Workgroup Synthesis for Hybrid ARM-FPGA Devices Mohammad Hosseinabady and Jose Luis Nunez-Yanez Department of Electrical and Electronic Engineering University of Bristol, UK. Email: {m.hosseinabady,
More informationContents 1 Introduction 2 Functional Verification: Challenges and Solutions 3 SystemVerilog Paradigm 4 UVM (Universal Verification Methodology)
1 Introduction............................................... 1 1.1 Functional Design Verification: Current State of Affair......... 2 1.2 Where Are the Bugs?.................................... 3 2 Functional
More informationTowards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing
Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationCMPE 415 Programmable Logic Devices Introduction
Department of Computer Science and Electrical Engineering CMPE 415 Programmable Logic Devices Introduction Prof. Ryan Robucci What are FPGAs? Field programmable Gate Array Typically re programmable as
More informationImplementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks
Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks 2014 The MathWorks, Inc. 1 Traditional Implementation Workflow: Challenges Algorithm Development
More informationFPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)
FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor
More informationAdvanced FPGA Design Methodologies with Xilinx Vivado
Advanced FPGA Design Methodologies with Xilinx Vivado Alexander Jäger Computer Architecture Group Heidelberg University, Germany Abstract With shrinking feature sizes in the ASIC manufacturing technology,
More informationSYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS
SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS Embedded System System Set of components needed to perform a function Hardware + software +. Embedded Main function not computing Usually not autonomous
More informationOverview. Technology Details. D/AVE NX Preliminary Product Brief
Overview D/AVE NX is the latest and most powerful addition to the D/AVE family of rendering cores. It is the first IP to bring full OpenGL ES 2.0/3.1 rendering to the FPGA and SoC world. Targeted for graphics
More informationModular SystemC. In-house Training Options. For further information contact your local Doulos Sales Office.
Modular SystemC is a set of modules related to SystemC TM (IEEE 1666-2005) aimed at fulfilling teambased training requirements for engineers from a range of technical backgrounds, i.e. hardware and software
More informationA novel way to efficiently simulate complex full systems incorporating hardware accelerators
ARM Research Summit 2017 Workshop A novel way to efficiently simulate complex full systems incorporating hardware accelerators Nikolaos Tampouratzis Technical University of Crete, Greece Motivation / The
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationCoupling MPARM with DOL
Coupling MPARM with DOL Kai Huang, Wolfgang Haid, Iuliana Bacivarov, Lothar Thiele Abstract. This report summarizes the work of coupling the MPARM cycle-accurate multi-processor simulator with the Distributed
More informationA Framework for Rapid System-Level Synthesis Targeting to Reconfigurable Platforms
A Framework for Rapid System-Level Synthesis Targeting to Reconfigurable Platforms A Computer Vision Case Study Dionysios Diamantopoulos, Ioannis Galanis, Kostas Siozios, George Economakos and Dimitrios
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationThe Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006
The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content
More informationNetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013
NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching
More informationSystem Level Design with IBM PowerPC Models
September 2005 System Level Design with IBM PowerPC Models A view of system level design SLE-m3 The System-Level Challenges Verification escapes cost design success There is a 45% chance of committing
More informationChapter 2 The AMBA SOC Platform
Chapter 2 The AMBA SOC Platform SoCs contain numerous IPs that provide varying functionalities. The interconnection of IPs is non-trivial because different SoCs may contain the same set of IPs but have
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationEmploying Multi-FPGA Debug Techniques
Employing Multi-FPGA Debug Techniques White Paper Traditional FPGA Debugging Methods Debugging in FPGAs has been difficult since day one. Unlike simulation where designers can see any signal at any time,
More informationGetting the Most out of Advanced ARM IP. ARM Technology Symposia November 2013
Getting the Most out of Advanced ARM IP ARM Technology Symposia November 2013 Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block are now Sub-Systems Cortex
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationDesign Methodologies. Kai Huang
Design Methodologies Kai Huang News Is that real? In such a thermally constrained environment, going quad-core only makes sense if you can properly power gate/turbo up when some cores are idle. I have
More informationParallel Neural Network Training with OpenCL
Parallel Neural Network Training with OpenCL Nenad Krpan, Domagoj Jakobović Faculty of Electrical Engineering and Computing Unska 3, Zagreb, Croatia Email: nenadkrpan@gmail.com, domagoj.jakobovic@fer.hr
More informationDeveloping a Data Driven System for Computational Neuroscience
Developing a Data Driven System for Computational Neuroscience Ross Snider and Yongming Zhu Montana State University, Bozeman MT 59717, USA Abstract. A data driven system implies the need to integrate
More informationDESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC
DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC 1 Pawar Ruchira Pradeep M. E, E&TC Signal Processing, Dr. D Y Patil School of engineering, Ambi, Pune Email: 1 ruchira4391@gmail.com
More informationTime-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration
Time-Shared Execution of Realtime Computer Vision Pipelines by Dynamic Partial Reconfiguration Marie Nguyen Carnegie Mellon University Pittsburgh, Pennsylvania James C. Hoe Carnegie Mellon University Pittsburgh,
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationFast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations
FZI Forschungszentrum Informatik at the University of Karlsruhe Fast and Accurate Source-Level Simulation Considering Target-Specific Compiler Optimizations Oliver Bringmann 1 RESEARCH ON YOUR BEHALF Outline
More informationDigital Design Methodology (Revisited) Design Methodology: Big Picture
Digital Design Methodology (Revisited) Design Methodology Design Specification Verification Synthesis Technology Options Full Custom VLSI Standard Cell ASIC FPGA CS 150 Fall 2005 - Lec #25 Design Methodology
More informationAbstraction Layers for Hardware Design
SYSTEMC Slide -1 - Abstraction Layers for Hardware Design TRANSACTION-LEVEL MODELS (TLM) TLMs have a common feature: they implement communication among processes via function calls! Slide -2 - Abstraction
More informationCo-synthesis and Accelerator based Embedded System Design
Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer
More informationCopyright Khronos Group Page 1. Vulkan Overview. June 2015
Copyright Khronos Group 2015 - Page 1 Vulkan Overview June 2015 Copyright Khronos Group 2015 - Page 2 Khronos Connects Software to Silicon Open Consortium creating OPEN STANDARD APIs for hardware acceleration
More informationSIGGRAPH Briefing August 2014
Copyright Khronos Group 2014 - Page 1 SIGGRAPH Briefing August 2014 Neil Trevett VP Mobile Ecosystem, NVIDIA President, Khronos Copyright Khronos Group 2014 - Page 2 Significant Khronos API Ecosystem Advances
More informationINTRODUCTION TO FPGA ARCHITECTURE
3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)
More informationSimplify System Complexity
1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller
More information