Characterization of OpenCL on a Scalable FPGA Architecture

Size: px
Start display at page:

Download "Characterization of OpenCL on a Scalable FPGA Architecture"

Transcription

1 Characterization of OpenCL on a Scalable FPGA Architecture Shanyuan Gao and Jeremy Chritz Pico Computing, Inc. {sgao, jchritz}@picocomputing.com 1 Abstract The recent release of Altera s SDK for OpenCL has greatly eased the development of FPGA-based systems. Research have shown performance improvements brought by OpenCL using a single FPGA device. However, to meet the objectives of high performance computing, OpenCL needs to be evaluated using multiple FPGAs. This work has proposed a scalable FPGA architecture for high performance computing. The design includes multiple FPGA modules and a high performance backplane. The modular nature of this architecture supports the combination of different FPGAs, as well as provides for easy hardware updates. FPGA modules based on Stratix V are compatible with Altera s OpenCL tool flow. The evaluation has tested the native IO performance of the architecture and the results have demonstrated scalability using six FPGAs. The host-to-device peak bandwidth is measured as 13.1 GB/s for read operation and 12.1 GB/s for write operation. The FPGA-to-memory bandwidth is measured as 64.5 GB/s in total. An OpenCL AES is selected to test the scalable multi-fpga architecture. The test results have shown peak throughput is achiveded when six FPGAs are used. The throughput per watt shows 5 improvement using four FPGAs, over a general-purpose processor. I. INTRODUCTION FPGAs have long been recognized for the important role they play in high performance computing (HPC), owing primarily to their inherent parallelism and low power consumption [1] [3]. However, the application of FPGAs in the HPC domain has historically been limited to those with deep expertise in hardware, firmware, operating systems, and HPC applications. The FPGA community is striving to improve productivity and the ease of use in all the above aspects. But the variables between software application and underlying FPGA device such as defining layers of functionality, selecting communication protocols, developing algorithms, performing verification, etc. still takes several engineers weeks or months to deliver a project and maintain it. Moreover, porting or upgrading a design from one FPGA device to another means repeating the process described above. This is not a productive or sustainable engineering model, and it must be changed. To this end, an ideal FPGA system for HPC applications should address the aforementioned issues and bring: Short development time from high level description to system configuration (bitstream). Scalable multiple FPGA support to realize maximum parallelism. Easy upgrade of hardware devices. OpenCL (Open Computing Language) is an open specification widely used in many HPC systems that leverages the parallelism inherent to heterogeneous computing devices. The OpenCL standard is supported by various vendors for different devices. From a performance point of view: a) These heterogeneous computing devices are capable of processing a large number of threads in parallel, while the general processor is only capable of a limited number of threads b) Threads on heterogeneous computing devices are commonly light weight, while threads on the host are heavy weight. Consequently, the cost for context switching on computing devices is smaller than that of the host processor. From a software programmer s perspective: a) The OpenCL code running on the computing device is easy to write, employing the same syntax as high level programming languages (for example, C language). b) The OpenCL code is portable from one computing device to another so it can easily benefit from hardware upgrade or tradeoff evaluation. Altera s OpenCL solution is selected in this work because the tools automate the build process, and tremendously reduce the entire development time. However, Altera s current OpenCL solution only supports PCI Express Gen2 8 with Configuration via Protocol (CvP) enabled. As most host machines now have Gen3 16 and most Altera board vendors only have one or two FPGAs per board, at least half of the Gen3 16 bandwidth is wasted. To address the application of FPGA in an HPC system, with all the issues described, in this work, we have: Designed a scalable FPGA architecture that supports multiple FPGAs on a single backplane. Developed a solution using Altera s OpenCL on multiple FPGAs. Tested the performance of the architecture and experimented with the scalability of an OpenCL application Calculated the power efficiency (throughput per watt). The rest of the paper is organized as follows: Section II provides some OpenCL background and OpenCL related work on FPGA. Section III presents the design of implementation of the multi-fpga architecture. Section IV shows the experiment results. Section V concludes this paper and describes future work.

2 2 A. OpenCL II. BACKGROUND AND RELATED WORK OpenCL [4] is an open standard that enables applications to be executed on various computing architectures, such as general purpose CPUs, GPGPUs, FPGAs, as well as other special processors. As a result, applications can benefit from the underlying hardware features without programmers spending a great deal of time learning hardware details. OpenCL adopts the processor-accelerator concept, defining host and device in the platform. The application is divided into the host program and the program. When the application starts, the host executes the host program and passes computationally intensive workload to the device; the device is programmed either using sources or binaries on the fly, executes the computationally intensive workload, and passes the result back to the host. The latency between the host and the device is often larger than the host and its memory. To overcome this issue, data should be properly sized so it will be pipelined into the device. The program is a piece of computationally intensive code within the application identified by the programmer through profiling. The source is written in a C-like language. Hardware vendors often have their OpenCL implementation optimized for their hardware, which can be applied by setting #pragma and attribute in the code. B. OpenCL on FPGA Since the introduction of the OpenCL standard, there have been several attempts to implement OpenCL on FPGAs. However, there are some obstacles that need to be addressed before implementing OpenCL on FPGAs. C to Hardware Description Language (HDL): The OpenCL standard uses a C-like language to describe high level functions, whereas FPGAs use HDL to describe functions in Register Transfer Level (RTL). Translating high level language to HDL can be done by hand, but it requires expertise in HDL, and the process involves repeated development and verification cycles. To speed up the translation time, there are several high level synthesis tools that help convert C to HDL code [5] [8]. High level synthesis tools only take seconds or minutes to covert C function to HDL code. Integration: The generated HDL core will not work without a framework which provides necessary control and data IO. For example, if the FPGA board sits on a PCI Express slot, the framework should set up a stable PCI Express link, calibrate DDR memory, connect all necessary peripheral devices, and control the generated HDL core. The integration process requires minimum FPGA knowledge of clock, interface, and timing. Building : OpenCL has two ways of creating the executables on the device: a) clcreateprogramwithsource, which compiles the source and loads the executable during the runtime; b) clcreateprogramwithbinary, which loads a pre-compiled binary of the executable. GPGPUs generally use the former method because the compile time is negligible (seconds). Due to the place and route time, FPGA tools take much longer (hours) to generate a configuration. Thus pre-compiling the design is the only efficient way to create executables. Kernel reloading: Programming a configuration onto FPGA will erase the original configuration, which will disconnect the physical link between the device and the host. Special engineering work needs to be done to keep the physical link alive without rebooting the host machine. One can conclude from the above that in order to run OpenCL on FPGA, an ideal tool flow should automatically convert a source into a system-level FPGA configuration, program the FPGA, and run the OpenCL application without disconnecting the physical link. The initial exploration of OpenCL on FPGA starts with high level synthesis tools, which are designed to convert high level semantics such as C or C++ into HDL code [5] [8]. Designers can direct the tools to create interfaces, utilize vendor primitives, or optimize hardware logic. Some tools can even simulate and verify the design. High level synthesis tools tremendously reduce the development time in creating HDL code. However, they still require that designers manually integrate the generated HDL code into the final system. Cartwright et al. [9] have created FSM SYS Builder that assembles IP components into the system. They have used the OpenCL standard mapping OpenCL APIs to hthread, a hardware-based micro OS in a system-on-chip (SoC) environment. Owaida et al. [1] have focused on the compiler tools. An architectural synthesis tool called Silicon OpenCL, which can generate hardware accelerators and SoC systems from OpenCL programs. The evaluation used a single FPGA and a single static. The work in [11] has shifted the abstraction level. Using Convey s HC-1 platform, the onboard CPU is used as the compute device, while four onboard FPGAs (Application Engine) are used as compute units. Kernels are replicated to test different configurations. Source-to-source translation is used to convert OpenCL to C source, which is then feed into Auto-ESL (now Xilinx Vivado HLS) to generate the HDL core. The final integration involves Convey s PDK framework and Xilinx ISE tools. The evaluation has used four FPGAs and a static. Taneem [12] in his Master Thesis has studied an OpenCL framework as a unified programming framework for CPU, GPGPU, and FPGA. Specifically for FPGA, a static OpenCL framework with controller, host interface, and memory controller is proposed.

3 3 x16 x16 x8 x8 FPGA Module FPGA Module FPGA Module 1... FPGA Module 1... FPGA Module 5 FPGA Module 5 Fig. 1. Block diagram of Pico Computing s Architecture Fig. 2. M56 and EX7 backplane Backplane PCI Express PCIswitch Express switch Backplane host Altera [13], [14] has introduced the industry s first FPGA support for OpenCL. A similar OpenCL framework described in [12] is used for each FPGA board. Kernels are compiled into HDL code using Altera s proprietary tools and stitched into the framework. The is built in pipelined fashion to emulate parallel work items in progress. The middle layer library translates OpenCL APIs to FPGA transactions. With Altera s CvP, the host can program the FPGA on the fly without disconnecting the physical link. III. D ESIGN AND I MPLEMENTATION A. Pico Computing s Architecture We have previously designed several FPGA modules adopted in many projects [15], [16]. The philosophy behind this design approach is to pack as many FPGAs as possible onto a single backplane, while providing flexibility to change or upgrade the FPGA modules. Shown in Figure 1, on the backplane, FPGA modules are connected through a central PCI Express switch to the host and appear to the host as independent PCI Express devices. Depending on the physical dimensions of the FPGA module, one full-length backplane (312 mm) can carry up to six FPGA modules. Therefore, a single 4U server with eight PCI Express backplanes is able to carry as many as 48 FPGAs. Additionally, because the architecture is based on modules that snap onto the backplane, designers can explore different FPGA options, including FPGAs of different types, sizes, or vendors. B. M56 Module and EX7 Backplane The M56 module has a Stratix V A3 FPGA (5SGXA3E3H29C2), 8 GB DDR3 memory, 256 Mb EPCQ flash, and a JTAG connector. The GPIO connector provides 46 LVDS signal pairs and two sets of MGT pair. On the backside of the module, a Samtec connector provides PCI Express connection capable of Gen3 8. The M56 measures (mm), so six M56s can fit on a single full-length PCI Express backplane. With heat sink installed, the M56 module and the backplane occupies double-slot depth. In this work, we have also designed the EX7 backplane, on which a PLX PEX878 switch provides a PCI Express Gen3 16 connection to the host. Figure 2 shows a photo of EX7 with four M56 modules. C. OpenCL Framework and Tool Flow The Altera OpenCL flow uses Altera Offline Compiler (aoc) to compile OpenCL source into an HDL core and stitches the core into a firmware framework. The framework needs to be developed for each different FPGA board. We have developed the OpenCL framework based on Altera s reference design. As shown in Figure 3, within the framework, the PCI Express communicates with the host, and the memory controller accesses the DDR3 memory. The blank area circled with dotted line is where compiled OpenCL s reside. To fulfill the goal of dynamic configuration during runtime, Altera s CvP is used. The framework (shaded area) is constrained as a logiclock region and exported as a framework partition. All OpenCL builds share the same framework partition, so a CvP update would not overwrite the framework partition, and therefore not affect the PCI Express link. The CvP function is currently only available with a PCI Express Gen2 interface. As such, the framework of the M56 module is configured as a Gen2 8 interface. On the host side, the application calls clcreateprogramwithbinary instead of clcreateprogramwithsource to load the generated configuration (aocx file) onto the FPGAs. The rest of the OpenCL flow in application remains unchanged.

4 4 source PCI Express Gen2 x8 + aoc => PCI Express Gen2 x8 DDR3 Controller framework DDR3 Controller <>.aocx Fig. 3. OpenCL framework and tool flow TABLE I RESOURCE UTILIZATION OF OPENCL FRAMEWORK Logic Registers Memory DSP blocks Stratix V A OpenCL Framework 19% 8% 2% % Pico s Framework 11% 6% 12% % D. AES application To experiment with the architecture using OpenCL, the application is selected with following criteria: A well-known application: This application should be a common application and it should be easy to find, as well as repeat the experiment. Suitable for FPGA: The application should fit within the resources available on the FPGA. Scalability: The application should scale well with multiple FPGA devices. In this work, an AES OpenCL implementing the AES 256-bit algorithm in ECB mode is selected. The AES is constructed as an engine (dynamic library) that can be linked into the OpenSSL framework. In the experiment, the host application generates a certain size (2 GB) of data and encrypts the data on one or more FPGAs. Throughput (encryption rate) is calculated by dividing the workload over the execution time. The code was originally developed by Liu et al. [17] at Virginia Tech. With a couple of line changes, the benchmark is successfully built with a single-line command and run on the M56 module. Through experiments, the best throughput is achieved when chunk size, used in the pipeline, is 256 MB or below. It is understood that many modern processors have AES instruction built in the Instruction Set Architecture (ISA), which can achieve very high throughput. However, the experiments in this work are not targeting AES operation, but general OpenCL applications which are not implemented as instructions in the ISA. Therefore, the same OpenCL-AES test is conducted on the CPU for reference. A. Experimental Setup IV. EVALUATION In this work, a host system is set up with Intel i7-477k processor and 32GB DDR3 memory. The PCI Express slot on Intel DH87RL motherboard provides a Gen3 16 connection. CentOS 6.5 with Linux is installed as the operating system. For reference purpose, Intel s OpenCL 1.2 Development Kit is installed. Altera Quartus 13.1 is used for developing OpenCL applications. B. Native Performance The first test measures the resource utilization of the OpenCL framework. To obtain the size of the OpenCL framework, a blank file is used to build the design. The first row in Table I lists the resources available on the Stratix V A3 chip. The second row in Table I lists the resource usage reported by Altera s OpenCL tools. As a reference, the third row shows resource usage of Pico Computing s framework. Pico Computing s framework currently is not compatible with Altera s OpenCL flow, but it has a similar infrastructure including PCI Express, DMA engine, interconnect, and memory controller. It can be observed that Altera s OpenCL framework occupies more of the resources than our framework. The second test measures the bandwidth between the host machine and the FPGA devices. The read and write operations are from the perspective of the host. Note the PCI Express Gen3 16 has a theoretical bandwidth of GB/s, while the theoretical bandwidth on the M56 with PCI Express Gen2 x8 connection is 4 GB/s. In the test, six M56 modules are deployed on the EX7 backplane. According to [18], threading is not safe with OpenCL, which means multiple threads could currupt the shared runtime address space. During the test, the Linux system call fork() is used to generate multiple processes, with each process accessing one FPGA device in its independent address space. To ensure the transaction occurs at the same time, system time is reported in each process.

5 Bandwidth (GB/s) Gen3 read 2 Gen3 write Gen2 read Gen2 write Number of FPGA devices Fig. 4. Bandwidth between the host and FPGA devices 1 ideal measured 8 Bandwidth (GB/s) Number of FPGA devices Fig. 5. Bandwidth between the FPGA devices and DDR3 memory In Figure 4, the blue lines (with circles and squares marks) show the IO performance of multiple M56 modules on EX7 backplane. The red lines (with asterisk and cross marks) show the same test using the same M56s on a Gen2 16 backplane. The horizontal lines depict the theoretical bandwidth of Gen3 and Gen2, respectively. It can be seen that the IO bandwidth grows linearly on the Gen3 backplane when less than four M56 modules are used. When more than four M56 modules are used, the read bandwidth saturates around 13.1 GB/s; the write bandwidth slows down the linear growth and it reaches 12.1 GB/s. Running the same IO test on a Gen2 backplane, four M56s saturate the IO bandwidth. The third experiment has tested the combined read and write bandwidth between the FPGA and DDR3 memory. Each M56 module has 8GB DDR3 memory running at 8 MHz; the ideal bandwidth between the FPGA and DDR3 memory is 12.8 GB/s. Figure 5 shows that the peak bandwidth grows linearly as more M56s are involved. The total bandwidth is 64.5 GB/s for six M56 modules. C. AES on Single M56 During the AES application test, we have experimented with different optimization settings described in [19] on the AES encryption. The attribute num_simd_work_items vectorizes the data path accessing the, while num_compute_units duplicates the entire. The num_simd_work_items attribute only takes power of 2, such as 2, 4, 8 as input, while num_compute_units has no limit to the number of copies. Table II shows resource utilization, build time, and power consumption of different attribute settings. The first row is the original code without optimization. The second and third row (labeled as SIMD 2 and SIMD 4) vectorize the data path by 2 and 4, respectively. The fourth and fifth row (labeled as COMP 2 and COMP 3) duplicate the by 2 and 3, respectively. The tool fails to build when vectorizing data over 8 and duplicating over 4, because these optimization requires more resources than Stratix V A3 can provide. Figure 6 presents the throughput (encryption rate) measured from OpenCL-AES test. The first blue (dark color) bar shows OpenCL throughput of i7-477k executing on all eight logic cores at 3.4 GHz. The remaining blue (dark color) bars show the OpenCL throughput of M56 using different attribute settings. As shown in Figure 6, vectorizing the data path by 4,

6 6 TABLE II RESOURCE, BUILD TIME AND POWER OF OPENCL AES KERNEL attr. Logic Reg. Memory DSP Time Power Original 44% 12% 3% % 7 min 29.6 W SIMD 2 62% 13% 29% % 162 min 29.9 W SIMD 4 98% 14% 31% % 523 min 3.1 W COMP 2 68% 15% 35% % 82 min 3.6 W COMP 3 89% 17% 41% % 94 min 31. W 6 Throughput Throughput/Power.2 Throughput (Gb/s) Throughput/Power (Gb/s/watt) CPU Orig. SIMD 2 SIMD 4 COMP 3 COMP 4 OpenCL optimization using different attributes Fig. 6. Throughput of OpenCL-AES on M56 which has consumed the most resource in Table II, has achieved the highest throughput of 5.1 Gb/s while the same test has achieved 4.4 Gb/s on i7-477k. During this test, a DC power supply was used to measure the current draw of the FPGA and the backplane. For the i7-477k processor, the thermal design power (TDP) of 84 W from the datasheet has been used [2] in the calculation. However, the TDP value is a nomial value and the actual power of i7-477k running all eight logic cores at 3.4 GHz may be higher than the TDP value. Listed in the last column of Table II, power consumption is calculated by multiplying the DC voltage and current. The green (light color) bars plotted in Figure 6 show throughput/power as the power efficiency result. The power efficiency of using a single FPGA is 3 better than i7-477k. D. AES on Multiple M56 While the EX7 backplane can accommodate multiple FPGA modules, six M56 modules are used to run the OpenCL- AES test in parallel. In this test, all FPGAs are using the AES SIMD 4 configurations. To access multiple FPGAs, the host application creates one process for each FPGA. The wall time is reported to ensure the computation occurs in parallel. 25 Throughput Throughput/Power.3 Throughput (Gb/s) Throughput/Power (Gb/s/watt) Number of FPGA devices Fig. 7. Total throughput of OpenCL-AES on multiple M56 modules

7 7 TABLE III POWER CONSUMPTION OF MULTIPLE FPGAS # FPGAs Power (W) In Figure 7, the blue (dark color) bars show that the peak throughput grows with the number of FPGAs. The peak throughput reaches 19.8 Gb/s using six FPGAs. However, the performance growth slows down as more FPGAs are adopted, which is due to the overhead introduced by managing multiple FPGAs. When five or six FPGAs are used, the throughput almost saturates due to the limited bandwidth of PCI Express Gen3 16, which is similar to the result observed in Figure 4. Table III lists the power consumed by the FPGA and the backplane. In the first column where the FPGA number is, the 17.9 W is consumed by EX7 backplane. In Figure 7, the power efficiency (throughput/total power) is plotted in green (light color) bars. The peak power efficiency is achieved when four FPGAs are used, which is about 5 over i7-477k. When more than four FPGAs are used, the power efficiency reduces due to the drop of the throughput. V. CONCLUSION AND FUTURE WORK In this work, a scalable FPGA architecture for high performance computing has been designed. The design includes multiple FPGA modules and a high performance backplane. The modular nature of this architecture supports the combination of different FPGAs, as well as provides for easy hardware updates. The FPGA module is based on Stratix V, which is compatible with Altera s OpenCL tool flow. The evaluation has tested the native IO performance and the results have demonstrated linear scalability using six FPGAs. The host-to-device peak bandwidth is measured as 13.1 GB/s for read operation and 12.1 GB/s for write operation. The total FPGA-to-memory bandwidth is measured as 64.5 GB/s. An OpenCL-AES test results have shown the peak throughput is achieved when six FPGA modules are adopted. Compared against general-purpose processor, the throughput per watt shows 3 improvement using a single FPGA and 5 improvement using four FPGAs. In the course of experiments, several OpenCL benchmark suites [21] [23] are evaluated on this designed architecture. However, some benchmark s fail to compile because these benchmarks are targeting GPGPUs. These instruction-based code can be easily executed on GPGPU but can yield a large netlist that does not fit the M56. We plan to investigate the no-fit issue in the future. For the multi-fpga AES test, which consistently transfers data between the host and the devices, Figure 7 shows that the optimal number of FPGA modules on a Gen3 16 backplane is four. The overall performance is bounded by the IO between the host and device. To investigate how six or more M56s will benefit OpenCL applications, we plan to evaluate other OpenCL applications which are more compute-bound on multiple EX7s in the future. ACKNOWLEDGEMENT We would like to thank Dr. Peter Yiannacouras from Altera for his help with the OpenCL flow. REFERENCES [1] M. Gokhale et al., Splash: A reconfigurable linear logic array, in ICPP (1) 9, 199, pp [2] A. Krasnov et al., Ramp blue: A message-passing manycore system in fpgas. in FPL 7, 27, pp [3] A. G. Schmidt et al., An evaluation of an integrated on-chip/off-chip network for high-performance reconfigurable computing, Int. J. Reconfig. Comp., 212. [4] Khronos Group, The OpenCL specification, October 29. [5] Xilinx, Vivado High-Level Synthesis, Nov. 213, URL: [6] Impulse Accelerated Technologies, Inc., Jul. 214, URL: [7] J. Villarreal et al., Designing modular hardware accelerators in c with roccc 2., in Field-Programmable Custom Computing Machines (FCCM), 21 18th IEEE Annual International Symposium on, May 21, pp [8] J. Tripp et al., Trident: an fpga compiler framework for floating-point algorithms, in Field Programmable Logic and Applications, 25. International Conference on, Aug 25, pp [9] E. Cartwright et al., Creating hw/sw co-designed mpsopc s from high level programming models, in High Performance Computing and Simulation (HPCS), 211 International Conference on, July 211. [1] M. Owaida et al., Synthesis of platform architectures from opencl programs, in Field-Programmable Custom Computing Machines (FCCM), 211 IEEE 19th Annual International Symposium on, May 211, pp [11] P. Athanas, K. Kepa, and K. Shagrithaya, Enabling development of opencl applications on fpga platforms, in Proceedings of the 213 IEEE 24th International Conference on Application-specific Systems, Architectures and Processors (ASAP), ser. ASAP 13, 213. [12] A. Taneem, Opencl framework for a cpu, gpu, and fpga platform, Master s thesis, University of Toronto, 211. [13] T. Czajkowski et al., From opencl to high-performance hardware on fpgas, in Field Programmable Logic and Applications (FPL), nd International Conference on, Aug 212, pp [14] Altera, Implementing FPGA Design with the OpenCL Standard, Nov. 213, White Paper, URL: opencl.pdf. [15] R. Kirchgessner et al., Virtualrc: A virtual fpga platform for applications and tools portability, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA 12, 212. [16] C. Olson et al., Hardware acceleration of short read mapping, in Field-Programmable Custom Computing Machines (FCCM), 212 IEEE 2th Annual International Symposium on, April 212. [17] Z. Liu and A. R. M. Ganesh, OpenCL-AES, Dec. 211, URL: [18] Altera, Altera SDK for OpenCL Programming Guide, Nov. 213, URL: [19] Altera, Altera SDK for OpenCL Optimization Guide, Nov. 213, URL: [2] Intel, Intel Core i7-477k Processor, 213, URL: 9 GHz.

8 [21] W. Feng et al., Opencl and the 13 dwarfs: A work in progress, in Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, ser. ICPE 12, 212. [22] S. Seo, G. Jo, and J. Lee, Performance characterization of the nas parallel benchmarks in opencl, in Workload Characterization (IISWC), 211 IEEE International Symposium on, Nov 211. [23] A. Danalis et al., The scalable heterogeneous computing (shoc) benchmark suite, in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU 1. New York, NY, USA: ACM, 21, pp

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

FPGA Acceleration of 3D Component Matching using OpenCL

FPGA Acceleration of 3D Component Matching using OpenCL FPGA Acceleration of 3D Component Introduction 2D component matching, blob extraction or region extraction, is commonly used in computer vision for detecting connected regions that meet pre-determined

More information

Exploring Automatically Generated Platforms in High Performance FPGAs

Exploring Automatically Generated Platforms in High Performance FPGAs Exploring Automatically Generated Platforms in High Performance FPGAs Panagiotis Skrimponis b, Georgios Zindros a, Ioannis Parnassos a, Muhsen Owaida b, Nikolaos Bellas a, and Paolo Ienne b a Electrical

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

Exploring OpenCL Memory Throughput on the Zynq

Exploring OpenCL Memory Throughput on the Zynq Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines

More information

Convey Wolverine Application Accelerators. Architectural Overview. Convey White Paper

Convey Wolverine Application Accelerators. Architectural Overview. Convey White Paper Convey Wolverine Application Accelerators Architectural Overview Convey White Paper Convey White Paper Convey Wolverine Application Accelerators Architectural Overview Introduction Advanced computing architectures

More information

Cover TBD. intel Quartus prime Design software

Cover TBD. intel Quartus prime Design software Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a

More information

Cover TBD. intel Quartus prime Design software

Cover TBD. intel Quartus prime Design software Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a

More information

VXS-610 Dual FPGA and PowerPC VXS Multiprocessor

VXS-610 Dual FPGA and PowerPC VXS Multiprocessor VXS-610 Dual FPGA and PowerPC VXS Multiprocessor Two Xilinx Virtex -5 FPGAs for high performance processing On-board PowerPC CPU for standalone operation, communications management and user applications

More information

FPGA Solutions: Modular Architecture for Peak Performance

FPGA Solutions: Modular Architecture for Peak Performance FPGA Solutions: Modular Architecture for Peak Performance Real Time & Embedded Computing Conference Houston, TX June 17, 2004 Andy Reddig President & CTO andyr@tekmicro.com Agenda Company Overview FPGA

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Dr. Yassine Hariri CMC Microsystems

Dr. Yassine Hariri CMC Microsystems Dr. Yassine Hariri Hariri@cmc.ca CMC Microsystems 03-26-2013 Agenda MCES Workshop Agenda and Topics Canada s National Design Network and CMC Microsystems Processor Eras: Background and History Single core

More information

VXS-621 FPGA & PowerPC VXS Multiprocessor

VXS-621 FPGA & PowerPC VXS Multiprocessor VXS-621 FPGA & PowerPC VXS Multiprocessor Xilinx Virtex -5 FPGA for high performance processing On-board PowerPC CPU for standalone operation, communications management and user applications Two PMC/XMC

More information

The Convey HC-2 Computer. Architectural Overview. Convey White Paper

The Convey HC-2 Computer. Architectural Overview. Convey White Paper The Convey HC-2 Computer Architectural Overview Convey White Paper Convey White Paper The Convey HC-2 Computer Architectural Overview Contents 1 Introduction 1 Hybrid-Core Computing 3 Convey System Architecture

More information

Ten Reasons to Optimize a Processor

Ten Reasons to Optimize a Processor By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor

More information

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort Technology White Paper The CHAMP-AV6 VPX-REDI Digital Signal Processing Card Maximizing Performance with Minimal Porting Effort Introduction The Curtiss-Wright Controls Embedded Computing CHAMP-AV6 is

More information

Cymric A Framework for Prototyping Near-Memory Architectures

Cymric A Framework for Prototyping Near-Memory Architectures A Framework for Prototyping Near-Memory Architectures Chad D. Kersey 1, Hyesoon Kim 2, Sudhakar Yalamanchili 1 The rest of the team: Nathan Braswell, Jemmy Gazhenko, Prasun Gera, Meghana Gupta, Hyojong

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Is There A Tradeoff Between Programmability and Performance?

Is There A Tradeoff Between Programmability and Performance? Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Simplify Software Integration for FPGA Accelerators with OPAE

Simplify Software Integration for FPGA Accelerators with OPAE white paper Intel FPGA Simplify Software Integration for FPGA Accelerators with OPAE Cross-Platform FPGA Programming Layer for Application Developers Authors Enno Luebbers Senior Software Engineer Intel

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

To hear the audio, please be sure to dial in: ID#

To hear the audio, please be sure to dial in: ID# Introduction to the HPP-Heterogeneous Processing Platform A combination of Multi-core, GPUs, FPGAs and Many-core accelerators To hear the audio, please be sure to dial in: 1-866-440-4486 ID# 4503739 Yassine

More information

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA 1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors Paul Ekas, DSP Engineering, Altera Corp. pekas@altera.com, Tel: (408) 544-8388, Fax: (408) 544-6424 Altera Corp., 101

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Did I Just Do That on a Bunch of FPGAs?

Did I Just Do That on a Bunch of FPGAs? Did I Just Do That on a Bunch of FPGAs? Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto About the Talk Title It s the measure

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

Requirements for Scalable Application Specific Processing in Commercial HPEC

Requirements for Scalable Application Specific Processing in Commercial HPEC Requirements for Scalable Application Specific Processing in Commercial HPEC Steven Miller Silicon Graphics, Inc. Phone: 650-933-1899 Email Address: scm@sgi.com Abstract: More and more High Performance

More information

Design and Implementation of High Performance DDR3 SDRAM controller

Design and Implementation of High Performance DDR3 SDRAM controller Design and Implementation of High Performance DDR3 SDRAM controller Mrs. Komala M 1 Suvarna D 2 Dr K. R. Nataraj 3 Research Scholar PG Student(M.Tech) HOD, Dept. of ECE Jain University, Bangalore SJBIT,Bangalore

More information

Virtualized SQL Server Performance and Scaling on Dell EMC XC Series Web-Scale Hyper-converged Appliances Powered by Nutanix Software

Virtualized SQL Server Performance and Scaling on Dell EMC XC Series Web-Scale Hyper-converged Appliances Powered by Nutanix Software Virtualized SQL Server Performance and Scaling on Dell EMC XC Series Web-Scale Hyper-converged Appliances Powered by Nutanix Software Dell EMC Engineering January 2017 A Dell EMC Technical White Paper

More information

Pactron FPGA Accelerated Computing Solutions

Pactron FPGA Accelerated Computing Solutions Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market

More information

Exploring FPGA-specific Optimizations for Irregular OpenCL Applications

Exploring FPGA-specific Optimizations for Irregular OpenCL Applications Exploring FPGA-specific Optimizations for Irregular OpenCL Applications Mohamed W. Hassan 1, Ahmed E. Helal 1, Peter M. Athanas 1, Wu-Chun Feng 1,2, and Yasser Y. Hanafy 1 1 Electrical & Computer Engineering,

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

HES-7 ASIC Prototyping

HES-7 ASIC Prototyping Rev. 1.9 September 14, 2012 Co-authored by: Slawek Grabowski and Zibi Zalewski, Aldec, Inc. Kirk Saban, Xilinx, Inc. Abstract This paper highlights possibilities of ASIC verification using FPGA-based prototyping,

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief Meet the Increased Demands on Your Infrastructure with Dell and Intel ServerWatchTM Executive Brief a QuinStreet Excutive Brief. 2012 Doing more with less is the mantra that sums up much of the past decade,

More information

Welcome. Altera Technology Roadshow 2013

Welcome. Altera Technology Roadshow 2013 Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees

More information

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd Optimizing ARM SoC s with Carbon Performance Analysis Kits ARM Technical Symposia, Fall 2014 Andy Ladd Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable

More information

S2C K7 Prodigy Logic Module Series

S2C K7 Prodigy Logic Module Series S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

AMD Opteron Processors In the Cloud

AMD Opteron Processors In the Cloud AMD Opteron Processors In the Cloud Pat Patla Vice President Product Marketing AMD DID YOU KNOW? By 2020, every byte of data will pass through the cloud *Source IDC 2 AMD Opteron In The Cloud October,

More information

PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing

PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing Py: Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Kenji Kise Takamaeda-Yamazaki Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan 152-8552

More information

High-Level Synthesis Techniques for In-Circuit Assertion-Based Verification

High-Level Synthesis Techniques for In-Circuit Assertion-Based Verification High-Level Synthesis Techniques for In-Circuit Assertion-Based Verification John Curreri Ph.D. Candidate of ECE, University of Florida Dr. Greg Stitt Assistant Professor of ECE, University of Florida April

More information

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer) ESE Back End 2.0 D. Gajski, S. Abdi (with contributions from H. Cho, D. Shin, A. Gerstlauer) Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu 1 Technology advantages

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Hybrid Threading: A New Approach for Performance and Productivity

Hybrid Threading: A New Approach for Performance and Productivity Hybrid Threading: A New Approach for Performance and Productivity Glen Edwards, Convey Computer Corporation 1302 East Collins Blvd Richardson, TX 75081 (214) 666-6024 gedwards@conveycomputer.com Abstract:

More information

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device

More information

ALTERA FPGAs Architecture & Design

ALTERA FPGAs Architecture & Design ALTERA FPGAs Architecture & Design Course Description This course provides all theoretical and practical know-how to design programmable devices of ALTERA with QUARTUS-II design software. The course combines

More information

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS Embedded System System Set of components needed to perform a function Hardware + software +. Embedded Main function not computing Usually not autonomous

More information

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Nathan Woods XtremeData FPGA 2007 Outline Background Problem Statement Possible Solutions Description

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware

ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware Enno Lübbers and Marco Platzner Computer Engineering Group University of Paderborn {enno.luebbers, platzner}@upb.de Outline

More information

Four-Socket Server Consolidation Using SQL Server 2008

Four-Socket Server Consolidation Using SQL Server 2008 Four-Socket Server Consolidation Using SQL Server 28 A Dell Technical White Paper Authors Raghunatha M Leena Basanthi K Executive Summary Businesses of all sizes often face challenges with legacy hardware

More information

Simplify System Complexity

Simplify System Complexity 1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller

More information

Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration

Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration Dell IP Video Platform Design and Calibration Lab June 2018 H17250 Reference Architecture Abstract This

More information

Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes

Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes Yingyi Luo, Xiaoyang Wang, Seda Ogrenci-Memik, Gokhan Memik, Kazutomo Yoshii, Pete Beckman @ICCD 2018 Motivation FPGAs in data centers

More information

ECE 486/586. Computer Architecture. Lecture # 2

ECE 486/586. Computer Architecture. Lecture # 2 ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:

More information

Xilinx Vivado/SDK Tutorial

Xilinx Vivado/SDK Tutorial Xilinx Vivado/SDK Tutorial (Laboratory Session 1, EDAN15) Flavius.Gruian@cs.lth.se March 21, 2017 This tutorial shows you how to create and run a simple MicroBlaze-based system on a Digilent Nexys-4 prototyping

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

Energy scalability and the RESUME scalable video codec

Energy scalability and the RESUME scalable video codec Energy scalability and the RESUME scalable video codec Harald Devos, Hendrik Eeckhaut, Mark Christiaens ELIS/PARIS Ghent University pag. 1 Outline Introduction Scalable Video Reconfigurable HW: FPGAs Implementation

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Employing Multi-FPGA Debug Techniques

Employing Multi-FPGA Debug Techniques Employing Multi-FPGA Debug Techniques White Paper Traditional FPGA Debugging Methods Debugging in FPGAs has been difficult since day one. Unlike simulation where designers can see any signal at any time,

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Five Ways to Build Flexibility into Industrial Applications with FPGAs

Five Ways to Build Flexibility into Industrial Applications with FPGAs GM/M/A\ANNETTE\2015\06\wp-01154- flexible-industrial.docx Five Ways to Build Flexibility into Industrial Applications with FPGAs by Jason Chiang and Stefano Zammattio, Altera Corporation WP-01154-2.0 White

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster

CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster Mr. Matthew Krzych Naval Undersea Warfare Center Phone: 401-832-8174 Email Address: krzychmj@npt.nuwc.navy.mil The Robust Passive Sonar

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

An FPGA-Based Optical IOH Architecture for Embedded System

An FPGA-Based Optical IOH Architecture for Embedded System An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing

More information

What is PXImc? By Chetan Kapoor, PXI Product Manager National Instruments

What is PXImc? By Chetan Kapoor, PXI Product Manager National Instruments What is PXImc? By Chetan Kapoor, PXI Product Manager National Instruments Overview Modern day test and control systems are growing larger, more complex and more intricate. Most of these intricacies are

More information

ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT

ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT THE FREE AND OPEN RISC INSTRUCTION SET ARCHITECTURE Codasip is the leading provider of RISC-V processor IP Codasip Bk: A portfolio of RISC-V processors Uniquely

More information

SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels

SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels Ahmed Sanaullah, Chen Yang, Daniel Crawley and Martin C. Herbordt Department of Electrical and Computer Engineering, Boston University The Intel

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

International IEEE Symposium on Field-Programmable Custom Computing Machines

International IEEE Symposium on Field-Programmable Custom Computing Machines - International IEEE Symposium on ield-programmable Custom Computing Machines Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Bandwidth Kentaro Sano Yoshiaki Hatsuda

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Simplify System Complexity

Simplify System Complexity Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint

More information

Axiomtek Broadwell-U Embedded Board & SoM White Paper

Axiomtek Broadwell-U Embedded Board & SoM White Paper Axiomtek Broadwell-U Embedded Board & SoM White Paper Copyright 2015 Axiomtek Co., Ltd. All Rights Reserved Axiomtek s embedded board and system-on-module utilizing the latest 5th generation Intel Core

More information

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

On the Capability and Achievable Performance of FPGAs for HPC Applications "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

10 Steps to Virtualization

10 Steps to Virtualization AN INTEL COMPANY 10 Steps to Virtualization WHEN IT MATTERS, IT RUNS ON WIND RIVER EXECUTIVE SUMMARY Virtualization the creation of multiple virtual machines (VMs) on a single piece of hardware, where

More information