Characterization of OpenCL on a Scalable FPGA Architecture
|
|
- Muriel Black
- 6 years ago
- Views:
Transcription
1 Characterization of OpenCL on a Scalable FPGA Architecture Shanyuan Gao and Jeremy Chritz Pico Computing, Inc. {sgao, jchritz}@picocomputing.com 1 Abstract The recent release of Altera s SDK for OpenCL has greatly eased the development of FPGA-based systems. Research have shown performance improvements brought by OpenCL using a single FPGA device. However, to meet the objectives of high performance computing, OpenCL needs to be evaluated using multiple FPGAs. This work has proposed a scalable FPGA architecture for high performance computing. The design includes multiple FPGA modules and a high performance backplane. The modular nature of this architecture supports the combination of different FPGAs, as well as provides for easy hardware updates. FPGA modules based on Stratix V are compatible with Altera s OpenCL tool flow. The evaluation has tested the native IO performance of the architecture and the results have demonstrated scalability using six FPGAs. The host-to-device peak bandwidth is measured as 13.1 GB/s for read operation and 12.1 GB/s for write operation. The FPGA-to-memory bandwidth is measured as 64.5 GB/s in total. An OpenCL AES is selected to test the scalable multi-fpga architecture. The test results have shown peak throughput is achiveded when six FPGAs are used. The throughput per watt shows 5 improvement using four FPGAs, over a general-purpose processor. I. INTRODUCTION FPGAs have long been recognized for the important role they play in high performance computing (HPC), owing primarily to their inherent parallelism and low power consumption [1] [3]. However, the application of FPGAs in the HPC domain has historically been limited to those with deep expertise in hardware, firmware, operating systems, and HPC applications. The FPGA community is striving to improve productivity and the ease of use in all the above aspects. But the variables between software application and underlying FPGA device such as defining layers of functionality, selecting communication protocols, developing algorithms, performing verification, etc. still takes several engineers weeks or months to deliver a project and maintain it. Moreover, porting or upgrading a design from one FPGA device to another means repeating the process described above. This is not a productive or sustainable engineering model, and it must be changed. To this end, an ideal FPGA system for HPC applications should address the aforementioned issues and bring: Short development time from high level description to system configuration (bitstream). Scalable multiple FPGA support to realize maximum parallelism. Easy upgrade of hardware devices. OpenCL (Open Computing Language) is an open specification widely used in many HPC systems that leverages the parallelism inherent to heterogeneous computing devices. The OpenCL standard is supported by various vendors for different devices. From a performance point of view: a) These heterogeneous computing devices are capable of processing a large number of threads in parallel, while the general processor is only capable of a limited number of threads b) Threads on heterogeneous computing devices are commonly light weight, while threads on the host are heavy weight. Consequently, the cost for context switching on computing devices is smaller than that of the host processor. From a software programmer s perspective: a) The OpenCL code running on the computing device is easy to write, employing the same syntax as high level programming languages (for example, C language). b) The OpenCL code is portable from one computing device to another so it can easily benefit from hardware upgrade or tradeoff evaluation. Altera s OpenCL solution is selected in this work because the tools automate the build process, and tremendously reduce the entire development time. However, Altera s current OpenCL solution only supports PCI Express Gen2 8 with Configuration via Protocol (CvP) enabled. As most host machines now have Gen3 16 and most Altera board vendors only have one or two FPGAs per board, at least half of the Gen3 16 bandwidth is wasted. To address the application of FPGA in an HPC system, with all the issues described, in this work, we have: Designed a scalable FPGA architecture that supports multiple FPGAs on a single backplane. Developed a solution using Altera s OpenCL on multiple FPGAs. Tested the performance of the architecture and experimented with the scalability of an OpenCL application Calculated the power efficiency (throughput per watt). The rest of the paper is organized as follows: Section II provides some OpenCL background and OpenCL related work on FPGA. Section III presents the design of implementation of the multi-fpga architecture. Section IV shows the experiment results. Section V concludes this paper and describes future work.
2 2 A. OpenCL II. BACKGROUND AND RELATED WORK OpenCL [4] is an open standard that enables applications to be executed on various computing architectures, such as general purpose CPUs, GPGPUs, FPGAs, as well as other special processors. As a result, applications can benefit from the underlying hardware features without programmers spending a great deal of time learning hardware details. OpenCL adopts the processor-accelerator concept, defining host and device in the platform. The application is divided into the host program and the program. When the application starts, the host executes the host program and passes computationally intensive workload to the device; the device is programmed either using sources or binaries on the fly, executes the computationally intensive workload, and passes the result back to the host. The latency between the host and the device is often larger than the host and its memory. To overcome this issue, data should be properly sized so it will be pipelined into the device. The program is a piece of computationally intensive code within the application identified by the programmer through profiling. The source is written in a C-like language. Hardware vendors often have their OpenCL implementation optimized for their hardware, which can be applied by setting #pragma and attribute in the code. B. OpenCL on FPGA Since the introduction of the OpenCL standard, there have been several attempts to implement OpenCL on FPGAs. However, there are some obstacles that need to be addressed before implementing OpenCL on FPGAs. C to Hardware Description Language (HDL): The OpenCL standard uses a C-like language to describe high level functions, whereas FPGAs use HDL to describe functions in Register Transfer Level (RTL). Translating high level language to HDL can be done by hand, but it requires expertise in HDL, and the process involves repeated development and verification cycles. To speed up the translation time, there are several high level synthesis tools that help convert C to HDL code [5] [8]. High level synthesis tools only take seconds or minutes to covert C function to HDL code. Integration: The generated HDL core will not work without a framework which provides necessary control and data IO. For example, if the FPGA board sits on a PCI Express slot, the framework should set up a stable PCI Express link, calibrate DDR memory, connect all necessary peripheral devices, and control the generated HDL core. The integration process requires minimum FPGA knowledge of clock, interface, and timing. Building : OpenCL has two ways of creating the executables on the device: a) clcreateprogramwithsource, which compiles the source and loads the executable during the runtime; b) clcreateprogramwithbinary, which loads a pre-compiled binary of the executable. GPGPUs generally use the former method because the compile time is negligible (seconds). Due to the place and route time, FPGA tools take much longer (hours) to generate a configuration. Thus pre-compiling the design is the only efficient way to create executables. Kernel reloading: Programming a configuration onto FPGA will erase the original configuration, which will disconnect the physical link between the device and the host. Special engineering work needs to be done to keep the physical link alive without rebooting the host machine. One can conclude from the above that in order to run OpenCL on FPGA, an ideal tool flow should automatically convert a source into a system-level FPGA configuration, program the FPGA, and run the OpenCL application without disconnecting the physical link. The initial exploration of OpenCL on FPGA starts with high level synthesis tools, which are designed to convert high level semantics such as C or C++ into HDL code [5] [8]. Designers can direct the tools to create interfaces, utilize vendor primitives, or optimize hardware logic. Some tools can even simulate and verify the design. High level synthesis tools tremendously reduce the development time in creating HDL code. However, they still require that designers manually integrate the generated HDL code into the final system. Cartwright et al. [9] have created FSM SYS Builder that assembles IP components into the system. They have used the OpenCL standard mapping OpenCL APIs to hthread, a hardware-based micro OS in a system-on-chip (SoC) environment. Owaida et al. [1] have focused on the compiler tools. An architectural synthesis tool called Silicon OpenCL, which can generate hardware accelerators and SoC systems from OpenCL programs. The evaluation used a single FPGA and a single static. The work in [11] has shifted the abstraction level. Using Convey s HC-1 platform, the onboard CPU is used as the compute device, while four onboard FPGAs (Application Engine) are used as compute units. Kernels are replicated to test different configurations. Source-to-source translation is used to convert OpenCL to C source, which is then feed into Auto-ESL (now Xilinx Vivado HLS) to generate the HDL core. The final integration involves Convey s PDK framework and Xilinx ISE tools. The evaluation has used four FPGAs and a static. Taneem [12] in his Master Thesis has studied an OpenCL framework as a unified programming framework for CPU, GPGPU, and FPGA. Specifically for FPGA, a static OpenCL framework with controller, host interface, and memory controller is proposed.
3 3 x16 x16 x8 x8 FPGA Module FPGA Module FPGA Module 1... FPGA Module 1... FPGA Module 5 FPGA Module 5 Fig. 1. Block diagram of Pico Computing s Architecture Fig. 2. M56 and EX7 backplane Backplane PCI Express PCIswitch Express switch Backplane host Altera [13], [14] has introduced the industry s first FPGA support for OpenCL. A similar OpenCL framework described in [12] is used for each FPGA board. Kernels are compiled into HDL code using Altera s proprietary tools and stitched into the framework. The is built in pipelined fashion to emulate parallel work items in progress. The middle layer library translates OpenCL APIs to FPGA transactions. With Altera s CvP, the host can program the FPGA on the fly without disconnecting the physical link. III. D ESIGN AND I MPLEMENTATION A. Pico Computing s Architecture We have previously designed several FPGA modules adopted in many projects [15], [16]. The philosophy behind this design approach is to pack as many FPGAs as possible onto a single backplane, while providing flexibility to change or upgrade the FPGA modules. Shown in Figure 1, on the backplane, FPGA modules are connected through a central PCI Express switch to the host and appear to the host as independent PCI Express devices. Depending on the physical dimensions of the FPGA module, one full-length backplane (312 mm) can carry up to six FPGA modules. Therefore, a single 4U server with eight PCI Express backplanes is able to carry as many as 48 FPGAs. Additionally, because the architecture is based on modules that snap onto the backplane, designers can explore different FPGA options, including FPGAs of different types, sizes, or vendors. B. M56 Module and EX7 Backplane The M56 module has a Stratix V A3 FPGA (5SGXA3E3H29C2), 8 GB DDR3 memory, 256 Mb EPCQ flash, and a JTAG connector. The GPIO connector provides 46 LVDS signal pairs and two sets of MGT pair. On the backside of the module, a Samtec connector provides PCI Express connection capable of Gen3 8. The M56 measures (mm), so six M56s can fit on a single full-length PCI Express backplane. With heat sink installed, the M56 module and the backplane occupies double-slot depth. In this work, we have also designed the EX7 backplane, on which a PLX PEX878 switch provides a PCI Express Gen3 16 connection to the host. Figure 2 shows a photo of EX7 with four M56 modules. C. OpenCL Framework and Tool Flow The Altera OpenCL flow uses Altera Offline Compiler (aoc) to compile OpenCL source into an HDL core and stitches the core into a firmware framework. The framework needs to be developed for each different FPGA board. We have developed the OpenCL framework based on Altera s reference design. As shown in Figure 3, within the framework, the PCI Express communicates with the host, and the memory controller accesses the DDR3 memory. The blank area circled with dotted line is where compiled OpenCL s reside. To fulfill the goal of dynamic configuration during runtime, Altera s CvP is used. The framework (shaded area) is constrained as a logiclock region and exported as a framework partition. All OpenCL builds share the same framework partition, so a CvP update would not overwrite the framework partition, and therefore not affect the PCI Express link. The CvP function is currently only available with a PCI Express Gen2 interface. As such, the framework of the M56 module is configured as a Gen2 8 interface. On the host side, the application calls clcreateprogramwithbinary instead of clcreateprogramwithsource to load the generated configuration (aocx file) onto the FPGAs. The rest of the OpenCL flow in application remains unchanged.
4 4 source PCI Express Gen2 x8 + aoc => PCI Express Gen2 x8 DDR3 Controller framework DDR3 Controller <>.aocx Fig. 3. OpenCL framework and tool flow TABLE I RESOURCE UTILIZATION OF OPENCL FRAMEWORK Logic Registers Memory DSP blocks Stratix V A OpenCL Framework 19% 8% 2% % Pico s Framework 11% 6% 12% % D. AES application To experiment with the architecture using OpenCL, the application is selected with following criteria: A well-known application: This application should be a common application and it should be easy to find, as well as repeat the experiment. Suitable for FPGA: The application should fit within the resources available on the FPGA. Scalability: The application should scale well with multiple FPGA devices. In this work, an AES OpenCL implementing the AES 256-bit algorithm in ECB mode is selected. The AES is constructed as an engine (dynamic library) that can be linked into the OpenSSL framework. In the experiment, the host application generates a certain size (2 GB) of data and encrypts the data on one or more FPGAs. Throughput (encryption rate) is calculated by dividing the workload over the execution time. The code was originally developed by Liu et al. [17] at Virginia Tech. With a couple of line changes, the benchmark is successfully built with a single-line command and run on the M56 module. Through experiments, the best throughput is achieved when chunk size, used in the pipeline, is 256 MB or below. It is understood that many modern processors have AES instruction built in the Instruction Set Architecture (ISA), which can achieve very high throughput. However, the experiments in this work are not targeting AES operation, but general OpenCL applications which are not implemented as instructions in the ISA. Therefore, the same OpenCL-AES test is conducted on the CPU for reference. A. Experimental Setup IV. EVALUATION In this work, a host system is set up with Intel i7-477k processor and 32GB DDR3 memory. The PCI Express slot on Intel DH87RL motherboard provides a Gen3 16 connection. CentOS 6.5 with Linux is installed as the operating system. For reference purpose, Intel s OpenCL 1.2 Development Kit is installed. Altera Quartus 13.1 is used for developing OpenCL applications. B. Native Performance The first test measures the resource utilization of the OpenCL framework. To obtain the size of the OpenCL framework, a blank file is used to build the design. The first row in Table I lists the resources available on the Stratix V A3 chip. The second row in Table I lists the resource usage reported by Altera s OpenCL tools. As a reference, the third row shows resource usage of Pico Computing s framework. Pico Computing s framework currently is not compatible with Altera s OpenCL flow, but it has a similar infrastructure including PCI Express, DMA engine, interconnect, and memory controller. It can be observed that Altera s OpenCL framework occupies more of the resources than our framework. The second test measures the bandwidth between the host machine and the FPGA devices. The read and write operations are from the perspective of the host. Note the PCI Express Gen3 16 has a theoretical bandwidth of GB/s, while the theoretical bandwidth on the M56 with PCI Express Gen2 x8 connection is 4 GB/s. In the test, six M56 modules are deployed on the EX7 backplane. According to [18], threading is not safe with OpenCL, which means multiple threads could currupt the shared runtime address space. During the test, the Linux system call fork() is used to generate multiple processes, with each process accessing one FPGA device in its independent address space. To ensure the transaction occurs at the same time, system time is reported in each process.
5 Bandwidth (GB/s) Gen3 read 2 Gen3 write Gen2 read Gen2 write Number of FPGA devices Fig. 4. Bandwidth between the host and FPGA devices 1 ideal measured 8 Bandwidth (GB/s) Number of FPGA devices Fig. 5. Bandwidth between the FPGA devices and DDR3 memory In Figure 4, the blue lines (with circles and squares marks) show the IO performance of multiple M56 modules on EX7 backplane. The red lines (with asterisk and cross marks) show the same test using the same M56s on a Gen2 16 backplane. The horizontal lines depict the theoretical bandwidth of Gen3 and Gen2, respectively. It can be seen that the IO bandwidth grows linearly on the Gen3 backplane when less than four M56 modules are used. When more than four M56 modules are used, the read bandwidth saturates around 13.1 GB/s; the write bandwidth slows down the linear growth and it reaches 12.1 GB/s. Running the same IO test on a Gen2 backplane, four M56s saturate the IO bandwidth. The third experiment has tested the combined read and write bandwidth between the FPGA and DDR3 memory. Each M56 module has 8GB DDR3 memory running at 8 MHz; the ideal bandwidth between the FPGA and DDR3 memory is 12.8 GB/s. Figure 5 shows that the peak bandwidth grows linearly as more M56s are involved. The total bandwidth is 64.5 GB/s for six M56 modules. C. AES on Single M56 During the AES application test, we have experimented with different optimization settings described in [19] on the AES encryption. The attribute num_simd_work_items vectorizes the data path accessing the, while num_compute_units duplicates the entire. The num_simd_work_items attribute only takes power of 2, such as 2, 4, 8 as input, while num_compute_units has no limit to the number of copies. Table II shows resource utilization, build time, and power consumption of different attribute settings. The first row is the original code without optimization. The second and third row (labeled as SIMD 2 and SIMD 4) vectorize the data path by 2 and 4, respectively. The fourth and fifth row (labeled as COMP 2 and COMP 3) duplicate the by 2 and 3, respectively. The tool fails to build when vectorizing data over 8 and duplicating over 4, because these optimization requires more resources than Stratix V A3 can provide. Figure 6 presents the throughput (encryption rate) measured from OpenCL-AES test. The first blue (dark color) bar shows OpenCL throughput of i7-477k executing on all eight logic cores at 3.4 GHz. The remaining blue (dark color) bars show the OpenCL throughput of M56 using different attribute settings. As shown in Figure 6, vectorizing the data path by 4,
6 6 TABLE II RESOURCE, BUILD TIME AND POWER OF OPENCL AES KERNEL attr. Logic Reg. Memory DSP Time Power Original 44% 12% 3% % 7 min 29.6 W SIMD 2 62% 13% 29% % 162 min 29.9 W SIMD 4 98% 14% 31% % 523 min 3.1 W COMP 2 68% 15% 35% % 82 min 3.6 W COMP 3 89% 17% 41% % 94 min 31. W 6 Throughput Throughput/Power.2 Throughput (Gb/s) Throughput/Power (Gb/s/watt) CPU Orig. SIMD 2 SIMD 4 COMP 3 COMP 4 OpenCL optimization using different attributes Fig. 6. Throughput of OpenCL-AES on M56 which has consumed the most resource in Table II, has achieved the highest throughput of 5.1 Gb/s while the same test has achieved 4.4 Gb/s on i7-477k. During this test, a DC power supply was used to measure the current draw of the FPGA and the backplane. For the i7-477k processor, the thermal design power (TDP) of 84 W from the datasheet has been used [2] in the calculation. However, the TDP value is a nomial value and the actual power of i7-477k running all eight logic cores at 3.4 GHz may be higher than the TDP value. Listed in the last column of Table II, power consumption is calculated by multiplying the DC voltage and current. The green (light color) bars plotted in Figure 6 show throughput/power as the power efficiency result. The power efficiency of using a single FPGA is 3 better than i7-477k. D. AES on Multiple M56 While the EX7 backplane can accommodate multiple FPGA modules, six M56 modules are used to run the OpenCL- AES test in parallel. In this test, all FPGAs are using the AES SIMD 4 configurations. To access multiple FPGAs, the host application creates one process for each FPGA. The wall time is reported to ensure the computation occurs in parallel. 25 Throughput Throughput/Power.3 Throughput (Gb/s) Throughput/Power (Gb/s/watt) Number of FPGA devices Fig. 7. Total throughput of OpenCL-AES on multiple M56 modules
7 7 TABLE III POWER CONSUMPTION OF MULTIPLE FPGAS # FPGAs Power (W) In Figure 7, the blue (dark color) bars show that the peak throughput grows with the number of FPGAs. The peak throughput reaches 19.8 Gb/s using six FPGAs. However, the performance growth slows down as more FPGAs are adopted, which is due to the overhead introduced by managing multiple FPGAs. When five or six FPGAs are used, the throughput almost saturates due to the limited bandwidth of PCI Express Gen3 16, which is similar to the result observed in Figure 4. Table III lists the power consumed by the FPGA and the backplane. In the first column where the FPGA number is, the 17.9 W is consumed by EX7 backplane. In Figure 7, the power efficiency (throughput/total power) is plotted in green (light color) bars. The peak power efficiency is achieved when four FPGAs are used, which is about 5 over i7-477k. When more than four FPGAs are used, the power efficiency reduces due to the drop of the throughput. V. CONCLUSION AND FUTURE WORK In this work, a scalable FPGA architecture for high performance computing has been designed. The design includes multiple FPGA modules and a high performance backplane. The modular nature of this architecture supports the combination of different FPGAs, as well as provides for easy hardware updates. The FPGA module is based on Stratix V, which is compatible with Altera s OpenCL tool flow. The evaluation has tested the native IO performance and the results have demonstrated linear scalability using six FPGAs. The host-to-device peak bandwidth is measured as 13.1 GB/s for read operation and 12.1 GB/s for write operation. The total FPGA-to-memory bandwidth is measured as 64.5 GB/s. An OpenCL-AES test results have shown the peak throughput is achieved when six FPGA modules are adopted. Compared against general-purpose processor, the throughput per watt shows 3 improvement using a single FPGA and 5 improvement using four FPGAs. In the course of experiments, several OpenCL benchmark suites [21] [23] are evaluated on this designed architecture. However, some benchmark s fail to compile because these benchmarks are targeting GPGPUs. These instruction-based code can be easily executed on GPGPU but can yield a large netlist that does not fit the M56. We plan to investigate the no-fit issue in the future. For the multi-fpga AES test, which consistently transfers data between the host and the devices, Figure 7 shows that the optimal number of FPGA modules on a Gen3 16 backplane is four. The overall performance is bounded by the IO between the host and device. To investigate how six or more M56s will benefit OpenCL applications, we plan to evaluate other OpenCL applications which are more compute-bound on multiple EX7s in the future. ACKNOWLEDGEMENT We would like to thank Dr. Peter Yiannacouras from Altera for his help with the OpenCL flow. REFERENCES [1] M. Gokhale et al., Splash: A reconfigurable linear logic array, in ICPP (1) 9, 199, pp [2] A. Krasnov et al., Ramp blue: A message-passing manycore system in fpgas. in FPL 7, 27, pp [3] A. G. Schmidt et al., An evaluation of an integrated on-chip/off-chip network for high-performance reconfigurable computing, Int. J. Reconfig. Comp., 212. [4] Khronos Group, The OpenCL specification, October 29. [5] Xilinx, Vivado High-Level Synthesis, Nov. 213, URL: [6] Impulse Accelerated Technologies, Inc., Jul. 214, URL: [7] J. Villarreal et al., Designing modular hardware accelerators in c with roccc 2., in Field-Programmable Custom Computing Machines (FCCM), 21 18th IEEE Annual International Symposium on, May 21, pp [8] J. Tripp et al., Trident: an fpga compiler framework for floating-point algorithms, in Field Programmable Logic and Applications, 25. International Conference on, Aug 25, pp [9] E. Cartwright et al., Creating hw/sw co-designed mpsopc s from high level programming models, in High Performance Computing and Simulation (HPCS), 211 International Conference on, July 211. [1] M. Owaida et al., Synthesis of platform architectures from opencl programs, in Field-Programmable Custom Computing Machines (FCCM), 211 IEEE 19th Annual International Symposium on, May 211, pp [11] P. Athanas, K. Kepa, and K. Shagrithaya, Enabling development of opencl applications on fpga platforms, in Proceedings of the 213 IEEE 24th International Conference on Application-specific Systems, Architectures and Processors (ASAP), ser. ASAP 13, 213. [12] A. Taneem, Opencl framework for a cpu, gpu, and fpga platform, Master s thesis, University of Toronto, 211. [13] T. Czajkowski et al., From opencl to high-performance hardware on fpgas, in Field Programmable Logic and Applications (FPL), nd International Conference on, Aug 212, pp [14] Altera, Implementing FPGA Design with the OpenCL Standard, Nov. 213, White Paper, URL: opencl.pdf. [15] R. Kirchgessner et al., Virtualrc: A virtual fpga platform for applications and tools portability, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA 12, 212. [16] C. Olson et al., Hardware acceleration of short read mapping, in Field-Programmable Custom Computing Machines (FCCM), 212 IEEE 2th Annual International Symposium on, April 212. [17] Z. Liu and A. R. M. Ganesh, OpenCL-AES, Dec. 211, URL: [18] Altera, Altera SDK for OpenCL Programming Guide, Nov. 213, URL: [19] Altera, Altera SDK for OpenCL Optimization Guide, Nov. 213, URL: [2] Intel, Intel Core i7-477k Processor, 213, URL: 9 GHz.
8 [21] W. Feng et al., Opencl and the 13 dwarfs: A work in progress, in Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, ser. ICPE 12, 212. [22] S. Seo, G. Jo, and J. Lee, Performance characterization of the nas parallel benchmarks in opencl, in Workload Characterization (IISWC), 211 IEEE International Symposium on, Nov 211. [23] A. Danalis et al., The scalable heterogeneous computing (shoc) benchmark suite, in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU 1. New York, NY, USA: ACM, 21, pp
Altera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationFPGA Acceleration of 3D Component Matching using OpenCL
FPGA Acceleration of 3D Component Introduction 2D component matching, blob extraction or region extraction, is commonly used in computer vision for detecting connected regions that meet pre-determined
More informationExploring Automatically Generated Platforms in High Performance FPGAs
Exploring Automatically Generated Platforms in High Performance FPGAs Panagiotis Skrimponis b, Georgios Zindros a, Ioannis Parnassos a, Muhsen Owaida b, Nikolaos Bellas a, and Paolo Ienne b a Electrical
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationIntel HLS Compiler: Fast Design, Coding, and Hardware
white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager
More informationExploring OpenCL Memory Throughput on the Zynq
Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines
More informationConvey Wolverine Application Accelerators. Architectural Overview. Convey White Paper
Convey Wolverine Application Accelerators Architectural Overview Convey White Paper Convey White Paper Convey Wolverine Application Accelerators Architectural Overview Introduction Advanced computing architectures
More informationCover TBD. intel Quartus prime Design software
Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a
More informationCover TBD. intel Quartus prime Design software
Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a
More informationVXS-610 Dual FPGA and PowerPC VXS Multiprocessor
VXS-610 Dual FPGA and PowerPC VXS Multiprocessor Two Xilinx Virtex -5 FPGAs for high performance processing On-board PowerPC CPU for standalone operation, communications management and user applications
More informationFPGA Solutions: Modular Architecture for Peak Performance
FPGA Solutions: Modular Architecture for Peak Performance Real Time & Embedded Computing Conference Houston, TX June 17, 2004 Andy Reddig President & CTO andyr@tekmicro.com Agenda Company Overview FPGA
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationDr. Yassine Hariri CMC Microsystems
Dr. Yassine Hariri Hariri@cmc.ca CMC Microsystems 03-26-2013 Agenda MCES Workshop Agenda and Topics Canada s National Design Network and CMC Microsystems Processor Eras: Background and History Single core
More informationVXS-621 FPGA & PowerPC VXS Multiprocessor
VXS-621 FPGA & PowerPC VXS Multiprocessor Xilinx Virtex -5 FPGA for high performance processing On-board PowerPC CPU for standalone operation, communications management and user applications Two PMC/XMC
More informationThe Convey HC-2 Computer. Architectural Overview. Convey White Paper
The Convey HC-2 Computer Architectural Overview Convey White Paper Convey White Paper The Convey HC-2 Computer Architectural Overview Contents 1 Introduction 1 Hybrid-Core Computing 3 Convey System Architecture
More informationTen Reasons to Optimize a Processor
By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor
More informationGedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort
Technology White Paper The CHAMP-AV6 VPX-REDI Digital Signal Processing Card Maximizing Performance with Minimal Porting Effort Introduction The Curtiss-Wright Controls Embedded Computing CHAMP-AV6 is
More informationCymric A Framework for Prototyping Near-Memory Architectures
A Framework for Prototyping Near-Memory Architectures Chad D. Kersey 1, Hyesoon Kim 2, Sudhakar Yalamanchili 1 The rest of the team: Nathan Braswell, Jemmy Gazhenko, Prasun Gera, Meghana Gupta, Hyojong
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationEnergy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS
Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationIs There A Tradeoff Between Programmability and Performance?
Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationSimplify Software Integration for FPGA Accelerators with OPAE
white paper Intel FPGA Simplify Software Integration for FPGA Accelerators with OPAE Cross-Platform FPGA Programming Layer for Application Developers Authors Enno Luebbers Senior Software Engineer Intel
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationTo hear the audio, please be sure to dial in: ID#
Introduction to the HPP-Heterogeneous Processing Platform A combination of Multi-core, GPUs, FPGAs and Many-core accelerators To hear the audio, please be sure to dial in: 1-866-440-4486 ID# 4503739 Yassine
More informationFCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA
1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationDeveloping and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors
Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors Paul Ekas, DSP Engineering, Altera Corp. pekas@altera.com, Tel: (408) 544-8388, Fax: (408) 544-6424 Altera Corp., 101
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationDid I Just Do That on a Bunch of FPGAs?
Did I Just Do That on a Bunch of FPGAs? Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto About the Talk Title It s the measure
More informationWorld s most advanced data center accelerator for PCIe-based servers
NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationRequirements for Scalable Application Specific Processing in Commercial HPEC
Requirements for Scalable Application Specific Processing in Commercial HPEC Steven Miller Silicon Graphics, Inc. Phone: 650-933-1899 Email Address: scm@sgi.com Abstract: More and more High Performance
More informationDesign and Implementation of High Performance DDR3 SDRAM controller
Design and Implementation of High Performance DDR3 SDRAM controller Mrs. Komala M 1 Suvarna D 2 Dr K. R. Nataraj 3 Research Scholar PG Student(M.Tech) HOD, Dept. of ECE Jain University, Bangalore SJBIT,Bangalore
More informationVirtualized SQL Server Performance and Scaling on Dell EMC XC Series Web-Scale Hyper-converged Appliances Powered by Nutanix Software
Virtualized SQL Server Performance and Scaling on Dell EMC XC Series Web-Scale Hyper-converged Appliances Powered by Nutanix Software Dell EMC Engineering January 2017 A Dell EMC Technical White Paper
More informationPactron FPGA Accelerated Computing Solutions
Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market
More informationExploring FPGA-specific Optimizations for Irregular OpenCL Applications
Exploring FPGA-specific Optimizations for Irregular OpenCL Applications Mohamed W. Hassan 1, Ahmed E. Helal 1, Peter M. Athanas 1, Wu-Chun Feng 1,2, and Yasser Y. Hanafy 1 1 Electrical & Computer Engineering,
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationHES-7 ASIC Prototyping
Rev. 1.9 September 14, 2012 Co-authored by: Slawek Grabowski and Zibi Zalewski, Aldec, Inc. Kirk Saban, Xilinx, Inc. Abstract This paper highlights possibilities of ASIC verification using FPGA-based prototyping,
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationMeet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief
Meet the Increased Demands on Your Infrastructure with Dell and Intel ServerWatchTM Executive Brief a QuinStreet Excutive Brief. 2012 Doing more with less is the mantra that sums up much of the past decade,
More informationWelcome. Altera Technology Roadshow 2013
Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees
More informationOptimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd
Optimizing ARM SoC s with Carbon Performance Analysis Kits ARM Technical Symposia, Fall 2014 Andy Ladd Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block
More informationBest Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
More informationMapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.
Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable
More informationS2C K7 Prodigy Logic Module Series
S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationAMD Opteron Processors In the Cloud
AMD Opteron Processors In the Cloud Pat Patla Vice President Product Marketing AMD DID YOU KNOW? By 2020, every byte of data will pass through the cloud *Source IDC 2 AMD Opteron In The Cloud October,
More informationPyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing
Py: Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Kenji Kise Takamaeda-Yamazaki Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan 152-8552
More informationHigh-Level Synthesis Techniques for In-Circuit Assertion-Based Verification
High-Level Synthesis Techniques for In-Circuit Assertion-Based Verification John Curreri Ph.D. Candidate of ECE, University of Florida Dr. Greg Stitt Assistant Professor of ECE, University of Florida April
More informationESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)
ESE Back End 2.0 D. Gajski, S. Abdi (with contributions from H. Cho, D. Shin, A. Gerstlauer) Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu 1 Technology advantages
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationHybrid Threading: A New Approach for Performance and Productivity
Hybrid Threading: A New Approach for Performance and Productivity Glen Edwards, Convey Computer Corporation 1302 East Collins Blvd Richardson, TX 75081 (214) 666-6024 gedwards@conveycomputer.com Abstract:
More informationOpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch
OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device
More informationALTERA FPGAs Architecture & Design
ALTERA FPGAs Architecture & Design Course Description This course provides all theoretical and practical know-how to design programmable devices of ALTERA with QUARTUS-II design software. The course combines
More informationSYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS
SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS Embedded System System Set of components needed to perform a function Hardware + software +. Embedded Main function not computing Usually not autonomous
More informationIntegrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective
Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective Nathan Woods XtremeData FPGA 2007 Outline Background Problem Statement Possible Solutions Description
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware
ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware Enno Lübbers and Marco Platzner Computer Engineering Group University of Paderborn {enno.luebbers, platzner}@upb.de Outline
More informationFour-Socket Server Consolidation Using SQL Server 2008
Four-Socket Server Consolidation Using SQL Server 28 A Dell Technical White Paper Authors Raghunatha M Leena Basanthi K Executive Summary Businesses of all sizes often face challenges with legacy hardware
More informationSimplify System Complexity
1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller
More informationDell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration
Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration Dell IP Video Platform Design and Calibration Lab June 2018 H17250 Reference Architecture Abstract This
More informationMinimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes
Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes Yingyi Luo, Xiaoyang Wang, Seda Ogrenci-Memik, Gokhan Memik, Kazutomo Yoshii, Pete Beckman @ICCD 2018 Motivation FPGAs in data centers
More informationECE 486/586. Computer Architecture. Lecture # 2
ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:
More informationXilinx Vivado/SDK Tutorial
Xilinx Vivado/SDK Tutorial (Laboratory Session 1, EDAN15) Flavius.Gruian@cs.lth.se March 21, 2017 This tutorial shows you how to create and run a simple MicroBlaze-based system on a Digilent Nexys-4 prototyping
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationEnergy scalability and the RESUME scalable video codec
Energy scalability and the RESUME scalable video codec Harald Devos, Hendrik Eeckhaut, Mark Christiaens ELIS/PARIS Ghent University pag. 1 Outline Introduction Scalable Video Reconfigurable HW: FPGAs Implementation
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationEmploying Multi-FPGA Debug Techniques
Employing Multi-FPGA Debug Techniques White Paper Traditional FPGA Debugging Methods Debugging in FPGAs has been difficult since day one. Unlike simulation where designers can see any signal at any time,
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationFive Ways to Build Flexibility into Industrial Applications with FPGAs
GM/M/A\ANNETTE\2015\06\wp-01154- flexible-industrial.docx Five Ways to Build Flexibility into Industrial Applications with FPGAs by Jason Chiang and Stefano Zammattio, Altera Corporation WP-01154-2.0 White
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationCASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster
CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster Mr. Matthew Krzych Naval Undersea Warfare Center Phone: 401-832-8174 Email Address: krzychmj@npt.nuwc.navy.mil The Robust Passive Sonar
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationAn FPGA-Based Optical IOH Architecture for Embedded System
An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing
More informationWhat is PXImc? By Chetan Kapoor, PXI Product Manager National Instruments
What is PXImc? By Chetan Kapoor, PXI Product Manager National Instruments Overview Modern day test and control systems are growing larger, more complex and more intricate. Most of these intricacies are
More informationENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT
ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT THE FREE AND OPEN RISC INSTRUCTION SET ARCHITECTURE Codasip is the leading provider of RISC-V processor IP Codasip Bk: A portfolio of RISC-V processors Uniquely
More informationSimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels
SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels Ahmed Sanaullah, Chen Yang, Daniel Crawley and Martin C. Herbordt Department of Electrical and Computer Engineering, Boston University The Intel
More informationA Configurable Multi-Ported Register File Architecture for Soft Processor Cores
A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box
More informationInternational IEEE Symposium on Field-Programmable Custom Computing Machines
- International IEEE Symposium on ield-programmable Custom Computing Machines Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Bandwidth Kentaro Sano Yoshiaki Hatsuda
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationSimplify System Complexity
Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint
More informationAxiomtek Broadwell-U Embedded Board & SoM White Paper
Axiomtek Broadwell-U Embedded Board & SoM White Paper Copyright 2015 Axiomtek Co., Ltd. All Rights Reserved Axiomtek s embedded board and system-on-module utilizing the latest 5th generation Intel Core
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationA Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs
A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More information10 Steps to Virtualization
AN INTEL COMPANY 10 Steps to Virtualization WHEN IT MATTERS, IT RUNS ON WIND RIVER EXECUTIVE SUMMARY Virtualization the creation of multiple virtual machines (VMs) on a single piece of hardware, where
More information