Mohamed Taher The George Washington University

Size: px
Start display at page:

Download "Mohamed Taher The George Washington University"

Transcription

1 Experience Programming Current HPRCs Mohamed Taher The George Washington University

2 Acknowledgements GWU: Prof.\Tarek El-Ghazawi and Esam El-Araby GMU: Prof.\Kris Gaj ARSC SRC SGI CRAY 2

3 Outline Introduction Development Flow: 0 SRC-6 0 Cray-XD1 0 SGI Altix-350 Conclusion 3

4 High Performance Reconfigurable Computers (HPRCs( HPRCs) HPRCs are computing systems based on the close system-level integration of one or more general-purpose processors and one or more reconfigurable processors (RP) The computational cores are mapped to the reconfigurable hardware Memory The processors 0 Perform the operations that cannot be done efficiently in the reconfigurable hardware 0 Use specific APIs to: Download configuration codes into the RPs Transfer data to/from RP memory Start/Stop program Data RP Control Data µp Data 4

5 Development Flow for RCs 5

6 SRC-6

7 SRC Development Flow Application sources Macro sources µp Compiler.c or.f files. vhd or.v files MAP Compiler HDL sources.v files Netlists Logic synthesis. ngo files Object files.o files.o files Place & Route Linker Application executable.bin files Configuration bitstreams 7

8 MAP Compiler Compile C/Fortran codes to reconfigurable hardware 0 Generated code: circuits Basic blocks in inner loop bodies are merged and become pipelined circuits Basic blocks in outer loops become special purpose hardware function units C/Fortran code can be extended using macros allowing for program transformations that cannot be expressed straightforwardly in C/Fortran Macros have semantics unlike C/Fortran functions 0 have a period (#clocks between inputs) 0 have a pipeline delay (#clocks between in and output) 0 MAP compiler takes care of period and delay 0 can have state (kept between macro calls) 8

9 Example Application: Wavelet Main Program Allocate the RP Configure and start the Program execution on the FPGA Passing the input image pointer and the output image buffer pointer to be used by DMA individual parameters can be passed to the MAP C function such as image dimensions large parameter array, such as the wavelet coefficients, can be transferred using DMA by passing the pointer of the coefficients array Free the RP int main (int argc, char *argv[]) {.. /* initialize daubechies coefficients */ float coeff[8]; coeff[0]= coeff[1]= (1+sqrt(3))/(4*sqrt(2)); coeff[2]= coeff[3]= (3+sqrt(3))/(4*sqrt(2)); coeff[4]= coeff[5]= (3-sqrt(3))/(4*sqrt(2)); coeff[6]= coeff[7]= (1-sqrt(3))/(4*sqrt(2));.. /* allocate images */. map_allocate(1); gettimeofday(&time0, NULL); proc_fpga (image_in, image_out, coeff, dx, dy, &ht0, &ht1, &ht2, &ht3, mapno); gettimeofday(&time1, NULL); /* print time difference */.. map_free(1);. Mohamed Taher, GWU ARSC HPRC Workshop } Fairbanks, AK, August 22-24,

10 Example Application: Wavelet MAP C Function (FPGA.mc) transfer image data to OBM bank a transfer coefficients to OBM bank c load coefficients from bank c to on-chip registers read one pixel from bank a compute Wavelet store result into bank b End of Image Yes No transfer image data from bank b to the host 10

11 Example Application: Wavelet MAP C Function (FPGA.mc) void proc_fpga(int64_t image_in[], int64_t image_out[], int64_t coeff[], int dx, int dy, long long *ht0, long long *ht1, long long *ht2, long long *ht3, int mapnum) { // coefficients float LP0, LP1, LP2, LP3, LP4; float HP0, HP1, HP2, HP3, HP4; // variables int i, j, k; int64_t in_pixel, out_pixel; // input image OBM_BANK_A (AL, int64_t, MAX_OBM_SIZE) // output image OBM_BANK_B (BL, int64_t, MAX_OBM_SIZE) // filter coefficients OBM_BANK_C (CL, int64_t, MAX_OBM_SIZE) 11

12 start_timer(); read_timer(ht0); // DMA Input Image transfer DMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), image_in, 1, Image_size, 0); wait_dma (0); // DMA coefficients transfer DMA_CPU (CM2OBM, CL, MAP_OBM_stripe(1, C"), coeff, 1, 4*sizeof(int64_t), 0); wait_dma(0); read_timer(ht1); Example Application: Wavelet MAP C Function (FPGA.mc) transfer image data to an OBM bank transfer coefficients to an OBM bank } for (i = 0; i < 4; i++) { LP0 = LP1 ; HP0 = HP1 ; LP1 = LP2 ; HP1 = HP2 ; LP2 = LP3 ; HP2 = HP3 ; split_64to32_flt_flt(cl[i], & HP3, & LP3 ); load coefficients from the OBM bank to on-chip registers 12

13 Example Application: Wavelet MAP C Function (FPGA.mc) for (i = 0; i<image_size; i++) { } { in_pixel = AL[i]; }... BL[i] = out_pixel; read pixel value from the OBM bank compute Wavelet store results to the OBM bank read_timer(ht2); DMA_CPU (OBM2CM, BL, MAP_OBM_stripe(1,"B"), image_out, 1, Image_size, 0); wait_dma (0); read_timer(ht3); } transfer image data to the host 13

14 Using On-Chip Memory (OCM) in SRC void sum(int64_t a[], int *c, int mapno) FPGA { OBM_BANK_A (AL, int64_t, SIZE) uint64_t ocm_a [SIZE]; AL[] SM (OBM) 64 ocm_a[] OCM int i; DMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), a, 1, bytelength, 0); c 32 computations wait_dma (0); for(i=0; i<size; i++) ocm_a[i] = AL[i]; for(i=0; i<size; i++) tmp = ocm_a[i] + tmp; } 14

15 FPGA Mapping in SRC FPGA1.mc void fpga1(int64_t a, int64_t b, int64_t *sum, int mapno) { int64_t c, temp; } FPGA2.mc send_to_bridge(b); c = a * const1; recv_from_bridge(&temp); *sum = temp+mult; void fpga2() { int64_t a, d; } recv_from_bridge(&a); d = a/const2; send_to_bridge(d); Makefile MAPFILES = FPGA1.mc FPGA2.mc PRIMARY = FPGA1.mc SECONDARY = FPGA2.mc CHIP2 = FPGA2.mc a b FPGA 1 FPGA 2 multiply divide add sum 15

16 Overlapping Data Transfer with Computation Improve performance by overlapping algorithm computation and data loading and unloading Parallel sections 0 Multiple parallel code blocks are active in parallel Read DMA Algorithm Write DMA Cycle 1 1 X X Cycle X Cycle Cycle 4 X 3 2 Cycle 5 X X 3 Time { /*DMA_IN 1st BLOCK BUFFER*/ } for(i = 0; i < LoopMax; i++) { #pragma src parallel sections { #pragma src section { for(i = 0; i < InputCountPerLoop; i++) { } DO COMPUTATION (Current Data Block) } /* end of parallel section with compute loop */ #pragma src section { /* DMA_IN NEXT BLOCK BUFFER*/ } /* end of parallel section with DMA */ } /* end of parallel sections */ } /* for LoopSub from 0 to LoopMax */ 16

17 Streams Stream_64 S0; #pragma src parallel sections { #pragma src section { int i; for (i=0; i<sz; i++) put_stream (&S0, AL[i]+42, 1); } /* end of parallel section */ A stream is a data structure that allows flexible communication between concurrent producer and consumer loops Conventional Data Flow Compute Loop 1 Streams and Conventional Data Flow Compute Loop 1 Steam s Compute Loop 2 { #pragma src section int i; for (i=0; i<sz; i++) get_stream (&S0, &BL[i]); } /* end of parallel section */ } /* end of parallel sections */ Compute Loop 2 Time Saves Access to On-Board Memory Data is flowing In the logic On- Board Memory or BRAM On- Board Memory or BRAM On- Board Memory or BRAM On- Board Memory or BRAM On- Board Memory or BRAM 17

18 Performance Optimizations Performance tune (removing inefficiencies) 0 avoid re-reading of data from OBMs use Delay Queues 0 avoid read / write conflicts in same iteration 0 avoid multiple accesses to a memory in one iteration MAP routine optimization 0 Pipelined Loops All function units within loop are computing at every clock 0 parallel sections Multiple parallel code blocks are active in parallel 0 Multiple FPGAs Logic in both FPGAs can be computing in parallel 0 Utilize streams Multiple serial code blocks are active in parallel All function units within loop are computing at every clock 18

19 Cray XD1

20 XD1 Development Flow Hardware Flow Software Flow 20

21 Operational Scenarios 21

22 Example Application: Wavelet Define the address space for user registers and QDR memory Open the FPGA Device #define APP_CFG_REG 0x08UL #define USR_REG1 0x40UL #define USR_REG2 0x48UL #define USR_REG3 0x50UL #define USR_REG4 0x58UL #define QDR1_OFFSET 0x100UL /* Offset in FPGA memory space.*/ int main (int argc, char *argv[]) { int fp_id; err_e e; u_64 coeff[4] u_64 * dp_base; u_64 * image; fp_id = fpga_open ("/dev/ufp0", O_RDWR O_SYNC, &e); Load the FPGA Transfer coefficients into the FPGA registers fpga_load (fp_id, "top.bin.ufp", &e);.. /* Read Image */. /* initialize daubechies coefficients */. fpga_wrt_appif_val (fp_id, coeff[0], USR_REG1, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[1], USR_REG2, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[2], USR_REG3, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[3], USR_REG4, TYPE_VAL, &e); 22

23 Example Application: Wavelet Configure the Wavelet for QDR bridging Map the entire 4 Mbytes of QDR Memory Transfer the Image into the QDR Start Processing Read the FPGA status fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e); dp_base = fpga_memmap (fp_id, QDR_SIZE, PROT_READ ROT_WRITE,MAP_SHARED, QDR1_OFFSET, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) dp_base[i] = iamge[i]; fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e); /*... */ fpga_rd_appif_val (fp_id, &status, STATUS_REG, &e); /*... */ Configure the Wavelet for QDR bridging Read back the Image fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) output_iamge[i] = dp_base[i[i] ; Close the FPGA device } fpga_close (fp_id, &e); 23

24 Accessing µp memory from FPGA The APIs support access to a region of the µp memory by the FPGA logic The program uses the fpga_set_ftrmem function to: 0 Allocate an FTR 0 Associates it with the address space of the µp 0 Sets up the FPGA to access it directly It does not automatically provide the address of this region to the FPGA application logic 0 One way is to establish an FPGA register for that purpose and use the fpga_wrt_appif_val function to write the value to the register unsigned long order; void *ftr_mem; /*... */ ftr_mem = fpga_set_ftrmem(fpga_fd, order, &err); if (ftr_mem == NULL) { /* Handle error. */ } fpga_wrt_appif_val (fp_id, (u_64) ftr_mem, BUFF0_PTR_REG, TYPE_ADDR, &e); fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e); /*... */ fpga_rd_appif_val(fp_id, &status, STATUS_REG, &e); /*... */ 24

25 Using MPI on Cray XD1 Each Cray XD1 chassis consists of 6 dual-processor Opteron nodes 0 2 Opteron processors (Total 12) 0 1 Xilinx Virtex-II Pro 50 (Total 6) Applications can be parallelized across the 6 FPGAs using MPI Data are distributed across the 6 FPGAs if(mythread==0) read_image (image_file_name, image_buffer, &rows, &cols); MPI_Bcast(&rows, 1,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(&cols, 1,MPI_INT,0,MPI_COMM_WORLD); local_size= rows*cols/threads; MPI_Scatter(image_buffer, local_size,mpi_unsigned_long, local_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); /* Execute the wavelet on the Hardware*/ process_image(fp_id, local_image_buffer, local_output_buffer, rows, cols); MPI_Gather(local_output_buffer,local_size,MPI_UNSIGNED_LONG, output_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0,MPI_COMM_WORLD); if(mythread==0) write_image (output_file_name, output_image_buffer, rows, cols); 25

26 SGI Altix 350

27 Development Flow 27

28 Example Application: Wavelet Small parameters 0 Connect to Algorithm Defined Registers (alg_def_reg0 - alg_def_reg7) 0 Pass parameter mapping to software through an extractor directive, type REG_IN: -- extractor REG_IN: coeff0 64 u alg_def_reg0[63:0] -- extractor REG_IN: coeff1 64 u alg_def_reg1[63:0] -- extractor REG_IN: coeff2 64 u alg_def_reg2[63:0] -- extractor REG_IN: coeff3 64 u alg_def_reg3[63:0] Large Arrays 0 Dedicate a portion of an SRAM bank for the parameter array 0 Pass parameter array mapping to software with an extractor comment of type SRAM: Parameter Passing -- extractor SRAM:a_in sram[0] 0x00 in u fixed rasclib_algorithm_request_t ar; int alg_id; int res; strcpy(ar.alg_id, Wavelet"); ar.num_devices = 1;.. /* Read Image */. /* initialize daubechies coefficients */. rasclib_resource_alloc(&ar, 1); algorithm_id = rasclib_algorithm_open( Wavelet"); res = rasclib_algorithm_alg_reg_write (alg_id, coeff0", coeff[0]); res = rasclib_algorithm_alg_reg_write (alg_id, coeff1", coeff[1]); res = rasclib_algorithm_alg_reg_write (alg_id, coeff2", coeff[2]); res = rasclib_algorithm_alg_reg_write (alg_id, coeff3", coeff[3]); res = rasclib_algorithm_send (alg_id, "a_in", Image_Buff, SIZE); 28

29 Example Application: Wavelet rasclib_algorithm_go (alg_id); Reading results back res = rasclib_algorithm_receive (alg_id, "d_out", out_buff, SIZE); rasclib_algorithm_commit (alg_id); rasclib_algorithm_wait (alg_id); rasclib_algorithm_close (alg_id); Small parameters 0 Connect to Algorithm Defined Registers 0 Pass parameter mapping to software through an extractor directive, type REG_OUT 0 Use the API function rasclib_algorithm_reg_read Large Arrays 0 Dedicate a portion of an SRAM bank for the parameter array 0 Pass parameter array mapping to software with an extractor comment of type SRAM: -- extractor SRAM:d_out sram[1] 0x00 out u fixed 29

30 Streaming Improve performance by overlapping algorithm computation and data loading and unloading Extractor directives are used to tell software: 0 where input/output data arrays are located (SRAM bank + starting index) 0 the sizes of the input/output data arrays 0 which arrays have been enabled for streaming Extractor directive type used: SRAM with attribute stream, e.g.: -- extractor SRAM:a_in sram[0] 0x00 in u stream -- extractor SRAM:d_out sram[1] 0x00 out u stream Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Read DMA X X Algorithm X X Write DMA X X Time 30

31 Conclusions µp high level languages in all of the examined cases need extensive use of explicit APIs to interact with the FPGA In the case of CRAY-XD1, the FPGA can directly access the µp memory for reading and writing. This eliminates the DMA time in the pipelined designs The FPGA side in the case of SRC can be completely programmed using MAP C / Fortran Some explicit optimizations can be specified in the MAP C code (at the FPGA side) to explicitly perform forwarding and chaining In the SGI case, explicit optimization can be utilized in the side of the FPGA in order to overlap I/O with computation Although the two example optimization from SRC and SGI perform different optimizations, still has the same name (Streaming). This address the need of standardization terminology to avoid the potential confusions for application developers when they work across different platforms 31

Tools for Reconfigurable Supercomputing. Kris Gaj George Mason University

Tools for Reconfigurable Supercomputing. Kris Gaj George Mason University Tools for Reconfigurable Supercomputing Kris Gaj George Mason University 1 Application Development for Reconfigurable Computers Program Entry Platform mapping Debugging & Verification Compilation Execution

More information

First-hand experience on porting MATPHOT code to SRC platform

First-hand experience on porting MATPHOT code to SRC platform First-hand experience on porting MATPHOT code to SRC platform Volodymyr (Vlad) Kindratenko NCSA, UIUC kindr@ncsauiucedu Presentation outline What is MATPHOT? MATPHOT code Testbed code Implementations on

More information

SRC MAPstation Image Processing: Edge Detection

SRC MAPstation Image Processing: Edge Detection SRC MAPstation Image Processing: Edge Detection David Caliga, Director Software Applications SRC Computers, Inc. dcaliga@srccomputers.com Motivations The purpose of detecting sharp changes in image brightness

More information

Developing Applications for HPRCs

Developing Applications for HPRCs Developing Applications for HPRCs Esam El-Araby The George Washington University Acknowledgement Prof.\ Tarek El-Ghazawi Mohamed Taher ARSC SRC SGI Cray 2 Outline Background Methodology A Case Studies

More information

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers Allen Michalski 1, Kris Gaj 1, Tarek El-Ghazawi 2 1 ECE Department, George Mason University

More information

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization Sashisu Bajracharya, Deapesh Misra, Kris Gaj George Mason University Tarek El-Ghazawi The George Washington

More information

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED. The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in

More information

Performance and Overhead in a Hybrid Reconfigurable Computer

Performance and Overhead in a Hybrid Reconfigurable Computer Performance and Overhead in a Hybrid Reconfigurable Computer Osman Devrim Fidanci 1, Dan Poznanovic 2, Kris Gaj 3, Tarek El-Ghazawi 1, Nikitas Alexandridis 1 1 George Washington University, 2 SRC Computers

More information

End User Update: High-Performance Reconfigurable Computing

End User Update: High-Performance Reconfigurable Computing End User Update: High-Performance Reconfigurable Computing Tarek El-Ghazawi Director, GW Institute for Massively Parallel Applications and Computing Technologies(IMPACT) Co-Director, NSF Center for High-Performance

More information

Master s Thesis Presentation Hoang Le Director: Dr. Kris Gaj

Master s Thesis Presentation Hoang Le Director: Dr. Kris Gaj Master s Thesis Presentation Hoang Le Director: Dr. Kris Gaj Outline RSA ECM Reconfigurable Computing Platforms, Languages and Programming Environments Partitioning t ECM Code between HDLs and HLLs Implementation

More information

Support for Programming Reconfigurable Supercomputers

Support for Programming Reconfigurable Supercomputers Support for Programming Reconfigurable Supercomputers Miriam Leeser Nicholas Moore, Albert Conti Dept. of Electrical and Computer Engineering Northeastern University Boston, MA Laurie Smith King Dept.

More information

ECE 699: Lecture 12. Introduction to High-Level Synthesis

ECE 699: Lecture 12. Introduction to High-Level Synthesis ECE 699: Lecture 12 Introduction to High-Level Synthesis Required Reading The ZYNQ Book Chapter 14: Spotlight on High-Level Synthesis Chapter 15: Vivado HLS: A Closer Look S. Neuendorffer and F. Martinez-Vallina,

More information

Accelerating Scientific Applications with High-Performance Reconfigurable Computing (HPRC)

Accelerating Scientific Applications with High-Performance Reconfigurable Computing (HPRC) Accelerating Scientific Applications with High-Performance Reconfigurable Computing (HPRC) Volodymyr V. Kindratenko Innovative Systems Laboratory (ISL) (NCSA) University of Illinois at Urbana-Champaign

More information

Hardware Oriented Security

Hardware Oriented Security 1 / 20 Hardware Oriented Security SRC-7 Programming Basics and Pipelining Miaoqing Huang University of Arkansas Fall 2014 2 / 20 Outline Basics of SRC-7 Programming Pipelining 3 / 20 Framework of Program

More information

IMPLICIT+EXPLICIT Architecture

IMPLICIT+EXPLICIT Architecture IMPLICIT+EXPLICIT Architecture Fortran Carte Programming Environment C Implicitly Controlled Device Dense logic device Typically fixed logic µp, DSP, ASIC, etc. Implicit Device Explicit Device Explicitly

More information

Parallel Programming of High-Performance Reconfigurable Computing Systems with Unified Parallel C

Parallel Programming of High-Performance Reconfigurable Computing Systems with Unified Parallel C Parallel Programming of High-Performance Reconfigurable Computing Systems with Unified Parallel C Tarek El-Ghazawi, Olivier Serres, Samy Bahra, Miaoqing Huang and Esam El-Araby Department of Electrical

More information

In the past few years, high-performance computing. The Promise of High-Performance Reconfigurable Computing

In the past few years, high-performance computing. The Promise of High-Performance Reconfigurable Computing R E S E A R C H F E A T U R E The Promise of High-Performance Reconfigurable Computing Tarek El-Ghazawi, Esam El-Araby, and Miaoqing Huang, George Washington University Kris Gaj, George Mason University

More information

A Framework to Improve IP Portability on Reconfigurable Computers

A Framework to Improve IP Portability on Reconfigurable Computers A Framework to Improve IP Portability on Reconfigurable Computers Miaoqing Huang, Ivan Gonzalez, Sergio Lopez-Buedo, and Tarek El-Ghazawi NSF Center for High-Performance Reconfigurable Computing (CHREC)

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

On Using Simulink to Program SRC-6 Reconfigurable Computer

On Using Simulink to Program SRC-6 Reconfigurable Computer In Proc. 9 th Military and Aerospace Programmable Logic Devices (MAPLD) International Conference September, 2006, Washington, DC. On Using Simulink to Program SRC-6 Reconfigurable Computer David Meixner,

More information

High-Performance Reconfigurable Computing

High-Performance Reconfigurable Computing High-Performance Reconfigurable Computing Tarek El-Ghazawi Director, Institute for Massively Parallel Applications and Computing Technology (IMPACT) Co-Director, NSF Center for High-Performance Reconfigurable

More information

In the past few years, high-performance computing. The Promise of High-Performance Reconfigurable Computing

In the past few years, high-performance computing. The Promise of High-Performance Reconfigurable Computing R E S E A R C H F E A T U R E The Promise of High-Performance Reconfigurable Computing Tarek El-Ghazawi, Esam El-Araby, and Miaoqing Huang, George Washington University Kris Gaj, George Mason University

More information

Impulse Tutorial: Generating a Xilinx FPGA Netlist from C-Language

Impulse Tutorial: Generating a Xilinx FPGA Netlist from C-Language Impulse Tutorial: Generating a Xilinx FPGA Netlist from C-Language 1 1 Impulse Tutorial: Generating a Xilinx FPGA Netlist from C-Language Overview This Getting Started tutorial demonstrates how to compile

More information

RECONFIGURABLE COMPUTING: A DESIGN AND IMPLEMENTATION STUDY OF ELLIPTIC CURVE METHOD OF FACTORING USING SRC CARTE-C AND CELOXICA HANDEL-C

RECONFIGURABLE COMPUTING: A DESIGN AND IMPLEMENTATION STUDY OF ELLIPTIC CURVE METHOD OF FACTORING USING SRC CARTE-C AND CELOXICA HANDEL-C RECONFIGURABLE COMPUTING: A DESIGN AND IMPLEMENTATION STUDY OF ELLIPTIC CURVE METHOD OF FACTORING USING SRC CARTE-C AND CELOXICA HANDEL-C Committee: by Hoang Le A Thesis Submitted to the Graduate Faculty

More information

Implementing Simulink Designs on SRC-6 System

Implementing Simulink Designs on SRC-6 System Implementing Simulink Designs on SRC-6 System 1. Introduction David Meixner, Volodymyr Kindratenko 1, David Pointer Innovative Systems Laboratory National Center for Supercomputing Applications University

More information

Considerations for Algorithm Selection and C Programming Style for the SRC-6E Reconfigurable Computer

Considerations for Algorithm Selection and C Programming Style for the SRC-6E Reconfigurable Computer Considerations for Algorithm Selection and C Programming Style for the SRC-6E Reconfigurable Computer Russ Duren and Douglas Fouts Naval Postgraduate School Abstract: The architecture and programming environment

More information

ECE 448 Lecture 9. Bare Metal System Software Development

ECE 448 Lecture 9. Bare Metal System Software Development ECE 448 Lecture 9 Bare Metal System Software Development ECE 448 FPGA and ASIC Design with VHDL George Mason University Required Reading P. Chu, FPGA Prototyping by VHDL Examples Chapter 9, Bare Metal

More information

Research Article Parameterized Hardware Design on Reconfigurable Computers: An Image Processing Case Study

Research Article Parameterized Hardware Design on Reconfigurable Computers: An Image Processing Case Study Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 200, Article ID 454506, pages doi:0.55/200/454506 Research Article Parameterized Hardware Design on Reconfigurable

More information

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe: Cray events! Cray User Group (CUG):! When: May 16-19, 2005! Where: Albuquerque, New Mexico - USA! Registration: reserved to CUG members! Web site: http://www.cug.org! Cray Technical Workshop Europe:! When:

More information

Ted N. Booth. DesignLinx Hardware Solutions

Ted N. Booth. DesignLinx Hardware Solutions Ted N. Booth DesignLinx Hardware Solutions September 2015 Using Vivado HLS for Video Algorithm Implementation for Demonstration and Validation Agenda Project Description HLS Lessons Learned Summary Project

More information

A memcpy Hardware Accelerator Solution for Non Cache-line Aligned Copies

A memcpy Hardware Accelerator Solution for Non Cache-line Aligned Copies A memcpy Hardware Accelerator Solution for Non Cache-line Aligned Copies Filipa Duarte and Stephan Wong Computer Engineering Laboratory Delft University of Technology Abstract In this paper, we present

More information

Parallel FIR Filters. Chapter 5

Parallel FIR Filters. Chapter 5 Chapter 5 Parallel FIR Filters This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. ecause the Virtex-4 architecture

More information

SECURE PARTIAL RECONFIGURATION OF FPGAs. Amir S. Zeineddini Kris Gaj

SECURE PARTIAL RECONFIGURATION OF FPGAs. Amir S. Zeineddini Kris Gaj SECURE PARTIAL RECONFIGURATION OF FPGAs Amir S. Zeineddini Kris Gaj Outline FPGAs Security Our scheme Implementation approach Experimental results Conclusions FPGAs SECURITY SRAM FPGA Security Designer/Vendor

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

82V391x / 8V893xx WAN PLL Device Families Device Driver User s Guide

82V391x / 8V893xx WAN PLL Device Families Device Driver User s Guide 82V391x / 8V893xx WAN PLL Device Families Device Driver Version 1.2 April 29, 2014 Table of Contents 1. Introduction... 1 2. Software Architecture... 2 2.1. Overview... 2 2.2. Hardware Abstraction Layer

More information

15-440: Recitation 8

15-440: Recitation 8 15-440: Recitation 8 School of Computer Science Carnegie Mellon University, Qatar Fall 2013 Date: Oct 31, 2013 I- Intended Learning Outcome (ILO): The ILO of this recitation is: Apply parallel programs

More information

Using FPGAs in Supercomputing Reconfigurable Supercomputing

Using FPGAs in Supercomputing Reconfigurable Supercomputing Using FPGAs in Supercomputing Reconfigurable Supercomputing Why FPGAs? FPGAs are 10 100x faster than a modern Itanium or Opteron Performance gap is likely to grow further in the future Several major vendors

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

Comparison of the Hardware Performance of the AES Candidates Using Reconfigurable Hardware

Comparison of the Hardware Performance of the AES Candidates Using Reconfigurable Hardware Comparison of the Hardware Performance of the AES Candidates Using Reconfigurable Hardware Master s Thesis Pawel Chodowiec MS CpE Candidate, ECE George Mason University Advisor: Dr. Kris Gaj, ECE George

More information

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc.

Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc. Computed Tomography (CT) Scan Image Reconstruction on the SRC-7 David Pointer SRC Computers, Inc. CT Image Reconstruction Herman Head Sinogram Herman Head Reconstruction CT Image Reconstruction for all

More information

What does Heterogeneity bring?

What does Heterogeneity bring? What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

Virtualizing and Sharing Reconfigurable Resources in High-Performance Reconfigurable Computing Systems

Virtualizing and Sharing Reconfigurable Resources in High-Performance Reconfigurable Computing Systems Virtualizing and Sharing Reconfigurable Resources in High-Performance Reconfigurable Computing Systems Esam El-Araby, Ivan Gonzalez, and arek El-Ghazawi NSF Center for High-Performance Reconfigurable Computing

More information

Fast implementation and fair comparison of the final candidates for Advanced Encryption Standard using Field Programmable Gate Arrays

Fast implementation and fair comparison of the final candidates for Advanced Encryption Standard using Field Programmable Gate Arrays Kris Gaj and Pawel Chodowiec Electrical and Computer Engineering George Mason University Fast implementation and fair comparison of the final candidates for Advanced Encryption Standard using Field Programmable

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016 1 Cache Lab Recitation 7: Oct 11 th, 2016 2 Outline Memory organization Caching Different types of locality Cache organization Cache lab Part (a) Building Cache Simulator Part (b) Efficient Matrix Transpose

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

Outline. Catapult C Design Methodology. Design Steps with Catapult 1/31/12. System-on-Chip Design Methodologies

Outline. Catapult C Design Methodology. Design Steps with Catapult 1/31/12. System-on-Chip Design Methodologies System-on-Chip Design Methodologies High-Level Synthesis using Catapult-C Olivier Sentieys IRISA/INRIA ENSSAT - Université de Rennes 1 Outline Introduction Design Flow and Tool Basics Data Types Writing

More information

ESL design with the Agility Compiler for SystemC

ESL design with the Agility Compiler for SystemC ESL design with the Agility Compiler for SystemC SystemC behavioral design & synthesis Steve Chappell & Chris Sullivan Celoxica ESL design portfolio Complete ESL design environment Streaming Video Processing

More information

Developing and Deploying Advanced Algorithms to Novel Supercomputing Hardware

Developing and Deploying Advanced Algorithms to Novel Supercomputing Hardware Developing and Deploying Advanced Algorithms to Novel Supercomputing Hardware Robert J. Brunner 1,2, Volodymyr V. Kindratenko 2, and Adam D. Myers 1 1) Department of Astronomy, 2) National Center for Supercomputing

More information

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline CPE/EE 422/522 Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices Dr. Rhonda Kay Gaede UAH Outline Introduction Field-Programmable Gate Arrays Virtex Virtex-E, Virtex-II, and Virtex-II

More information

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING 1 DSP applications DSP platforms The synthesis problem Models of computation OUTLINE 2 DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: Time-discrete representation

More information

Outline Introduction System development Video capture Image processing Results Application Conclusion Bibliography

Outline Introduction System development Video capture Image processing Results Application Conclusion Bibliography Real Time Video Capture and Image Processing System using FPGA Jahnvi Vaidya Advisors: Dr. Yufeng Lu and Dr. In Soo Ahn 4/30/2009 Outline Introduction System development Video capture Image processing

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection

LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection This tutorial will introduce you to high-level synthesis (HLS) concepts using LegUp. You will apply HLS to a real problem:

More information

CS333 Intro to Operating Systems. Jonathan Walpole

CS333 Intro to Operating Systems. Jonathan Walpole CS333 Intro to Operating Systems Jonathan Walpole Threads & Concurrency 2 Threads Processes have the following components: - an address space - a collection of operating system state - a CPU context or

More information

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 April 4-7, 2016 Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU

More information

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. PCAP Assignment I 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. The multicore CPUs are designed to maximize the execution speed

More information

Pointers, Pointers, Pointers!

Pointers, Pointers, Pointers! Pointers, Pointers, Pointers! Pointers, Pointers, Pointers, Pointers, Pointers, Pointers, Pointers, Pointers, Pointers, Pointers, Pointers, Pointers, Pointers! Colin Gordon csgordon@cs.washington.edu University

More information

3L Diamond. Multiprocessor DSP RTOS

3L Diamond. Multiprocessor DSP RTOS 3L Diamond Multiprocessor DSP RTOS What is 3L Diamond? Diamond is an operating system designed for multiprocessor DSP applications. With Diamond you develop efficient applications that use networks of

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Shared Memory Programming OpenMP Fork-Join Model Compiler Directives / Run time library routines Compiling and

More information

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.

More information

Day: Thursday, 03/19 Time: 16:00-16:50 Location: Room 212A Level: Intermediate Type: Talk Tags: Developer - Tools & Libraries; Game Development

Day: Thursday, 03/19 Time: 16:00-16:50 Location: Room 212A Level: Intermediate Type: Talk Tags: Developer - Tools & Libraries; Game Development 1 Day: Thursday, 03/19 Time: 16:00-16:50 Location: Room 212A Level: Intermediate Type: Talk Tags: Developer - Tools & Libraries; Game Development 2 3 Talk about just some of the features of DX12 that are

More information

FPGA Provides Speedy Data Compression for Hyperspectral Imagery

FPGA Provides Speedy Data Compression for Hyperspectral Imagery FPGA Provides Speedy Data Compression for Hyperspectral Imagery Engineers implement the Fast Lossless compression algorithm on a Virtex-5 FPGA; this implementation provides the ability to keep up with

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

International Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013

International Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013 2499-20 International Training Workshop on FPGA Design for Scientific Instrumentation and Computing 11-22 November 2013 High-Level Synthesis: how to improve FPGA design productivity RINCON CALLE Fernando

More information

MPI Program Structure

MPI Program Structure MPI Program Structure Handles MPI communicator MPI_COMM_WORLD Header files MPI function format Initializing MPI Communicator size Process rank Exiting MPI 1 Handles MPI controls its own internal data structures

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

Lab Exam 1 D [1 mark] Give an example of a sample input which would make the function

Lab Exam 1 D [1 mark] Give an example of a sample input which would make the function Grade: / 20 Lab Exam 1 D500 1. [1 mark] Give an example of a sample input which would make the function scanf( "%f", &f ) return 0? Answer: Anything that is not a floating point number such as 4.567 or

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

Improving Area and Resource Utilization Lab

Improving Area and Resource Utilization Lab Lab Workbook Introduction This lab introduces various techniques and directives which can be used in Vivado HLS to improve design performance as well as area and resource utilization. The design under

More information

CS510 Operating System Foundations. Jonathan Walpole

CS510 Operating System Foundations. Jonathan Walpole CS510 Operating System Foundations Jonathan Walpole Threads & Concurrency 2 Why Use Threads? Utilize multiple CPU s concurrently Low cost communication via shared memory Overlap computation and blocking

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining ECE 5775 (Fall 17) High-Level Digital Design Automation More Pipelining Announcements HW 2 due Monday 10/16 (no late submission) Second round paper bidding @ 5pm tomorrow on Piazza Talk by Prof. Margaret

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

VHX - Xilinx - FPGA Programming in VHDL

VHX - Xilinx - FPGA Programming in VHDL Training Xilinx - FPGA Programming in VHDL: This course explains how to design with VHDL on Xilinx FPGAs using ISE Design Suite - Programming: Logique Programmable VHX - Xilinx - FPGA Programming in VHDL

More information

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data Message Passing as a Programming Paradigm Gentle Introduction to MPI Point-to-point Communication Message Passing

More information

Implementation of Elliptic Curve Cryptosystems over GF(2 n ) in Optimal Normal Basis on a Reconfigurable Computer

Implementation of Elliptic Curve Cryptosystems over GF(2 n ) in Optimal Normal Basis on a Reconfigurable Computer Implementation of Elliptic Curve Cryptosystems over GF(2 n ) in Optimal Normal Basis on a Reconfigurable Computer Sashisu Bajracharya, Chang Shu, Kris Gaj George Mason University Tarek El-Ghazawi The George

More information

Tutorial - Using Xilinx System Generator 14.6 for Co-Simulation on Digilent NEXYS3 (Spartan-6) Board

Tutorial - Using Xilinx System Generator 14.6 for Co-Simulation on Digilent NEXYS3 (Spartan-6) Board Tutorial - Using Xilinx System Generator 14.6 for Co-Simulation on Digilent NEXYS3 (Spartan-6) Board Shawki Areibi August 15, 2017 1 Introduction Xilinx System Generator provides a set of Simulink blocks

More information

A Hardware / Software Co-Design System using Configurable Computing Technology

A Hardware / Software Co-Design System using Configurable Computing Technology A Hardware / Software Co-Design System using Configurable Computing Technology John Schewel Virtual Computer Corporation 6925 Canby Ave #103 Reseda, California, USA 91335 Abstract Virtual Computer Corporation

More information

The Cray XD1. Technical Overview. Amar Shan, Senior Product Marketing Manager. Cray XD1. Cray Proprietary

The Cray XD1. Technical Overview. Amar Shan, Senior Product Marketing Manager. Cray XD1. Cray Proprietary The Cray XD1 Cray XD1 Technical Overview Amar Shan, Senior Product Marketing Manager Cray Proprietary The Cray XD1 Cray XD1 Built for price performance 30 times interconnect performance 2 times the density

More information

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Questions 1 Question 13 1: (Solution, p ) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Question 13 : (Solution, p ) In implementing HYMN s control unit, the fetch cycle

More information

Effective Programming in C and UNIX Lab 6 Image Manipulation with BMP Images Due Date: Sunday April 3rd, 2011 by 11:59pm

Effective Programming in C and UNIX Lab 6 Image Manipulation with BMP Images Due Date: Sunday April 3rd, 2011 by 11:59pm 15-123 Effective Programming in C and UNIX Lab 6 Image Manipulation with BMP Images Due Date: Sunday April 3rd, 2011 by 11:59pm The Assignment Summary: In this assignment we are planning to manipulate

More information

Chapter 2. Procedural Programming

Chapter 2. Procedural Programming Chapter 2 Procedural Programming 2: Preview Basic concepts that are similar in both Java and C++, including: standard data types control structures I/O functions Dynamic memory management, and some basic

More information

Resource Efficient Real-Time Processing of Contrast Limited Adaptive Histogram Equalization

Resource Efficient Real-Time Processing of Contrast Limited Adaptive Histogram Equalization Resource Efficient Real-Time Processing of Contrast Limited Adaptive Histogram Equalization Burak Ünal, Ali Akoglu Reconfigurable Computing Lab Department of Electrical and Computer Engineering The University

More information

Evaluation of running FFTs on the Cray XD1 with attached FPGAs

Evaluation of running FFTs on the Cray XD1 with attached FPGAs Evaluation of running FFTs on the Cray XD1 with attached FPGAs Michael Babst DSPlogic, Inc. 13017 Wisteria Drive, #420, Germantown, MD 20874 Phone (301) 977-5970 Mike.Babst@dpslogic.com Roderick Swift

More information

A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system

A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system A software platform to support dynamically reconfigurable Systems-on-Chip under the GNU/Linux operating system 26th July 2005 Alberto Donato donato@elet.polimi.it Relatore: Prof. Fabrizio Ferrandi Correlatore:

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

DYNAMIC ARRAYS; FUNCTIONS & POINTERS; SHALLOW VS DEEP COPY

DYNAMIC ARRAYS; FUNCTIONS & POINTERS; SHALLOW VS DEEP COPY DYNAMIC ARRAYS; FUNCTIONS & POINTERS; SHALLOW VS DEEP COPY Pages 800 to 809 Anna Rakitianskaia, University of Pretoria STATIC ARRAYS So far, we have only used static arrays The size of a static array must

More information

Array. Prepared By - Rifat Shahriyar

Array. Prepared By - Rifat Shahriyar Java More Details Array 2 Arrays A group of variables containing values that all have the same type Arrays are fixed length entities In Java, arrays are objects, so they are considered reference types

More information

FPGA Solutions: Modular Architecture for Peak Performance

FPGA Solutions: Modular Architecture for Peak Performance FPGA Solutions: Modular Architecture for Peak Performance Real Time & Embedded Computing Conference Houston, TX June 17, 2004 Andy Reddig President & CTO andyr@tekmicro.com Agenda Company Overview FPGA

More information

Introduction to Embedded System Design using Zynq

Introduction to Embedded System Design using Zynq Introduction to Embedded System Design using Zynq Zynq Vivado 2015.2 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

CSE 333 Lecture 2 Memory

CSE 333 Lecture 2 Memory CSE 333 Lecture 2 Memory John Zahorjan Department of Computer Science & Engineering University of Washington Today s goals - some terminology - review of memory resources - reserving memory - type checking

More information

CS242 COMPUTER PROGRAMMING

CS242 COMPUTER PROGRAMMING CS242 COMPUTER PROGRAMMING I.Safa a Alawneh Variables Outline 2 Data Type C++ Built-in Data Types o o o o bool Data Type char Data Type int Data Type Floating-Point Data Types Variable Declaration Initializing

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

MPI Collective communication

MPI Collective communication MPI Collective communication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) MPI Collective communication Spring 2018 1 / 43 Outline 1 MPI Collective communication

More information