Mohamed Taher The George Washington University

Size: px

Start display at page:

Download "Mohamed Taher The George Washington University"

Karen Morris
6 years ago
Views:

1 Experience Programming Current HPRCs Mohamed Taher The George Washington University

2 Acknowledgements GWU: Prof.\Tarek El-Ghazawi and Esam El-Araby GMU: Prof.\Kris Gaj ARSC SRC SGI CRAY 2

3 Outline Introduction Development Flow: 0 SRC-6 0 Cray-XD1 0 SGI Altix-350 Conclusion 3

4 High Performance Reconfigurable Computers (HPRCs( HPRCs) HPRCs are computing systems based on the close system-level integration of one or more general-purpose processors and one or more reconfigurable processors (RP) The computational cores are mapped to the reconfigurable hardware Memory The processors 0 Perform the operations that cannot be done efficiently in the reconfigurable hardware 0 Use specific APIs to: Download configuration codes into the RPs Transfer data to/from RP memory Start/Stop program Data RP Control Data µp Data 4

5 Development Flow for RCs 5

6 SRC-6

7 SRC Development Flow Application sources Macro sources µp Compiler.c or.f files. vhd or.v files MAP Compiler HDL sources.v files Netlists Logic synthesis. ngo files Object files.o files.o files Place & Route Linker Application executable.bin files Configuration bitstreams 7

8 MAP Compiler Compile C/Fortran codes to reconfigurable hardware 0 Generated code: circuits Basic blocks in inner loop bodies are merged and become pipelined circuits Basic blocks in outer loops become special purpose hardware function units C/Fortran code can be extended using macros allowing for program transformations that cannot be expressed straightforwardly in C/Fortran Macros have semantics unlike C/Fortran functions 0 have a period (#clocks between inputs) 0 have a pipeline delay (#clocks between in and output) 0 MAP compiler takes care of period and delay 0 can have state (kept between macro calls) 8

9 Example Application: Wavelet Main Program Allocate the RP Configure and start the Program execution on the FPGA Passing the input image pointer and the output image buffer pointer to be used by DMA individual parameters can be passed to the MAP C function such as image dimensions large parameter array, such as the wavelet coefficients, can be transferred using DMA by passing the pointer of the coefficients array Free the RP int main (int argc, char *argv[]) {.. /* initialize daubechies coefficients */ float coeff[8]; coeff[0]= coeff[1]= (1+sqrt(3))/(4*sqrt(2)); coeff[2]= coeff[3]= (3+sqrt(3))/(4*sqrt(2)); coeff[4]= coeff[5]= (3-sqrt(3))/(4*sqrt(2)); coeff[6]= coeff[7]= (1-sqrt(3))/(4*sqrt(2));.. /* allocate images */. map_allocate(1); gettimeofday(&time0, NULL); proc_fpga (image_in, image_out, coeff, dx, dy, &ht0, &ht1, &ht2, &ht3, mapno); gettimeofday(&time1, NULL); /* print time difference */.. map_free(1);. Mohamed Taher, GWU ARSC HPRC Workshop } Fairbanks, AK, August 22-24,

10 Example Application: Wavelet MAP C Function (FPGA.mc) transfer image data to OBM bank a transfer coefficients to OBM bank c load coefficients from bank c to on-chip registers read one pixel from bank a compute Wavelet store result into bank b End of Image Yes No transfer image data from bank b to the host 10

11 Example Application: Wavelet MAP C Function (FPGA.mc) void proc_fpga(int64_t image_in[], int64_t image_out[], int64_t coeff[], int dx, int dy, long long *ht0, long long *ht1, long long *ht2, long long *ht3, int mapnum) { // coefficients float LP0, LP1, LP2, LP3, LP4; float HP0, HP1, HP2, HP3, HP4; // variables int i, j, k; int64_t in_pixel, out_pixel; // input image OBM_BANK_A (AL, int64_t, MAX_OBM_SIZE) // output image OBM_BANK_B (BL, int64_t, MAX_OBM_SIZE) // filter coefficients OBM_BANK_C (CL, int64_t, MAX_OBM_SIZE) 11

12 start_timer(); read_timer(ht0); // DMA Input Image transfer DMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), image_in, 1, Image_size, 0); wait_dma (0); // DMA coefficients transfer DMA_CPU (CM2OBM, CL, MAP_OBM_stripe(1, C"), coeff, 1, 4*sizeof(int64_t), 0); wait_dma(0); read_timer(ht1); Example Application: Wavelet MAP C Function (FPGA.mc) transfer image data to an OBM bank transfer coefficients to an OBM bank } for (i = 0; i < 4; i++) { LP0 = LP1 ; HP0 = HP1 ; LP1 = LP2 ; HP1 = HP2 ; LP2 = LP3 ; HP2 = HP3 ; split_64to32_flt_flt(cl[i], & HP3, & LP3 ); load coefficients from the OBM bank to on-chip registers 12

13 Example Application: Wavelet MAP C Function (FPGA.mc) for (i = 0; i<image_size; i++) { } { in_pixel = AL[i]; }... BL[i] = out_pixel; read pixel value from the OBM bank compute Wavelet store results to the OBM bank read_timer(ht2); DMA_CPU (OBM2CM, BL, MAP_OBM_stripe(1,"B"), image_out, 1, Image_size, 0); wait_dma (0); read_timer(ht3); } transfer image data to the host 13

14 Using On-Chip Memory (OCM) in SRC void sum(int64_t a[], int *c, int mapno) FPGA { OBM_BANK_A (AL, int64_t, SIZE) uint64_t ocm_a [SIZE]; AL[] SM (OBM) 64 ocm_a[] OCM int i; DMA_CPU (CM2OBM, AL, MAP_OBM_stripe (1, "A"), a, 1, bytelength, 0); c 32 computations wait_dma (0); for(i=0; i<size; i++) ocm_a[i] = AL[i]; for(i=0; i<size; i++) tmp = ocm_a[i] + tmp; } 14

15 FPGA Mapping in SRC FPGA1.mc void fpga1(int64_t a, int64_t b, int64_t *sum, int mapno) { int64_t c, temp; } FPGA2.mc send_to_bridge(b); c = a * const1; recv_from_bridge(&temp); *sum = temp+mult; void fpga2() { int64_t a, d; } recv_from_bridge(&a); d = a/const2; send_to_bridge(d); Makefile MAPFILES = FPGA1.mc FPGA2.mc PRIMARY = FPGA1.mc SECONDARY = FPGA2.mc CHIP2 = FPGA2.mc a b FPGA 1 FPGA 2 multiply divide add sum 15

16 Overlapping Data Transfer with Computation Improve performance by overlapping algorithm computation and data loading and unloading Parallel sections 0 Multiple parallel code blocks are active in parallel Read DMA Algorithm Write DMA Cycle 1 1 X X Cycle X Cycle Cycle 4 X 3 2 Cycle 5 X X 3 Time { /*DMA_IN 1st BLOCK BUFFER*/ } for(i = 0; i < LoopMax; i++) { #pragma src parallel sections { #pragma src section { for(i = 0; i < InputCountPerLoop; i++) { } DO COMPUTATION (Current Data Block) } /* end of parallel section with compute loop */ #pragma src section { /* DMA_IN NEXT BLOCK BUFFER*/ } /* end of parallel section with DMA */ } /* end of parallel sections */ } /* for LoopSub from 0 to LoopMax */ 16

17 Streams Stream_64 S0; #pragma src parallel sections { #pragma src section { int i; for (i=0; i<sz; i++) put_stream (&S0, AL[i]+42, 1); } /* end of parallel section */ A stream is a data structure that allows flexible communication between concurrent producer and consumer loops Conventional Data Flow Compute Loop 1 Streams and Conventional Data Flow Compute Loop 1 Steam s Compute Loop 2 { #pragma src section int i; for (i=0; i<sz; i++) get_stream (&S0, &BL[i]); } /* end of parallel section */ } /* end of parallel sections */ Compute Loop 2 Time Saves Access to On-Board Memory Data is flowing In the logic On- Board Memory or BRAM On- Board Memory or BRAM On- Board Memory or BRAM On- Board Memory or BRAM On- Board Memory or BRAM 17

18 Performance Optimizations Performance tune (removing inefficiencies) 0 avoid re-reading of data from OBMs use Delay Queues 0 avoid read / write conflicts in same iteration 0 avoid multiple accesses to a memory in one iteration MAP routine optimization 0 Pipelined Loops All function units within loop are computing at every clock 0 parallel sections Multiple parallel code blocks are active in parallel 0 Multiple FPGAs Logic in both FPGAs can be computing in parallel 0 Utilize streams Multiple serial code blocks are active in parallel All function units within loop are computing at every clock 18

19 Cray XD1

20 XD1 Development Flow Hardware Flow Software Flow 20

21 Operational Scenarios 21

22 Example Application: Wavelet Define the address space for user registers and QDR memory Open the FPGA Device #define APP_CFG_REG 0x08UL #define USR_REG1 0x40UL #define USR_REG2 0x48UL #define USR_REG3 0x50UL #define USR_REG4 0x58UL #define QDR1_OFFSET 0x100UL /* Offset in FPGA memory space.*/ int main (int argc, char *argv[]) { int fp_id; err_e e; u_64 coeff[4] u_64 * dp_base; u_64 * image; fp_id = fpga_open ("/dev/ufp0", O_RDWR O_SYNC, &e); Load the FPGA Transfer coefficients into the FPGA registers fpga_load (fp_id, "top.bin.ufp", &e);.. /* Read Image */. /* initialize daubechies coefficients */. fpga_wrt_appif_val (fp_id, coeff[0], USR_REG1, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[1], USR_REG2, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[2], USR_REG3, TYPE_VAL, &e); fpga_wrt_appif_val (fp_id, coeff[3], USR_REG4, TYPE_VAL, &e); 22

23 Example Application: Wavelet Configure the Wavelet for QDR bridging Map the entire 4 Mbytes of QDR Memory Transfer the Image into the QDR Start Processing Read the FPGA status fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e); dp_base = fpga_memmap (fp_id, QDR_SIZE, PROT_READ ROT_WRITE,MAP_SHARED, QDR1_OFFSET, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) dp_base[i] = iamge[i]; fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e); /*... */ fpga_rd_appif_val (fp_id, &status, STATUS_REG, &e); /*... */ Configure the Wavelet for QDR bridging Read back the Image fpga_wrt_appif_val (fp_id, 0x0UL, APP_CFG_REG, TYPE_VAL, &e); for (i = 0; i < QDR_SIZE/sizeof (u_64); i++) output_iamge[i] = dp_base[i[i] ; Close the FPGA device } fpga_close (fp_id, &e); 23

24 Accessing µp memory from FPGA The APIs support access to a region of the µp memory by the FPGA logic The program uses the fpga_set_ftrmem function to: 0 Allocate an FTR 0 Associates it with the address space of the µp 0 Sets up the FPGA to access it directly It does not automatically provide the address of this region to the FPGA application logic 0 One way is to establish an FPGA register for that purpose and use the fpga_wrt_appif_val function to write the value to the register unsigned long order; void *ftr_mem; /*... */ ftr_mem = fpga_set_ftrmem(fpga_fd, order, &err); if (ftr_mem == NULL) { /* Handle error. */ } fpga_wrt_appif_val (fp_id, (u_64) ftr_mem, BUFF0_PTR_REG, TYPE_ADDR, &e); fpga_wrt_appif_val (fp_id, 0x1UL, APP_CFG_REG, TYPE_VAL, &e); /*... */ fpga_rd_appif_val(fp_id, &status, STATUS_REG, &e); /*... */ 24

25 Using MPI on Cray XD1 Each Cray XD1 chassis consists of 6 dual-processor Opteron nodes 0 2 Opteron processors (Total 12) 0 1 Xilinx Virtex-II Pro 50 (Total 6) Applications can be parallelized across the 6 FPGAs using MPI Data are distributed across the 6 FPGAs if(mythread==0) read_image (image_file_name, image_buffer, &rows, &cols); MPI_Bcast(&rows, 1,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(&cols, 1,MPI_INT,0,MPI_COMM_WORLD); local_size= rows*cols/threads; MPI_Scatter(image_buffer, local_size,mpi_unsigned_long, local_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0, MPI_COMM_WORLD); /* Execute the wavelet on the Hardware*/ process_image(fp_id, local_image_buffer, local_output_buffer, rows, cols); MPI_Gather(local_output_buffer,local_size,MPI_UNSIGNED_LONG, output_image_buffer, local_data_size, MPI_UNSIGNED_LONG, 0,MPI_COMM_WORLD); if(mythread==0) write_image (output_file_name, output_image_buffer, rows, cols); 25

26 SGI Altix 350

27 Development Flow 27

28 Example Application: Wavelet Small parameters 0 Connect to Algorithm Defined Registers (alg_def_reg0 - alg_def_reg7) 0 Pass parameter mapping to software through an extractor directive, type REG_IN: -- extractor REG_IN: coeff0 64 u alg_def_reg0[63:0] -- extractor REG_IN: coeff1 64 u alg_def_reg1[63:0] -- extractor REG_IN: coeff2 64 u alg_def_reg2[63:0] -- extractor REG_IN: coeff3 64 u alg_def_reg3[63:0] Large Arrays 0 Dedicate a portion of an SRAM bank for the parameter array 0 Pass parameter array mapping to software with an extractor comment of type SRAM: Parameter Passing -- extractor SRAM:a_in sram[0] 0x00 in u fixed rasclib_algorithm_request_t ar; int alg_id; int res; strcpy(ar.alg_id, Wavelet"); ar.num_devices = 1;.. /* Read Image */. /* initialize daubechies coefficients */. rasclib_resource_alloc(&ar, 1); algorithm_id = rasclib_algorithm_open( Wavelet"); res = rasclib_algorithm_alg_reg_write (alg_id, coeff0", coeff[0]); res = rasclib_algorithm_alg_reg_write (alg_id, coeff1", coeff[1]); res = rasclib_algorithm_alg_reg_write (alg_id, coeff2", coeff[2]); res = rasclib_algorithm_alg_reg_write (alg_id, coeff3", coeff[3]); res = rasclib_algorithm_send (alg_id, "a_in", Image_Buff, SIZE); 28

29 Example Application: Wavelet rasclib_algorithm_go (alg_id); Reading results back res = rasclib_algorithm_receive (alg_id, "d_out", out_buff, SIZE); rasclib_algorithm_commit (alg_id); rasclib_algorithm_wait (alg_id); rasclib_algorithm_close (alg_id); Small parameters 0 Connect to Algorithm Defined Registers 0 Pass parameter mapping to software through an extractor directive, type REG_OUT 0 Use the API function rasclib_algorithm_reg_read Large Arrays 0 Dedicate a portion of an SRAM bank for the parameter array 0 Pass parameter array mapping to software with an extractor comment of type SRAM: -- extractor SRAM:d_out sram[1] 0x00 out u fixed 29

30 Streaming Improve performance by overlapping algorithm computation and data loading and unloading Extractor directives are used to tell software: 0 where input/output data arrays are located (SRAM bank + starting index) 0 the sizes of the input/output data arrays 0 which arrays have been enabled for streaming Extractor directive type used: SRAM with attribute stream, e.g.: -- extractor SRAM:a_in sram[0] 0x00 in u stream -- extractor SRAM:d_out sram[1] 0x00 out u stream Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Read DMA X X Algorithm X X Write DMA X X Time 30

31 Conclusions µp high level languages in all of the examined cases need extensive use of explicit APIs to interact with the FPGA In the case of CRAY-XD1, the FPGA can directly access the µp memory for reading and writing. This eliminates the DMA time in the pipelined designs The FPGA side in the case of SRC can be completely programmed using MAP C / Fortran Some explicit optimizations can be specified in the MAP C code (at the FPGA side) to explicitly perform forwarding and chaining In the SGI case, explicit optimization can be utilized in the side of the FPGA in order to overlap I/O with computation Although the two example optimization from SRC and SGI perform different optimizations, still has the same name (Streaming). This address the need of standardization terminology to avoid the potential confusions for application developers when they work across different platforms 31

Tools for Reconfigurable Supercomputing. Kris Gaj George Mason University

Tools for Reconfigurable Supercomputing Kris Gaj George Mason University 1 Application Development for Reconfigurable Computers Program Entry Platform mapping Debugging & Verification Compilation Execution