Latency Masking Threads on FPGAs

Size: px

Start display at page:

Download "Latency Masking Threads on FPGAs"

Marvin Nelson
5 years ago
Views:

1 Latency Masking Threads on FPGAs Walid Najjar UC Riverside & Jacquard Computing Inc.

2 Credits } Edward B. Fernandez (UCR) } Dr. Jason Villarreal (Jacquard Computing) } Adrian Park (Jacquard Computing) } Robert Halstead (UCR)

3 FPGAs are for streaming data FPGAs excel at streaming data from sensors, disks or network Avoid memory offloading overhead data delivered directly to data path unlike CPUs and GPUs 3

4 Irregular applications? Complex data structures dynamically computed data addresses multiple levels of indirection in data fetching Many classes of applications not only graph algorithms quad, oct and kd tree based algorithms: unstructured grid, finite element analysis, computational fluid dynamics, magnetism, etc 4

5 QUESTIONS How do you do irregular applications on an FPGA? An example? Is it faster? Is there an intrinsic advantages to FPGAs for irregular applications? Can this structure be compiler generated?

6 QUESTIONS How do you do irregular applications on an FPGA? An example? Is it faster? Is there an intrinsic advantages to FPGAs for irregular applications? Can this structure be compiler generated?

7 Inspiration The Tera MTA (Cray XMT) Inspiration only - no control over the hardware design FPGA is a blank sheet The memory system is already done, work with it

8 Latency masking hardware threads MTA-like hardware structure multiple active threads suspend on memory access and resume on available data thread state is saved and restored Customized data path to the specific computation at hand minimal data path: only needed functional units are instantiated 8

9 MTA-like architecture <address> In-order memory replies memory request executing thread waiting threads FIFO ordered ready threads memory reply <data> 9

10 MTA-like architecture <tag, address> Out-of-order memory replies memory request executing thread waiting threads no order CAM ready threads memory reply <tag, data> 10

11 QUESTIONS How do you do irregular applications on an FPGA? An example? Is it faster? Is there an intrinsic advantages to FPGAs for irregular applications? Can this structure be compiler generated?

12 FHAST FPGA Hardware Accelerated Sequence matching Tool based on FM-Index data structures latency masking hardware threads customized data path hand coded in VHDL on Convey HC-1 and Pico M501 over 100X speedup over Bowtie

13 Large-scale string matching String matching is found in a wide variety of applications Research in pattern searching lead to the development of the FM-Index which combines the properties of suffix arrays and the Burrows Wheeler Transform (BWT) FM Index executes at logarithmic time 13

14 BWT Generation Burrow-Wheeler Transform Created for data compression. Rearranges symbols of the text to a form that is easily reversible and compressible BZIP2 is one sample applications using BWT Steps to Generate BWT of text Q Terminate the text Q with a unique character: $ Generate all rotations of the text Sort all text rotations Extract the last characters of all the entries of the sorted list Join the characters in the same order they appeared in the sorted list. The newly generated text is the BWT(Q). Suffix Arrays Suffix arrays indicate the position of each possible suffix in the original string. 14

15 FHAST machine HC-1 cache coherent shared virtual memory reads queue E0 E1 E2 found matches detection of perfect matches one mismatch two mismatches 15

16 The Convey HC-1 Application engines Four Virtex 5 LX330 Cache coherent virtual shared memory Shared with 8 Xeon 2.13 GHz 150 MHz memory 16

17 FHAST on Convey HC-1 one FPGA used First FHAST implementatio n Single FPGA Software used only for - Instantiating the accelerator - Loading data to memory 150 MHz data path speed - Higher speeds possible 17

18 Experimental setup Sequence matching using to read sizes: 37 and 101 base pairs Run time on HC-1 Compared to Bowtie software running on HC-1 Xeon (2.13 GHz) same memory system 18

19 QUESTIONS How do you do irregular applications on an FPGA? An example? Is it faster? Is there an intrinsic advantages to FPGAs for irregular applications? Can this structure be compiler generated?

20 Execution time - FHAST v/s Bowtie sec 2000 Bowtie execution time (Xeon 2.13 GHz, 1 core) sec 30 FAST execution time (Virtex 5 LX330, 150 MHz) reads (10 6 ) reads (10 6 ) & 1 mismatch 2 mismatch 20

21 Comments on execution time Execution time difference between one mismatch (including exact match) and two mismatch Bowtie - increase by ~2.5X FHAST - increase by 1.7% multithreaded parallelism & pipelining Definite advantage to using FPGAs 21

22 Measured speedup Speedup Speedup with no replication, on one FPGA mismatches 37 bp per read Virtex 5 LX reads (10 6 ) Millions of reads 22

23 FPGA area utilization 0 mismatch 1 mismatch 2 mismatch 50 % FPGA Utilization bp (V5) 101 bp (V5) 37 bp (V6) 101 bp (V6) read size (FPGA) 23

24 Comments on FPGA area utilization Minimal area increase, from one to two mismatch Minimal area increase, from 37 to 101 base pairs per read Potential speedup using Virtex 6 LX760: 2*95X = 190X per FPGA 24

25 QUESTIONS How do you do irregular applications on an FPGA? An example? Is it faster? Is there an intrinsic advantages to FPGAs for irregular applications? Can this structure be compiler generated?

26 CC-MTA Compiler Customized Multi-Threaded Architectures

27 Compilation approach Based on ROCCC Detect the unstream condition data dependent on long latency loads Initiate threads in sequence (stream of threads) Suspend a thread after a memory read Save its state in a queue identify state variables 27

28 Current status We can do any code, as long as it looks like: n 1 C[ j] = A[B[i], j] Speed-up: 29X, on Convey HC-1 Stay tuned... 28

29 Conclusion Latency masking hardware threads on FPGA can be done Customized to a specific computation Tremendous speedup Very small FPGA footprint Roadmap to Compiler Customized MT architectures on FPGAs 29

30 Taking FPGA accelerators to the next level

31 Thank you! QUESTIONS?

Multithreaded FPGA Acceleration of DNA Sequence Mapping

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward B. Fernandez, Walid A. Najjar, Stefano Lonardi University of California Riverside Riverside, USA {efernand,najjar,lonardi}@cs.ucr.edu Jason