Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers

Size: px

Start display at page:

Download "Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers"

Camilla Gilbert
5 years ago
Views:

1 Tracking Acceleration with FPGAs Future Tracking, CMS Week 4/12/17 Sioni Summers

2 Contents Introduction FPGAs & 'DataFlow Engines' for computing Device architecture Maxeler HLT Tracking Acceleration 2

3 Introduction With Combinatorial Kalman Filter, computation time scales exponentially with pileup 140 to 200 PU expected for High Lumi LHC For the HLT, Moore's law improvement in CPU power not enough to meet latency limits Can we accelerate the algorithm with FPGAs? Previously demonstrated 1.5μs KF fitting on FPGAs for L1 track trigger CMS Note 2017/009 3

4 FPGAs for computing A collection of digital components Lookup tables for general logic, multipliers, memories Routing fabric Program component configuration, and connectivity With degrees of abstraction Modern high end devices contain ~10k multipliers > 10MB internal memory with Tb/s bandwidth millions of LUTs multi Tb/s IO bandwidth (And cost > $10k) Every operation can execute on every clock cycle Huge parallelism of ops. Enables low latency Registers hold values between operations Creates 'pipeline' of data Enables high throughput Placing and routing design is time consuming 'Write once, run many' 4

5 FPGAs vs. GPUs FPGAs Processing customised to application High programming effort Highest upfront cost Best 'Flops per watt' Both: Good at: same operations on every data Bad at: 'control flow' (branching) GPUs Software accelerator Lower programming effort Lower upfront cost More power intensive per Flop 5

performance FPGA applications easier High Level

6 Maxeler Manufacture FPGA PCIe cards ('DataFlow Engines') Different arrangements with CPUs available Making high performance FPGA applications easier High Level programming language (MaxJ, MaxCompiler) Software for CPU/FPGA interaction 6

7 DataFlow Engine connectivity Host CPU 'DFE' BRAM PCIe: 1GB/s 30MB 10TB/s 100GB 100GB/s DRAM FPGA Host CPU and FPGA card connected over PCIe bus FPGA can access on-card memory External DRAM and internal B(lock)RAM Possibility for inter-fpga connections, networking 7

8 Acceleration Setup Maxeler 'MPC-X' node at STFC Hartree facility 'FPGAs as a shared resource' Altera Stratix V FPGAs 8 per 1U box Infiniband network & PCIe switch Intel Xeon CPUs 8

9 Tracking Algorithm Summary Track building with Combinatorial Kalman Filter Propagate, search, update Have not changed algorithm at all Using CMSSW for parts not implemented on FPGA 9

10 Tracking Algorithm Summary candidates = seedingstep(); Not considered here while(!candidates.empty()){ newcandidates = emptyvector(); for(candidate : candidates){ measurements = findmeasurements(candidate); for(measurement : measurements){ newcand = update(candidate, measurement); if(newcand.finished){ addtoresult(newcand); }else{ newcandidates.push_back(newcand); } } newcandidates = bestn(newcandidates); } candidates.swap(newcandidates); } Matrix maths + searching through mem. Matrix maths Sort 10

11 DFE Porting Implemented update on DFE Kalman Filter single precision float FPGA code which is (mostly) readable! 'Unroll' all matrix multiplications Every multiply in parallel Adder tree Matrix Matrix Vector Matrix K = C * H * Rinv; M = I K * H; x_up = x + K * r; C_up = similarity(m, C) + similarity(k, V); DFELink statein = addstreamfromcpu( state ); DFELink hitin = addstreamfromcpu( hit );... KFUpdator.getInput( state ) <== statein; KFUpdator.getInput( hit ) <== hitin; Fine grained parallelism 11

12 DFE Porting Input Parallel Operations Output 12

13 DFE Porting Implemented update on DFE Kalman Filter single precision float Reorder CPU code to better fill pipeline Find all measurements for all current states Send to FPGA in one transaction candidates = seedingstep(); while(!candidates.empty()){ measurements = emptyvectory(); for(candidate : candidates){ measurements.push_back(findmeasurements(candidate)); } newcandidates = DFEUpdate(candidates, measurements); for(newcand in newcandidates){ if(newcand.finished){ addtoresult(newcand); }else{ newcandidates.push_back(newcand); } } newcandidates = bestn(newcandidates); } candidates.swap(newcandidates); } 13

14 DFE Porting Hit Execute Kalman filter state update on FPGA Find next measurements on CPU Stream hits and states across infiniband & PCIe bus State CPU DFE update findmeasurements State 14

15 DFE Porting Latency is 181 clks at 250MHz: ~725ns ~250ns measured on CPU But fully pipelined: new state & hit enters on each clock cycle First-to-first latency longer than CPU First-to-last latency hopefully quicker... 15

16 Interfacing to CMSSW Interact with FPGA with C++ functions Header generated at FPGA compile step Load a design Send and receive data Maxeler software handles low level Just link Maxeler libraries to CMSSW max_file_t* maxfile = KFUpdatorDFE_init(); max_engine_t* engine = max_load(maxfile, "*"); KFUpdatorDFE_actions_t actions = {nmeasperstate, packedhits, packedstates, resultstates}; KFUpdatorDFE_run(engine, &actions); 16

17 FPGA vs. CPU timing Testing with ttbar PU & Intel Xeon X GHz So far, no speedup! Long initial latency costly 0.5 ms Under investigation Algorithm latency not enough to hide it FPGA achieves ~4.5x rate increase Limited at 3GB/s PCIe bandwidth Cannot use more KFs in parallel, either FPGA would become faster at n ~ 2500 Throughput (MHz) FIFO Latency (ms) CPU FPGA

FPGA vs. CPU timing Testing with ttbar + 200 PU So far, no speedup! Long initial latency costly 0.

18 FPGA vs. CPU timing Testing with ttbar PU So far, no speedup! Long initial latency costly 0.5 ms Under investigation Algorithm latency not enough to hide it FPGA achieves ~4x rate increase Limited at 3GB/s PCIe bandwidth 0.8 Cannot use more KFs in parallel, either FPGA would become faster at n ~

19 Rate improvements - hardware 3GB/s not that much for PCIe PCIe v5.0 specs 64GB/s (2019?) 3GB/s DFE 250MHz 20 GB/s 3GB/s 20x faster than 3GB/s Algorithm can produce data up to 20GB/s 80B/clk at 250MHz Consumes less With higher IO bandwidth, could use more parallel instances DFE 64 GB/s 250MHz 20 GB/s 250MHz 20 GB/s 250MHz 20 GB/s 64 GB/s 19

20 Rate improvements - algorithm All hits known at event begin Smaller object DFE CPU findmeasurements Mem update Execute 'bestn' on DFE State Send to DFE memory once Stream hit pointers instead Hit* Hit Hit Fewer states to return Predict 2-4x rate improvement Hit State bestn 20

21 Rate improvements - algorithm Ultimate performance by 'closing the loop' in the FPGA Remove PCIe bottleneck completely Pay latency cost only O(1) Write all hits and seeds to DFE at event begin DFE Queue Hit State Hit State Hit State update bestn Finding the next measurements is not trivial! Highly data dependent processing Huge internal memory bandwidth must be utilised findmeasurements Mem Hit Hit Hit 21

$Number representations FPGA capable of custom data types Non IEEE floating point Fixed-point with any size integer/fractional part Floating point is expensive$

22 Number representations FPGA capable of custom data types Non IEEE floating point Fixed-point with any size integer/fractional part Floating point is expensive Resources Latency (= more resources, too) Routing (exponent normalising) 7b mantissa, 17b exponent more suitable Intel/Altera Stratix 10 promising for floating point 22

23 Numerical Profiling What range is really used? Histogram exponent (base 2) of variables in code Doesn't tell what precision needed: must be careful with numerical stability 23

24 Conclusions HLT Tracking scales exponentially with pileup: poses a problem for HLLHC Explored porting tracking to FPGAs Kalman Updator implemented Limitation is IO between CPU-FPGA: latency & bandwidth Presented steps to further optimise for the architecture Reducing size of data transferred Reducing number of inter-cpu-fpga transactions Number representation tuning 24

25 Existing data rate reduction Each state will be filtered with multiple hits State2 Send a state only once, with a stream for 'n reuses' State1... Hit1_n... Hit1_n State1... Hit1_1... Hit1_1 State1 Hit1_0 Hit1_0 State0 Hit0_n n3 State3 Hit0_n State0 n2 State2 State0... Hit0_1 n1 State1... Hit0_1 State0 Hit0_0 n0 State0 Hit0_0 State1 25

Custom Computing. wl

Custom Computing. wl Custom Computing theory and practice of customising designs one of the fastest growing technologies impact on ASIC, CPU, many-core, GPU, multi-scale dataflow wide range of architectures and applications