Efficient Event Processing through Reconfigurable Hardware for Algorithmic Trading. University of Toronto

Size: px

Start display at page:

Download "Efficient Event Processing through Reconfigurable Hardware for Algorithmic Trading. University of Toronto"

Lydia Patterson
5 years ago
Views:

1 Efficient Event Processing through Reconfigurable Hardware for Algorithmic Trading Martin Labrecque Harsh Singh Warren Shum Hans-Arno Jacobsen University of Toronto

2 Algorithm Trading

05] Investment Strategies (subscription) 1 Classical arbitrage strategy [stock = ABX,

3 Examples of Financial Strategies & Market Events Market Feed (event) [stock = ABX, TSE ask = 40.04, NYSE ask = 40.05] Investment Strategies (subscription) 1 Classical arbitrage strategy [stock = ABX, TSE ask NYSE ask, ACTION: BUY & SELL] 2 Classical short-sell strategy [stock = ABX, TSE ask 40.04, ACTION: SELL ] [stock = ABX, TSE ask 38.04, ACTION: BUY ]

Algorithm Trading Vision Algorithm Trading Key Observation 1 Every 1-millisecond reduction in response-time is estimated to generate the staggering amount of over 100 million a year 2 Millions of

4 Algorithm Trading Vision Algorithm Trading Key Observation 1 Every 1-millisecond reduction in response-time is estimated to generate the staggering amount of over 100 million a year 2 Millions of market events (and increasing) are expected per second Our Solution We propose a novel FPGA-based event processing platform to significantly speed up algorithm trading computations, namely, market event parsing and market event matching against strategies

Why FPGAs 1 Hardware reconfigurability: the ability to be re-configured on-demand into a highly parallel custom hardware circuit 2 Hardware parallelism: eliminating inter-processor signaling and

5 Why FPGAs 1 Hardware reconfigurability: the ability to be re-configured on-demand into a highly parallel custom hardware circuit 2 Hardware parallelism: eliminating inter-processor signaling and message passing overhead at the program and OS level 3 High throughput packet processing: using multiple high bandwidth (giga-bit) I/O pins to eliminate the OS layer latency overhead in moving data between input and output ports

6 Propagation Data Structure Overview (SIGMOD 01) hash(?) S 1 S 2 hash(?) S 8 S 15 S 20 hash(?) S 5 S i AP(S i ) Hash(AP(S i )) S i hash(?) S 31 S 4 1 Strategies are distributed in disjoint clusters to enables highly parallelizable event matching through custom hardware units 2 In each cluster, strategies are stored as contiguous blocks of memory to enable fast sequential access to improve memory locality

7 Soft-Processor Approach 1 Simplest solution with virtually no deployment effort 2 Identical C program is compiled to execute on FPGA soft-processor

8 Hybrid Approach 1 4 Matching units (custom processors) are ran in parallel 2 Strategies are stored both in off-chip DDR2 and on-chip BRAM 3 DDR2 memory access are batched to reduce hand-shaking latency 4 BRAM memory is accessed during DDR2 hand-shaking phase 5 Scales in order hundred of thousands strategies

9 Hardware-only Approach 1 Each strategy encoded as a matching unit (a custom processor) 2 High rate of matching due to lack of memory access 3 High degree of parallelization, all strategies are executed in parallel 4 Resource exhaustive with respect to the number of strategies 5 Scales in order of thousands of strategies (latest FPGA chip)

10 Verilog Snippet case (CurrentState) IDLE: begin if (Go) NextState = SELECT_CLUSTER_ID; else NextState = IDLE; end SELECT_CLUSTER_ID: begin // Select a valid cluster index if (curcluster > LAST_CLUSTER) // Finished reading last cluster // When all clusters are invalid NextState = WAIT; else NextState = START_ADDRESS; end START_ADDRESS: begin if (curcluster > LAST_CLUSTER) NextState = WAIT; else if (can_take_more_requests) NextState = MEM_BURST_WAIT; else NextState = START_ADDRESS; end NEXT_ADDRESS: begin if (clusterendfound) NextState = SELECT_CLUSTER_ID; else if (can_take_more_requests) NextState = MEM_BURST_WAIT; else NextState = NEXT_ADDRESS; end default NextState = IDLE; endcase // Select cluster address // Finished reading last cluster // When all clusters are invalid // If 'can_take_more_requests' is low, wait // Check data from cluster terminator // Increment curaddr by 16 bytes // If 'can_take_more_requests' is low, wait

Verilog Compilation 1 Synthesis: checks syntax and analyzes the design to ensure that it is optimized for the architecture; outputs a design Netlist file 2 Design Translation: merges the input

11 Verilog Compilation 1 Synthesis: checks syntax and analyzes the design to ensure that it is optimized for the architecture; outputs a design Netlist file 2 Design Translation: merges the input Netlists and design constraints; outputs a NGD file, describing the logical design reduced to gate primitives 3 Mapping: maps an NGD logic into FPGA; outputs a native circuit description (NCD) that represents the design mapped to the FPGA. 4 Place & Route: takes a NCD file and places and routes the design; outputs an NCD file for bitstream generation.

12 Evaluation Testbed 1 Throughput is the maximum sustainable input packet rate, determined through a bisection search, when no packet is dropped 2 Latency is the interval between the time a market event packet leaves the Event Monitor output queue to the time the action is received

13 Experimental Results End-to-end System Latency (µs) Workload PC Soft-Processor Hybrid Hardware-only K N/A 10K , N/A 100K 2, , , N/A System Throughput (market events/sec) Workload PC Soft-Processor Hybrid Hardware-only ,654 14, ,142 1,024,590 1K , ,500 N/A 10K ,779 N/A 100K N/A

14 Event Sender

15 Stock Event Sender

16 Workload Replayer

17 Event Packet Analyzer

Lessons Learned 1 Algorithmic trading trends Account for 70% of all trading in equities Cost millions per subsecond response delay 2 Expressive predicate language Support classical arbitrage strategy

18 Lessons Learned 1 Algorithmic trading trends Account for 70% of all trading in equities Cost millions per subsecond response delay 2 Expressive predicate language Support classical arbitrage strategy Support buy-and-hold strategy 3 Reconfigurable hardware (FPGA) Accelerate using custom logic circuit Utilize hardware parallelism 4 Line-rate algorithmic trading Eliminate OS layer latency Leverage on-board packet processing 5

19 Thank You,

Multi-Query Stream Processing on FPGAs. University of Toronto

Multi-Query Stream Processing on FPGAs Mohammad Sadoghi Rija Javed Naif Tarafdar Harsh Singh Rohan Palaniappan Hans-Arno Jacobsen April 2012 Algorithmic Trading NASDAQ NYSE TSX AMGN=58 HON=24 Market ORCL=12