Resource-efficient regular expression matching architecture for text analytics

Resource-efficient regular expression matching architecture for text analytics Kubilay Atasu IBM Research - Zurich Presented at ASAP 2014

SystemT: an algebraic approach to declarative information extraction distill structured data from unstructured and semi-structured text exploit the extracted data in your applications For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access. Richard Stallman, founder of the Free Software Foundation, countered saying Annotations Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. (from Cohen s IE tutorial, 2003) 2

A simple SystemT information extraction rule Find the names (regex) that are at most 20 chars after a title (dict.) Founder...Bill Gates dict. match regex match start offset end offset start offset end offset at most 20 chars start offset end offset result 3

Finding leftmost regular expression matches Assume that we are searching for the regex.*(a aa aaaa) in the input string aaaa Find the regex match with the smallest start offset value at each end offset position The leftmost maches are marked using solid lines 4

Regex matching: background Consider the regex.*(a b aa aba) Can be transformed into NFA/DFA NFA DFA Traditional architectures do not support start offset reporting & leftmost matching: Reconfigurable NFAs (Sidhu FCCM 2001, Bispo FPT 2006, Yang ANCS 2008) Programmable DFAs (Smith SIGCOMM 2008, Van Lunteren MICRO 2012) 5

Previous solution: network of state machines active_reg state_reg start_offset_reg -Dimension of the network (# state machines) is statically computed for each regex -DRAWBACK: replication of the state transition logic K. Atasu, R. Polig, C.Hagleitner, F. R. Reiss: Hardware-accelerated regular expression matching for high-throughput text analytics. FPL 2013: 1-7 6

Contributions of this work 1. Extending Sidhu and Prasanna s NFA architecture to support start offset reporting 2. A graph coloring based register clustering method to minimize the register usage 3. An efficient leftmost match computation method without using offset comparisons NFA Sidhu and Prasanna s NFA Architecture 7

A straightforward extension of Sidhu & Prasanna s architecture Add a start offset register to each NFA state offset_reg[0] = value of current offset position DRAWBACK: redundant start offset registers offset_reg [0] offset_reg [1] offset_reg [3] offset_reg [4] offset_reg [2] NFA Baseline Architecture 8

Clustering offset registers NFA DFA Build a conflict graph and apply graph coloring States with the same color can share registers 9

Leftmost match computation Assume that state 0 and state 1 are active and the current input is a We have to compute offset_reg[2] = MIN(offset_reg[0], offset_reg[1]) offset_reg [0] offset_reg [1] offset_reg [3] offset_reg [4] offset_reg [2] NFA Baseline Architecture 10

Leftmost match computation without offset comparisons (1) Assume that state K has M incoming state transitions We have to compute the minimum of M offset values offset_reg [1] offset_reg [2] offset_reg [0] offset_reg [M-1] K tree-based implementation (long latency) fully parallel implementation (expensive) offset_reg [K] Better solution: each state keeps track of the states that are activated earlier -Define an N N bit matrix: earlier[i,j]=1 if state j is activated before state i 11

Leftmost match computation without offset comparisons (2) offset_reg[0] = value of the current offset pointer, earlier[0, :] = active_reg Assume that state 1 is active (i.e., earlier[0,1] = 1), and the input is a we have two transitions into state 2: from state 0 and from state 1 since earlier[0,1] = 1, we choose the start offset provided by state 1 due to transitions 0 1 and 1 2, earlier[1,2]=1 in the next cycle offset_reg [0] offset_reg [1] tr[ 0,2] ( tr[1,2] earlier [0,1]) 2 tr[ 1,2] ( tr[0,2] earlier [1,0]) offset_reg [2] 12

Experiments (text analytics regexs) Altera Stratix IV GX530KH40C2, Altera Quartus II V11 tools 32-bit start offset registers, 250 MHz target clock frequency NFA representation: Follow Automata with character classes 13

Experiments (L7 filter regexs) Altera Stratix IV GX530KH40C2, Altera Quartus II V11 tools 32-bit start offset registers, 250 MHz target clock frequency NFA representation: Follow Automata with character classes 14

Summary & future work Support for start offset reporting and leftmost matching without replicating the state transition logic without using redundant offset registers without using expensive offset comparison > threefold reduction in the logic resource usage > 1.25-fold improvement in the clock frequency < 8.6-fold overhead w.r.t. Sidhu & Prasanna s architecture while using 32-bit offset registers Our current and future work includes design of intermediate fabrics for fast compilation analysis and optimization of the energy consumption 15

Resource-efficient regular expression matching architecture for text analytics Kubilay Atasu IBM Research - Zurich Presented at ASAP 2014