BIG data applications such as stock exchanges, smart. FPGA Based Custom Accelerator Architecture Framework for Complex Event Processing

Size: px

Start display at page:

Download "BIG data applications such as stock exchanges, smart. FPGA Based Custom Accelerator Architecture Framework for Complex Event Processing"

Tobias Jared Spencer
5 years ago
Views:

1 FPGA Based Custom Accelerator Architecture Framework for Complex Event Processing Kavinga Upul Bandara Ekanayaka Department of Electronic and Telecommunication Engineering University of Moratuwa Sri Lanka Ajith Pasqual Department of Electronic and Telecommunication Engineering University of Moratuwa Sri Lanka Abstract Complex Event Processing (CEP) is an emerging field in high performance computing paradigm where real time (low latency) computing capability is expected over big data processing (high throughput). Significant number of software architectures have been developed to improve the throughput while reduce the latency but maintaining of the both aspects reaches the limits of the software platforms. This paper proposes a novel custom hardware accelerator architecture framework for CEP in big data domain. The proposed design improves the throughput performance more than 10 times over the software counterpart while keeping the latency value at less than 100 nano seconds. Same Structured Query Language(SQL) type queries used in reference software architecture were used to improve the flexibility. A query compiler based on the same query language grammar was designed to convert the queries in to Hardware Description Language(HDL) modules. All modules were parameterized to improve the scalability of the design. Those generated modules were synthesized through vendor tools and programmed in to Field Programmable Gate Array(FPGA) platform in order to implement the system. Proposed hardware architecture framework was verified using a sensor network data set of a football field and the results were compared with software counterpart to show the performance improvement. Keywords Complex Event Processing, Hardware Acceleration, FPGA, Big data. I. INTRODUCTION BIG data applications such as stock exchanges, smart grids, wireless sensor networks, RFID networks, social networks, etc. are having an essential need of processing a huge amount of serial data in real time. Complex Event Processing(CEP) is one of the most rapidly emerging field in data processing and it s a principal technology solution for moving large data processing in real time. A CEP identifies meaningful patterns, relationships & data abstractions among apparently unrelated events and fires an immediate response. Significant number of software architecture solutions with different algorithms have been developed over the past few years such as Aurora[1], PIPES[2], STREAM[3], Borealis[4] and S4 [5] as CEP engines to satisfy the high throughput and low latency(real time) processing requirement. Siddhi[6] is a recently published software architecture solution for CEP. It has used some novel concepts such as pipelining, multi threading in order to achieve above mentioned main targets /14/$31.00 c 2014 IEEE of a CEP which impressed authors of this paper to select it as the base architecture to follow. Siddhi shows a significant performance improvement over one of the well established CEP architectures ESPER[7] as a result of its novel architecture improvements over the traditional CEP architectures. But still its throughput performance parameters lie in few Megabits per second range. All of the above mentioned software architectures lack the ability to maintain and deliver the two main aspects of a CEP system at the same time due to the limitations of the software platforms such as CPU processing power, CPU-Memory data latency bottleneck. Therefore software CEP platforms will fail to satisfy the requirements of today s high performance computing application areas as they are expected to process data in near real time at least around 1 Gbps throughput range. A system consist with a hardware co-processor in line with the CPU can provide a promising capability to such individual software CEP systems with the hardware acceleration to improve the performance in terms of both latency and throughput. Main hardware co-processor design approaches of parallel processing and pipelining can be used very elegantly to address the high throughput and low latency requirements comparing with the architecture requirement of a CEP system. Hardware accelerated systems which are built on Field Programmable Gate Arrays(FPGAs) would show a great performance in stream processing and pattern matching applications as suggested in [8] and [9]. This research proposes a custom hardware acceleration architecture framework to enhance the performance parameters of individual software CEP systems. Here, the architecture and SQL based query language is designed based on the Siddhi software CEP platform. Moreover, the hardware design approach of this research would act as a generalized framework for CEP in hardware and will be able to function as a co-processor with any such CEP software application with minor changes in query compiler of the design which will be explained in section IV. A SQL based hardware approach for CEP is proposed in [10] and a C-based approach is proposed in [11] where both are much similar designs taking market trading application as the motivation example. Both of them directly process the data streams from network port and hence achieved a throughput of 20Gbps. A Hardware design of a CEP with

2 a query compiler is proposed in [12] which inspired authors to use the query compiler approach and [13] has proposed a pattern matching architecture design in hardware by showcasing a detailed overview of the advantages of using Nondeterministic Finite Automata(NFA) design approaches for the pattern recognizing state machines. That inspiration makes authors to use NFA architecture design methodologies in a proper way to implement the highly scalable and generalized sequence, pattern matching modules. Both of later papers were able to achieve a data throughput of 1Gbps each. All of the above identified hardware approaches of CEP designs work as individual processors where the gap with software platform is little bit high. This paper proposes a novel approach of hardware acceleration for CEP as hardware coprocessor which works in line with software platform by using a high speed PCI-Express communication link in between two architectures. The novel design approach increases the flexibility of the hardware design than earlier approaches by functioning as a hardware API to the software platform while enhancing the throughput than in individual software platform. This idea provides the ability to move parts of the whole design back and forth among hardware and software platforms with having the concerns of trade-off between flexibility and the efficiency of the design. Rest of the paper is organized as follows. Modeling of the custom hardware accelerator system with the basic building blocks and digital system design theories is detailed by the section II while the overall system architecture is explained at the section III. Section IV gives a description about the query compiler. An evaluating example for the system is discussed in section V and the results are compared with software counterpart in section VI. Finally the section VII concludes the paper. II. MODELING OF THE CUSTOM ARCHITECTURE COMPONENTS The main architecture of this research is built upon custom component models based on the basic query types of the reference CEP system. Any complex application in the CEP domain can be divided in to five main query types: select, filter, window & aggregation, pattern recognition and sequence recognition. These are defined in the Siddhi language specification [14] which is also the reference software platform for this research. Hence any complex CEP architecture can be built with the basic component models which increases the flexibility of the system. Since the whole architecture is based on the component models of the basic query types, the flexibility of the hardware accelerated CEP architecture would be much closer to the software counterpart of the system. This particular design approach gives the novelty of this research as it works as a hardware co-processor in line with the existing software based system architecture.therefore any system can be designed in such a way that part of the system is in hardware and the other part is in software platform where this partitioning would depend on the flexibility and efficiency concerns of the particular application. One basic component model may be a part of the architecture of another component model. Each and every input event is always collected to an input buffer register of event data length. The length of the input buffer register is decided by the input stream event definition provided by the define query. The output event is also collected to an output buffer register of output event length which is decided based on the mentioned output data fields in the select query. A. Selector Modeling The selector module would be the most basic design component of the architecture of this research because even all other basic component models use it as a sub part of their systems. This component model is built exactly based on the select query type[14]. As depicted by the Fig. 1, the selector module construction is achieved by a simple hardwired connection of the selected output data fields in between input buffer register and the output buffer register. The functionality of this module is to select some particular data fields among all input data fields at the input buffer register and copy them in to the output buffer register. Fig. 1: Selector module an advanced selector module has to have hardwired connections from aggregator outputs to output register as well as from stored registers of pattern and sequence filters to output register other than basic hardwired connections from input register to the output register. In any case a single clock cycle would be enough to transfer data between input and output registers according to the hardwired architecture and it can be operated with a higher clock frequency as it does not contain any complex combinational circuit part. This basic architecture depicts the advantage of a simple hardware parallelism technique to increase the throughput as well as maintain the latency at a very low level. B. Filter Modeling Next most basic module type would be filter module and that type is based on the filter query type in the reference query language[14]. Fig. 2 shows the hardware implementation

3 of a filter module. Here, one or more data fields are being filtered according to given conditions. Input register is being partitioned in to registers of separate data types. The required data fields to be filtered are sent through set of filters which are designed similarly but differed only based on the comparing operator type. Six operators are supported in the filter design as follows. less than(<), less than or equal(<=), greater than(>), greater than or equal(>=), equal(==) and not equal(!=). Each of these filter modules are consist with two inputs, the input register filtering data field value and the constant value to be compared with and the comparing operator combinational logic. Fig. 2: Filter module An advanced implementation of such a filter module has comparing input of stored filter output of earlier event instead of the constant comparator in normal filter module. These type of filters are used in pattern & sequence matching modeling types. The single bit outputs of each filter is being sent through set of AND or OR gates depending on their combination defined in the filter query. The whole process consists of a combinational circuit which includes only few steps of gates. Since all the filters function independently, they operate parallel in the circuit which emulate the parallelism benefits obtained by the hardware acceleration. Therefore the filter module can also operate at a higher frequency and within just one clock cycle. The selector module is included in the main filter module as a sub module which functions in parallel with the internal filter modules. The selection happens only if the final output of the filter combination comes true and this total filter output is considered as the event detection signal. Fig. 3 shows the latency advancement obtained in the hardware design comparing to the traditional sequential approach. If the design follows the sequential approach, it has to wait for several clock cycles to pass the data through all filters and combinational logic to get the result. But in parallel approach all these will be done within one clock cycle even though the clock latency is increased by a very little amount. C. Window & Aggregator Modeling The window + aggregating module is built based on the Window + Aggregator query type[14]. Fig. 4 shows an architecture model of window + aggregation module. Both of (a) Sequantial approach timing (b) Parallel approach timing Fig. 3: Latency advancement of parallel filter modeling above modeled modules are used here as sub modules to work parallel and pipelined with window & aggregation modules. First, some basic filters are applied on input buffer register on each event to filter and select events to be inserted to window. If whole filter conditions are satisfied, then the particular input event is being inserted in to a First In First Out(FIFO) window memory consists of input event length sized words. The window size is decided based on two different parameters, time value or a length value. An event in a time value will be expired after a defined time value calculated from the time stamp of the arrival time of that particular event to the window. A length window size is decided based on a defined length or number of events inside that window. In both cases some aggregation function is applied on a defined data field of the events stored in the window continuously. A separate aggregator module is Fig. 4: Window and Aggregation module implemented to handle this functionality and all together five aggregation functions are supported as follows. sum, average, count, min and max. Aggregator issues outputs at every expired event or input event as defined in the reference query. Finally another filter module is applied on the aggregator output to

4 filter the final output according to a given condition. All these modules are arranged in a pipelined architecture and usual select module functions parallel with this pipeline to create the output event. In this module, aggregator function causes some considerable latency comparing with above basic modules. Therefore this module functions in comparatively low frequency but more than enough to maintain the throughput at a very higher level comparing with the software counterpart. The latency is also at a very low level as every step of the pipeline would only take one clock cycle each to process and apparently there are only few steps. D. Pattern Recognizer modeling One of the important modules in this architecture would be Pattern recognizer module. This module is designed based on Pattern query[14]. The Pattern Recognizer module is shown in Fig. 6. In this module, the target is to check a pattern of two or more simple conditional events. The conditions for each event in the pattern are checked by parallel set of filter modules which are discussed above in subsection II-B. Recognizing Fig. 5: Pattern recognizer state machine of the pattern based on the output signal of each condition checking filters is done by a Finite State Machine(FSM). The FSM for pattern module is highly scalable to considerable number of events easily since all the state changes are similar as depicted in Fig. 5. The states are changed one by one according to the filter condition outputs at each event and fire an event detection at the last event matching. Here, also the select module do its functionality parallel and it creates the output using the input buffer register as well as some of the stored events at filter matching conditions. According to the filter modules, they take only one clock cycle to output the condition and FSM takes another clock cycle to change the state. Therefore the pattern module also has the ability to function at a higher frequency with having a very low latency for the whole process. E. Sequence Recognizer modeling The last basic type of module is sequence recognizer and it s designed based on the sequence query type[14]. The abstract architecture is almost similar to that of pattern recognizer module but the sequence recognizing FSM is having a total different architecture than in earlier case. The sequence of events should match continuously as well as each an every event is built upon a regular expression among exactly one match, one or more match(+), zero or more match(*), zero or one match(?) and or between two events. Fig. 7 shows the generalized state transition for each regular expression case. The main FSM module is a combination of two or more of these in the defined order in the query. The design of FSM for sequence recognizing shows the scalability and generalization given to even a considerable complex design in this research. d d d d ^&^D Fig. 6: Pattern recognizer module Fig. 7: Sequence recognizer FSM states Both of above FSMs are built according to NFA architecture since the state explosion would be minimum over a DFA approach as clearly explained in [13]. The NFA architecture suits well for hardware platform as it supports parallel implementation methodologies which leads to enhance the throughput parameter. All of the above modules are highly parameterized to increase the generalization of the architecture. Therefore any type of basic query can be built using these modules by just changing the parameters in the top module of the HDL design. III. OVERALL SYSTEM ARCHITECTURE Any of the complex CEP system consist of one or more combination of the above discussed basic modules. Overall architecture of the custom accelerated architecture is shown in Fig. 8. Main architecture consists of a software CEP, hardware CEP and a PCIe communication link in between them. The PCIe communication link is built using a PCIe kernel driver at the software platform side and with a PCIe core at the hardware(fpga) side. This research has chosen PCIe as communication link because it s the only one that can afford a very high data transfer rate in Giga bits per second range in the current context. The software CEP system(already exist [6]) runs on a CPU based processor architecture while the hardware CEP

5 Fig. 8: System architecture system (finding of this research) runs on a FPGA platform. A complex CEP system can be divided in to separate CEP engines cascaded together to form the whole system to increase the flexibility of the system while reducing the complexity which will be further explained in section V with an evaluation example. That particular design approach allows the total CEP system to be partitioned in to software components and hardware components which enhances the flexibility and scalability of the design. The software CEP system partition sees the hardware co-processor partition as an API which is connected through high speed PCIe link. The whole system is a PC master design where software system writes data to hardware system to create the input data stream to hardware CEP and reads back the output stream coming back from hardware. The data writes and reads are done through a kernel driver module specially designed to handle PCIe protocol which communicates with a PCIe IP core implemented at the hardware(fpga) side. A receiver and transmitter engine handles the data transmission in between the PCIe core and the hardware CEP application. The reconfigurability of the custom accelerated hardware system is achieved by a fully parameterized module design at the HDL level. Therefore in any CEP application only the top module(cep engine) has to be designed while instantiating other basic modules with required parameters. Architecture design approach of this research can be used as an inspired solution to high performance Bigdata processing hardware architecture design in cloud computing domain. IV. QUERIES TO CIRCUIT PROCESS Fig. 9: Queries to circuit process The CEP system in the software platform is built with a software query compiler as stated in [6] which is a part of the software CEP application. This research also proposes a similar approach with a software query compiler to compile the queries and identify the basic building blocks and extract the parameters inside them. Fig. 9 shows the design process. The query compiler application generates the top module of the hardware CEP engine according to a predefined mapping model between basic queries, their equivalent hardware modules and the connections in between them. This query compiler design approach is inspired by a similar design in [12]. The query compiler of this research needs to generate only the top module of the design as all other basic modules are designed in fully parameterized modeling which depicts the generalization and the flexibility of the research outcome. Final user of the application sees the same query abstraction level similar to that of the software system. V. EVALUATING EXAMPLE Authors have used a real world application example to show the flexibility of the design through the applicability of the research to a real world example. The other aspect of using an evaluating example in this paper is to evaluate the actual performance of the design in this research. The evaluating example used in this paper is based on the DEBS grand challenge, application problem on CEP, described in the 7 th ACM International Conference on Distributed Event Based Systems. A. Query 1 - Running data analysis The goal of this query is to calculate the analysis of the running performance of each of the players currently participating in the game. This use case is implemented using CEP event sequences to detect whenever a player crossed a threshold of event speeds using a sequence recognizing query. Here, four filters are used, first and last event both shares same set of filters and two of them are combined in parallel in each case. The whole scenario can be explained using the sequence matching basic query type where filter and select query types are act as sub queries inside itself. B. Query 2 - Ball possession This query needs to calculate the time of ball possession by each player. Design of the query has been divided in to three parts in order to reduce the complexity of the design. Those three queries are designed separately using basic modules and then cascaded them together to form whole system. As depicted in Fig. 10, one CEP engine detects and output Hit at ball events while another detects Ball leave the ground event and both of them can function parallel since they act independently. Both outputs are sent to another CEP engine

6 Fig. 10: Ball Possession detection CEP architecture simultaneously called Ball Possession which detect the ball possession by a particular player. In this Hit at a ball query, the hit is identified by the distance between ball and the player. getdistance function is to calculate that distance and compared the output with constant value of This particular function is also possible to built in hardware as a user defined function. Implementation of this complex query shows a highly complex system implementation using basic modules and cascading them in a pipeline to built the whole system. It proves the flexibility and scalability of the system in to a great extent. VI. RESULTS The above discussed evaluating example is implemented on Xilinx Virtex6 ML605[15] for the hardware CEP part and the data stream handling software application, driver are implemented on a PC with Linux OS having Linux-Kernel The development of hardware modules are done using Xilinx ISE The communication link between PC and FPGA was a PCIe Gen.1 x8 link. Sensor data were stored in a file inside the PC and they were sent to the hardware CEP through PC software application after creating it as a data stream. The total data size was about 49.5 million events of 56 bytes each. The design of Running analysis query was able to process data for the query approximately in a rate about 1 million events per second. The required rate of processing by the challengers is 15,000 events per seconds and the software counterpart, siddhi was able to achieve approximately 100,000 events per second. Even with the basic communication link of PCIe (8 lane, gen 1), hardware design is 10 times faster. The hardware implementation of Ball possession query is achieved about 0.74 million events per second throughput which is also more than 7 times faster than its software counterpart, siddhi and far more ahead than the required data rate expected by the challenge. Both of the cases data rate is about 1Gbps and latency is only few nano seconds. VII. CONCLUSION This research has designed and implemented a hardware based CEP system with a custom accelerator design approach which is highly scalable and flexible comparing with existing approaches. The hardware designed CEP system is being able to work as a hardware co-processor in line with an existing software CEP system and share the same queries to implement the CEP architecture. The design idea of the research has been proved using a practical example which evaluates the high throughput, low latency performance far more ahead than the software system while showing the scalability and flexibility of the system which is very close to its software counterpart. Dynamic reconfigurability can be added to system which is still not supported by the current design. The design of this research has used fixed size string variable due to the HDL language hardware design limitations. Authors hope to add the dynamic reconfigurability to the system as well as a method to use variable length string variables as future work. In addition to that the system performance can be further improved by using a PCIe communication link with higher performance than the used one. This design approach broadens the areas to develop a system for Bigdata processing in cloud computing architectures with hardware acceleration support while keeping the system flexibility very much closer to that of existing software platforms. REFERENCES [1] Daniel J. Abadi, Don Carney, Ugur etintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik, Aurora: a new model and architecture for data stream management. The VLDB Journal,12, 2 (August 2003), [2] M. Cammert, C. Heinz, et al. Pipes: A multi-threaded publishsubscribe architecture for continuous queries over streaming data sources. Technical report, Citeseer,2003. [3] D. Arvind, A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, I. Nishizawa, J. Rosenstein, and J. Widom. STREAM: the stanford stream data manager. IEEE Data Engineering Bulletin,2003. [4] D. Abadi, Y. Ahmad, et al. The design of the borealis stream processing engine. Second Biennial Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, CA,pages , [5] Neumeyer, L.; Robbins, B.; Nair, A.; Kesari, A., S4: Distributed Stream Computing Platform, Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, vol., no., pp.170,177, Dec [6] Suhothayan, Sriskandarajah and Gajasinghe, Kasun and Narangoda, Isuru Loku and Chaturanga, Subash and Perera, Srinath and Nanayakkara, Vishaka. Siddhi: a second look at complex event processing architectures. SC-GCE, ACM, page 43-50,2011. [7] EsperTech - event stream intelligence. [Online] [8] Sidhu, R.; Prasanna, V.K., Fast Regular Expression Matching Using FPGAs,Field-Programmable Custom Computing Machines, FCCM 01. The 9th Annual IEEE Symposium on,vol., no., pp.227,238, March April [9] Woods, L.; Teubner, J.; Alonso, G., Real-time pattern matching with FPGAs, Data Engineering (ICDE), 2011 IEEE 27th International Conference on,vol., no., pp.1292,1295, April [10] Takenaka, T.; Takagi, M.; Inoue, H., A scalable complex event processing framework for combination of SQL-based continuous queries and C/C++ functions, Field Programmable Logic and Applications (FPL), nd International Conference on, vol., no., pp.237,242, Aug [11] Inoue, H.; Takenaka, T.; Motomura, M., 20Gbps C-Based Complex Event Processing, Field Programmable Logic and Applications (FPL), 2011 International Conference on, vol., no., pp.97,102, 5-7 Sept [12] Rene Mueller, Jens Teubner, and Gustavo Alonso Streams on wires: a query compiler for FPGAs. Proc. VLDB Endow. 2, 1 (August 2009), [13] Louis Woods, Jens Teubner, and Gustavo Alonso Complex event detection at wire speed with FPGAs.Proc. VLDB Endow. 3,1-2 (September 2010), [14] Siddhi Language Specification. [Online] [15] Xilinx Virtex6 ML605. [Online]

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University