UDP Packet Monitoring with Stanford Data Stream Manager

UDP Packet Monitoring with Stanford Data Stream Manager Nadeem Akhtar #1, Faridul Haque Siddiqui #2 # Department of Computer Engineering, Aligarh Muslim University Aligarh, India 1 nadeemalakhtar@gmail.com 2 faridhaq@zhcet.ac.in Abstract The purpose of the paper is to monitor the real-time stream of UDP packets with the Data Stream Management System (DSMS) tool: Stanford using Continuous Query Language (CQL). The huge amount of data that has to be managed and analyzed together with the fact that many different analysis tasks are performed over a small set of different network trace formats, motivates us to study whether Data Stream Management Systems (DSMSs) might be useful to develop network traffic analysis tools. We will see how displays excellent robustness in handling high speed UDP streams. The system, however suffers from several setbacks also like tuple redundancy, frequent system crash and smaller query set. Keywords Stanford, CQL, DSMS, UDP Packet Analysis I. INTRODUCTION Data Stream Management Systems are specifically designed for handling continuous data streams. They can handle multiple, time-varying, unpredictable and unbounded streams which cannot be handled using traditional tools. In this paper, we have used a Data Stream Management System- Stanford to monitor the traffic of UDP packets in a computer network Data stream management systems have been developed to monitor the continuously arriving data. They are different from traditional Database Management Systems in that they work on transient tables rather than persistent tables. Database Management Systems (DBMSs) may be used for analyzing continuous stream data. Traditional DBMSs, however suffer from some serious bottlenecks which limit their functionality from such real time complex applications requiring continuous monitoring of ever-changing data-streams. Detailed discussion is provided in [1]. Since data stream management has been a hot topic the last few years, several systems have been developed. Some important ones are Stanford [2], Aurora [3], TelegraphCQ [4] and Niagra [5]. supports declarative continuous queries over two types of inputs: streams and relations. A continuous query is simply a long-running query, which produces output in a continuous fashion as the input arrives. The queries are expressed in a language called CQL. The input types-streams and relations are defined using some ordered time domain, which may or may not be related to wall-clock time. Definition 2.1 (Stream) A stream is a sequence of time stamped tuples. There could be more than one tuple with the same timestamp. The tuples of an input stream are required to arrive at the system in the order of increasing timestamps. A stream has an associated schema consisting of a set of named attributes, and all tuples of the stream conform to the schema. Definition 2.2 (Relation) A relation is time-varying bag of tuples. Here time" refers to an instant in the time domain. Input relations are presented to the system as a sequence of timestamped updates which capture how the relation changes over time. An update is either a tuple insertion or a tuple deletion. The updates are required to arrive at the system in the order of increasing timestamps. Like streams, relations have a fixed schema to which all tuples conform. The output of a CQL query is a stream or relation depending on the query. The output is produced in a continuous fashion as described below: If the output is a stream, the tuples of the stream are produced in the order of increasing timestamps. The tuples with timestamp τ are produced once all the input stream tuples and relation updates with timestamps τ have arrived. If the output is a relation, the relation is represented as a sequence of timestamped updates (just like the input relations). The updates are produced in the order of increasing timestamps, and updates with timestamp τ are produced once all input stream tuples and relation updates with timestamps τ have arrived. The UDP Header is as shown below: 0 16 32 Source Port Destination Port Length Other Octats Fig. 1: UDP packet header Checksum The purpose of the paper will be to monitor the traffic data based on this header file information.

II. CONTINUOUS QUERY LANGUAGE AND ITS RESTRICTIONS currently does not support all the features of CQL[6]. In this section, we mention the important features omitted in the current implementation of that we found based on our experiences with. The important omissions are: Sub-queries are not allowed in the Where clause. For example the following query is not supported: Select * Where S.A in (Select R.A From R) The Having clause is not supported, but Group By clause is supported. For example, the following query is not supported: Select A, SUM(B) Group By A Having MAX(B) > 50 Expressions in the Project clause involving aggregations are not supported. For example, the query: Select A, (MAX(B) + MIN(B))/2 Group By A is not supported. However, non-aggregated attributes can participate in arbitrary arithmetic expressions in the project clause and the where clause. For example, the following query is supported: Select (A + B)/2 Where (A - B) * (A - B) > 25 Attributes can have one of four types: Integer, Float, Char(n), and Byte. Variable length strings (Varchar(n)) are not supported. Windows with the slide parameter are not supported. The binary operations Union and Except is supported, but Intersect is not. III. ARCHITECTURE [7] This section briefly describes the architecture of the DSMS prototype. The architecture is made up of two broad components: 1. Planning subsystem, which stores metadata and generates query plans, and 2. Execution engine, which executes the continuous queries. A. Planning Subsystem Figure 1 shows the main components of the planning subsystem. The components shown with double-bordered rectangles are state-full-- they contain the system metadata. The other components are stateless, functional units, which are used to transform a query to its functional plan. The solid arrows indicate the path of a query along these components. Fig. 2: The planning component[7] 1) Parser: Transform the query string to a parse tree representation of the query. (The parser is also used to parse the schema of a registered stream or relation.) 2) Semantic Interpreter: Transform the parse tree to an internal representation of the query. The representation is still block-based (declarative) and not an operator-tree. As part of this transformation, the semantic interpreter: Resolves attribute references Implements CQL defaults (e.g., adding an Unbounded window) Other miscellaneous syntactic transformations like expanding the *" in Select * Converts external string-based identifiers for relations, streams, and attributes to internal integer- based ones. The mapping from string identifiers to integer identifiers is maintained by TableManager. 3) Logical Plan Generator: Transform the internal representation of a query to a logical plan for the query. The logical plan is constructed from logical operators. The logical operators closely resemble the relational algebra operators (e.g., select, project, join), but some are CQL-specific (e.g., window operators and relation-to-stream operators). The logical operators are not necessarily related to the actual operators present in the execution subsystem. The logical plan generator also applies various transformations that (usually) improve the performance: Push selections below cross-products (joins). Eliminate redundant Istream operators (an Istream over a stream is redundant). Eliminate redundant project operators (e.g., a project operator in a Select * query is usually redundant). Apply Rstream-Now window based transformations. 4) Physical Plan Generator: Transform a logical plan for a query to a physical plan. The operators in a physical plan are exactly those that are available in the execution subsystem (unlike those in the logical plan). The physical plan generator

is actually part of the plan manager (although this is not suggested by Figure 1) and the generated physical plan for a query is linked to the physical plans for previously registered queries. In particular, the physical plans for views that are referenced by the query now directly feed into the physical plan for the query. 5) Plan Manager: The plan Manager stores the combined mega" physical plan corresponding to all the registered queries. The plan manager also contains the routines that: Flesh out a basic physical plan containing operators with all the subsidiary execution structures like synopses, stores, storage allocators, indexes, and queues. Instantiate the physical plan before starting execution. 6) Table Manager: The table Manager stores the names and schema of all the registered streams and relation. The streams and relations could be either input (base) stream and relations or intermediate streams and relations produced by named queries. The table manager also assigns integer identifiers for streams and relations which are used in the rest of the planning subsystem. 7) Query Manager: The query manager stores the text of all the registered queries. B. Execution Engine The main purpose of execution engine is the execution of continuous queries over the stream. The work done by it could be divided into two sub-groups as shown in table 1. For further details on Execution engine, the reader is recommended the manual. College of Engineering & Technology. The end-users were asked to communicate among themselves using UDP connections & this stream of incoming traffic was tested on server. A simplified network overview, made on Packet Tracer, is shown below in Figure 3. The stream thus passed consisted of UDP packet header fields: Traffic {ipsrc, srcport, ipdest, destport, length, checksum}. Where ipsrc, ipdest are the IP address of Source & destination respectively; srcport, destport are the port number of source & destination respectively; length is the packet length sent. The major assumption for the project was that end-users were communicating among themselves using UDP connections only. This could, although be easily approximated to larger systems & was done primarily to simplify the computation complexities. TABLE I COMPONENTS OF EXECUTION ENGINE [7] Data Low-level Tuple Element Heartbeat High-level Stream Relation IV. RESULTS Operational Units Low-level Arithmetic Evaluators Boolean Evaluators Hash Evaluators High-level Operators Queues Synopses Indexes Stores Storage Allocators Global Memory Manager Scheduler The study on robustness during UDP network traffic analysis was conducted on server of Zakir Hussain Fig. 3: Network overview The information retrieved from them is tabulated as shown: TABLE II INFORMATION RETRIEVED FROM TRAFFIC Figure Information Retrieved 4 Display total traffic in the network 5 Network usage by 102.66.17.202 Conclusion Drawn Displays the network usage at the moment Shows to network administrator the usage by a particular user and thus may help in billing the user

6 Evaluate average packet length from DNS request 7 Packet sent to the network by an end user 102.66.17.202 over a period of 10 seconds Shows the network usage pattern i.e. network crowding by DNS request, multimedia request etc. Helps in customer billing and determining the network usage pattern The results so obtained are displayed in figures 4, 5, 6 and 7: Fig. 6 Evaluate average packet length from DNS request ports in the traffic Fig. 7 Evaluate packet length sent by 102.66.17.202 over an average period of 10 seconds Fig. 4 Evaluate the total packet flow in the network at varying packet speeds Fig. 5 Display the network usage information by end user with IP 102.66.17.202 over varying network speeds V. CONCLUSIONS Based on our experiences with, we can easily say that it easily covered and exceeded our expectations. Below is the list of advantages that we think make a suitable applications in real life streaming applications: Not a single tuple was dropped (Tested till 20000 t/s): The fact that displays such levels of robustness, easily makes it one of the best DSMS tools around. It also makes highly suitable for extreme precision applications like Stock exchange streams, weather forecasts, etc. Extremely accurate on aggregation operations: The error percentage in our working environment varied from 0.125% to 0.025% thereby again portraying as an accurate DSMS tool giving reliable output. Supports a sub-set of SQL queries that are easy to understand: As against ad-hoc development & deployment of conventional stream handling tools, offers its user CQL which is easy to use with users with previous SQL experience. is the only DSMS tool identified by us which had Graphical User Interface (GUI) environment. This makes user friendly & coupled with the fact that it is easier to install & deploy, certainly makes it one of the better DSMS around.

Generates query plans: This makes more user interactive & shows graphically the relational model of the project. Based on server-client architecture: Many clients can simultaneously access the server resources. Following are the disadvantages of Stanford : Robustness level drops severely when number of simultaneous queries increases to about 8 & above: The system hangs frequently at registries with relations that are strongly dependent on each other System crashes frequently on some aggregation operations like min, max: The support for these aggregation operators is extremely limited Requires conversion of data stream to text file before operation could be performed: Instant operation on live streams is still not supported & hence real-time analysis could not be performed on streaming data. System crashes on complex queries at high speed: System robustness drops severely at high speed coupled with relations that are complex & related Tuple duplicity because of tuple redundancy: inputs the results at first and then exits it at next interval. This fact introduces tuple redundancy as tuple accumulation occurs at time when the tuple is not meant to be in the system. Supports only a small subset of SQL queries as discussed before. Following improvements could be made in the DSMS: could be made real time by taking streams as input rather than text files Inputs should be enabled to be taken from sensors should support a wider range of SQL queries Robustness levels should be increased & redundancy should be minimized Tumbling window support should be enabled Relations should be allowed to be formed real-time REFERENCES [1] Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S.,Seidman, G., Stonebraker, M., Tatbul, N., Zdonik, S.,:Monitiring Streams- A New Class of Data Management Applications [2] Stanford website: http://www.db.stanford.edu/stream [3] www.cs.brown.edu/research/aurora/ [4] TelegraphCQ: http://telegraph.cs.berkeley.edu/, 2008 [5] http://datalab.cs.pdx.edu/niagara/ [6] A. Arasu, S. Babu and J. Widom. The CQL Continuous Query Language: Semantic Foundations and Query Execution, VLDB Journal, 2005 [7] : The Stanford Stream Data Manager, User Guide and Design Document