UDP Packet Monitoring with Stanford Data Stream Manager

Similar documents
STATISTICAL DATA ANALYSIS OF CONTINUOUS STREAMS USING STREAM DSMS

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

DATA STREAMS AND DATABASES. CS121: Introduction to Relational Database Systems Fall 2016 Lecture 26

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Data Streams. Building a Data Stream Management System. DBMS versus DSMS. The (Simplified) Big Picture. (Simplified) Network Monitoring

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

1. General. 2. Stream. 3. Aurora. 4. Conclusion

Query Processing over Data Streams. Formula for a Database Research Project. Following the Formula

Analytical and Experimental Evaluation of Stream-Based Join

Streaming Data Integration: Challenges and Opportunities. Nesime Tatbul

An Efficient Execution Scheme for Designated Event-based Stream Processing

Exploiting Predicate-window Semantics over Data Streams

Event Object Boundaries in RDF Streams A Position Paper

DSMS Benchmarking. Morten Lindeberg University of Oslo

Incremental Evaluation of Sliding-Window Queries over Data Streams

Load Shedding in a Data Stream Manager

StreamGlobe Adaptive Query Processing and Optimization in Streaming P2P Environments

No Pane, No Gain: Efficient Evaluation of Sliding-Window Aggregates over Data Streams

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 16th January 2014 Time: 09:45-11:45. Please answer BOTH Questions

Window Specification over Data Streams

Rethinking the Design of Distributed Stream Processing Systems

Querying Sliding Windows over On-Line Data Streams

6. Relational Algebra (Part II)

Programming in C++ Prof. Partha Pratim Das Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CQL: A Language for Continuous Queries over Streams and Relations

Dynamic Plan Migration for Snapshot-Equivalent Continuous Queries in Data Stream Systems

Ian Kenny. November 28, 2017

Comprehensive Guide to Evaluating Event Stream Processing Engines

Relational Model, Relational Algebra, and SQL

Load Shedding for Aggregation Queries over Data Streams

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Chapter 4. Basic SQL. Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

SCALABLE AND ROBUST STREAM PROCESSING VLADISLAV SHKAPENYUK. A Dissertation submitted to the. Graduate School-New Brunswick

Big Data Management and NoSQL Databases

Update-Pattern-Aware Modeling and Processing of Continuous Queries

RDF stream processing models Daniele Dell Aglio, Jean-Paul Cabilmonte,

TAG: A TINY AGGREGATION SERVICE FOR AD-HOC SENSOR NETWORKS

A Temporal Foundation for Continuous Queries over Data Streams

Relational Databases

Big Data Infrastructures & Technologies

Lecture 3 SQL. Shuigeng Zhou. September 23, 2008 School of Computer Science Fudan University

Database Technology Introduction. Heiko Paulheim

Towards Action-Oriented Continuous Queries in Pervasive Systems

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Data Stream Management and Complex Event Processing in Esper. INF5100, Autumn 2010 Jarle Søberg

Chapter 12: Query Processing

CS 5114 Network Programming Languages Data Plane. Nate Foster Cornell University Spring 2013

Lecture Query evaluation. Combining operators. Logical query optimization. By Marina Barsky Winter 2016, University of Toronto

Data Models and Query Languages for Data Streams

Concept as a Generalization of Class and Principles of the Concept-Oriented Programming

Database Systems SQL SL03

Slides by: Ms. Shree Jaswal

What s a database system? Review of Basic Database Concepts. Entity-relationship (E/R) diagram. Two important questions. Physical data independence

Streaming SQL. Julian Hyde. 9 th XLDB Conference SLAC, Menlo Park, 2016/05/25

Madhya Pradesh Bhoj (Open) University, Bhopal Diploma in Computer Application (DCA) Assignment Question Paper I

Scheduling Strategies for Processing Continuous Queries Over Streams

QUERY OPTIMIZATION E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 QUERY OPTIMIZATION

Specifying Access Control Policies on Data Streams

Database Systems SQL SL03

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Database System Concepts, 5th Ed.! Silberschatz, Korth and Sudarshan See for conditions on re-use "

Relational Algebra 1

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1

Service-oriented Continuous Queries for Pervasive Systems

SQL DATA DEFINITION LANGUAGE

Jennifer Widom. Stanford University

EECS 647: Introduction to Database Systems

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Transformation of Continuous Aggregation Join Queries over Data Streams

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

Models and Issues in Data Stream Systems

Processing Flows of Information: From Data Stream to Complex Event Processing

Symbol Tables. ASU Textbook Chapter 7.6, 6.5 and 6.3. Tsan-sheng Hsu.

Linked Stream Data Processing Part I: Basic Concepts & Modeling

Introduction to Data Management CSE 344. Lectures 8: Relational Algebra

2.3 Algorithms Using Map-Reduce

SQL Overview. CSCE 315, Fall 2017 Project 1, Part 3. Slides adapted from those used by Jeffrey Ullman, via Jennifer Welch

Databases-1 Lecture-01. Introduction, Relational Algebra

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

New Approach towards Covert Communication using TCP-SQN Reference Model

Real-time and Reliable Video Transport Protocol (RRVTP) for Visual Wireless Sensor Networks (VSNs)

SQL Overview cont d. SQL Overview. ICOM 5016 Database Systems. SQL Data Types. A language higher-level than general purpose languages

Programming Assignment 1

Chapter III. congestion situation in Highspeed Networks

Relational Data Model

Chapter 13: Query Optimization. Chapter 13: Query Optimization

Database Applications (15-415)

C & Data Structures syllabus

Michel Heydemann Alain Plaignaud Daniel Dure. EUROPEAN SILICON STRUCTURES Grande Rue SEVRES - FRANCE tel : (33-1)

A Dynamic Attribute-Based Load Shedding Scheme for Data Stream Management Systems

Improving Query Plans. CS157B Chris Pollett Mar. 21, 2005.

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Today s topics. Null Values. Nulls and Views in SQL. Standard Boolean 2-valued logic 9/5/17. 2-valued logic does not work for nulls

Think about these queries for exam

CHAPTER-2 IP CONCEPTS

Transcription:

UDP Packet Monitoring with Stanford Data Stream Manager Nadeem Akhtar #1, Faridul Haque Siddiqui #2 # Department of Computer Engineering, Aligarh Muslim University Aligarh, India 1 nadeemalakhtar@gmail.com 2 faridhaq@zhcet.ac.in Abstract The purpose of the paper is to monitor the real-time stream of UDP packets with the Data Stream Management System (DSMS) tool: Stanford using Continuous Query Language (CQL). The huge amount of data that has to be managed and analyzed together with the fact that many different analysis tasks are performed over a small set of different network trace formats, motivates us to study whether Data Stream Management Systems (DSMSs) might be useful to develop network traffic analysis tools. We will see how displays excellent robustness in handling high speed UDP streams. The system, however suffers from several setbacks also like tuple redundancy, frequent system crash and smaller query set. Keywords Stanford, CQL, DSMS, UDP Packet Analysis I. INTRODUCTION Data Stream Management Systems are specifically designed for handling continuous data streams. They can handle multiple, time-varying, unpredictable and unbounded streams which cannot be handled using traditional tools. In this paper, we have used a Data Stream Management System- Stanford to monitor the traffic of UDP packets in a computer network Data stream management systems have been developed to monitor the continuously arriving data. They are different from traditional Database Management Systems in that they work on transient tables rather than persistent tables. Database Management Systems (DBMSs) may be used for analyzing continuous stream data. Traditional DBMSs, however suffer from some serious bottlenecks which limit their functionality from such real time complex applications requiring continuous monitoring of ever-changing data-streams. Detailed discussion is provided in [1]. Since data stream management has been a hot topic the last few years, several systems have been developed. Some important ones are Stanford [2], Aurora [3], TelegraphCQ [4] and Niagra [5]. supports declarative continuous queries over two types of inputs: streams and relations. A continuous query is simply a long-running query, which produces output in a continuous fashion as the input arrives. The queries are expressed in a language called CQL. The input types-streams and relations are defined using some ordered time domain, which may or may not be related to wall-clock time. Definition 2.1 (Stream) A stream is a sequence of time stamped tuples. There could be more than one tuple with the same timestamp. The tuples of an input stream are required to arrive at the system in the order of increasing timestamps. A stream has an associated schema consisting of a set of named attributes, and all tuples of the stream conform to the schema. Definition 2.2 (Relation) A relation is time-varying bag of tuples. Here time" refers to an instant in the time domain. Input relations are presented to the system as a sequence of timestamped updates which capture how the relation changes over time. An update is either a tuple insertion or a tuple deletion. The updates are required to arrive at the system in the order of increasing timestamps. Like streams, relations have a fixed schema to which all tuples conform. The output of a CQL query is a stream or relation depending on the query. The output is produced in a continuous fashion as described below: If the output is a stream, the tuples of the stream are produced in the order of increasing timestamps. The tuples with timestamp τ are produced once all the input stream tuples and relation updates with timestamps τ have arrived. If the output is a relation, the relation is represented as a sequence of timestamped updates (just like the input relations). The updates are produced in the order of increasing timestamps, and updates with timestamp τ are produced once all input stream tuples and relation updates with timestamps τ have arrived. The UDP Header is as shown below: 0 16 32 Source Port Destination Port Length Other Octats Fig. 1: UDP packet header Checksum The purpose of the paper will be to monitor the traffic data based on this header file information.

II. CONTINUOUS QUERY LANGUAGE AND ITS RESTRICTIONS currently does not support all the features of CQL[6]. In this section, we mention the important features omitted in the current implementation of that we found based on our experiences with. The important omissions are: Sub-queries are not allowed in the Where clause. For example the following query is not supported: Select * Where S.A in (Select R.A From R) The Having clause is not supported, but Group By clause is supported. For example, the following query is not supported: Select A, SUM(B) Group By A Having MAX(B) > 50 Expressions in the Project clause involving aggregations are not supported. For example, the query: Select A, (MAX(B) + MIN(B))/2 Group By A is not supported. However, non-aggregated attributes can participate in arbitrary arithmetic expressions in the project clause and the where clause. For example, the following query is supported: Select (A + B)/2 Where (A - B) * (A - B) > 25 Attributes can have one of four types: Integer, Float, Char(n), and Byte. Variable length strings (Varchar(n)) are not supported. Windows with the slide parameter are not supported. The binary operations Union and Except is supported, but Intersect is not. III. ARCHITECTURE [7] This section briefly describes the architecture of the DSMS prototype. The architecture is made up of two broad components: 1. Planning subsystem, which stores metadata and generates query plans, and 2. Execution engine, which executes the continuous queries. A. Planning Subsystem Figure 1 shows the main components of the planning subsystem. The components shown with double-bordered rectangles are state-full-- they contain the system metadata. The other components are stateless, functional units, which are used to transform a query to its functional plan. The solid arrows indicate the path of a query along these components. Fig. 2: The planning component[7] 1) Parser: Transform the query string to a parse tree representation of the query. (The parser is also used to parse the schema of a registered stream or relation.) 2) Semantic Interpreter: Transform the parse tree to an internal representation of the query. The representation is still block-based (declarative) and not an operator-tree. As part of this transformation, the semantic interpreter: Resolves attribute references Implements CQL defaults (e.g., adding an Unbounded window) Other miscellaneous syntactic transformations like expanding the *" in Select * Converts external string-based identifiers for relations, streams, and attributes to internal integer- based ones. The mapping from string identifiers to integer identifiers is maintained by TableManager. 3) Logical Plan Generator: Transform the internal representation of a query to a logical plan for the query. The logical plan is constructed from logical operators. The logical operators closely resemble the relational algebra operators (e.g., select, project, join), but some are CQL-specific (e.g., window operators and relation-to-stream operators). The logical operators are not necessarily related to the actual operators present in the execution subsystem. The logical plan generator also applies various transformations that (usually) improve the performance: Push selections below cross-products (joins). Eliminate redundant Istream operators (an Istream over a stream is redundant). Eliminate redundant project operators (e.g., a project operator in a Select * query is usually redundant). Apply Rstream-Now window based transformations. 4) Physical Plan Generator: Transform a logical plan for a query to a physical plan. The operators in a physical plan are exactly those that are available in the execution subsystem (unlike those in the logical plan). The physical plan generator

is actually part of the plan manager (although this is not suggested by Figure 1) and the generated physical plan for a query is linked to the physical plans for previously registered queries. In particular, the physical plans for views that are referenced by the query now directly feed into the physical plan for the query. 5) Plan Manager: The plan Manager stores the combined mega" physical plan corresponding to all the registered queries. The plan manager also contains the routines that: Flesh out a basic physical plan containing operators with all the subsidiary execution structures like synopses, stores, storage allocators, indexes, and queues. Instantiate the physical plan before starting execution. 6) Table Manager: The table Manager stores the names and schema of all the registered streams and relation. The streams and relations could be either input (base) stream and relations or intermediate streams and relations produced by named queries. The table manager also assigns integer identifiers for streams and relations which are used in the rest of the planning subsystem. 7) Query Manager: The query manager stores the text of all the registered queries. B. Execution Engine The main purpose of execution engine is the execution of continuous queries over the stream. The work done by it could be divided into two sub-groups as shown in table 1. For further details on Execution engine, the reader is recommended the manual. College of Engineering & Technology. The end-users were asked to communicate among themselves using UDP connections & this stream of incoming traffic was tested on server. A simplified network overview, made on Packet Tracer, is shown below in Figure 3. The stream thus passed consisted of UDP packet header fields: Traffic {ipsrc, srcport, ipdest, destport, length, checksum}. Where ipsrc, ipdest are the IP address of Source & destination respectively; srcport, destport are the port number of source & destination respectively; length is the packet length sent. The major assumption for the project was that end-users were communicating among themselves using UDP connections only. This could, although be easily approximated to larger systems & was done primarily to simplify the computation complexities. TABLE I COMPONENTS OF EXECUTION ENGINE [7] Data Low-level Tuple Element Heartbeat High-level Stream Relation IV. RESULTS Operational Units Low-level Arithmetic Evaluators Boolean Evaluators Hash Evaluators High-level Operators Queues Synopses Indexes Stores Storage Allocators Global Memory Manager Scheduler The study on robustness during UDP network traffic analysis was conducted on server of Zakir Hussain Fig. 3: Network overview The information retrieved from them is tabulated as shown: TABLE II INFORMATION RETRIEVED FROM TRAFFIC Figure Information Retrieved 4 Display total traffic in the network 5 Network usage by 102.66.17.202 Conclusion Drawn Displays the network usage at the moment Shows to network administrator the usage by a particular user and thus may help in billing the user

6 Evaluate average packet length from DNS request 7 Packet sent to the network by an end user 102.66.17.202 over a period of 10 seconds Shows the network usage pattern i.e. network crowding by DNS request, multimedia request etc. Helps in customer billing and determining the network usage pattern The results so obtained are displayed in figures 4, 5, 6 and 7: Fig. 6 Evaluate average packet length from DNS request ports in the traffic Fig. 7 Evaluate packet length sent by 102.66.17.202 over an average period of 10 seconds Fig. 4 Evaluate the total packet flow in the network at varying packet speeds Fig. 5 Display the network usage information by end user with IP 102.66.17.202 over varying network speeds V. CONCLUSIONS Based on our experiences with, we can easily say that it easily covered and exceeded our expectations. Below is the list of advantages that we think make a suitable applications in real life streaming applications: Not a single tuple was dropped (Tested till 20000 t/s): The fact that displays such levels of robustness, easily makes it one of the best DSMS tools around. It also makes highly suitable for extreme precision applications like Stock exchange streams, weather forecasts, etc. Extremely accurate on aggregation operations: The error percentage in our working environment varied from 0.125% to 0.025% thereby again portraying as an accurate DSMS tool giving reliable output. Supports a sub-set of SQL queries that are easy to understand: As against ad-hoc development & deployment of conventional stream handling tools, offers its user CQL which is easy to use with users with previous SQL experience. is the only DSMS tool identified by us which had Graphical User Interface (GUI) environment. This makes user friendly & coupled with the fact that it is easier to install & deploy, certainly makes it one of the better DSMS around.

Generates query plans: This makes more user interactive & shows graphically the relational model of the project. Based on server-client architecture: Many clients can simultaneously access the server resources. Following are the disadvantages of Stanford : Robustness level drops severely when number of simultaneous queries increases to about 8 & above: The system hangs frequently at registries with relations that are strongly dependent on each other System crashes frequently on some aggregation operations like min, max: The support for these aggregation operators is extremely limited Requires conversion of data stream to text file before operation could be performed: Instant operation on live streams is still not supported & hence real-time analysis could not be performed on streaming data. System crashes on complex queries at high speed: System robustness drops severely at high speed coupled with relations that are complex & related Tuple duplicity because of tuple redundancy: inputs the results at first and then exits it at next interval. This fact introduces tuple redundancy as tuple accumulation occurs at time when the tuple is not meant to be in the system. Supports only a small subset of SQL queries as discussed before. Following improvements could be made in the DSMS: could be made real time by taking streams as input rather than text files Inputs should be enabled to be taken from sensors should support a wider range of SQL queries Robustness levels should be increased & redundancy should be minimized Tumbling window support should be enabled Relations should be allowed to be formed real-time REFERENCES [1] Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S.,Seidman, G., Stonebraker, M., Tatbul, N., Zdonik, S.,:Monitiring Streams- A New Class of Data Management Applications [2] Stanford website: http://www.db.stanford.edu/stream [3] www.cs.brown.edu/research/aurora/ [4] TelegraphCQ: http://telegraph.cs.berkeley.edu/, 2008 [5] http://datalab.cs.pdx.edu/niagara/ [6] A. Arasu, S. Babu and J. Widom. The CQL Continuous Query Language: Semantic Foundations and Query Execution, VLDB Journal, 2005 [7] : The Stanford Stream Data Manager, User Guide and Design Document