Lecture 2: Streaming Algorithms for Counting Distinct Elements

Similar documents
UnivMon: Software-defined Monitoring with Universal Sketch

Estimating Persistent Spread in High-speed Networks Qingjun Xiao, Yan Qiao, Zhen Mo, Shigang Chen

Forwarding and Routers : Computer Networking. Original IP Route Lookup. Outline

Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Heavy-Hitter Detection Entirely in the Data Plane

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1

NetFlow Multiple Export Destinations

KNOM Tutorial Internet Traffic Matrix Measurement and Analysis. Sue Bok Moon Dept. of Computer Science

Algorithms (III) Yu Yu. Shanghai Jiaotong University

Algorithms (III) Yijia Chen Shanghai Jiaotong University

Database Applications (15-415)

Highly Compact Virtual Maximum Likelihood Sketches for Counting Big Network Data

Algorithms (III) Yijia Chen Shanghai Jiaotong University

TRAFFIC measurement and monitoring are crucial to

Origin-Destination Flow Measurement in High-Speed Networks

Database Applications (15-415)

FloCon Netflow Collection and Analysis at a Tier 1 Internet Peering Point. San Diego, CA. Fred Stringer

New Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring

Hashing. Yufei Tao. Department of Computer Science and Engineering Chinese University of Hong Kong

Lectures 4+5: The (In)Security of Encrypted Search

Sorting and Searching

B561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1

Algorithmic Complexity

Sorting and Searching

Distributed Anomaly Detection with Network Flow Data

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

Software Defined Traffic Measurement with OpenSketch

From Routing to Traffic Engineering

key h(key) Hash Indexing Friday, April 09, 2004 Disadvantages of Sequential File Organization Must use an index and/or binary search to locate data

Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

Priority Queues and Binary Heaps

Example of why greedy step is correct

One-Pass Streaming Algorithms

Balls Into Bins Model Occupancy Problems

Streaming Algorithms. Stony Brook University CSE545, Fall 2016

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Botnets Behavioral Patterns in the Network

Algorithms and Data Structures (INF1) Lecture 15/15 Hua Lu

Introduction. Router Architectures. Introduction. Introduction. Recent advances in routing architecture including

Network Data Streaming A Computer Scientist s Journey in Signal Processing

Research Students Lecture Series 2015

Packet Doppler: Network Monitoring using Packet Shift Detection

Data Structure and Algorithm, Spring 2013 Midterm Examination 120 points Time: 2:20pm-5:20pm (180 minutes), Tuesday, April 16, 2013

MAD 12 Monitoring the Dynamics of Network Traffic by Recursive Multi-dimensional Aggregation. Midori Kato, Kenjiro Cho, Michio Honda, Hideyuki Tokuda

Power of Slicing in Internet Flow Measurement. Ramana Rao Kompella Cristian Estan

Pseudorandomness and Cryptographic Applications

Semi-Streaming Algorithms for Annotated Graph Streams. Justin Thaler, Yahoo Labs

ECE697AA Lecture 21. Packet Classification

Lecture 5: Data Streaming Algorithms

Binary Heaps in Dynamic Arrays

Two-Stage Opportunistic Sampling for Network Anomaly Detection

ECE 122. Engineering Problem Solving Using Java

SCRIPT: An Architecture for IPFIX Data Distribution

An Empirical Evaluation of Entropybased Traffic Anomaly Detection

CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch. Jared Saia University of New Mexico

Security: Internet of Things

Lecture 15. Lecture 15: Bitmap Indexes

Introduction to Data Mining

A Study on Network Flow Security

Database Applications (15-415)

Joint Data Streaming and Sampling Techniques for Detection of Super Sources and Destinations

Approximation Basics

Distributed Denial of Service (DDoS)

Lecture 1 August 31, 2017

Module 2: Classical Algorithm Design Techniques

1 Minimum Cut Problem

Impact of Sampling on Anomaly Detection

Query Evaluation and Optimization

CIS 121 Data Structures and Algorithms Midterm 3 Review Solution Sketches Fall 2018

Analysis of Algorithms, I

Analyzing Graph Structure via Linear Measurements.

Algorithms and Theory of Computation. Lecture 5: Minimum Spanning Tree

Dynamic-Programming algorithms for shortest path problems: Bellman-Ford (for singlesource) and Floyd-Warshall (for all-pairs).

1 (15 points) LexicoSort

Practice exercises on Ethernet switching

Algorithms and Theory of Computation. Lecture 5: Minimum Spanning Tree

Impact of Packet Sampling on Anomaly Detection Metrics

On the Difficulty of Scalably Detecting Network Attacks

Dictionaries. Priority Queues

Based on the original slides by Minos Garofalakis Yahoo! Research & UC Berkeley

Peer-to-Peer Botnet Detection Using NetFlow. Connor Dillon

Flow-based Accounting: Applications and Standardisation

Database Applications (15-415)

The Round Complexity of Distributed Sorting

Introduction to Algorithms

Outline. Motivation. Our System. Conclusion

ECE 333: Introduction to Communication Networks Fall 2002

What is sorting? Lecture 36: How can computation sort data in order for you? Why is sorting important? What is sorting? 11/30/10

Probabilistic embedding into trees: definitions and applications. Fall 2011 Lecture 4

Part 1: Written Questions (60 marks):

Course : Data mining

We have used both of the last two claims in previous algorithms and therefore their proof is omitted.

Randomized Algorithms for Network Security and Peer-to-Peer Systems

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Notes for Lecture 21. From One-Time Signatures to Fully Secure Signatures

DSMS Benchmarking. Morten Lindeberg University of Oslo

(2½ hours) Total Marks: 75

Scalable P2P architectures

CPSC 536N: Randomized Algorithms Term 2. Lecture 4

Transcription:

Lecture 2: Streaming Algorithms for Counting Distinct Elements 20th August, 2008

Streaming Algorithms

Streaming Algorithms Streaming algorithms have the following properties: 1 items in the stream are presented sequentially 2 single pass over the data 3 limited (sublinear) space in which to operate 4 updates per item must be very fast

Notation A stream is an ordered tuple over the alphabet {1, 2, 3,..., n}. source or destination address (n = 2 32 ) source or destination port (n = 64K) or any combination of flow header fields (n 2 100 ) Frequency of ith item is m i, and the length of the stream is m = n m i. i=1 The stream is represented by (a 1, a 2, a 3,...,a m ), where each a t {1, 2, 3,...,n}.

Some Sample Streaming Problems Estimating frequency moments (e.g., F 2 is a measure of the skewness ) n F k = Estimating entropy (Useful metric for anomaly detection) H = n i=1 i=1 m k i m ( i m log mi ) m Counting the number of distinct elements: D = {i m i 0}

Definition (A, B, B, A, C, A, E, A, C, C, C, B, B) Number of distinct elements = 4

Motivation What does it mean when 1 the number of destination ports spikes? Port scan 2 the number of source addresses spikes? Flash crowd 3 the number of source addresses to a fixed destination address spikes? DDoS attack

First-attempt Solutions Maintain a record for every distinct flow - Needs O(n) memory or at least O(m) Sample (e.g., 1 in 100 packets) - Cisco s NetFlow - Hohn and Veitch show that sampling can give inaccurate results Use a lossy set maintenance data structure (Bloom filter) - Still requires O(m) space

Bitmap Algorithm Initialize a large bitmap (binary array) of size B with all zeroes Choose a hash function h : {1,..., n} {1,..., B} For each flow label f {1,..., n}, compute h(f) and mark that position in the bitmap with a 1. Afterwards, count the number of positions in the bitmap with a 1 and call it c ( ) Estimate number of distinct items by B ln B B c.

Brief Proof Consider the probability of a position in the bitmap being zero at the end: (1 1/B) D e D/B, where D is the number of distinct items. The expected number of zeroes is B times this (since the bitmap is of size B), so we have: B c B e D/B. ( ) Solving for D, we get D = B ln B B c. This turns out to be a maximum likelihood estimator for the number of distinct elements.

Bitmap Example Estimated number of distinct elements = 4 ln(4/1) = 5.55 6

The Coupon Collector Problem A coupon collector wants to collect all of C different types of coupons. Every time he gets a coupon, he gets one of the C with replacement. Question: How many coupons will he need to draw before he has one of every type?

The Coupon Collector Problem - Solution We look at the number of coupons that he must pick to get the ith distinct coupon: 1 To get the first coupon, he only needs one try. 2 To get the second distinct coupon (any of C 1 of C) he needs expected C C 1 tries. 3 To get the ith distinct coupon (any of C i + 1 of C) he needs expected C C i+1 tries. Hence, expected number of tries is 1+ C C 1 + C C 2 +... C 1 = C (1+1/2+1/3+...+1/C) C ln C.

Multi-resolution Bitmaps A virtual bitmap is one where the flows are first sampled before being entered into the bitmap. This cuts down the amount of memory by about the sampling factor. However, from the Coupon Collector s problem, we know that B ln B distinct flows will (in expectation) fill the bitmap entirely with 1 s. If we are unsure how large the number of distinct items can get beforehand, we can use bitmaps of various sizes and approximate the value from the one with the most appropriate size.

Min-hash Algorithm Use a hash function h : {1,...,n} {1,..., 2 N } For each flow label f {1,..., n}, compute h(f) and retain the minimal value Approximate the number of distinct items by 2 N / min f h(f) 1 Only need N = O(log (n)) bits to maintain minimum.

Min-hash Example Number of distinct elements 16/3 1 = 4.33