Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden

Size: px

Start display at page:

Download "Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden"

Loraine Lawrence
5 years ago
Views:

1 Data Stream Mining Tore Risch Dept. of information technology Uppsala University Sweden

2 Enormous data growth Read landmark article in Economist : The traditional Moore s law: Processor speed doubles every 1.5 years Current data growth rate significantly higher Data grows 10-fold every 5 year, which about the same as Moore s law Major opportunities: spot business trends prevent diseases combat crime scientific discoveries, the 4 th research paradigm ( data-centered economy Major challenges: Information overload Scalable data processing, Bigdata management Data security Data privacy

3 Too much data to store on disk Need to mine streaming data

4 Mining a swift data river Cover, Economist 14 pages thematic issue

5 New applications Data comes as huge data streams, e.g.: - Satellite data e.g. satellite receivers - Scientific instruments, e.g. colliders - Social networks, e.g. twitter - Stock data - Process industry, e.g. equipment in use - Traffic control, e.g. car monitoring - Patient monitoring, e.g. EKG, flow models

6 Data Stream Management Systems (DSMS) DataBase Management System (DBMS) General purpose software to handle large volume persistent data (usually on disk) Important tool for traditional datamining Data Stream Management System (DSMS) General purpose software to handle large volume data streams (often transient data) Important tool for data stream mining

7 Data Base Management System SQL Queries DBMS Query Processor Data Manager Meta data Stored Data

8 Data Stream Management System Continuous Queries (CQs) DSMS Query Processor Data streams Data Stream Manager Data streams Meta data

9 Data Stream Management System Continuous Queries (CQs) DSMS Query Processor Data streams Data & Stream Manager Data streams Meta data Stored Data

10 Mining streams vs. Databases and files Data streams are dynamic and infinite in size Data is continuously generated and changing Live streams may have no upper limit A live stream, can be read only once ( Just One Look ) The stream rate may be very high Traditional data mining does not work for streaming data: Regular data mining based on finite data collections stored in files or databases Regular data mining done in batch: access data in collection several times analyze accessed data store results in database (or file) For example, store for each data object what cluster it belongs to

11 Data stream mining vs. traditional data mining Live streams mining must be done on-line (in real-time) Not traditional batch processing Live streams require main memory processing To keep up with very high data flow rates (speed) Live streams must be mined with limited memory Not load-and-analyze as traditional data mining Iterative processing of statistics Live streams must keep up with very high data flow volumes Approximate mining algorithms Parallel on-line computations

12 Requirements for Stream Mining Single scan of data ( Just One Look ), because of very large or infinite size of streams. because it may be impossible or very expensive to reread the stream for the computation Limited memory and CPU usage because the processing should be done in main memory despite the very large stream volume Compact continuously evolving representation of mined data It is not possible to store the mined data in database as with traditional data mining A compact main memory representation of mined data needed

13 Iterative stream processing Data stream mining requires iterative stream processing Regular load analyze store mining does not scale Read data tuples iteratively from input stream(s) E.g. measured values Do not store read data in database Result of data stream mining is iteratively produced as derived stream Can be seen as continuously changing database of mined data (statistics, clusters, association rules, etc.)

14 Differential aggregation Data stream mining requires differential aggregation Analyze received tuples in streams differentially Continuously keep summary results in main memory E.g. running statistics, such as sum and count Sum: initially sum=0 in database for each received x: sum = sum + x Count: initially cnt=0 in database for each received x: cnt = cnt + 1 Continuosly emit incremental analyzed resuls E.g. running average by dividing sum with count Sum: emit sum Count: emit cnt Avg: emit sum/cnt

15 Incremental stream aggregation

16 Incremental stream statistics Iterative numerically stable one-pass solution: set N = 0 // incremental counter set M = 0 // incremental mean set S = 0 // incremental standard deviation for each Number Xi where Xi in X do set N = N + 1; // incremental count set Mnew = M + (Xi - M)/N; // incremental mean set S = S + (Xi - M)*(Xi - Mnew); // incremental sdev set M = Mnew, return sqrt(s/(n-1)) Similar numerically stable and incremental stream computation algorithms needed also for other computations.

17 Streams vs. regular data Stream processing should keep up with data flow Make computations so fast that they keep up with the flow (on average) Should not be catastrophic if if miner cannot keep up with the flow: Drop input data values Use approximate computations such as sampling Asynchronous logging in (many) files often possible At least during limited time (log files) Perhaps of processed (reduced) streaming data

18 Requirements for Stream Mining Allow for continuous evolution of mined data Traditional batch mining is for static data Continuous mining makes mined data into a stream too =>Concept drift Often mining over different kinds of windows of streams E.g. sliding or tumbling windows Windows of limited size Often only statistics summaries needed (synopses, sketches)

19 Stream windows Limited size section of stream stored temporarily in DSMS Regular database queries can be made over these windows Need window operator to chop stream into segments Window size (sz) based on: Number of elements, a counting window E.g. last 10 elements i.e. windows has fixed size of 10 elements A time window E.g. elements last second i.e. windows contains all event processed during the last second A landmark window All events from time t 0 in window c.f. growing log file A decaying window Decrease importance of measurement by multiplying with factor λ Remove when importance below threshold

20 Stream windows Windows may also have stride (str) Rule for how fast they move forward, E.g. 10 elements for a 10 element counting window A tumbling window E.g. 2 elements for a 10 element counting window A sliding windows E.g. 100 elements for a 10 element counting window A sampling window Windows need not always be materialized E.g often sufficient to keep statistics materialized

21 Continuous (standing) queries over streams from expressways Schema for stream CarLocStr of tuples: CarLocStr(car_id, /* unique car identifier */ speed, /* speed of the car */ exp_way, /* expressway: */ lane, /* lane: 0,1,2,3 */ dir, /* direction: 0(east), 1(west) */ x-pos); /* coordinate in express way */ CQL query to continuously get the cars in a window of the stream every 30 seconds: SELECT DISTINCT car_id FROM CarLocStr [RANGE 30 SECONDS]; Get the average speed of vehicles per expressway, direction, segment each 5 minutes: SELECT exp_way, dir, seg, AVG(speed) as speed, FROM CarSegStr [RANGE 5 MINUTES] GROUP BY exp_way, dir, seg;

22 Denstream Streamed DBScan Published: 2006 SIAM Conf. on Data Mining ( Regular DBScan: DBScan saves cluster memberships of static database per member object in database by scanning database looking for pairs of objects close to each other Database accessed many times For scalable processing a spatial index must be used to index points in hyperspace and answer nearest-neighbor queries

23 Denstream Denstream One pass processing Limited memory Evolving clustering => concept drift, i.e. not static cluster membership Indefinite stream => cluster memberships not stored in database objects No assumption of number of clusters Transient clusters fade in and fade out of decaying window Clusters of arbitrary shape allowed Good at handling outliers

24 Core micro-clusters Core point: anchor in cluster of other points Core micro-cluster: An area covering points close to (epsilon similar) a core point Cluster defined as set of c-micro-clusters

25 Potential micro-clusters Outlier o-micro-cluster New point not included in any micro-cluster Potential p-micro-cluster Several clustered points not large enough to form a micro-cluster When new data point arrives: 1. Try to merge with nearest p-micro-cluster 2. Try to merge with nearest o-micro-cluster If so convert o-micro-cluster to p-micro-cluster 3. Otherwise make new o-micro-cluster

26 Decaying p-micro-cluster windows Maintain weight C p per p-micro-cluster Periodically (each T p time period) decrease weight exponentially by multiplying old weight with λ Weight lower than threshold => delete, i.e. decaying window Decaying window of micro clusters

27 Dealing with outliers o-micro-clusters important for forming new evoving p-micro-clusters Keep o-micro-clusters around Keeping all o-micro-clusters may be expensive Delete o-micro-cluster by special exponential pruning rule (decaying window) Decaying window method proven to make # micro-clusters grow logarithmically with stream size Good, but not sufficient for indefinite stream Shown to grow very slowly though

28 Growth of micro-clusters

29 Forming c-micro-cluster sets Regularly (e.g. each time period) the user demands forming current c-micro-clusters from the current p-micro-clusters Done by running regular DBSCAN over the p-micro-clusters Center of each p-micro-cluster regarded as point Close when p-micro-clusters intersect => Clusters formed

30 Bloom-filters Problem: Testing membership in extremely large data sets E.g. all non-spam addresses No false negatives, i.e. if address is in set then OK guaranteed Few false positives allowed, i.e. a small number of spams may sneek through See section 4.3.2

31 Bloom-filters Main idea: Assume bitmap B of objects of size s Hash each object x to h in [1,s] Set bit B[h] Smaller than sorted table in memory: addresses of 40 bytes => 40 GByte if set to be stored sorted in memory Would be expensive to extend Bitmap could have e.g /8= 125MBytes May have false positives Since hash function not perfect

32 Lowering false positives Small bitmap => many false positives Idea, hash with several independent hash functions h 1 (x), h 2 (x) and set bits correspondingly (logical OR) For each new x check that all h i (x) are set If so => match Chance of false positives decrease exponentially with number of h i Assumes independent h i (x) h i (x) and h j (x) no common factors if i j

33 Articles L. Golab and T. Özsu: Issues in Stream Data Management, SIGMOD Records, 32(2), June 2003, Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy, "Mining Data Streams: A Review", ACM SIGMOD Record, Vol. 34, No. 2, June 2005, pp

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams