Data Stream Mining. Tore Risch Dept. of information technology Uppsala University Sweden
|
|
- Loraine Lawrence
- 5 years ago
- Views:
Transcription
1 Data Stream Mining Tore Risch Dept. of information technology Uppsala University Sweden
2 Enormous data growth Read landmark article in Economist : The traditional Moore s law: Processor speed doubles every 1.5 years Current data growth rate significantly higher Data grows 10-fold every 5 year, which about the same as Moore s law Major opportunities: spot business trends prevent diseases combat crime scientific discoveries, the 4 th research paradigm ( data-centered economy Major challenges: Information overload Scalable data processing, Bigdata management Data security Data privacy
3 Too much data to store on disk Need to mine streaming data
4 Mining a swift data river Cover, Economist 14 pages thematic issue
5 New applications Data comes as huge data streams, e.g.: - Satellite data e.g. satellite receivers - Scientific instruments, e.g. colliders - Social networks, e.g. twitter - Stock data - Process industry, e.g. equipment in use - Traffic control, e.g. car monitoring - Patient monitoring, e.g. EKG, flow models
6 Data Stream Management Systems (DSMS) DataBase Management System (DBMS) General purpose software to handle large volume persistent data (usually on disk) Important tool for traditional datamining Data Stream Management System (DSMS) General purpose software to handle large volume data streams (often transient data) Important tool for data stream mining
7 Data Base Management System SQL Queries DBMS Query Processor Data Manager Meta data Stored Data
8 Data Stream Management System Continuous Queries (CQs) DSMS Query Processor Data streams Data Stream Manager Data streams Meta data
9 Data Stream Management System Continuous Queries (CQs) DSMS Query Processor Data streams Data & Stream Manager Data streams Meta data Stored Data
10 Mining streams vs. Databases and files Data streams are dynamic and infinite in size Data is continuously generated and changing Live streams may have no upper limit A live stream, can be read only once ( Just One Look ) The stream rate may be very high Traditional data mining does not work for streaming data: Regular data mining based on finite data collections stored in files or databases Regular data mining done in batch: access data in collection several times analyze accessed data store results in database (or file) For example, store for each data object what cluster it belongs to
11 Data stream mining vs. traditional data mining Live streams mining must be done on-line (in real-time) Not traditional batch processing Live streams require main memory processing To keep up with very high data flow rates (speed) Live streams must be mined with limited memory Not load-and-analyze as traditional data mining Iterative processing of statistics Live streams must keep up with very high data flow volumes Approximate mining algorithms Parallel on-line computations
12 Requirements for Stream Mining Single scan of data ( Just One Look ), because of very large or infinite size of streams. because it may be impossible or very expensive to reread the stream for the computation Limited memory and CPU usage because the processing should be done in main memory despite the very large stream volume Compact continuously evolving representation of mined data It is not possible to store the mined data in database as with traditional data mining A compact main memory representation of mined data needed
13 Iterative stream processing Data stream mining requires iterative stream processing Regular load analyze store mining does not scale Read data tuples iteratively from input stream(s) E.g. measured values Do not store read data in database Result of data stream mining is iteratively produced as derived stream Can be seen as continuously changing database of mined data (statistics, clusters, association rules, etc.)
14 Differential aggregation Data stream mining requires differential aggregation Analyze received tuples in streams differentially Continuously keep summary results in main memory E.g. running statistics, such as sum and count Sum: initially sum=0 in database for each received x: sum = sum + x Count: initially cnt=0 in database for each received x: cnt = cnt + 1 Continuosly emit incremental analyzed resuls E.g. running average by dividing sum with count Sum: emit sum Count: emit cnt Avg: emit sum/cnt
15 Incremental stream aggregation
16 Incremental stream statistics Iterative numerically stable one-pass solution: set N = 0 // incremental counter set M = 0 // incremental mean set S = 0 // incremental standard deviation for each Number Xi where Xi in X do set N = N + 1; // incremental count set Mnew = M + (Xi - M)/N; // incremental mean set S = S + (Xi - M)*(Xi - Mnew); // incremental sdev set M = Mnew, return sqrt(s/(n-1)) Similar numerically stable and incremental stream computation algorithms needed also for other computations.
17 Streams vs. regular data Stream processing should keep up with data flow Make computations so fast that they keep up with the flow (on average) Should not be catastrophic if if miner cannot keep up with the flow: Drop input data values Use approximate computations such as sampling Asynchronous logging in (many) files often possible At least during limited time (log files) Perhaps of processed (reduced) streaming data
18 Requirements for Stream Mining Allow for continuous evolution of mined data Traditional batch mining is for static data Continuous mining makes mined data into a stream too =>Concept drift Often mining over different kinds of windows of streams E.g. sliding or tumbling windows Windows of limited size Often only statistics summaries needed (synopses, sketches)
19 Stream windows Limited size section of stream stored temporarily in DSMS Regular database queries can be made over these windows Need window operator to chop stream into segments Window size (sz) based on: Number of elements, a counting window E.g. last 10 elements i.e. windows has fixed size of 10 elements A time window E.g. elements last second i.e. windows contains all event processed during the last second A landmark window All events from time t 0 in window c.f. growing log file A decaying window Decrease importance of measurement by multiplying with factor λ Remove when importance below threshold
20 Stream windows Windows may also have stride (str) Rule for how fast they move forward, E.g. 10 elements for a 10 element counting window A tumbling window E.g. 2 elements for a 10 element counting window A sliding windows E.g. 100 elements for a 10 element counting window A sampling window Windows need not always be materialized E.g often sufficient to keep statistics materialized
21 Continuous (standing) queries over streams from expressways Schema for stream CarLocStr of tuples: CarLocStr(car_id, /* unique car identifier */ speed, /* speed of the car */ exp_way, /* expressway: */ lane, /* lane: 0,1,2,3 */ dir, /* direction: 0(east), 1(west) */ x-pos); /* coordinate in express way */ CQL query to continuously get the cars in a window of the stream every 30 seconds: SELECT DISTINCT car_id FROM CarLocStr [RANGE 30 SECONDS]; Get the average speed of vehicles per expressway, direction, segment each 5 minutes: SELECT exp_way, dir, seg, AVG(speed) as speed, FROM CarSegStr [RANGE 5 MINUTES] GROUP BY exp_way, dir, seg;
22 Denstream Streamed DBScan Published: 2006 SIAM Conf. on Data Mining ( Regular DBScan: DBScan saves cluster memberships of static database per member object in database by scanning database looking for pairs of objects close to each other Database accessed many times For scalable processing a spatial index must be used to index points in hyperspace and answer nearest-neighbor queries
23 Denstream Denstream One pass processing Limited memory Evolving clustering => concept drift, i.e. not static cluster membership Indefinite stream => cluster memberships not stored in database objects No assumption of number of clusters Transient clusters fade in and fade out of decaying window Clusters of arbitrary shape allowed Good at handling outliers
24 Core micro-clusters Core point: anchor in cluster of other points Core micro-cluster: An area covering points close to (epsilon similar) a core point Cluster defined as set of c-micro-clusters
25 Potential micro-clusters Outlier o-micro-cluster New point not included in any micro-cluster Potential p-micro-cluster Several clustered points not large enough to form a micro-cluster When new data point arrives: 1. Try to merge with nearest p-micro-cluster 2. Try to merge with nearest o-micro-cluster If so convert o-micro-cluster to p-micro-cluster 3. Otherwise make new o-micro-cluster
26 Decaying p-micro-cluster windows Maintain weight C p per p-micro-cluster Periodically (each T p time period) decrease weight exponentially by multiplying old weight with λ Weight lower than threshold => delete, i.e. decaying window Decaying window of micro clusters
27 Dealing with outliers o-micro-clusters important for forming new evoving p-micro-clusters Keep o-micro-clusters around Keeping all o-micro-clusters may be expensive Delete o-micro-cluster by special exponential pruning rule (decaying window) Decaying window method proven to make # micro-clusters grow logarithmically with stream size Good, but not sufficient for indefinite stream Shown to grow very slowly though
28 Growth of micro-clusters
29 Forming c-micro-cluster sets Regularly (e.g. each time period) the user demands forming current c-micro-clusters from the current p-micro-clusters Done by running regular DBSCAN over the p-micro-clusters Center of each p-micro-cluster regarded as point Close when p-micro-clusters intersect => Clusters formed
30 Bloom-filters Problem: Testing membership in extremely large data sets E.g. all non-spam addresses No false negatives, i.e. if address is in set then OK guaranteed Few false positives allowed, i.e. a small number of spams may sneek through See section 4.3.2
31 Bloom-filters Main idea: Assume bitmap B of objects of size s Hash each object x to h in [1,s] Set bit B[h] Smaller than sorted table in memory: addresses of 40 bytes => 40 GByte if set to be stored sorted in memory Would be expensive to extend Bitmap could have e.g /8= 125MBytes May have false positives Since hash function not perfect
32 Lowering false positives Small bitmap => many false positives Idea, hash with several independent hash functions h 1 (x), h 2 (x) and set bits correspondingly (logical OR) For each new x check that all h i (x) are set If so => match Chance of false positives decrease exponentially with number of h i Assumes independent h i (x) h i (x) and h j (x) no common factors if i j
33 Articles L. Golab and T. Özsu: Issues in Stream Data Management, SIGMOD Records, 32(2), June 2003, Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy, "Mining Data Streams: A Review", ACM SIGMOD Record, Vol. 34, No. 2, June 2005, pp
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationClustering from Data Streams
Clustering from Data Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Introduction 2 Clustering Micro Clustering 3 Clustering Time Series Growing the Structure Adapting
More informationDOI:: /ijarcsse/V7I1/0111
Volume 7, Issue 1, January 2017 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey on
More informationDATABASE DESIGN II - 1DL400
DATABASE DESIGN II - 1DL400 Fall 2016 A second course in database systems http://www.it.uu.se/research/group/udbl/kurser/dbii_ht16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationThe URI of the document is "family.xml". Write XQuery programs for the following tasks and queries:
Question 1: XQuery Programming Consider the following XML document that models a family; i.e., a hierarchy of persons in which a parent/child relationship in the XML document represents a parent/child
More informationMining Data Streams. From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records.
DATA STREAMS MINING Mining Data Streams From Data-Streams Management System Queries to Knowledge Discovery from continuous and fast-evolving Data Records. Hammad Haleem Xavier Plantaz APPLICATIONS Sensors
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2016 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationExtended R-Tree Indexing Structure for Ensemble Stream Data Classification
Extended R-Tree Indexing Structure for Ensemble Stream Data Classification P. Sravanthi M.Tech Student, Department of CSE KMM Institute of Technology and Sciences Tirupati, India J. S. Ananda Kumar Assistant
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2012 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt12 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,
More informationSystems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Data Stream Processing Topics Model Issues System Issues Distributed Processing Web-Scale Streaming 3 Data Streams Continuous
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially
More informationDATA STREAMS: MODELS AND ALGORITHMS
DATA STREAMS: MODELS AND ALGORITHMS DATA STREAMS: MODELS AND ALGORITHMS Edited by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 Kluwer Academic Publishers Boston/Dordrecht/London
More information1. General. 2. Stream. 3. Aurora. 4. Conclusion
1. General 2. Stream 3. Aurora 4. Conclusion 1. Motivation Applications 2. Definition of Data Streams 3. Data Base Management System (DBMS) vs. Data Stream Management System(DSMS) 4. Stream Projects interpreting
More informationAnnouncement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17
Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa
More informationA Framework for Clustering Massive Text and Categorical Data Streams
A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract
More informationHash-Based Indexing 165
Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19
More informationMultiple Query Optimization for Density-Based Clustering Queries over Streaming Windows
Worcester Polytechnic Institute DigitalCommons@WPI Computer Science Faculty Publications Department of Computer Science 4-1-2009 Multiple Query Optimization for Density-Based Clustering Queries over Streaming
More informationMin-Hashing and Geometric min-hashing
Min-Hashing and Geometric min-hashing Ondřej Chum, Michal Perdoch, and Jiří Matas Center for Machine Perception Czech Technical University Prague Outline 1. Looking for representation of images that: is
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data
More informationUnsupervised Learning
Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised
More informationMining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window
Mining Frequent Itemsets from Data Streams with a Time- Sensitive Sliding Window Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu Department of Computer Science National Tsing Hua University Arbee L.P. Chen
More information8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara
Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,
More informationEpilog: Further Topics
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Epilog: Further Topics Lecture: Prof. Dr. Thomas
More informationNew Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring
New Directions in Traffic Measurement and Accounting C. Estan and G. Varghese Presented by Aaditeshwar Seth 1 Need for traffic measurement Internet backbone monitoring Short term Detect DoS attacks Long
More informationFrequent Pattern Mining in Data Streams. Raymond Martin
Frequent Pattern Mining in Data Streams Raymond Martin Agenda -Breakdown & Review -Importance & Examples -Current Challenges -Modern Algorithms -Stream-Mining Algorithm -How KPS Works -Combing KPS and
More informationMining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams
Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationStreaming Algorithms. Stony Brook University CSE545, Fall 2016
Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big Data Analytics -- The Class We will learn: to analyze different types of data: high dimensional graphs infinite/never-ending labeled to
More informationPredictive Analysis: Evaluation and Experimentation. Heejun Kim
Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training
More informationAn Approximate Scheme to Mine Frequent Patterns over Data Streams
An Approximate Scheme to Mine Frequent Patterns over Data Streams Shanchan Wu Department of Computer Science, University of Maryland, College Park, MD 20742, USA wsc@cs.umd.edu Abstract. In this paper,
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2016 A second course in data mining!! http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn! Uppsala Database Laboratory! Department of Information Technology,
More informationManaging the Database
Slide 1 Managing the Database Objectives of the Lecture : To consider the roles of the Database Administrator. To consider the involvmentof the DBMS in the storage and handling of physical data. To appreciate
More informationDynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering
Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining
More informationTDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran
TDT- An Efficient Clustering Algorithm for Large Database Ms. Kritika Maheshwari, Mr. M.Rajsekaran M-Tech Scholar, Department of Computer Science and Engineering, SRM University, India Assistant Professor,
More informationAccess Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value
Access Methods This is a modified version of Prof. Hector Garcia Molina s slides. All copy rights belong to the original author. Basic Concepts search key pointer Value record? value Search Key - set of
More informationData Stream Processing
Data Stream Processing Part II 1 Data Streams (recap) continuous, unbounded sequence of items unpredictable arrival times too large to store locally one pass real time processing required 2 Reservoir Sampling
More informationCity, University of London Institutional Repository
City Research Online City, University of London Institutional Repository Citation: Andrienko, N., Andrienko, G., Fuchs, G., Rinzivillo, S. & Betz, H-D. (2015). Real Time Detection and Tracking of Spatial
More informationData Streams. Everything Data CompSci 216 Spring 2018
Data Streams Everything Data CompSci 216 Spring 2018 How much data is generated every 2 minute in the world? haps://fossbytes.com/how-much-data-is-generated-every-minute-in-the-world/ 3 Data stream A potentially
More informationChallenges in Ubiquitous Data Mining
LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 2 Very-short-term Forecasting in Photovoltaic Systems 3 4 Problem Formulation: Network Data Model Querying Model Query = Q( n i=0 S i)
More informationCHAPTER 4: CLUSTER ANALYSIS
CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis
More informationClustering Part 4 DBSCAN
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationTrack Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross
Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk
More informationLarge-Scale Flight Phase identification from ADS-B Data Using Machine Learning Methods
Large-Scale Flight Phase identification from ADS-B Data Using Methods Junzi Sun 06.2016 PhD student, ATM Control and Simulation, Aerospace Engineering Large-Scale Flight Phase identification from ADS-B
More information7. Query Processing and Optimization
7. Query Processing and Optimization Processing a Query 103 Indexing for Performance Simple (individual) index B + -tree index Matching index scan vs nonmatching index scan Unique index one entry and one
More informationCQL: A Language for Continuous Queries over Streams and Relations
CQL: A Language for Continuous Queries over Streams and Relations Jennifer Widom Stanford University Joint work with Arvind Arasu & Shivnath Babu Data Streams Continuous, unbounded, rapid, time-varying
More informationDatabase Architectures
Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/15/15 Agenda Check-in Parallelism and Distributed Databases Technology Research Project Introduction to NoSQL
More informationData Stream Clustering Using Micro Clusters
Data Stream Clustering Using Micro Clusters Ms. Jyoti.S.Pawar 1, Prof. N. M.Shahane. 2 1 PG student, Department of Computer Engineering K. K. W. I. E. E. R., Nashik Maharashtra, India 2 Assistant Professor
More informationAdministrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments
Administrivia Midterm on Thursday 10/18 CS 133: Databases Fall 2018 Lec 12 10/16 Prof. Beth Trushkowsky Assignments Lab 3 starts after fall break No problem set out this week Goals for Today Cost-based
More informationMULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT
MULTIDIMENSIONAL INDEXING TREE STRUCTURE FOR SPATIAL DATABASE MANAGEMENT Dr. G APPARAO 1*, Mr. A SRINIVAS 2* 1. Professor, Chairman-Board of Studies & Convener-IIIC, Department of Computer Science Engineering,
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationB561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1
B561 Advanced Database Concepts 2.2. Streaming Model Qin Zhang 1-1 Data Streams Continuous streams of data elements (massive possibly unbounded, rapid, time-varying) Some examples: 1. network monitoring
More informationDynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
More informationIncremental Classification of Nonstationary Data Streams
Incremental Classification of Nonstationary Data Streams Lior Cohen, Gil Avrahami, Mark Last Ben-Gurion University of the Negev Department of Information Systems Engineering Beer-Sheva 84105, Israel Email:{clior,gilav,mlast}@
More informationDSMS Benchmarking. Morten Lindeberg University of Oslo
DSMS Benchmarking Morten Lindeberg University of Oslo Agenda Introduction DSMS Recap General Requirements Metrics Example: Linear Road Example: StreamBench 30. Sep. 2009 INF5100 - Morten Lindeberg 2 Introduction
More informationTowards New Heterogeneous Data Stream Clustering based on Density
, pp.30-35 http://dx.doi.org/10.14257/astl.2015.83.07 Towards New Heterogeneous Data Stream Clustering based on Density Chen Jin-yin, He Hui-hao Zhejiang University of Technology, Hangzhou,310000 chenjinyin@zjut.edu.cn
More informationHYRISE In-Memory Storage Engine
HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University
More informationLarge-Scale Data Engineering. Data streams and low latency processing
Large-Scale Data Engineering Data streams and low latency processing DATA STREAM BASICS What is a data stream? Large data volume, likely structured, arriving at a very high rate Potentially high enough
More informationCourse : Data mining
Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationExploiting Predicate-window Semantics over Data Streams
Exploiting Predicate-window Semantics over Data Streams Thanaa M. Ghanem Walid G. Aref Ahmed K. Elmagarmid Department of Computer Sciences, Purdue University, West Lafayette, IN 47907-1398 {ghanemtm,aref,ake}@cs.purdue.edu
More informationEstimating Persistent Spread in High-speed Networks Qingjun Xiao, Yan Qiao, Zhen Mo, Shigang Chen
Estimating Persistent Spread in High-speed Networks Qingjun Xiao, Yan Qiao, Zhen Mo, Shigang Chen Southeast University of China University of Florida Motivation for Persistent Stealthy Spreaders Imagine
More informationA New Online Clustering Approach for Data in Arbitrary Shaped Clusters
A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK
More informationApache Flink. Alessandro Margara
Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate
More informationCSC 261/461 Database Systems Lecture 19
CSC 261/461 Database Systems Lecture 19 Fall 2017 Announcements CIRC: CIRC is down!!! MongoDB and Spark (mini) projects are at stake. L Project 1 Milestone 4 is out Due date: Last date of class We will
More informationQuery optimization. Elena Baralis, Silvia Chiusano Politecnico di Torino. DBMS Architecture D B M G. Database Management Systems. Pag.
Database Management Systems DBMS Architecture SQL INSTRUCTION OPTIMIZER MANAGEMENT OF ACCESS METHODS CONCURRENCY CONTROL BUFFER MANAGER RELIABILITY MANAGEMENT Index Files Data Files System Catalog DATABASE
More informationCascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching
Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching Kefei Wang and Feng Chen Louisiana State University SoCC '18 Carlsbad, CA Key-value Systems in Internet Services Key-value
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 4
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More informationGLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)
DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything
More informationMining Frequent Itemsets in Time-Varying Data Streams
Mining Frequent Itemsets in Time-Varying Data Streams Abstract A transactional data stream is an unbounded sequence of transactions continuously generated, usually at a high rate. Mining frequent itemsets
More informationColumn-Oriented Database Systems. Liliya Rudko University of Helsinki
Column-Oriented Database Systems Liliya Rudko University of Helsinki 2 Contents 1. Introduction 2. Storage engines 2.1 Evolutionary Column-Oriented Storage (ECOS) 2.2 HYRISE 3. Database management systems
More informationBloom Filter for Network Security Alex X. Liu & Haipeng Dai
Bloom Filter for Network Security Alex X. Liu & Haipeng Dai haipengdai@nju.edu.cn 313 CS Building Department of Computer Science and Technology Nanjing University Bloom Filters Given a set S = {x 1,x 2,x
More informationUDP Packet Monitoring with Stanford Data Stream Manager
UDP Packet Monitoring with Stanford Data Stream Manager Nadeem Akhtar #1, Faridul Haque Siddiqui #2 # Department of Computer Engineering, Aligarh Muslim University Aligarh, India 1 nadeemalakhtar@gmail.com
More informationLecturer 2: Spatial Concepts and Data Models
Lecturer 2: Spatial Concepts and Data Models 2.1 Introduction 2.2 Models of Spatial Information 2.3 Three-Step Database Design 2.4 Extending ER with Spatial Concepts 2.5 Summary Learning Objectives Learning
More informationChapter 1, Introduction
CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from
More informationCSE 190D Spring 2017 Final Exam Answers
CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join
More informationDATA STREAMS AND DATABASES. CS121: Introduction to Relational Database Systems Fall 2016 Lecture 26
DATA STREAMS AND DATABASES CS121: Introduction to Relational Database Systems Fall 2016 Lecture 26 Static and Dynamic Data Sets 2 So far, have discussed relatively static databases Data may change slowly
More informationMaking Session Stores More Intelligent KYLE J. DAVIS TECHNICAL MARKETING MANAGER REDIS LABS
Making Session Stores More Intelligent KYLE J. DAVIS TECHNICAL MARKETING MANAGER REDIS LABS What is a session store? A session store is An chunk of data that is connected to one user of a service user
More informationData Mining Query Scheduling for Apriori Common Counting
Data Mining Query Scheduling for Apriori Common Counting Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,
More informationAnalytical and Experimental Evaluation of Stream-Based Join
Analytical and Experimental Evaluation of Stream-Based Join Henry Kostowski Department of Computer Science, University of Massachusetts - Lowell Lowell, MA 01854 Email: hkostows@cs.uml.edu Kajal T. Claypool
More informationDatabase Optimization
Database Optimization June 9 2009 A brief overview of database optimization techniques for the database developer. Database optimization techniques include RDBMS query execution strategies, cost estimation,
More informationMineração de Dados Aplicada
Data Exploration August, 9 th 2017 DCC ICEx UFMG Summary of the last session Data mining Data mining is an empiricism; It can be seen as a generalization of querying; It lacks a unified theory; It implies
More informationAn Overview of various methodologies used in Data set Preparation for Data mining Analysis
An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of
More informationOverview of Implementing Relational Operators and Query Evaluation
Overview of Implementing Relational Operators and Query Evaluation Chapter 12 Motivation: Evaluating Queries The same query can be evaluated in different ways. The evaluation strategy (plan) can make orders
More informationQuery Processing. Lecture #10. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018
Query Processing Lecture #10 Database Systems 15-445/15-645 Fall 2018 AP Andy Pavlo Computer Science Carnegie Mellon Univ. 2 ADMINISTRIVIA Project #2 Checkpoint #1 is due Monday October 9 th @ 11:59pm
More informationOverlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma
Overlay and P2P Networks Unstructured networks Prof. Sasu Tarkoma 20.1.2014 Contents P2P index revisited Unstructured networks Gnutella Bloom filters BitTorrent Freenet Summary of unstructured networks
More informationNesnelerin İnternetinde Veri Analizi
Bölüm 4. Frequent Patterns in Data Streams w3.gazi.edu.tr/~suatozdemir What Is Pattern Discovery? What are patterns? Patterns: A set of items, subsequences, or substructures that occur frequently together
More informationPARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH
PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH 1 INTRODUCTION In centralized database: Data is located in one place (one server) All DBMS functionalities are done by that server
More informationLesson n.11 Data Structures for P2P Systems: Bloom Filters, Merkle Trees
Lesson n.11 : Bloom Filters, Merkle Trees Didactic Material Tutorial on Moodle 15/11/2013 1 SET MEMBERSHIP PROBLEM Let us consider the set S={s 1,s 2,...,s n } of n elements chosen from a very large universe
More informationDBSCAN. Presented by: Garrett Poppe
DBSCAN Presented by: Garrett Poppe A density-based algorithm for discovering clusters in large spatial databases with noise by Martin Ester, Hans-peter Kriegel, Jörg S, Xiaowei Xu Slides adapted from resources
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining
More informationGRID Stream Database Managent for Scientific Applications
GRID Stream Database Managent for Scientific Applications Milena Ivanova (Koparanova) and Tore Risch IT Department, Uppsala University, Sweden Outline Motivation Stream Data Management Computational GRIDs
More informationLazyBase: Trading freshness and performance in a scalable database
LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
CHAPTER 19 Query Optimization Introduction Query optimization Conducted by a query optimizer in a DBMS Goal: select best available strategy for executing query Based on information available Most RDBMSs
More informationCarnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#14(b): Implementation of Relational Operations Administrivia HW4 is due today. HW5 is out. Faloutsos/Pavlo
More informationManaging and mining (streaming) sensor data
Petr Čížek Artificial Intelligence Center Czech Technical University in Prague November 3, 2016 Petr Čížek VPD 1 / 1 Stream data mining / stream data querying Problem definition Data can not be stored
More informationTribhuvan University Institute of Science and Technology MODEL QUESTION
MODEL QUESTION 1. Suppose that a data warehouse for Big University consists of four dimensions: student, course, semester, and instructor, and two measures count and avg-grade. When at the lowest conceptual
More informationQuery Evaluation Overview, cont.
Query Evaluation Overview, cont. Lecture 9 Feb. 29, 2016 Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke Architecture of a DBMS Query Compiler Execution Engine Index/File/Record
More information