Epilog: Further Topics

Similar documents
Chapter 5: Outlier Detection

MapReduce: Algorithm Design for Relational Operations

Chapter 7: Numerical Prediction

Big Data Management and NoSQL Databases

Challenges for Data Driven Systems

Kapitel 4: Clustering

Knowledge Discovery in Databases II Summer Term 2017

Databases 2 (VU) ( / )

Introduction to Data Management CSE 344

Big Data Analytics. Rasoul Karimi

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Data Partitioning and MapReduce

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Introduction to Data Management CSE 344

Lecture Notes to Big Data Management and Analytics Winter Term 2018/2019 Stream Processing

Algorithms for Grid Graphs in the MapReduce Model

Data Mining Concepts & Tasks

Sublinear Algorithms for Big Data Analysis

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Online Social Networks and Media. Community detection

Efficient Aggregation for Graph Summarization

Statistical Pattern Recognition

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Overview of Clustering

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

Distributed computing: index building and use

Introduction to Data Mining

D B M G Data Base and Data Mining Group of Politecnico di Torino

Nesnelerin İnternetinde Veri Analizi

Subgraph Isomorphism. Artem Maksov, Yong Li, Reese Butler 03/04/2015

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Managing and mining (streaming) sensor data

Introduction to Data Mining and Data Analytics

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Distributed computing: index building and use

Frequent Pattern Mining in Data Streams. Raymond Martin

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CompSci 516: Database Systems

An overview of Graph Categories and Graph Primitives

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Data Mining in Bioinformatics Day 5: Graph Mining

Statistical Pattern Recognition

Graph Analytics in the Big Data Era

CISC 7610 Lecture 2b The beginnings of NoSQL

Data Mining Concepts & Tasks

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Knowledge Discovery in Databases II Summer Semester 2018

Database Systems CSE 414

Knowledge Discovery in Databases

Statistical Pattern Recognition

Data mining, 4 cu Lecture 8:

Introduction to Web Clustering

TI2736-B Big Data Processing. Claudia Hauff

Introduction to Computer Science

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Knowledge Discovery in Databases II Winter Term 2015/2016. Optional Lecture: Pattern Mining & High-D Data Mining

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Exploratory data analysis for microarrays

CS6220: DATA MINING TECHNIQUES

CAIM: Cerca i Anàlisi d Informació Massiva

Spatial Outlier Detection

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1

DS504/CS586: Big Data Analytics Big Data Clustering II

Local Context Selection for Outlier Ranking in Graphs with Multiple Numeric Node Attributes

Data Mining in Bioinformatics Day 3: Graph Mining

Data Analytics with HPC. Data Streaming

Workload Characterization Techniques

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Mining Social Network Graphs

Graph Mining and Social Network Analysis

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Graphs (Part II) Shannon Quinn

Data Preprocessing. Slides by: Shree Jaswal

Introduction to Parallel Computing

Non-exhaustive, Overlapping k-means

Embedded Technosolutions

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Spectral Clustering and Community Detection in Labeled Graphs

Sublinear Models for Streaming and/or Distributed Data

Image Processing. Image Features

DS504/CS586: Big Data Analytics Big Data Clustering II

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Advanced Data Management Technologies Written Exam

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Table Of Contents: xix Foreword to Second Edition

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Clustering CS 550: Machine Learning

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Gene Clustering & Classification

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

ECLT 5810 Clustering

Knowledge Discovery. URL - Spring 2018 CS - MIA 1/22

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Next Steps in Data Mining. Sistemas de Apoio à Decisão Cláudia Antunes

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Transcription:

Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Epilog: Further Topics Lecture: Prof. Dr. Thomas Seidl Tutorials: Julian Busch, Evgeniy Faerman, Florian Richter, Klaus Schmid Knowledge Discovery in Databases I: Epilog 1

We are drowning in data but starving for information Exponential grows in data GPS DB RF-ID http://www.popsci.com/announcements/article/2011-10/november-2011-data-power J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Data contains value and knowledge Big Data Management and Analytics 2

Four Vs of Big Data Big Data Management and Analytics 3

Cluster computing Vertical scalability (scale up) Increase capacity by adding more resources to single machine (processors, storage, memory) Horizontal scalability (scale-out) Increase capacity by adding of more machines https://www.greentree.com/latest-news/avoiding-cu mulus-congestus Big Data Management and Analytics 4

Cluster computing Distributed storage: Distributed file systems (GFS (google), HDFS (hadoop), S3 Amazon, ) NoSQL Databases CONSISTENCY AVAILABILITY PARTITION TOLERANCE CAP Theorem: Any shared-data system can have at most two of the three desired properties! Big Data Management and Analytics 5

Cluster computing MapReduce Programming abstraction for parallel processing of large datasets proposed by Google Programmer specifies the program as sequence of consecutive map and reduce functions Programs are automatically parallelized and executed on cluster Runtime system takes care of data partitioning and parallel execution Big Data Management and Analytics 6

varying data rate constant data rate DATABASE Stream Mining: Applications and tasks Classification Modeling Outlier detection Dynamic Stream Data 7

Stream properties and resulting requirements Infinite stream Single scan, missing random access Limited time Fast access, cheap methods, sampling Limited memory Compression Evolving distribution Aging & updating models Noisy data Noise handling Varying data rates varying time allowances Handle lowest time allowance, reduce idle times Check list for stream algorithms: Single scan, missing random access Fast access, cheap methods Compression Aging & updating models Noise handling Handle varying data rates Dynamic Stream Data 8

Definitions Stream A stream S: N 0 N 0 Q: i (t i, o i ) is an infinite sequence of objects o i Q from a d-dimensional input space Q and t i N 0, t i t j i < j is the discrete arrival time of object o i. Stream algorithms Online algorithms the input is given one at a time Budget algorithms tailored to a specific real-time budget b Anytime algorithms provide a result after any amount of processing time Inter-arrival time The inter-arrival time between two consecutive objects o i and o i+1 is denoted as Δt i = t i+1 t i, i.e. 0 Δt i N. o i o i+1 Constant and varying streams A stream S is called constant Δt i = Δt j i, j Δt t i i t i+1 Dynamic Stream Data 9

Sampling and Buffering Sampling: draw a representative sample of the data distribution Apply a strategy to pick objects (online random sampling) Maintain a finite sample A A F A B B A F F A A F A B A Good for computing statistics like expected values Impossible for anomaly detection or practical tasks like sorting Buffering: insert newly incoming objects into a buffer (FiFo queue) and process objects consecutively from the buffer Can reduce idle times A A F A B B A F F A A F A B A Buffer Probability of failure (buffer overflow) depends on buffer size, budget, arrival distribution and length of the data stream Dynamic Stream Data 10

Stream statistics: mean, standard deviation, correlation Let o i,j R be the value of object o i in dimension j The sample mean μ j of all objects o 1 o n seen so far is n μ j = σ i=1 n The standard deviation σ j of dimension j can be computed as σ j = σ i=1 n n 2 o i,j o i,j σ i=1 n n o i,j 2 All necessary statistics (n, σo i,j, σo 2 i,j, σo i,j o i,k ) can be maintained incrementally Dynamic Stream Data 11

Aging mechanisms Landmark window Restart, reset Sliding window Binary weighting landmark t t Damped window Usually linear or exponential decay w now t Adaptive windowing Compression wrt. mean shift Dynamic Stream Data 12 12

Concept drift Dynamic Stream Data 13

Graph Mining Graphs, graphs everywhere! Chemical data analysis, proteins Biological pathways/networks Program control flow, traffic flow, work flow analysis XML, Web, social network analysis Graphs form a complex and expressive data type Trees, lattices, sequences, and items are degenerated graphs Different applications result in different kinds of graphs and tasks Diversity of graphs and tasks diversity of challenges Complexity of algorithms: many problems are of high complexity (NP-complete or even P-SPACE!) Social Network Graph (facebook, Dez 2010) Yeast Protein Interaction Network source: http://www.nrao.edu/pr/2004/gbtmolecules/ source: http://jchemed.chem.wisc.edu/ from H. Jeong et al Nature 411, 41 (2001) Graph Data 14

Graph Databases vs. Network Data Different applications result in different kinds of graphs and tasks E.g. chemical graphs: relatively small, repeating vertex labels E.g. large scale domains (web, computer networks, social networks): very big, vertex labels are distinct Diversity of graphs and tasks diversity of challenges Graph mining can be divided into two fundamental settings: Mining in a set of graphs, e.g.: Finding similar graphs Determining all frequent subgraphs Classification of graphs Mining in one single large graph, e.g.: How does the network behave? Determine striking patterns, e.g. homogeneous and connected components Graph Data 15

Graph Data: Mining Tasks Some interesting problems investigated in graph mining: How to measure similarity between graphs? How to find frequent patterns in a graph database? How does a real graph look like? How to identify groups in social networks? How to integrate additional information into graph mining techniques? Graph Data 16

Graph Similarity Similarity between objects basic requirement for mining and exploration Retrieval, Clustering, Classification, Many techniques rely on similarity/distance measures Traditional vector data: several distance functions introduced Euclidean Distance, Cosine Distance, Mahalanobis Distance, Similarity between graphs more complex Arbitrary permutation of nodes still results in same graph Computing, e.g., Frobenius norm ( entrywise Euclidean Distance) between two adjacency matrices not meaningful 0 1 1 0 g 1 M 1 = 1 0 1 1 1 1 0 0 g 2 0 1 0 0 v1 v3 v2 v4 v4 v3 v2 v1 M 2 = 0 1 0 0 1 0 1 1 0 1 0 1 0 1 1 0 g 1 g 2 but M 1 M 2 F =2 Graph Data 17

Frequent Subgraph Mining Input: Collection of undirected labeled graphs Aim: Determine all connected graphs that occur as subgraph in at least a given percentage (support) or number (frequency) of all graphs in DB Analogy to "traditional" frequent itemset mining: Each graph of the graph database represents a transaction Each subgraph represents an itemset Applications: As preprocessing: characterizing graph sets, discriminating different groups of graphs, classifying graphs, clustering graphs, building graph indices, facilitating similarity search Bioinformatics, computer vision, video indexing, chemical informatics E.g. frequent molecular fragments (e.g. in drug discovery) Graph Data 18

How does a real network look like? For sure: dependent on application, different graphs possible But: Many real world networks follow certain rules; some characteristics show up regularly Important applications: Detection of abnormal/interesting patterns Specify what is normal Development of graph generators Run experiments on synthetic but realistic graphs Simulation studies E.g. test next-generation internet protocol on graph similar to what Internet will look like a few years into the future Realism of samples Efficient testing on smaller samples of whole data Samples must still reflect original characteristic Graph Data 19

Clustering in Network Data Input: A graph Aim: Find clusters of vertices in the graph Related to "traditional" clustering (e.g. k-means): In traditional clustering, we cluster objects based on attribute data Here, we cluster objects based on graph data (information about relationships between the objects) Example: Given a social network, find groups of people that are densely connected by "friendship" edges Alice Mary John Bob Jeff Tina Paul Graph Data 20

Clustering in Network Data Many different applications: Friendship graph: find circles of friends Protein-/Gene-Interaction Network: find groups of highly interacting proteins/genes Internet graph: Find groups of websites with similar topics Internet snapshot source: the internet mapping project Extension: Integrated Clustering Graph Data Combined data sources: attribute information (for individual objects) and graph information, e.g. Social networks: users interests Gene networks: expression values Citation networks: conference/paper information Find clusters with homogenous attribute values and cohesive subgraph structure 21