SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds

Size: px
Start display at page:

Download "SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds"

Transcription

1 SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel! Institute of Informatics Database Systems Group Ludwig-Maximilians-Universität München

2 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp 56 Term frequency :47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

3 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp 56 Term frequency A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

4 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp 56 Term frequency ? A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

5 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp Pair { Facebook, WhatsApp } 56 Term frequency A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

6 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp Pair { Facebook, WhatsApp } 56 Term frequency A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Facebook bought WhatsApp Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

7 Problem description 1. Statistical significance score Popular topics trending topics (e.g. Obama) 2. Track interacting terms Facebook bought WhatsApp Edward Snowden traveled to Moscow 3. Scalability Efficient calculation for all terms and pairs Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 2/10

8 SigniTrend on textual streams tracking both: single terms and pairs A. Preprocessing (stopwords, stemming, duplicates) B. Trend detection cycle Temporal slicing for statistical aggregation Score all terms and pairs based on expectations from past slices C. Refinement with clustering Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 3/10

9 Trend detection cycle Terms and pairs Count frequency exceeds threshold? Trend candidates new alerting thresholds at the end of each time slice Update statistics Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 4/10

10 Update statistics for time slice t and term or pair e How many standard deviations is the current frequency x higher than its mean z (x t,e ):= x t,e µ t 1,e t-1,e Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 5/10

11 Update statistics for time slice t and term or pair e How many standard deviations is the current frequency x higher than its mean! z (x t,e ):= x pt,e EWMA t 1,e EWMVart 1,e Exponential weighted moving average/variance for continuous estimation on a stream [Finch09] 4 t,e x t,e EWMA t 1,e EWMA t,e EWMA t 1,e + 4 t,e EWMVar t,e (1 ) EWMVar t 1,e t,e { [Finch09] T. Finch. Incremental calculation of weighted mean and variance. Technical report, University of Cambridge, 2009 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 5/10

12 Significance and frequency for term Facebook Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 6/10

13 How to track statistics of all pairs efficiently? Problem: Too many terms and pairs to track everything 2013 News Dataset STEMMED TERMS Therefore, we designed an efficient hashing scheme (based on Bloom Filters and Heavy Hitters) for probabilistic upper-bound statistics!!! OBSERVED PAIRS TOTAL 56,661, ,430,059 UNIQUES 300,141 71,289,359 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 7/10

14 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 #1 #2 #3 #4 #5 #6 #7 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

15 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 h1 h2 45 ± ± 30 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

16 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 h1 h2 45 ± 30 2 ± 1 45 ± 30 2 ± 1 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

17 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 h1 h2 45 ± 30 2 ± 1 45 ± 30? 20 ± 10 2 ± 1 20 ± 10 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

18 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) h1 h2 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

19 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 h1 h2 read {Obama, US}: min(45±30, 20±10) = 20±10 read MIN (lowest collision) Upper-bound estimate for mean and its variance Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

20 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 read {Obama, US}: min(45±30, 20±10) = 20±10 read MIN (lowest collision) Upper-bound estimate for mean and its variance Performance on news dataset: 104s/day with a Raspberry-Pi Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

21 Artificial trends evaluation Inject artificial words with frequency α e.g. Obama meets <X123> Netanyahu 100% Recall of Injected Trends 80% 60% 40% 20% 0% Hash Table Bits l α=0.01 α=0.03 α=0.05 α=0.07 α=0.09 α=0.11 α=0.13 α=0.15 Hash table size large enough recall saturation Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 9/10

22 Refinement & clustering Inverted index (Apache Lucene) to verify trend candidates and measure exactly (without hashing) for precise reporting (false-positives can be eliminated) Single Link clustering with Ward of remaining trends (similarity matrix is built with the exact significance of all pairs) Future work: include topic modeling techniques (e.g. plsi, LDA) Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 10/10

23 Thank You! Questions? Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

TiP: Analyzing Periodic Time Series Patterns

TiP: Analyzing Periodic Time Series Patterns ip: Analyzing Periodic ime eries Patterns homas Bernecker, Hans-Peter Kriegel, Peer Kröger, and Matthias Renz Institute for Informatics, Ludwig-Maximilians-Universität München Oettingenstr. 67, 80538 München,

More information

Detection of Outliers

Detection of Outliers Detection of Outliers TNM033 - Data Mining by Anton Auoja, Albert Backenhof & Mikael Dalkvist Holy Outliers, Batman!! An outlying observation, or outlier, is one that appears to deviate markedly from other

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

SourcererCC -- Scaling Code Clone Detection to Big-Code

SourcererCC -- Scaling Code Clone Detection to Big-Code SourcererCC -- Scaling Code Clone Detection to Big-Code What did this paper do? SourcererCC a token-based clone detector, that can detect both exact and near-miss clones from large inter project repositories

More information

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points 1 Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points Zhe Zhao Abstract In this project, I choose the paper, Clustering by Passing Messages Between Data Points [1],

More information

Chapter 5: Outlier Detection

Chapter 5: Outlier Detection Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.

More information

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition What s the BIG deal?! 2011 2011 2008 2010 2012 What s the BIG deal?! (Gartner Hype Cycle) What s the

More information

Data Analytics with HPC. Data Streaming

Data Analytics with HPC. Data Streaming Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Research Students Lecture Series 2015

Research Students Lecture Series 2015 Research Students Lecture Series 215 Analyse your big data with this one weird probabilistic approach! Or: applied probabilistic algorithms in 5 easy pieces Advait Sarkar advait.sarkar@cl.cam.ac.uk Research

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

Epilog: Further Topics

Epilog: Further Topics Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Epilog: Further Topics Lecture: Prof. Dr. Thomas

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Course : Data mining

Course : Data mining Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter

More information

Online Template Matching Over a Stream of Digitized Documents

Online Template Matching Over a Stream of Digitized Documents Online Template Matching Over a Stream of Digitized Documents Michael Stockerl, Christoph Ringlstetter Gini Gmbh Eirini Ntoutsi, Matthias Schubert, Hans Peter Kriegel University of Munich (LMU) CS Departement

More information

Introduction to ThingWorx

Introduction to ThingWorx Introduction to ThingWorx Introduction to Internet of Things (2min) What are the objectives of this video? Figure 1 Welcome to ThingWorx (THWX), in this short video we will introduce you to the main components

More information

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios Problem definition The goal of Entity Recognition and Disambiguation

More information

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Down the event-driven road: Experiences of integrating streaming into analytic data platforms Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018 Integrate

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Online template matching over a stream of digitized documents

Online template matching over a stream of digitized documents Online template matching over a stream of digitized documents Michael Stockerl Gini GmbH München, Germany stockerl@cip.ifi.lmu.de Christoph Ringlstetter Gini GmbH München, Germany c.ringlstetter@gini.net

More information

ProUD: Probabilistic Ranking in Uncertain Databases

ProUD: Probabilistic Ranking in Uncertain Databases Proc. 20th Int. Conf. on Scientific and Statistical Database Management (SSDBM'08), Hong Kong, China, 2008. ProUD: Probabilistic Ranking in Uncertain Databases Thomas Bernecker, Hans-Peter Kriegel, Matthias

More information

Axibase Time-Series Database. Non-relational database for storing and analyzing large volumes of metrics collected at high-frequency

Axibase Time-Series Database. Non-relational database for storing and analyzing large volumes of metrics collected at high-frequency Axibase Time-Series Database Non-relational database for storing and analyzing large volumes of metrics collected at high-frequency What is a Time-Series Database? A time series database (TSDB) is a software

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

A Library and Proxy for SPDY

A Library and Proxy for SPDY A Library and Proxy for SPDY Interdisciplinary Project Andrey Uzunov Chair for Network Architectures and Services Department of Informatics Technische Universität München April 3, 2013 Andrey Uzunov (TUM)

More information

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch Nick Pentreath Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Heavy-Hitter Detection Entirely in the Data Plane

Heavy-Hitter Detection Entirely in the Data Plane Heavy-Hitter Detection Entirely in the Data Plane VIBHAALAKSHMI SIVARAMAN SRINIVAS NARAYANA, ORI ROTTENSTREICH, MUTHU MUTHUKRSISHNAN, JENNIFER REXFORD 1 Heavy Hitter Flows Flows above a certain threshold

More information

KLEE: A Framework for Distributed Top-k Query Algorithms

KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed Top-k Query Algorithms Sebastian Michel Max-Planck Institute for Informatics Saarbrücken, Germany smichel@mpi-inf.mpg.de Peter Triantafillou RACTI / Univ. of Patras Rio,

More information

Prioritizing Attention in Fast Data

Prioritizing Attention in Fast Data Prioritizing ttention in Fast Data Principles and Promise Peter ailis Edward Gan Kexin Rong Sahaana Suri CIDR 2017 Edward Kexin Sahaana Edward Kexin Sahaana Firas Deepak John Matei Tony Edward Kexin Ihab

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

Learning Similarity Metrics for Event Identification in Social Media. Hila Becker, Luis Gravano

Learning Similarity Metrics for Event Identification in Social Media. Hila Becker, Luis Gravano Learning Similarity Metrics for Event Identification in Social Media Hila Becker, Luis Gravano Columbia University Mor Naaman Rutgers University Event Content in Social Media Sites Event Content in Social

More information

Privacy Preserving Probabilistic Record Linkage

Privacy Preserving Probabilistic Record Linkage Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of

More information

Overview Problem Statement Related Work KLEE The Histogram Bloom Structure Candidate Filtering Evaluation Conclusion / Future Work

Overview Problem Statement Related Work KLEE The Histogram Bloom Structure Candidate Filtering Evaluation Conclusion / Future Work Sebastian Michel Max-Planck Institute for Informatics Saarbrücken, Germany smichel@mpi-inf.mpg.de KLEE: Approximate Distributed Top-k Query Algorithms Peter Triantafillou RACTI / Univ. of Patras Rio, Greece

More information

13 Frequent Itemsets and Bloom Filters

13 Frequent Itemsets and Bloom Filters 13 Frequent Itemsets and Bloom Filters A classic problem in data mining is association rule mining. The basic problem is posed as follows: We have a large set of m tuples {T 1, T,..., T m }, each tuple

More information

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.

Chapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved. Chapter 27 Hashing 1 Objectives To know what hashing is for ( 27.3). To obtain the hash code for an object and design the hash function to map a key to an index ( 27.4). To handle collisions using open

More information

CSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed)

CSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed) Name: Sample Solution Email address: CSE 373 Autumn 0: Midterm # (closed book, closed notes, NO calculators allowed) Instructions: Read the directions for each question carefully before answering. We may

More information

Finding Similar Items:Nearest Neighbor Search

Finding Similar Items:Nearest Neighbor Search Finding Similar Items:Nearest Neighbor Search Barna Saha February 23, 2016 Finding Similar Items A fundamental data mining task Finding Similar Items A fundamental data mining task May want to find whether

More information

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems SkimpyStash: Key Value

More information

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala

Master Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue

More information

Goal of this document: A simple yet effective

Goal of this document: A simple yet effective INTRODUCTION TO ELK STACK Goal of this document: A simple yet effective document for folks who want to learn basics of ELK (Elasticsearch, Logstash and Kibana) without any prior knowledge. Introduction:

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Pattern Mining in Frequent Dynamic Subgraphs

Pattern Mining in Frequent Dynamic Subgraphs Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)

Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Reminders:

More information

GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing

GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing Institute of Parallel and Distributed Systems () Universitätsstraße 38 D-70569 Stuttgart GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing Ruben Mayer, Christian Mayer,

More information

Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing

Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing SIGMOD, June -5,, Houston, TX, USA Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, Steve Uhlig Peking University.

More information

"GET /cgi-bin/purchase?itemid=109agfe111;ypcat%20passwd mail 200

GET /cgi-bin/purchase?itemid=109agfe111;ypcat%20passwd mail 200 128.111.41.15 "GET /cgi-bin/purchase? itemid=1a6f62e612&cc=mastercard" 200 128.111.43.24 "GET /cgi-bin/purchase?itemid=61d2b836c0&cc=visa" 200 128.111.48.69 "GET /cgi-bin/purchase? itemid=a625f27110&cc=mastercard"

More information

Cross Reference Strategies for Cooperative Modalities

Cross Reference Strategies for Cooperative Modalities Cross Reference Strategies for Cooperative Modalities D.SRIKAR*1 CH.S.V.V.S.N.MURTHY*2 Department of Computer Science and Engineering, Sri Sai Aditya institute of Science and Technology Department of Information

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

Building A Billion Spatio-Temporal Object Search and Visualization Platform

Building A Billion Spatio-Temporal Object Search and Visualization Platform 2017 2 nd International Symposium on Spatiotemporal Computing Harvard University Building A Billion Spatio-Temporal Object Search and Visualization Platform Devika Kakkar, Benjamin Lewis Goal Develop a

More information

SCREAM: Sketch Resource Allocation for Software-defined Measurement

SCREAM: Sketch Resource Allocation for Software-defined Measurement SCREAM: Sketch Resource Allocation for Software-defined Measurement (CoNEXT 15) Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat Measurement is Crucial for Network Management Network Management

More information

Anomaly Detection on Data Streams with High Dimensional Data Environment

Anomaly Detection on Data Streams with High Dimensional Data Environment Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant

More information

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources FOSS4G 2017 Boston The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources Devika Kakkar and Ben Lewis Harvard Center for Geographic Analysis

More information

networks data threads

networks data threads Informatics Department Aristotle University of Thessaloniki A distributed framework for early trending topics detection on big social networks data threads AT HE NA VA KA L I, N I KOL AOS KI T MER IDIS,

More information

Collective Entity Resolution in Relational Data

Collective Entity Resolution in Relational Data Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution

More information

Distributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7.

Distributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7. Distributed Itembased Collaborative Filtering with Apache Mahout Sebastian Schelter ssc@apache.org twitter.com/sscdotopen 7. October 2010 Overview 1. What is Apache Mahout? 2. Introduction to Collaborative

More information

Chapter 7: Numerical Prediction

Chapter 7: Numerical Prediction Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 7: Numerical Prediction Lecture: Prof. Dr.

More information

Efficient query processing

Efficient query processing Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:

More information

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Automatic Cluster Number Selection using a Split and Merge K-Means Approach Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr and Michael Granitzer 31st August 2009 The Know-Center is partner of Austria's Competence Center Program COMET. Agenda

More information

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1

Hash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1 Hash Tables Outline Definition Hash functions Open hashing Closed hashing collision resolution techniques Efficiency EECS 268 Programming II 1 Overview Implementation style for the Table ADT that is good

More information

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 232 Fall 2016 The Data Equation Oceans of Data Ocean Biodiversity Informatics,

More information

Requirements Engineering

Requirements Engineering Practical Course: Web Development Requirements Engineering Winter Semester 2016/17 Juliane Franze Ludwig-Maximilians-Universität München Practical Course Web Development WS 16/17-01 - 1 Today s Agenda

More information

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.

More information

Volume 6, Issue 5, May 2018 International Journal of Advance Research in Computer Science and Management Studies

Volume 6, Issue 5, May 2018 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 6, Issue 5, May 2018 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at:

More information

CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch. Jared Saia University of New Mexico

CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch. Jared Saia University of New Mexico CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch Jared Saia University of New Mexico Outline Hash Tables Skip Lists Count-Min Sketch 1 Dictionary ADT A dictionary ADT implements

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Efficient Spatial Query Processing in Geographic Database Systems

Efficient Spatial Query Processing in Geographic Database Systems Efficient Spatial Query Processing in Geographic Database Systems Hans-Peter Kriegel, Thomas Brinkhoff, Ralf Schneider Institute for Computer Science, University of Munich Leopoldstr. 11 B, W-8000 München

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS535 Big Data Fall 2017 Colorado State University   10/10/2017 Sangmi Lee Pallickara Week 8- A. CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE

More information

Presented by: Dardan Xhymshiti Spring 2016

Presented by: Dardan Xhymshiti Spring 2016 Presented by: Dardan Xhymshiti Spring 2016 Authors: Sean Chester* Darius Sidlauskas` Ira Assent* Kenneth S. Bogh* *Data-Intensive Systems Group, Aarhus University, Denmakr `Data-Intensive Applications

More information

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning Clustering. Unsupervised Learning Maria-Florina Balcan 04/06/2015 Reading: Chapter 14.3: Hastie, Tibshirani, Friedman. Additional resources: Center Based Clustering: A Foundational Perspective. Awasthi,

More information

CAP 6412 Advanced Computer Vision

CAP 6412 Advanced Computer Vision CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong April 21st, 2016 Today Administrivia Free parameters in an approach, model, or algorithm? Egocentric videos by Aisha

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

Efficient Mining Algorithms for Large-scale Graphs

Efficient Mining Algorithms for Large-scale Graphs Efficient Mining Algorithms for Large-scale Graphs Yasunari Kishimoto, Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka Abstract This article describes efficient graph mining algorithms designed

More information

How Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns

How Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns How Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns Roadmap Next several lectures: universal structural properties of networks Each large-scale network is unique microscopically,

More information

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria Open Source Search Andreas Pesenhofer max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria max.recall information systems max.recall is a software and consulting company enabling

More information

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams. http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Network Analysis

More information

Silk Server Adding missing Links while consuming Linked Data

Silk Server Adding missing Links while consuming Linked Data Proceedings Of The First International Workshop On Consuming Linked Data Shanghai, China, November 8, 2010 Silk Server Adding missing Links while consuming Linked Data Robert Isele, Freie Universität Berlin

More information

Real-time Collaborative Filtering Recommender Systems

Real-time Collaborative Filtering Recommender Systems Real-time Collaborative Filtering Recommender Systems Huizhi Liang, Haoran Du, Qing Wang Presenter: Qing Wang Research School of Computer Science The Australian National University Australia Partially

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

CSI33 Data Structures

CSI33 Data Structures Outline Department of Mathematics and Computer Science Bronx Community College September 6, 2017 Outline Outline 1 Chapter 2: Data Abstraction Outline Chapter 2: Data Abstraction 1 Chapter 2: Data Abstraction

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,

More information

Here(is(the(XML(Schema(that(describes(the(format:

Here(is(the(XML(Schema(that(describes(the(format: BuildingBusinessDashboardsusingXSLT,SVG,HTML5 Today seconomyhasdramaticallychangedthewaycompaniesdobusiness,havinganonlinepresenceisnolonger sufficient,andordermanagementsystemshavetobereal:timeandaccessibleonalargevarietyofdevices:

More information

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,

More information

Flow and Congestion Control Marcos Vieira

Flow and Congestion Control Marcos Vieira Flow and Congestion Control 2014 Marcos Vieira Flow Control Part of TCP specification (even before 1988) Goal: not send more data than the receiver can handle Sliding window protocol Receiver uses window

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule. CS 188: Artificial Intelligence Fall 2007 Lecture 26: Kernels 11/29/2007 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit your

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Using Natural Language Processing and Machine Learning to Assist First-Level Customer Support for Contract Management

Using Natural Language Processing and Machine Learning to Assist First-Level Customer Support for Contract Management Using Natural Language Processing and Machine Learning to Assist First-Level Customer Support for Contract Management Master thesis Final presentation Michael Legenc Advisor: Daniel Braun Munich, 08.01.2018

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016 15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Mining User - Aware Rare Sequential Topic Pattern in Document Streams

Mining User - Aware Rare Sequential Topic Pattern in Document Streams Mining User - Aware Rare Sequential Topic Pattern in Document Streams A.Mary Assistant Professor, Department of Computer Science And Engineering Alpha College Of Engineering, Thirumazhisai, Tamil Nadu,

More information

Tag Based Image Search by Social Re-ranking

Tag Based Image Search by Social Re-ranking Tag Based Image Search by Social Re-ranking Vilas Dilip Mane, Prof.Nilesh P. Sable Student, Department of Computer Engineering, Imperial College of Engineering & Research, Wagholi, Pune, Savitribai Phule

More information