SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
|
|
- Cori Watkins
- 6 years ago
- Views:
Transcription
1 SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel! Institute of Informatics Database Systems Group Ludwig-Maximilians-Universität München
2 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp 56 Term frequency :47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10
3 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp 56 Term frequency A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10
4 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp 56 Term frequency ? A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10
5 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp Pair { Facebook, WhatsApp } 56 Term frequency A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10
6 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp Pair { Facebook, WhatsApp } 56 Term frequency A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Facebook bought WhatsApp Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10
7 Problem description 1. Statistical significance score Popular topics trending topics (e.g. Obama) 2. Track interacting terms Facebook bought WhatsApp Edward Snowden traveled to Moscow 3. Scalability Efficient calculation for all terms and pairs Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 2/10
8 SigniTrend on textual streams tracking both: single terms and pairs A. Preprocessing (stopwords, stemming, duplicates) B. Trend detection cycle Temporal slicing for statistical aggregation Score all terms and pairs based on expectations from past slices C. Refinement with clustering Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 3/10
9 Trend detection cycle Terms and pairs Count frequency exceeds threshold? Trend candidates new alerting thresholds at the end of each time slice Update statistics Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 4/10
10 Update statistics for time slice t and term or pair e How many standard deviations is the current frequency x higher than its mean z (x t,e ):= x t,e µ t 1,e t-1,e Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 5/10
11 Update statistics for time slice t and term or pair e How many standard deviations is the current frequency x higher than its mean! z (x t,e ):= x pt,e EWMA t 1,e EWMVart 1,e Exponential weighted moving average/variance for continuous estimation on a stream [Finch09] 4 t,e x t,e EWMA t 1,e EWMA t,e EWMA t 1,e + 4 t,e EWMVar t,e (1 ) EWMVar t 1,e t,e { [Finch09] T. Finch. Incremental calculation of weighted mean and variance. Technical report, University of Cambridge, 2009 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 5/10
12 Significance and frequency for term Facebook Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 6/10
13 How to track statistics of all pairs efficiently? Problem: Too many terms and pairs to track everything 2013 News Dataset STEMMED TERMS Therefore, we designed an efficient hashing scheme (based on Bloom Filters and Heavy Hitters) for probabilistic upper-bound statistics!!! OBSERVED PAIRS TOTAL 56,661, ,430,059 UNIQUES 300,141 71,289,359 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 7/10
14 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 #1 #2 #3 #4 #5 #6 #7 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10
15 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 h1 h2 45 ± ± 30 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10
16 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 h1 h2 45 ± 30 2 ± 1 45 ± 30 2 ± 1 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10
17 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 h1 h2 45 ± 30 2 ± 1 45 ± 30? 20 ± 10 2 ± 1 20 ± 10 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10
18 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) h1 h2 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10
19 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 h1 h2 read {Obama, US}: min(45±30, 20±10) = 20±10 read MIN (lowest collision) Upper-bound estimate for mean and its variance Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10
20 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 read {Obama, US}: min(45±30, 20±10) = 20±10 read MIN (lowest collision) Upper-bound estimate for mean and its variance Performance on news dataset: 104s/day with a Raspberry-Pi Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10
21 Artificial trends evaluation Inject artificial words with frequency α e.g. Obama meets <X123> Netanyahu 100% Recall of Injected Trends 80% 60% 40% 20% 0% Hash Table Bits l α=0.01 α=0.03 α=0.05 α=0.07 α=0.09 α=0.11 α=0.13 α=0.15 Hash table size large enough recall saturation Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 9/10
22 Refinement & clustering Inverted index (Apache Lucene) to verify trend candidates and measure exactly (without hashing) for precise reporting (false-positives can be eliminated) Single Link clustering with Ward of remaining trends (similarity matrix is built with the exact significance of all pairs) Future work: include topic modeling techniques (e.g. plsi, LDA) Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 10/10
23 Thank You! Questions? Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning
Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning
More informationTiP: Analyzing Periodic Time Series Patterns
ip: Analyzing Periodic ime eries Patterns homas Bernecker, Hans-Peter Kriegel, Peer Kröger, and Matthias Renz Institute for Informatics, Ludwig-Maximilians-Universität München Oettingenstr. 67, 80538 München,
More informationDetection of Outliers
Detection of Outliers TNM033 - Data Mining by Anton Auoja, Albert Backenhof & Mikael Dalkvist Holy Outliers, Batman!! An outlying observation, or outlier, is one that appears to deviate markedly from other
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationSourcererCC -- Scaling Code Clone Detection to Big-Code
SourcererCC -- Scaling Code Clone Detection to Big-Code What did this paper do? SourcererCC a token-based clone detector, that can detect both exact and near-miss clones from large inter project repositories
More informationReplication on Affinity Propagation: Clustering by Passing Messages Between Data Points
1 Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points Zhe Zhao Abstract In this project, I choose the paper, Clustering by Passing Messages Between Data Points [1],
More informationChapter 5: Outlier Detection
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.
More informationBig Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition
Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition What s the BIG deal?! 2011 2011 2008 2010 2012 What s the BIG deal?! (Gartner Hype Cycle) What s the
More informationData Analytics with HPC. Data Streaming
Data Analytics with HPC Data Streaming Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationResearch Students Lecture Series 2015
Research Students Lecture Series 215 Analyse your big data with this one weird probabilistic approach! Or: applied probabilistic algorithms in 5 easy pieces Advait Sarkar advait.sarkar@cl.cam.ac.uk Research
More informationEffective Latent Space Graph-based Re-ranking Model with Global Consistency
Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case
More informationEpilog: Further Topics
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Epilog: Further Topics Lecture: Prof. Dr. Thomas
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationCourse : Data mining
Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter
More informationOnline Template Matching Over a Stream of Digitized Documents
Online Template Matching Over a Stream of Digitized Documents Michael Stockerl, Christoph Ringlstetter Gini Gmbh Eirini Ntoutsi, Matthias Schubert, Hans Peter Kriegel University of Munich (LMU) CS Departement
More informationIntroduction to ThingWorx
Introduction to ThingWorx Introduction to Internet of Things (2min) What are the objectives of this video? Figure 1 Welcome to ThingWorx (THWX), in this short video we will introduce you to the main components
More informationTulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios
Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios Problem definition The goal of Entity Recognition and Disambiguation
More informationDown the event-driven road: Experiences of integrating streaming into analytic data platforms
Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018 Integrate
More informationOutlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data
Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University
More informationOnline template matching over a stream of digitized documents
Online template matching over a stream of digitized documents Michael Stockerl Gini GmbH München, Germany stockerl@cip.ifi.lmu.de Christoph Ringlstetter Gini GmbH München, Germany c.ringlstetter@gini.net
More informationProUD: Probabilistic Ranking in Uncertain Databases
Proc. 20th Int. Conf. on Scientific and Statistical Database Management (SSDBM'08), Hong Kong, China, 2008. ProUD: Probabilistic Ranking in Uncertain Databases Thomas Bernecker, Hans-Peter Kriegel, Matthias
More informationAxibase Time-Series Database. Non-relational database for storing and analyzing large volumes of metrics collected at high-frequency
Axibase Time-Series Database Non-relational database for storing and analyzing large volumes of metrics collected at high-frequency What is a Time-Series Database? A time series database (TSDB) is a software
More informationMining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams
Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06
More informationA Library and Proxy for SPDY
A Library and Proxy for SPDY Interdisciplinary Project Andrey Uzunov Chair for Network Architectures and Services Department of Informatics Technische Universität München April 3, 2013 Andrey Uzunov (TUM)
More informationBuilding a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch
Nick Pentreath Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning
More informationA BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK
A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific
More informationHeavy-Hitter Detection Entirely in the Data Plane
Heavy-Hitter Detection Entirely in the Data Plane VIBHAALAKSHMI SIVARAMAN SRINIVAS NARAYANA, ORI ROTTENSTREICH, MUTHU MUTHUKRSISHNAN, JENNIFER REXFORD 1 Heavy Hitter Flows Flows above a certain threshold
More informationKLEE: A Framework for Distributed Top-k Query Algorithms
KLEE: A Framework for Distributed Top-k Query Algorithms Sebastian Michel Max-Planck Institute for Informatics Saarbrücken, Germany smichel@mpi-inf.mpg.de Peter Triantafillou RACTI / Univ. of Patras Rio,
More informationPrioritizing Attention in Fast Data
Prioritizing ttention in Fast Data Principles and Promise Peter ailis Edward Gan Kexin Rong Sahaana Suri CIDR 2017 Edward Kexin Sahaana Edward Kexin Sahaana Firas Deepak John Matei Tony Edward Kexin Ihab
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining
More informationLearning Similarity Metrics for Event Identification in Social Media. Hila Becker, Luis Gravano
Learning Similarity Metrics for Event Identification in Social Media Hila Becker, Luis Gravano Columbia University Mor Naaman Rutgers University Event Content in Social Media Sites Event Content in Social
More informationPrivacy Preserving Probabilistic Record Linkage
Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of
More informationOverview Problem Statement Related Work KLEE The Histogram Bloom Structure Candidate Filtering Evaluation Conclusion / Future Work
Sebastian Michel Max-Planck Institute for Informatics Saarbrücken, Germany smichel@mpi-inf.mpg.de KLEE: Approximate Distributed Top-k Query Algorithms Peter Triantafillou RACTI / Univ. of Patras Rio, Greece
More information13 Frequent Itemsets and Bloom Filters
13 Frequent Itemsets and Bloom Filters A classic problem in data mining is association rule mining. The basic problem is posed as follows: We have a large set of m tuples {T 1, T,..., T m }, each tuple
More informationChapter 27 Hashing. Liang, Introduction to Java Programming, Eleventh Edition, (c) 2017 Pearson Education, Inc. All rights reserved.
Chapter 27 Hashing 1 Objectives To know what hashing is for ( 27.3). To obtain the hash code for an object and design the hash function to map a key to an index ( 27.4). To handle collisions using open
More informationCSE 373 Autumn 2012: Midterm #2 (closed book, closed notes, NO calculators allowed)
Name: Sample Solution Email address: CSE 373 Autumn 0: Midterm # (closed book, closed notes, NO calculators allowed) Instructions: Read the directions for each question carefully before answering. We may
More informationFinding Similar Items:Nearest Neighbor Search
Finding Similar Items:Nearest Neighbor Search Barna Saha February 23, 2016 Finding Similar Items A fundamental data mining task Finding Similar Items A fundamental data mining task May want to find whether
More informationPart II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems SkimpyStash: Key Value
More informationMaster Project. Various Aspects of Recommender Systems. Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue Ayala
Master Project Various Aspects of Recommender Systems May 2nd, 2017 Master project SS17 Albert-Ludwigs-Universität Freiburg Prof. Dr. Georg Lausen Dr. Michael Färber Anas Alzoghbi Victor Anthony Arrascue
More informationGoal of this document: A simple yet effective
INTRODUCTION TO ELK STACK Goal of this document: A simple yet effective document for folks who want to learn basics of ELK (Elasticsearch, Logstash and Kibana) without any prior knowledge. Introduction:
More informationNowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?
Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/
More informationPattern Mining in Frequent Dynamic Subgraphs
Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de
More informationTools for Social Networking Infrastructures
Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes
More informationHomework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please)
Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Spring 2017, Prakash Homework 3: Map-Reduce, Frequent Itemsets, LSH, Streams (due March 16 th, 9:30am in class hard-copy please) Reminders:
More informationGraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing
Institute of Parallel and Distributed Systems () Universitätsstraße 38 D-70569 Stuttgart GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing Ruben Mayer, Christian Mayer,
More informationCold Filter: A Meta-Framework for Faster and More Accurate Stream Processing
SIGMOD, June -5,, Houston, TX, USA Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, Steve Uhlig Peking University.
More information"GET /cgi-bin/purchase?itemid=109agfe111;ypcat%20passwd mail 200
128.111.41.15 "GET /cgi-bin/purchase? itemid=1a6f62e612&cc=mastercard" 200 128.111.43.24 "GET /cgi-bin/purchase?itemid=61d2b836c0&cc=visa" 200 128.111.48.69 "GET /cgi-bin/purchase? itemid=a625f27110&cc=mastercard"
More informationCross Reference Strategies for Cooperative Modalities
Cross Reference Strategies for Cooperative Modalities D.SRIKAR*1 CH.S.V.V.S.N.MURTHY*2 Department of Computer Science and Engineering, Sri Sai Aditya institute of Science and Technology Department of Information
More informationInforma)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies
Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:
More informationBuilding A Billion Spatio-Temporal Object Search and Visualization Platform
2017 2 nd International Symposium on Spatiotemporal Computing Harvard University Building A Billion Spatio-Temporal Object Search and Visualization Platform Devika Kakkar, Benjamin Lewis Goal Develop a
More informationSCREAM: Sketch Resource Allocation for Software-defined Measurement
SCREAM: Sketch Resource Allocation for Software-defined Measurement (CoNEXT 15) Masoud Moshref, Minlan Yu, Ramesh Govindan, Amin Vahdat Measurement is Crucial for Network Management Network Management
More informationAnomaly Detection on Data Streams with High Dimensional Data Environment
Anomaly Detection on Data Streams with High Dimensional Data Environment Mr. D. Gokul Prasath 1, Dr. R. Sivaraj, M.E, Ph.D., 2 Department of CSE, Velalar College of Engineering & Technology, Erode 1 Assistant
More informationThe Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources
FOSS4G 2017 Boston The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources Devika Kakkar and Ben Lewis Harvard Center for Geographic Analysis
More informationnetworks data threads
Informatics Department Aristotle University of Thessaloniki A distributed framework for early trending topics detection on big social networks data threads AT HE NA VA KA L I, N I KOL AOS KI T MER IDIS,
More informationCollective Entity Resolution in Relational Data
Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution
More informationDistributed Itembased Collaborative Filtering with Apache Mahout. Sebastian Schelter twitter.com/sscdotopen. 7.
Distributed Itembased Collaborative Filtering with Apache Mahout Sebastian Schelter ssc@apache.org twitter.com/sscdotopen 7. October 2010 Overview 1. What is Apache Mahout? 2. Introduction to Collaborative
More informationChapter 7: Numerical Prediction
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 7: Numerical Prediction Lecture: Prof. Dr.
More informationEfficient query processing
Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:
More informationAutomatic Cluster Number Selection using a Split and Merge K-Means Approach
Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr and Michael Granitzer 31st August 2009 The Know-Center is partner of Austria's Competence Center Program COMET. Agenda
More informationHash Tables Outline. Definition Hash functions Open hashing Closed hashing. Efficiency. collision resolution techniques. EECS 268 Programming II 1
Hash Tables Outline Definition Hash functions Open hashing Closed hashing collision resolution techniques Efficiency EECS 268 Programming II 1 Overview Implementation style for the Table ADT that is good
More informationDS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK 232 Fall 2016 The Data Equation Oceans of Data Ocean Biodiversity Informatics,
More informationRequirements Engineering
Practical Course: Web Development Requirements Engineering Winter Semester 2016/17 Juliane Franze Ludwig-Maximilians-Universität München Practical Course Web Development WS 16/17-01 - 1 Today s Agenda
More informationTEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION
TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.
More informationVolume 6, Issue 5, May 2018 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 6, Issue 5, May 2018 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at:
More informationCS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch. Jared Saia University of New Mexico
CS 561, Lecture 2 : Hash Tables, Skip Lists, Bloom Filters, Count-Min sketch Jared Saia University of New Mexico Outline Hash Tables Skip Lists Count-Min Sketch 1 Dictionary ADT A dictionary ADT implements
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining
More informationEfficient Spatial Query Processing in Geographic Database Systems
Efficient Spatial Query Processing in Geographic Database Systems Hans-Peter Kriegel, Thomas Brinkhoff, Ralf Schneider Institute for Computer Science, University of Munich Leopoldstr. 11 B, W-8000 München
More informationKapitel 4: Clustering
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.
More informationCS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.
CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE
More informationPresented by: Dardan Xhymshiti Spring 2016
Presented by: Dardan Xhymshiti Spring 2016 Authors: Sean Chester* Darius Sidlauskas` Ira Assent* Kenneth S. Bogh* *Data-Intensive Systems Group, Aarhus University, Denmakr `Data-Intensive Applications
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 04/06/2015 Reading: Chapter 14.3: Hastie, Tibshirani, Friedman. Additional resources: Center Based Clustering: A Foundational Perspective. Awasthi,
More informationCAP 6412 Advanced Computer Vision
CAP 6412 Advanced Computer Vision http://www.cs.ucf.edu/~bgong/cap6412.html Boqing Gong April 21st, 2016 Today Administrivia Free parameters in an approach, model, or algorithm? Egocentric videos by Aisha
More informationDATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane
DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data
More informationEfficient Mining Algorithms for Large-scale Graphs
Efficient Mining Algorithms for Large-scale Graphs Yasunari Kishimoto, Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka Abstract This article describes efficient graph mining algorithms designed
More informationHow Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns
How Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns Roadmap Next several lectures: universal structural properties of networks Each large-scale network is unique microscopically,
More informationOpen Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria
Open Source Search Andreas Pesenhofer max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria max.recall information systems max.recall is a software and consulting company enabling
More informationHigh dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.
http://www.mmds.org High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Network Analysis
More informationSilk Server Adding missing Links while consuming Linked Data
Proceedings Of The First International Workshop On Consuming Linked Data Shanghai, China, November 8, 2010 Silk Server Adding missing Links while consuming Linked Data Robert Isele, Freie Universität Berlin
More informationReal-time Collaborative Filtering Recommender Systems
Real-time Collaborative Filtering Recommender Systems Huizhi Liang, Haoran Du, Qing Wang Presenter: Qing Wang Research School of Computer Science The Australian National University Australia Partially
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationCSI33 Data Structures
Outline Department of Mathematics and Computer Science Bronx Community College September 6, 2017 Outline Outline 1 Chapter 2: Data Abstraction Outline Chapter 2: Data Abstraction 1 Chapter 2: Data Abstraction
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,
More informationHere(is(the(XML(Schema(that(describes(the(format:
BuildingBusinessDashboardsusingXSLT,SVG,HTML5 Today seconomyhasdramaticallychangedthewaycompaniesdobusiness,havinganonlinepresenceisnolonger sufficient,andordermanagementsystemshavetobereal:timeandaccessibleonalargevarietyofdevices:
More informationPrivacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras
Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,
More informationFlow and Congestion Control Marcos Vieira
Flow and Congestion Control 2014 Marcos Vieira Flow Control Part of TCP specification (even before 1988) Goal: not send more data than the receiver can handle Sliding window protocol Receiver uses window
More informationFeature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.
CS 188: Artificial Intelligence Fall 2007 Lecture 26: Kernels 11/29/2007 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit your
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationLecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics
More informationUsing Natural Language Processing and Machine Learning to Assist First-Level Customer Support for Contract Management
Using Natural Language Processing and Machine Learning to Assist First-Level Customer Support for Contract Management Master thesis Final presentation Michael Legenc Advisor: Daniel Braun Munich, 08.01.2018
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More information/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016
15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module
More informationA novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems
A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationMining User - Aware Rare Sequential Topic Pattern in Document Streams
Mining User - Aware Rare Sequential Topic Pattern in Document Streams A.Mary Assistant Professor, Department of Computer Science And Engineering Alpha College Of Engineering, Thirumazhisai, Tamil Nadu,
More informationTag Based Image Search by Social Re-ranking
Tag Based Image Search by Social Re-ranking Vilas Dilip Mane, Prof.Nilesh P. Sable Student, Department of Computer Engineering, Imperial College of Engineering & Research, Wagholi, Pune, Savitribai Phule
More information