SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds

Size: px

Start display at page:

Download "SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds"

Cori Watkins
6 years ago
Views:

1 SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Erich Schubert, Michael Weiler, Hans-Peter Kriegel! Institute of Informatics Database Systems Group Ludwig-Maximilians-Universität München

2 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp 56 Term frequency :47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

3 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp 56 Term frequency A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

4 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp 56 Term frequency ? A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp Pair { Facebook, WhatsApp } 56 Term frequency 42 28 14 A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01

5 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp Pair { Facebook, WhatsApp } 56 Term frequency A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Twitter Streaming API on Feb. 19th 2014 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

6 Trend detection on streams should be early and accurate 70 Term Facebook Term WhatsApp Pair { Facebook, WhatsApp } 56 Term frequency A B 0 10:47 10:49 10:51 10:53 10:55 10:57 10:59 11:01 11:03 11:05 11:07 11:09 11:11 11:13 Facebook bought WhatsApp Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 1/10

7 Problem description 1. Statistical significance score Popular topics trending topics (e.g. Obama) 2. Track interacting terms Facebook bought WhatsApp Edward Snowden traveled to Moscow 3. Scalability Efficient calculation for all terms and pairs Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 2/10

8 SigniTrend on textual streams tracking both: single terms and pairs A. Preprocessing (stopwords, stemming, duplicates) B. Trend detection cycle Temporal slicing for statistical aggregation Score all terms and pairs based on expectations from past slices C. Refinement with clustering Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 3/10

9 Trend detection cycle Terms and pairs Count frequency exceeds threshold? Trend candidates new alerting thresholds at the end of each time slice Update statistics Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 4/10

10 Update statistics for time slice t and term or pair e How many standard deviations is the current frequency x higher than its mean z (x t,e ):= x t,e µ t 1,e t-1,e Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 5/10

11 Update statistics for time slice t and term or pair e How many standard deviations is the current frequency x higher than its mean! z (x t,e ):= x pt,e EWMA t 1,e EWMVart 1,e Exponential weighted moving average/variance for continuous estimation on a stream [Finch09] 4 t,e x t,e EWMA t 1,e EWMA t,e EWMA t 1,e + 4 t,e EWMVar t,e (1 ) EWMVar t 1,e t,e { [Finch09] T. Finch. Incremental calculation of weighted mean and variance. Technical report, University of Cambridge, 2009 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 5/10

12 Significance and frequency for term Facebook Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 6/10

13 How to track statistics of all pairs efficiently? Problem: Too many terms and pairs to track everything 2013 News Dataset STEMMED TERMS Therefore, we designed an efficient hashing scheme (based on Bloom Filters and Heavy Hitters) for probabilistic upper-bound statistics!!! OBSERVED PAIRS TOTAL 56,661, ,430,059 UNIQUES 300,141 71,289,359 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 7/10

14 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 #1 #2 #3 #4 #5 #6 #7 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

15 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 h1 h2 45 ± ± 30 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

16 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 h1 h2 45 ± 30 2 ± 1 45 ± 30 2 ± 1 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

17 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 h1 h2 45 ± 30 2 ± 1 45 ± 30? 20 ± 10 2 ± 1 20 ± 10 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

18 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) h1 h2 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

19 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 h1 h2 read {Obama, US}: min(45±30, 20±10) = 20±10 read MIN (lowest collision) Upper-bound estimate for mean and its variance Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

20 Hashing scheme for efficient tracking L=7 buckets, K=2 hash functions {WhatsApp}: 60 {Facebook, WhatsApp}: 2 {Obama, US}: 25 write MAX (upper bound) 45 ± 30 2 ± 1 45 ± 30 2 ± 1 20 ± 10 read {Obama, US}: min(45±30, 20±10) = 20±10 read MIN (lowest collision) Upper-bound estimate for mean and its variance Performance on news dataset: 104s/day with a Raspberry-Pi Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 8/10

Hash Table Bits l α=0.01 α=0.03 α=0.05 α=0.07 α=0.09 α=0.11 α=0.13 α=0.

21 Artificial trends evaluation Inject artificial words with frequency α e.g. Obama meets <X123> Netanyahu 100% Recall of Injected Trends 80% 60% 40% 20% 0% Hash Table Bits l α=0.01 α=0.03 α=0.05 α=0.07 α=0.09 α=0.11 α=0.13 α=0.15 Hash table size large enough recall saturation Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 9/10

22 Refinement & clustering Inverted index (Apache Lucene) to verify trend candidates and measure exactly (without hashing) for precise reporting (false-positives can be eliminated) Single Link clustering with Ward of remaining trends (similarity matrix is built with the exact significance of all pairs) Future work: include topic modeling techniques (e.g. plsi, LDA) Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds Page 10/10

23 Thank You! Questions? Michael Weiler SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning