Clustering the output of Apache Nutch using Apache Spark. May 12, Vancouver, Canada

Size: px

Start display at page:

Download "Clustering the output of Apache Nutch using Apache Spark. May 12, Vancouver, Canada"

Suzanna Flora McGee
6 years ago
Views:

1 Clustering the output of Apache Nutch using Apache Spark Thamme Gowda N. Dr. Chris Mattmann May 12, Vancouver, Canada 1

2 About ThammeGowda Narayanaswamy - TG in short Contributor to Apache Tika and Apache Nutch Now - a grad University of Southern California Past - Technical Datoin - Dr. Chris Adj. Prof. and the director of IRDS University of Southern California, Los Angeles Apache Software Foundation Chief Architect, NASA JPL 2

3 Overview Problem Statement Clustering - a solution Structure and Style Similarity Shared Near Neighbor Clustering Scaling it up using Spark s Distributed Matrices and GraphX A demo 3

4 Audience Who crawls the web Who extracts data from web Who filters webpages likes to know web page structure and style similarity shared near neighbor clustering 4

5 Problem Statement Scraping data from online marketplaces Start with homepage categories listing pages Actual stuff (Detail page) 5

6 Sample set of web pages 6 credits:

7 USELESS USELESS Sample set of web pages 7 credits:

8 USELESS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS USELESS Sample set of web pages 8 credits:

9 USELESS USEFUL FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS USEFUL FOR ANALYSIS USEFUL FOR ANALYSIS Sample set of web pages REQUIRED FOR CRAWLER, BUT NOT IMPORTANT FOR ANALYSIS USELESS 9 credits:

10 Question : How do we solve this? Answer : Cluster the web pages 10

11 Why Cluster? Separate the interesting web pages? Drop uninteresting/noisy web pages Categorical treatment of clusters Extract Structured data using XPath Automated extraction using alignment 11

12 Goal Group web pages that are similar Similar in terms of CSS Styles DOM Structure Toolkit for experimentation with various thresholds % of similarity in style and/or structure Nice visualizations 12

13 How do we cluster? Based on similarity between pages Semantic similarity meaning of the web pages Syntactic similarity Web page structure, css styles This session has focus on syntactic aspect 13

14 Structural similarity HTML Web pages are built with HTML HTML Doc DOM tree a labeled ordered tree Structural similarity using tree edit distance(ted) HEAD TITLE BODY DIV P 14

15 (Minimum) Tree Edit Distance Edit distance measure similar to strings, but on hierarchical data instead of sequences Number of editing operations required to transform one tree into another. Three basic editing operations: INSERT, REMOVE and REPLACE. An useful measure to quantify how similar (or dissimilar) two trees are. 15

16 Example: Tree Edit Distance* Edit operations Normalized distance * Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6),

17 Style Similarity Have you noticed? Similar web pages have similar css styles XPath : //*[@class]/@class Simple measure Jaccard Similarity on CSS class names 17

18 Web pages consists of : HTML CSS JavaScript 18

19 Aggregating the Style and Structure StructuralSimilarity : Normalized Tree Edit Distance StyleSimilarity : Jaccard Distance Combine on a linear scale Aggregated = k. Structural + (1-k) Style 19

20 Implementation 20

21 Implementation Read Nutch s Segements sparkcontext.sequnecefile(...) Filter web pages Robust content type detection -- Tika Structural Similarity HTML to DOM Tree -- NeckoHtml Tree Edit Distance -- Zhang Shasha s algorithm 21

22 Implementation Style Similarity Query CSS class names using Xpath Similarity Matrix sparkcontext.cartesian() to get nxn cells Spark s Distributed (Coordinate) Matrix Persist the matrix for later experimentation with multiple thresholds 22

23 Clustering Shared Near Neighbor Clustering Jarvis et al, 1973 With improvements Graph based Implementation Spark GraphX for the win! * Jarvis, R. A., & Patrick, E. A. (1973). Clustering using a similarity measure based on shared near neighbors. Computers, IEEE Transactions on, 100(11),

24 What s good about this algorithm? What s the difficulty with the most popular k-means? Prior knowledge of clusters? Mean/Average of documents in a cluster? Average of DOM Trees? Average of CSS styles? Circular/Spherical/Globular shapes? Shared Near Neighbor Cluster Similarity matrix - pluggable similarity measures - generic Thresholds - numbers, percent of match 24

25 Shared Near Neighbor Algorithm If two data points share a threshold number of neighbors, then they must belong to the same cluster 25

26 Clustering Implementation Similarity Matrix to Graph Clusters as nodes, similarity measure as edges Check for Similar neighbors Filter on threshold and Merge Immutable! - new graph for next iteration Repeat 26

27 Shared Near Neighbor Clustering on Apache Spark GraphX 27

28 Challenges Tree Edit Distance is very expensive 28

29 What s ahead on the road? Integrate to Apache Nutch Auto Extraction Unsupervised learning on structure of pages and scrape the actual data of the web page Faster Tree Edit Distance May be with approximation techniques 29

30 Demo 30

31 Summary Example Scenario Similarity measures Clustering as a solution Demo 31

32 Acknowledgements Dr. Chris Mattmann My mentor Professor, Director at USC - Director, Apache Software Foundation DARPA Memex project 32

33 Thank You! Source Code Tutorial Follow up Thamme Gowda Chris Mattmann 33

Clustering Web Pages Based on Structure and Style Similarity

2016 IEEE 17th International Conference on Information Reuse and Integration Clustering Web Pages Based on Structure and Style Similarity Thamme Gowda 1 and Chris Mattmann 1,2 1 University of Southern