Focused Crawling with Scalable Ordinal Regression Solvers

Size: px

Start display at page:

Download "Focused Crawling with Scalable Ordinal Regression Solvers"

Josephine Miles
6 years ago
Views:

1 institution-logo with Scalable Ordinal Regression Solvers Rashmin Babaria, J Saketha Nath, Krishnan S, KR Sivaramakrishnan, Chiranjib Bhattacharyya, M N Murty Department of Computer Science and Automation Indian Institute of Science, INDIA

2 institution-logo & Large scale OR Given a topic (seed pages) find out relevant pages from the web Pose as a large scale OR problem Ordinal Regression Fast OR training algorithm scales to millions of datapoints Fast algorithm to solve an SOCP with one SOC constraint Low prediction time

3 Baseline OR Formulation [Chu & Keerthi, 2005]

4 Clustering based scalable OR Formulation Describe data using clusters instead of data points

5 Clustering based scalable OR Formulation Describe data using clusters instead of data points Class conditional distributions mixture models with spherical covariance

6 institution-logo Clustering based scalable OR Formulation Describe data using clusters instead of data points Class conditional distributions mixture models with spherical covariance Using second order moments (µ, σ 2 I), classify clusters

7 institution-logo Clustering based scalable OR Formulation Describe data using clusters instead of data points Class conditional distributions mixture models with spherical covariance Using second order moments (µ, σ 2 I), classify clusters Proposed formulation will have constraints per cluster

8 institution-logo Clustering based scalable OR Formulation Describe data using clusters instead of data points Class conditional distributions mixture models with spherical covariance Using second order moments (µ, σ 2 I), classify clusters Proposed formulation will have constraints per cluster Size of optimization problem O(clusters) rather than O(datapoints)

9 Proposed OR formulation s solution

10 Proposed OR formulation s solution

11 Proposed OR formulation s solution

12 Proposed OR formulation Features: SOCP Problem with one SOC constraint T train = T clust + T SOCP = O(n) Cluster moments estimated using BIRCH [Zhang et.al., 1996] T clust = O(n) SOCP solved using SeDuMi a. T SOCP is independent of n Can be Kernelized using input space cluster moments No. of Support Vectors at max. k low prediction time a institution-logo

13 Clustering + SOCP gives speedup Table: Training times (sec) with SeDuMi and SMO-OR [Chu & Keerthi, 2005] on synthetic dataset. S-Rate S-Size SMO-OR SeDuMi , , , ,500, ,000, Table: Training times (sec), test error rate with SeDuMi and SMO-OR [Chu & Keerthi, 2005] on CS-Census dataset. S-Size SMO-OR SeDuMi sec (err) sec 5, (.128) 20.4 (.109) 11, (.107) (.112) CS 15, (.107) (.108) 22, (.119) institution-logo

14 institution-logo Large number of clusters is still challenging Table: Training times (sec), test error rate with SeDuMi and SMO-OR [Chu & Keerthi, 2005] on CH-California Housing dataset. S-Size SMO-OR SeDuMi sec (err) sec 10, (.619) 112 (.623) 13, (.616) (.634) CH 15, (.617) 17, (.617) 20, (.62)

15 institution-logo CB-OR Solver Key Idea: Exploit special SOCP form SOCP problem with one SOC constraint Erdougan et.al., 2006 specialized solvers scale better Fast algorithm similar in spirit to Platt s SMO for QP Features: More scalable than generic solvers Easy to implement, uses no optimization tools

16 CB-OR Solver Rewrite Dual as follows: min α,α W (α α) K(α α) d (α + α ) s.t. 0 α 1, 0 α 1 K is Gram matrix for cluster centers s i = i nk k=1 j=1 αj k and s i = i+1 k=2 s i s i, i = 1,...,r 2, s r 1 = s r 1 nk j=1 α j k

17 CB-OR Solver Minimization wrt. two multipliers min α s.t. a( α) 2 + 2b( α) + c e α lb α ub Has closed form solution: α = r ac b e 2 a e 2 b a ] b ub a ub lb if ac b 2 > 0, a e 2 > 0 if ac b 2 = 0, a e 2 > 0 lb ub if e a 0 lb if e + a 0 institution-logo

18 institution-logo CB-OR Solver CB-OR Algorithm Step 1 Pick two most KKT violators Step 2 Solve the 1-d minimization problem Step 3 Update unknowns Step 4 Check for KKT violators. If none terminate. Else Step 1

19 CB-OR Evaluation Training time in seconds CB OR SeDuMi Number of Clusters Figure: Dashed line represents training time with SeDuMi and continuous line that with CB-OR on a synthetic dataset. institution-logo

20 CB-OR Evaluation Table: Comparison of training times (in sec) with CB-OR, SMO-OR and SeDuMi on benchmark datasets. The test set error rate is given in brackets. (CH-California Housing, CS-Census datasets). S-Size CB-OR SMO-OR SeDuMi sec (err) sec (err) sec 10,320.5 (.623) (.619) , (.634) (.616) CH 15, (.618) 1142 (.617) 17, (.621) 1410 (.617) 20, (.62) (.62) 5,690.3 (.109) 893 (.128) ,393.7 (.112) (.107) CS 15,191 1 (.108) (.107) , (.119) institution-logo

21 institution-logo Given a topic (seed pages) find out relevant pages from the web. S. Chakrabarti et.al (1999,2002), C. Aggarwal et.al (2001), M. Diligenti et.al (2000) Requires low bandwidth and low disk space. Small updation cycle.

22 Baseline Focused Crawler [Chakrabarti et.al., 1999]

23 Topic Taxonomy

24 Topic Taxonomy

25 Topic Taxonomy

26 Topic Taxonomy

27 Topic Taxonomy

28 Topic Taxonomy

29 Topic Taxonomy

30 Topic Taxonomy

31 institution-logo Exploit link structure Grangier and Bengio observe that hyperlinked documents are semantically closer. One link away pages are more similar to seed pages compare to two link away pages.

32 Link structure in web

33 Link structure in web

34 Link structure in web

35 as OR problem exploit link structure

36 as OR problem exploit link structure

37 as OR problem exploit link structure

38 as OR problem exploit link structure

39 Baseline architecture

40 Proposed architecture

41 Crawling Experiments Conclusions is a large scale OR problem Category Seed NASCAR Soccer Cancer Mutual Funds

42 NASCAR harvest rate Crawling Experiments Conclusions

43 Cancer harvest rate Crawling Experiments Conclusions

44 Mutual Funds harvest rate Crawling Experiments Conclusions

45 Harvest rate comparison Crawling Experiments Conclusions Dataset Baseline OR NASCAR Cancer Mutual Fund Soccer

46 Conclusions Crawling Experiments Conclusions Proposed a scalable clustering based OR formulation Training time O(datapoints) Support Vectors O(clusters) Exploited special structure of the formulation to develop a fast solver, CB-OR Scalable to tens of thousands of clusters We formulated focused crawling as large scale ordinal regression No need for negative class definition Independent of topic taxonomy OR captures link structure of web graph.

47 Crawling Experiments Conclusions Focused crawler code available at

48 Acknowledgments Crawling Experiments Conclusions This project is partially supported by AOL India Pvt Ltd and DST, Government Of India (DST/ECA/CB/660)

49 Crawling Experiments Conclusions Questions?

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com