LaSEWeb: Automating Search Strategies over Semi-Structured Web Data

Size: px

Start display at page:

Download "LaSEWeb: Automating Search Strategies over Semi-Structured Web Data"

Calvin Bailey
6 years ago
Views:

1 LaSEWeb: Automating Search Strategies over Semi-Structured Web Data Oleksandr Polozov University of Washington Sumit Gulwani Microsoft Research KDD 2014 August 27, 2014

2 Motivation: search engine micro-segments

3 Motivation: search engine micro-segments

4 Motivation: search engine micro-segments

5 Motivation: search engine micro-segments

6 Repetitive search tasks Structured databases Precise, but limited in content No time-sensitive information Provide no context (sources)

7 Repetitive search tasks Structured databases Web mining scripts Precise, but limited in content No time-sensitive information Provide no context (sources) Two extremes: Powerful ML, which has to be relearned for each micro-segment Fragile HTML layout parser Inaccessible for end-users

8 LaSEWeb Query Language A semantic scripting language for semi-structural information extraction from the Web Models natural patterns from the humans search strategies LaSEWeb interpreter Explores multiple webpages, clusters different answer candidates, and provides context for each answer Makes use of state-of-the-art NLP/ML/PL algorithms

9 Example: phone number v = ( Sumit Gulwani ) let η t = Emphasized v 1 in let η b = AttributeLookup Syn("phone"), l a in Union η t, η b where Regex l a, "$\d+$\w \d + \W \d+" where Layout η t, η b, Down and Nearby η t, η b

10 Example: phone number v = ( Sumit Gulwani ) let η t = Emphasized v 1 in let η b = AttributeLookup Syn("phone"), l a in Union η t, η b where Regex l a, "$\d+$\w \d + \W \d+" where Layout η t, η b, Down and Nearby η t, η b Visual attributes

11 Example: phone number v = ( Sumit Gulwani ) let η t = Emphasized v 1 in let η b = AttributeLookup Syn("phone"), l a in Union η t, η b where Regex l a, "$\d+$\w \d + \W \d+" where Layout η t, η b, Down and Nearby η t, η b Visual attributes Implicit table detection

12 Example: phone number v = ( Sumit Gulwani ) let η t = Emphasized v 1 in let η b = AttributeLookup Syn("phone"), l a in Union η t, η b where Regex l a, "$\d+$\w \d + \W \d+" where Layout η t, η b, Down and Nearby η t, η b Visual attributes Implicit table detection Linguistic patterns

$"$\d+$\w \d + \W \d+" where Layout η t, η b, Down and Nearby η t, η b Visual attributes Implicit table detection Linguistic patterns Clustering across$

13 Example: phone number v = ( Sumit Gulwani ) let η t = Emphasized v 1 in let η b = AttributeLookup Syn("phone"), l a in Union η t, η b where Regex l a, "$\d+$\w \d + \W \d+" where Layout η t, η b, Down and Nearby η t, η b Visual attributes Implicit table detection Linguistic patterns Clustering across webpages

14 Language Structure Visual patterns Structural patterns Linguistic patterns Match: webpage layout, style, end-user appearance Use: in-memory rendering, DOM analysis Nearby, Emphasized, Layout, CSS Match: relational patterns on implicit tables Use: table detection, plain text analysis using programming-by-example technologies VLOOKUP, AttributeLookup Match: semantic text properties Use: POS tagging, sentence parsing, entity recognition, synonymy detection Syn, POS, Entity, NP, SameSentence [1] J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by Gibbs sampling. In ACL, [2] D. Klein and C. D. Manning. Accurate unlexicalized parsing. In ACL, [3] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In HLT-NAACL, [4] C. Quirk, P. Choudhury, J. Gao, H. Suzuki, K. Toutanova, M. Gamon, W.-t. Yih, L. Vanderwende, and C. Cherry. MSR SPLAT, a language analysis toolkit. In ACL, [5] W.-t. Yih, G. Zweig, and J. C. Platt. Polarity inducing latent semantic analysis. In ACL, [6] S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, [7] M. J. Cafarella., A. Halevy, and J. Madhavan. Structured data on the web. In CACM 54.2 (2011):

15 Program interpreter: user emulation algorithm

16 Program interpreter: user emulation algorithm v = "computer" LaSEWeb Engine LaSEWeb inventors MS script

17 Program interpreter: user emulation algorithm v = "computer" LaSEWeb Engine LaSEWeb inventors MS script Seed query

18 Program interpreter: user emulation algorithm v = "computer" LaSEWeb inventors MS script LaSEWeb Engine Seed query John Atanasoff John Vincent Atanasoff Charles Babbage Babbage, C. konrad zuse

19 Program interpreter: user emulation algorithm v = "computer" score C i U = 1 U j=1 s C i c s, u j c u j LaSEWeb Engine John Atanasoff John Vincent Atanasoff LaSEWeb inventors MS script Seed query Charles Babbage Babbage, C. konrad zuse

5%) http://www.computerhope.com http://www.ehow.

com Charles Babbage (10.5%) http://www.buzzle.

com score C i U = 1 U j=1 s C i c s, u j c u j

20 Program interpreter: user emulation algorithm v = "computer" LaSEWeb Engine John Atanasoff (14.5%) Charles Babbage (10.5%) score C i U = 1 U j=1 s C i c s, u j c u j John Atanasoff John Vincent Atanasoff LaSEWeb inventors MS script Seed query Charles Babbage Babbage, C. konrad zuse

Experiments ~95% precision and 71% recall on factoid micro-segments For micro-segments: Precision measured by random sampling, based on top-3 results For end-user repetitive

21 Experiments ~95% precision and 71% recall on factoid micro-segments For micro-segments: Precision measured by random sampling, based on top-3 results For end-user repetitive search tasks: Precision/recall measured manually Average execution time: ~5 sec/webpage Depends on the rendering settings Current setting: offline deployment / database population

22 Summary & Future work Typical patterns of human search strategies in a scripting language for IE Match semi-structured Web content Existing cross-disciplinary technologies used as building blocks Exploit information redundancy across multiple webpages Applications: 1. Micro-segments of factoid questions in search engines 2. Repeatable batch data extraction tasks for end-users 3. Structured database population from free Web text 4. English language comprehension problem generation Future work: Automatic query execution plans in the language Integration with natural language logic engines

23 Summary & Future work Typical patterns of human search strategies in a scripting language for IE 1. The Match principal semi-structured characterized his pupils Web as content because they were pampered and spoiled by their indulgent parents. Existing cross-disciplinary technologies used as building blocks 2. The commentator characterized the electorate as because it was unpredictable and given to constantly Exploit shifting information moods. redundancy across multiple webpages (a) cosseted (b) disingenuous (c) corrosive (d) laconic (e) mercurial Applications: 1. Micro-segments of factoid questions in search engines 2. Repeatable batch data extraction tasks for end-users 3. Structured database population from free Web text 4. English language comprehension problem generation Future work: Automatic query execution plans in the language Integration with natural language logic engines

24 Summary & Future work Typical patterns of human search strategies in a scripting language for IE Match semi-structured Web content Existing cross-disciplinary technologies used as building blocks Exploit information redundancy across multiple webpages Applications: 1. Micro-segments of factoid questions in search engines 2. Repeatable batch data extraction tasks for end-users 3. Structured database population from free Web text 4. English language comprehension problem generation Future work: Automatic query execution plans in the language Integration with natural language logic engines

25 Thanks for listening! Questions?

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A Statistical parsing Fei Xia Feb 27, 2009 CSE 590A Statistical parsing History-based models (1995-2000) Recent development (2000-present): Supervised learning: reranking and label splitting Semi-supervised