ExpertSearch. Data Sciences Summer Institute University of Illinois at UrbanaChampaign. July 1, 2011

Size: px

Start display at page:

Download "ExpertSearch. Data Sciences Summer Institute University of Illinois at UrbanaChampaign. July 1, 2011"

Bernadette Sharp
6 years ago
Views:

1 ExpertSearch Data Sciences Summer Institute University of Illinois at UrbanaChampaign July 1, 2011

2 Expert Search Group

3 Expert Search Goal Expert Search is a search engine that returns a list of people who are experts in a particular area of study given a paper abstract or list of topics. Expert #1 Paper Abstract or Topic Expert Search Expert #2 Expert #3 Expert #4

4 Expert Search Current Use Generate a list of professors that should be invited to a talk on campus. Summary of Talk List of people to

5 Presentation Overview

6 Data Crawling

7 Data Crawling Team Oluwadare Ibiyemi Fitzroy Nembhard Froswell Wallace Thapanapong Rukkanchanunt

8 Data Crawling Documents 4445 Experts 987 Mb of Text 247 Programs of Study

9 Data Crawling crawl retrieve obtain save store

10 Data Crawling Get homepage URL from UIUC phonebook Use search engine to obtain the URL if it is not in UIUC phonebook Crawl homepage for PDF files and process them. Terminate the search at 300 HTML pages Store the number of documents, homepage link, and text crawled from "homepage"

11 Data Crawling

12 Data Crawling: Tools

13 Data Classification

14 Classification & Extraction Kendra Clay Eunki Kim Victoria Ko Bekah Van Maanen

15 Classification & Extraction Classification Information Extraction

16 Classification & Extraction Classification Information Extraction

17 Classification Task #1 The first task of classification is to determine whether the URL listed for the expert is his or her homepage. HTML Text Files Classified Homepages

18 Extraction Task #2 Information(Keywords) Extraction Two types: 1. Homepages / 2. Papers Output used by Information Retrieval to match search query to an expert Expert Text Files Expert Interests

19 Extraction Task #2 Expert Text Files 1. HTML code 2. Parsed Text Files Use methods or apply rules Extract Interests Expert Interest Text File

20 Extraction Task #2.1 - Challenges Various formats of homepages Needs to set rules to deal with various cases

21 Extraction Task #2.1 - Rules Step 1: Extraction rules - Get a big chunk of information - Example of tokens: Research Areas, Interests, Specialization, Areas of Expertise, Field of Study Step 2: Iteration rules - Find what format it is and refine found information - Example of formats: List, Comma, Table, Paragraph, Link Repeat Step 1 and Step2 until it founds the right part of information

22 Extraction Task #2.1 - Example Webpage Profile Research Areas Courses Step 1. Find Research Areas Education Soil erosion and sediment control Water quality and management Publications Step 2. Define what format it is here: List with <ul> apply iteration rule with <li>

23 Extraction Task #2.2 - Challenges Papers - include some non-word text (i.e. mathematical notation, etc), may be incorrectly identified as keywords Solution: take only abstracts from paper How long should a keyword be to be useful in associating it to an expert? Must define maximum length How can we identify keywords? Part-of-speech, noun phrases papers/pdfs

24 Extraction Task #2.2 Tools Illinois Chunker Above: part of abstract from Pictorial Structures for Object Recognition by P. Felzenszwalb and D. Huttenlocher - NP -> Candidate Keywords - Calculate weight for these potential keywords - Take top 10 highest weight noun phrases

25 Extraction Task #2.2 Tools Rapid Automatic Keyword Extraction (RAKE) Frequency: Total # of word occurrences. Degree: (total # of individual occurrences of word in document) + (length of each noun phrase the word appears in) Word score: s(w) = deg(w)/freq(w) NP score: np_s(w) = s(w1) + s(w2) s (wk), where w = (w1 w2...wk) = noun phrase and s(wk) = individual word score for word wk

26 Classification & Extraction: Tools

27 Topic Modeling

28 Topic Modeling Team Pradip Karki Sam Somuah

Dirichlet Allocation and Gibbs Sampling Expert Text

29 Topic Modeling Goal: To discover latent topics in the bag of words associated with expert Process: Latent Dirichlet Allocation and Gibbs Sampling Expert Text Files Distribution of words over topics, topics over experts

30 Topic Modeling:Motivation Challenges: Large number of documents Experts have multiple areas of expertise. Topic Modeling: Reduce dimensionality by mapping to a limited number of topics "Hidden" topics can be discovered without the need for labeling.

31 Topic Modeling The probabilities of the topics and words associated with the topics are used to retrieve relevant results by the Information retrieval group

32 Topic Modeling Expert Text Files

33 Topic Modeling:Output Expert-Topic ExpertID TopicID Prob TopicID Word Prob 87 data mine algorithm pattern Topic-Word

34 Topic Modeling

35 Information Retrieval

36 Information Retrieval Team Sean Massung Fei Wu

37 Information Retrieval Given a user's query, the IR component acts as a search engine, ranking experts based on relevancy. List of experts ordered by relevancy

38 Information Retrieval System Flow HTTP POST Request HTTP GET with Key UI Abstract or query Key/List Crawl data Key Databas e List

40 Information Retrieval

41 Results As expected, longer queries produce more accurate results LM method is more accurate for short queries, whereas the TM method performs well on longer queries Overall, we have good expert recall

42 Information Retrieval

43 User Interface

44 User Interface Team Fitzroy Nembhard Jerone Dunbar

45 User Interface The user interface allows anyone to search for experts using paper abstracts, keywords or department. Abstract, Keywords or department Ranked Results

46 User Interface System Flow HTTP POST Request HTTP GET with Key IR SYSTEM Abstract or query Key/List Crawl data Key Databas e List

47 User Interface

48 User Interface - Ranked List

49 User Interface

50 Expert Word Cloud Above: Screen shot of the word cloud.

51 Distribution of Experts Above: Distribution of experts across a geographical area, with the top ranking experts in red

52 User Interface: Tools

53 Summary Mobile application Web Application DHS applications Preparedness Response and Recovery Given a particular event, such as natural disasters, have the capability of searching for experts to can help with the situation. Cuts across various sectors of the economy. For individual uses in time of emergencies, such as urgent medical conditions. Each component of the Expert Search

54 Questions?

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO INDEX Proposal Recap Implementation Evaluation Future Works Proposal Recap Keyword Visualizer (chrome