Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web

Size: px

Start display at page:

Download "Entity Search Engine: Towards Agile Best-Effort Information Integration over the Web"

Kerry Wiggins
5 years ago
Views:

Entity Search Engine: Towards Agile Best-Effort Inforation Integration over the Web Tao Cheng, Kevin Chen-Chuan Chang University of Illinois at Urbana-Chapaign {tcheng3, kcchang}@cs.uiuc.edu.

The dual forces have enriched the Web with all kinds of data, uch beyond the conventional page view of the Web as a corpus of HTML pages, or docuents.

co), as Figure (a) shows. The richness of data, while a proising opportunity, has challenged us to effectively find data we need, fro one or ultiple sources.

The MetaQuerier syste, as we reported in CIDR 2005 [4], ais at finding and querying data sources on the deep Web.

data ebedded in unstructured result pages? (e.g., the query results of aazon.co and bn.co).

While data is proliferating, however, we are currently not able to effectively access such data. To otivate, consider a few scenarios, for user Ay: Scenario : To call Aazon.

A query like aazon custoer service phone ay not work; often a phone is siply shown without keyword phone (e.g., custoer service: 800-77-6688).

1 Entity Search Engine: Towards Agile Best-Effort Inforation Integration over the Web Tao Cheng, Kevin Chen-Chuan Chang University of Illinois at Urbana-Chapaign {tcheng3, INTRODUCTION The iense scale and wide spread has rendered the Web as an ultiate inforation repository as not only the sources where we find but also the destinations where we publish inforation. The dual forces have enriched the Web with all kinds of data, uch beyond the conventional page view of the Web as a corpus of HTML pages, or docuents. The Web has thus becoe a rich collection of data-rich pages, on the surface Web of static URLs (e.g., personal or copany hoepages) as well as the deep Web of database-backed contents (e.g., flights fro aa.co), as Figure (a) shows. The richness of data, while a proising opportunity, has challenged us to effectively find data we need, fro one or ultiple sources. In particular, we are otivated by, when building the Meta- Querier at UIUC, the need of large scale on-the-fly integration for online structured data. The MetaQuerier syste, as we reported in CIDR 2005 [4], ais at finding and querying data sources on the deep Web. However, as the last ile for etaquerying, when users can query ultiple sources on the fly [8], or when data is autoatically crawled fro sources [6], how do we identify and integrate the structured data ebedded in unstructured result pages? (e.g., the query results of aazon.co and bn.co). Further, we realized, as observed earlier, beyond our Meta- Querier experience, such data richness is pervasive fro the depth as well as the surface of the Web. While data is proliferating, however, we are currently not able to effectively access such data. To otivate, consider a few scenarios, for user Ay: Scenario : To call Aazon.co for her online purchase, how can Ay find their phone nuber? To begin with, what should be the right keywords for finding pages with such nubers? A query like aazon custoer service phone ay not work; often a phone is siply shown without keyword phone (e.g., custoer service: ). On the other hand, aazon custoer service could be too broad. Then, for each querying, she ust sift through the returned pages to dig for the phone nubers. This overall process can be unnec- This article is published under a Creative Coons License Agreeent ( You ay copy, distribute, display, and perfor the work, ake derivative works and ake coercial use of the work, but you ust attribute the work to the author and CIDR Cars.co BN.co Figure : The data-rich Web. Aazon.co AA.co essarily tie consuing. Scenario 2: To apply for graduate schools, how can she find the list of professors in the database area? She has to go through all the CS departent pages or, even worse, check faculty s hoepages one by one a very laborious process. Scenario 3: As a graduate student, Ay needs to prepare a seinar presentation for her choice of recent papers. Can she find papers that coe readily with presentations, i.e., a PDF file along with a PPT file, say fro SIGMOD 2006? Scenario 4: Now Ay wants to buy a copy of Shakespeare s Halet to read; how can she find the prices and cover iages of available choices fro, say, Borders.co and BN.co. She would have to look at and copare the results fro ultiple online bookstores one by one. In these scenarios, like every user in any siilar situations, Ay is looking for particular types of data, which we call entities, e.g., a phone nuber, a book cover iage, a PDF, a PPT, a nae, a date, an eail address, etc. She is not, we stress, looking for pages as relevant docuents to read, but entities as data for her subsequent tasks (to contact, to apply, to present, to buy, etc.). We are facing a dilea: On one hand, for accessing over the sheer size of the Web, we are ostly relying on search engines, such as Google, Yahoo, or MSN, which search for pages by keywords. On this extree, while being IR-style with a scalable text processing fraework, they are not data aware. On the other hand, integration services, such as Expedia.co or PriceGrabber.co, exist online for specific doains. On this extree, while providing DB-style precise querying, they can hardly scale the aount of data and the nuber of sources on the Web. As our solution, which this paper will propose, we believe the two extrees ust eet, with a synergistic arriage in

2 the iddle. Fro the search perspective: Can we build a search engine with data awareness? Fro the integration perspective: Can we develop an integration syste with scale awareness, even with liited best-effort seantics? As we will discuss in Section 3, the two perspectives together lead to our objective of entity search As a search syste, we propose our EntitySearch engine to search directly for target inforation entities, and holistically across their occurrences in all pages. As an integration syste, it will assue only a lightweight off-the-shelf entity extraction layer, and thus scalable to any sources; further, it will construct derived relations as the search output, on the fly, and thus flexible for ad-hoc user needs. In suary, this paper proposes the concept of entity search and its initial ipleentation as an agile best-effort fraework for realizing large scale inforation integration over the Web. We will otivate and define entity search in Section 2 and Section 3 respectively. Section 4 will then discuss our initial pilot ipleentation for exploring research issues. Section 5 will deonstrate the usefulness of entity search using real world scenarios. We will conclude in Section THE DILEMMA To access the Web, we ust resort to search engines or integration services. The state of the art, however, presents a clear dilea between data and scale, and we are thus lacking large scale data-based access to the Web. 2. Search Engines: Data Awareness? On one hand, our inforation access is ostly relying on current search systes, such as popular engines like Google, Yahoo, and MSN Search However, while reaching extensive parts of the Web, with their inherent page view of the Web, they are lacking even inial data awareness and thus incapable for such tasks as finding data entities. That is, current search systes are not data aware. In their rather siplistic page view, with a text retrieval syste as their backend engine, they naturally view the Web as siply a repository of interlinked docuents, or pages, each containing soe keywords upon which searches can be perfored. Our inforation access aounts to finding relevant pages by keywords regardless of what data entities we are looking for. In other words, fro this search perspective, we are lacking a search syste that is aware of data. Specifically, this lacking leads to two ajor liitations of current search systes: Indirect Input and Output. In ters of the input and output, current engines are searching indirectly. To begin with, to forulate queries as input, users cannot directly describe what they want. Ay has to forulate her needs indirectly as keyword queries, often in a non-trivial and non-intuitive way, with a hope to hit relevant pages that ay or ay not contain target entities. For Scenario, will aazon custoer service phone work? Then, as query output, users cannot directly get what they want. The engine will only take Ay, indirectly, to a list of pages, and she ust anually sift through the to find the phone nuber. Can we help users to search directly in both describing and getting what they want? Singular Matching Mechanis. In ters of the atching echanis, current search engines are finding each page singularly. Our target entities often coe fro ultiple pages. In Scenario, the sae phone nuber of Aazon ay appear in the copany s Web site, online user forus, or even blogs. In this case, we should collect, for each phone, all its occurrences fro ultiple pages as supporting evidences of atching. In Scenario 2, the list of professors probably cannot be found in any single page. In this case, again, we ust look at any pages to coe up with the list of proising naes (and each nae ay appear in ultiple pages, like the earlier case). Can we help users to search holistically for atching entities across the Web corpus as a whole, instead of each individual page? Thus, fro the search perspective, we are facing the challenge of building data awareness into Web search systes. As data proliferates on the Web, the classic view of the Web as a page repository is increasingly inadequate. A step forward, can we search for data, or specific entities, as our target directly and across all pages holistically? That is, while keyword-based search is scalable to the Web and easy to use, can it be orphed to be data aware? 2.2 Integration Systes: Scale Awareness? On the other hand, there already eerged any integration systes, such as the various specialized engines in coparison shopping (e.g., PriceGrabber.co, NextTag.co and BizRate.co) and vertical search services (e.g., Realtor.co for real estates, Expedia.co for airfares, and Google Base for various doains) However, while offering precise data-based access, with their inherent database view, they are not not designed with scale in ind, for the large aount of data-rich pages on the Web and the large diversity of user needs. That is, current integration systes are not scale aware. In their rather rigid database view, they naturally assue and can only query data prepared in a structured relational forat in certain prescribed scheas upon which SQL queries can be perfored. Our inforation access is thus significantly liited by the availability of data in such rigid scheas, as well as the types of queries that the prescribed schea can support. In other words, fro this integration perspective, we are lacking an integration syste that is aware of scale the proliferation of Web data and online users. In particular, because their database view requires data in well-structured forat, an integration syste ust build wrapper for each source, which works as a relation extractor for each site, to precisely extract its unstructured text into structured DBMS, upon which to perfor SQL queries. We believe that, while achieving structural rigor, this technique is not viable, with two liitations: Liited Sources. In ters of sources, it lacks scalability. As repeatedly reported [7, 5], per-source wrappers are not only laborious to build but also fragile to aintain, and thus a cost barrier. To incorporate yriad potential sources (and pages), we shall not rely on per-source training or construction. Liited Queries. In ters of queries, it lacks flexibility: A wrapper ust ake a hard decision on the scheas of the relations to extract e.g., for book, to extract a relation with forat, publisher, price, cover iage, or all? The ore coplex a schea is, the ore likely its extraction ay fail, while a sipler schee ay be useless for any users. Thus, fro the integration perspective, we are facing the challenge of building scale awareness into Web integration This is perhaps why today s Web integration systes work by assuing a sall set of pre-configured sources, by user subission (e.g., Google Base), or by relying on a centralized database, e.g., Sabre [3] for airfares and MLS [2] for real estate

3 rank rank 2 phone nuber PDF sigod6.pdf surajit2.pdf score PPT urls aazon.co/support.ht yblog.org/shopping Dell.co/supportors xyz.co hp.co sigod6.ppt surajit2.ppt score urls db.co,sigod.co s.co Figure 2: Query Results of Q and Q3. systes. While the prevalence of data on the Web presents novel opportunities for integration, this Web-scale scenario contrasts traditional settings and defeats current techniques We ust rethink not only new techniques but also realistic objectives. To bring yriads heterogeneous sources to eet ad-hoc users, as the scalability and flexibility andate, can we develop agile integration without rigid scheas, even with best effort seantics? 3. OUR PROPOSAL: ENTITY SEARCH Approaching the current barriers fro the dual perspectives, we are inspired that they see to converge to the sae solution: The dual perspectives represent the two extrees of the spectru: siple keyword search to fully transparent inforation integration. Meeting in the iddle, they suggest the arriage of scale-aware search and data-aware integration. Towards searching directly and holistically, for finding specific types of inforation, we thus propose to support entity search. First, as input, users forulate queries to directly describe what types of data they are looking for: She can siply specify what her target entities are, and what keywords ay appear in the context with a right answer. To distinguish between entities to look for and keywords in the context, we use a prefix #, e.g., #phone for the phone entity. Our scenarios will naturally lead to the following queries: Q: (aazon custoer service #phone) Q2: (#professor #university #research= database ) Q3: ow(sigod 2006 #pdf file #ppt file) Q4: (#title= halet #iage #price) In the queries, there are two coponents: () Context pattern, how will the target entities appear? Q says that #phone will appear with these keywords in the default pattern of near (or proxiity). We ay also explicitly specify the pattern, e.g., Q3 uses ow to ean order-window (the keywords ust appear before #pdf file and then #ppt file. (2) Content restriction: A target entity will atch any instances of that entity type, subject to optional restriction on their content values e.g., Q will atch every phone instance, while Q 2 will only atch research area database. (In addition to equality =, other restriction operators are possible, such as contain. ) Second, as output, users will get directly the entities that they are looking for. That is, as a query specifies what entity types are the targets, its results are those entity instances (or literal values) that atch the query, in a ranked order by their atching scores. (We will discuss this atching next.) Figure 2 shows soe exaple results for Q and Q3. Third, as search echanis, entity search will find atching entities holistically, where an instance will be found and atched in all the pages where it occurs; e.g., a #phone ay occur at ultiple URLs as Figure 2 shows. For each instance, all its atching occurrences will be aggregated to for the final ranking; e.g., a phone nuber that ore frequently occurs at where aazon custoer service is entioned ay rank higher than others. Thus, entity search will not only find entities as the priary result, but also return pages where each entity is found as the secondary result and supporting evidences. To support entity-based querying, the syste ust be fundaentally entity aware: As a departure fro current search engines built around the notions of pages and keywords, we ust generalize the to support entity as a first-class concept. With this awareness, as our data odel, we will ove fro the current page view, i.e., the Web as a docuent collection, to the new entity view, i.e., the Web as an entity repository. Upon this foundation, as our query echanis, we can then develop entity search, where users specify what they are looking for with keywords and target entities, as Q Q4 illustrated. We now foralize these notions. Data Model: Entity View. How should we view the Web as our database to search over? In the standard page view, the Web is a set of docuents (or pages) D = {d, d 2,..., d n}. Our data odel takes an entity view: We consider the Web as priarily a repository of entities: E = {E, E 2,..., E n}, where each E i is an entity type. For instance, to support Scenario (Section ), the syste ight be constructed with entities E = {E : #phone, E 2 : #eail}. Further, each entity type E i is a set of entity instances that are extracted fro the corpus, i.e., literal values of entity type E i that occurs soewhere in soe d D. We use e i to denote an entity instance of entity type E i. In our exaple, e.g., by recognizing phonenuber patterns (say, in a regular expression of digits) fro D, we ay extract #phone = { , , (27) ,...} As the starting point for data-aware search, this tagging, or entity extraction, is to recognize each entity fro Web pages. Entity extraction has been well studied in the context of general inforation extraction, and techniques abound, fro siple syntactic pattern atching to sophisticated statistical taggers, with any off-the-shelf tools available. We stress that, however, while ipleentations ay differ, such extractors are inherently iperfect, and entity search ust essentially deal with uncertainty. Used as a black box, in abstraction, an entity extractor will return, for each occurrence, the extraction confidence probability and the position of extraction. We thus transfor the page view into our entity view, E = {E, E 2,..., E n}. The Entity Search Proble: Given: An entity collection E = {E, E 2,..., E N } Input: Query: β-α(k, K 2,..., K l, E, E 2,..., E ) Output: t = e, e 2,..., e : sorted by score(t), the tuple score of t with respect to β and α. As input, as Section 3 discussed, an entity-search query is siilar to standard keyword queries, but now users can specify entity types, E, E 2,..., and E, in addition to keywords K,..., K l. Optionally, in a coplete for, a query can also specify a atching pattern α (e.g., ow in Q3), to restrict when an occurrence of an instance t = e, e 2,..., e is considered a atching tuple, and a scoring easure β, to specify how all

4 the atching instances are ranked. Depending on the ipleentation and application settings, the choices of the β pattern and the α can be syste built-in or user specified, and they together deterine the ranking scores. As output, for a query with target entities, each atching result is a -ary entity tuple, i.e., t = e, e 2,..., e, i.e., a cobined instance. Note that an entity tuple ay contain one or ultiple instances, each of which is associated with one entity type. For exaple, David, david@hotail.co, is an entity tuple of type #nae, #eail, #phone, and Canada, Ottawa of type #country, #capital city. The objective is to find, fro the space of E... E, the atching tuples in ranked order by how well they atch the query, i.e., how well entity tuple t and the specified keywords associate, as atched by pattern α and ranked by easure β, which result in the tuple score score(t). To suarize, and to put our proposal in perspectives, we exaine it for both the search and integration point of views: First, as a search fraework, entity search is data aware. With the probabilistic tagging of entities, we view the Web as a repository of entities. Users directly specify their target entities, and the syste holistically evaluates the atchings. Second, as an integration fraework, entity search is scale aware, towards agile requireent and best-effort seantics for large scale deployent: On one hand, as deployent requireent, it is agile, requiring only inial tagging of the data doain of interest It can be uniforly applied to any sources, without requiring building wrappers. We stress that each entity is independently extracted in a soft probabilistic sense, and thus requires no rigid full relation extraction that often andates per-source wrappers. It can adapt to diverse inforation needs, without fixed pre-scribed scheas. The desired entities are only associated by ad-hoc queries at query tie Thus, users ay ask #phone with ib thinkpad or bill gates, or they ask to pair #phone with, say, #eail for white house. Supporting such online atching and association is exactly the challenge (and usefulness) of entity search. On the other hand, as query seantics, it is only best effort, returning the result in ranked, best-first anner. In essence, the syste assues integration in a probabilistic sense : Given independent entities with probabilities, entity search finds atching instances that ost likely for a desired tuple. To understand the proise of the syste and the practical issues, we decided to start with a quick Version 0. pilot ipleentation, which is the focus of our deonstration. 4. THE PILOT SYSTEM To realize our proposal in Section 3, we ipleented a pilot syste for epirical study. First, as we abstracted in the Entity Search proble, we need to coe up with specific atching patterns α as well as scoring easure β. Figures 3 and 4 suarizes our currently ipleented α and β. To specify a tuple function operator, the user or the application will choose the exact easures to use. The α Measure: An α easure qualifies if atch of occurrences of entities atch the desired tuple, in the atching pattern α(x). In principle, any Web presentation features can be incorporated for pattern expression, e.g., linguistic, proxiity, or visual features. In this work, we treat webpages as linear docuents. Consequently, we ipleented several siple α(x) qualification rule phrase(x) x atches a phrase as sepcified uwn(x) all ters in x occur in window of size n, any order own(x) all ters in x occur in window of size n, in the order nnuwn(x) uwn(x), and the ters are nearest neighbors: i.e., the su of the distances fro the left ost ter to all others is the sallest, aong other choices. nnown(x) own(x), and the ters are nearest neighbors as above tf dtf i Figure 3: Pattern easures. f ( e,..., e ) E tscore f ( e,..., e ), where with D as corpus size: windowsize f ( e l windowsize f k i) ( j ) E = f ( e ) i= 2 j= D D cprod f ( e,..., e) = [ e~,..., e ~ ] α ( x) f ( e,..., e ) log f ( e ) i = i forula [ ~..., ]) [ ~, where (, ] ~ ( ~ D e e = D e,..., e ) ( ~ ~ 2 e. pos e. pos) [ ~ e,..., ~ ] ( ) e x α i= conf ( ei) Figure 4: Scoring easures. position-based, page-bounded α easures as Figure 3 suarizes. The β Measures: Once instances of entities are atched, they for tuples. A β easure quantifies how proising a tuple e,...,e is. A tuple e,...,e ay appear in the corpus any ties let [ẽ,..., ẽ ] be one such occurrence, i.e., [ẽ,..., ẽ ] α(x). We note that a scoring function β for deterining the tuple score, i.e., β(e,...,e ), can build upon the following quantities:. Frequency of the tuple: How any ties has e,...,e occurred? That is, how any [ẽ,..., ẽ ] are atched? 2. Strength of each occurrence: How well does each [ẽ,..., ẽ ] atch the pattern? 3. Frequency of individual entity instance: How any ties has e i appeared in the corpus? A tuple ay be frequent siply because its entity instances are coon. 4. Uncertainty of entity instance: What is the probability of instance e i being of the entity type E i. Our initial ipleentation thus tried, for experientation, to use all four in various ways, as Figure 4 suarizes a few representative ones: tf : tuple frequency (using ); dtf : distance weighted tuple frequency (using, 3); i: utual inforation (using, 2); t-score(using, 3); conf (using 4). While these easures are validated in our experients to be useful, we believe a ore systeatic study of the scoring easure is still needed. Coing up with ore effective scoring easures is itself an interesting research proble. We plan to work towards this direction in the future. Next, we describe the key syste coponents to accoplish the goal of entity search, towards agile best-effort inforation integration. The overall syste architecture is illustrated in Figure 5. We now zoo into each specific syste coponent. Data Collector: Our data webpages could coe fro both the surface Web and the deep Web. To get webpages fro the surface Web, our Data Collector works like a crawler, getting webpages related with a specific topic/application. Our current ipleentation obtains webpages fro the Stanford Web- Base Project 2. To get webpages fro the deep Web sources, 2 testbed/doc2/webbase/ i i+

5 CEO Apple Crop Users, Applications EntitySearch query Query Engine Results, scores Relational Operations Constructor Price 7.99$ 6.99$ 5.00$ 7.35$ Copany Microsoft Apple Intel Aazon Indices Extraction Storage Figure 6: Syste Interface. Annotated Webpages Entity Extractor Entity Models Webpages Data Collector Figure 5: Syste Architecture our Data Collector works by collecting result pages of specific queries on deep Web sources, e.g., houses available in a certain zipcode area, books written by soe specific authors, etc. Entity Extractor: In order to discover seantics of the doain, our Entity Extractor extracts doain entities, the key coponents of the doain, independently. Each entities is extracted using an entity odel, which describes how to identify instances of an entity in the webpages. Query Engine This coponent essentially supports online querying by perforing pattern atching and tuple scoring, as we abstracted in Section 3 and described in the beginning of this Section. As we produce tuples in the results, naturally any relational operations could be perfored, which corresponds to the Relational Operations sub coponent in Figure 5. Our Query Engine is orphed using the Leur Toolkit (version 2.2), an inforation retrieval engine []. This IR engine facilitates easy storage of the extracted entities in the for of ordered lists based on docuent ID, uch like storing inverted indices for keywords. It also enables the construction in a natural way as sort-erge-join based on docuent ID is one of the ost coon operations in answering IR queries. 5. DEMONSTRATION In this deonstration, we specifically show the online coponent of our syste, which is our query engine. The current sever is running on a Pentiu-4 2.6GHz PC with GB eory. This section will first discuss the interface of our query engine. Then two deonstration scenarios are presented to show the effectiveness of our syste. Finally, the deonstration plan is described. 5. Syste Interface We show our query interface for the query engine of Entity- Search in Figure 6. The first three input boxes in Figure 6 directly refers to the operators explained in Section 4. Matching Pattern specifies a pattern α(x), in which x is a list of ters to be joined, as either entities E i or literal keywords K j, and α is a pattern easure, shown in Figure 3, specifying how the ters are connected into a pattern. Figure 7: Saple Output of Query C. Scoring Measure specifies which β scoring easure, aong the ones we support in Figure 4, will be used to score the atched tuples. Entity Filter enables specifying conditions on each entity. In principle, any filter operation could be applied to an entity. Our syste currently supports two operators: equalto (string value of the instance atches exactly with the specified keywords) and contains (string value of the instance contains the specified keywords). In addition, We provide three auxiliary features to iprove the effectiveness of the engine. These features could be regarded as a light-weight Relational Operation layer, which perfors inial but useful relational operations upon the output relation. We discuss these three features one by one. Corpus Restriction: The Corpus Restriction operator is designed to facilitate application s control over the corpus. We support Corpus Restriction by URL doains as regular patterns. For instance, we could set this field to *.azon.co, which atches the sub-corpus of pages at the doains with that pattern (e.g., book.azon.co). Pages Per Answer: This feature requests the nuber of Web pages to return as support pages or evidences for each tuple returned by the query engine because results are integrated fro ultiple Web pages across the corpus. Order By specifies the order of listing results (e.g., by #research alphabetically) uch like the sae clause in SQL. By default it will rank the results according to their ranking score calculated by the β Measure. Application issues queries by filling in the interface (Matching Pattern, Scoring Measure and Entity Filter), as well as the other three auxiliary operators and then clicking the Subit Query button. Figure 7 shows a saple syste output. As you can see fro the figure, it is in the for of a relation. Each tuple has a score associated with it. The URL below each tuple shows the supporting pages where this tuple is found. 5.2 Deonstration Scenarios In this subsection, we use two concrete scenarios on real data to deonstrate the practical usage of our syste. The two scenarios we selected are very different in nature. The first scenario, regarding education doain regarding idwest CS departents, focuses on the surface Web whose pages are

6 ostly unstructured. In contrast, the second scenario, using the book doain, focuses on the deep Web. Result pages collected fro the deep Web tend to be ore structured. We show that our syste works well in both scenarios. Scenario : Midwest Coputer Science Doain Currently, we have collected pages regarding CS departent in six idwest universities (IIT, Illinois, Indiana, Michigan, Purdue and Wisconsin) fro the WebBase project. The Doain Extractor extracts the following attributes: professor, research, university, eail and phone. Three exaple queries: C: eails of professors across universities C2: research areas of professors across universities C3: professors conducting DB related research across universities Above we have shown three exaple queries that could be asked in this doain. Detailed query operators for each query is excluded. In Query C, the application is interested in integrating all the eails of professors across universities into one table. Figure 7 shows a snippet of the result, which is good. We anually checked the result returned by our syste for Query C, over 90% of professor s eails are included in the output relation. And if for each unique professor, we only reain the tuple with the highest score in the relation, we can achieve precision over 85%. Please refer to the results in our online deo for ore details. Siilar results are observed fro other exaple queries, such as Query C2 and C3. Scenario 2: Shakespeare s Book Doain This scenario focuses on the deep Web. We issued a query by filling author attribute with Shakespeare to the three ajor online book stores: We then anually collect webpages containing the top 00 results fro each site. The Doain Extractor extracts the following attributes: title, author, iage, price, date. Three exaple queries: B: title and price of books B2: iages of books with title containing Halet B3: title of books that are in stock Above we have shown three exaple queries that could be asked in this doain, excluding detailed query operators. Query B is a very coon integration scenario where the application wants to find all the books and their prices across ultiple websites. More interestingly, Query B2 asks for iages of titles with keyword Halet in it. As you can see fro the top results shown in Figure 8, the iages indeed are all cover iages for books with keyword Halet in their titles. Query B3 could be issued using the title attribute together with keywords in stock as application finds vendors norally put keywords in stock to indicate the books currently available for purchase. Siilarly, we can query for books that are out of stock, on order, etc. Due to the lack of space, we are only able to briefly show very liited exaple queries and results in our two scenarios. To experience ore about the above two scenarios and have a real feeling of the results, we invite users to the following online deo site Figure 8: Saple Output of Query B Deo Plan Users can follow the link fro our deo site to experience our two deo scenarios. For each scenario, we have listed a set of exaple queries, including all the exaple queries shown in Section 5.2. By clicking on an exaple query, the query operators of the chosen query will be autoatically filled into the input boxes in our query interface. Alternatively, users are welcoe to odify the query operators of the exaple queries or coe up with their own queries. Query results are norally returned within seconds. 6. CONCLUSION In this paper, we proposed the concept of Entity Search for supporting agile best-effort inforation integration over the Web. While we observe entity search, as a id-way arriage of search and integration, is eaningful and useful to access the data-rich Web, there are several open research issues towards its full realization: First, as the core, how to rank each tuple, so that the best atching surfaces to the top? Second, to provide efficient online search, can we generalize the current page-based search engine into an entity search engine? Finally, as entity search returns a ranked list of tuples (e.g., Figure 2), how can we integrate the derived relations with relational SQL-based querying e.g., joining #copany, #phone with #copany, #eail? Or, filtering only those copanies with.co? We are continuing to investigate these challenging and interesting probles as our future research agenda. 7. REFERENCES [] Leur toolkit for language odeling and inforation retrieval. leur. [2] Multiple listing service. Listing Service. [3] Sabre holdings corporation. [4] K. C.-C. Chang, B. He, and Z. Zhang. Toward large scale integration: Building a etaquerier over databases on the web. In Proceedings CIDR [5] W. W. Cohen. Soe practical observations on integration of web inforation. In WebDB (Inforal Proceedings), pages 55 60, 999. [6] S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In VLDB, pages 29 38, 200. [7] L. J. Seligan, A. Rosenthal, P. E. Lehner, and A. Sith. Data integration: Where does the tie go? IEEE Data Eng. Bull., 25(3):3 0, [8] Z. Zhang, B. He, and K. C.-C. Chang. Light-weight doain-based for assistant: Querying web databases on the fly. In Proceedings of VLDB 2005.

1 Extended Boolean Model

1 EXTENDED BOOLEAN MODEL It has been well-known that the Boolean odel is too inflexible, requiring skilful use of Boolean operators to obtain good results. On the other hand, the vector space odel is flexible