Exploring synonyms within large commercial site search engine queries

Size: px

Start display at page:

Download "Exploring synonyms within large commercial site search engine queries"

William Jennings
6 years ago
Views:

Explorng synonyms wthn large commercal ste search engne queres Jula Kseleva, Andrey Smanovsky HP Laboratores HPL-2011-41 Keyword(s): synonym mnng, query log analyss Abstract: We descrbe results of

1 Explorng synonyms wthn large commercal ste search engne queres Jula Kseleva, Andrey Smanovsky HP Laboratores HPL Keyword(s): synonym mnng, query log analyss Abstract: We descrbe results of experments of extract-ng synonyms from large commercal ste search engne query log. Our prmary object s product search queres. The resultng dctonary of synonyms can be plugged nto a search engne n order to mprove search results qualty. We use product database to extend the dctonary. External Postng Date: Aprl 6, 2011 [Fulltext] Internal Postng Date: Aprl 6, 2011 [Fulltext] Approved for External Publcaton Copyrght 2011 Hewlett-Packard Development Company, L.P.

2 Explorng synonyms wthn large commercal ste search engne queres Jula Kseleva, Andrey Smanovsky HP Labs Russa Abstract. We descrbe results of experments of extractng synonyms from large commercal ste search engne query log. Our prmary object s product search queres. The resultng dctonary of synonyms can be plugged nto a search engne n order to mprove search results qualty. We use product database to extend the dctonary. Keywords: synonym mnng, query log analyss 1 Introducton A large commercal ste s an nformaton portal for customers where they can fnd everythng about the vendor s products e.g. manuals, drvers, etc. A large commercal ste has a search engne and ts man functon s to help customers to retreve approprate nformaton. We regard a user search query as a product query when the user s ntent s to retreve nformaton about hp products and servces ncludng manuals, drvers, and support. One way to mprove search qualty s to utlze a dctonary of synonyms that ncorporates a document collecton vocabulary and vocabulary of dfferent users. We analyze ncomng queres for synonym terms whch could be ncluded nto a thesaurus. We attempted several technques n order to detect synonymous terms and queres among queres from a large commercal ste search engne. Query expanson s defned as a stage of the nformaton retreval process durng whch a user s ntal query statement s extended wth addtonal search terms n order to mprove retreval performance. Query expanson s ratonalzed by the fact that ntal query formulaton does not always reflect the exact nformaton need of a

3 user. The applcaton of thesaur to query expanson and reformulaton has become an area of ncreasng nterest. Three types of query expanson are dscussed n the lterature: manual, automatc, and nteractve (ncludng semautomatc, user-medated, and user-asssted). These approaches use dfferent sources of search terms and a varety of expanson technques. Manual approach does not nclude any knowledge about a collecton whle nteractve approach mples a query modfcaton by a feedback process. However, assstance could be sought from other sources, ncludng a dctonary or a thesaurus. In the query expanson research, one of the bggest ssues s to generate approprate keywords that represent the user s ntenton. Spellng correcton [1] s also related to synonyms detecton as ts technques are applcable, especally for product synonyms whch share a lot of common words. The methods descrbed above have used only a query set as nput data. But there are a few publshed approaches whch use external data sources for synonym detecton to make the technque more robust. The goal of ths project s detectng synonymous terms n search queres whch are submtted by users to a large commercal ste search engne. We also provde recommendatons for enhancng search qualty on the large commercal ste. The remander of the report s organzed as follows. We revew related work n Secton 2. The problem s formulated n Secton 3. Secton 4 dscusses algorthms that were utlzed; n partcular, Sectons present functons we use as measures of smlarty. Secton 5 descrbes our expermental data set and compares expermental results. Fnally, Secton 6 summarzes our contrbuton. 2 Related Work There are a lot of papers related to synonym detecton n search queres. Thesaur have been recognzed as a useful source for enhancng search-term selecton for query formulaton and expanson [4], [5]. Termnologcal assstance may be provded through ncluson of thesaur and classfcaton schemes nto the IR system. In a seres of experments on desgnng nterfaces to the Okap search engne t was found that both mplct and explct use of a thesaurus durng automatc and nteractve query expanson were benefcal. It was also suggested that whle the system could fnd useful thesaurus terms through an automatc query-expanson process, terms explctly selected by users are of partcular value ([4], [6]). 2

4 The paper [3] presents a new approach to query expanson. Authors proposed Related Word Extracton Algorthm (RWEA). Ths algorthm extracts words from texts that are supposed to be strongly related to the ntal query. RWEA weghts were also used n Robertson s Selecton Value (RSV), a well known method for relevance feedback [4], weghtng scheme. Query expanson was performed based on the results of each method (RSV, RWEA, and RSV wth RWEA weghts) and a comparson was made. RWEA evaluates a word n a document and RSV evaluates a word among several documents, consequently, the combnaton should perform unformly well. Expermental results corroborated that statement: the combned method works effectvely for all queres on average. In partcular, when a user nputs ntal queres whch results have Average Precson (AP) under 0.6 the method obtans the hghest Mean Average Precson (MAP). It also obtans the hghest among the three methods MAP on experments wth navgatonal queres. However, RWEA obtans the hghest MAP on experments wth nformatonal queres. Expermental results show that effectveness of a method for query expanson depends on the type of queres. There are a lot of research papers about query spellng correcton [1] whch were publshed recently. We thnk that ths area s also related to synonym detecton as ts technques are applcable. For example n [7] authors consder a new class of smlarty functons between canddate strngs and reference enttes. These smlarty functons are more accurate than prevous strng-based smlarty functons because they aggregate evdence from multple documents and explot web search engnes n order to measure smlarty. They thoroughly evaluate technques on real datasets and demonstrate ther precson and effcency. In [2] authors present a study on clusterng of synonym terms n search queres. The man dea s that f users clck on the same web-page after submttng dfferent search queres those queres are synonyms. 3 Problem Statement Our goal s to buld a thesaurus of synonyms terms whch are related to respectve products. We also provde a set of recommendatons for enhancng qualty of search results returned by the large commercal ste search engne. 3

5 4 Algorthms 4.1 Smlarty Dstance Metrcs We perform experments wth token-based and term-based smlarty metrcs. We choose ths metrcs because ther effcency was proved n lterature [11], [13] Token-based dstance There are a lot of token-based strng smlarty metrcs whch are descrbed n the lterature. Levenshten dstance (LD) s a measure of the smlarty between two strngs, whch we wll refer to as the source strng (s) and the target strng (t). The dstance s the number of deletons, nsertons, or substtutons requred to transform s nto t. For example, If s s "test" and t s "test", then LD(s,t) = 0, because no transformatons are needed. The strngs are already dentcal. If s s "test" and t s "tent", then LD(s,t) = 1, because one substtuton (change "s" to "n") s suffcent to transform s nto t. The greater the Levenshten dstance s the more dfferent are the strngs. Levenshten dstance s also called edt dstance. Smth Waterman dstance [11] s smlar to Levenshten dstance. It was developed to dentfy optmal algnments between related DNA and proten sequences. It has two parameters, a functon d and a gap G. The functon d s a functon from an alphabet to cost values for substtutons. The gap G allows costs to be attrbuted to nsert and delete operatons. The smlarty score D s computed wth a dynamc programmng algorthm descrbed by the equaton below: 0 // start D( 1, j 1) d( s, tj) // subst / copy D(, j) max D( 1) G // nsert D(, j 1) G // delete The fnal score s gven by the hghest valued cell. Table 1 presents the example of score calculaton. 4

6 C O H E N M C C O H N Table 1. Smth-Waterman calculaton between strng cohen and mccohn where G = 1, d(c,c) =2, d(c,d) = +1. Smth-Waterman-Gotoh [12] s an extenson of Smth-Waterman dstance that allows affne gaps wthn the sequence. The Affne Gap model ncludes varable gap costs typcally based upon the length of the gap l (W l ). If two sequences, A (=a 1 a 2 a 3... a n ) and B (=b 1 b 2 b 3... b m ), are compared the formula for dynamc programmng algorthm s: D j =max{d -1, j-1 +d(a,b j ), max k {D -k,j -W k }, max l {D, j-l -W l }, 0}, where D j s n fact maxmum smlarty of two segments endng n a and b j respectvely. Two affne gap costs are consdered, a cost for startng a gap and a cost for contnuaton of a gap. Defnton: The taxcab dstance, d 1, between two vectors p, q n an n-dmensonal real vector space wth fxed Cartesan coordnate system, s the sum of the lengths of the projectons of the lne segment between the ponts onto the coordnate axes: d 1( p, q) p q 1 p q, Where p p, p,..., p ) and q q, q,..., q ) are the two vectors. ( 1 2 n n 1 ( 1 2 n The taxcab metrc s also known as rectlnear dstance, L 1 dstance or 1 norm, cty block dstance, Manhattan dstance, or Manhattan length Term based dstance We choose cosne smlarty metrc as a term-based dstance. Cosne smlarty s a measure of smlarty between two vectors whch s equal to the cosne of the angle between them. The result of the Cosne functon s equal to 1 when the vectors are collnear or between 0 and 1 otherwse. 5

7 Cosne of two vectors can be easly derved by usng the Eucldean Dot Product formula: a * b a b cos a * b smlarty cos( ) a b n 1 ( a ) n a 1 2 b n 1 ( b ) As a weghtng functon we used a tf*df weght. The tf (term frequency) n the gven document s smply the number of tmes a gven term appears n that document: 2 tf n n k j k j where n,j s the number of occurrences of the consdered term t n document d j, and the denomnator s the sum of number of occurrences of all terms n document d j, that s, the sze of the document d j. The df (nverse document frequency ) s a measure of the general mportance of the term : df D log { d : t d} We selected tf*df weght. It combnes two aspects of a word, the mportance of word for document and ts dscrmnatve power wthn the whole collecton. Each query was regarded as a document n the collecton. Tf s the frequency of a term n a query. It s almost always equal to 1 and df s the ordnary nverse document frequency. 4.2 Probablstc Model Source Chanel Model In paper [1] authors apply source channel model to the error correcton task. We explore the possblty of applyng t to fndng synonyms. Source channel model has been wdely used for spellng correcton. Usng source channel model, we try to solve an equvalent problem by applyng Bayes rule and droppng the constant denomnator: 6

8 * c argmax c C P(q c)p(c), where q s query, c s correcton canddate. In ths approach, two components of generatve model are nvolved: P(c) characterzes user s ntended query c and P(q c) models error. The two components can be estmated ndependently. The source model (P(c)) could be approxmated wth n-gram statstcal language model. It s estmated wth tokenzed query logs n practce for mult-term query. Consder, for example, a bgram model. c s a correcton canddate contanng n terms, c= c 1 c2... cn, then P(c) could be wrtten as a product of consecutve bgram probabltes: P ( c) P( c c 1 ) Smlarly, the error model probablty of a query s decomposed nto generaton probabltes of ndvdual terms whch are assumed to be ndependent: P q c) P( q c ) ( Now the word synonymy can be accessed va correlaton. There are dfferent ways to estmate dstrbutonal smlarty between two words, and the one we propose to use s confuson probablty. Formally, confuson probablty P c estmates the possblty that a word w 1 could be replaced by another word w 2 [1]: P( w w c 2 ) P ( w2 w1 ) P( w w1 ) P( w2 ) P( w) w where w belongs to the set of words that co-occur wth both, w 1 and w 2. For synonym detecton we assume that w 1 s an ntal word and w 2 s a synonym. Confuson probablty P c ( w 2 w1 ) models the probablty of w 1 beng rephrased as w 2 n query logs Utlzng database as external data contaner As we menton n secton Related works, there s a successful practce of utlzng external sources to dscover synonyms. We present a novel method whch makes use of a database wth product names to enhance synonym detecton estmated n the prevous secton. The database provdes new ways to detect synonym terms because t contans product names whch are related to the queres but could be expressed n, 7

9 other words. Synonym terms from the database are extremely useful for detectng related products durng search process. We ntroduce an analog of confuson probablty between words n the query and terms n the database. Fgure 1. Metrcs nsde search query tokens and product names database Fgure 1 shows sets of tokens n a database (D) and n a query log (Q). D Q s an ntersecton of terms n the database and the query log; w s a token from the ntersecton. P w w ) s the confuson probablty from [1]. c ( w depcts a smlarty functon wthn the space of database terms between the term ' w and the term w, whch we choose to be Manhattan dstance because t performed best as token-based smlarty measure. { w } s a set of terms whch occur n the ntersecton between database and queres (n D Q ). ' We extend a noton of confuson probablty between w and w where w s term ' whch occurs only n queres and w s term whch occurs only n the database. We propose two ways of ntroducng confuson probablty extenson (n both formulas ndexes words of the ntersecton): ' 1. P ( w, w ) max C max( P ( w w ) * ( w, w )) max_ c ' 2. P ( w, w ) C P( w ) ( w, w) * P ( w w ) * P( w ) c W c c ' 8

10 Note that the natural desred property P c ( w, w ) P c ( w w ) f w DQ s not automatcally met by the ntroduced extenson. Another possble approach to extend the confuson probablty s to ntro- ' ' " duce P C ( w, w ) accordng the jont dstrbuton of ( w, w j ), where w j DQ. w DQ and 5 Experments In order to perform ntal data flterng, we have bult basc statstcs of the query log and found notable propertes of the current large commercal ste search engne traffc, whch are presented n secton 5.1. Next we evaluated the metrcs presented n the Secton 4. We present the evaluaton n subsequent sectons together wth sample results. 5.1 Data Descrpton In ths secton we present data descrpton and some statstcs whch wll help us to understand data nature. By data nature we mean answers to the followng questons: Where the queres have come from? What s the average length of a query? What s the lst of stop words for the large commercal ste search engne query log? The query log used for analyss s collected durng 8 days. It contans queres, unque queres, and queres whch occur more than one tme. The average length of the query s words. Table 2 provdes a detaled query log descrpton. The log does not contan any addtonal nformaton about users except paddresses. They do not unquely dentfy users. 9

11 Ipaddress 1 Tme Request Browser nformaton Status Status1 Return page *.* 05/Jun/ 2010:00 :00: GET /query.html?lang=e n&search=++&qt= pavllon+6130+add+re place+expanson&l a=en&cc=us&char set=utf-8 HTTP/1.1 Mozlla/5.0 (Wndows; U; Wndows NT 6.1; en- US; rv: ) Gecko/ Frefox/ www. hp.co m/ Table 2. Query log descrpton. The query frequency dstrbuton n the log s presented below, on the fgure 2. Fgure 2. Query s frequency dstrbuton. Top most frequent queres are gven n the Table 3. The most popular queres are nonproduct queres lke google. We thnk that those queres are most frequent because they have come from nternal corporate users. Probably t happens because the commercal ste page s by default a start page of company s employees. 1 Here and further on IP addresses are partally obfuscated because of prvacy consderatons 10

12 Query Frequency search: 1066 Google 610 Drvers 579 hp offcejet j4500 seres search 535 Slate 439 hp deskjet f2200 seres search 421 Warranty 363 hp busness avalablty center 354 hp deskjet f4200 seres search 246 Tablet 232 go nstant 214 Table 3. The most frequent queres n the log There s a parameter web secton n the request that shows what category on ste was selected by a user. From our pont of vew the query dstrbuton by topc could be useful n order to understand user behavor. We bult statstcs by web secton from query URLs. Ths web secton s related to the query topc. The statcs s demonstrated n the Table 4. The total number of web secton queres s 527 whch s 0.35 % of the whole number of queres,.e. web secton functonalty s not popular wth the users. Web Secton Topcs Frequency small & medum busness 153 Home 108 compaq.com 70 home & home offce 55 home & home offce secton only 42 small & medum busness ste 37 11

13 hp procurer networkng 27 products and servces 10 home & home offce only 9 hp promotons only 6 busness technology optmzaton (bto) software 4 learn about supples 3 hp onlne store 2 hp servces 1 Total 527 (0, 35%) Table 4. Dstrbuton of web secton queres 5.2 Data Preprocessng Data Flterng For some of the approaches that we apply, as well as to make dstncton between external and nternal use of the ste, we need per user data. To obtan per-user statstcs we develop a technque for data flterng. We fgure out that there were p-addresses whch send many requests to the search engne. We gve examples of such p-addresses, whch had more than 1000 requests, n the Table 5. We beleve that most of those search queres are sent from company s employees computers through corporate proxes. The corporate p-addresses are marked wth bold n the Table 5. We called ths set of ps non-confdental and they were removed from the data set. Ip-address Frequency *.* *.* *.* *.* *.*

14 *.* *.* *.* *.* *.* *.* *.* 1200 Table 5. Top non-confdental p-addresses We calculated statstcs of requests from all p-addresses and from non-confdental p-addresses. The statstcs are presented n the Table 6. We conclude that at least 25% of search queres orgnate from nsde the company. Date Number of confdental requests Number of all requests Delta 1 June June June June June June June June Total Table 6. Daly query statstcs per orgn To make our methodology more robust we buld a lst of stop-words. It contans prepostons and term hp. We used ths lst to clean up queres n the log. 13

15 5.2.2 Identfcaton of user sesson tme In one sesson a user may try to pursue sngle nformaton need and reformulate queres untl he/she gets a desred result. Thus, analyzng user sessons n order to fnd synonymous queres seems reasonable. We fltered p-addresses form the log accordng to the algorthm descrbed n Secton Data flterng to dentfy user sesson. Defnnton1: Delta s a tme n seconds between two contguous clcks from the same p. Defnnton2: Delta frequency frequency of delta n the whole query log. For both cases, wth non-confdental p-addresses and wthout non-confdental paddresses, we bult plots whch are presented on Fgure 3. We suppose that we should see how a user rephrases the query or expands t. We used Manhattan Dstance to fnd synonyms because t has performed well n prevous experments. (a) 14

16 (b) Fgure 3. (a) a hstogram of deltas whch start from 5 seconds for all p-addresses and (b) a hstogram for deltas whch start from 5 seconds for set of p-addreses wthout non-confdental ps. 5.3 Evaluaton Metrcs We use precson as an evaluaton metrc for our experments. Its formula s gven below: # correct _ results Precson # total _ results 5.4 Experments wth dfferent token based smlarty metrcs The frst approach that we consdered for fndng synonyms orgnates n the task of matchng smlar strngs 2. To characterze whether or not a canddate strng s synonymous to another strng, we compute the strng smlarty score between the canddate and the reference strngs [10, 6]. 2 We use smmetrcs lbrary ( 15

17 Unfortunately, there s no gold standard for evaluatng synonyms dscovery n query logs and we have to buld ground truth. After performng experments wth dfferent metrcs we select top results and evaluate them manually. We decded not to make general poolng and evaluate precson at 100 metrc nstead. We beleve that top smlar pars are more stable that pars smlar to a gven one. The results of evaluaton and the volume of gold standard are presented n the Table 7. Token-based Metrc Gold Standard Sze Precson Levenshten Dstance Smth-Waterman Dstance Smth-Waterman-Gotoh Dstance Manhattan Dstance Table 7. Results of experments wth proposed token-based metrc. Manhattan Dstance shows the best precson at 100. The man reason for low precson s that strng smlarty does not mply synonymy. E.g. strngs hp deskjet 960c and deskjet 932c are smlar accordng to smlarty metrc but they represent dfferent models of prnters and ths s not a case of synonymy Synonyms detecton by usng clck on the same URL A hypothess suggested n [2] clams that f users clck on the same search result URL ther queres should be synonyms. We explored that hypothess on our data. The Table 8 shows a few examples that were obtaned: Id Queres from the same clcked url 1 hp deskjet 845c hp deskjet d hp deskjet d1360 hp deskjet 845c 3 hp laserjet 4350tn hp laserjet hp laserjet 1102 hp laserjet 4350tn 5 hp photosmart c6380 hp photosmart a524 hp photosmart c

18 6 hp psc 1300 hp psc 1315 hp psc hp psc 1315 hp psc 1300 hp psc hp photosmart a524 hp photosmart c6380 hp photosmart c hp photosmart c4240 hp photosmart c6380 hp photosmart a hp psc 2410 hp psc 1300 hp psc hp pavlon dv6500 hp pavlon dv2000 hp pavlon dv3 Table 8. Examples of synonyms through clcks on the same URL One can see that we obtaned low precson. A clue to that ssue s that queres whch contan dfferent model numbers are regarded as synonyms. We expected that users wll reformulate a query by replacng a term, but we found that users mostly replace a model number Synonyms detecton by usng user sesson We perform experments wth the purpose of fndng smlar terms wthn the query sesson usng the methodology to detect a user sesson that we descrbed n secton We evaluated 203 queres manually and ths set s our gold standard for expert evaluaton. We obtaned precson equal to A few examples of synonyms n one user sesson are gven n the Table 9. The Table 9 also presents a smlarty value between queres wthn the sesson. User IP Query1 Query2 Smlarty Value audo sp27792 sp hpdv6-1153e drvers hp prolant ml350 g6 dv5-1153e drvers 0.5 ml330 g ml330 ml330 g

19 hp offcejet j4500 seres search hp offcejet j4500 seres warranty regstraton Table 9. Synonymous queres wthn a user sesson 5.5 Experments wth term based metrc We nflated weght for terms that are numbers or contan numbers. It was done n order to avod regardng queres wth dfferent model numbers as synonyms. Canddate pars of synonymous queres whch had cosne smlarty less than 0.7 were fltered. We have evaluated 150 queres and obtaned precson of 0.4. Almost all results are synonyms expanson. We dd not nclude the term hp and prepostons nto features space because we consder them as stop words. A few examples of synonyms found wth cosne smlarty are presented n the Table 10. The obtaned set of synonyms could be dvded nto two categores: query expanson (pars 1, 2, and 3) query rephrasng (par 4). In ths case we can conclude that terms laptop and notebook are synonyms. d Intal Query Query Synonym photosmart hp laserjet 4250n 4250n 3 rx3715 paq rx laptop 4510 notebook Table 10. Examples of query synonyms obtaned wth cosne smlarty metrcs 5.6 Experments wth confuson probablty In ths secton we appled another approach to synonyms detecton. Ths approach detects synonyms on the level of sngle words rather than whole queres and t recalls source channel model. 18

20 Some of the top results of the descrbed synonyms detecton method are presented n the Table 11. Most of presented synonyms could be characterzed by followng categores: paronymous terms lke face and facal ; msspellng lke Desgnerjet and Desgnjet ; dfferent forms of the same word lke dv42160us and dv4-2164us. Query term Query term should be smlar Confuson probablty Desgnerjet Desgnjet 0.75 Wndows Twan Twn 0.2 Mchael Mcheal dv42160us dv4-2164us Facal Face Vtamne Vtamn 0.2 Ms-6390 Ms Technsch Farm Table 11. Synonymous terms n queres detected wth confuson probablty 6 Concluson and recommendatons We dscovered that all obtaned synonyms can be classfed nto the followng groups: 1. Msspellngs. 2. Dfferent forms of a word (mostly plural form) 3. Term and dgt. Terms adherng the followng regular expressons: Dgt Space* Letter and Letter Space* Dgt. 4. Query expansons. 5. Rephrasngs. It s the type of synonyms whch s the most nterestng for us. The Table 12 contans examples of the above categores. 19

21 Category Intal Query Synonyms Query Msspellng 1) alanta 2) laser 3) vdeo 4) Desgnerjet Warrantes 1) Atlanta 2) Leser 3) Vdeo 4) Desgnerjet Warranty Dfferent form of the word Term and dgt dv 8 dv8 Query expanson hp offce locatons n hp nda nda Rephrasng 1) Remove 2) Actvaton 3) How to 4) Total care 5) Call center 1) Unnstall 2) Product key 3) Help, not workng, support 4) Advser 5) servce center Table 12. Synonyms categores wth examples Accordng the dscovered groups of synonyms we gve the followng recommendatons: 1. Make spellng correcton n run tme. We can dentfy and store a lst of most common msspelled terms. The appendx B demonstrates that currently search engne at the ste cannot detect a msspellng. The Fgure 5 shows that the search engne does not correct msspellng and returns rrelevant results. For now we cannot say that we have detected the whole lst of msspellngs because the current query log does not have enough data. 2. We thnk that storng dfferent forms of terms wll mprove search qualty. 3. Make data normalzaton. Terms adherng the followng regular expressons: Dgt Space* Letter and Letter Space* Dgt should be normalzed. We should normalze ncomng queres and data n the database. The appendx C contans two Fgures, 7 and 8, whch show how search result could change dependng on form of wrtng for hard drve capacty. 4. We need more data to detect query expansons. The search engne has query reformulatons servce but sometmes very werd suggestons are returned. One of the examples s presented n appendx A, Fgure 4. The ste should have a product orented search engne but suggested queres look lke most frequent queres and are not related to products. An example could be found n the Appendx A, the Fgures 5 and We present novel technque for synonym detecton n ths report. We need more data to detect strong lst of rephrasng synonyms. 20

22 We detected two problems wth data set: The majorty of queres come from nternal corporate users and they are not product search queres. We thnk that ths pecularty s not nherent to the specfc query log and reflects general ssues wth the current search functonalty on the ste. Statstcs of the one week log are not enough to detect strong synonym patterns. We total number of extracted synonym pars counts on tens. We hope that a longer log can ncrease that number wth close to lnear dependence on the log sze. 7 References 1. Mu L, Muhua Zhu, Yang Zhang, Mng Zhou. Explorng Dstrbutonal Smlarty Based Models Query Spellng Correcton. In processng of the 21st Internatonal Conference on Computatonal Lngustcs and 44th Annual Meetng of the ACL, pages , Jeonghee Y, Farzn Maghoul.Query clusterng usng clck-through graph. In processng of SIGIR, Tetsuya Osh, Shunsuke Kuramoto, Tsunenor Mne, Ryuzo Hasegawa, Hrosh Fujta, Myuk Koshmura: A Method for Query Expanson Usng the Related Word Extracton Algorthm. Web Intellgence/IAT Workshops 2008: Beaulev, M. (1997). Experments of nterfaces to support query expanson Journal of Documentaton, 53(1), Brajnk, G., Mzzaro, S., & Tasso, C. (1996, August). Evaluatng user nterfaces to nformaton retreval systems: A case study on user support. Proceedngs of the 19th annual conference on Research and Development n Informaton Retreval (ACM/SIGIR) (pp ). Zurch, Swtzerland. 6. Jones, S., Gatford, M., Hancock-Beauleu, M., Robertson, S.E.,Walker,W.,& Secker, J. (1995). Interactve thesaurus navgaton: Intellgence rules Ok? Journal of the Amercan Socety for Informaton Scence, 46(1), Surajt Chaudhur, Venkatesh Gant, Dong Xn. Explotng Web Search to Generate Synonyms for Enttes, WWW K. Chakrabart, S. Chaudhur, V. Gant, and D. Xn. An effcent flter for approxmate membershp checkng. In SIGMOD Conference, pages , W. W. Cohen and S. Sarawag. Explotng dctonares n named entty extracton: combnng sem-markovextracton processes and data ntegraton methods. InKDD, pages 89-98, C. H. Bennett, P. Gács, M. L, P. M. B. Vtány, and W. Zurek, Informaton dstance, IEEE Trans. Inform. Theory, vol. 44, pp , July Smth, T. F. and Waterman, M. S. Identfcaton of common molecular subsequences, J. Mol. Bol., pp ,

23 12. Gotoh, O. "An Improved Algorthm for Matchng Bologcal Sequences". Journal of Molecular Bology. 162: , Rshn Haldar, Debajyot Mukhopadhyay.Levenshten Dstance Technque n Dctonary Lookup Methods: An Improved Approach. In processng of CoRR abs/ (2011). 8 Appendx 8.1 A. Fgure 4. Controversal query suggestons: 22

24 23

25 8.2 B Fgure 5. Msspelled query Alanta servce Fgure 6. Search page for query Atlanta servce 24

26 8.3 C Fgure 7. Search page for query hp eltebook 200 gb. Fgure 8. Search page for query hp eltebook 200gb. 25

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto