Anmol Bhasn abhasn[at]cedar.buffalo.edu Moht Devnan mdevnan[at]cse.buffalo.edu Sprng 2005
#$ "% &'" (!
Informaton Retreval )" " * + %, ##$ + *--. / "#,0, #'",,,#$ ", # " /,,#,0 1"%,2 '",,
Documents are represented as vectors n term space Terms are usually stems Documents represented by weghted vectors of terms Queres are also modeled n term space as boolean / weghted vectors
3 #$$ )4"54" ) log( ) ( ) (,, n N f t df d t tf w = = ) max(,,, freq freq f = % '" " ), ( q d q d q d sm = = = = = t q t t q w w w w q d sm 1 2, 1 2, 1,, ), (
RSS : Really Smple Syndcaton RSS s a dalect of XML / XML based syndcaton specfcaton RSS fles conform to the XML 1.0 specfcaton, as publshed by W3C RSS standards.91,.92, 1.0, 2.0 Sample RSS Document Expermental RSS Schema (Jorgn Theln) Atom another form of XML based syndcaton
Natve XML Database Engne Embedded XML Database lnked to Applcaton Layered on top of the Berkeley DB database (a key-value par based database) Stores XML documents n collectons and provdes ablty to access multple collectons at the same tme. Recently started to support XQuery, XPath, and XML Namespaces
#$ Proof of concept for XML IR usng tradtonal IR technques Proect Obectves Platform for ndexng and ntegraton of RSS news feeds from multple sources Provde support for keyword searches and focused queres on the ndex Semantcally cluster news feeds based on XML feed data
"%
"% Feed Aggregator Data Cleaner XML Encodng Date Formattng Flter non-nterest enttes Data Preprocessor Stop Word Removal Word Stemmer (Porter Stemmer) IR Indces Generaton Clusterng Framework for Clusterng Item Feeds K Means Implementaton Cosne Smlarty as Dstance Metrc Index & Document Contaner (Berkeley DB XML) XML All IR Indces are themselves Documents Query Framework (Keyword Searches & Focused Top 5 Queres)
'" Keyword based searchng of news feed data eg. Presdent of Palestne Daly news tem clusterng nto Top Fve Stores usng K-means clusterng Popular Story Search usng Google API as well as Corpus Statstc
" IR INDICES Document Dctonary <?xml verson="1.0" encodng="iso-8859-1"?> <DocDctonary> <Document> <ID>0</ID> <LINK>http://www.abz.com/permalnker.html </LINK> </Document> </DocDctonary> Term Dctonary <?xml verson="1.0" encodng="iso-8859-1"?> <!-- Term Dctonary--> <TermDct> <Term> <ID>0</ID> <Strng>azb</Strng> </Term> </TermDct>
" 3 IR INDICES Forward Map <ForwardMap> <Postng> <DID>9</DID> <Term> <TID>5</TID><Freq>3</Freq> </Term> </Postng> </ForwardMap>
" 3 IR INDICES Inverted Map <InvertedMap> <Postng> <TID>2</TID> <Document> <DID>3</DID><FREQ>3</FREQ> </Document> </Postng> </InvertedMap>
" 3 NEWS CLUSTERS K Means Clusterng Bascs An algorthm for parttonng (or clusterng) N data ponts nto K dsont subsets S contanng N data ponts so as to mnmze the sum-of-squares crteron J = x µ = 1 n S where xn s a vector representng the nth data pont and µ s the geometrc centrod of the data ponts n S K n 2
" 3 NEWS CLUSTERS K Means Implementaton Specfcaton K = 5 : Top 5 Stores per day Feature Selecton : Postng Fles of a Document Dstance Metrc : Cosne Smlarty On Ttle & Descrpton Text Data Set : RSS Feeds for a partcular day Crteron Functon : Least Mean Squares
" 3 Query Framework $ %& %& #! " #
+ &(! Data should be conducve to Informaton Retreval Custom parsers requred for dfferent schemas Addng Precson & Recall Metrcs to measure Retreval Performance Herarchcal clusterng n place of K Means Clent / Server based mplementaton
1. Baeza-Yates R., et. al. Modern Informaton Retreval. 2. Page L., Brn S., Anatomy of a Large Scale Hypertextual Search Engne. 3. Fenberg P., Anatomy of a Natve XML Database. 4. Woodley A., Geva S., NPLX XML IR System 5. Mhalovc V., et. al., XML-IR DB Sandwch 6. Theln J., www.thearchtect.co.uk/weblog/archves/2003/03/0 00118.html
)%6