Text mining: bag of words representation and beyond it

Text mining: bg of words representtion nd beyond it Jsmink Dobš Fculty of Orgniztion nd Informtics University of Zgreb 1

Outline Definition of text mining Vector spce model or Bg of words representtion Representtion of documents nd queries in VSM Mesure of similrity between document nd query Exmple Semtic representtion of documents Ltent semntic indexing Concept indexing Exmple 2

Definition nd bsic concepts Text mining: prt of the wider discipline of dt mining Content-bsed processing of unstructured text documents nd extrction of useful informtion from them Tretment bsed only on the content of the documents nd not on metdt (dte of publiction, source of publiction, type of document...) Bsic text mining tsks: Informtion retrievl Automtic clssifiction of text documents Clustering of text documents 3

Informtion retrievl: definition nd bsic concepts The problem: determining set of documents from collection tht re relevnt to prticulr user query Assessment of the relevnce of documents is bsed on comprison of index terms tht pper in texts of documents nd text of the query Index term: ny word or group of words tht pper in the text When the text is represented by set of index terms much of the semntics contined in the text is lost Informtion bout the reltions between index terms is lost 4

Lbels nd bsic models Lbels: d j - document of collection of documents q - query i, j vector representtions of documents q - vector representtion of the query Bsic models: Probbilistic Boolen model Vector spce model or bg of words representtion 5

Vector spce model or Bg of words model Vector spce model is tody one of the most populr models for informtion retrievl It is developed by Gerld Slton nd ssocites within the serch system SMART (System for the Mechnicl Anlysis nd Retrievl of Text) Similrity between documents nd queries cn tke rel vlues in the intervl [0,1] Indexing terms re joined by weight Documents re rnked ccording to the degree of similrity with the query Representtion in the vector spce model clled Bg of words representtion (BOW model) BOW model ssumes independence of indexing terms 6

Term-document mtrix Nottion T t, t 2,, t m - set of index terms 1 D d, d2,, d n - set of documents 1 ij weight ssocited to pir (t i, d j ), i=1,2,..,m; j=1,2,...,n q i, i=1,2,...,m weights ssocited to pir (t i,q), q query Document d j, j=1,2,...n is represented by the vector j =( j1, j2,..., jm ) T Query is represented by the vector q=(q 1, q 2,..., q m ) T Collection od documents is represented by term-document mtrix 7

Term-document mtrix A d 1 11 21 m1 d 12 22 m2 d 1n n mn 2n Rows: terms Columns: documents 2 t t t 1 2 m The terms re xes in the spce Documents nd queries re points or vectors in spce Spce is high-dimensionl - dimension corresponds to the number of index terms Documents usully contin only few index terms from the list of index terms Vectors representing documents re rre: most coordintes re equl to 0 8

Mesure of similrity The similrity between the document nd the query is mesured by the ngle between their representtions The mesure of similrity is cosine of the ngle between representtions of document nd query cos ( j, q) j T j 2 q q 2 9

Mesure of similrity Vector representtions of documents usully re stndrdized so tht its Eucliden norm is 1 j 2 1 Norm of representtion of the query q 2 does not ffect the rnking of documents by relevnce The mesure of similrity between document nd query usully is mesured by the dot product of vectors in the numertor 10

Exmple: BOW representtion 1 st document About vccintion of children ginst some virus 2 nd nd 3 rd document About vccintion of citizens ginst flu 4 th document About computer viruses 11

n m l 2 2 2 0 0 0 0 Exmple: BOW representtion 12 Child Virus Vccintion n m i 1 1 1 0 0 0 0 Computer Flu Immuniztion 1 st doc 2 nd doc 3 rd doc 4 th doc n j l 3 3 3 0 0 0 0 n k i 4 4 4 0 0 0 0

Exmple: BOW representtion Documents will be similr if there is term mtching between them If document does not contin exct word set s the query, but synonym of tht word, document will not be recognized s relevnt to the query Documents contining polysemy will be recognized s similr lthough they ctully re not similr 13

Evlution of informtion retrievl Mesures: Recll Precision Averge precision Recll Precision recll i precision r r i i n ri i r i number of relevnt documents mong i highest rnking documents r n totl number of relevnt documents in collection Insisting on high recll reduces precision of retrievl nd vice vers Averge precision verge precision on more recll revels (usully 11) 14 Workshop on Dt Anlysis,

Assigning weights to index terms Components: Locl - depends on the frequency of occurrence of the term in prticulr document (Term frequency - TF component) Globl - fctor for the index term depends inversely on the number of documents in which the term ppers (the inverse of the frequency of the term in the collection ( Inverse document frequency - IDF component)) Normliztion vectors of representtions of documents re divided by Eucliden norm of representtion (this voids the sitution tht longer documents re often returned s relevnt) 15

Exmple A collection of 15 documents (titles of books) 9 from the field of dt mining 5 from the field of liner lgebr One tht combines these two res (ppliction of liner lgebr to dt mining) A list of words is formed s follows: From the words contined in t lest two documents The words on the list of stop words re discrded The words re reduced to their bsic forms, ie. vritions of the word re mpped to the bsic version of the word 16

Documents 1/2 D1 D2 D3 D4 D5 D6 D7 D8 Survey of text mining: clustering, clssifiction, nd retrievl Automtic text processing: the trnsformtion nlysis nd retrievl of informtion by computer Elementry liner lgebr: A mtrix pproch Mtrix lgebr & its pplictions sttistics nd econometrics Effective dtbses for text & document mngement Mtrices, vector spces, nd informtion retrievl Mtrix nlysis nd pplied liner lgebr Topologicl vector spces nd lgebrs 17

Documents 2/2 D9 D10 D11 D12 D13 D14 D15 Informtion retrievl: dt structures & lgorithms Vector spces nd lgebrs for chemistry nd physics Clssifiction, clustering nd dt nlysis Clustering of lrge dt sets Clustering lgorithms Document wrehousing nd text mining: techniques for improving business opertions, mrketing nd sles Dt mining nd knowledge discovery 18

Terms Dt mining terms Liner lgebr terms Neutrl terms Text Liner Anlysis Mining Algebr Appliction Clustering Mtrix Algorithm Clssifiction Vector Retrievl Spce Informtion Document Dt The concepts re mpped in their bsic versions, eg. Applictions, Applied Appliction 19

Queries Q1: Dt mining Relevnt documents : All dt mining documents (blue) Q2: Using liner lgebr for dt mining Relevnt document: D6 20

Rezultti weight pretrživnj bxx tfn Document Q1 Q2 Q1 Q2 D1 0,4472 0,4472 0,3607 0,2424 D2 0 0 0 0 D3 0 1,1547 0 0,6417 D4 0 0,5774 0 0,1471 D5 0 0 0 0 D6 0 0 0 0 D7 0 0,8944 0 0,4598 D8 0 0,5774 0 0,1654 D9 0,5 0,5 0,2634 0,1771 D10 0 0,5774 0 0,1654 D11 0,5 0,5 0,2634 0,1770 D12 0,7071 0,7071 0,4488 0,3016 D13 0 0 0 0 D14 0,5774 0,5774 0,4092 0,2750 D15 1,4142 1,4142 1,0000 0,672021

Comments on serch results Consider tht in the process of serch re returned ll documents whose mesure of similrity with the query is greter thn 0 For the first query (Dt mining) All documents re returned in the serch process re relevnt Not ll relevnt documents re returned in the serch Only those documents contining the terms contined in the query re returned - serch the lexicl, but not semnticl For the second query (Using liner lgebr for dt mining) None of the returned document is relevnt The only relevnt document is not returned in the serch Serch using weight bxx nd tfn did not result in significntly different results presented in this cse 22

CONCEPTUAL INDEXING 23

Definition Conceptul indexing is indexing by fetures which recognize reltions between terms Here we compre two techniques for conceptul indexing bsed on projection of vectors of documents (in mens of lest squres) on lower-dimensionl vector spce Ltent semntic indexing (LSI) - Benchmrk Concept indexing (CI) 24

Ltent semntic indexing Introduced in 1990; improved in 1995 S. Deerwester, S. Dums, G. Furns, T. Lnduer, R. Hrsmn: Indexing by ltent semntic nlysis, J. Americn Society for Informtion Science, 41, 1990, pp. 391-407 M. W. Berry, S.T. Dums, G.W. O Brien: Using liner lgebr for intelligent informtion retrievl, SIAM Review, 37, 1995, pp. 573-595 Bsed on spectrl nlysis of term-document mtrix 25

Ltent semntic indexing For LSI truncted SVD is used A k U k k V T k where U k is m k mtrix whose columns re first k left singulr vectors of A k is k k digonl mtrix whose digonl is formed by k leding singulr vlues of A V k is n k mtrix whose columns re first k right singulr vectors of A Rows of U k = terms Rows of V k = documents 26

LSI: Representtions U k (m x k) Σ k (k x k) Vectors of documents V k T (k x n) A (m x n) U (m x m) Σ (m x n) V T (n x n) Vectors of terms 27

Concept indexing (CI) Insted of SVD it is used concept decomposition (CD) CD ws introduced in 2001 I.S.Dhillon, D.S. Modh: Concept decomposition for lrge sprse text dt using clustering, Mchine Lerning, 42:1, 2001, pp. 143-175 28

Concept decomposition First step: clustering of documents in term-document mtrix A on k groups Clustering lgorithms used: k-mens lgorithm Fuzzy k-mens lgorithm Centroids of groups = concept vectors Concept mtrix is mtrix whose columns re centroids of groups C c j centroid of j-th group c c k 1 2 c k 29

Concept decomposition Second step: computing of the CD It is obtined by solving the lest squres problem A C k Z 2 min Solution to this problem is Concept decomposition Z T 1 T A C k C k C k Rows of C k = terms Columns of Z = documents 30

Exmple: Projection of terms by SVD 0.5 0.4 0.3 0.2 0.1 0-0.1-0.2-0.3-0.4 dt mining terms liner lgebr terms -0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 31

Exmple: Projection of terms by CD 0.5 0.45 dt mining terms liner lgebr terms 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 32

Exmple: Projection of documents by SVD 0.3 0.2 dt mining documents liner lgebr documents D6 document queries 0.1 0 Q 2-0.1 Relevnt documents for query Q 2-0.2-0.3-0.4 Relevnt documents for query Q 1 Q 1 0 0.1 0.2 0.3 0.4 0.5 0.6 33

Exmple: Projection of documents by CD 1.2 Relevnt documents for query Q 1 1 Q 1 0.8 Q 2 0.6 0.4 Relevnt documents for query Q 2 0.2 0 dt mining documents liner lgebr documents D6 document queries -0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 34

Results of informtion retrievl (Q1) Term-mching method Ltent semntic indexing Concept indexing score document score document score document 1,4142 D15 0,6141 D1 0,6622 D1 0,7071 D12 0,5480 D11 0,6057 D14 0,5774 D14 0,5465 D12 0,5953 D12 0,5000 D9 0,4809 D9 0,5763 D11 0,5000 D11 0,4644 D15 0,5619 D15 0,4472 D1 0,4301 D2 0,4727 D5 0,0000 D2 0,4127 D14 0,3379 D13 0,0000 D3 0,3858 D13 0,2690 D2 0,0000 D4 0,3165 D5 0,2615 D9 0,0000 D5 0,1585 D6 0,0016 D7 0,0000 D6 0,0013 D7-0,0063 D6 0,0000 D7-0,0631 D8-0,0462 D3 0,0000 D8-0,0631 D10-0,0468 D4 0,0000 D10-0,0712 D4-0,0473 D8 0,0000 D13-0,0712 D3-0,0473 D10 35

Results of informtion retrievl (Q2) Term mtching method Ltent semntic indexing Concept indexing score document score document score document 1,4142 D15 0,6737 D6 0,7105 D1 1,1547 D3 0,6472 D7 0,6204 D11 0,8944 D7 0,6100 D8 0,5936 D12 0,7071 D12 0,6100 D10 0,5517 D2 0,5774 D4 0,5924 D3 0,5488 D14 0,5774 D8 0,5924 D4 0,5452 D6 0,5774 D10 0,5789 D10 0,5441 D9 0,5774 D14 0,5404 D2 0,5280 D7 0,5000 D9 0,5268 D11 0,5256 D15 0,5000 D11 0,5236 D9 0,4797 D8 0,4472 D1 0,4656 D12 0,4797 D10 0,0000 D2 0,3936 D15 0,4693 D3 0,0000 D5 0,3560 D14 0,4693 D4 0,0000 D6 0,3320 D13 0,4459 D5 0,0000 D13 0,2800 D5 0,4202 D13 36

Exmple: Experimentl collections of documents MEDLINE 1033 documents 30 queries Relevnt judgements CRANFIELD 1400 documents 225 queries Relevnt judgements 37

Experiment Comprison of men verge precision of informtion retrievl nd precision-recll plots Men verge precision for term-mtching method: MEDLINE : 43,54 CRANFIELD : 20,89 38

men verge precision MEDLINE men verge precision 60,00 55,00 50,00 45,00 40,00 35,00 30,00 0 50 100 150 200 250 300 rnk of pproximtion k-mens fuzzy k-mens SVD 39

men verge precision CRANFIELD men verge precision 0,25 0,20 0,15 0,10 0,05 0,00 0 50 100 150 200 250 300 rnk of pproximtion (k) k-mens fuzzy k-mens SVD 40

precision MEDLINE precision-recll plot 90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 120 recll k-mens fuzzy k-mens SVD term-mtching 41

Teorem Eckrt nd Young Distnce between mtrix A nd its pproximtion by mtrix of rnk k < p =min {m,n} in Frobenious norm is minimized for tructed SVD decomposition of given mtrix Despite tht theoreticl fct, results of informtion retrievl re beter for concept decoposition by fuzzy k- mens clustering 42

Thnk you for your ttention! 43