Text mining: bag of words representation and beyond it

Similar documents
Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

Graph Exploration: Taking the User into the Loop

A Heuristic Approach for Discovering Reference Models by Mining Process Model Variants

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

1. SEQUENCES INVOLVING EXPONENTIAL GROWTH (GEOMETRIC SEQUENCES)

Fig.1. Let a source of monochromatic light be incident on a slit of finite width a, as shown in Fig. 1.

Chapter 2 Sensitivity Analysis: Differential Calculus of Models

Lecture 5: Spatial Analysis Algorithms

Integration. September 28, 2017

Integration. October 25, 2016

On the Detection of Step Edges in Algorithms Based on Gradient Vector Analysis

Slides for Data Mining by I. H. Witten and E. Frank

Expert Systems with Applications

Geometric transformations

2 Computing all Intersections of a Set of Segments Line Segment Intersection

A Transportation Problem Analysed by a New Ranking Method

Cone Cluster Labeling for Support Vector Clustering

UNIT 11. Query Optimization

Stained Glass Design. Teaching Goals:

Tree Structured Symmetrical Systems of Linear Equations and their Graphical Solution

Section 10.4 Hyperbolas

Misrepresentation of Preferences

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

MATH 25 CLASS 5 NOTES, SEP

9 4. CISC - Curriculum & Instruction Steering Committee. California County Superintendents Educational Services Association

A Comparison of the Discretization Approach for CST and Discretization Approach for VDM

Agilent Mass Hunter Software

Statistical classification of spatial relationships among mathematical symbols

Dr. D.M. Akbar Hussain

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method

a(e, x) = x. Diagrammatically, this is encoded as the following commutative diagrams / X

MA1008. Calculus and Linear Algebra for Engineers. Course Notes for Section B. Stephen Wills. Department of Mathematics. University College Cork

Engineer To Engineer Note

CSEP 573 Artificial Intelligence Winter 2016

4452 Mathematical Modeling Lecture 4: Lagrange Multipliers

CHAPTER 5 Spline Approximation of Functions and Data

An Efficient and Effective Case Classification Method Based On Slicing

CHAPTER 8 Quasi-interpolation methods

INTRODUCTION TO SIMPLICIAL COMPLEXES

a < a+ x < a+2 x < < a+n x = b, n A i n f(x i ) x. i=1 i=1

AVolumePreservingMapfromCubetoOctahedron

International Conference on Mechanics, Materials and Structural Engineering (ICMMSE 2016)

Information Retrieval and Organisation

CS481: Bioinformatics Algorithms

Pointwise convergence need not behave well with respect to standard properties such as continuity.

12-B FRACTIONS AND DECIMALS

LECT-10, S-1 FP2P08, Javed I.

Constrained Optimization. February 29

such that the S i cover S, or equivalently S

Character-Stroke Detection for Text-Localization and Extraction

DQL: A New Updating Strategy for Reinforcement Learning Based on Q-Learning

Nearest Keyword Set Search in Multi-dimensional Datasets

Representation of Numbers. Number Representation. Representation of Numbers. 32-bit Unsigned Integers 3/24/2014. Fixed point Integer Representation

Algorithm Design (5) Text Search

GENERATING ORTHOIMAGES FOR CLOSE-RANGE OBJECTS BY AUTOMATICALLY DETECTING BREAKLINES

Suffix trees, suffix arrays, BWT

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

Functor (1A) Young Won Lim 10/5/17

SIMPLIFYING ALGEBRA PASSPORT.

Presentation Martin Randers

Research Announcement: MAXIMAL CONNECTED HAUSDORFF TOPOLOGIES

Math 142, Exam 1 Information.

If f(x, y) is a surface that lies above r(t), we can think about the area between the surface and the curve.

What do all those bits mean now? Number Systems and Arithmetic. Introduction to Binary Numbers. Questions About Numbers

SOME EXAMPLES OF SUBDIVISION OF SMALL CATEGORIES

SUPPLEMENTARY INFORMATION

II. THE ALGORITHM. A. Depth Map Processing

Revisiting the notion of Origin-Destination Traffic Matrix of the Hosts that are attached to a Switched Local Area Network

Step-Voltage Regulator Model Test System

COMBINATORIAL PATTERN MATCHING

9.1 apply the distance and midpoint formulas

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007

Functor (1A) Young Won Lim 8/2/17

Approximate Retrieval from Multimedia Databases Using Relevance Feedback

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Comparative Study of Universities Web Structure Mining

Computing offsets of freeform curves using quadratic trigonometric splines

Efficient Techniques for Tree Similarity Queries 1

Complete Coverage Path Planning of Mobile Robot Based on Dynamic Programming Algorithm Peng Zhou, Zhong-min Wang, Zhen-nan Li, Yang Li

Unit 5 Vocabulary. A function is a special relationship where each input has a single output.

A REINFORCEMENT LEARNING APPROACH TO SCHEDULING DUAL-ARMED CLUSTER TOOLS WITH TIME VARIATIONS

CS380: Computer Graphics Modeling Transformations. Sung-Eui Yoon ( 윤성의 ) Course URL:

CS 268: IP Multicast Routing

Making Tree Kernels practical for Natural Language Learning

The Reciprocal Function Family. Objectives To graph reciprocal functions To graph translations of reciprocal functions

Outline. Two combinatorial optimization problems in machine learning. Talk objectives. Grammar induction. DFA induction.

9 Graph Cutting Procedures

6.2 Volumes of Revolution: The Disk Method

Vertex Unique Labelled Subgraph Mining

RATIONAL EQUATION: APPLICATIONS & PROBLEM SOLVING

Topic 2: Lexing and Flexing

Efficient K-NN Search in Polyphonic Music Databases Using a Lower Bounding Mechanism

Approximation by NURBS with free knots

What are suffix trees?

Tilt-Sensing with Kionix MEMS Accelerometers

Shape-Similarity Search of 3D Models by using Enhanced Shape Functions

Theory of Computation CSE 105

International Journal of Scientific & Engineering Research, Volume 4, Issue 12, December ISSN

Questions About Numbers. Number Systems and Arithmetic. Introduction to Binary Numbers. Negative Numbers?

Transcription:

Text mining: bg of words representtion nd beyond it Jsmink Dobš Fculty of Orgniztion nd Informtics University of Zgreb 1

Outline Definition of text mining Vector spce model or Bg of words representtion Representtion of documents nd queries in VSM Mesure of similrity between document nd query Exmple Semtic representtion of documents Ltent semntic indexing Concept indexing Exmple 2

Definition nd bsic concepts Text mining: prt of the wider discipline of dt mining Content-bsed processing of unstructured text documents nd extrction of useful informtion from them Tretment bsed only on the content of the documents nd not on metdt (dte of publiction, source of publiction, type of document...) Bsic text mining tsks: Informtion retrievl Automtic clssifiction of text documents Clustering of text documents 3

Informtion retrievl: definition nd bsic concepts The problem: determining set of documents from collection tht re relevnt to prticulr user query Assessment of the relevnce of documents is bsed on comprison of index terms tht pper in texts of documents nd text of the query Index term: ny word or group of words tht pper in the text When the text is represented by set of index terms much of the semntics contined in the text is lost Informtion bout the reltions between index terms is lost 4

Lbels nd bsic models Lbels: d j - document of collection of documents q - query i, j vector representtions of documents q - vector representtion of the query Bsic models: Probbilistic Boolen model Vector spce model or bg of words representtion 5

Vector spce model or Bg of words model Vector spce model is tody one of the most populr models for informtion retrievl It is developed by Gerld Slton nd ssocites within the serch system SMART (System for the Mechnicl Anlysis nd Retrievl of Text) Similrity between documents nd queries cn tke rel vlues in the intervl [0,1] Indexing terms re joined by weight Documents re rnked ccording to the degree of similrity with the query Representtion in the vector spce model clled Bg of words representtion (BOW model) BOW model ssumes independence of indexing terms 6

Term-document mtrix Nottion T t, t 2,, t m - set of index terms 1 D d, d2,, d n - set of documents 1 ij weight ssocited to pir (t i, d j ), i=1,2,..,m; j=1,2,...,n q i, i=1,2,...,m weights ssocited to pir (t i,q), q query Document d j, j=1,2,...n is represented by the vector j =( j1, j2,..., jm ) T Query is represented by the vector q=(q 1, q 2,..., q m ) T Collection od documents is represented by term-document mtrix 7

Term-document mtrix A d 1 11 21 m1 d 12 22 m2 d 1n n mn 2n Rows: terms Columns: documents 2 t t t 1 2 m The terms re xes in the spce Documents nd queries re points or vectors in spce Spce is high-dimensionl - dimension corresponds to the number of index terms Documents usully contin only few index terms from the list of index terms Vectors representing documents re rre: most coordintes re equl to 0 8

Mesure of similrity The similrity between the document nd the query is mesured by the ngle between their representtions The mesure of similrity is cosine of the ngle between representtions of document nd query cos ( j, q) j T j 2 q q 2 9

Mesure of similrity Vector representtions of documents usully re stndrdized so tht its Eucliden norm is 1 j 2 1 Norm of representtion of the query q 2 does not ffect the rnking of documents by relevnce The mesure of similrity between document nd query usully is mesured by the dot product of vectors in the numertor 10

Exmple: BOW representtion 1 st document About vccintion of children ginst some virus 2 nd nd 3 rd document About vccintion of citizens ginst flu 4 th document About computer viruses 11

n m l 2 2 2 0 0 0 0 Exmple: BOW representtion 12 Child Virus Vccintion n m i 1 1 1 0 0 0 0 Computer Flu Immuniztion 1 st doc 2 nd doc 3 rd doc 4 th doc n j l 3 3 3 0 0 0 0 n k i 4 4 4 0 0 0 0

Exmple: BOW representtion Documents will be similr if there is term mtching between them If document does not contin exct word set s the query, but synonym of tht word, document will not be recognized s relevnt to the query Documents contining polysemy will be recognized s similr lthough they ctully re not similr 13

Evlution of informtion retrievl Mesures: Recll Precision Averge precision Recll Precision recll i precision r r i i n ri i r i number of relevnt documents mong i highest rnking documents r n totl number of relevnt documents in collection Insisting on high recll reduces precision of retrievl nd vice vers Averge precision verge precision on more recll revels (usully 11) 14 Workshop on Dt Anlysis,

Assigning weights to index terms Components: Locl - depends on the frequency of occurrence of the term in prticulr document (Term frequency - TF component) Globl - fctor for the index term depends inversely on the number of documents in which the term ppers (the inverse of the frequency of the term in the collection ( Inverse document frequency - IDF component)) Normliztion vectors of representtions of documents re divided by Eucliden norm of representtion (this voids the sitution tht longer documents re often returned s relevnt) 15

Exmple A collection of 15 documents (titles of books) 9 from the field of dt mining 5 from the field of liner lgebr One tht combines these two res (ppliction of liner lgebr to dt mining) A list of words is formed s follows: From the words contined in t lest two documents The words on the list of stop words re discrded The words re reduced to their bsic forms, ie. vritions of the word re mpped to the bsic version of the word 16

Documents 1/2 D1 D2 D3 D4 D5 D6 D7 D8 Survey of text mining: clustering, clssifiction, nd retrievl Automtic text processing: the trnsformtion nlysis nd retrievl of informtion by computer Elementry liner lgebr: A mtrix pproch Mtrix lgebr & its pplictions sttistics nd econometrics Effective dtbses for text & document mngement Mtrices, vector spces, nd informtion retrievl Mtrix nlysis nd pplied liner lgebr Topologicl vector spces nd lgebrs 17

Documents 2/2 D9 D10 D11 D12 D13 D14 D15 Informtion retrievl: dt structures & lgorithms Vector spces nd lgebrs for chemistry nd physics Clssifiction, clustering nd dt nlysis Clustering of lrge dt sets Clustering lgorithms Document wrehousing nd text mining: techniques for improving business opertions, mrketing nd sles Dt mining nd knowledge discovery 18

Terms Dt mining terms Liner lgebr terms Neutrl terms Text Liner Anlysis Mining Algebr Appliction Clustering Mtrix Algorithm Clssifiction Vector Retrievl Spce Informtion Document Dt The concepts re mpped in their bsic versions, eg. Applictions, Applied Appliction 19

Queries Q1: Dt mining Relevnt documents : All dt mining documents (blue) Q2: Using liner lgebr for dt mining Relevnt document: D6 20

Rezultti weight pretrživnj bxx tfn Document Q1 Q2 Q1 Q2 D1 0,4472 0,4472 0,3607 0,2424 D2 0 0 0 0 D3 0 1,1547 0 0,6417 D4 0 0,5774 0 0,1471 D5 0 0 0 0 D6 0 0 0 0 D7 0 0,8944 0 0,4598 D8 0 0,5774 0 0,1654 D9 0,5 0,5 0,2634 0,1771 D10 0 0,5774 0 0,1654 D11 0,5 0,5 0,2634 0,1770 D12 0,7071 0,7071 0,4488 0,3016 D13 0 0 0 0 D14 0,5774 0,5774 0,4092 0,2750 D15 1,4142 1,4142 1,0000 0,672021

Comments on serch results Consider tht in the process of serch re returned ll documents whose mesure of similrity with the query is greter thn 0 For the first query (Dt mining) All documents re returned in the serch process re relevnt Not ll relevnt documents re returned in the serch Only those documents contining the terms contined in the query re returned - serch the lexicl, but not semnticl For the second query (Using liner lgebr for dt mining) None of the returned document is relevnt The only relevnt document is not returned in the serch Serch using weight bxx nd tfn did not result in significntly different results presented in this cse 22

CONCEPTUAL INDEXING 23

Definition Conceptul indexing is indexing by fetures which recognize reltions between terms Here we compre two techniques for conceptul indexing bsed on projection of vectors of documents (in mens of lest squres) on lower-dimensionl vector spce Ltent semntic indexing (LSI) - Benchmrk Concept indexing (CI) 24

Ltent semntic indexing Introduced in 1990; improved in 1995 S. Deerwester, S. Dums, G. Furns, T. Lnduer, R. Hrsmn: Indexing by ltent semntic nlysis, J. Americn Society for Informtion Science, 41, 1990, pp. 391-407 M. W. Berry, S.T. Dums, G.W. O Brien: Using liner lgebr for intelligent informtion retrievl, SIAM Review, 37, 1995, pp. 573-595 Bsed on spectrl nlysis of term-document mtrix 25

Ltent semntic indexing For LSI truncted SVD is used A k U k k V T k where U k is m k mtrix whose columns re first k left singulr vectors of A k is k k digonl mtrix whose digonl is formed by k leding singulr vlues of A V k is n k mtrix whose columns re first k right singulr vectors of A Rows of U k = terms Rows of V k = documents 26

LSI: Representtions U k (m x k) Σ k (k x k) Vectors of documents V k T (k x n) A (m x n) U (m x m) Σ (m x n) V T (n x n) Vectors of terms 27

Concept indexing (CI) Insted of SVD it is used concept decomposition (CD) CD ws introduced in 2001 I.S.Dhillon, D.S. Modh: Concept decomposition for lrge sprse text dt using clustering, Mchine Lerning, 42:1, 2001, pp. 143-175 28

Concept decomposition First step: clustering of documents in term-document mtrix A on k groups Clustering lgorithms used: k-mens lgorithm Fuzzy k-mens lgorithm Centroids of groups = concept vectors Concept mtrix is mtrix whose columns re centroids of groups C c j centroid of j-th group c c k 1 2 c k 29

Concept decomposition Second step: computing of the CD It is obtined by solving the lest squres problem A C k Z 2 min Solution to this problem is Concept decomposition Z T 1 T A C k C k C k Rows of C k = terms Columns of Z = documents 30

Exmple: Projection of terms by SVD 0.5 0.4 0.3 0.2 0.1 0-0.1-0.2-0.3-0.4 dt mining terms liner lgebr terms -0.5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 31

Exmple: Projection of terms by CD 0.5 0.45 dt mining terms liner lgebr terms 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 32

Exmple: Projection of documents by SVD 0.3 0.2 dt mining documents liner lgebr documents D6 document queries 0.1 0 Q 2-0.1 Relevnt documents for query Q 2-0.2-0.3-0.4 Relevnt documents for query Q 1 Q 1 0 0.1 0.2 0.3 0.4 0.5 0.6 33

Exmple: Projection of documents by CD 1.2 Relevnt documents for query Q 1 1 Q 1 0.8 Q 2 0.6 0.4 Relevnt documents for query Q 2 0.2 0 dt mining documents liner lgebr documents D6 document queries -0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 34

Results of informtion retrievl (Q1) Term-mching method Ltent semntic indexing Concept indexing score document score document score document 1,4142 D15 0,6141 D1 0,6622 D1 0,7071 D12 0,5480 D11 0,6057 D14 0,5774 D14 0,5465 D12 0,5953 D12 0,5000 D9 0,4809 D9 0,5763 D11 0,5000 D11 0,4644 D15 0,5619 D15 0,4472 D1 0,4301 D2 0,4727 D5 0,0000 D2 0,4127 D14 0,3379 D13 0,0000 D3 0,3858 D13 0,2690 D2 0,0000 D4 0,3165 D5 0,2615 D9 0,0000 D5 0,1585 D6 0,0016 D7 0,0000 D6 0,0013 D7-0,0063 D6 0,0000 D7-0,0631 D8-0,0462 D3 0,0000 D8-0,0631 D10-0,0468 D4 0,0000 D10-0,0712 D4-0,0473 D8 0,0000 D13-0,0712 D3-0,0473 D10 35

Results of informtion retrievl (Q2) Term mtching method Ltent semntic indexing Concept indexing score document score document score document 1,4142 D15 0,6737 D6 0,7105 D1 1,1547 D3 0,6472 D7 0,6204 D11 0,8944 D7 0,6100 D8 0,5936 D12 0,7071 D12 0,6100 D10 0,5517 D2 0,5774 D4 0,5924 D3 0,5488 D14 0,5774 D8 0,5924 D4 0,5452 D6 0,5774 D10 0,5789 D10 0,5441 D9 0,5774 D14 0,5404 D2 0,5280 D7 0,5000 D9 0,5268 D11 0,5256 D15 0,5000 D11 0,5236 D9 0,4797 D8 0,4472 D1 0,4656 D12 0,4797 D10 0,0000 D2 0,3936 D15 0,4693 D3 0,0000 D5 0,3560 D14 0,4693 D4 0,0000 D6 0,3320 D13 0,4459 D5 0,0000 D13 0,2800 D5 0,4202 D13 36

Exmple: Experimentl collections of documents MEDLINE 1033 documents 30 queries Relevnt judgements CRANFIELD 1400 documents 225 queries Relevnt judgements 37

Experiment Comprison of men verge precision of informtion retrievl nd precision-recll plots Men verge precision for term-mtching method: MEDLINE : 43,54 CRANFIELD : 20,89 38

men verge precision MEDLINE men verge precision 60,00 55,00 50,00 45,00 40,00 35,00 30,00 0 50 100 150 200 250 300 rnk of pproximtion k-mens fuzzy k-mens SVD 39

men verge precision CRANFIELD men verge precision 0,25 0,20 0,15 0,10 0,05 0,00 0 50 100 150 200 250 300 rnk of pproximtion (k) k-mens fuzzy k-mens SVD 40

precision MEDLINE precision-recll plot 90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 120 recll k-mens fuzzy k-mens SVD term-mtching 41

Teorem Eckrt nd Young Distnce between mtrix A nd its pproximtion by mtrix of rnk k < p =min {m,n} in Frobenious norm is minimized for tructed SVD decomposition of given mtrix Despite tht theoreticl fct, results of informtion retrievl re beter for concept decoposition by fuzzy k- mens clustering 42

Thnk you for your ttention! 43