An Extension of TF-IDF Model for Extracting Feature Terms

Size: px

Start display at page:

Download "An Extension of TF-IDF Model for Extracting Feature Terms"

Paul Lambert
5 years ago
Views:

1 An Extension of TF-IDF Model for Extracting Feature Terms Mingxi Zhang*, Hangfei Hu, Guanying Su, Yuening Zhang, Xiaohong Wang College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China address: (Mingxi Zhang), (Hangfei Hu), (Guanying Su), (Yuening Zhang), (Xiaohong Wang) *Corresponding author Acknowledgements This work was supported by Natural Science Foundation of Shanghai grant 16ZR ; Training Project of University of Shanghai for Science and Technology grant 16HJPY-QN04. 64

2 Abstract: Extracting feature terms can help users get useful information correctly and quickly, which is very important in many real applications, such as web search, document clustering and similarity computation. Although TF-IDF model can be used for generating feature terms, some latent terms which are not contained in current documents directly might be neglected. In this paper, an approach for finding the latent feature terms is proposed by extending TF-IDF model. At first, a bipartite graph is constructed for representing the relationship between documents and the terms, and an iterative formula is given for finding the feature terms that are not contained in current documents directly. Based on bipartite graph, the transition probabilities from documents to terms are computed, and then the feature terms with relatively high transition probabilities are generated. Experiment result on real dataset demonstrates the effectiveness of the proposed approach through comparing with the TF-IDF Model. Keywords: TF-IDF; transition probability; feature terms 65

3 1. Introduction Actually, there are many terms in a document, but most of them have much fewer contributions to the document. A few of them play an important role in the document. These representative terms are viewed as the feature terms. The feature terms which are extracted from documents are viewed as the signatures of a document. They can represent the subject information of the whole text for the information retrieval [19]. And the computer can find the main information fast in the process of querying by feature terms. Moreover, the number of documents will be bigger and bigger according to the new Moore's Law. More time will be needed while the users submit query requests. The feature terms can shorten the query time and improve the accuracy of queries, and the users can get some general information of a document from the feature terms and decide whether the document is useful or not in a short space of time. The feature terms can reflect the idea of the document and the field of the document, and they can give the users a favor to navigate to the target document [2]. So the feature terms not only reflect the main content, but also are specific.currently,most feature terms are given by the authors of the papers. However, most documents are not rendered to feature terms especially previous papers. As we all know, it s difficult to sign the feature terms for every document by hand, because the task is very onerous and not all the feature terms are suitable owing to the human subjectivity. If inappropriate feature terms were chosen,they would have negative effects on the subsequent process. So it s necessary to do some researches about how to extract feature terms from the documents automatically. The feature terms extraction technique is the foundation technique which can provide service for many fields such as text classification, data-mining, indexing, text-analysis [11], mean clustering analysis[3]and so on. There has been much related work about how to extract the feature terms. One of the most popular algorithms to pick up the feature terms is TF-IDF algorithm. TF represents a single term occurrence frequency in the document collections. The main idea of IDF is that if the document contains the certain term less, the term has a very good classification of the ability to distinguish. So the main idea of TF-IDF algorithm is that a term or a phrase has the ability to distinguish and is suitable for the classification if the frequency of the term or phrase in the chapter of TF is high and in the other documents it rarely appears [5]. Zipf, a linguistics expert at the Harvard University, when studying the frequency of English terms, found that the frequency of each term had a simple inverse relationship with the constant power of its ranking, if the frequency of the terms appeared in the order from the large to the small. Based on Zipf s law, TF-IDF is a useful tool to a certain degree. With the development of the Web, there has been some new methods which are based on TF-IDF algorithm. Even though TF-IDF is a classical algorithm, which has been proved very efficient, there also exists some disadvantages in traditional TF-IDF algorithm. This paper dissects the initial TF-IDF algorithm and analyzes the existing research algorithms which are based on the TF-IDF algorithm. At last, the paper extends TF-IDF algorithm to extract feature terms. In this paper, a new approach based on the topologies is presented which is used to extract the feature terms. The frequencies of terms are calculated not only by the times they appear in this document, but also the topology structure. 66

4 2. Related work In recent years, work done on the feature terms extraction has received a great deal of interest. An important aspect in most of the works is to identify appropriate weighting techniques to assign weight to the terms in a text corpus and then the feature terms can be identified, which is based on the assigned weights. A document was regard as a phrase set by Turney [15] who used genetic algorithm and C4.5 decision tree induction algorithm to extract the feature phrases. KEA System gained the eigenvalue by Naive Bayesian Technology [17]. Rule Induction algorithm [6] proved that adding Linguistic knowledge to text could improve the extraction accuracy. KP-Miner System had the ability to extract the feature terms from any length of English documents and Arabia documents [4] without any exercise. IDF was firstly picked up by Jones K.S [8] in 1972, which held the main idea that if a certain feature concentrated on a small amount of documentation, it would contain higher entropy and corresponding weights should be higher. TF-IDF was proposed by Salton [12] which was based on the IDF algorithm. TF is the number of terms which appear in the document and IDF is anti-document frequency. TF-IDF algorithm is a classical algorithm in the field of finding the feature terms or the stop words. Though it s very efficient in most cases, it also has its shortcomings. Most papers about the shortcomings of TF-IDF algorithm begin the narrative with the classification. TF-ID Fmodel has boundedness because of its universal applicability. In different fields, some terms should be viewed as the stop words not the feature terms. Some terms may have different meanings in the specific field. In the same document, there may be two or more terms which have the same meaning, and they all should be regard as the stop words not the feature terms. In order to avoid these and similar problems, many researchers have done much work to improve the intrinsic model. In [9], a keyword extraction algorithm IWCN based on TF-IDF was proposed. The author took TF-IDF and semantic weights into consideration. [14] presented a hybrid statistical-graphical algorithm to extract the Keyphrases. The author of [7] extracted keyword to classify the network texts based on K-nearest neighbor method. In [18], a new keyword extraction algorithm which applies to a single document without using a corpus is presented. The frequent terms are extracted by the counting term frequencies. If two terms are in the same sentence, the two terms are considered to have the relationship of co-occurrence. If co-occurrence distribution between term a and the frequent terms is based to a particular subset of frequent terms, the term a is a possible keyword. And the χ 2 -measure is used to measure the biases of co-occurrence distribution. In [1], a new algorithm called Topic Rank was a graph-based key phrase extraction method, which relied on a topical representation of the document. The author clustered candidate key phrases into topics and the topics were used as vertices in a complete graph. The paper used a graph-based ranking model to assign a significance score to each topic. Then the key phrases were extracted by the scores. [16] proposed a keyword extraction algorithm named TKG, in which TKG stands for Twitter Keyword Graph. The stop words are removed in the preprocessing step. Then a textual graph is built by the co-occurrence relationship between tokens. After that, the author took the centrality measures into consideration, which includes the degree centrality C D i, the closeness centrality C C i and eccentricity C E i. The best rank vertices are view as keywords. [13] proposedan unsupervised, domain-independent, and 67

5 corpus independent approach for automatic keyword extraction. The algorithm combines the information contained in frequency and spatial distribution of a term. The terms with middle range of frequency are important. The low-frequency terms are removed at first. In specific contexts or portions text, a keyword appears frequently to some extent. For the reason, a keyword s distribution pattern may indicate some level of clustering. The terms with highest standard deviation are chosen as keywords. In [10], an unsupervised statistical approach called SwiftRank is used to extract keywords and salient sentences. The core sentences and keywords often appear at the beginning of the text and the last paragraph. It starts ranking sentences on a document with scoring its text into a set of sentences by their distinct features. After removing the stop words, the terms remain are view as candidate terms. Then the score of every term can be computed by its corresponding distributions. The terms with high scores are extracted as the keywords. Compared to existing approaches, this paper solves these problems from the constructed graph. Based on the constructed bipartite graph, the transition probabilities from documents to terms are computed. The bigger the probability from a document to a term is, the more important the term is for the document, which is selected as the feature team in the current document. 3. Model 3. 1 Constructing relational matrix The chosen dataset contains some information about a lot of documents on the field of computer. (The information may comprise titles, abstracts, authors, numbers of the paper and so on.) Parts of useful information are extracted including abstracts & numbers. Firstly, abstracts & numbers are used to build a matrix on account of TF-IDF algorithm. Then, the TF-IDF value of every term is viewed as the strength of the relationship from the document to the term. Accordingly, a relational matrix is built up. The TF-IDF is formalized as: where TF and IDF are formalized respectively as: 3.2 Normalized relational matrix tfidf i,j = tf i,j idf i (1) tf i,j = n i,j Σ k n k,j (2) D idf i =log (3) {j :t i d j } Each value of the relational matrix can reflect the importance of a term to a document to some extent. To put it differently, the value can mirror the probability from the document to the term. So a probability matrix can be structured by the relational matrix. In order to the convenience of calculation, normalizing the relational matrix can get the probability matrix. In every document, the summation of all terms TF-IDF values is viewed as the denominator and the primary TF-IDF value of every term in this document is considered to be the numerator. Then a new value is obtained as the new element of the probability matrix. And it also represents the probability from the document to the term. In a clear and ordered pattern, a 68

6 probability matrix is structured. The normalization formula is defined as: P i, j = t(i,j ) n k=1 t(i,k) Where P(i, j)stands for the probability from document i to termj. t(i, k)stands for the TF-IDF value of the document i and termk.nstands for the number of the terms which document i contains. (4) 3.3 Constructing bipartite graph and analysis the graph As is known to all, documents and terms have the relationship of containing and being contained. They are different from each other. The bipartite graph can put two classes of things with different attributes on the same graph. Therefore, it is reasonable to build a bipartite graph which contains documents and terms. In this graph, a document is seen as a spot and a term is also considered as a spot which is different from the spot of the document. And the values of the probability matrix can be used as the weight coefficient from the document spot to the term spot reversely. In other words, the bipartite graph is an undirected graph. Then the graph can be used to get some useful information to achieve the purpose of extracting feature terms. Figure 1 shows an example of the connection between documents and terms, where D i represents a document i and T j represents a termj. In this example, it is obvious that there are many ways from a document to a term including the direct way and the indirect way(s). (There exists one way at least.) For example, there exists more than one way from D1 to T2(D1 T2, D1 T1 D4 T2 and so on). D i represents the document i, and T j represents the termj. To a certain extent, most probabilities from a document to a term can be calculated, which include the direct way and the indirect ways. The summation of all the most direct and indirect probabilities is viewed as the new probability from the document to the term. The bigger the probability is, the closer the relationship between the document and the term is. Then sort the new probabilities of every term of each document, and select some terms with high probabilities as the document s feature terms. Through the structure, the feature terms which can be used to represent the document s primary coverage can be found. Moreover, the found feature terms might not be contained by the appointed document, but it also reflects the feature of the document. 3.4 Modeling On the foundation of analysis of the graph, a new model calculating the probability values emerges for the purpose of looking for the feature terms in the bipartite graph. Obviously, starting from the document and only transferring odd number steps can reach the destination from a document to a term, while transferring even number steps can reach another document. In order to find the feature terms, we d better choose the odd number steps. Through the above analysis, the model is proposed. The iterative formula is formalized as: P t+1 (i, j) = P t (i, k) P(k, j) (5) If t + 1 = 2m,i stands for document i,j stands for document j, kstands for termk; If t + 1 = 2m + 1, i stands for document i, j stands for termj, k stands for document k; 69

7 3. 5 model analysis It is obvious that a huge tree would be built after several iterations. In the process, most storage space would be needed to store the information of every node. In the weak of increasing of the number of iteration, more and more time will be needed to calculate new values. Fortunately, matrix multiplication is a good method which can be used to solve the problem. Using matrix multiplication can calculate the probability [20] from a document to a term including the direct way and the indirect way(s), if the step length is given. In the process of the iteration, the probability should decay. λis used as the attenuation coefficient.thep n i, j is calculated as follows: P n i,j = ΠP(i,j) λ n 1 λ = 0.8 (6) If the probability matrix is a square matrix, it s very easy to get the probability from a document to a term by step n. In most cases, the chosen documents and the number of the terms every document contains can t built a square matrix. So the probability matrix needs turning to transport matrix in order to the convenience of calculation. In this case, thep n i, j is calculated as follows: P n i, j = P i, j P T i, j P i, j P T i, j.(7) So the probability from a document to a term can be calculated by the given step length. sum i, j is used as the final probability from a document to a term.the sum(i, j) is calculated as follows: sum i, j = ΣP 2k+1 (i, j)(8) Where i stands for document i andj stands for termj. 4. Experimental Result 4.1. Setup The experiment is developed, which is based on Intel(R) Core(TM) i5-4210, the frequency of 2.60GHz, and the memory of 4.0GB. The development environment is Eclipse based on JDK and JRE In this part, the result analysis is presented, and the test data is extracted from DBLP. The number of messages about every document is very big in the dataset which include the author, the title, the abstract and so on. So it s easy to get the number of messages which are needed. The evaluation puts much attention on whether the feature terms can be found to some extent and potential feature terms can be found. In part three, it has been discussed directly or indirectly that the chosen step length would have an impact on the final result more or less. In most cases, the recall and precision are often used to evaluate whether the experimental results conform to the expected results in the field of information retrieval. These two are viewed as the important indexes to reflect the effect of retrieval. The recall is that the amount of related information detected divides the total related information in the retrieval system. And the precision is that the amount of related information divides the total amount of information detected. In the paper, the result that the amount of the feature terms detected divides the total feature terms is used to calculate the recall R and the amount of the feature terms detected divides the total feature terms detected to compute the precision P for the given 70

8 document. Recall and precision should be taken into consideration synthetically because they can reflect two different aspects of the quality of the experimental results. In the process of comparison, both the recall and precision are compared On varying Step Length A document to a term may contain several ways. A step length is set to choose the indirect ways whose step lengths are less than the given step length and equal to the given step length. And the summation of the probability values which meet the requirement as the final probability value from the document to the term is calculated. Through the analysis in part 3, only an odd number can be chosen as the step length, while an even number step length can just guarantee from a document to another document. If the step length is set as the even number k, the result can be got, which is based on the step length k 1 in fact. And the result which is agreed with the step length can be got if the step length is set as the odd number. The experiment is carried out by different step lengths. Some results were chosen from the experimental results to calculate the precision and the recall. For the convenience of calculation, the averages of the precisions and the recalls are viewed as the final precision and the recall to draw the below charts. From figure 2 and figure 3, it s obvious that as the increasing of the step length, both the precision and the recall of the probability algorithm are superior to those of the classical TF-IDF algorithm. When the step length is set as 1, both the precision and the recall of classical algorithm are as good as those of the probability algorithm. It s because every TF-IDF value is viewed as the probability from the document to the term directly not including the indirect way(s). The TF-IDF values are considered to be the startup probabilities of the probability algorithm. In fact, the probability values with the step length 1 are the same as the TF-IDF values. So it is inevitable that the precisions and the recalls are equal to those of the TF-IDF algorithm when the step length is set as 1. From the step length 1 to 3, the rate of precision growth is very obvious. After that, there is no growth with the growth of the step length. So is the recall rate. It s the reason that there exists probability attenuation in the process of calculation, and the bigger the step length is, the more obvious the probability attenuation is. The attenuation coefficient is set to 0.8 in this paper. When the probability from a document to a term are calculated with the step length m, the direct probability needs to attenuate (m 1) times. If m is very big, most results are too small to make few contributions to the final value from a document to a term. If m is small, each result plays an important part in the final value. So the trend of the folding line is expected. Through the above analysis, it s easy to find that if the step length is set as 3, the probability algorithm has a good performance. If the given step length is bigger than 3, the result is roughly consistent with that when the step length is set as 3, and more time will be needed in the process of calculation. So the step length 3 is more suitable than other values. 71

9 4.3 On varying k In the process of evaluating the experimental result, the number of chosen terms would have an effect on the final evaluation. The more the terms were chosen, the more the feature terms may be contained. In fact, few feature terms are required for every document. In this part, top-k would be considered to carry out the experiment. (kstands for the number of chosen terms.) The results were evaluated byvarying k. The figure 4 and figure 5 were drew byvarying k. From figure 4 and figure 5, the precisions and recalls of two algorithms are roughly consistent. The probability algorithm is based on the TF-IDF algorithm. And the results of the probability algorithm are the combination of the TF-IDF algorithm and the structure of the constructed bipartite graph. To some extent, the results of probability algorithm are another way of arranging the results of TF-IDF algorithm. When two top terms are chosen, the results of probability algorithm are as good as these of TF-IDF algorithm. As the k is increasing, the two orders of terms are not same with each other. So when k is set as 6, the top 6 terms of probability algorithm includes the feature word(s) which the top 6 terms of TF-IDF algorithm not includes. So the P-pro line and the R-pro line have the breaking points earlier than these lines of the TF-IDF algorithm. With the further increase of k, both the results of two algorithms have included all the feature terms. And the precisions and recalls of the two algorithms are the same with each other. So the two curves meet again finally in figure 4. So do the curves of figure 5.Obviously, the probability algorithm has a better performance than the TF-IDF algorithm with the varied k.. 5. Conclusion and Future Work 5.1 Conclusion This paper finds the feature terms from the constructed bipartite graph, and the TF-IDF values of every word are considered to be the startup probabilities. The constructed bipartite graph is influenced by the contents and the number of the chosen documents. In the bipartite graph, there may exist many ways from a document to a term. The step length should be set to choose the ways including the direct and indirect ways to calculate the probability from a document to a term. The bigger the probability value is, the more important the term is to the document. And the term which has big probability value should be viewed as the feature terms. TF-IDF algorithm and extended TF-IDF algorithm, the probability algorithm are carried out the experiments. From the experimental results, it s obvious that the probability algorithm is superior to the classical TF-IDF algorithm. And the probability algorithm will have a better performance with the step length 3 and the increasing number of the chosen articles. However, more time and space will be needed with the increasing of the step length in the process of calculation. 72

10 5.2 The Future Work In future work, more suitable threshold should be considered to filter small probability values where the step length is big. The iteration increases with the step length increasing, and more time would be required to compute the transition probabilities from documents to terms. In the process of iteration, many spots in the constructed graph make little contributions to the final results with the increasing of the step length. These spots existence not only wastes storage space, but also affects computational time. An appropriate threshold would be seek to further reduce storage space and shorten the computational time in the future research. Accordingly, a reasonable structure will be designed, towards reducing the memory and time overhead. Generally, the computation of transition probability will consume most memory. A reasonable storage structure can shorten the storage space and computing time by adopting existing optimized algorithms on network data [20, 21].The probability algorithm will be applied to other real datasets in some real applications, including literature search and web search. References [1] Bougouin, A., Boudin, F., & Daille, B. (2013). Topic Rank: Graph-Based Topic Ranking for Keyphrase Extraction. International Joint Conference on Natural Language Processing, [2] Cai, X., & Cao, S. (2017). A Keyword Extraction Method Based on Learning to Rank. International Conference on Semantics, Knowledge and Grids, 13: [3] Chen, B., Wu, Z., Cai, X., & Xiqing, X. U. (2016). New method for the segmentation of polarizing microscope image of rock core based on k-means cluster algorithm. Journal of University of Shanghai for Science & Technology, 38(4): (in Chinese) [4] El-Beltagy, S. R., & Rafea, A. (2009). Kp-Miner: a keyphrase extraction system for english and arabic documents. Information Systems, 34(1), [5] Guo, A., & Yang, T. (2016). Research and improvement of feature words weight based on TFIDF algorithm. Information Technology, Networking, Electronic and Automation Control Conference, [6] Hulth, A.(2003) An improved automatic keyword extraction given more linguistic knowledge. Proc of Conference on Empirical Methods in Natural Language Processing, [7] Jiang, Z. X., & Ding, Y. W. (2005). Network text classification based on k-nearest neighbor method. Journal of University of Shanghai for Science & Technology, 27(1): (in Chinese) [8] JONES, K.S. (1972). A statistical interpretation of word specificity and itsap-plication in retrieval. Journal of Documentation, 28(1): [9] Liang, Y. (2017). Chinese keyword extraction based on weighted complex network. International Conference on Intelligent Systems and Knowledge Engineering, 12:

11 [10] Lynn, H. M., Lee, E., Chang, C., Kim, P., Lynn, H. M., & Lee, E., et al. (2017). Swiftrank: an unsupervised statistical approach of keyword and salient sentence extraction for individual documents. Procedia Computer Science, 113: [11] Pay, T., & Lucci, S. (2017). Automatic Keyword Extraction: An Ensemble Method. IEEE Big Data, [12] Salton, G., & Yu, C. T. (1973). On the construction of effective vocabularies for information retrieval. ACM Sigplan Notices, 9(3): [13] Sharan, A., Siddiqi, S., & Singh, J. (2015). Keyword Extraction from Hindi Documents Using Statistical Approach. Intelligent Computing, Communication and Devices. Springer India, 309: [14] Soheil D., Tamara S., James H. Martin. (2015). SGRank: Combing Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction. Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, [15] Turney, P. D. (2002). Learning to Extract Keyphrases from Text. NRC Technical Report ERB-1057 Canada: National Research Council. [16] Willyan D. Abilhoa, Leandro N. de Castro(2014), A keyword extraction method from twitter messages represented as graphs. Applied Mathematics and Computation, 240: [17] Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. G. (1999). Kea: practical automatic keyphrase extraction. Proc of the 4th ACM Conference on Digital Libraries. Berkeley, California:[s. n.] [18] Y. MATSUO, & M. ISHIZUKA. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01): [19] Zhang, X., An, J., & Liu, W. (2018). Research and implementation of keyword extraction algorithm based on professional background knowledge. International Congress on Image and Signal Processing, Biomedical Engineering and Informatics, 10:1-5. [20] Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute similarities. VLDB Endowment, 9: [21] Zhang M., Chen Y., Shen Y., Ma J. (2016). Comparative Study on the Performances of ANN and SVM and Their Application in the Identification of DMD Disease. Journal of University of Shanghai for Science & Technology, 4: (in chinese) 74

12 D1 D2 D3 T1 T2 T3 D4 Figure 1. An example of the connection between documents and terms T4 The Precision Based on the varying step length 80.00% 75.00% 70.00% 65.00% P-TFIDF P-Probability Figure 2.Precision on varying step length. The Recall rate Based on the varying step length 85.00% 80.00% 75.00% 70.00% 65.00% R-TFIDF R-Probability Figure 3. Recall on varying step length. 75

13 The Precision Based on the varying K 120% 100% 80% 60% 40% 20% 0% P-TFIDF P-Pro Figure 4.Precision on varying k. 120% 100% 80% 60% 40% 20% The Recall of the varying K 0% R-TFIDF R-Pro Figure 5. Recall on varying k. 76

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

The 2014 Conference on Computational Linguistics and Speech Processing ROCLING 2014, pp. 110-124 The Association for Computational Linguistics and Chinese Language Processing Collaborative Ranking between