An Extension of TF-IDF Model for Extracting Feature Terms

Size: px
Start display at page:

Download "An Extension of TF-IDF Model for Extracting Feature Terms"

Transcription

1 An Extension of TF-IDF Model for Extracting Feature Terms Mingxi Zhang*, Hangfei Hu, Guanying Su, Yuening Zhang, Xiaohong Wang College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China address: (Mingxi Zhang), (Hangfei Hu), (Guanying Su), (Yuening Zhang), (Xiaohong Wang) *Corresponding author Acknowledgements This work was supported by Natural Science Foundation of Shanghai grant 16ZR ; Training Project of University of Shanghai for Science and Technology grant 16HJPY-QN04. 64

2 Abstract: Extracting feature terms can help users get useful information correctly and quickly, which is very important in many real applications, such as web search, document clustering and similarity computation. Although TF-IDF model can be used for generating feature terms, some latent terms which are not contained in current documents directly might be neglected. In this paper, an approach for finding the latent feature terms is proposed by extending TF-IDF model. At first, a bipartite graph is constructed for representing the relationship between documents and the terms, and an iterative formula is given for finding the feature terms that are not contained in current documents directly. Based on bipartite graph, the transition probabilities from documents to terms are computed, and then the feature terms with relatively high transition probabilities are generated. Experiment result on real dataset demonstrates the effectiveness of the proposed approach through comparing with the TF-IDF Model. Keywords: TF-IDF; transition probability; feature terms 65

3 1. Introduction Actually, there are many terms in a document, but most of them have much fewer contributions to the document. A few of them play an important role in the document. These representative terms are viewed as the feature terms. The feature terms which are extracted from documents are viewed as the signatures of a document. They can represent the subject information of the whole text for the information retrieval [19]. And the computer can find the main information fast in the process of querying by feature terms. Moreover, the number of documents will be bigger and bigger according to the new Moore's Law. More time will be needed while the users submit query requests. The feature terms can shorten the query time and improve the accuracy of queries, and the users can get some general information of a document from the feature terms and decide whether the document is useful or not in a short space of time. The feature terms can reflect the idea of the document and the field of the document, and they can give the users a favor to navigate to the target document [2]. So the feature terms not only reflect the main content, but also are specific.currently,most feature terms are given by the authors of the papers. However, most documents are not rendered to feature terms especially previous papers. As we all know, it s difficult to sign the feature terms for every document by hand, because the task is very onerous and not all the feature terms are suitable owing to the human subjectivity. If inappropriate feature terms were chosen,they would have negative effects on the subsequent process. So it s necessary to do some researches about how to extract feature terms from the documents automatically. The feature terms extraction technique is the foundation technique which can provide service for many fields such as text classification, data-mining, indexing, text-analysis [11], mean clustering analysis[3]and so on. There has been much related work about how to extract the feature terms. One of the most popular algorithms to pick up the feature terms is TF-IDF algorithm. TF represents a single term occurrence frequency in the document collections. The main idea of IDF is that if the document contains the certain term less, the term has a very good classification of the ability to distinguish. So the main idea of TF-IDF algorithm is that a term or a phrase has the ability to distinguish and is suitable for the classification if the frequency of the term or phrase in the chapter of TF is high and in the other documents it rarely appears [5]. Zipf, a linguistics expert at the Harvard University, when studying the frequency of English terms, found that the frequency of each term had a simple inverse relationship with the constant power of its ranking, if the frequency of the terms appeared in the order from the large to the small. Based on Zipf s law, TF-IDF is a useful tool to a certain degree. With the development of the Web, there has been some new methods which are based on TF-IDF algorithm. Even though TF-IDF is a classical algorithm, which has been proved very efficient, there also exists some disadvantages in traditional TF-IDF algorithm. This paper dissects the initial TF-IDF algorithm and analyzes the existing research algorithms which are based on the TF-IDF algorithm. At last, the paper extends TF-IDF algorithm to extract feature terms. In this paper, a new approach based on the topologies is presented which is used to extract the feature terms. The frequencies of terms are calculated not only by the times they appear in this document, but also the topology structure. 66

4 2. Related work In recent years, work done on the feature terms extraction has received a great deal of interest. An important aspect in most of the works is to identify appropriate weighting techniques to assign weight to the terms in a text corpus and then the feature terms can be identified, which is based on the assigned weights. A document was regard as a phrase set by Turney [15] who used genetic algorithm and C4.5 decision tree induction algorithm to extract the feature phrases. KEA System gained the eigenvalue by Naive Bayesian Technology [17]. Rule Induction algorithm [6] proved that adding Linguistic knowledge to text could improve the extraction accuracy. KP-Miner System had the ability to extract the feature terms from any length of English documents and Arabia documents [4] without any exercise. IDF was firstly picked up by Jones K.S [8] in 1972, which held the main idea that if a certain feature concentrated on a small amount of documentation, it would contain higher entropy and corresponding weights should be higher. TF-IDF was proposed by Salton [12] which was based on the IDF algorithm. TF is the number of terms which appear in the document and IDF is anti-document frequency. TF-IDF algorithm is a classical algorithm in the field of finding the feature terms or the stop words. Though it s very efficient in most cases, it also has its shortcomings. Most papers about the shortcomings of TF-IDF algorithm begin the narrative with the classification. TF-ID Fmodel has boundedness because of its universal applicability. In different fields, some terms should be viewed as the stop words not the feature terms. Some terms may have different meanings in the specific field. In the same document, there may be two or more terms which have the same meaning, and they all should be regard as the stop words not the feature terms. In order to avoid these and similar problems, many researchers have done much work to improve the intrinsic model. In [9], a keyword extraction algorithm IWCN based on TF-IDF was proposed. The author took TF-IDF and semantic weights into consideration. [14] presented a hybrid statistical-graphical algorithm to extract the Keyphrases. The author of [7] extracted keyword to classify the network texts based on K-nearest neighbor method. In [18], a new keyword extraction algorithm which applies to a single document without using a corpus is presented. The frequent terms are extracted by the counting term frequencies. If two terms are in the same sentence, the two terms are considered to have the relationship of co-occurrence. If co-occurrence distribution between term a and the frequent terms is based to a particular subset of frequent terms, the term a is a possible keyword. And the χ 2 -measure is used to measure the biases of co-occurrence distribution. In [1], a new algorithm called Topic Rank was a graph-based key phrase extraction method, which relied on a topical representation of the document. The author clustered candidate key phrases into topics and the topics were used as vertices in a complete graph. The paper used a graph-based ranking model to assign a significance score to each topic. Then the key phrases were extracted by the scores. [16] proposed a keyword extraction algorithm named TKG, in which TKG stands for Twitter Keyword Graph. The stop words are removed in the preprocessing step. Then a textual graph is built by the co-occurrence relationship between tokens. After that, the author took the centrality measures into consideration, which includes the degree centrality C D i, the closeness centrality C C i and eccentricity C E i. The best rank vertices are view as keywords. [13] proposedan unsupervised, domain-independent, and 67

5 corpus independent approach for automatic keyword extraction. The algorithm combines the information contained in frequency and spatial distribution of a term. The terms with middle range of frequency are important. The low-frequency terms are removed at first. In specific contexts or portions text, a keyword appears frequently to some extent. For the reason, a keyword s distribution pattern may indicate some level of clustering. The terms with highest standard deviation are chosen as keywords. In [10], an unsupervised statistical approach called SwiftRank is used to extract keywords and salient sentences. The core sentences and keywords often appear at the beginning of the text and the last paragraph. It starts ranking sentences on a document with scoring its text into a set of sentences by their distinct features. After removing the stop words, the terms remain are view as candidate terms. Then the score of every term can be computed by its corresponding distributions. The terms with high scores are extracted as the keywords. Compared to existing approaches, this paper solves these problems from the constructed graph. Based on the constructed bipartite graph, the transition probabilities from documents to terms are computed. The bigger the probability from a document to a term is, the more important the term is for the document, which is selected as the feature team in the current document. 3. Model 3. 1 Constructing relational matrix The chosen dataset contains some information about a lot of documents on the field of computer. (The information may comprise titles, abstracts, authors, numbers of the paper and so on.) Parts of useful information are extracted including abstracts & numbers. Firstly, abstracts & numbers are used to build a matrix on account of TF-IDF algorithm. Then, the TF-IDF value of every term is viewed as the strength of the relationship from the document to the term. Accordingly, a relational matrix is built up. The TF-IDF is formalized as: where TF and IDF are formalized respectively as: 3.2 Normalized relational matrix tfidf i,j = tf i,j idf i (1) tf i,j = n i,j Σ k n k,j (2) D idf i =log (3) {j :t i d j } Each value of the relational matrix can reflect the importance of a term to a document to some extent. To put it differently, the value can mirror the probability from the document to the term. So a probability matrix can be structured by the relational matrix. In order to the convenience of calculation, normalizing the relational matrix can get the probability matrix. In every document, the summation of all terms TF-IDF values is viewed as the denominator and the primary TF-IDF value of every term in this document is considered to be the numerator. Then a new value is obtained as the new element of the probability matrix. And it also represents the probability from the document to the term. In a clear and ordered pattern, a 68

6 probability matrix is structured. The normalization formula is defined as: P i, j = t(i,j ) n k=1 t(i,k) Where P(i, j)stands for the probability from document i to termj. t(i, k)stands for the TF-IDF value of the document i and termk.nstands for the number of the terms which document i contains. (4) 3.3 Constructing bipartite graph and analysis the graph As is known to all, documents and terms have the relationship of containing and being contained. They are different from each other. The bipartite graph can put two classes of things with different attributes on the same graph. Therefore, it is reasonable to build a bipartite graph which contains documents and terms. In this graph, a document is seen as a spot and a term is also considered as a spot which is different from the spot of the document. And the values of the probability matrix can be used as the weight coefficient from the document spot to the term spot reversely. In other words, the bipartite graph is an undirected graph. Then the graph can be used to get some useful information to achieve the purpose of extracting feature terms. Figure 1 shows an example of the connection between documents and terms, where D i represents a document i and T j represents a termj. In this example, it is obvious that there are many ways from a document to a term including the direct way and the indirect way(s). (There exists one way at least.) For example, there exists more than one way from D1 to T2(D1 T2, D1 T1 D4 T2 and so on). D i represents the document i, and T j represents the termj. To a certain extent, most probabilities from a document to a term can be calculated, which include the direct way and the indirect ways. The summation of all the most direct and indirect probabilities is viewed as the new probability from the document to the term. The bigger the probability is, the closer the relationship between the document and the term is. Then sort the new probabilities of every term of each document, and select some terms with high probabilities as the document s feature terms. Through the structure, the feature terms which can be used to represent the document s primary coverage can be found. Moreover, the found feature terms might not be contained by the appointed document, but it also reflects the feature of the document. 3.4 Modeling On the foundation of analysis of the graph, a new model calculating the probability values emerges for the purpose of looking for the feature terms in the bipartite graph. Obviously, starting from the document and only transferring odd number steps can reach the destination from a document to a term, while transferring even number steps can reach another document. In order to find the feature terms, we d better choose the odd number steps. Through the above analysis, the model is proposed. The iterative formula is formalized as: P t+1 (i, j) = P t (i, k) P(k, j) (5) If t + 1 = 2m,i stands for document i,j stands for document j, kstands for termk; If t + 1 = 2m + 1, i stands for document i, j stands for termj, k stands for document k; 69

7 3. 5 model analysis It is obvious that a huge tree would be built after several iterations. In the process, most storage space would be needed to store the information of every node. In the weak of increasing of the number of iteration, more and more time will be needed to calculate new values. Fortunately, matrix multiplication is a good method which can be used to solve the problem. Using matrix multiplication can calculate the probability [20] from a document to a term including the direct way and the indirect way(s), if the step length is given. In the process of the iteration, the probability should decay. λis used as the attenuation coefficient.thep n i, j is calculated as follows: P n i,j = ΠP(i,j) λ n 1 λ = 0.8 (6) If the probability matrix is a square matrix, it s very easy to get the probability from a document to a term by step n. In most cases, the chosen documents and the number of the terms every document contains can t built a square matrix. So the probability matrix needs turning to transport matrix in order to the convenience of calculation. In this case, thep n i, j is calculated as follows: P n i, j = P i, j P T i, j P i, j P T i, j.(7) So the probability from a document to a term can be calculated by the given step length. sum i, j is used as the final probability from a document to a term.the sum(i, j) is calculated as follows: sum i, j = ΣP 2k+1 (i, j)(8) Where i stands for document i andj stands for termj. 4. Experimental Result 4.1. Setup The experiment is developed, which is based on Intel(R) Core(TM) i5-4210, the frequency of 2.60GHz, and the memory of 4.0GB. The development environment is Eclipse based on JDK and JRE In this part, the result analysis is presented, and the test data is extracted from DBLP. The number of messages about every document is very big in the dataset which include the author, the title, the abstract and so on. So it s easy to get the number of messages which are needed. The evaluation puts much attention on whether the feature terms can be found to some extent and potential feature terms can be found. In part three, it has been discussed directly or indirectly that the chosen step length would have an impact on the final result more or less. In most cases, the recall and precision are often used to evaluate whether the experimental results conform to the expected results in the field of information retrieval. These two are viewed as the important indexes to reflect the effect of retrieval. The recall is that the amount of related information detected divides the total related information in the retrieval system. And the precision is that the amount of related information divides the total amount of information detected. In the paper, the result that the amount of the feature terms detected divides the total feature terms is used to calculate the recall R and the amount of the feature terms detected divides the total feature terms detected to compute the precision P for the given 70

8 document. Recall and precision should be taken into consideration synthetically because they can reflect two different aspects of the quality of the experimental results. In the process of comparison, both the recall and precision are compared On varying Step Length A document to a term may contain several ways. A step length is set to choose the indirect ways whose step lengths are less than the given step length and equal to the given step length. And the summation of the probability values which meet the requirement as the final probability value from the document to the term is calculated. Through the analysis in part 3, only an odd number can be chosen as the step length, while an even number step length can just guarantee from a document to another document. If the step length is set as the even number k, the result can be got, which is based on the step length k 1 in fact. And the result which is agreed with the step length can be got if the step length is set as the odd number. The experiment is carried out by different step lengths. Some results were chosen from the experimental results to calculate the precision and the recall. For the convenience of calculation, the averages of the precisions and the recalls are viewed as the final precision and the recall to draw the below charts. From figure 2 and figure 3, it s obvious that as the increasing of the step length, both the precision and the recall of the probability algorithm are superior to those of the classical TF-IDF algorithm. When the step length is set as 1, both the precision and the recall of classical algorithm are as good as those of the probability algorithm. It s because every TF-IDF value is viewed as the probability from the document to the term directly not including the indirect way(s). The TF-IDF values are considered to be the startup probabilities of the probability algorithm. In fact, the probability values with the step length 1 are the same as the TF-IDF values. So it is inevitable that the precisions and the recalls are equal to those of the TF-IDF algorithm when the step length is set as 1. From the step length 1 to 3, the rate of precision growth is very obvious. After that, there is no growth with the growth of the step length. So is the recall rate. It s the reason that there exists probability attenuation in the process of calculation, and the bigger the step length is, the more obvious the probability attenuation is. The attenuation coefficient is set to 0.8 in this paper. When the probability from a document to a term are calculated with the step length m, the direct probability needs to attenuate (m 1) times. If m is very big, most results are too small to make few contributions to the final value from a document to a term. If m is small, each result plays an important part in the final value. So the trend of the folding line is expected. Through the above analysis, it s easy to find that if the step length is set as 3, the probability algorithm has a good performance. If the given step length is bigger than 3, the result is roughly consistent with that when the step length is set as 3, and more time will be needed in the process of calculation. So the step length 3 is more suitable than other values. 71

9 4.3 On varying k In the process of evaluating the experimental result, the number of chosen terms would have an effect on the final evaluation. The more the terms were chosen, the more the feature terms may be contained. In fact, few feature terms are required for every document. In this part, top-k would be considered to carry out the experiment. (kstands for the number of chosen terms.) The results were evaluated byvarying k. The figure 4 and figure 5 were drew byvarying k. From figure 4 and figure 5, the precisions and recalls of two algorithms are roughly consistent. The probability algorithm is based on the TF-IDF algorithm. And the results of the probability algorithm are the combination of the TF-IDF algorithm and the structure of the constructed bipartite graph. To some extent, the results of probability algorithm are another way of arranging the results of TF-IDF algorithm. When two top terms are chosen, the results of probability algorithm are as good as these of TF-IDF algorithm. As the k is increasing, the two orders of terms are not same with each other. So when k is set as 6, the top 6 terms of probability algorithm includes the feature word(s) which the top 6 terms of TF-IDF algorithm not includes. So the P-pro line and the R-pro line have the breaking points earlier than these lines of the TF-IDF algorithm. With the further increase of k, both the results of two algorithms have included all the feature terms. And the precisions and recalls of the two algorithms are the same with each other. So the two curves meet again finally in figure 4. So do the curves of figure 5.Obviously, the probability algorithm has a better performance than the TF-IDF algorithm with the varied k.. 5. Conclusion and Future Work 5.1 Conclusion This paper finds the feature terms from the constructed bipartite graph, and the TF-IDF values of every word are considered to be the startup probabilities. The constructed bipartite graph is influenced by the contents and the number of the chosen documents. In the bipartite graph, there may exist many ways from a document to a term. The step length should be set to choose the ways including the direct and indirect ways to calculate the probability from a document to a term. The bigger the probability value is, the more important the term is to the document. And the term which has big probability value should be viewed as the feature terms. TF-IDF algorithm and extended TF-IDF algorithm, the probability algorithm are carried out the experiments. From the experimental results, it s obvious that the probability algorithm is superior to the classical TF-IDF algorithm. And the probability algorithm will have a better performance with the step length 3 and the increasing number of the chosen articles. However, more time and space will be needed with the increasing of the step length in the process of calculation. 72

10 5.2 The Future Work In future work, more suitable threshold should be considered to filter small probability values where the step length is big. The iteration increases with the step length increasing, and more time would be required to compute the transition probabilities from documents to terms. In the process of iteration, many spots in the constructed graph make little contributions to the final results with the increasing of the step length. These spots existence not only wastes storage space, but also affects computational time. An appropriate threshold would be seek to further reduce storage space and shorten the computational time in the future research. Accordingly, a reasonable structure will be designed, towards reducing the memory and time overhead. Generally, the computation of transition probability will consume most memory. A reasonable storage structure can shorten the storage space and computing time by adopting existing optimized algorithms on network data [20, 21].The probability algorithm will be applied to other real datasets in some real applications, including literature search and web search. References [1] Bougouin, A., Boudin, F., & Daille, B. (2013). Topic Rank: Graph-Based Topic Ranking for Keyphrase Extraction. International Joint Conference on Natural Language Processing, [2] Cai, X., & Cao, S. (2017). A Keyword Extraction Method Based on Learning to Rank. International Conference on Semantics, Knowledge and Grids, 13: [3] Chen, B., Wu, Z., Cai, X., & Xiqing, X. U. (2016). New method for the segmentation of polarizing microscope image of rock core based on k-means cluster algorithm. Journal of University of Shanghai for Science & Technology, 38(4): (in Chinese) [4] El-Beltagy, S. R., & Rafea, A. (2009). Kp-Miner: a keyphrase extraction system for english and arabic documents. Information Systems, 34(1), [5] Guo, A., & Yang, T. (2016). Research and improvement of feature words weight based on TFIDF algorithm. Information Technology, Networking, Electronic and Automation Control Conference, [6] Hulth, A.(2003) An improved automatic keyword extraction given more linguistic knowledge. Proc of Conference on Empirical Methods in Natural Language Processing, [7] Jiang, Z. X., & Ding, Y. W. (2005). Network text classification based on k-nearest neighbor method. Journal of University of Shanghai for Science & Technology, 27(1): (in Chinese) [8] JONES, K.S. (1972). A statistical interpretation of word specificity and itsap-plication in retrieval. Journal of Documentation, 28(1): [9] Liang, Y. (2017). Chinese keyword extraction based on weighted complex network. International Conference on Intelligent Systems and Knowledge Engineering, 12:

11 [10] Lynn, H. M., Lee, E., Chang, C., Kim, P., Lynn, H. M., & Lee, E., et al. (2017). Swiftrank: an unsupervised statistical approach of keyword and salient sentence extraction for individual documents. Procedia Computer Science, 113: [11] Pay, T., & Lucci, S. (2017). Automatic Keyword Extraction: An Ensemble Method. IEEE Big Data, [12] Salton, G., & Yu, C. T. (1973). On the construction of effective vocabularies for information retrieval. ACM Sigplan Notices, 9(3): [13] Sharan, A., Siddiqi, S., & Singh, J. (2015). Keyword Extraction from Hindi Documents Using Statistical Approach. Intelligent Computing, Communication and Devices. Springer India, 309: [14] Soheil D., Tamara S., James H. Martin. (2015). SGRank: Combing Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction. Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, [15] Turney, P. D. (2002). Learning to Extract Keyphrases from Text. NRC Technical Report ERB-1057 Canada: National Research Council. [16] Willyan D. Abilhoa, Leandro N. de Castro(2014), A keyword extraction method from twitter messages represented as graphs. Applied Mathematics and Computation, 240: [17] Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. G. (1999). Kea: practical automatic keyphrase extraction. Proc of the 4th ACM Conference on Digital Libraries. Berkeley, California:[s. n.] [18] Y. MATSUO, & M. ISHIZUKA. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01): [19] Zhang, X., An, J., & Liu, W. (2018). Research and implementation of keyword extraction algorithm based on professional background knowledge. International Congress on Image and Signal Processing, Biomedical Engineering and Informatics, 10:1-5. [20] Zhou, Y., Cheng, H., & Yu, J. X. (2009). Graph clustering based on structural/attribute similarities. VLDB Endowment, 9: [21] Zhang M., Chen Y., Shen Y., Ma J. (2016). Comparative Study on the Performances of ANN and SVM and Their Application in the Identification of DMD Disease. Journal of University of Shanghai for Science & Technology, 4: (in chinese) 74

12 D1 D2 D3 T1 T2 T3 D4 Figure 1. An example of the connection between documents and terms T4 The Precision Based on the varying step length 80.00% 75.00% 70.00% 65.00% P-TFIDF P-Probability Figure 2.Precision on varying step length. The Recall rate Based on the varying step length 85.00% 80.00% 75.00% 70.00% 65.00% R-TFIDF R-Probability Figure 3. Recall on varying step length. 75

13 The Precision Based on the varying K 120% 100% 80% 60% 40% 20% 0% P-TFIDF P-Pro Figure 4.Precision on varying k. 120% 100% 80% 60% 40% 20% The Recall of the varying K 0% R-TFIDF R-Pro Figure 5. Recall on varying k. 76

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction The 2014 Conference on Computational Linguistics and Speech Processing ROCLING 2014, pp. 110-124 The Association for Computational Linguistics and Chinese Language Processing Collaborative Ranking between

More information

News-Oriented Keyword Indexing with Maximum Entropy Principle.

News-Oriented Keyword Indexing with Maximum Entropy Principle. News-Oriented Keyword Indexing with Maximum Entropy Principle. Li Sujian' Wang Houfeng' Yu Shiwen' Xin Chengsheng2 'Institute of Computational Linguistics, Peking University, 100871, Beijing, China Ilisujian,

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Analysis on the technology improvement of the library network information retrieval efficiency

Analysis on the technology improvement of the library network information retrieval efficiency Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(6):2198-2202 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Analysis on the technology improvement of the

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

Retrieval of Highly Related Documents Containing Gene-Disease Association

Retrieval of Highly Related Documents Containing Gene-Disease Association Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,

More information

Semantic-Based Keyword Recovery Function for Keyword Extraction System

Semantic-Based Keyword Recovery Function for Keyword Extraction System Semantic-Based Keyword Recovery Function for Keyword Extraction System Rachada Kongkachandra and Kosin Chamnongthai Department of Computer Science Faculty of Science and Technology, Thammasat University

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Ranking Web Pages by Associating Keywords with Locations

Ranking Web Pages by Associating Keywords with Locations Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

An Automatic Reply to Customers  Queries Model with Chinese Text Mining Approach Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 15-17, 2007 71 An Automatic Reply to Customers E-mail Queries Model with Chinese Text Mining Approach

More information

Using Gini-index for Feature Weighting in Text Categorization

Using Gini-index for Feature Weighting in Text Categorization Journal of Computational Information Systems 9: 14 (2013) 5819 5826 Available at http://www.jofcis.com Using Gini-index for Feature Weighting in Text Categorization Weidong ZHU 1,, Yongmin LIN 2 1 School

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Extractive Text Summarization Techniques

Extractive Text Summarization Techniques Extractive Text Summarization Techniques Tobias Elßner Hauptseminar NLP Tools 06.02.2018 Tobias Elßner Extractive Text Summarization Overview Rough classification (Gupta and Lehal (2010)): Supervised vs.

More information

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM Myomyo Thannaing 1, Ayenandar Hlaing 2 1,2 University of Technology (Yadanarpon Cyber City), near Pyin Oo Lwin, Myanmar ABSTRACT

More information

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM

GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM http:// GRAPHICAL REPRESENTATION OF TEXTUAL DATA USING TEXT CATEGORIZATION SYSTEM Akshay Kumar 1, Vibhor Harit 2, Balwant Singh 3, Manzoor Husain Dar 4 1 M.Tech (CSE), Kurukshetra University, Kurukshetra,

More information

A Semantic Model for Concept Based Clustering

A Semantic Model for Concept Based Clustering A Semantic Model for Concept Based Clustering S.Saranya 1, S.Logeswari 2 PG Scholar, Dept. of CSE, Bannari Amman Institute of Technology, Sathyamangalam, Tamilnadu, India 1 Associate Professor, Dept. of

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---

More information

A Data Classification Algorithm of Internet of Things Based on Neural Network

A Data Classification Algorithm of Internet of Things Based on Neural Network A Data Classification Algorithm of Internet of Things Based on Neural Network https://doi.org/10.3991/ijoe.v13i09.7587 Zhenjun Li Hunan Radio and TV University, Hunan, China 278060389@qq.com Abstract To

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

Research on Design and Application of Computer Database Quality Evaluation Model

Research on Design and Application of Computer Database Quality Evaluation Model Research on Design and Application of Computer Database Quality Evaluation Model Abstract Hong Li, Hui Ge Shihezi Radio and TV University, Shihezi 832000, China Computer data quality evaluation is the

More information

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection Based on Locality Preserving Projection 2 Information & Technology College, Hebei University of Economics & Business, 05006 Shijiazhuang, China E-mail: 92475577@qq.com Xiaoqing Weng Information & Technology

More information

Mining Landmark Papers

Mining Landmark Papers Mining Papers by Annu Tuli, Vikram Pudi in SIGAI Workshop on Emerging Research Trends in AI (ERTAI) Mumbai, India Report : IIIT/TR/2010/32 Centre for Data Engineering International Institute of Information

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM Lu Chen and Yuan Hang PERFORMANCE DEGRADATION ASSESSMENT AND FAULT DIAGNOSIS OF BEARING BASED ON EMD AND PCA-SOM.

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Text Clustering Incremental Algorithm in Sensitive Topic Detection

Text Clustering Incremental Algorithm in Sensitive Topic Detection International Journal of Information and Communication Sciences 2018; 3(3): 88-95 http://www.sciencepublishinggroup.com/j/ijics doi: 10.11648/j.ijics.20180303.12 ISSN: 2575-1700 (Print); ISSN: 2575-1719

More information

Documents 1. INTRODUCTION. IJCTA, 9(10), 2016, pp International Science Press. *MG Thushara **Nikhil Dominic

Documents 1. INTRODUCTION. IJCTA, 9(10), 2016, pp International Science Press. *MG Thushara **Nikhil Dominic A Template Based Checking and Automated Tagging Algorithm for Project Documents IJCTA, 9(10), 2016, pp. 4537-4544 International Science Press 4537 A Templa emplate Based Checking and Auto- mated Tagging

More information

Extraction of Web Image Information: Semantic or Visual Cues?

Extraction of Web Image Information: Semantic or Visual Cues? Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Research on Evaluation Method of Product Style Semantics Based on Neural Network

Research on Evaluation Method of Product Style Semantics Based on Neural Network Research Journal of Applied Sciences, Engineering and Technology 6(23): 4330-4335, 2013 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scientific Organization, 2013 Submitted: September 28, 2012 Accepted:

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

Automatic Keyphrase Extractor from Arabic Documents

Automatic Keyphrase Extractor from Arabic Documents Automatic Keyphrase Extractor from Arabic Documents Hassan M. Najadat Department of information Systems Jordan University of and Technology Irbid, Jordan Ismail I. Hmeidi Department of information Systems

More information

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval

A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Information and Management Sciences Volume 18, Number 4, pp. 299-315, 2007 A New Approach for Automatic Thesaurus Construction and Query Expansion for Document Retrieval Liang-Yu Chen National Taiwan University

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

KeaKAT An Online Automatic Keyphrase Assignment Tool

KeaKAT An Online Automatic Keyphrase Assignment Tool 2012 10th International Conference on Frontiers of Information Technology KeaKAT An Online Automatic Keyphrase Assignment Tool Rabia Irfan, Sharifullah Khan, Irfan Ali Khan, Muhammad Asif Ali School of

More information

An Improved KNN Classification Algorithm based on Sampling

An Improved KNN Classification Algorithm based on Sampling International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE 017) An Improved KNN Classification Algorithm based on Sampling Zhiwei Cheng1, a, Caisen Chen1, b, Xuehuan Qiu1,

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques

An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques An Efficient Semantic Image Retrieval based on Color and Texture Features and Data Mining Techniques Doaa M. Alebiary Department of computer Science, Faculty of computers and informatics Benha University

More information

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP Dynamics of Continuous, Discrete and Impulsive Systems Series B: Applications & Algorithms 14 (2007) 103-111 Copyright c 2007 Watam Press FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

More information

REFERENCE ALGORITHM OF TEXT CATEGORIZATION BASED ON FUZZY COGNITIVE MAPS

REFERENCE ALGORITHM OF TEXT CATEGORIZATION BASED ON FUZZY COGNITIVE MAPS REFERENCE ALGORITHM OF TEXT CATEGORIZATION BASED ON FUZZY COGNITIVE MAPS ZHANG Guiyun,LIU Yang, ZHANG Weijuan,WANG Yuanyuan Computer and Information Engineering College, Tianjin Normal University, Tianjin

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Seismic regionalization based on an artificial neural network

Seismic regionalization based on an artificial neural network Seismic regionalization based on an artificial neural network *Jaime García-Pérez 1) and René Riaño 2) 1), 2) Instituto de Ingeniería, UNAM, CU, Coyoacán, México D.F., 014510, Mexico 1) jgap@pumas.ii.unam.mx

More information

Text Similarity Based on Semantic Analysis

Text Similarity Based on Semantic Analysis Advances in Intelligent Systems Research volume 133 2nd International Conference on Artificial Intelligence and Industrial Engineering (AIIE2016) Text Similarity Based on Semantic Analysis Junli Wang Qing

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

Using Self-Organizing Maps for Sentiment Analysis. Keywords Sentiment Analysis, Self-Organizing Map, Machine Learning, Text Mining.

Using Self-Organizing Maps for Sentiment Analysis. Keywords Sentiment Analysis, Self-Organizing Map, Machine Learning, Text Mining. Using Self-Organizing Maps for Sentiment Analysis Anuj Sharma Indian Institute of Management Indore 453331, INDIA Email: f09anujs@iimidr.ac.in Shubhamoy Dey Indian Institute of Management Indore 453331,

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

Research Article Apriori Association Rule Algorithms using VMware Environment

Research Article Apriori Association Rule Algorithms using VMware Environment Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation

Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation Jun Gong Department of Information System Beihang University No.37 XueYuan Road HaiDian District, Beijing, China

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

An Empirical Performance Comparison of Machine Learning Methods for Spam Categorization

An Empirical Performance Comparison of Machine Learning Methods for Spam  Categorization An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization Chih-Chin Lai a Ming-Chi Tsai b a Dept. of Computer Science and Information Engineering National University

More information

A Bayesian Approach to Hybrid Image Retrieval

A Bayesian Approach to Hybrid Image Retrieval A Bayesian Approach to Hybrid Image Retrieval Pradhee Tandon and C. V. Jawahar Center for Visual Information Technology International Institute of Information Technology Hyderabad - 500032, INDIA {pradhee@research.,jawahar@}iiit.ac.in

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Study on A Recommendation Algorithm of Crossing Ranking in E- commerce

Study on A Recommendation Algorithm of Crossing Ranking in E- commerce International Journal of u-and e-service, Science and Technology, pp.53-62 http://dx.doi.org/10.14257/ijunnesst2014.7.4.6 Study on A Recommendation Algorithm of Crossing Ranking in E- commerce Duan Xueying

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Text Classification based on Limited Bibliographic Metadata

Text Classification based on Limited Bibliographic Metadata Text Classification based on Limited Bibliographic Metadata Kerstin Denecke, Thomas Risse L3S Research Center Hannover, Germany denecke@l3s.de Thomas Baehr German National Library of Science and Technology

More information

Template Extraction from Heterogeneous Web Pages

Template Extraction from Heterogeneous Web Pages Template Extraction from Heterogeneous Web Pages 1 Mrs. Harshal H. Kulkarni, 2 Mrs. Manasi k. Kulkarni Asst. Professor, Pune University, (PESMCOE, Pune), Pune, India Abstract: Templates are used by many

More information

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Multimodal Medical Image Retrieval based on Latent Topic Modeling Multimodal Medical Image Retrieval based on Latent Topic Modeling Mandikal Vikram 15it217.vikram@nitk.edu.in Suhas BS 15it110.suhas@nitk.edu.in Aditya Anantharaman 15it201.aditya.a@nitk.edu.in Sowmya Kamath

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Fast K-nearest neighbors searching algorithms for point clouds data of 3D scanning system 1

Fast K-nearest neighbors searching algorithms for point clouds data of 3D scanning system 1 Acta Technica 62 No. 3B/2017, 141 148 c 2017 Institute of Thermomechanics CAS, v.v.i. Fast K-nearest neighbors searching algorithms for point clouds data of 3D scanning system 1 Zhang Fan 2, 3, Tan Yuegang

More information

Landslide Monitoring Point Optimization. Deployment Based on Fuzzy Cluster Analysis.

Landslide Monitoring Point Optimization. Deployment Based on Fuzzy Cluster Analysis. Journal of Geoscience and Environment Protection, 2017, 5, 118-122 http://www.scirp.org/journal/gep ISSN Online: 2327-4344 ISSN Print: 2327-4336 Landslide Monitoring Point Optimization Deployment Based

More information

KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES

KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES Dr. S.Vijayarani R.Janani S.Saranya Assistant Professor Ph.D.Research Scholar, P.G Student Department of CSE, Department of CSE, Department

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES STUDYING OF CLASSIFYING CHINESE SMS MESSAGES BASED ON BAYESIAN CLASSIFICATION 1 LI FENG, 2 LI JIGANG 1,2 Computer Science Department, DongHua University, Shanghai, China E-mail: 1 Lifeng@dhu.edu.cn, 2

More information