The Research of A multi-language supporting description-oriented Clustering Algorithm on Meta-Search Engine Result Wuling Ren 1, a and Lijuan Liu 2,b

Size: px

Start display at page:

Download "The Research of A multi-language supporting description-oriented Clustering Algorithm on Meta-Search Engine Result Wuling Ren 1, a and Lijuan Liu 2,b"

Alaina Small
6 years ago
Views:

1 Applied Mechanics and Materials Online: ISSN: , Vol. 151, pp doi: / Trans Tech Publications, Switzerland The Research of A multi-language supporting description-oriented Clustering Algorithm on Meta-Search Engine Result Wuling Ren 1, a and Lijuan Liu 2,b 1 Zhejiang Gongshang University, Hangzhou,China 2 Zhejiang Gongshang University, Hangzhou,China a rwl@zjgsu.edu.cn l, b liulijuan2012@163.com Keywords: Meta-Search Engine; Chinese Segmentation;Text Clustering; DCFC Clustering. Abstract. Search engine has adopted a variety of techniques to improve the accuracy of information retrieval, but the way of a linear list of search engine results, which mixes unrelated documents with relevant documents, has brought user great burden. This article commits to build clustering of search results, which is based on meta search engine techniques. We use all the popular search engine as a data source, then after a certain pre-processing of the source search engine, hierarchical clustering results is formed and returned to the query users. we propose a multi-language supporting, label first clustering algorithm, which we named DCFC algorithm. This algorithm supports both Chinese and English query, focuses on generating human readable labels, shows search results in hierarchical structure. 1 Introduction Given the scale of the Internet, and search engines has become an important means of access to information, which can give the search engine operators to bring huge economic benefits. Meanwhile, the search engine as an emerging cross-disciplinary, integrated search engine can be a text mining, database theory, natural language understanding, etc. Can be seen from the theoretical and practical aspects with a high research value. Currently, the field of data mining are at home and abroad to study, research institute and a hot area[1]. Search results clustering in the study, can be divided into pre-and post-clustering clustering two. Pre-clustering of web documents in the search before clustering, a large amount of data the Internet, this approach requires high computing resources. After the clustering is a clustering search results, greatly reducing the number of documents clustering, such a program for real-time requirements. Given that the Internet is huge, and the dynamic characteristics of the Internet, often used after clustering. The main content of this study include the following: Analysis of the search engines and data mining related research status, highlights the relevant principles of meta-search engines, as well as the theory of text clustering. The design of a meta-search engine DCFC based on the data source, on the one hand the use of search engines to obtain the source data, on the other hand the results of user queries can be indexed and stored, so as to improve the efficiency of the next query. segmentation and other operations, by calculating the maximum frequent candidate set as the classlabel name, and then by hierarchical latent semantic analysis to generate class labels, and finally by calculating the relevant data and class labels of the data into the corresponding class[2]. 2 Meta-search engines and text mining research Search through the user interface for the user's query, and then store the data from the index database and return relevant information to end users. In order to obtain enough and relevant user data, search engines often need to maintain a large inventory to put the index data. A typical search engine, mainly by the web spider, Indexer, searcher and the user interface components. All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, (# , Pennsylvania State University, University Park, USA-18/09/16,15:46:03)

2 550 New Trends in Mechatronics and Materials Engineering The characteristics of meta-search engines Compared to the independent search engine, meta search engine has the following characteristics:data from multiple independent secondary result of the search engine results; no independent web library, save storage space and network bandwidth; provides a single query interface to submit a query to multiple search engines;of multiple independent search engine results to re-combined, scoring, sorting. Meta Search Engine(MSE) works as follows[3]: The user enters a query keywords, MSE certain keyword query processing, such as multiple query terms translated into all members of the Boolean search engine supports the format of the query to determine the theme and so on. MSE source search engine based on the scheduling of its way, a number of sources from which you select the search engine, users can set their own members through the search engine list. MSE according to their individual source queries search engine support, through pre-configured mapping will query the source assembly into the URL string search engine support. via an HTTP request method, the queries submitted to various search engines and receive the source returns the result. If it exceeds the allotted time, the source did not receive a search engine results, then resubmit the request, or to abandon the source of search engine results. receiving each source search engine results, and all results to the weight and eliminate reproduced page, and query results and query the relevance of query results. According to the user's individual characteristics, the results presented to the user. Such as through a linear list, graphical, and automatic classification methods. 3 Multi-layer label priority text clustering algorithm 3.1 Multi-label text clustering algorithm DCFC priority Search results clustering in the existing studies, mostly based on the search engine returns a list of documents to provide the URL, and content of the full-text crawl down the cluster. Label priority text clustering algorithm (Description Comes First Clustering), based on user input keywords into the search engine queries, and the results (a summary of the search engine returned) after clustering to the user. Characterized on the one hand to support Chinese and English queries, the other is different from the traditional clustering algorithm is to emphasize the name of readability clustering. Fig. 1. Clustering algorithm with the traditional label priority difference between clustering process This section of the DCFC complete description of the algorithm, where necessary, given the program's pseudo-code and some examples. DCFC includes the following five steps: data preprocessing, segmentation, frequent phrase generation, the generation of multi-class label, the data go to the appropriate category under.

3 Applied Mechanics and Materials Vol Generate frequent phrase Through the data pre-processing to extract the title and summary information, we have obtained the required data source clustering. Frequent phrase generation is mainly used to find documents in a phrase used to describe the document content, the tag name as a candidate. In this paper, the classic frequent itemsets algorithm Apriori, to find frequent phrases Mining frequent itemsets algorithm described as follows: (1) L1 = find_frequent_1-itemsets(d); // mining frequent 1 - itemsets, it is easier (2) for (k=2; Lk 1 Φ ;k++) { C _ ( (3) k = apriori gen Lk 1, min_ sup) ; // call apriori_gen method to generate candidate frequent k-itemsets (4) for each transaction t D { // Scan the transaction database D C (5) Ct = subset( k,t); C (6) for each candidate c t (7) c.count++; // statistics candidate frequent k-itemset count(8) } L (9) k C ={c k c.count min_sup} // satisfy the minimum support of the k-itemset is frequent k-itemsets (10) } (11) return L= L k ; // merge frequent k-itemsets (k> 0) 3.3 Generation of multi-class label Frequent phrase in the generation stage, we get to represent the semantic information of the document as a frequent phrase candidate class labels. In the following, we will first generate the class labels, and organized according to their semantic relations hierarchy. Here, we use latent semantic analysis (Latent Semantic Indexing, LSI), to achieve the extraction of abstract concepts. LSI matrix of singular value decomposition (SVD) method as its mathematical basis, the use of SVD method the results of the U matrix to obtain the document containing the concept. U matrix of any column vector represents an abstraction of the document, as the column vector U contains the matrix concept is expressed in the form of vectors, so that users can not directly understand this. First, we calculated based on frequent word document word frequency matrix A, where the weights we use TF-IDF calculation. SVD decomposition of matrix A, we can get the matrix U, on behalf of its column vector abstraction. Secondly, the formation of abstract concepts and the SVD generated by frequent phrase match, about abstract concepts concrete, to find documents that summarize the contents of the tag name of summary. Frequent phrases in the candidate generation phase to obtain labels, who have been in

4 552 New Trends in Mechatronics and Materials Engineering accordance with easy-to-understand way of screening, it is a good label readability. This process is similar with the query process, we will name the candidate labels as a query q, all the candidate labels were composed of vector and vector matrix of the candidate word phrase which is frequently a column vector is a column vector of the candidate word. Calculate P for each column vector and the distance between the column vectors, so that C = P, C means that each column vector P with the abstract concept of the distance value of each component of the matrix. Matrix C can be selected from the biggest match of abstract phrases[4]. At this point, we take the matrix C in each row corresponding to the maximum frequent candidate phrases, frequent words were selected from all angles the maximum value. If the difference between the threshold value in a range, meaning that the two are very similar, can be used to express the same concept; if the difference between the two is greater than the threshold value, that value is smaller word or phrase and abstract concept distance, thus leaving only the larger value of a (word or phrase is possible) to express abstract concepts. Hierarchical: the tag name of a single word or phrase as a child tree node, the node child node does not exist. The label of the same word or phrase into the tree under the same node[5]. 4 Prototype system running example and conclusions 4.1Development Environment The system development and operating environment are as follows: TABLE 1. DESCRIBES HARDWARE AND SOFTWARE ENVIRONMENT Hardware environment GATEWAY T6832C Operating system Windows XP Development language Java Database MYSQL(5.0.18) Application Server Tomcat(6.0.18) 4.2 System Architecture View annotation tool for collaborative product design in the lace for the practical application process, designers and technicians are being spent to design and geometry-based discussion. Designers view as the server-side start the collaborative annotation tools, process design staff provide the assembly model to browse examination, collaborative exchange, customers can flower pattern already in the database annotation and the need to modify their own program, a total collaboration platform designers and customers in different reference pattern to speed up the design of new development. 5 Conclusion This traditional search engines as an improvement by bringing together the results to achieve higher data coverage, data mining clustering techniques to achieve results in order to shorten the time the user location information. This approach has some viability, to a certain extent reduce the burden on the user query information to quickly find the information really needed. However, the proposed method, there are some shortcomings need to further improve system accuracy. In addition, the system implementation process, many functions can be optimized. Acknowledgment This project is supported by the Science and Technology Research Programs of Zhejiang Province, China (No.2009C , 2009C11159) and by Zhejiang GongShang University Graduate Science andtechnology innovative projects (NO.1130XJ ,3070JQ ).

5 Applied Mechanics and Materials Vol References [1] Stanis law Osi nski, Jerzy Stefanowski, and Dawid Weiss. Lingo: Search results clustering algorithm based on Singular Value Decomposition. In K_lopotek, M.A., Wierzcho n, S.T., Trojanowski, K., eds.: Proceedings of the International IIS: Intelligent Information Processing and Web Mining Conference. Advances in Soft Computing, Zakopane, Poland, Springer (2004) [2] Chien L.F., PAT-Tree-Based Adaptive Key phrase Extraction for Intelligent Chinese Information Retrieval. In Proceedings of the 20m Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval(SIGIR 93), pages , Pittsburgh, PA, [3] Lan Huang. A Survey on Web Information Retrieval Technologies [ EB/ OL ]. ECSL Technical Report, State University of New York, [4] L Ding, T Finin, A Joshi, R Pan, RS Cost, Y Peng. Swoogle: a search and metadata engine for the semantic web. In CIKM [5] Wu L, Mcelean S. Result merging methods in distributed information retrieval with overlapping databases.information Retrieval, 2007,10(3): [6] Carrot2 Framework. Carrot2: Design of a Flexible and Efficient Web Information Retrieval Framework. Third International Atlantic Web Intelligence Conference (AWIC2005), Łodź, Poland, 2005, [7] P. Ferragina, A. Gulli. A personalized search engine based on web-snippet hierarchical clustering. www14, 2005.

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods