The Research of A multi-language supporting description-oriented Clustering Algorithm on Meta-Search Engine Result Wuling Ren 1, a and Lijuan Liu 2,b

Similar documents
Chapter 6: Information Retrieval and Web Search. An introduction

Improving Suffix Tree Clustering Algorithm for Web Documents

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Construction of the Library Management System Based on Data Warehouse and OLAP Maoli Xu 1, a, Xiuying Li 2,b

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

Research Institute of Uranium Geology,Beijing , China a

Automated Online News Classification with Personalization

Research and Improvement of Apriori Algorithm Based on Hadoop

Constructing an University Scientific Research Management Information System of NET Platform Jianhua Xie 1, a, Jian-hua Xiao 2, b

Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

TALP at WePS Daniel Ferrés and Horacio Rodríguez

The Analysis of the Loss Rate of Information Packet of Double Queue Single Server in Bi-directional Cable TV Network

A New Technique to Optimize User s Browsing Session using Data Mining

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Search Results Clustering in Polish: Evaluation of Carrot

Research Of Data Model In Engineering Flight Simulation Platform Based On Meta-Data Liu Jinxin 1,a, Xu Hong 1,b, Shen Weiqun 2,c

Mining of Web Server Logs using Extended Apriori Algorithm

Multimodal Information Spaces for Content-based Image Retrieval

Introducing Usability Practices to OSS: The Insiders Experience

Clustering Analysis based on Data Mining Applications Xuedong Fan

Research Article Apriori Association Rule Algorithms using VMware Environment

The Application of Programmable Controller to Chip Design. Shihong Lan 1, Jian Zhang 2

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Mining Web Data. Lijun Zhang

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Searching the Deep Web

A Finite State Mobile Agent Computation Model

Project Report on winter

Clustering of Web Search Results Based on Document Segmentation

Storage Model of Graph Based on Variable Collection

Part I: Data Mining Foundations

Application of Individualized Service System for Scientific and Technical Literature In Colleges and Universities

RANKING AND SUGGESTING POPULAR ITEMSETS IN MOBILE STORES USING MODIFIED APRIORI ALGORITHM

Shape Optimization Design of Gravity Buttress of Arch Dam Based on Asynchronous Particle Swarm Optimization Method. Lei Xu

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

International Journal of Advanced Research in Computer Science and Software Engineering

The Analysis and Research of IPTV Set-top Box System. Fangyan Bai 1, Qi Sun 2

Research Article Semantic Clustering of Search Engine Results

Research on the Application of Digital Images Based on the Computer Graphics. Jing Li 1, Bin Hu 2

INTRODUCTION. Chapter GENERAL

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

Study on A Recommendation Algorithm of Crossing Ranking in E- commerce

Personalized Search Engine using Social Networking Activity

A Template-Matching-Based Fast Algorithm for PCB Components Detection Haiming Yin

Personalized Search for TV Programs Based on Software Man

Social Network Recommendation Algorithm based on ICIP

A New Model of Search Engine based on Cloud Computing

Web People Search using Ontology Based Decision Tree Mrunal Patil 1, Sonam Khomane 2, Varsha Saykar, 3 Kavita Moholkar 4

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Research on 3G Terminal-Based Agricultural Information Service

Introduction to Information Retrieval

Keywords: Interactive electronic technical manuals; GJB6600; XML markup language; Automatic control equipment

An Intelligent Retrieval Platform for Distributional Agriculture Science and Technology Data

Enhanced Web Log Based Recommendation by Personalized Retrieval

Mining Distributed Frequent Itemset with Hadoop

Design of the Software for Wirelessly Intercepting Voices

Mining Web Data. Lijun Zhang

Data Mining Part 3. Associations Rules

Serial Communication Based on LabVIEW for the Development of an ECG Monitor

A Data Classification Algorithm of Internet of Things Based on Neural Network

AN IMPROVED APRIORI BASED ALGORITHM FOR ASSOCIATION RULE MINING

The Application Analysis and Network Design of wireless VPN for power grid. Wang Yirong,Tong Dali,Deng Wei

Study and Design of CAN / LIN Hybrid Network of Automotive Body. Peng Huang

Experience of Developing a Meta-Semantic Search Engine

Information Gathering Support Interface by the Overview Presentation of Web Search Results

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Research on Computer Network Virtual Laboratory based on ASP.NET. JIA Xuebin 1, a

International Journal of Science Engineering and Advance Technology, IJSEAT, Vol 2, Issue 11, November ISSN

Designing a Data Warehouse for an ERP Using Business Intelligence

Tutorial on Association Rule Mining

Semantic Website Clustering

Analysis on the technology improvement of the library network information retrieval efficiency

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Chapter 2. Architecture of a Search Engine

IMPROVING APRIORI ALGORITHM USING PAFI AND TDFI

Efficient Indexing and Searching Framework for Unstructured Data

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

Chapter 27 Introduction to Information Retrieval and Web Search

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015)

Application of CAD/CAE/CAM Technology in Plastics Injection Mould Design and Manufacture. Ming He Dai,Zhi Dong Yun

Cemetery Navigation and Information Query System Based on Android and Java Web

Context Based Web Indexing For Semantic Web

DATA MINING II - 1DL460. Spring 2014"

STUDY ON 3D SOLID RECONSTRUCTION FROM 2D VIEWS BASED ON INTELLIGENT UNDERSTANDING OF MECHANICAL ENGINEERING DRAWINGS

Weighted Suffix Tree Document Model for Web Documents Clustering

Association Rule Mining among web pages for Discovering Usage Patterns in Web Log Data L.Mohan 1

ImgSeek: Capturing User s Intent For Internet Image Search

Design and Implementation of unified Identity Authentication System Based on LDAP in Digital Campus

Text Analytics (Text Mining)

Realization of Automatic Keystone Correction for Smart mini Projector Projection Screen

International Journal of Advanced Computer Technology (IJACT) ISSN: CLUSTERING OF WEB QUERY RESULTS USING ENHANCED K-MEANS ALGORITHM

Medical Data Mining Based on Association Rules

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix

Transcription:

Applied Mechanics and Materials Online: 2012-01-24 ISSN: 1662-7482, Vol. 151, pp 549-553 doi:10.4028/www.scientific.net/amm.151.549 2012 Trans Tech Publications, Switzerland The Research of A multi-language supporting description-oriented Clustering Algorithm on Meta-Search Engine Result Wuling Ren 1, a and Lijuan Liu 2,b 1 Zhejiang Gongshang University, Hangzhou,China 2 Zhejiang Gongshang University, Hangzhou,China a rwl@zjgsu.edu.cn l, b liulijuan2012@163.com Keywords: Meta-Search Engine; Chinese Segmentation;Text Clustering; DCFC Clustering. Abstract. Search engine has adopted a variety of techniques to improve the accuracy of information retrieval, but the way of a linear list of search engine results, which mixes unrelated documents with relevant documents, has brought user great burden. This article commits to build clustering of search results, which is based on meta search engine techniques. We use all the popular search engine as a data source, then after a certain pre-processing of the source search engine, hierarchical clustering results is formed and returned to the query users. we propose a multi-language supporting, label first clustering algorithm, which we named DCFC algorithm. This algorithm supports both Chinese and English query, focuses on generating human readable labels, shows search results in hierarchical structure. 1 Introduction Given the scale of the Internet, and search engines has become an important means of access to information, which can give the search engine operators to bring huge economic benefits. Meanwhile, the search engine as an emerging cross-disciplinary, integrated search engine can be a text mining, database theory, natural language understanding, etc. Can be seen from the theoretical and practical aspects with a high research value. Currently, the field of data mining are at home and abroad to study, research institute and a hot area[1]. Search results clustering in the study, can be divided into pre-and post-clustering clustering two. Pre-clustering of web documents in the search before clustering, a large amount of data the Internet, this approach requires high computing resources. After the clustering is a clustering search results, greatly reducing the number of documents clustering, such a program for real-time requirements. Given that the Internet is huge, and the dynamic characteristics of the Internet, often used after clustering. The main content of this study include the following: Analysis of the search engines and data mining related research status, highlights the relevant principles of meta-search engines, as well as the theory of text clustering. The design of a meta-search engine DCFC based on the data source, on the one hand the use of search engines to obtain the source data, on the other hand the results of user queries can be indexed and stored, so as to improve the efficiency of the next query. segmentation and other operations, by calculating the maximum frequent candidate set as the classlabel name, and then by hierarchical latent semantic analysis to generate class labels, and finally by calculating the relevant data and class labels of the data into the corresponding class[2]. 2 Meta-search engines and text mining research Search through the user interface for the user's query, and then store the data from the index database and return relevant information to end users. In order to obtain enough and relevant user data, search engines often need to maintain a large inventory to put the index data. A typical search engine, mainly by the web spider, Indexer, searcher and the user interface components. All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, www.ttp.net. (#69815087, Pennsylvania State University, University Park, USA-18/09/16,15:46:03)

550 New Trends in Mechatronics and Materials Engineering The characteristics of meta-search engines Compared to the independent search engine, meta search engine has the following characteristics:data from multiple independent secondary result of the search engine results; no independent web library, save storage space and network bandwidth; provides a single query interface to submit a query to multiple search engines;of multiple independent search engine results to re-combined, scoring, sorting. Meta Search Engine(MSE) works as follows[3]: The user enters a query keywords, MSE certain keyword query processing, such as multiple query terms translated into all members of the Boolean search engine supports the format of the query to determine the theme and so on. MSE source search engine based on the scheduling of its way, a number of sources from which you select the search engine, users can set their own members through the search engine list. MSE according to their individual source queries search engine support, through pre-configured mapping will query the source assembly into the URL string search engine support. via an HTTP request method, the queries submitted to various search engines and receive the source returns the result. If it exceeds the allotted time, the source did not receive a search engine results, then resubmit the request, or to abandon the source of search engine results. receiving each source search engine results, and all results to the weight and eliminate reproduced page, and query results and query the relevance of query results. According to the user's individual characteristics, the results presented to the user. Such as through a linear list, graphical, and automatic classification methods. 3 Multi-layer label priority text clustering algorithm 3.1 Multi-label text clustering algorithm DCFC priority Search results clustering in the existing studies, mostly based on the search engine returns a list of documents to provide the URL, and content of the full-text crawl down the cluster. Label priority text clustering algorithm (Description Comes First Clustering), based on user input keywords into the search engine queries, and the results (a summary of the search engine returned) after clustering to the user. Characterized on the one hand to support Chinese and English queries, the other is different from the traditional clustering algorithm is to emphasize the name of readability clustering. Fig. 1. Clustering algorithm with the traditional label priority difference between clustering process This section of the DCFC complete description of the algorithm, where necessary, given the program's pseudo-code and some examples. DCFC includes the following five steps: data preprocessing, segmentation, frequent phrase generation, the generation of multi-class label, the data go to the appropriate category under.

Applied Mechanics and Materials Vol. 151 551 3.2 Generate frequent phrase Through the data pre-processing to extract the title and summary information, we have obtained the required data source clustering. Frequent phrase generation is mainly used to find documents in a phrase used to describe the document content, the tag name as a candidate. In this paper, the classic frequent itemsets algorithm Apriori, to find frequent phrases Mining frequent itemsets algorithm described as follows: (1) L1 = find_frequent_1-itemsets(d); // mining frequent 1 - itemsets, it is easier (2) for (k=2; Lk 1 Φ ;k++) { C _ ( (3) k = apriori gen Lk 1, min_ sup) ; // call apriori_gen method to generate candidate frequent k-itemsets (4) for each transaction t D { // Scan the transaction database D C (5) Ct = subset( k,t); C (6) for each candidate c t (7) c.count++; // statistics candidate frequent k-itemset count(8) } L (9) k C ={c k c.count min_sup} // satisfy the minimum support of the k-itemset is frequent k-itemsets (10) } (11) return L= L k ; // merge frequent k-itemsets (k> 0) 3.3 Generation of multi-class label Frequent phrase in the generation stage, we get to represent the semantic information of the document as a frequent phrase candidate class labels. In the following, we will first generate the class labels, and organized according to their semantic relations hierarchy. Here, we use latent semantic analysis (Latent Semantic Indexing, LSI), to achieve the extraction of abstract concepts. LSI matrix of singular value decomposition (SVD) method as its mathematical basis, the use of SVD method the results of the U matrix to obtain the document containing the concept. U matrix of any column vector represents an abstraction of the document, as the column vector U contains the matrix concept is expressed in the form of vectors, so that users can not directly understand this. First, we calculated based on frequent word document word frequency matrix A, where the weights we use TF-IDF calculation. SVD decomposition of matrix A, we can get the matrix U, on behalf of its column vector abstraction. Secondly, the formation of abstract concepts and the SVD generated by frequent phrase match, about abstract concepts concrete, to find documents that summarize the contents of the tag name of summary. Frequent phrases in the candidate generation phase to obtain labels, who have been in

552 New Trends in Mechatronics and Materials Engineering accordance with easy-to-understand way of screening, it is a good label readability. This process is similar with the query process, we will name the candidate labels as a query q, all the candidate labels were composed of vector and vector matrix of the candidate word phrase which is frequently a column vector is a column vector of the candidate word. Calculate P for each column vector and the distance between the column vectors, so that C = P, C means that each column vector P with the abstract concept of the distance value of each component of the matrix. Matrix C can be selected from the biggest match of abstract phrases[4]. At this point, we take the matrix C in each row corresponding to the maximum frequent candidate phrases, frequent words were selected from all angles the maximum value. If the difference between the threshold value in a range, meaning that the two are very similar, can be used to express the same concept; if the difference between the two is greater than the threshold value, that value is smaller word or phrase and abstract concept distance, thus leaving only the larger value of a (word or phrase is possible) to express abstract concepts. Hierarchical: the tag name of a single word or phrase as a child tree node, the node child node does not exist. The label of the same word or phrase into the tree under the same node[5]. 4 Prototype system running example and conclusions 4.1Development Environment The system development and operating environment are as follows: TABLE 1. DESCRIBES HARDWARE AND SOFTWARE ENVIRONMENT Hardware environment GATEWAY T6832C Operating system Windows XP Development language Java Database MYSQL(5.0.18) Application Server Tomcat(6.0.18) 4.2 System Architecture View annotation tool for collaborative product design in the lace for the practical application process, designers and technicians are being spent to design and geometry-based discussion. Designers view as the server-side start the collaborative annotation tools, process design staff provide the assembly model to browse examination, collaborative exchange, customers can flower pattern already in the database annotation and the need to modify their own program, a total collaboration platform designers and customers in different reference pattern to speed up the design of new development. 5 Conclusion This traditional search engines as an improvement by bringing together the results to achieve higher data coverage, data mining clustering techniques to achieve results in order to shorten the time the user location information. This approach has some viability, to a certain extent reduce the burden on the user query information to quickly find the information really needed. However, the proposed method, there are some shortcomings need to further improve system accuracy. In addition, the system implementation process, many functions can be optimized. Acknowledgment This project is supported by the Science and Technology Research Programs of Zhejiang Province, China (No.2009C03016-4, 2009C11159) and by Zhejiang GongShang University Graduate Science andtechnology innovative projects (NO.1130XJ1510130,3070JQ4211015).

Applied Mechanics and Materials Vol. 151 553 References [1] Stanis law Osi nski, Jerzy Stefanowski, and Dawid Weiss. Lingo: Search results clustering algorithm based on Singular Value Decomposition. In K_lopotek, M.A., Wierzcho n, S.T., Trojanowski, K., eds.: Proceedings of the International IIS: Intelligent Information Processing and Web Mining Conference. Advances in Soft Computing, Zakopane, Poland, Springer (2004) 359 368. [2] Chien L.F., PAT-Tree-Based Adaptive Key phrase Extraction for Intelligent Chinese Information Retrieval. In Proceedings of the 20m Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval(SIGIR 93), pages 125-135, Pittsburgh, PA, 1993. [3] Lan Huang. A Survey on Web Information Retrieval Technologies [ EB/ OL ]. ECSL Technical Report, State University of New York, 2000. [4] L Ding, T Finin, A Joshi, R Pan, RS Cost, Y Peng. Swoogle: a search and metadata engine for the semantic web. In CIKM 2005. [5] Wu L, Mcelean S. Result merging methods in distributed information retrieval with overlapping databases.information Retrieval, 2007,10(3):297 319. [6] Carrot2 Framework. Carrot2: Design of a Flexible and Efficient Web Information Retrieval Framework. Third International Atlantic Web Intelligence Conference (AWIC2005), Łodź, Poland, 2005, 439-444. [7] P. Ferragina, A. Gulli. A personalized search engine based on web-snippet hierarchical clustering. www14, 2005.