Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c

Similar documents
Research Of Data Model In Engineering Flight Simulation Platform Based On Meta-Data Liu Jinxin 1,a, Xu Hong 1,b, Shen Weiqun 2,c

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

Design and Implementation of unified Identity Authentication System Based on LDAP in Digital Campus

A Compatible Public Service Platform for Multi-Electronic Certification Authority

Design and Implementation of Full Text Search Engine Based on Lucene Na-na ZHANG 1,a *, Yi-song WANG 1 and Kun ZHU 1

Construction of the Library Management System Based on Data Warehouse and OLAP Maoli Xu 1, a, Xiuying Li 2,b

Research of 3D parametric design system of worm drive based on Pro/E. Hongbin Niu a, Xiaohua Li b

Constructing an University Scientific Research Management Information System of NET Platform Jianhua Xie 1, a, Jian-hua Xiao 2, b

Keywords: Interactive electronic technical manuals; GJB6600; XML markup language; Automatic control equipment

The Analysis and Research of IPTV Set-top Box System. Fangyan Bai 1, Qi Sun 2

A Digital Menu System Based on the Cloud client Technology Lin Dong 1, a, Weibo Li 1, b, Ping He 2,c,Jia Liu 1,d

Serial Communication Based on LabVIEW for the Development of an ECG Monitor

The Analysis of the Loss Rate of Information Packet of Double Queue Single Server in Bi-directional Cable TV Network

Utilizing Restricted Direction Strategy and Binary Heap Technology to Optimize Dijkstra Algorithm in WebGIS

The principle of a fulltext searching instrument and its application research Wen Ju Gao 1, a, Yue Ou Ren 2, b and Qiu Yan Li 3,c

Research on the Application of Digital Images Based on the Computer Graphics. Jing Li 1, Bin Hu 2

Shape Optimization Design of Gravity Buttress of Arch Dam Based on Asynchronous Particle Swarm Optimization Method. Lei Xu

Design of student information system based on association algorithm and data mining technology. CaiYan, ChenHua

Research Article. Three-dimensional modeling of simulation scene in campus navigation system

International Conference on Information Sciences, Machinery, Materials and Energy (ICISMME 2015)

The Design of CAN Bus Communication System Based on MCP2515 and S3C2440 Jinmei Liu, Junhong Wang, Donghui Sun

Customizing dynamic libraries of Qt based on the embedded Linux Li Yang 1,a, Wang Yunliang 2,b

The Application Analysis and Network Design of wireless VPN for power grid. Wang Yirong,Tong Dali,Deng Wei

Research on Power Quality Monitoring and Analyzing System Based on Embedded Technology

Tag Based Image Search by Social Re-ranking

Study on the Quantitative Vulnerability Model of Information System based on Mathematical Modeling Techniques. Yunzhi Li

Study on Improving the Quality of Reconstructed NURBS Surfaces

Design and Implementation of CNC Operator Panel Control Functions Based on CPLD. Huaqun Zhan, Bin Xu

Design and Implementation of Inspection System for Lift Based on Android Platform Yan Zhang1, a, Yanping Hu2,b

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

SYSTEM OF PREVIEW AND DETECTION BASED ON NETWORK VIRTUAL EXPERIMENT

Research on Software Scheduling Technology Based on Multi-Buffered Parallel Encryption

Research and Application of Mobile Geographic Information Service Technology Based on JSP Chengtong GUO1, a, Yan YAO1,b

Design of Liquid Level Control System Based on Simulink and PLC

APPLICATION ON IOC PATTERN IN INTEGRATION OF WORKFLOW SYSTEM WITH APPLICATION SYSTEMS

Design of PC Remote Monitoring System for Standby Generators Chuanhong Zhou 1,a,Jinjie Xiao 1,b, Wei Ren 1,c

An Intelligent Retrieval Platform for Distributional Agriculture Science and Technology Data

Simulation Technology of Light Effect Based on Catia and Workbench Software HongXia Hu

Research on Two - Way Interactive Communication and Information System Design Analysis Dong Xu1, a

PARAMETERIZED COMPUTER AIDED DESIGN OF STUBBLE CLEANER

Chongqing, China. *Corresponding author. Keywords: Wireless body area network, Privacy protection, Data aggregation.

Realization of Automatic Keystone Correction for Smart mini Projector Projection Screen

The RTP Encapsulation based on Frame Type Method for AVS Video

The Application of Programmable Controller to Chip Design. Shihong Lan 1, Jian Zhang 2

Design of Coal Mine Power Supply Monitoring System

AN WIRELESS COLLECTION AND MONITORING SYSTEM DESIGN BASED ON ARDUINO. Lu Shaokun 1,e*

High Level Architecture and Agent Technology based Astronautics Simulation Platform and Cluster Computing Environment s Construction

APPLICATION OF JAVA TECHNOLOGY IN THE REGIONAL COMPARATIVE ADVANTAGE ANALYSIS SYSTEM OF MAIN GRAIN IN CHINA

Applied Mechanics and Materials Vol

Citation for the original published paper (version of record):

Application Research of Wavelet Fusion Algorithm in Electrical Capacitance Tomography

Design and Implementation of Archives Query System Based on B/S Structure Lianfeng Zhao

The Research of A multi-language supporting description-oriented Clustering Algorithm on Meta-Search Engine Result Wuling Ren 1, a and Lijuan Liu 2,b

Information Push Service of University Library in Network and Information Age

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

The Development of Mobile Shopping System Based on Android Platform

TCM Health-keeping Proverb English Translation Management Platform based on SQL Server Database

A Decision Support System Based on SSH and DWR for the Retail Industry

The Discussion of 500kV Centralized Monitoring System for Large Operation and Large Maintenance Mode

Research on Medical Information Cross-regional Integration Scheme

Optimal Design of the Data Center Environmental Temperature Monitoring ZhiXiang Yuan 1,2, a, ShuangBo Lai 2,b, Ming Liu 1,c, HuiYi Zhang 1,2, d

Construction Scheme for Cloud Platform of NSFC Information System

Database Design on Construction Project Cost System Nannan Zhang1,a, Wenfeng Song2,b

Research and Improvement of Apriori Algorithm Based on Hadoop

Personalized Search for TV Programs Based on Software Man

Research on software development platform based on SSH framework structure

, ,China. Keywords: CAN BUS,Environmental Factors,Data Collection,Roll Call.

Design of Intelligent System for Watering Flowers Based on IOT

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm

A Template-Matching-Based Fast Algorithm for PCB Components Detection Haiming Yin

A method of three-dimensional subdivision of arbitrary polyhedron by. using pyramids

Application of Three-dimensional Visualization Technology in Real Estate Management Jian Cui 1,a, Jiju Ma 2,b, Dongling Ma 1, c and Nana Yang 3,d

Application of CAD/CAE/CAM Technology in Plastics Injection Mould Design and Manufacture. Ming He Dai,Zhi Dong Yun

2017 International Conference on Economics, Management Engineering and Marketing (EMEM 2017) ISBN:

The design and implementation of UML-based students information management system

Research on Design and Application of Computer Database Quality Evaluation Model

Design of Physical Education Management System Guoquan Zhang

On the Expansion of Access Bandwidth of Manufacturing Cloud Core Network

Computer Aided Drafting, Design and Manufacturing Volume 26, Number 4, December 2016, Page 30

An Adaptive Threshold LBP Algorithm for Face Recognition

A Test Sequence Generation Method Based on Dependencies and Slices Jin-peng MO *, Jun-yi LI and Jian-wen HUANG

Research and Design of Key Technology of Vertical Search Engine for Educational Resources

Clustering Analysis based on Data Mining Applications Xuedong Fan

A New Method Of VPN Based On LSP Technology

A New Model of Search Engine based on Cloud Computing

Design and Research of Adaptive Filter Based on LabVIEW

The Design Of Private Cloud Platform For Colleges And Universities Education Resources Based On Openstack. Guoxia Zou

Remote monitoring system based on C/S and B/S mixed mode Kaibing Song1, a, Yinsong Wang2,band Dandan Shang3,c

Implementation and performance test of cloud platform based on Hadoop

The Design of Electronic Color Screen Based on Proteus Visual Designer Ting-Yu HOU 1,a, Hao LIU 2,b,*

Model the P2P Attack in Computer Networks

Qingdao, , China. China. Keywords: Deep sea Ultrahigh pressure, Water sound acquisition, LabVIEW, NI PXle hardware.

The Application of PLC in the automatic Packing Machine Control System Lixia Guo a, Zhengzhong Li b

THE EXPLOITATION OF WEBGIS BASED ON ARCGIS SERVER AND AJAX

Fast Snippet Generation. Hybrid System

DRA AUDIO CODING STANDARD

Keywords: truck frame, parametric modeling, cross-section.

Man-hour Estimation Model based on Standard Operation Unit for Flexible Manufacturing System

Practicing for Business Intelligence Application with SQL Server 2008 Zhijun Ren

Transcription:

Applied Mechanics and Materials Submitted: 2014-07-18 ISSN: 1662-7482, Vols. 644-650, pp 1950-1953 Accepted: 2014-07-21 doi:10.4028/www.scientific.net/amm.644-650.1950 Online: 2014-09-22 2014 Trans Tech Publications, Switzerland Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c 1 Staff Room of Computer, Aviation University of Airforce, Changchun, 130022, China 2 College of Software, Changchun University of Technology, Changchun, 130012, China a email: henryviolet@sina.com, b email:audreyfxl@gmail.com, c email: zch-cc@163.com Keywords: Lucene; Full-text Retrieval; Data Retrieval; Enterprise Content Management System Abstract. This paper studied Lucene search engine technology in enterprise content management system and make it effectively expanded. It implemented four function modules in the architecture of full-text retrieval based on Lucene: Text analysis module, index module, query module and store module. The intelligent enterprise content management system is reflected in a perfect combination of intellectual technology which includes data retrieval and the modern open source frameworks. It intensified contents processing analyzing functions. Introduction The information content of enterprises shows a swift growth as the informationization of society develops. The enterprises may deal with the collection, creation, storage, management, publication, search and service of information content. Based on the basic content management and document management, enterprise content management further integrates the development technology of enterprise, such as collaboration and knowledge. This has made the content management technology present the following development trend: heterogeneous, standardized, intelligentized and platformized. This paper studied Lucene search engine technology in enterprise content management system and make it effectively expanded so that the full-text retrieval is functioned based on Lucene. The intelligent enterprise content management system is reflected in a perfect combination of intellectual technology which includes data retrieval and the modern open source frameworks. It intensified contents processing analyzing functions. Data retrieval Data retrieval technology help users locate the required content quickly. Due to the different information source, different storage format and systems, different access and retrieval methods, the retrieval problem of heterogeneous resources integration has appeared, users need access and retrieve the content uniformly. Under pressure to the content of the massive amounts of data and concurrent retrieval, to ensure the retrieval performance, it need combine the distributed cluster retrieval, cache and load balance to retrieval. The system combined the realization of full-text retrieval based on Lucene and data retrieval effectively. Lucene[1] is a subproject of Apache. It is an open-sourcing full-text engine toolkit. It is not a complete full-text retrieval engine, is an architecture of full-text retrieval engine. It provides complete searching engine, index engine, and partial text analysis engine[2]. Lucene's purpose is to provide a simple and easy toolkit for software developers, in order to achieve full-text retrieval functions in target system or establish a complete full-text retrieval engine based on it[3]. The advantages of Lucene[4]:Index file format is independent of the application platform; It has realized the partitioned indexes based on the inverted index in the traditional full-text retrieval engine[5]; Excellent object-oriented system architecture[6]; It designed text analysis interface independent of language and file format. All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, www.ttp.net. (ID: 130.203.136.75, Pennsylvania State University, University Park, USA-18/05/16,06:46:25)

Applied Mechanics and Materials Vols. 644-650 1951 Lucene consists of three parts: basic encapsulation structure, index core and external interface. Index core operated index file directly is the key of system. All the source code for Lucene is divided into seven modules[7]. Full-text retrieval based on Lucene Through the above analysis, we can understand the system structure characteristics of the Lucene. The paper can extend Lucene to complete a comprehensive full-text retrieval based on the above, and build all kinds of application system on the basis of the full-text retrieval engine[8]. Lucene is an architecture of full-text retrieval engine. It contains a lot of abstract classes, interfaces, document type, the logic of giving scores etc. [9] It defines the implementation according to specific application. The development of full-text retrieval based on Lucene in intelligent enterprise content management system need consider the following problems: Firstly, the system need build retrieval index in order to realize the retrieval; Then it implements the page of Lucene retrieval platform in order to interact between users and system and show the retrieval results. Based on the above consideration, the architecture of full-text retrieval based on Lucene in the system is shown in Figure1. Figure1 The architecture of full-text retrieval based on Lucene Figure1 shows four function modules in the architecture of full-text retrieval based on Lucene: Text analysis module, index module, query module and store module. Text analysis module Text analysis is a critical first step of to establish full-text retrieval system, it is also an important factor determining the overall system. Word segmentation technology is a key technology in text analysis. It cuts the indexed file into the key word, and then establishes inverted index for the segmentation keywords. In the word segmentation technology, Chinese word segmentation is different from English word segmentation. English identifies the word through the spaces, but a single word in Chinese has no significance. It must be divided into a specific meaning words composed of one single Chinese word. It doesn't have natural space delimiter like English. [10] This increases the complexity of word segmentation. Therefore, Chinese word segmentation technology is the first problem which need to be solved of full-text retrieval system in the Chinese environment. Lucene has provided the Chinese component in the analysis package, main contains: CJKAnalyzer, Chinese-Analyzer and StandardAnalyzer[11]. This system uses the open-sourced Chinese word segmentation algorithm: IKAnalyzer. IKAnalyzer is a lightweight Chinese word segmentation toolkit based on Java. The main idea of content maximum matching algorithm based on a dictionary library is: 1. Query statement Q, the longest word in the dictionary library have n word length;

1952 Machine Tool Technology, Mechatronics and Information Engineering 2. Capturing a substring of length n since the beginning of the Q, and then using the substring matching with the words in dictionary library one by one: 2.1 If the match is successful, the substring will be split off from the head of the Q as one word; 2.2 Capturing substrings of n characters from the rest of the Q; 2.3 Repeating the process until splitting over the Q; 2.4 Otherwise, if it did not find a word in a dictionary library matching with n substring, the last character of n substring is removed, it becomes a substring of n-1 characters, then using new substring matching with the dictionary library; 2.4.1 If the match is successful, the n-1 substring will be removed from the Q, repeating the process. 2.4.2 If the match is not successful, deleting the last character from n-1 substring, matching n-2 substring to dictionary library until the match is successful. Index module The most central part of the full-text retrieval system is the establishment of the index. It can establish of the index based on keyword after completion of the Chinese word segmentation, and it can use Lucene API for data index. An index of Lucene is usually divided into many sub index, and each sub index is a segment. Segment contains some searchable documentation. This system mainly uses Class IndexWriter in Package org.apache.lucene.index to establish of the index. The index mechanism described in this system is that triggering the index to be setted up in the system when any changes in the database. Updating of database contains adding, changing and deleting. After the business logic, it joins the steps to establish or update the full-text index. Index management interface shown in Figure1 provides a friendly way to manage the full-text index on the basis of the index module, including adding index, deleting index and merging index etc. Query module After creating the index library, it can query the data information we need from index library. The query module is mainly responsible for the corresponding request submitted by the user, then handing in the user-submitted queries to retrieve server. The retrieval logic searches from index library based on keywords entered by the user, then returns the sorted results to the user. Query interface shown in Figure1 increases user interface and caching mechanism of query results and improves the response speed of the same or similar queries of different users based on Lucene query module. The paper provides retrieval function to the user based on Web. User accesses the retrieval page of system through the browser by entering the keywords to query. Store module It can generate an index file saved in disk after establishing the index. Subsequent work of query index is directly searching documents contained the keyword from the index file. The index file is saved based on different segments group. All files have the same name in the same segment, but they have different extensions[12], such as *. tii is used to save index file of single word term, *. fdx is used to save index file of word field. In addition, there are segments, deletable and lock which did not have the file extension. Segements file is used to save segment recorded, deletable file is used to save deletable files, lock file is used to control the read and write synchronization. The established index is not static. It need to be updated regularly with the changes in terminal resources. Adding new data, changing shifty data, and deleting obsolete data. The system updates index combining an incremental updating and full amount of updating. Because of the strong object-oriented features of Lucene, it only need call its packaged interfaces to achieve data index and search. Therefore, the system completed the index and retrieval of

Applied Mechanics and Materials Vols. 644-650 1953 classified text according to implementing or calling its internal abstract class based on a detailed analysis of the Lucene architecture and the designed Lucene framework full-text retrieval. Conclusion The current research enterprise content management theory and application is still insufficient to meet the needs of enterprises. Faced with this situation, the paper analyzed the needs of enterprises, summed up the practical experience of enterprise content management, and designed an intelligent enterprise content management system. The system is reflected in a perfect combination of intellectual technology which includes data retrieval and the modern open source frameworks. It intensified contents processing analyzing functions. On the basis of studying Lucene toolkit, the paper analyzed its overall architecture and implemented full-text retrieval functions base on Lucene. References [1] Erik Hatcher. Lucene in Action-A Guide to the Java Search Engine[M]. Manning Publications Co, 2005. [2] Zhou Haisong, Liu Jianming, Li Hong. Research and implementation of vertical search engine based on Lucene[J]. Journal of Guilin University of Electronic Technology, June 2014, Vol.34, No.3. [3] Wang Feihong, Ding Zefa. Research and implementation of vertical search engine based on Lucene[J]. Electronic Technology & Software Engineering, April 2014. [4] GAO Yan, GU Shi-wen, TAN Li-qiu, FEI Yao-ping. Design and implementation of search engine based on Lucene[J]. Microcomputer Development, October 2004, Vol.14, No. 10: 81-84. [5] IPTC. Documentation for NITF3.2[EB/OL]. http://www.nitf.org/iptc/nitf/3.2/documentation/nitf.html.2003-10. [6] WANG Li-yun, WANG Hua, CHEN Gang, YAO Nai-ming. Design and implementation of full-text retrieval based on Lucene[J]. Computer Engineering and Design, Dec.2007, Vol.28, No. 24: 5959-5961. [7] SU Tan-ying, GUO Xian-yong, JIN Xin. Chinese Full-tsxt Retrieval System Based on Lucene[J]. Computer Engineering, Dec. 2007, Vol.33, No.23: 94-96. [8] ZHANG Jun,LI Lu-qun,ZHOU Rong. Research and Application of Search Engine Based on Lucene[J]. Computer Technology and development, June 2013, Vol.23, No.6. [9] Zhu chonglai. Research and Implementation of Enterprise Search Engine system Based on Lucene[D]. Chongqing: Chongqing University of Technology, 2012. [10] Li Wen. Research and Implementation on key technology for Chinese full text database index[d]. ShangHai: Fudan University, 2004. [11] Yuan Tianyu. Research on Inter-Related Successive Trees model extension[d]. ShangHai: Fudan University, 2007. [12] Cao Dayuan, He Haijun. Research and Application on full-text retrieval technology[j]. Computer Engineering, 2007, 28(6): 260-262.

Machine Tool Technology, Mechatronics and Information Engineering 10.4028/www.scientific.net/AMM.644-650 Research on Full-Text Retrieval Based on Lucene in Enterprise Content Management System 10.4028/www.scientific.net/AMM.644-650.1950