Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c

Applied Mechanics and Materials Submitted: 2014-07-18 ISSN: 1662-7482, Vols. 644-650, pp 1950-1953 Accepted: 2014-07-21 doi:10.4028/www.scientific.net/amm.644-650.1950 Online: 2014-09-22 2014 Trans Tech Publications, Switzerland Research on Full-text Retrieval based on Lucene in Enterprise Content Management System Lixin Xu 1, a, XiaoLin Fu 2, b, Chunhua Zhang 1, c 1 Staff Room of Computer, Aviation University of Airforce, Changchun, 130022, China 2 College of Software, Changchun University of Technology, Changchun, 130012, China a email: henryviolet@sina.com, b email:audreyfxl@gmail.com, c email: zch-cc@163.com Keywords: Lucene; Full-text Retrieval; Data Retrieval; Enterprise Content Management System Abstract. This paper studied Lucene search engine technology in enterprise content management system and make it effectively expanded. It implemented four function modules in the architecture of full-text retrieval based on Lucene: Text analysis module, index module, query module and store module. The intelligent enterprise content management system is reflected in a perfect combination of intellectual technology which includes data retrieval and the modern open source frameworks. It intensified contents processing analyzing functions. Introduction The information content of enterprises shows a swift growth as the informationization of society develops. The enterprises may deal with the collection, creation, storage, management, publication, search and service of information content. Based on the basic content management and document management, enterprise content management further integrates the development technology of enterprise, such as collaboration and knowledge. This has made the content management technology present the following development trend: heterogeneous, standardized, intelligentized and platformized. This paper studied Lucene search engine technology in enterprise content management system and make it effectively expanded so that the full-text retrieval is functioned based on Lucene. The intelligent enterprise content management system is reflected in a perfect combination of intellectual technology which includes data retrieval and the modern open source frameworks. It intensified contents processing analyzing functions. Data retrieval Data retrieval technology help users locate the required content quickly. Due to the different information source, different storage format and systems, different access and retrieval methods, the retrieval problem of heterogeneous resources integration has appeared, users need access and retrieve the content uniformly. Under pressure to the content of the massive amounts of data and concurrent retrieval, to ensure the retrieval performance, it need combine the distributed cluster retrieval, cache and load balance to retrieval. The system combined the realization of full-text retrieval based on Lucene and data retrieval effectively. Lucene[1] is a subproject of Apache. It is an open-sourcing full-text engine toolkit. It is not a complete full-text retrieval engine, is an architecture of full-text retrieval engine. It provides complete searching engine, index engine, and partial text analysis engine[2]. Lucene's purpose is to provide a simple and easy toolkit for software developers, in order to achieve full-text retrieval functions in target system or establish a complete full-text retrieval engine based on it[3]. The advantages of Lucene[4]:Index file format is independent of the application platform; It has realized the partitioned indexes based on the inverted index in the traditional full-text retrieval engine[5]; Excellent object-oriented system architecture[6]; It designed text analysis interface independent of language and file format. All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans Tech Publications, www.ttp.net. (ID: 130.203.136.75, Pennsylvania State University, University Park, USA-18/05/16,06:46:25)

Applied Mechanics and Materials Vols. 644-650 1951 Lucene consists of three parts: basic encapsulation structure, index core and external interface. Index core operated index file directly is the key of system. All the source code for Lucene is divided into seven modules[7]. Full-text retrieval based on Lucene Through the above analysis, we can understand the system structure characteristics of the Lucene. The paper can extend Lucene to complete a comprehensive full-text retrieval based on the above, and build all kinds of application system on the basis of the full-text retrieval engine[8]. Lucene is an architecture of full-text retrieval engine. It contains a lot of abstract classes, interfaces, document type, the logic of giving scores etc. [9] It defines the implementation according to specific application. The development of full-text retrieval based on Lucene in intelligent enterprise content management system need consider the following problems: Firstly, the system need build retrieval index in order to realize the retrieval; Then it implements the page of Lucene retrieval platform in order to interact between users and system and show the retrieval results. Based on the above consideration, the architecture of full-text retrieval based on Lucene in the system is shown in Figure1. Figure1 The architecture of full-text retrieval based on Lucene Figure1 shows four function modules in the architecture of full-text retrieval based on Lucene: Text analysis module, index module, query module and store module. Text analysis module Text analysis is a critical first step of to establish full-text retrieval system, it is also an important factor determining the overall system. Word segmentation technology is a key technology in text analysis. It cuts the indexed file into the key word, and then establishes inverted index for the segmentation keywords. In the word segmentation technology, Chinese word segmentation is different from English word segmentation. English identifies the word through the spaces, but a single word in Chinese has no significance. It must be divided into a specific meaning words composed of one single Chinese word. It doesn't have natural space delimiter like English. [10] This increases the complexity of word segmentation. Therefore, Chinese word segmentation technology is the first problem which need to be solved of full-text retrieval system in the Chinese environment. Lucene has provided the Chinese component in the analysis package, main contains: CJKAnalyzer, Chinese-Analyzer and StandardAnalyzer[11]. This system uses the open-sourced Chinese word segmentation algorithm: IKAnalyzer. IKAnalyzer is a lightweight Chinese word segmentation toolkit based on Java. The main idea of content maximum matching algorithm based on a dictionary library is: 1. Query statement Q, the longest word in the dictionary library have n word length;

1952 Machine Tool Technology, Mechatronics and Information Engineering 2. Capturing a substring of length n since the beginning of the Q, and then using the substring matching with the words in dictionary library one by one: 2.1 If the match is successful, the substring will be split off from the head of the Q as one word; 2.2 Capturing substrings of n characters from the rest of the Q; 2.3 Repeating the process until splitting over the Q; 2.4 Otherwise, if it did not find a word in a dictionary library matching with n substring, the last character of n substring is removed, it becomes a substring of n-1 characters, then using new substring matching with the dictionary library; 2.4.1 If the match is successful, the n-1 substring will be removed from the Q, repeating the process. 2.4.2 If the match is not successful, deleting the last character from n-1 substring, matching n-2 substring to dictionary library until the match is successful. Index module The most central part of the full-text retrieval system is the establishment of the index. It can establish of the index based on keyword after completion of the Chinese word segmentation, and it can use Lucene API for data index. An index of Lucene is usually divided into many sub index, and each sub index is a segment. Segment contains some searchable documentation. This system mainly uses Class IndexWriter in Package org.apache.lucene.index to establish of the index. The index mechanism described in this system is that triggering the index to be setted up in the system when any changes in the database. Updating of database contains adding, changing and deleting. After the business logic, it joins the steps to establish or update the full-text index. Index management interface shown in Figure1 provides a friendly way to manage the full-text index on the basis of the index module, including adding index, deleting index and merging index etc. Query module After creating the index library, it can query the data information we need from index library. The query module is mainly responsible for the corresponding request submitted by the user, then handing in the user-submitted queries to retrieve server. The retrieval logic searches from index library based on keywords entered by the user, then returns the sorted results to the user. Query interface shown in Figure1 increases user interface and caching mechanism of query results and improves the response speed of the same or similar queries of different users based on Lucene query module. The paper provides retrieval function to the user based on Web. User accesses the retrieval page of system through the browser by entering the keywords to query. Store module It can generate an index file saved in disk after establishing the index. Subsequent work of query index is directly searching documents contained the keyword from the index file. The index file is saved based on different segments group. All files have the same name in the same segment, but they have different extensions[12], such as *. tii is used to save index file of single word term, *. fdx is used to save index file of word field. In addition, there are segments, deletable and lock which did not have the file extension. Segements file is used to save segment recorded, deletable file is used to save deletable files, lock file is used to control the read and write synchronization. The established index is not static. It need to be updated regularly with the changes in terminal resources. Adding new data, changing shifty data, and deleting obsolete data. The system updates index combining an incremental updating and full amount of updating. Because of the strong object-oriented features of Lucene, it only need call its packaged interfaces to achieve data index and search. Therefore, the system completed the index and retrieval of

Applied Mechanics and Materials Vols. 644-650 1953 classified text according to implementing or calling its internal abstract class based on a detailed analysis of the Lucene architecture and the designed Lucene framework full-text retrieval. Conclusion The current research enterprise content management theory and application is still insufficient to meet the needs of enterprises. Faced with this situation, the paper analyzed the needs of enterprises, summed up the practical experience of enterprise content management, and designed an intelligent enterprise content management system. The system is reflected in a perfect combination of intellectual technology which includes data retrieval and the modern open source frameworks. It intensified contents processing analyzing functions. On the basis of studying Lucene toolkit, the paper analyzed its overall architecture and implemented full-text retrieval functions base on Lucene. References [1] Erik Hatcher. Lucene in Action-A Guide to the Java Search Engine[M]. Manning Publications Co, 2005. [2] Zhou Haisong, Liu Jianming, Li Hong. Research and implementation of vertical search engine based on Lucene[J]. Journal of Guilin University of Electronic Technology, June 2014, Vol.34, No.3. [3] Wang Feihong, Ding Zefa. Research and implementation of vertical search engine based on Lucene[J]. Electronic Technology & Software Engineering, April 2014. [4] GAO Yan, GU Shi-wen, TAN Li-qiu, FEI Yao-ping. Design and implementation of search engine based on Lucene[J]. Microcomputer Development, October 2004, Vol.14, No. 10: 81-84. [5] IPTC. Documentation for NITF3.2[EB/OL]. http://www.nitf.org/iptc/nitf/3.2/documentation/nitf.html.2003-10. [6] WANG Li-yun, WANG Hua, CHEN Gang, YAO Nai-ming. Design and implementation of full-text retrieval based on Lucene[J]. Computer Engineering and Design, Dec.2007, Vol.28, No. 24: 5959-5961. [7] SU Tan-ying, GUO Xian-yong, JIN Xin. Chinese Full-tsxt Retrieval System Based on Lucene[J]. Computer Engineering, Dec. 2007, Vol.33, No.23: 94-96. [8] ZHANG Jun,LI Lu-qun,ZHOU Rong. Research and Application of Search Engine Based on Lucene[J]. Computer Technology and development, June 2013, Vol.23, No.6. [9] Zhu chonglai. Research and Implementation of Enterprise Search Engine system Based on Lucene[D]. Chongqing: Chongqing University of Technology, 2012. [10] Li Wen. Research and Implementation on key technology for Chinese full text database index[d]. ShangHai: Fudan University, 2004. [11] Yuan Tianyu. Research on Inter-Related Successive Trees model extension[d]. ShangHai: Fudan University, 2007. [12] Cao Dayuan, He Haijun. Research and Application on full-text retrieval technology[j]. Computer Engineering, 2007, 28(6): 260-262.

Machine Tool Technology, Mechatronics and Information Engineering 10.4028/www.scientific.net/AMM.644-650 Research on Full-Text Retrieval Based on Lucene in Enterprise Content Management System 10.4028/www.scientific.net/AMM.644-650.1950