Development of Search Engines using Lucene: An Experience

Similar documents
ScienceDirect. STA Data Model for Effective Business Process Modelling

Search Engines Exercise 5: Querying. Dustin Lange & Saeedeh Momtazi 9 June 2011

Google docs as a collaborating tool for academicians

The Lucene Search Engine

A short introduction to the development and evaluation of Indexing systems

Search Engine Architecture II

Procedia - Social and Behavioral Sciences 191 ( 2015 ) WCES 2014

CS54701: Information Retrieval

60-538: Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval

Chapter 2. Architecture of a Search Engine

Applied Databases. Sebastian Maneth. Lecture 11 TFIDF Scoring, Lucene. University of Edinburgh - February 26th, 2017

Making the most of Oxford Journals Online Collection.

Quick Reference Guide

Available online at ScienceDirect. Is Data Quality an Influential Factor on Web Portals' Visibility?

Information Retrieval

Ontology-based search system using hierarchical structure design

Semantic-Based Information Retrieval for Java Learning Management System

Chapter 27 Introduction to Information Retrieval and Web Search

University of Mumbai

VK Multimedia Information Systems

Search Engines Information Retrieval in Practice

Scopus. Quick Reference Guide

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

A Simulation Based Comparative Study of Normalization Procedures in Multiattribute Decision Making

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Curriculum Scheme. Dr. Ambedkar Institute of Technology, Bengaluru-56 (An Autonomous Institute, Affiliated to V T U, Belagavi)

EPL660: Information Retrieval and Search Engines Lab 2

Introduction & Administrivia

How to Search: EBSCO HOST

Ranking in a Domain Specific Search Engine

A conceptual model of trademark retrieval based on conceptual similarity

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces

Introduc)on to Lucene. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata

Total cost of ownership for application replatform by open-source SW

BD - Databases

Improving Synoptic Querying for Source Retrieval

A Toolbox for Teaching Image Fusion in Matlab

COURSE OUTLINE. Course lecturer(s) Name Office Tel (07-55) Dr Norsham Idris N /

Sentiment Web Mining Architecture - Shahriar Movafaghi

Available online at ScienceDirect. Procedia Computer Science 59 (2015 )

Approach Research of Keyword Extraction Based on Web Pages Document

The Replication Technology in E-learning Systems

HOMEPAGE USABILITY: MULTIPLE CASE STUDIES OF TOP 5 IPTA S HOMEPAGE IN MALAYSIA Ahmad Khairul Azizi Ahmad 1 UniversitiTeknologi MARA 1

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

COMP Implemen0ng Search using Lucene

Procedia - Social and Behavioral Sciences 195 ( 2015 ) World Conference on Technology, Innovation and Entrepreneurship

Java Programming Assessment Tool for Assignment Module in Moodle E-learning System

The course can be taken as part of the MSc Programme in Information Systems, or as a separate course.

Educational soft presenting particularities through special functions of ABAP List Viewer in Web Dynpro technology

Compiling Techniques

Why is Search Engine Optimisation (SEO) important?

Natural Language Processing

San José State University Computer Science Department CS157A: Introduction to Database Management Systems Sections 5 and 6, Fall 2015

On the automatic classification of app reviews

A recommendation engine by using association rules

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

MGA Developing Interactive Systems (5 ECTS), spring 2017 (16 weeks)

dr.ir. D. Hiemstra dr. P.E. van der Vet

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Experimental modelling of the cluster analysis processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

ScienceDirect. University of Wolverhampton. Goes beyond search to research

Brainspace: Quick Reference

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Termin 6: Web Suche. Übung Netzbasierte Informationssysteme. Arbeitsgruppe. Prof. Dr. Adrian Paschke

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

IT Specialist, Associate of Applied Science Outcomes Assessment Implementation Summary

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Bruno Martins. 1 st Semester 2012/2013

Evaluation Scheme L T P Total Credit Theory Mid Sem Exam

Viewpoint Review & Analytics

Available online at ScienceDirect. Procedia Computer Science 60 (2015 )

Information Retrieval

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Unit Assessment Guide

Mining Web Data. Lijun Zhang

Information Retrieval in Libraries and Information Centres: Concepts, Challenges and Search Strategies

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Coursework Completion

Student Research Center User Guide. support.ebsco.com

THE WEB SEARCH ENGINE

modern database systems lecture 4 : information retrieval

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Scuola di dottorato in Scienze molecolari Information literacy in chemistry 2015 SCOPUS

Web of Science. LIBRARY SERVICES

Figure 4. Digital Specialist Browse Page.

Building Search Applications

CMPE 152 Compiler Design

National School of Business Management

This is an author-deposited version published in : Eprints ID : 12964

The course makes up the first semester of the BSc programme in Design of Information Systems or can be taken as a freestanding course.

ISTQB Expert Level. Introduction and overview

Assessment Plan. Academic Cycle

CS47300: Web Information Search and Management

City University of Hong Kong Course Syllabus. offered by Department of Computer Science with effect from Semester A 2017/18

Web Data Management. Text indexing with LUCENE (Nicolas Travers) Philippe Rigaux CNAM Paris & INRIA Saclay

Science Direct. Quick Reference Guide. Empowering Knowledge

Transcription:

Available online at www.sciencedirect.com Procedia Social and Behavioral Sciences 18 (2011) 282 286 Kongres Pengajaran dan Pembelajaran UKM, 2010 Development of Search Engines using Lucene: An Experience Masnizah Mohd Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Malaysia Received date here; revised date here; accepted date here Abstract Lucene is a Java library which is able to perform the indexing and searching process. It allows the development of a text-based information retrieval systems or applications such as search engines. This paper intends to discuss the issues, and share the experience of using Lucene during course project development. Lucene is used by 28 second year students who are in the Information Science programs. They have to implement Lucene library in the Development of Search Engines (TP2433) course project. Results from the analysis have contributed in providing guidelines for future handling of the final TP2433 project. 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and/or peer-review under responsibility of Kongres Pengajaran & Pembelajaran UKM, 2010 Keywords: Lucene; Information Retrieval; Search Engines; 1. Introduction Information retrieval (IR) is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information (Salton 1968). It emphasizes on the process of matching user queries to the index in finding relevant documents. In fact, the main issue in this area is to ensure a good match with high similarity score by comparing between the queries and the document index. Search engines such as Google are the practical applications of IR techniques on large-scale text collections. Search engines should include the concept, models, techniques and the processes of IR. Two major components of search engines are the indexing and query processes (Croft et al. 2010). The indexing process aims to create data structures or the indexes that allows the searching. Meanwhile the querying process will use the structures and user queries to generate a ranked list of documents. Figure 1 depicted the indexing process in search engines. It involves three components; text acquisition, text transformation and index creation as described in Table 1. Corresponding author. Tel.: +0-603-8921-6671; fax: +0-603-8926-7950 E-mail address: mas@ftsm.ukm.my 1877 0428 2011 Published by Elsevier Ltd. doi:10.1016/j.sbspro.2011.05.040 Open access under CC BY-NC-ND license.

Masnizah Mohd / Procedia Social and Behavioral Sciences 18 (2011) 282 286 283 Figure 1 Indexing process in search engines Table 1. Three components of the indexing process in search engines Process Description A. Text Identifies and stores documents for indexing. Documents are in various acquisition formats such as email, websites, memos, letters and articles. B. Text Transforms documents into index terms or features which involves lexical transformation analysis (parsing-tokenizing-stopword removal-stemming). C. Index creation Takes index terms and creates data structures (indexes) to support fast searching Figure 2. shows the query process in search engines. It involves three components; user interaction, ranking and evaluation as indicated in Table 2. Figure 2 Query process in search engines

284 Masnizah Mohd / Procedia Social and Behavioral Sciences 18 (2011) 282 286 2. Lucene Table 2. Three components of the query process in search engines Process Description A. User interaction Supports creation and refinement of query, display of results. B. Ranking Uses query and indexes to generate ranked list of documents. C. Evaluation Monitors and measures effectiveness and efficiency (primarily offline) Lucene is an open source Java library, which supports the process and techniques of information retrieval. Applications such as Amazon are among the commercial application that uses Lucene for indexing and allowing effective searching. Lucene is able to index text from a various formats such as PDF, HTML and Microsoft Word, and also in various languages. The key classes used to build search engines are (Paul 2004): a. Document - The Document class represents a document in Lucene. We index Document objects and get Document objects back when we do a search. b. Field - The Field class represents a section of a Document. The Field object will contain a name for the section and the actual data. c. Analyzer - The Analyzer class is an abstract class that is used to provide an interface that will take a Document and turn it into tokens that can be indexed. There are several useful implementations of this class but the most commonly used is the StandardAnalyzer class. d. IndexWriter - The IndexWriter class is used to create and maintain indexes. e. IndexSearcher - The IndexSearcher class is used to search through an index. f. QueryParser - The QueryParser class is used to build a parser that can search through an index g. Query - The Query class is an abstract class that contains the search criteria created by the QueryParser. h. Hits - The Hits class contains the Document objects that are returned by running the Query object against the index. Lucene supports an extensive search criteria such as Boolean (AND, OR and NOT), fuzzy searches, proximity searches, wildcard searches, and range searches. Some examples of searches are: i. Find all Masnizah articles related to information retrieval and search engines: author:masnizah information retrieval AND search engines ii. Find all articles that contain the phrase clustering and exclude Single Pass clustering NOT Single Pass iii. Find all articles written by Masnizah in February this year author:masnizah date:[02/01/2011 TO 02/28/2011] 3. Methodology TP2433 course is compulsory for the second year Information Science students at the Faculty of Information Sciences and Technology (FTSM). It was offered in semester 1. A total of 28 students worked in group and they were the first batch to use Lucene. They were asked to use Lucene in their project with the evaluation percentage of 30%. The students have to develop an IR system or search engines that apply the IR techniques learnt during lectures. Thus, students' understanding of the IR concept can be enhanced through the use of Lucene during the project implementation. For example, students can increase their understanding of the indexing process to construct the index using Lucene. Assessment of the project is divided into two categories:

Masnizah Mohd / Procedia Social and Behavioral Sciences 18 (2011) 282 286 285 i. Presentation (10%) a. The ability to explain the program. b. The ability to answer question from the lecturer and colleagues ii. Development of search engines (20%) a. The usage of classes in Lucene This is to evaluate the students ability to use and be able to customize classes in Lucene in their project b. Index creation The quality of index created after the lexical analysis process c. The ability of searching process The various searching criteria offered such as allowing the Boolean and wildcard is an advantage. d. The accurate results The ability of search engines to return an accurate and relevan results based on user query. Students were given a set of questionnaires after the TP2433 course. The questionnaire aims to validate students achievement on the learning outcomes based on their understanding. The scale used in the questionnaire was: 1- Unachieved 2- Less achieved 3- Average 4- Achieved 5- Well achieved 4. Results We analysed the students achievement on the learning outcomes based on their understanding. The result was shown as in Table 3. Table 3. Percentage of students achievement on the learning outcomes based on their understanding Learning Outcome (LO) Scale 1 2 3 4 5 LO1 1. Able to define the concept of information retrieval (IR) 0.0 0.0 0.0 32.1 67.9 and search engines (SE). LO2 2. Able to identify the components, techniques and 0.0 0.0 3.6 25.0 71.4 models of IR. LO3 3. Able to explain the process of an IR system. 0.0 0.0 17.9 25.0 57.1 LO4 4. Able to identify, analysis and evalute the effectiveness 0.0 0.0 0.0 7.1 92.9 of search engines. LO5 5. Able to develop search engines or an IR system using the principles and techniques learned. 3.6 10.7 28.6 46.4 10.7 Most of the students agreed that they have achieved the first learning outcome (LO1) where they were able to define the concept of IR and SE. 67.9% of students agreed that LO1 was well achieved (scale 5) and 32.1% of students agreed that it was achieved (scale 4). Meanwhile for the second learning outcome (LO2) on students ability to identify the components, techniques and models of IR, there were 71.4% of students agreed that the LO2 was well achieved (scale 5) and 25% of students agreed it was achieved (scale 4). Findings also indicated that most of the students agreed that they have achieved the third learning outcome (LO3) on the ability to explain the process of an IR system. 57.1% of students agreed that LO3 was well achieved (scale 5) and 25% of students agreed that it was achieved (scale 4).

286 Masnizah Mohd / Procedia Social and Behavioral Sciences 18 (2011) 282 286 The fourth learning outcome (LO4) received the highest percentage with 92.9% of students agreeing that it was well achieved. This indicates that the students were able to measure an effective and efficient search engines. The focus of this paper is on the fifth learning outcome (LO5) which is on the ability to develop search engines. Results showed that the proportion between the number of students who agreed (scale 4 and 5) with the number of students who are less agreed (scale 1 and 2) on the ability to develop a search engine using the principles and techniques learned, is 4 to 1 ratio. It was observed that students have misperception of Lucene where they thought it was an IR system. Gradually students solved these problems through an explanation and demonstration on the implementation of Lucene in the final project. Based on the project presentation, students have used 70% of the primary classes in Lucene but the ability to develop a search engine can still be improved. The quality of the index created also can be improved because there is no application of the stemming process. Almost 60% of students used the exact match and applied the use of Boolean operators in the search process. Finally, the search engines have returned relevant documents based on user query where the overall accuracy was still at a moderate level. These are the guidelines for the future conduct of TP2433 course: i. TP2433 course offering This course should be offered in the second semester after the students already have a good Java programming skill. This is because the usage of Lucene relies most on the Java skills. Lucene only provide the classes to perform the IR techniques. ii. Course basic requirement TP2433 should have the Java programming course as the basic requirement. Student who did not have Java programming background will face difficulties to understand Lucene framework. It was observed that they have a low programming skill, thus it has affected their contribution to the project. 5. Conclusion Lucene has been introduced for the first time in TP2433 course. Lucene is the state of the art in document preprocessing and in index creation. Thus, it was relevant to enforce the students to use Lucene. It will increase their understanding of the IR theory and approaches in developing a search engine. Students should be able to see the practicality of IR techniques by using Lucene. This paper has identified the requirement for the course project development and therefore has come out with a guideline for future conduct of TP2433 course. 6. Acknowledgement The author would like to thank the Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia for the opportunity given in conducting this research. References Croft, W.B., Metzler, D. & Strohman, T. (2010). Search Engines: Information Retrieval in Practice. London: Pearson. Paul, T. (2004). The Lucene Search Engine. http://www.javaranch.com/journal/2004/04/lucene.html [2 November 2010]. Salton, G. (1968). Automatic information organization and retrieval. New York: McGraw-Hill.