MPLEMENTATION OF DIGITAL LIBRARY USING HDFS AND SOLR

Similar documents
High Performance Computing on MapReduce Programming Framework

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

CLIENT DATA NODE NAME NODE

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

MapReduce. U of Toronto, 2014

Mining Distributed Frequent Itemset with Hadoop

BigData and Map Reduce VITMAC03

Map-Reduce. Marco Mura 2010 March, 31th

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

A SURVEY ON SCHEDULING IN HADOOP FOR BIGDATA PROCESSING

EPL660: Information Retrieval and Search Engines Lab 3

Online Bill Processing System for Public Sectors in Big Data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Global Journal of Engineering Science and Research Management

Decision analysis of the weather log by Hadoop

DATA DEDUPLCATION AND MIGRATION USING LOAD REBALANCING APPROACH IN HDFS Pritee Patil 1, Nitin Pise 2,Sarika Bobde 3 1

VJER-Vishwakarma Journal of Engineering Research Volume 1 Issue 1, March 2017 ISSN: Admixture of IaaS and PaaS

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Data Management in Data Intensive Computing Systems - A Survey

Distributed Face Recognition Using Hadoop

A Comparative Study of Various Computing Environments-Cluster, Grid and Cloud

Chapter 5. The MapReduce Programming Model and Implementation

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Hadoop/MapReduce Computing Paradigm

CA485 Ray Walshe Google File System

HADOOP FRAMEWORK FOR BIG DATA

Analyzing and Improving Load Balancing Algorithm of MooseFS

Grid Computing with Voyager

A Review Approach for Big Data and Hadoop Technology

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop and HDFS Overview. Madhu Ankam

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

A Review on Backup-up Practices using Deduplication

Embedded Technosolutions

Cassandra- A Distributed Database

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Efficient Load Balancing and Disk Failure Avoidance Approach Using Restful Web Services

CISC 7610 Lecture 2b The beginnings of NoSQL

Big Data and Cloud Computing

CONTENTS 1. ABOUT KUMARAGURU COLLEGE OF TECHNOLOGY 2. MASTER OF COMPUTER APPLICATIONS 3. FACULTY ZONE 4. CO-CURRICULAR / EXTRA CURRICULAR ACTIVITIES

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Autonomic Data Replication in Cloud Environment

Introduction to MapReduce

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

Online Editor for Compiling and Executing Different Languages Source Code

Virtual Machine Placement in Cloud Computing

Cat Herding. Why It s Time for a Millennial Approach to Storage. Cloud Expo East Western Digital Corporation All rights reserved 01/25/2016

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Research on Load Balancing and Database Replication based on Linux

Stream Processing on IoT Devices using Calvin Framework

Data Analysis Using MapReduce in Hadoop Environment

Security Architecture Models for the Cloud

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Transaction Analysis using Big-Data Analytics

Mounica B, Aditya Srivastava, Md. Faisal Alam

DIVING IN: INSIDE THE DATA CENTER

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

ABSTRACT I. INTRODUCTION

The Analysis Research of Hierarchical Storage System Based on Hadoop Framework Yan LIU 1, a, Tianjian ZHENG 1, Mingjiang LI 1, Jinpeng YUAN 1

SECURED ONLINE MATRIMONIAL SYSTEM

Distributed Systems 16. Distributed File Systems II

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

A Study of Comparatively Analysis for HDFS and Google File System towards to Handle Big Data

A Survey on improving performance of Information Retrieval System using Adaptive Genetic Algorithm

IMPLEMENTATION OF INFORMATION RETRIEVAL (IR) ALGORITHM FOR CLOUD COMPUTING: A COMPARATIVE STUDY BETWEEN WITH AND WITHOUT MAPREDUCE MECHANISM *

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Natural Language to Database Interface

Distributed Systems CS6421

Using AWS Data Migration Service with RDS

Directory Structure and File Allocation Methods

Automatic Voting Machine using Hadoop

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Efficient and Reliable Data Recovery Technique in Cloud Computing

A brief history on Hadoop

FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Soir 1.4 Enterprise Search Server

Administration 1. DLM Administration. Date of Publish:

An Overview of Projection, Partitioning and Segmentation of Big Data Using Hp Vertica

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Programming Models MapReduce

Multi Packed Security Addressing Challenges in Cloud Computing

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

Assistant Professor, School of Computer Applications,Career Point University,Kota, Rajasthan, India Id

Top 40 Cloud Computing Interview Questions

A Survey on Comparative Analysis of Big Data Tools

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

EsgynDB Enterprise 2.0 Platform Reference Architecture

Data Storage Infrastructure at Facebook

An Application for Monitoring Solr

Informatica Enterprise Information Catalog

NYC Apache Lucene/Solr Meetup

Transcription:

MPLEMENTATION OF DIGITAL LIBRARY USING HDFS AND SOLR H. K. Khanuja, Amruta Mujumdar, Manashree Waghmare, Mrudula Kulkarni, Mrunal Bajaj Department of Computer Engineering Marathwada Mitra Mandal's College of Engineering Pune, India ABSTRACT: Majority of the digital libraries implemented in worldwide use server-client architecture. Such libraries require a large infrastructure investment in server setups and maintenance. Servers are also very difficult to scale as they are based on changing institutional requirements and needs. These drawbacks can be overcome by using a distributed database system. Apache Hadoop is one of the large scale data processing projects that support data-intensive distributed applications. Hadoop applications utilize a distributed file system for data storage called Hadoop Distributed File System (HDFS). Apache Solr is used for processing and searching the required data. Apache Solr provides search features like Full-Text Searching, Hit Highlighting. Solr searches the index instead of searching the text. Because of this feature Solr achieve fast search responses. Keywords Hadoop, HDFS, Digital Library, Replication, SOLR [1] INTRODUCTION An advanced library is a framework for storing and accessing large amount of data in less time more effectively. Hadoop Distributed File System is a stage which permits storage for large data with simple adaptability and strong adaptation to internal failure at a low execution cost [4]. Hadoop has, as of now, observed a fast acknowledgment among various multi-national organizations, for example, Facebook, Amazon, Yahoo!, and so on. These organizations have completely working Hadoop bunches taking into account a lot of information consistently. Apache Solr is very efficient platform for processing the data. It searches the index of a particular data instead of searching the data itself. [2] EXISTING SYSTEM Most digital library systems today use client-server architecture [1]. Scalability is the major issue of such systems. Although servers are considered to be scalable, the cost required for scaling is quite high. This increases the task of maintaining the server regularly. There are many more issue like downtime and fault tolerance with server based systems. Server systems having a single server unit have a higher tendency to fail while handling higher loads or bandwidth. Failed systems will again lead to downtimes. There are some setups like RAID which are supported by server systems but these setups tend to be expensive, and are still not fully resistant to faults. Many systems are implemented on PaaS (Platform as a Service) approach of cloud H. K. Khanuja, Amruta Mujumdar, Manashree Waghmare, Mrudula Kulkarni, Mrunal Bajaj 1

IMPLEMENTATION OF DIGITAL LIBRARY USING HDFS AND SOLR architecture. These systems also suffer from scalability issues. PaaS systems are not easy to scale, and require a significantly higher infrastructure cost as compared to client-server systems. That is why most institutions and universities wouldn t go for cloud setups. [3] PROPOSED SYSTEM The proposed system basically focuses on the issues like scalability and fault tolerance which are present in the existing system. It also proposes to store and process huge amount of data. The main goal of the system is to access the required data from the large dataset in less time. For this purpose HDFS is used to store large amount of data. HDFS takes care of scalability by providing dynamic allocation of data nodes [3, 5]. It also takes care of load balancing and fault tolerance by providing replication factor [3, 5]. These are the inbuilt features of HDFS which makes it more efficient. HDFS is a storage platform, for processing Apache Solr is beneficial. Here processing means searching the specific data from large dataset. For faster searching indexing technique is used. Apache Solr provides some features which results into automatic searching and sorting of data[8]. So proposed system overcomes all the issues of existing system and builds an advanced platform to store and process data using HDFS and Apache Solr. [4] HDFS ARCHITECTURE HDFS mainly includes 3 components: 1. Name Node 2. Data Node 3. Secondary Name Node Name Node is responsible for storing metadata of the data such as location, size, permission, hierarchy. It is responsible for dividing the data into chunks and allocating it on specific data node and then it maintains the address of data node. Name Node is also called as a master node, whereas, data nodes are called slave nodes. Data node is responsible for actual storage of data. Data gets stored in the form of chunks on data node. Secondary Name Node is also called as helper node as it helps the name node to recreate its image in case of failure. This architecture provides features like load balancing, replication, scalability, etc. [3,4,5]. [Figure-1] below shows the HDFS architecture. Figure: 1. HDFS Architecture H. K. Khanuja, Amruta Mujumdar, Manashree Waghmare, Mrudula Kulkarni, Mrunal Bajaj 2

[5] INITIALIZATION OF SOLR In Apache Solr, Document is a unit of search and index. Before adding documents to Solr, we need to specify the schema, represented in a file called schema.xml. When Tomcat starts up, it initiates by reading some related configuration documents including solr.xml, solrconfig.xml, schema.xml and other related xml configuration files, and will then finish the loading of the related SOLR handler, and the configuration for the searching and indexing processes. Before actual indexing, analysis of data is done. The result of analysis phase is series of tokens and after that actual indexing gets performed on tokens. Tokens are the keywords which we use while searching the text [8,9,10]. The [Figure-2] shows the working and general architecture of SOLR. Figure: 2.SOLR Architecture [6] SYSTEM FEATURES A. Availability The system will be available 24/7 as it will be deployed over an intranet. B. Flexibility System is highly flexible, since a web user interface is being provided. The user can login from any computer and work from anywhere within an organization s campus. C. Interoperability Ability of HDFS system to work with other systems is unmatchable. It works very well operating within its own name node and data node. D. Reliability HDFS itself is reliable. H. K. Khanuja, Amruta Mujumdar, Manashree Waghmare, Mrudula Kulkarni, Mrunal Bajaj 3

[7] MATHEMATICAL MODEL IMPLEMENTATION OF DIGITAL LIBRARY USING HDFS AND SOLR Input: I = Different type of files from user. Output: O = Proper searched and sorted list of data(files) according to the hit rates. Functions : f1 = function for taking inputs, f2 = function for storing data, f3 = function for indexing, f4 = function for searching, f5 = function for displaying the data as per the hit-rates. Success Conditions: succ1 = If the user enters valid user-name and password only then he will get an access to the library. succ2 = If file is present in database then according to the user s requirement file will get displayed with respect to hit rates. succ3 = If file is not present proper message will get displayed. succ4 = According to the user s requirement the tasks like previewing the file or downloading a file will get performed. Failure Conditions: fail1 = If the user enters invalid user-name, password and still access is granted to the user. fail2 = Even though the file is present in the database, the system shows the message file not present and vice-verse. fail3 = Wrong hit rates to the files and wrong sorting of files. [7] CONCLUSION Deploying digital library using HDFS overcomes many issues like scalability, fault tolerance, load balancing etc. which causes in normal client server architecture. But because of some inbuilt features of HDFS, it is possible to gain over these drawbacks. Storage issues get handled by HDFS and searching is done by Solr. By indexing techniques Solr provides fast searching. Because of this system user can use any data in less time very effectively. H. K. Khanuja, Amruta Mujumdar, Manashree Waghmare, Mrudula Kulkarni, Mrunal Bajaj 4

REFERENCES [1] A study on Digital on Digital Library using Hadoop Distributed File System. Piyush Puranik, Umesh Kumar Toke, Mohamadali Shaikh, Mohit Raimule, Chetan Patil, Department of Computer Engineering, University of Pune, International Journal of Innovative Research in Advanced Engineering(IJIRAE), ISSN: 2349-2163 March 2015 www.ijirae.com [2] The Hadoop Distributed File system: Balancing Portability and Performance Jeffrey Shafer, Scott Rixner, and Alan L. Cox Rice University Houston, TX Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium,28-30 March 2010 [3] Hadoop -The Definitive Guide ISBN 13: 978-93-502-756-4 [4] http://adcalves.wordpress.com/2010/12/12/a-hadoop-primer/ [5] http://hadoop.apache.org/ [6] http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single [7] http://www.cloudera.com/content/cloudera/en/why-cloudera/hadoop [8] http://www.solrtutorial.com [9] Solr-Apache Solr Reference Guide [10] Solr In Action H. K. Khanuja, Amruta Mujumdar, Manashree Waghmare, Mrudula Kulkarni, Mrunal Bajaj 5