Fast and Effective System for Name Entity Recognition on Big Data

Similar documents
High Performance Computing on MapReduce Programming Framework

Global Journal of Engineering Science and Research Management

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Research Article Distributed and Parallel Big Textual Data Parsing for Social Sensor Network

A Fast and High Throughput SQL Query System for Big Data

Distributed Face Recognition Using Hadoop

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Introduction to Text Mining. Hongning Wang

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

An UIMA based Tool Suite for Semantic Text Processing

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

SQL Query Optimization on Cross Nodes for Distributed System

Indexing Strategies of MapReduce for Information Retrieval in Big Data

Figure 1 shows unstructured data when plotted on the co-ordinate axis

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Improved MapReduce k-means Clustering Algorithm with Combiner

Clustering websites using a MapReduce programming model

Annotating Spatio-Temporal Information in Documents

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

An improved MapReduce Design of Kmeans for clustering very large datasets

Ontology Extraction from Heterogeneous Documents

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February-2016 ISSN

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

MapReduce Algorithms

Data Analysis Using MapReduce in Hadoop Environment

Big Data Analysis Using Hadoop and MapReduce

Natural Language Processing In A Distributed Environment

Research Article Apriori Association Rule Algorithms using VMware Environment

Parallel learning of content recommendations using map- reduce

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Graph-based Entity Linking using Shortest Path

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Batch Inherence of Map Reduce Framework

PRIS at TAC2012 KBP Track

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Recommendation Queuing Management by Big Data with Predicted in Real Time Patient Treatment

Oleksandr Kuzomin, Bohdan Tkachenko

Introduction to MapReduce Algorithms and Analysis

Volume 3, Issue 11, November 2015 International Journal of Advance Research in Computer Science and Management Studies

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

CS 61C: Great Ideas in Computer Architecture. MapReduce

Meta-Content framework for back index generation

A New HadoopBased Network Management System with Policy Approach

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

A Micro Partitioning Technique in MapReduce for Massive Data Analysis

Identifying Idioms of Source Code Identifier in Java Context

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Investigating F# as a development tool for distributed multi-agent systems

An Efficient Informal Data Processing Method by Removing Duplicated Data

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

BIG DATA & HADOOP: A Survey

Storm Identification in the Rainfall Data Using Singular Value Decomposition and K- Nearest Neighbour Classification

1. Introduction to MapReduce

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

BESIII Physical Analysis on Hadoop Platform

Hadoop Map Reduce 10/17/2018 1

Big Data Management and NoSQL Databases

Clustering Lecture 8: MapReduce

Néonaute: mining web archives for linguistic analysis

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

On The Fly Mapreduce Aggregation for Big Data Processing In Hadoop Environment

Performance Analysis of Hadoop Application For Heterogeneous Systems

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Toward Interlinking Asian Resources Effectively: Chinese to Korean Frequency-Based Machine Translation System

Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Cooperation between Data Modeling and Simulation Modeling for Performance Analysis of Hadoop

Survey on MapReduce Scheduling Algorithms

Data and Information Integration: Information Extraction

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

The MapReduce Abstraction

Document Clustering with Map Reduce using Hadoop Framework

Journal of East China Normal University (Natural Science) Data calculation and performance optimization of dairy traceability based on Hadoop/Hive

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Universidade de Santiago de Compostela. Perldoop v0.6.3 User Manual

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Documents Retrieval Using the Combination of Two Keywords

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Transcription:

International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam 1* and Sandeep Sahu 2 Department of CS, SRIT, Jabalpur, RGPV University jigyasa_926@yahoo.co.in, sandeep.sahu12@gmail.com www.ijcaonline.org Received: Jan /9/214 Revised: Jan/8/214 Accepted: Jan/2/214 Published: Jan/31/ 214 Abstract In today scenario all data store in digital form and data size is too large. So problem is that how to manage this big data or extract information with speed and efficiency. Information extraction is a technique which using in text mining. Information extraction extract required information whose user demand from unstructured text. Information extraction use NLP (Natural Language Processing) and NER (Name entity recognition). NER systems help to machine recognize proper noun (entity), events, relationships and so on. There are several NER systems in the world. Such as GATE, CRFClassifier, OpenNLP and Stanford NLP (Natural Language Processing ). The NER system works fast for limited amount of documents but drawback of this system is that it works slows for huge/large amount of data. To overcome the drawback of NER system, this paper, report the implement of a NER which is based on Map Reduce, a distributed programming model. This improvement helps to achieve the fast extraction and reduce storage cost with better performance. Keywords Distributed computing, Big textual data, Named Entity Recognition (NER), Natural Language Processing (NLP), MapReduce, Hadoop and Maxent Tagger. In this paper use one of the NER system Stanford- 1. INTRODUCTION Information extraction is the process in which extracting data from Unstructured Data, semi-structured Data and structured Data. Unstructured Data does not have organized any pre-defined manner or model. Structured data have organize in any pre-define manner or model. Generally extract information of human language texts by uses of natural language processing (NLP). [1][8] Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video could be seen as information extraction. Information Extraction is a technology that is originates from the user's point of view in the current information existing world. Rather than indicating which documents need to be read by a user, it extracts pieces of information that are relevant to the user's needs. Links between the extracted information and the original documents are maintained to allow the user to reference context. Information is in different shapes and sizes. One important form is structured data, where there is a regular and predictable organization of entities and relationships. In information extraction NER system play a role is more and more important. It easily recognizes such as persons and organizations can be extracted with reliability. But the problem is that when we use huge amount of data then its processing speed vary slow. Its improve time complexity and space complexity also. [4] [1] So improve to speed and reduce to space proposed work in this paper to apply NER system with Distributed MapReduce framework. Using the MapReduce framework with NER system got fast information extraction and reduced copy with accuracy. POStagger(Part of speech tagger) [8] in which Maxent Tagger model to use extract the information in the form of Name Entity recognition. The paper is organized as follows. Section 2 introduces related work. We report on the design of the proposed distributed text parsing system in Section 3. Finally, in Section 4, we give the conclusion. 2. Related Work: In this section, introduce the Stanford-POStagger parser[8], MapReduce programming model[6] and Hadoop[6]. These are used by the proposed system in single system and distributed environments. First, the Stanford parser, proposed by the NLP lab of Stanford University in the 199s, in proposed system using the maxent tagger model into part-of-speech tagger from Stanford parser.pos tagger is more than faster to other available tagger. And maxent tagger model to more faster and accurate to other existing model. It uses the best tokenize method in which each and every word create a token.thats why its increase accuracy of parsing and give the best result of parsing. And easily extract the NER. MapReduce is a programming model for use expressing distributed computations on huge amounts of data and an execution framework for large-scale data processing on clusters of produce servers.[5] It was originally developed by Google and built on well-known principles in parallel and distributed processing which was already introduce several decades. MapReduce has since enjoyed pervasive adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (now an Apache project). 215, IJCSE All Rights Reserved 31

ApacheTM Hadoop is an open source framework that supports distributed computing. It came into existence from Google s MapReduce and Google File Systems projects. It is a platform that can be used for intense data applications which are processed in a distributed environment. [6][11] It follows a Map and Reduce programming paradigm where the division of data is the simple step and this split data is fed into the distributed network for processing. The processed data is then integrated as a whole. Hadoop also provides a defined file system for the organization of processed data like the Hadoop Distributed File System(HDFS).[6][11] The Hadoop framework takes into account the node failures and is automatically handled by it. This makes hadoop really flexible and a versatile platform for data intensive applications. The answer to growing volumes of data that demand fast and effective retrieval of information lies in engendering the principles of data mining over a distributed environment such as Hadoop. This not only reduces the time required for completion of the operation but also reduces the individual system requirements for computation of large volumes of data. Distributed Computing is a technique aimed at solving computational problems mainly by sharing the computation over a network of interconnected systems. Each individual system connected on the network is called a node and the collection of many nodes that form a network is called a cluster. in which mapreduce programming apply with StanfordPOS-tagger NLP system.[2][3] In figure 1.2 show what system propose in this paper. Figure show how the huge amount of input data to access in hadoop distributed file system with the use of mapreduce programming. Maxent tagger model of Stanford POS-tagger system used by propose system is loaded into hadoop file system because of all mapper function to share it for tokenize the sentences. In this propose system architecture all input file store in master server and master server distributes the files to slave servers. After distribution apply the mapper function and Stanford POS-tagger in each file. Mapper function separate the sentences into key and value form and with the help of maxent tagger model of stanfordpos-tagger system tokenize the sentences [1] And in this tokenized sentences recognize name entity form and apply reduce method. In this paper use maxent tagger model because of this model is tokenize the each and every word of sentence that s why we retrieve the more accurate or fast desire result. Maxent tagger model use the best parsing method to all Existing model [1][9]. Fig. 2.1 The MapReduce framework In this fig. shows the input data split equally and apply map function on these files.map function make a (key, value) pair to split data. After that they are combine,and sort and this sorted pair reduce function are apply.reduce function reduce the size of file with maintain the accuracy and give the final MapReduce result. 3. Proposed Distributed Parsing System Slave Server1 Fig. 3.1 Proposed work architecture Slave Server m Reducer method reduces the result or length of output file. Output file is also store in master server. In order to extract information from huge amount of text file. This paper propose to use of distributed environment 215, IJCSE All Rights Reserved 32

2GB RAM. In this Implementationn use hadoop-1..4 and the Stanford Pos-tagger3.3. And Java-oracle-7 programming language used. The data set consists of 1 file of financial data. In this implementation evaluate the running time from the step of the sentence separation to the step of parsing sentences using the distributed Stanford parser. In below table shows execution time comparison between legacy system and implemented system with different numbers of nodes. In this table execution time consider in seconds. This graph clearly shows implemented system much faster than legacy system. 3.2 Flowchart of proposed method Show proposed system pseudo code in below Class Mapper 1. StanfordParser.setModel(MaxentTagger model) 2. Method Map(docid a; doc d) 3. sentences = MaxentTagger.tokenizeText(d); 4. sentence1<-sentences.startwith( NN ); 5. Emit(sentence1,d.id) Class Reducer 1. Method Reduce(d.id,iterable(sentence1) 2. For each sentence1 s1 3. sum=sum+s1; 4. emit(d.id,sum) Proposed system have some advantages, first it reduces the time of parsing then to legacy system because in legacy system tokenize file one by one. Second, it reduces the size of output file through which we can easily recognize how many times, the particular word uses in all files. Because it provides the count with every word [1]. Third we can easily modify it with replacing another parsing method. 4. Experimental Result To calculate the performance of the proposed system, we consider four autonomous computer in these four computers make one of them Master or left are consider as a slave. In this implementation compare the space complexity or time complexity of the legacy Stanford POS-tagger system with that of the distributed Stanford POS-tagger system applied to the MapReduce programming model in hadoop. We also evaluate the running time of the distributed Stanford POS-tagger system on four slaves. In These all autonomous computers each one consists of eight cores of Intel i3-322, 3.3 GHz CPU speed and S. N Nodes Numbers 1. Existing system 2. Single Node MapReduce 3. Two Node MapReduce 4. Four Node MapReduce Table 4.1 Execution Comparison time table In below graph shows those related to Table 4.1 i.e. In this graph consider legacy system execution time and implemented system which consider with single node, two nodes and three nodes execution time in seconds. 35 3 25 2 15 1 5 Execution Time (In seconds) Fig. 4.1 Execution time Comparison Execution Time (In seconds) 218 367 1533.5 766.75 Execution Time (In seconds) In table 5.2 shows result storage cost on legacy system as well as implemented system. In This table consider different category for Compare to result storage cost such as file size, page numbers and words count. And in every category implemented system storage cost of output file is minimal than existing system. See in table 4.2 215, IJCSE All Rights Reserved 33

In Fig 4.4 shows words count of output file. In legacy system 22,64 words count of output file and implemented system 1,324 words count of output file. Words Count 25 2 Table 4.2 Storage cost comparison table 15 In fig. 4.2 shows output file size comparison in both system, and consider text file as well as word file to store same output. text file taking 191KB and word file storage cost 183KB in legacy system while in implemented system text file taking 62.5KB and word file to store same 46.3KB. 1 5 Existing system Implemented system Words Count 2 15 1 5 In Text File In Word file Fig. 4.4 Output files word wise comparison This implemented system reduces storage cost just because it considers one word in one time whenever this word repeat in many time in data set. This system reduces repetition and considers total number occurrence count with particular word. Example word Attention 4. In legacy system word Attention occur in four times in different places although in implemented system show this word like this Attention 4.and implemented system arrange output in alphabetical order. Fig. 4.2 Output files size comparison In fig 4.3 shows page numbers of outputt files in legacy system and implemented system. Legacy system output store in 519 pages and implemented system store output in 89 pages. 6 5 4 3 2 1 Page Numbers Existing System Implemented System Fig. 4.3 Output file Page wise comparison Page Numbers 5. Conclusion: This paper proposes the approach for processing of huge amount of text in distributedd environments with MapReduce programming in which Stanford POS-tagger parser applies for name entity recognition. Advantage of propose system, it is less time consuming then to legacy system as well as reduce storage cost and arrange data in alphabetical order. For future work this System evaluates for show the relationship between name entity and in addition optimized technique for parsing in distributed environment. 6. References: [1]. Nigam, Jigyasa, and Sandeep Sahu. "An Effective Text Processing Approach With MapReduce." [2]. James J. (Jong Hyuk) Park et al. (eds.), Mobile, Ubiquitous, and Intelligent Computing,Lecture Notes in Electrical Engineering 274,DOI: 1.17/978-3-642-4675-1_41, Verlag Berlin Heidelberg 214 Springer- [3]. Kim, J., Lee, S., Jeong, D.-H., Jung, H.: Semantic Data Model and Service for Supporting Intelligent Legislation Establishment. In: The 2nd Joint 215, IJCSE All Rights Reserved 34

International Semantic Technology Conference (212) [4]. Klein, D., Manning, C.D.: Accurate Unlexicalized Parsing. In: Proceedings of the 41 st Meeting of the Association for Computational Linguistics, pp. 423 43 (23) [5]. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI, pp. 137 15 (24) [6]. HDFS (hadoop distributed file system) architecture(29),http://hadoop.apache.org/com mon/docs/current/hdfs-design.html [7]. Seo, D., Hwang, M.-N., Shin, S., Choi, S.: Development of Crawler System Gathering Web Document on Science and Technology. In: The 2nd Joint International SemanticTechnology Conference (212) Morphological features help POS tagging of unknown words across language varieties [8]. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 425 432,Sydney, July 26. c26 Association for Computational Linguistics [9]. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 2th International Conference on Very Large Databases (VLDB-94), pages 487 499, Santiago, Chile, Sept. 1994. [1]. en.wikipedia.org/wiki/information_extraction [11]. Shvachko,K. Yahoo!,Sunnyvale,CA,USA Hairon g Kuang ; Radia, S. ; Chansler, R.Mass Storage Systems and Technologies (MSST), 21 IEEE 26th Symposium on E-ISBN :978-1-4244-7153-9 215, IJCSE All Rights Reserved 35