The implementation of web service based text preprocessing to measure Indonesian student thesis similarity level

Similar documents
Stemming Influence on Similarity Detection of Abstract Written in Indonesia

The Observation of Bahasa Indonesia Official Computer Terms Implementation in Scientific Publication

Influence of Word Normalization on Text Classification

Efficient Algorithms for Preprocessing and Stemming of Tweets in a Sentiment Analysis System

An Increasing Efficiency of Pre-processing using APOST Stemmer Algorithm for Information Retrieval

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Implementation of Winnowing Algorithm Based K-Gram to Identify Plagiarism on File Text-Based Document

Stemming Techniques for Tamil Language

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Automated Text Summarization for Indonesian Article Using Vector Space Model

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

The Key Technology and Algorithm Design for the Development of Intelligent Examination System

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 50 Tracing Related Scientific Papers by a Given Seed Paper Using Parscit

TextProc a natural language processing framework

THE DESIGN OF FOREIGN LANGUAGE TEACHING SOFTWARE IN SCHOOL COMPUTER LABORATORY

Implementation of LSI Method on Information Retrieval for Text Document in Bahasa Indonesia

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Building an Efficient Web Portal for Students at Institutions of Higher Education Based on Web Crawlers

Web Page Similarity Searching Based on Web Content

Computer-assisted English Research Writing Workshop. Adam Turner Director, English Writing Lab

CS 6320 Natural Language Processing

Research on Computer Network Virtual Laboratory based on ASP.NET. JIA Xuebin 1, a

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

The Goal of this Document. Where to Start?

RECSM Summer School: Scraping the web. github.com/pablobarbera/big-data-upf

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Approach Research of Keyword Extraction Based on Web Pages Document

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Cosine Similarity Measurement for Indonesian Publication Recommender System

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving Multi-documents

Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment

Improving Suffix Tree Clustering Algorithm for Web Documents

Troubleshooting Guide for Search Settings. Why Are Some of My Papers Not Found? How Does the System Choose My Initial Settings?

Semantic Web Search Model for Information Retrieval of the Semantic Data *

describe the functions of Windows Communication Foundation describe the features of the Windows Workflow Foundation solution

CADIAL Search Engine at INEX

The Application Research of Semantic Web Technology and Clickstream Data Mart in Tourism Electronic Commerce Website Bo Liu

IJRIM Volume 2, Issue 2 (February 2012) (ISSN )

Information Extraction of Important International Conference Dates using Rules and Regular Expressions

Information Retrieval. Chap 7. Text Operations

Information Retrieval and Organisation

Cleveland State University Department of Electrical and Computer Engineering. CIS 408: Internet Computing

Tree-Based Minimization of TCAM Entries for Packet Classification

Text Analytics (Text Mining)

DOT NET Syllabus (6 Months)

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

Providing link suggestions based on recent user activity

Text Analytics (Text Mining)

DIPLOMA IN PROGRAMMING WITH DOT NET TECHNOLOGIES

Fault Identification from Web Log Files by Pattern Discovery

Smarter text input system for mobile phone

Tutorial to QuotationFinder_0.6

Web of Science. Platform Release Nina Chang Product Release Date: March 25, 2018 EXTERNAL RELEASE DOCUMENTATION

Sentiment Analysis using Weighted Emoticons and SentiWordNet for Indonesian Language

The Design of Model for Tibetan Language Search System

Personalized Search for TV Programs Based on Software Man

EBSCOhost Web 6.0. User s Guide EBS 2065

Creating and Maintaining Vocabularies

LUISSearch LUISS Institutional Open Access Research Repository

The course also includes an overview of some of the most popular frameworks that you will most likely encounter in your real work environments.

The Effect of Diversity Implementation on Precision in Multicriteria Collaborative Filtering

If you do not meet any of the above criteria, you can purchase the voice from for

Automatic Bangla Corpus Creation

TIRA: Text based Information Retrieval Architecture

Automatic Text Summarization System Using Extraction Based Technique

INFORMATION TECHNOLOGY SERVICES

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task

Tutorial to QuotationFinder_0.4.4

Student/Project Portfolios Using The NEW Google Sites

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Database Design on Construction Project Cost System Nannan Zhang1,a, Wenfeng Song2,b

Public Health Surveillance Using Global Health Explorer

Mitchell Bosecke, Greg Burlet, David Dietrich, Peter Lorimer, Robin Miller

A Retrieval Mechanism for Multi-versioned Digital Collection Using TAG

Final Project Discussion. Adam Meyers Montclair State University

Web Mining Team 11 Professor Anita Wasilewska CSE 634 : Data Mining Concepts and Techniques

Integrating Text Mining with Image Processing

Syntax Analysis. Chapter 4

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Tutorial to QuotationFinder_0.4.3

Increasing access to OA material through metadata aggregation

CIS 408 Internet Computing (3-0-3)

A short introduction to the development and evaluation of Indexing systems

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

Implementation of the common phrase index method on the phrase query for information retrieval

The Ultimate Social Media Setup Checklist. fans like your page before you can claim your custom URL, and it cannot be changed once you

FIT 100: Fluency with Information Technology

Turnitin Instructor Guide

Comparing Two Program Contents with Computing Curricula 2005 Knowledge Areas

Analysis on the technology improvement of the library network information retrieval efficiency

EE-575 INFORMATION THEORY - SEM 092

Chapter 4. Processing Text

COURSE OUTLINE: OD10267A Introduction to Web Development with Microsoft Visual Studio 2010

Database Security MET CS 674 On-Campus/Blended

REST Web Services Objektumorientált szoftvertervezés Object-oriented software design

Plagiarism Detection Using FP-Growth Algorithm

Transcription:

The implementation of web service based text preprocessing to measure Indonesian student thesis similarity level Yan Watequlis Syaifudin *, Pramana Yoga Saputra, and Dwi Puspitasari State Polytechnic of Malang, Information Technology Department, Indonesia Abstract. The plagiarism of scientific work, especially undergraduate thesis, mostly happened in the college. In this research we used text mining, a new method which can be used to do the checking procedure, to obtain specific pattern of the document. After obtaining the document pattern, we compare the pattern with another document pattern. If the level of pattern similarity is high, it can be suspected as plagiarism. This paper will explain the development of the text preprocessing, a part of text mining. We choosed Nazief and Adriani Algorithm as a text preprocessing algorithm for this research. This research will result a text preprocessing web service. The web service is expected to be used for further development of text mining. 1 Introduction Text is a language expression, in which one form is a work of writing. Text can be composed of one or more words, such as phrases, clauses and sentences. A text can contain information the author wants to convey to the reader. In information technology, the text is a combination of both characters and words that can be read by humans. They also can be encoded into a format that can be read by a computer, such as ASCII. Currently the most widely used information exchange is through text exchanges. The text belongs to the category of semi structured and unstructured formats. The example of semi-structured formats is HTML document and the examples of unstructured format are email and full-text document. The structured format is data that stored in the database and obtaining information from the database is relatively easier than retrieving information from unstructured format documents. To retrieve information from the unstructured document, we can use text mining. In this research, the software will be developed from earlier research on implementation of Nazief & Adriani algorithm with preprocessing of similarity level measurement of thesis title. And in this research we will develop a system with web service for stemming process based on previous research. Web service itself is a service provided by the web site and this service can be used by other systems to perform a data processing. With the web service, there is interoperability between systems. Interoperability is the ability to interact between one system with another, where the interacting systems have different characteristics. We have chosen the development of the stemming process into the form of web service, because we want to be able to develop text mining application based on web sustainably, and the software can be developed by anyone by utilizing the web service. 2 Basic Concepts 2.1 Text Mining Text mining is defined as a process for finding hidden, useful, and interesting patterns from unstructured documents [1]. To find the pattern in the document takes many stages because the text in the document is unstructured and complex. Some of the steps done in text mining are as follows: 1. Text preprocessing 2. Knowledge distillation The preprocessing stage is the data preparation stage before the knowledge distillation stage. Once obtained, a collection of messages will be conducted text preprocessing stage [2]. The data in question is the text to be retrieved information. The preprocessing stage consists of: 2.1.1 Tokenizing The cutting phase of input string is based on each word that compiles it. For these deductions, usually commas, periods and spaces are used as markers in separating words. 2.1.2 Filtering Filtering is a phase of taking important words that have been resulted from the tokenization process. At this stage, the less important words are discarded (stopword removal) and important words are stored. * Corresponding author: qulis@polinema.ac.id The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).

2.1.3 Stemming Stemming is the process of finding the basic word of a word in the text [3]. The stemming is tightly related to basic word or lemma and the sub lemmas. The lemma and sub lemma of Indonesian Language have been grown and absorb from foreign languages or Indonesian traditional languages [4]. The basic word search process of each word that have been generated from the filtering stage. 2.2 Nazief and Adriani Algorithm In Indonesian, there have been some algorithms, such as Nazief & Andriani, Porter for Indonesian language, Vega, and Arifin & Setiono stemming algorithms. Nazief & Andriani algorithm has the highest accuracy [5]. The stemming algorithm Nazief and Adriani was developed based on the Indonesian morphology rules that grouped the affixes into prefixes, infixes, suffixes, and confixes. There are variations of affixes including prefixes, suffixes, infixes, and confixes [6]. The stemming process of Indonesian texts is become more complex because there are variations of affixes should be removed to get the root word of a word. The words in Bahasa Indonesia have the special and complex morphological structure compared with the words in other languages [7]. Generally, the basic word of Indonesian consists of a combination: Prefix 1 + Prefix 2 + Basic word + Suffix 3 + Suffix 2 + Suffix 1. The Nazief & Adriani algorithm has the following stages: 1. Find the stemmed word in the basic dictionary. If found, then assume the word is a root word. So the algorithm stops. 2. Inflection Suffixes ("-lah", "-kah", "-ku", "-mu", or "nya") are discarded. If it was a particle ("-lah", "- kah", "-tah" or "-pun") then this step is repeated again to remove Possesive Pronouns ("-ku", "-mu" or "-nya"), if there are. 3. Remove Derivation Suffixes ("-i", "-an" or "-kan"). If the word is found in the dictionary, then the algorithm stops. If not found, then go to step 3a a. If "-an" has been deleted and the last letter of the word is "-k", then "-k" is also deleted. If the word is found in the dictionary, then the algorithm stops. If not found, then do step 3. b. The deleted suffix ("-i", "-an" or "-kan") is returned, go to step 4. 4. First find the stemmed word in the basic word dictionary. If found, then it is assumed that word is root word. So the algorithm stops. 5. Inflection Suffixes ("-lah", "-kah", "-ku", "-mu", or "nya") are discarded. If it is a particle ("-lah", "- kah", "-tah" or "-pun") then this step is repeated again to remove Possesive Pronouns ("mu", "ku" or "nya"), if there are. 6. Remove Derivation Suffixes ("-i", "-an" or "-kan"). If the word is found in the dictionary, then the algorithm stops. If not found, then go to step 3a. a. If "-an" has been deleted and the last letter of the word is "-k", then "-k" is also deleted. If the word is found in the dictionary, then the algorithm stops. If not found, then do step 3. b. The deleted suffix ("-i", "-an" or "-kan") is returned, go to step 4. 7. Remove the Derivation Prefix. If in step 3 there is a suffix that have been deleted, then go to step 4a, else go to step 4b. a. Check the end-of-end combination table that was not allowed. If found, then the algorithm stops, else go to step 4b. b. For i = 1 to 3, specify the prefix type and then delete the prefix. If root word has not been found do step 5, if it is then the algorithm stops. Note: if the second prefix is the same as the first prefix the algorithm stops. 8. Doing Recoding. 9. If all the steps have been completed but not succeeded, then the initial word is assumed as root word. The process is complete. 2.3 Windows Communication Foundation (WCF) Windows Communication Foundation (WCF) is a framework used in building service-oriented based applications. Using WCF can send data in the form of asynchronous messages from one endpoint to another, that shown in Figure 1 [8]. A service endpoint can be part of an ongoing service hosted by IIS, or it can be part of a hosted service within the application. An endpoint could be a client of a service that requests data from the service endpoint. Message either could be as simple as a single character or word, sent in XML or JSON or could be a complex message as a binary data stream. Fig. 1. WCF Model. In short, WCF is designed to offer a manageable approach in creating web services and web service clients. 3 Application Design In this research, we designed an architecture of making web-based stemming service shown as in Figure 2. In the picture explained that the client has the text that we want to stem, then the text is sent by the client to the server via web service, in this study the service named Stemming.svc. The text will be stemmed by the server 2

and the results of the stemming will be sent by the server to the client through the web service in JSON format. The JSON documents obtained by the client can be followed up for example presented in tabular form, or can be used as resource in further application development, for example used for comparison of similarity level of a document to obtain similarity value. Explanation of workflow in Fig. 1 is as follows: 1. The client enters the text that you want to be stacked through URL: http: //localhost/stservice/stemming.svc/json/" stemmed text" 2. The text will be stacked by the server 3. The result of stemming is dataset 4. The dataset is then converted into JSON format 5. JSON data is sent to the client 6. JSON data can be directly displayed to the client, or can be presented into a table, or can be used as a resource to check the level of similarity of a document. 4 Result and Discussion Fig. 2. Application Architeture. In Figure 2 there is a web service that uses Windows Communication Foundation technology, created using ASP.NET. As for the client, there are clients A and client B, which is a client that has a platform other than ASP.NET. The workings of the created application can be seen in the flowchart in Figure 3, and the explanation is as follows: This research, a web-based stemming service, produces a web service running on the ASP.NET platform. The web service utilizes Windows Communication Foundation (WCF) technology, which applies the principle of REST. This research uses Microsoft Visual Studio 2010 for code generation program. First made a WCF project with the name STService. In the project there are four main files in making web service, namely: Global.asax Used to set the application connection to the database, which in the database other than used to store data, is also used to process data using stored procedure IStemming.cs Contains the exposed method and will be used by the client. The HTTP function used, as well as the message format (XML / JSON) are also defined in this section Stemming.svc Contains detailed logic of the exposed method in IStemming.cs Web.config Contains service settings, this part is very important, because even if the algorithm used is correct, but if there is an error in the settings, then the service will not run. 4.1 Application Result To access this service, we can use a web browser such as Mozilla, Google Chrome, or Internet Explorer. Make sure that IIS on the server computer is activated and the service is running. Next, enter the following link in the browser: http://localhost/stservice/stemming.svc Next, in the web browser will appear as shown in Figure 4 below. Fig. 3. Application workflow. 3

Fig. 4. Display of web service in the browser. Next, we can enter the sentence to be processed. For example, the sentence to be stemmed is the following sentence: ketika masih anak-anak, dia suka bermain dengan anak yang sebaya dengannya Then input the following link to the web browser: http://localhost/stservice/stemming.svc/json/ketika masih anak-anak, dia suka bermain dengan anak yang sebaya dengannya Next will appear JSON data as in picture 5 below. Fig. 6. JSON data on the JSON online reader. The result is JSON data that will be used as resource to make further process by client. {"Table":[{"No":1,"dasar":"SUKA","jml":1},{"No":2,"dasar":"BAYA","j ml":1},{"no":3,"dasar":"masih","jml":1},{"no":4,"dasar":"ketika","j ml":1},{"no":5,"dasar":"main","jml":1},{"no":6,"dasar":"anak- ANAK,","jml":1},{"No":7,"dasar":"ANAK","jml":1}]} Fig. 5. The JSON data result. Furthermore, if checked on JSON reader online, it will be shown as Figure 6. 4.2 Discussion Software testing is performed to determine the time of measurement of similarity and will then be tested based on the ratio of the number of documents and the length of the process. Test results on 10 documents require a processing time of 12 seconds. Testing time will continue to increase along with the number of documents to be tested in plagiarism. In fact, there are a lot of cases of plagiarism occurred in Indonesia, which is not only done by students but also by lecturers [9]. So, it becomes very important for every university to be able to have a repository of scientific works, both from lecturers and students. Furthermore, from the repository we can check the similarity of scientific work using software from the results of this research. The web service system will greatly assist each university in providing plagiarism checking services that can be accessed by outside systems based on existing repositories. The next step is to build a comprehensive plagiarism checking system and integrate with other systems, and this will be easier with the web service. To effectively deter plagiarism, a comprehensive and effective policy should be in place, which includes definition, detection, and penalties [10]. 5 Conclusion The result of stemming can be used well by the client application that has been made before to measure the level of similarity. Datasets in JSON format can also be converted in database format, so it can be used for applications that use database to store student thesis data. 4

Besides can be used for similarity measurement, web service can also be used to develop applications for learning Indonesian language, checking spelling, and so forth. References 1. K.L. Sumathy, M. Chidambaram. Text Mining: Concepts, Applications, Tools and Issues An Overview. International Journal of Computer Applications, Vol. 80, No. 4. (October 2013) 2. Y.W. Syaifudin, D. Puspitasari. Twitter Data Mining for Sentiment Analysis on Peoples Feedback Against Government Public Policy. MATTER: International Journal of Science and Technology, pp. 100-122. (2017) 3. P.M. Prihatini, I.K.G.D. Putra, I.A.D. Giriantari, M. Sudarma. Stemming Algorithm for Indonesian Digital News Text Processing. International Journal of Engineering and Emerging Technology. p. 1-7, Mar. (2018) 4. R. Setiawan, A. Kurniawan, W. Budiharto, I.H. Kartowisastro, H. Prabowo. Flexible affix classification for stemming Indonesian Language. 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). (2016) 5. R.B.S. Putra, E. Utami. Non-formal affixed word stemming in Indonesian language. 2018 International Conference on Information and Communications Technology (ICOIACT). (2018) 6. H. Widayanto, A.F. Huda. Comparison Nazief Adriani and CS Stemmer Algorithm for Stemm Real Data. e-proceeding of Engineering: Vol. 4, No.3 Desember 2017, Page 5215. (2017) 7. A.F. Hidayatullah, C.I. Ratnasari, S. Wisnugroho. Analysis of Stemming Influence on Indonesian Tweet Classification. TELKOMNIKA, pp. 665~673, Vol.14, No.2, (June 2016) 8. Microsoft. What Is Windows Communication Foundation, (https://msdn.microsoft.com/enus/id/library/ms731082(v=vs.110).aspx, Accessed 27 Oktober 2017) 9. R. Agustina, P. Raharjo. Exploring Plagiarism into Perspectives of Indonesian Academics and Students. Journal of Education and Learning (EduLearn). Vol. 11 (3) pp. 262-272. (2017) 10. T. Adiningrum. Reviewing Plagiarism: An Input for Indonesian Higher Education. Journal of Academic Ethics 13(1):107-120. (2015) 5