The Observation of Bahasa Indonesia Official Computer Terms Implementation in Scientific Publication

Similar documents
SSL/TLS Vulnerability Detection Using Black Box Approach

Indonesian Citation Based Harvester System

Standalone Mobile Application for Shipping Services Based on Geographic Information System and A-Star Algorithm

OAI-PMH. DRTC Indian Statistical Institute Bangalore


Problem: Solution: No Library contains all the documents in the world. Networking the Libraries

Data pre-processing in record linkage to find the same companies from different databases

Metadata Harvesting Framework

Optimizing the hotspot performances by using load and resource balancing method

OAI AND AMF FOR ACADEMIC SELF-DOCUMENTATION

IVOA Registry Interfaces Version 0.1

An Implementation of RC4 + Algorithm and Zig-zag Algorithm in a Super Encryption Scheme for Text Security

Web & Android Application for Harvester of Indonesian Scientific Paper Citation

Creating a National Federation of Archives using OAI-PMH

The implementation of web service based text preprocessing to measure Indonesian student thesis similarity level

Integrating Access to Digital Content

File text security using Hybrid Cryptosystem with Playfair Cipher Algorithm and Knapsack Naccache-Stern Algorithm

Attendance fingerprint identification system using arduino and single board computer

RVOT: A Tool For Making Collections OAI-PMH Compliant

Mobile Application for Accessing Paper Citation with Social Network Feature

Open Archives Initiative protocol development and implementation at arxiv

Prime Numbers Comparison using Sieve of Eratosthenes and Sieve of Sundaram Algorithm

An implementation of super-encryption using RC4A and MDTM cipher algorithms for securing PDF Files on android

Analysis of labor employment assessment on production machine to minimize time production

Monitoring and Indentification Packet in Wireless With Deep Packet Inspection Method

arxiv, the OAI, and peer review

Using metadata for interoperability. CS 431 February 28, 2007 Carl Lagoze Cornell University

Journal of Physics: Conference Series PAPER OPEN ACCESS. To cite this article: B E Zaiwani et al 2018 J. Phys.: Conf. Ser.

Taking D2D Services to the Users with OpenURL, RSS, and OAI-PMH. Chuck Koscher Technology Director, CrossRef

Optimizing Libraries Content Findability Using Simple Object Access Protocol (SOAP) With Multi- Tier Architecture

Fast Learning for Big Data Using Dynamic Function

Harvesting Statistical Metadata from an Online Repository for Data Analysis and Visualization

UDP-Lite Enhancement Through Checksum Protection

Conveyor Performance based on Motor DC 12 Volt Eg-530ad-2f using K-Means Clustering

Harvesting Metadata Using OAI-PMH

Flexible Design for Simple Digital Library Tools and Services

Implementation of pattern generation algorithm in forming Gilmore and Gomory model for two dimensional cutting stock problem

The Open Archives Initiative and the Sheet Music Consortium

Load balancing factor using greedy algorithm in the routing protocol for improving internet access

Increasing access to OA material through metadata aggregation

Nearby Search Indekos Based Android Using A Star (A*) Algorithm

How to contribute information to AGRIS

A comparative study of Message Digest 5(MD5) and SHA256 algorithm

Application Marketing Strategy Search Engine Optimization (SEO)

Discovery To Delivery In Digital Library Environment Library and

Vigenere cipher algorithm modification by adopting RC6 key expansion and double encryption process

Concept of Analysis and Implementation of Burst On Mikrotik Router

Alternative Improving the Quality of Sub-Voltage Transmission System using Static Var Compensator

Improving the accuracy of k-nearest neighbor using local mean based and distance weight

BIBLID (2004) 93:1 pp (2004.6) 209. NBINet NBINet 92

Network Information System. NESCent Dryad Subcontract (Year 1) Metacat OAI-PMH Project Plan 25 February Mark Servilla

Feature weighting using particle swarm optimization for learning vector quantization classifier

Chapter 50 Tracing Related Scientific Papers by a Given Seed Paper Using Parscit

User satisfaction analysis for service-now application

Implementation and evaluation of LMS mobile application: scele mobile based on user-centered design

CodeSharing: a simple API for disseminating our TEI encoding. Martin Holmes

Building Api Student Store at Iris Labs Unikom

Improved Information Retrieval Performance on SQL Database Using Data Adapter

Classification of stroke disease using convolutional neural network

The Development of the Education Related Multimedia Whitelist Filter using Cache Proxy Log Analysis

OAI-PMH implementation and tools guidelines

Energy consumption model on WiMAX subscriber station

The Role of Participatory Design in Mobile Application Development

Development of Smart Home System to Controlling and Monitoring Electronic Devices using Microcontroller

MALAYSIA THESES ONLINE (MYTO): AN APPROACH FOR MANAGING UNIVERSITIES ELECTRONIC THESES AND DISSERTATIONS

Interoperability and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

SciX Open, self organising repository for scientific information exchange. D15: Value Added Publications IST

The Implementation of Alert System for LAN Network Monitoring Using the Dude Based

Usability evaluation of user interface of thesis title review system

Non-text theses as an integrated part of the University Repository

Implementation of LSI Method on Information Retrieval for Text Document in Bahasa Indonesia

Registry Interchange Format: Collections and Services (RIF-CS) explained

Repository Interoperability

A Dublin Core Application Profile for Scholarly Works (eprints)

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem

Design of Electrical Engineer Profession Certification Model Based on Recognition of Prior Learning

Design of Smart Home Systems Prototype Using MyRIO

Design of The PORTA EUROPA Portal (PEP) Pilot Project

Competency Assessment Parameters for System Analyst Using System Development Life Cycle

A KBE tool for solving the mechanisms kinematics

The application of EOQ and lead time crashing cost models in material with limited life time (Case study: CN-235 Aircraft at PT Dirgantara Indonesia)

Publishing Based on Data Provider

Measuring the power consumption of social media applications on a mobile device

EXTENDING OAI-PMH PROTOCOL WITH DYNAMIC SETS DEFINITIONS USING CQL LANGUAGE

From community-specific XML markup to Linked Data and an abstract application profile: A possible path for the future of OLAC

The recognition of female voice based on voice registers in singing techniques in real-time using hankel transform method and macdonald function

Developing data catalogue extensions for metadata harvesting in GIS

Exposing and Harvesting Metadata Using the OAI Metadata Harvesting Protocol: A Tutorial

HTML Extraction Algorithm Based on Property and Data Cell

Orbis Cascade Alliance Content Creation & Dissemination Program Digital Collections Service. Enabling OAI & Mapping Fields in Digital Commons

COAR Interoperability Roadmap. Uppsala, May 21, 2012 COAR General Assembly

IMu OAI-PMH Web Service

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

Metadata for Digital Collections: A How-to-Do-It Manual

Questionnaire for effective exchange of metadata current status of publishing houses

The Open Archives Initiative Protocol for Metadata Harvesting: An Introduction

The Effect of Diversity Implementation on Precision in Multicriteria Collaborative Filtering

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

International Implementation of Digital Library Software/Platforms 2009 ASIS&T Annual Meeting Vancouver, November 2009

Introduction to the OAI Protocol for Metadata Harvesting Version 2.0. Hussein Suleman Virginia Tech DLRL 17 June 2002

Transcription:

Journal of Physics: Conference Series PAPER OPEN ACCESS The Observation of Bahasa Indonesia Official Computer Terms Implementation in Scientific Publication To cite this article: D Gunawan et al 2018 J. Phys.: Conf. Ser. 979 012034 View the article online for updates and enhancements. This content was downloaded from IP address 148.251.232.83 on 27/04/2018 at 16:09

The Observation of Bahasa Indonesia Official Computer Terms Implementation in Scientific Publication D Gunawan 1, A Amalia 2, M S Lydia 2, M I Muthaqin 1 1 Department of Information Technology, Universitas Sumatera Utara, Jl. dr. Mansur No. 9 Kampus USU Medan 20155 2 Department of Computer Science, Universitas Sumatera Utara, Jl. dr. Mansur No. 9 Kampus USU Medan 20155 Email: danigunawan@usu.ac.id Abstract. The government of the Republic of Indonesia had issued a regulation to substitute computer terms in foreign language that have been used earlier into official computer terms in Bahasa Indonesia. This regulation was stipulated in Presidential Decree No. 2 of 2001 concerning the introduction of official computer terms in Bahasa Indonesia (known as Senarai Padanan Istilah/SPI). After sixteen years, people of Indonesia, particularly for academics, should have implemented the official computer terms in their official publications. This observation is conducted to discover the implementation of official computer terms usage in scientific publications which are written in Bahasa Indonesia. The data source used in this observation are the publications by the academics, particularly in computer science field. The method used in the observation is divided into four stages. The first stage is metadata harvesting by using Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH). Second, converting the harvested document (in pdf format) to plain text. The third stage is text-preprocessing as the preparation of string matching. Then the final stage is searching the official computer terms based on 629 SPI terms by using Boyer-Moore algorithm. We observed that there are 240,781 foreign computer terms in 1,156 scientific publications from six universities. This result shows that the foreign computer terms are still widely used by the academics. 1. Introduction The era of the early Internet in Indonesia brought new computer terms in foreign language that had no suitable translation in Bahasa Indonesia. In 2001, the government of Republic of Indonesia responded to these computer terms by issued the Presidential Decree No. 2 of 2001 concerning the introduction of official computer terms in Bahasa Indonesia. The official computer terms are also known as Senarai Padanan Istilah (SPI). There are 629 terms to substitute the computer terms in foreign language. These terms were generated from the Computer Generation Specific Formatting Guidelines. Sixteen years after being officially published by the government of Republic of Indonesia, the official computer terms in Bahasa Indonesia should have been implemented widely in scientific publications which are written in Bahasa Indonesia. A research that conducted in 2015 had analyzed the implementation of official computer terms in Bahasa Indonesia. The implementation of official computer terms in Bahasa Indonesia was analyzed by conducting questionnaire to the students. The research shows that the students only use as much as 33% of the official computer terms in Bahasa Indonesia [1]. Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by Ltd 1

Recent research in 2016 conducted the observation of the implementation of official computer terms in Bahasa Indonesia for the government websites [2]. The research observed 1,084 Indonesian government websites. It focused on the government websites template. It means the research only observed the repeated part in the websites, not the content. This research found that there are 47% of Indonesian government websites which implement 88.4% official computer terms in Bahasa Indonesia compared to the foreign language. This result does not depict the condition of the whole websites because the observation was done for the template only, not included the content of the website. This research will analyze the implementation of official computer terms in Bahasa Indonesia in scientific publications which are written by the academics, especially by the students who majoring computer science (or related to computer). We use the publications written by the higher education students instead of questionnaire to analyze the implementation of official computer terms in Bahasa Indonesia. The research stages include data harvesting by using Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH), converting harvested documents (in pdf format) to plaintext, conducting text pre-processing and the last stage is comparing the computer terms by using a string matching algorithm. The OAI-PMH was used in previous research to harvest scholarly research journal s content [3]. The result of this research is proposed to be the starting point for the government to evaluate the implementation of official computer terms in Bahasa Indonesia. The challenges of this research are the terms that have more than one meaning and the term with suffix. The terms that have more than one meaning, for example address, might lead to the miscalculation of the number of computer terms. Because the term address can be used as personal address or computer address. The term that should be calculated is the one which has computer related meaning. The next challenge is the term with suffix. However, this can be covered by implementing word stemming or a dictionary. As an alternative, the dictionary can be replaced by lexical database. The recent research to build a lexical database for Bahasa Indonesia is still in design phase [4]. Another challenge in natural language processing is multiword expressions (MWE). Multiword expressions are the combination of two or more words that form a meaning. Current research is still determining the candidates of the multiword expressions for Bahasa Indonesia [5]. However, because of this research used the official computer terms which listed in the Presidential Decree No. 2 of 2001 (the list includes the terms with two or more words), then the multiword expressions handling is sufficiently done by implementing string matching. This research uses Boyer-Moore algorithm to implement string matching. The organization of this paper is as follow: the first section is the introduction about the conducted research. Then, section two discusses the observation approach used in this research. Section three discusses the result of this research. The last section is conclusion. 2. Research Methodology The research was done by conducting several steps. The first step is collecting the documents from several major universities in Indonesia. As we research the usage of official computer terms, we limit the documents to the fields which are related to computer, such as computer science, information technology and informatics. Second, we extracted the documents into tokens to prepare for the next process. According to the ordinance of writing in Bahasa Indonesia, the foreign language must be written in italics format. As the format has no effect on the calculation of the official computer terms, this research does not consider the format of the writing. Therefore, we implemented string matching algorithm to compare the extracted tokens with the official computer terms for Bahasa Indonesia in the third step. The last but not least, we observed the result and draw conclusion. The details of the research are as follow: 2.1 Harvesting the Documents The aim of this research is to demonstrate the usage of official computer terms for Bahasa Indonesia in academics, especially in computer field. We harvested the thesis documents which are published by the universities. We utilize Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) to 2

harvest all the documents in each university. OAI-PMH provides services to publish metadata based on HyperText Transfer Protocol (HTTP) and extensible Markup Language (XML). To enable data interoperability, the harvested metadata should be available in agreed format. Usually, metadata is available in Dublin Core standard. The OAI-PMH provides data interoperability between two providers namely service providers and data providers. The service providers harvest the metadata published by the repositories. These repositories are called the data providers. Both service providers and data providers communicate using the same protocol called OAI-PMH. There are six different types of protocol request and response in OAI-PMH, such as GetRecord, Identify, ListRecords, ListIdentifier, and ListMetadataFormats. GetRecord is a verb to retrieve single metadata record from a data provider (repository). Identify is a verb to retrieve information about a data provider (repository). Usually, part of the retrieved information is required for other request. ListRecords is used to obtain or harvest records from a repository. ListIdentifiers is similar to ListRecords. However, ListRecords only harvests headers instead of records. The last, ListMetadataFormats will response metadata formats available in a data provider (repository). As mentioned previously, data interoperability in OAI-PMH will work if both data providers and service providers communicate with the same format. In this research, we use simplified Dublin Core standard because this format is common standard to create digital web catalog. It consists of 15 metadata elements, such as title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, and rights. These metadata elements describe the published document. Figure 1 shows the example of Dublin Core implementation in XML. All the Dublin Core elements are part of the dc scheme. This scheme is pointed to http://purl.org/dc/elements/1.1/ (by using xmlns attribute). Therefore, as shown in figure 1, each tag of Dublin Core elements should be followed by dc. For example, <dc:title>revitalisasi Perpustakaan</dc:title> means the title elements belongs to the dc (Dublin Core) scheme and has the value: Revitalisasi Perpustakaan. All the metadata do not have to be published or available. As shown in figure 1, the data provider only publishes 9 metadata elements. Figure 1. Dublin Core implementation in XML. Initially, we observed that there are 63 universities in Indonesia which have computer related study program. However, not all of the universities support OAI-PMH for their repositories. In addition, not all universities open the access to the documents. Some of them do not allowed guest access. This will prevent the service provider to harvest all the available documents. Therefore, this research only used 10 universities in Indonesia. The list of universities and their OAI-PMH addresses are shown in table 1. The universities listed in table 1 provides open access to their documents. Those documents can be downloaded by using GetRecord protocol. In addition, this research limits the documents to PDF format and written in Bahasa Indonesia only. 3

Table 1. The list of universities University Universitas Airlangga Universitas Andalas Universitas Diponegoro Universitas Sumatera Utara Universitas Sultan Syarif Kasim Universitas Sriwijaya Institut Teknologi Sepuluh November Universitas Gadjah Mada Universitas Bengkulu Universitas Malikussaleh OAI-PMH Address http://repository.unair.ac.id/cgi/oai2 http://teknosi.fti.unand.ac.id/index.php/teknosi/oai http://jtsiskom.undip.ac.id/index.php/jtsiskom/oai http://repository.usu.ac.id/oai/request http://repository.uin-suska.ac.id/cgi/oai2 http://eprints.unsri.ac.id/cgi/oai2 http://iptek.its.ac.id/index.php/index/oai http://i-lib.ugm.ac.id/oai-ilib/oai21.php http://jamal.ub.ac.id/index.php/index/oai http://ejurnal.tif.unimal.ac.id/index.php/index/oai 2.2 Text Pre-processing This research requires single word (token) to be compared with official computer terms in Bahasa Indonesia. Therefore, after the metadata and document harvesting process, we extracted all the text in the documents and applied text pre-processing such as tokenizing, stop-word removal, case folding and stemming. Tokenizing is used to extract all the text into single word. Stop-word removal is intended to remove all the words that cannot be used as a single word. This research uses stop-word list provided by Tala [6]. Case folding will convert all the text to lowercase. This step will avoid case sensitive issue when comparing each token. The last text pre-processing is stemming. As the official computer terms are available in root word format, hence the token should be available in root word format also. Stemming process will extract the token to its root word. We used confix-stripping approach to provide words stemming [7]. In addition, this research does not consider the abbreviation of English terms. Therefore, the abbreviation of English terms is considered as the other terms and will not be calculated. 2.3 Comparing Token by Using String Matching Algorithm The string matching algorithm has two types, namely: 1) exact string matching and 2) approximate string matching. This research uses extract string matching type. We utilize Boyer-Moore algorithm to match the token with the official computer terms for Bahasa Indonesia. This algorithm is known as the most famous and efficient algorithm in model matching [8]. 2.4 The Result Observation After comparing tokens with the official computer terms for Bahasa Indonesia, we observed the result for each university. Then we observed the result from the all available universities. 3. Results and Discussion 3.1 Harvesting Document Result As mentioned earlier, the data are obtained by harvesting documents via OAI-PMH. This research limits to the documents that related to computer field only. To obtain only the specific documents, we should know the category structure in each university repository. The category structure can be observed via ListSets protocol. For example, we can observe the categories in Universitas Sumatera Utara repository by accessing http://repository.usu.ac.id/oai/request?verb=listsets. This requires manual approach to observe each repository as there is no convention by name or code to classify specific document type. After determining the category which will be harvested for each repository, the metadata is ready to be 4

harvested by specifying SetSpec attribute. All the documents location can be extracted by using ListRecords protocol. There are several issues in harvesting process. There are three universities that are registered as data provider but they do not provide required metadata. Therefore, we exclude those three universities from this research. They are Universitas Gadjah Mada, Universitas Bengkulu and Universitas Malikussaleh. In addition, we exclude Institut Teknologi Sepuluh Nopember from this research because it uses English instead of Bahasa Indonesia. 3.2 Data Extraction Figure 2 shows the example of harvested XML from Universitas Sumatera Utara. We extract several fields such as published date (represented by <atom:published>), file name (represented by <atom:title>), and link to file download (represented by <atom: link>). For example, the extracted data are shown in figure 2 (a) (b) (c) respectively are published date, title and link. We notice that there is difference between Universitas Sumatera Utara and the other universities in providing link to file download. Universitas Sumatera Utara separates a document to several files according to its chapter. However, the other universities provide only one file for each document. These XMLs are extracted by using a DOM Parser. (a) (b) (c) Figure 2. XML from Universitas Sumatera Utara. Table 2. The downloaded files. Number of documents Number of University from the XML downloaded files Universitas Airlangga 130 123 Universitas Andalas 43 43 Universitas Diponegoro 186 185 Universitas Sumatera Utara 325 325 Universitas Sultan Syarif Kasim 654 395 Universitas Sriwijaya 99 85 3.3 Document Download After extracting required field from the XML, the document will be downloaded to be processed for the next step. We found that some repositories do no provide file download links for the certain documents. Therefore, the number of downloaded files are fewer than the number of documents available in the XML data. 5

3.4 Results Before implementing the algorithm, the downloaded files are required to be converted to plain text. Next, we implement text pre-processing such as tokenizing, stop-word removal, and case folding. To search the official computer terms in Bahasa Indonesia we utilize a string matching algorithm called Boyer-Moore. By utilizing the algorithm, we found ten most used computer terms in English such as user, file, input, database, server, output, text, image, error, and password. The result of the searching official term for Bahasa Indonesia is shown in table 3. The column Year describes the year when the documents are published. This column is followed by UNAND, UA, UNDIP, UNSRI, UIN-SUSKA, USU which represent Universitas Andalas, Universitas Airlangga, Unversitas Diponegori, Universitas Sriwijaya, Universitas Islam Negeri Sultan Syarif Kasim and Universitas Sumatera Utara respectively. Table 3. Number of computer terms in English for each university. Year UNAND UA UNDIP UNSRI UIN-SUSKA USU Sub Total 2001-9 - - - - 9 2002-0 - - - - 0 2003 - - - - - - 0 2004-21 - - - - 21 2005-0 - 243 - - 243 2006-0 - 291 - - 291 2007-4 - - - - 4 2008-0 - - - - 0 2009-39 - 83 1,974-2,096 2010-85 - 334 3,925-4,344 2011-85 - 547 25,619-26,251 2012-28 - 1,218 7,751-8,997 2013-38 1,945 1,900 28,797 5,331 38,011 2014-45 2,802 5,893 3,437 32,634 44,811 2015 564 510 5,181 1,406-63,581 71,242 2016 3,309 1,477 4,458-1,094 25,044 35,382 Total 3,873 2,341 14,386 11,915 72,597 126,590 Number of documents 43 123 185 85 325 395 Most of the repositories do not have any documents from 2001 except Universitas Airlangga. However, we cannot convert the documents from UA which are published in 2005, 2006 and 2008 because the PDF files are scanned using image format. According to the table 3, the usage of the computer terms in foreign language is still very high. This result is proposed be the starting point for the government to do profound research to evaluate the implementation of official computer terms in Bahasa Indonesia. 4. Conclusion Sixteen years after being published by the government of Republic of Indonesia, the official computer terms in Bahasa Indonesia should have been implemented widely, especially by the academics. This 6

research observed the usage of official computer terms in scientific publications by the academics in computer related field. We observed 1,084 scientific publications which are published by six universities in Indonesia. We found that the usage of computer terms in foreign language is still widely used by the academics in their scientific publications. References [1] Sari C A 2015 Tanggapan Mahasiswa di Kota Surakarta Terhadap Pengindonesiaan Istilah Asing Bidang Pengkomputeran (Kajian Sosiolinguistik) (Universitas Sebelas Maret) [2] Amalia A, Gunawan D, Lydia M S, and Charlie C 2017 IOP Conf. Ser. Mater. Sci. Eng. vol 180 p 12060 [3] Simek P, Jarolimek J, Vanek J, and Stoces M 2011 Proc. of the 5th International Conference on Information and Communication Technologies for Sustainable Agri-production and Environment (HAICTA 2011) p 827 834 [4] Gunawan D and Amalia A 2017 IOP Conf. Ser. Mater. Sci. Eng. vol 180 p 12052 [5] Gunawan D, Amalia A, and Charisma I 2016 6th IEEE Intl. Conf. on Control System, Computing and Engineering (ICCSCE) p 304 309 [6] Tala F Z 2003 A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia (Universiteit van Amsterdam) [7] Adriani M, Asian J, Nazief B, and Tahaghoghi S M M 2007 ACM Trans. Asian Lang. Inf. Process. vol 6 p 1 33 [8] Yuan L 2011 Intl. Conf. on Computer Science and Service System (CSSS) pp 182 184 7