Integrated Framework for Keyword-based Text Data Collection and Analysis

Similar documents
Distributed Interference-aware Medium Access Control for IEEE Visible Light Communications

The Study of Genetic Algorithm-based Task Scheduling for Cloud Computing

A Personal Information Retrieval System in a Web Environment

A Supervised Method for Multi-keyword Web Crawling on Web Forums

Enhancement of Real Time EPICS IOC PV Management for Data Archiving System. Jae-Ha Kim

Implementation of Semantic Information Retrieval. System in Mobile Environment

Applicability Estimation of Mobile Mapping. System for Road Management

Template Extraction from Heterogeneous Web Pages

DIGIT.B4 Big Data PoC

Research Article Resource Efficient Handover Strategy for LTE Femtocells

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Online Version Only. Book made by this file is ILLEGAL. Design and Implementation of Binary File Similarity Evaluation System. 1.

DNS Naming Services for Service Discovery and Remote Control for Internet-of-Things Devices

A New Technique to Optimize User s Browsing Session using Data Mining

An Efficient Provable Data Possession Scheme based on Counting Bloom Filter for Dynamic Data in the Cloud Storage

Extraction of Semantic Text Portion Related to Anchor Link

Study on the Distributed Crawling for Processing Massive Data in the Distributed Network Environment

Scanline-based rendering of 2D vector graphics

A Survey on Disk-based Genome. Sequence Indexing

Towards Secure Virtual Machine Migration in Vehicular Cloud Environment

Image Classification Using Wavelet Coefficients in Low-pass Bands

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

Big Data Service Combination for Efficient Energy Data Analytics

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Adaptive Aggregation Scheduling Using. Aggregation-degree Control in Sensor Network

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Software Development of Automatic Data Collector for Bus Route Planning System

Design Considerations on Implementing an Indoor Moving Objects Management System

Automation of URL Discovery and Flattering Mechanism in Live Forum Threads

Development of Smart-CITY Based Convergent Contents Platform Using Bluetooth Low Energy Beacon Sensors

Volume 6, Issue 1, January 2018 International Journal of Advance Research in Computer Science and Management Studies

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Automated Path Ascend Forum Crawling

Document Summarization using Semantic Feature based on Cloud

Keywords Data alignment, Data annotation, Web database, Search Result Record

Collaboration System using Agent based on MRA in Cloud

An Improvement of the Occlusion Detection Performance in Sequential Images Using Optical Flow

A Study of Open Middleware for Wireless Sensor Networks

OSM-SVG Converting for Open Road Simulator

DIGIT.B4 Big Data PoC

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Distributed Indexing of the Web Using Migrating Crawlers

Sentiment Analysis for Customer Review Sites

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Research and Design of Key Technology of Vertical Search Engine for Educational Resources

I&R SYSTEMS ON THE INTERNET/INTRANET CITES AS THE TOOL FOR DISTANCE LEARNING. Andrii Donchenko

A Study of the Correlation between the Spatial Attributes on Twitter

Dynamic Clustering of Data with Modified K-Means Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

The Design of Distributed File System Based on HDFS Yannan Wang 1, a, Shudong Zhang 2, b, Hui Liu 3, c

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

Real-time processing for intelligent-surveillance applications

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Supervised Web Forum Crawling

Trajectory Planning for Mobile Robots with Considering Velocity Constraints on Xenomai

80 Daehak-ro, Buk-gu, Daegu 41566, Korea 2 Department of Sensor and Display Engineering, Kyungpook National University,

Information Extraction of Important International Conference Dates using Rules and Regular Expressions

Supervisory System Design and Implementation of Automatic Illumination Control for Dimmable Flat LED Lighting Devices in Buildings

Implementation of a Face Recognition System for Interactive TV Control System

Design of memory efficient FIFO-based merge sorter

Research Article Implementation of Personal Health Device Communication Protocol Applying ISO/IEEE

Information Retrieval System Based on Context-aware in Internet of Things. Ma Junhong 1, a *

Keyword Extraction by KNN considering Similarity among Features

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Automated and Massive-scale CCNx Experiments with Software-Defined SmartX Boxes

Chapter 14 HARD: Host-Level Address Remapping Driver for Solid-State Disk

Site Content Analyzer for Analysis of Web Contents and Keyword Density

ViTAMin: A Virtual Backbone Tree Algorithm for Minimal Energy Consumption in Wireless Sensor Network Routing

The real-time operation server model of the digital signage system for a lift based on Web Socket

An Optimization Algorithm of Selecting Initial Clustering Center in K means

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Preliminary Research on Distributed Cluster Monitoring of G/S Model

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Power Control Performance Evaluation for Efficiency in D2D Cellular Network

Searching the Web [Arasu 01]

Multiple Classifier Fusion using k-nearest Localized Templates

A Design of the Conceptual Architecture for a Multitenant SaaS Application Platform

ISSN: [Shubhangi* et al., 6(8): August, 2017] Impact Factor: 4.116

Path Planning for a Robot Manipulator based on Probabilistic Roadmap and Reinforcement Learning

LETTER Paging out Multiple Clusters to Improve Virtual Memory System Performance

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

Determination of the Parameter for Transformation of Local Geodetic System to the World Geodetic System using GNSS

Near Optimal Repair Rate Built-in Redundancy Analysis with Very Small Hardware Overhead

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

P2FS: supporting atomic writes for reliable file system design in PCM storage

UNICORE Globus: Interoperability of Grid Infrastructures

Association Rules Mining using BOINC based Enterprise Desktop Grid

Open Source Software Recommendations Using Github

Correlation Based Feature Selection with Irrelevant Feature Removal

Appearance-Based Place Recognition Using Whole-Image BRISK for Collaborative MultiRobot Localization

Mining Association Rules in Temporal Document Collections

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

Robot localization method based on visual features and their geometric relationship

Research Article Apriori Association Rule Algorithms using VMware Environment

Chapter 50 Tracing Related Scientific Papers by a Given Seed Paper Using Parscit

SMCCSE: PaaS Platform for processing large amounts of social media

Distributed Pub/Sub Model in CoAP-based Internet-of-Things Networks

Transcription:

Sensors and Materials, Vol. 30, No. 3 (2018) 439 445 MYU Tokyo 439 S & M 1506 Integrated Framework for Keyword-based Text Data Collection and Analysis Minki Cha, 1 Jung-Hyok Kwon, 1 Sol-Bee Lee, 1 Jaehoon Park, 2 Sungkwan Youm, 3 and Eui-Jik Kim 1* 1 Department of Convergence Software, Hallym University, 1 Hallymdaehak-gil, Chuncheon-si, Gangwon-do 24252, South Korea 2 Department of Electronic Engineering, Hallym University, 1 Hallymdaehak-gil, Chuncheon-si, Gangwon-do 24252, South Korea 3 Department of Information and Communication, Cheju Halla University, 38 Halladaehak-ro, Jeju-si, Jeju-do 63092, South Korea (Received August 1, 2017; accepted November 2, 2017) Keywords: data analysis, integrated framework, intelligent service, text data collection, web crawling In this paper, we present an integrated framework for keyword-based text data collection and analysis. The integrated framework consists of four types of component: (1) user interface, (2) web crawler, (3) data analyzer, and (4) database (DB). The user interface is used to set input keyword and option values for web crawling and text data analysis through a graphical user interface (GUI). In addition, it provides analysis results through data visualization. The web crawler collects the text data of articles posted on the web based on input keywords. The data analyzer classifies the articles into relevant articles and nonrelevant articles using predefined knowledge (i.e., a set of words to be included in the articles). Then, it analyzes the text data of relevant articles and visualizes the results of the data analysis. Finally, the DB stores the collected text data, the predefined knowledge defined by the user, and the results of the data analysis and data visualization. We verify the feasibility of the integrated framework through proof of concept (PoC) prototyping. The experimental results show that the implemented prototype collects and analyzes the text data of articles reliably. 1. Introduction Recently, intelligent services, such as news curation and preference analysis and recommendation, have received considerable attention from both academia and industry. (1 3) These services commonly use the text data of articles posted on the web to collect information needed for individuals. The text data can be recorded, collected, and analyzed through a sensor web, which is a special type of web-centric information infrastructure. (4) Especially, data collection and analysis are the most important functionalities of a sensor web. Therefore, web crawling to collect text data and big data analysis to analyze collected text data are widely considered as key enablers of sensor web for such intelligent services. * Corresponding author: e-mail: ejkim32@hallym.ac.kr http://dx.doi.org/10.18494/sam.2018.1764 ISSN 0914-4935 MYU K.K. http://myukk.org/

440 Sensors and Materials, Vol. 30, No. 3 (2018) To date, a number of existing research studies have tried to implement these functionalities using open source programming languages, such as R, Python, and Scala. (5 7) However, most of them have not used an integrated application for both web crawling and big data analysis. In other words, the functionalities of web crawling and big data analysis have been implemented separately in most of the existing research studies. Therefore, unpredictable delay may occur in the targeted intelligent service, because their feasibility highly depends on the developer s proficiency. (8,9) To enable seamless intelligent services, the design of an integrated architecture encompassing various functionalities (e.g., web crawling, data analysis, and user application) is required. (10) In this paper, we propose an integrated framework of web crawling and data analysis for keyword-based text data collection and analysis to enable seamless intelligent services. The proposed framework includes the following four components: (1) user interface, (2) web crawler, (3) data analyzer, and (4) database (DB). These components interact with each other to exchange data. The user interface component helps users set the input keyword and detailed option values for web crawling and text data analysis through the graphical user interface (GUI). Moreover, it provides various data visualization results, such as word clouds and word frequency graphs, based on the results of the text data analysis. The web crawler component collects the text data of articles posted on the web and provides the datasets to be used for analysis by storing the collected text data in the DB. The data analyzer component performs the data preprocessing and the data analysis using the datasets provided from the web crawler component. For the data preprocessing, article classification is conducted. Specifically, the articles are classified into relevant articles and nonrelevant articles using predefined knowledge (i.e., a set of words to be included in the article). Relevant articles are collected articles that are highly associated with the topic the user intends to search, while nonrelevant articles are not highly associated with the topic. The data analysis is performed in two steps. The former is a word extraction step in which words composed of three or more characters are extracted based on the text data of relevant articles. The latter is a word filtering step that removes unnecessary words from the extracted words. The results of the data analysis are visualized in the form of word clouds and word frequency graphs. Finally, the DB component consists of three DBs: crawling DB, predefined knowledge DB, and analysis DB. Each DB stores the collected text data, predefined knowledge, and results of data analysis and data visualization. We verify the feasibility of the integrated framework through proof of concept (PoC) prototyping. The web crawler and data analysis are implemented using the functional packages provided by open source R, and the interface is implemented using the Java Swing framework. (11,12) The experimental results show that the integrated framework reliably provides web crawling and text data analysis functionalities. The rest of this paper is organized as follows. In Sect. 2, the detailed design of the integrated framework is described. The implementation and experimental results are presented in Sect. 3. Finally, we conclude the paper in Sect. 4.

Sensors and Materials, Vol. 30, No. 3 (2018) 441 2. Functional Architecture of Integrated Framework Figure 1 shows the functional architecture of the proposed integrated framework consisting of the user interface, web crawler, data analyzer, and DB components. The user interface component consists of three function blocks: user input panel, result view panel, and predefined knowledge panel. The user input panel is used to set the input keyword and option values, including the range of crawling pages, word cloud minimum frequency, and word frequency ranking. The range of crawling pages is the range of webpages on which to crawl the text data of articles. The word cloud minimum frequency refers to the minimum frequency of words displayed in the word cloud, which is one of the data visualization results. Note that the word cloud is an image composed of various words with different sizes depending on their frequency. The frequent word ranking is a criterion of word frequency to determine the words displayed in the word frequency graph. The result view panel is used to provide data visualization results after analyzing the data, and the predefined knowledge panel is used to add or delete the words stored in the predefined knowledge DB of the DB component. The web crawler component consists of the data collector and file creator function blocks. The data collector performs parsing, language support, and extraction. Parsing involves searching the webpages of articles and crawling the text data of articles on specific webpages. For this, it creates a uniform resource locator (URL) to search the webpages of articles based on the input keyword defined by the user and counts the number of webpages of articles using the corresponding URL. In addition, during parsing, it creates a URL for the webpages to crawl the articles based on the number of searched webpages of articles. Then, it requests a hypertext markup language (HTML) file of all the articles contained in the webpage using the corresponding URL. Language support involves encoding in the UTF-8 type to prevent data loss during web crawling. Extraction involves extracting the text data of the articles from the HTML file. The file creator creates a data file that records the extracted text data and stores the file in the crawling DB of the DB component. Fig. 1. Functional architecture of integrated framework.

442 Sensors and Materials, Vol. 30, No. 3 (2018) The data analyzer component consists of three function blocks: preprocessor, analyzer, and visualizer. The preprocessor preprocesses the extracted text data of the articles stored in the crawling DB. The preprocessor classifies the articles into relevant and nonrelevant articles by comparing the text data of the articles with a set of words in predefined knowledge. If the article contains one of the words in predefined knowledge, the preprocessor function block classifies it as a relevant article. The text data of relevant articles is stored in the analysis DB of the DB component. On the other hand, the text data of nonrelevant articles is removed. The analyzer performs two steps: word extraction and word filtering. In the word extraction step, words composed of three or more characters are extracted based on the text data of relevant articles. In the word filtering step, unnecessary words (e.g., a, an, the, and that ) are removed from the text data of relevant articles. The data analysis results are stored in the analyzer DB of the DB component. The visualizer performs data visualization based on the data analysis results and the option values set through the user input panel. The data visualization results are visualized in the form of word clouds and word frequency graphs and stored in the analyzer DB. The DB components consist of crawling DB, predefined knowledge DB, and analysis DB. The crawling DB stores the text data of articles collected from the web through the web crawler component. The predefined knowledge DB stores a set of words added by the user through the predefined knowledge panel. The analysis DB stores the text data of relevant articles, data analysis results, and data visualization results. 3. Implementation The feasibility of the integrated framework is verified through PoC prototyping. For this, the GUI is implemented using the Swing framework provided by the Java development kit (JDK) version 1.8, and the functionalities of data collection and data analysis are implemented using open source R version 3.4.0. Figure 2 shows the GUI of the integrated framework. The input keyword settings and the request for searching the webpages of articles are executed through the setting group box in the GUI. In the setting group box, the Query field is used to set the input keyword, and the Confirm button is used to search the webpages of articles based on the input keyword. The request for searching the webpages of articles is performed as follows. First, the HTML file containing information with respect to the total number of webpages of articles is requested. Then, the text data whose attribute value is <select class = search-header > is extracted from the HTML file through the xpathsapply() function to obtain the total number of webpages of articles. Next, blanks and quotes are removed from the extracted text data, and the text data (i.e., the maximum number of webpages of articles) is displayed in the Max Page label in the crawling group box. To collect the text data of articles, the range of crawling pages is set through the Crawling Page field. Then, the request for web crawling is executed through the Crawling button. In web crawling, the HTML file containing the text data of articles on a certain webpage is requested, and the text data whose attribute value is <div class = storybody p> is extracted from the HTML file through the xpathsapply() function. Web crawling is

Sensors and Materials, Vol. 30, No. 3 (2018) 443 Fig. 2. (Color online) GUI of integrated framework. repeatedly executed to include the entire range of crawling pages set by the user. Finally, the text data of all the extracted articles is accumulated and stored in the crawling DB. The data preprocessing and data analysis are executed through the Analysis button. For data preprocessing, the predefined knowledge is used for article classification. The words in predefined knowledge can be added or deleted by the user. The Add and Remove buttons in the predefined knowledge group box are used to add/delete words to/from predefined knowledge, respectively. The str_detect() function is used to check whether the words in predefined knowledge are included in the text data of the collected articles. If the article contains one of the words in predefined knowledge, it is classified as a relevant article; otherwise, it is classified as a nonrelevant article. The text data of relevant articles is stored in the analysis DB, while the text data of nonrelevant articles is removed. Upon completing data processing, data analysis is executed, in which word extraction and filtering are performed. For word extraction, words composed of three or more characters are extracted from the text data of relevant articles using the sapply() function. For word filtering, unnecessary words are removed from the text data of relevant articles using the str_detect() function. The results of data analysis are stored in the analysis DB. Data visualization is executed by the Visualization button. The results of data visualization are displayed in the form of word clouds and word frequency graphs. For this, the option values (i.e., word frequency ranking and word cloud minimum frequency) in the option group box are used. Word clouds are created using the wordcloud() function, and word frequency graphs are created using the barplot() function. The results of data visualization are stored in the analysis DB. In the experiment, we collect and analyze the text data of articles posted on the ZDNet webpage. (13) For this, we set the keyword to big data analysis and the crawling page to 1 300.

444 Sensors and Materials, Vol. 30, No. 3 (2018) Since each webpage contains ten articles, a total of 3000 articles are collected. The words in predefined knowledge and the filtering words are listed in Table 1. We set the word cloud minimum frequency to 150; thus, the words with frequencies over 150 are displayed in the word cloud. We also set the word frequency ranking to ten to display ten words in the word frequency graph. The detailed experimental parameters are listed in Table 1. Figure 3 shows a word cloud result. In the figure, words extracted from the articles with frequencies over 150 are displayed. The words with the highest frequencies have the largest font size and are located at the center of the word cloud. The words are represented in different colors according to frequency, but they have the same color when their frequencies are similar. In our experiment, the frequency of cloud is the highest among all the words. Figure 4 shows the word frequency graph indicating the ten words with the highest frequencies among all the words extracted from the articles. The x- and y-axes of the graph represent the word types and the frequency of words, respectively. The labels on the bar graph show the frequency and frequency rate of each word. In the figure, the frequency of cloud is 2132, and the frequency rate of cloud is 2%. Table 1 Experimental parameters. Parameter Values Keyword Big data analysis Crawling page 300 Predefined knowledge Data, IoT, Hadoop, Cloud good, will, then, the, are, with, and, that, this, Filtering words but, have, has, can, for, you, from, been, more, they, said, what, its, about, how, was, which, their, into, these, when, there, those Word cloud minimum frequency 150 Word frequency ranking 10 Fig. 3. (Color online) Word cloud. Fig. 4. (Color online) Word frequency graph.

Sensors and Materials, Vol. 30, No. 3 (2018) 445 4. Conclusions In this paper, we proposed an integrated framework of web crawling and big data analysis for keyword-based text data collection and analysis. The integrated framework consists of four components: user interface, web crawler, data analyzer, and DB. The user interface component helps the user set input keyword and option values through the GUI, and the web crawler component collects the text data of articles posted on webpages based on the input keyword. The data analyzer component performs data preprocessing, data analysis, and data visualization. Finally, the DB component stores the collected text data, predefined knowledge, data analysis results, and data visualization results. To verify the feasibility of the integrated framework, PoC prototyping was conducted. For the experiment, we collected text data of articles from ZDNet based on the user-defined input keyword and analyzed the collected text data to check their frequencies. The data analysis results were visualized in the form of a word cloud and word frequency graph. The experimental results showed that the integrated framework reliably provides the functionalities of web crawling and text data analysis. Acknowledgments This research was supported by Basic Science Research Program through the National Research Foundation of Korea ( NRF) funded by the Ministry of Education ( NRF- 2017R1D1A1B03031055). This work was also supported by the Hallym Leading Research Group Support Program of 2017 (HRF-LGR-2017-0003). References 1 C. Dobre and F. Xhafa: Future Gener. Comp. Syst. 37 (2014) 267. 2 W. Raghupathi and V. Raghupathi: Health Inf. Sci. Syst. 2 (2014) 1. 3 Z. Khan, A. Anjum, and S. L. Kiani: Proc. 2013 IEEE/ACM 6th Int. Conf. Utility and Cloud Computing (IEEE, 2013) 381. 4 A. Sheth, C. Henson, and S. S. Sahoo: IEEE Internet Comput. 12 (2008) 78. 5 S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin: J. Big Data 2 (2015) 24. 6 F. Morandat, B. Hill, L. Osvald, and J. Vitek: Proc. European Conf. Object-Oriented Programming (Springer, 2012) 104. 7 B. C. D. S. Oliveira and J. Gibbons: J. Funct. Program. 20 (2010) 303. 8 A. Mesbah, A. V. Deursen, and S. Lenselink: ACM Trans. Web 6 (2012) 1. 9 Y. Zhang: IEEE Trans. Serv. Comput. 9 (2016) 786. 10 S. Wang, C. Zhang, and D. Li: Proc. Int. Conf. Industrial IoT Technologies and Applications (Springer, 2016) 12. 11 R Foundation: https://www.r-project.org/ (accessed July 2017). 12 Oracle: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html (accessed July 2017). 13 ZDNet: http://www.zdnet.com/ (accessed July 2017).