A Vision Recognition Based Method for Web Data Extraction

Similar documents
Deep Web Crawling and Mining for Building Advanced Search Application

Data Extraction and Alignment in Web Databases

Web Data Extraction Using Tree Structure Algorithms A Comparison

A Novel Image Super-resolution Reconstruction Algorithm based on Modified Sparse Representation

Web Scraping Framework based on Combining Tag and Value Similarity

Keywords Data alignment, Data annotation, Web database, Search Result Record

Numerical Recognition in the Verification Process of Mechanical and Electronic Coal Mine Anemometer

A Research on the Method of Fine Granularity Webpage Data Extraction of Open Access Journals

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

ImageNet Classification with Deep Convolutional Neural Networks

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

A Review on Identifying the Main Content From Web Pages

Image Classification using Fast Learning Convolutional Neural Networks

Mining Structured Objects (Data Records) Based on Maximum Region Detection by Text Content Comparison From Website

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

TensorFlow and Keras-based Convolutional Neural Network in CAT Image Recognition Ang LI 1,*, Yi-xiang LI 2 and Xue-hui LI 3

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

A survey: Web mining via Tag and Value

Background Motion Video Tracking of the Memory Watershed Disc Gradient Expansion Template

Hidden Web Data Extraction Using Dynamic Rule Generation

An adaptive container code character segmentation algorithm Yajie Zhu1, a, Chenglong Liang2, b

Vision-based Web Data Records Extraction

Research of Traffic Flow Based on SVM Method. Deng-hong YIN, Jian WANG and Bo LI *

E-MINE: A WEB MINING APPROACH

Machine Learning 13. week

An Automatic Extraction of Educational Digital Objects and Metadata from institutional Websites

A Supervised Method for Multi-keyword Web Crawling on Web Forums

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5.

Towards New Heterogeneous Data Stream Clustering based on Density

Content Based Cross-Site Mining Web Data Records

Design and Realization of Data Mining System based on Web HE Defu1, a

Research on Integration of Video Vehicle Data Statistics and Model Parameter Correction

Research on an Adaptive Terrain Reconstruction of Sequence Images in Deep Space Exploration

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

An Cross Layer Collaborating Cache Scheme to Improve Performance of HTTP Clients in MANETs

Pupil Localization Algorithm based on Hough Transform and Harris Corner Detection

Car License Plate Detection Based on Line Segments

Adaptive Zoom Distance Measuring System of Camera Based on the Ranging of Binocular Vision

Research on QR Code Image Pre-processing Algorithm under Complex Background

Extraction of Flat and Nested Data Records from Web Pages

Channel Locality Block: A Variant of Squeeze-and-Excitation

Convolution Neural Networks for Chinese Handwriting Recognition

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis

An Efficient Character Segmentation Algorithm for Printed Chinese Documents

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

Extracting Characters From Books Based On The OCR Technology

An Efficient Approach for Color Pattern Matching Using Image Mining

Classification Algorithms for Determining Handwritten Digit

Edge Detection for Dental X-ray Image Segmentation using Neural Network approach

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Design of a Processing Structure of CNN Algorithm using Filter Buffers

Visual Resemblance Based Content Descent for Multiset Query Records using Novel Segmentation Algorithm

Supervised Web Forum Crawling

Deep Learning with Tensorflow AlexNet

Query Disambiguation from Web Search Logs

Efficient Path Finding Method Based Evaluation Function in Large Scene Online Games and Its Application

SHIV SHAKTI International Journal in Multidisciplinary and Academic Research (SSIJMAR) Vol. 7, No. 2, April 2018 (ISSN )

Construction of the Library Management System Based on Data Warehouse and OLAP Maoli Xu 1, a, Xiuying Li 2,b

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Anti-Distortion Image Contrast Enhancement Algorithm Based on Fuzzy Statistical Analysis of the Histogram Equalization

Robust Face Recognition Based on Convolutional Neural Network

A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )

Construction Scheme for Cloud Platform of NSFC Information System

arxiv: v1 [cs.cv] 22 Feb 2017

Kaggle Data Science Bowl 2017 Technical Report

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Multi-Step Segmentation Method Based on Adaptive Thresholds for Chinese Calligraphy Characters

Semantic HTML Page Segmentation using Type Analysis

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Study on fabric density identification based on binary feature matrix

A Boosting-Based Framework for Self-Similar and Non-linear Internet Traffic Prediction

Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

A SMART WAY FOR CRAWLING INFORMATIVE WEB CONTENT BLOCKS USING DOM TREE METHOD

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

Discovering Advertisement Links by Using URL Text

A Method for Representing Thematic Data in Three-dimensional GIS

Real Time Motion Authoring of a 3D Avatar

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

Yield Estimation using faster R-CNN

Prediction of traffic flow based on the EMD and wavelet neural network Teng Feng 1,a,Xiaohong Wang 1,b,Yunlai He 1,c

Computing the relations among three views based on artificial neural network

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

Visual object classification by sparse convolutional neural networks

Report: Privacy-Preserving Classification on Deep Neural Network

A Novel Method of Optimizing Website Structure

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Recognition of the smart card iconic numbers

Research and Application of Machine Learning on Geographic Information System

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA

Clustering Analysis based on Data Mining Applications Xuedong Fan

Research on Evaluation Method of Video Stabilization

Transcription:

, pp.193-198 http://dx.doi.org/10.14257/astl.2017.143.40 A Vision Recognition Based Method for Web Data Extraction Zehuan Cai, Jin Liu, Lamei Xu, Chunyong Yin, Jin Wang College of Information Engineering, Shanghai Maritime University Shanghai China {zhcai, jinliu, lmxu}@shmtu.edu.cn Abstract. This paper proposes a data extraction method based on visual recognition and Document Object Model(DOM) tree for Deep Pages to extract a large number of Deep Web data in-formation. By utilizing the characteristics of the presentation of Deep Web data and the characteristics of the visual information of the web page, the data region of multiple targets is located, and the data of the data region is extracted accurately by DOM analysis. Experiments were conducted on several travel websites, and test results show that efficiency and accuracy of the extraction are higher than those of the traditional methods. Keywords: Deep Web, Data Extraction, Data Region Mining, Visual Feature, DOM, Deep Learning 1 Introduction With the rapid growth of Web information, Nowadays, there have been various types of Deep Web information extraction technology and tools. Through analyzing the DOM structure of the page and defining certain rules for data extraction, Liu B [1] proposes MDR algorithm, that is, to detect the similarity of multiple nodes in a web page. These nodes constitute a similar sub-tree and then are divided into different data region, Where each node corresponds to a data record, through the analysis of the DOM structure of the page define some extraction rules for data ex-traction. Based on MDR, Zhai Y [2], Liu B [3], Simon K [4], Lausen G and other algorithms have been proposed DEPTA, NET, and VIPER algorithm. These algorithms are all based on the analysis of DOM structure to define corresponding rules for extraction, which need to traverse a large number of DOM nodes and cost a lot of time. Therefore it is difficult to guarantee the extraction efficiency and the web structure is increasingly complicated, The above algorithms cannot achieve a good extraction effect. In this paper, a method based on visual recognition combined with DOM analysis is proposed to solve the problem of inefficient use of DOM structure to extract Deep Web data, although other researchers, domestic or abroad, have proposed some other data extraction methods on the basis of natural language processing, such as Califf M [5], Mooney R, Freitag [6], and Soderland [7] have proposed RAPIER, SRV, WHISK and other methods, The main idea of these methods is to regard the entire page of the ISSN: 2287-1233 ASTL Copyright 2017 SERSC

html document as a large text to deal with. Meanwhile, some scholars put forward the methods of using visual features, for example, Cai D [9], Liu W [10] have proposed VIPS and VIPS-based VIDE methods, But because of the different design of the page, it is difficult to determine a uniform standard to carry out the division of corresponding data region, so the universality of such methods is low. The visual recognition proposed in this paper is based on the deep learning filed. It is a kind of real visual feature that allows the computer to simulate the process of human acquisition of information to locate the multiple target data region of Deep Web. It can adapt to different webpage heterogeneity. The deep Web data of different Web sites is universal. The accurate positioning of visual recognition and the method of extracting data from DOM analysis can effectively improve the efficiency of extracting the data of regional data. 2 Related Researches 2.1 Introduction In this paper, we propose a new based on visual recognition multi-region data extraction method for Deep Web Page. The convolution neural network is used to get the data region s location information and pass the prediction result of the data region to the HTML engine. Then we can get current DOM element from DOM structure. Finally, we can finish all data region s data extraction. This section focuses on this method. First of all, we will introduce the general process of algorithm, then introduce the main technologies used in the algorithm, and finally introduce the detailed steps used in this method. 2.2 Flow chart The method adopted in this paper mainly includes the following steps: As is shown in Figure 2.2: Fig. 2.2. Algorithm Flow: The flow chart reflects the entire design process. 194 Copyright 2017 SERSC

2.3 VRDE Mechanism 1) Design of Convolutional Neural Network Firstly, the training set is constructed. When the training set is obtained, we need to get the data region of location and size and regard the location and size as the label of the training set. Then the training set is threshold. Finally, we need to generate the training set file. In this paper, convolution neural network is used to locate the data region. Convolution neural network is an efficient image recognition method developed in recent years. It is an important application of deep learning algorithm in image processing field. It is widely applied in handwritten character recognition, face recognition, object detection filed and achieved good performance. The classification model of the convolution neural network can directly take a twodimensional image as the input of the convolution neural network, and then give the classification result at the output. However, we cannot use the traditional classification model to predict the regression problem such as the position of multiple data regions in the deep web page. We choose to use the nonlinear function sigmoid for the regression problem. This function has a range of values between 0 and 1 that conforms to the definition of the target area boundary detection value (IOU). The CNN model include four sampling layers (S), five convolutions (C), and two fully connected layers (F). The training set which is preprocessed feed in convolutional neural network to train model. What s more, SGD(stochastic gradient descent)is used to optimize the parameters of the whole network. The input of the network is a 128 128 image matrix. Then, all the network parameters are randomly initialized by Gaussian distribution. For all layers, the activation function selects the non-linear modified linear unit ReLU, which avoids the problem that the network train is too slow problem in early. Because there are many parameters in the whole network, in order to avoid over-fitting during training, we set the parameter of Dropout as 0.25 in each layer. Using sigmoid to the full-connected layer of final layer, we regard 8-dimensional output as a number of data areas in the picture position and size. Let the output of the network for the two data regions of the i-th image be: Y_pred[i][0] Y_pred[i][1] Y_pred[i][2] Y_pred[i][3] Y_pred[i][4] Y_pred[i][5] Y_pred[i][6] Y_pred[i][7] It means that the upper left corner of the data area coordinates of the original picture accounts for the width and length of the original image ratio. That is Y_pred[i][0] = startx/new_width Y_pred[i][4] = startx/new_width And the last two values represent the ratio of the length and width of the data region relative to the original image length w (new_width) and width h (new_height), namely: Y_pred[i][2] = width/new_width Y_pred[i][3] = righty1/new_height Y_pred[i][6] = width/new_width Y_pred[i][7] = (height-lefty2)/new_height Startx represents the first data area of the upper left corner of the abscissa, righty1 represents the width of the first data area, (height-lefty2) represents the width of the second data area, new_width represents the original length, new_height represents the original width. Copyright 2017 SERSC 195

We define the error value between the true value and predicted values of the data area at here. The loss function is as follows: 2 Loss_function = 10 * (y_true - y_pred). (1) We define the loss function by using the Euclidean distance for computing the loss between the true position of data region and the predicted position of data region and use the magnification factor to carry out more effective training. In the network, this paper also sets up the standard IOU of the data region detection. If IOU> 50%, the data region regards as positive sample. The higher the IOU value represents the more accurate the boundary prediction of data region. IOU is defined as: Area_pred Area_true IOU =. (2) Area_pred Area_true 2) DOM Tree Construction for Data Extraction We make a request to server through the URL of the webpage of deep web to get the corresponding html page. The corresponding DOM syntax tree structure is constructed base on html source. The constructed DOM tree has the following characteristics. A DOM tree node contains a data record. Within the same data area, the data record nodes are adjacent and share a common parent node. When the model is established, we will take a screenshot of the visited web page into the model we build with the convolutional neural network. Through the established model, we can get the corresponding predicted position of the multiple data regions, then passing the coordinates of the current position to the Dom tree, searching all the root nodes and child nodes related to the current DOM element, and through the search of the DOM tree to obtain a plurality of complete data region s DOM elements. Finally, we can use the corresponding extraction rules to accomplish the data extraction of data region. 3 Experiments In this paper, the total number of training set is 58500, and the size of each training sample is 128 * 128. There are one hundred and ninety-five images with different data sizes. Those images are placed in different locations on the 128 * 128 white background image. After the corresponding pretreatment, we can pass it to the convolution neural network for training the model. Because most of the Deep Web data is presented in DIVs and tables, in order to verify the validity of the deep web multiple data region extraction algorithm based on visual recognition and DOM, this paper combines the data of the same way website and get a web page screenshot. The screenshot contains two data regions presented by the div. Finally, compared with the extraction result of VIPS algorithm, the results of this experiment is a crawled performance with a machine in 50M shared network environment. In the experiment, we select randomly 30 pages from the same way web page, and calculate the extraction time from the beginning of the extracted page to the next 196 Copyright 2017 SERSC

page. Figure 3.l shows the results of the crawl, the abscissa represents the number of pages extracted, and the vertical axis represents the total time taken to extract the corresponding pages. The detailed data extraction time is shown as follow the below Table 3.1. Table 3.1. Details of Extraction Time Extraction Algorithm Our Method VIPS Extract Five Pages (s) 20.15 108.13 Extract Ten Pages (s) 39.56 148.27 Extract Fifteen Pages (s) 59.64 192.23 Extract Twenty Pages (s) 77.10 244.17 Extract Twenty-Five Pages (s) 95.56 292.11 Extract Thirty Pages (s) 117.69 345.56 Fig. 3.1. Performance of Data Extraction 4 Conclusions For the Deep Web query result page, this chapter proposes the method of data extraction based on the visual information of web page and DOM tree. It is characterized by the combination of visual information and DOM node information. Compared with VIPS and other methods, this method need not the comparison of a lot of DOM tree similarity and need not to obtain all the nodes of the visual information, so that the efficiency of data extraction is larger of the up-grade. At last, the experiment of extracting data record is given. The result shows that this meth-od is Copyright 2017 SERSC 197

effective and can be used to extract the data of Deep Web page quickly and accurately. In addition, because of the import of the deep learning methods, this method is more universal. The problem that the extraction efficiency and accuracy of different deep web page heterogeneity has been solved. Although this paper has good adaptability to the multiple data region of deep web of data extraction, the interference of web page noise to data extraction cannot be removed completely. Page noise is other non-related data in the web page. This is the next step to improvement and research in this paper. References 1. Liu B, Grossman R, Zhai Y. Mining data records in Web pages. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 601~606 2. Zhai Y, Liu B. Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on World Wide Web. ACM, 2005: 76~85 3. Liu B, Zhai Y. NET A System for Extracting Web Data from Flat and Nested Data Records [C]//proc of the 6th International Conference on Information and Web Information VIPER System Engineering. New York: Springer: 2005: 487-495 4. Simon K, Lausen G VIPER: Augmenting Automatic Information Extraction with Visual Per-ceptions[C] //Proc of the 14th ACM International Conference on Information and Knowledge Management. Brement: ACM, 2005: 381-388 5. Califf M, Mooney R. Relational Learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence. Florida: Orlando, 1999. 328~334 6. Freitag D. Machine learning for information extraction in informal domains. Machine learning, 2000, 39(2~3): 169~202 [23] Soderland S. Learning information extraction rules for semi-structured and free text. Machine learning, 1999, 34(1~3): 233~272 7. Soderland S. Learning information extraction rules for semi-structured and free text. Machine learning, 1999, 34(1~3): 233~272 8. Cai D, Yu S, Wen J R, et al. VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003 9. Liu W, Meng X, Meng W. VIDE: A Vision-Based Approach for Deep Web Data Extraction[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 22(3):447-460. 10. Liu B, Yu Y Web Data Mining[M]. Tsinghua University Press,2013:265-269 11. HTML DOM http://www.w3school.com.cn/htmldom/ford University Press, pp.93-106 (2012) 198 Copyright 2017 SERSC