A Vision Recognition Based Method for Web Data Extraction

, pp.193-198 http://dx.doi.org/10.14257/astl.2017.143.40 A Vision Recognition Based Method for Web Data Extraction Zehuan Cai, Jin Liu, Lamei Xu, Chunyong Yin, Jin Wang College of Information Engineering, Shanghai Maritime University Shanghai China {zhcai, jinliu, lmxu}@shmtu.edu.cn Abstract. This paper proposes a data extraction method based on visual recognition and Document Object Model(DOM) tree for Deep Pages to extract a large number of Deep Web data in-formation. By utilizing the characteristics of the presentation of Deep Web data and the characteristics of the visual information of the web page, the data region of multiple targets is located, and the data of the data region is extracted accurately by DOM analysis. Experiments were conducted on several travel websites, and test results show that efficiency and accuracy of the extraction are higher than those of the traditional methods. Keywords: Deep Web, Data Extraction, Data Region Mining, Visual Feature, DOM, Deep Learning 1 Introduction With the rapid growth of Web information, Nowadays, there have been various types of Deep Web information extraction technology and tools. Through analyzing the DOM structure of the page and defining certain rules for data extraction, Liu B [1] proposes MDR algorithm, that is, to detect the similarity of multiple nodes in a web page. These nodes constitute a similar sub-tree and then are divided into different data region, Where each node corresponds to a data record, through the analysis of the DOM structure of the page define some extraction rules for data ex-traction. Based on MDR, Zhai Y [2], Liu B [3], Simon K [4], Lausen G and other algorithms have been proposed DEPTA, NET, and VIPER algorithm. These algorithms are all based on the analysis of DOM structure to define corresponding rules for extraction, which need to traverse a large number of DOM nodes and cost a lot of time. Therefore it is difficult to guarantee the extraction efficiency and the web structure is increasingly complicated, The above algorithms cannot achieve a good extraction effect. In this paper, a method based on visual recognition combined with DOM analysis is proposed to solve the problem of inefficient use of DOM structure to extract Deep Web data, although other researchers, domestic or abroad, have proposed some other data extraction methods on the basis of natural language processing, such as Califf M [5], Mooney R, Freitag [6], and Soderland [7] have proposed RAPIER, SRV, WHISK and other methods, The main idea of these methods is to regard the entire page of the ISSN: 2287-1233 ASTL Copyright 2017 SERSC

html document as a large text to deal with. Meanwhile, some scholars put forward the methods of using visual features, for example, Cai D [9], Liu W [10] have proposed VIPS and VIPS-based VIDE methods, But because of the different design of the page, it is difficult to determine a uniform standard to carry out the division of corresponding data region, so the universality of such methods is low. The visual recognition proposed in this paper is based on the deep learning filed. It is a kind of real visual feature that allows the computer to simulate the process of human acquisition of information to locate the multiple target data region of Deep Web. It can adapt to different webpage heterogeneity. The deep Web data of different Web sites is universal. The accurate positioning of visual recognition and the method of extracting data from DOM analysis can effectively improve the efficiency of extracting the data of regional data. 2 Related Researches 2.1 Introduction In this paper, we propose a new based on visual recognition multi-region data extraction method for Deep Web Page. The convolution neural network is used to get the data region s location information and pass the prediction result of the data region to the HTML engine. Then we can get current DOM element from DOM structure. Finally, we can finish all data region s data extraction. This section focuses on this method. First of all, we will introduce the general process of algorithm, then introduce the main technologies used in the algorithm, and finally introduce the detailed steps used in this method. 2.2 Flow chart The method adopted in this paper mainly includes the following steps: As is shown in Figure 2.2: Fig. 2.2. Algorithm Flow: The flow chart reflects the entire design process. 194 Copyright 2017 SERSC

2.3 VRDE Mechanism 1) Design of Convolutional Neural Network Firstly, the training set is constructed. When the training set is obtained, we need to get the data region of location and size and regard the location and size as the label of the training set. Then the training set is threshold. Finally, we need to generate the training set file. In this paper, convolution neural network is used to locate the data region. Convolution neural network is an efficient image recognition method developed in recent years. It is an important application of deep learning algorithm in image processing field. It is widely applied in handwritten character recognition, face recognition, object detection filed and achieved good performance. The classification model of the convolution neural network can directly take a twodimensional image as the input of the convolution neural network, and then give the classification result at the output. However, we cannot use the traditional classification model to predict the regression problem such as the position of multiple data regions in the deep web page. We choose to use the nonlinear function sigmoid for the regression problem. This function has a range of values between 0 and 1 that conforms to the definition of the target area boundary detection value (IOU). The CNN model include four sampling layers (S), five convolutions (C), and two fully connected layers (F). The training set which is preprocessed feed in convolutional neural network to train model. What s more, SGD(stochastic gradient descent)is used to optimize the parameters of the whole network. The input of the network is a 128 128 image matrix. Then, all the network parameters are randomly initialized by Gaussian distribution. For all layers, the activation function selects the non-linear modified linear unit ReLU, which avoids the problem that the network train is too slow problem in early. Because there are many parameters in the whole network, in order to avoid over-fitting during training, we set the parameter of Dropout as 0.25 in each layer. Using sigmoid to the full-connected layer of final layer, we regard 8-dimensional output as a number of data areas in the picture position and size. Let the output of the network for the two data regions of the i-th image be: Y_pred[i][0] Y_pred[i][1] Y_pred[i][2] Y_pred[i][3] Y_pred[i][4] Y_pred[i][5] Y_pred[i][6] Y_pred[i][7] It means that the upper left corner of the data area coordinates of the original picture accounts for the width and length of the original image ratio. That is Y_pred[i][0] = startx/new_width Y_pred[i][4] = startx/new_width And the last two values represent the ratio of the length and width of the data region relative to the original image length w (new_width) and width h (new_height), namely: Y_pred[i][2] = width/new_width Y_pred[i][3] = righty1/new_height Y_pred[i][6] = width/new_width Y_pred[i][7] = (height-lefty2)/new_height Startx represents the first data area of the upper left corner of the abscissa, righty1 represents the width of the first data area, (height-lefty2) represents the width of the second data area, new_width represents the original length, new_height represents the original width. Copyright 2017 SERSC 195

We define the error value between the true value and predicted values of the data area at here. The loss function is as follows: 2 Loss_function = 10 * (y_true - y_pred). (1) We define the loss function by using the Euclidean distance for computing the loss between the true position of data region and the predicted position of data region and use the magnification factor to carry out more effective training. In the network, this paper also sets up the standard IOU of the data region detection. If IOU> 50%, the data region regards as positive sample. The higher the IOU value represents the more accurate the boundary prediction of data region. IOU is defined as: Area_pred Area_true IOU =. (2) Area_pred Area_true 2) DOM Tree Construction for Data Extraction We make a request to server through the URL of the webpage of deep web to get the corresponding html page. The corresponding DOM syntax tree structure is constructed base on html source. The constructed DOM tree has the following characteristics. A DOM tree node contains a data record. Within the same data area, the data record nodes are adjacent and share a common parent node. When the model is established, we will take a screenshot of the visited web page into the model we build with the convolutional neural network. Through the established model, we can get the corresponding predicted position of the multiple data regions, then passing the coordinates of the current position to the Dom tree, searching all the root nodes and child nodes related to the current DOM element, and through the search of the DOM tree to obtain a plurality of complete data region s DOM elements. Finally, we can use the corresponding extraction rules to accomplish the data extraction of data region. 3 Experiments In this paper, the total number of training set is 58500, and the size of each training sample is 128 * 128. There are one hundred and ninety-five images with different data sizes. Those images are placed in different locations on the 128 * 128 white background image. After the corresponding pretreatment, we can pass it to the convolution neural network for training the model. Because most of the Deep Web data is presented in DIVs and tables, in order to verify the validity of the deep web multiple data region extraction algorithm based on visual recognition and DOM, this paper combines the data of the same way website and get a web page screenshot. The screenshot contains two data regions presented by the div. Finally, compared with the extraction result of VIPS algorithm, the results of this experiment is a crawled performance with a machine in 50M shared network environment. In the experiment, we select randomly 30 pages from the same way web page, and calculate the extraction time from the beginning of the extracted page to the next 196 Copyright 2017 SERSC

page. Figure 3.l shows the results of the crawl, the abscissa represents the number of pages extracted, and the vertical axis represents the total time taken to extract the corresponding pages. The detailed data extraction time is shown as follow the below Table 3.1. Table 3.1. Details of Extraction Time Extraction Algorithm Our Method VIPS Extract Five Pages (s) 20.15 108.13 Extract Ten Pages (s) 39.56 148.27 Extract Fifteen Pages (s) 59.64 192.23 Extract Twenty Pages (s) 77.10 244.17 Extract Twenty-Five Pages (s) 95.56 292.11 Extract Thirty Pages (s) 117.69 345.56 Fig. 3.1. Performance of Data Extraction 4 Conclusions For the Deep Web query result page, this chapter proposes the method of data extraction based on the visual information of web page and DOM tree. It is characterized by the combination of visual information and DOM node information. Compared with VIPS and other methods, this method need not the comparison of a lot of DOM tree similarity and need not to obtain all the nodes of the visual information, so that the efficiency of data extraction is larger of the up-grade. At last, the experiment of extracting data record is given. The result shows that this meth-od is Copyright 2017 SERSC 197

effective and can be used to extract the data of Deep Web page quickly and accurately. In addition, because of the import of the deep learning methods, this method is more universal. The problem that the extraction efficiency and accuracy of different deep web page heterogeneity has been solved. Although this paper has good adaptability to the multiple data region of deep web of data extraction, the interference of web page noise to data extraction cannot be removed completely. Page noise is other non-related data in the web page. This is the next step to improvement and research in this paper. References 1. Liu B, Grossman R, Zhai Y. Mining data records in Web pages. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 601~606 2. Zhai Y, Liu B. Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on World Wide Web. ACM, 2005: 76~85 3. Liu B, Zhai Y. NET A System for Extracting Web Data from Flat and Nested Data Records [C]//proc of the 6th International Conference on Information and Web Information VIPER System Engineering. New York: Springer: 2005: 487-495 4. Simon K, Lausen G VIPER: Augmenting Automatic Information Extraction with Visual Per-ceptions[C] //Proc of the 14th ACM International Conference on Information and Knowledge Management. Brement: ACM, 2005: 381-388 5. Califf M, Mooney R. Relational Learning of pattern-match rules for information extraction. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence. Florida: Orlando, 1999. 328~334 6. Freitag D. Machine learning for information extraction in informal domains. Machine learning, 2000, 39(2~3): 169~202 [23] Soderland S. Learning information extraction rules for semi-structured and free text. Machine learning, 1999, 34(1~3): 233~272 7. Soderland S. Learning information extraction rules for semi-structured and free text. Machine learning, 1999, 34(1~3): 233~272 8. Cai D, Yu S, Wen J R, et al. VIPS: a vision-based page segmentation algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003 9. Liu W, Meng X, Meng W. VIDE: A Vision-Based Approach for Deep Web Data Extraction[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 22(3):447-460. 10. Liu B, Yu Y Web Data Mining[M]. Tsinghua University Press,2013:265-269 11. HTML DOM http://www.w3school.com.cn/htmldom/ford University Press, pp.93-106 (2012) 198 Copyright 2017 SERSC