A New Mode of Browsing Web Tables on Small Screens

A New Mode of Browsing Web Tables on Small Screens Wenchang Xu, Xin Yang, Yuanchun Shi Department of Computer Science and Technology, Tsinghua University, Beijing, P.R. China stefanie8806@gmail.com; yang-x02@mails.tsinghua.edu.cn; shiyc@tsinghua.edu.cn Abstract Nowadays, tables are widely used in web pages. However, for most web tables, we can only view information passively but cannot read them as we need, which may bring inconvenience when we are browsing on handheld devices, especially for large tables. Thus, we propose a new browsing mode to improve users experience when encountering large web tables on small screens. Based on automatic table detection and a good understanding of table contents, a new plug-in for Microsoft Internet Explorer is designed and implemented which provides a customized user interface to enable several new operations on web tables, such as sorting a table by some column/row or hiding and showing certain columns/rows. In the user study, we are pleased to find that our work is accepted and highly evaluated by most participants. Index Terms Web table, Small screens, Table extraction, Table understanding, User interface design D I. INTRODUCTION uring the past few years, web surfing on small screen devices such as mobile phones and PDAs is becoming more and more popular as it is more convenient and immediate to users than PCs. At the same time, the technology of web production develops rapidly and there is exploding information provided to users through many page elements, such as text, images, forms, tables and so on. Among these page elements, table is one of the most widely used one, as it is a two-dimension element consisting of many items, which displays all kinds of item features and shows relationships between different items and different features. However, most designs of web tables are only for PCs and just focus on how to visually and logically show large amount of data without considering their visual effects on small screen devices. Therefore, when we look at a web table on a mobile phone, we usually have to press direction keys all the time, searching up and down, left and right, to make sure of the column or row heading of a cell, or to find certain cells we need. This is really time-consuming and brings quite a lot of inconvenience to users of handheld devices. In this paper, we propose a new browsing mode of viewing web tables when users are using small screen devices. After research on features of handheld devices and detailed analysis and demonstration of user s habits when viewing web tables, we get to the conclusion that enabling actively operating tables such as sorting a table by some column/row or hiding and showing certain columns/rows can greatly improve user experience when users view web tables on handheld devices. We designed and implemented a plug-in for the Microsoft Internet Explorer as the first step, which provides a customized user interface to access these operations. In this paper, a genuine table [12] is used to display logically related data with significant semantics, while a non-genuine table is used to group non-related data or layout elements together in order to improve the appearance and understandability of web pages. Our implementation can be divided into three steps: automatically detecting genuine web tables by machine learning; understanding the contents of a genuine web table and reconstructing a customized structure for the table; designing the user interface so that users can easily access all functions of operating a genuine table. In the first step, we analyze the HTML documents, utilize web table features summarized by previous works [12] and realize the Navie Bayesian classification algorithm. In the second step, we analyze contents of the detected genuine tables and reconstruct the structure in the format of matrix, which provides functional interface that can be easily called by the user interface. In the last step, we design the user interface and integrate it with the Microsoft Internet Explorer. In the experiment, we conducted a user study on our work, which shows that the new browsing mode can effectively improve user experience of viewing web tables. Most users think high of this tool, especially the sorting and hiding functions. They also trust that the new browsing mode will bring more convenience and efficiency when users view web tables on small screen devices. We also compare our work with similar work of others, which is detailed described in the Related Work. Based on the experiment and the contrast with others work, we reach a conclusion and propose several points of our future work at the end of this paper.

II. NEW BROWSING MODE Generally speaking, our new browsing mode of viewing web tables is to actively operate certain tables so that they can be viewed better on small screens of handheld devices and users can get information they need from the tables more conveniently. How do we operate a table? Let s get to the answer gradually through research on features of handheld devices and detailed analysis and demonstration of user s habits when viewing web tables. A. Features of Handheld Devices Compared to PCs, handheld devices have three distinct features. The first feature is that screens of handheld devices are much smaller than those of PCs. As a result, what can be fully displayed on PCs may only show a small part on handheld devices, which means that although we can see a whole web table on a PC, we may only see several cells of the table on a small screen device. The second feature is that keys on handheld devices are generally much fewer and smaller than those of PCs, which means that operations on handheld devices become more difficult than that on PCs. For example, when browsing a web table, users can click the Page Down button to go to content below on PCs, while on a mobile phone, users can only click the Down button time and time again to go down gradually. The third feature is that operations on handheld devices are often carried out only by one finger of a user, while on a PC, users are using all fingers of both hands to type words. Therefore, operations on handheld devices are more time-consuming. The effect is especially obvious when the user is using a small screen device. The following group of pictures shows the effect of sorting and hiding. Fig. 1. Initial web table. Fig. 2. Sort the table ascending according to the heading of Singer. B. Users Habits When Viewing Web Tables Through personal experiences and survey on other people, we conclude users habits when viewing web tables as follows: look at the whole structure of the table look at certain columns or rows of the table find out the order of certain columns or rows of the table compare several columns or rows of the table So far, few tools have been developed to help users with the above browsing habits. In this case, if a user browsing with a mobile phone wants to sort the web table by certain column, he will have to go up and down again and again to find the order of the column, and then to sort the table according to the column. Also, if the user wants to compare the relations between two columns which are quite far from each other, he will have to go left and right again and again to match each pair of data. Therefore, it is rather time-consuming when users are browsing a web table, especially when they are using a small screen device. Based on the conclusions and problems above, we propose the following two active operations on web tables. Sort tables ascending or descending according to certain column or row when a heading exists in the column or row Hide and show certain column or row when a heading exists in the column or row Given these two functions, when a user wants to sort a table, he will just need to move the cursor to the heading he wants to sort by, and then click the sorting button. Also, when the user wants to compare two columns far away from each other, he just needs to hide columns between them and compare them directly. Fig. 3. Hide the column of Singer. III. IMPLEMENTATION With the above idea, we designed and implemented a plug-in for the Microsoft Internet Explorer to realize the functions as the first step of our research. Our implementation can be divided into three steps: Genuine tables detecting Contents understanding and table structure reconstructing User interface designing A. Genuine tables detecting We denote each web table as either genuine table or non-genuine table according to previous work [3, 6, 12]. Here we partially adopt the automatic web table detection method proposed by Wang and Hu [12] which summarizes 15 features of web tables, including 7 layout features and 8 content features.

Layout features are calculated based on row numbers, column numbers and cell length, while content features are calculated based on the number of different content types of table cells, such as image, form, hyperlink, alphabetical and so on. We firstly realize the Navie Bayesian classification algorithm with the above extracted features. We collect 1774 web tables, in which there are 233 genuine tables and 1541 non-genuine tables, and extract their features as training data. Then, when a web table is detected, its features will be passed to the classifier, which will tell whether it is a genuine table or not. B. Contents understanding and table structure reconstructing After a genuine table has been detected, we will farther deal with it from the following three aspects: determining table type identifying cell data type reconstructing table structure For the first aspect, we classify genuine web tables into three-sub-categories according to different table header styles, including column-wise, column-row-wise and row-wise. The type of each genuine table is decided depending on the layout of TH elements within it. If a genuine table contains no TH element, it will be classified as default (column-wise) and each cell in the first row will be regarded as a table header. For the second aspect, we define seven basic data types for table cells, including image, form, hyperlink, alphabetical, digit, empty and others, each corresponds to the content type of the same name used by the pre-processor. The type of each cell in a genuine table is identified based on six heuristics, namely H1-H6, as is shown below. No. Table 1. Heuristics for identifying cell data type Description properties. Each matrix element is recorded as an attribute-value pair with the header information, denoted as <<data-type, content>, isheader, headerinfo>. Compared to the DOM structure of a TABLE element, our customized structure can be traversed with much lower cost. C. User interface designing Based on the first two steps, we designed a customized user interface so that users can operate web tables conveniently when they are browsing on the web. The user interface is shown as the following figure, which is a floating toolbar with six buttons and adjustable transparency that automatically appears when the cursor is hovered on a header cell of a genuine web table. Fig. 4. User interface of the IE plug-in. Table 2 presents the function of each button from the left side of the toolbar to the right. By pressing these buttons, users can control the appearance of the table in different ways without changing the layout of the web page. Table 2. Functions provided by the toolbar No. Name Description F1 F2 Ascent Sorting Descent Sorting Sorting data of the entire table ascending ordered by the data of current column or row Sorting data of the entire table descending ordered by the data of current column or row H1 H2 H3 H4 If the cell contains FORM element, the cell is identified as form, otherwise use H2. If the cell contains IMG element, the cell is identified as image, otherwise use H3. If the cell contains A element, the cell is identified as hyperlink, otherwise use H4. If the inner text length (not including blanks) of the cell is 0, the cell is identified as empty, otherwise use H5. F3 Initial Order Restoring data of the entire table to initial order F4 Hiding Info Hiding current column or row F5 Showing Info Showing column or row next to the current one F6 Restoring All Restoring the table to the initial state H5 H6 If the inner text of the cell is consisted of only digital numbers, the cell is digit, otherwise use H6. If the inner text of the cell is consisted of only digital numbers and alphabetical characters, the cell is alphabetical, otherwise the cell is others. For the third aspect, we reconstruct each genuine table as a data matrix. The width of the matrix is equal to the maximum cell count in a row in the genuine table, while the height of the matrix is equal to the row count of the genuine table. Each element of the matrix represents a corresponding table cell in the genuine table. Table cells with ROWSPAN or COLSPAN tags are represented by multiple matrix elements of the same Besides the functions above, we provide several other functions to help to improve the user experience, which are shown as follows. Marking the button where the cursor is currently hovered with the yellow color. Marking the button which was clicked last time with the blue color. Creating a log file to record all events the toolbar catches, such as showing the toolbar, clicking certain buttons, adjusting the transparency and so on.

IV. EXPERIMENT The experiment is aimed at finding the effectiveness of our work. We divide our experiment into two parts. One part is to test the accuracy of detecting genuine web tables, and the other part is to do user study to get evaluation of the tool. A. Accuracy of detecting genuine web tables In the experiment, we collect 1774 web tables, in which the number of genuine tables is 233 and the number of non-genuine tables is 1541. We define Precision as the proportion of genuine tables in all tables that are detected as genuine, and Recall as the proportion of tables which are detected as genuine in all genuine tables. Follows are the Precision and Recall values of our Navie Bayesian Algorithm, compared with values of other three classification algorithms in Wang and Hu s paper [12]. Table 3. Values of Precision and Recall of several classification algorithms Algorithm Precision Recall tool in advance and then were asked to finish the tasks in four ways: toolbar disabled with normal size of browser; toolbar enabled with normal size of browser; toolbar disabled with the browser window sized 240 * 320 pix; toolbar enabled with the browser window sized 240 * 320 pix. We change the browser window size in order to simulate small screen devices. Table 5 and Table 6 shows the results calculated from automatically recorded timestamps of relevant operations, with F1-F6 denoting the six functions of the tool. Table 5. Result of Task 1 Task 1 Status of the toolbar Disabled Enabled Disabled Enabled Browser window size Normal Normal 240*320 240*320 F1-1.85-0.71 F2-1.29-0.43 Navie Bayesian 90.50% 93.99% Decision Tree 97.50% 94.25% Avg. times of using F1-F6 F3-0.71-0.29 F4-1.57-1.00 SVM(linear) 91.39% 93.91% SVM(RBF) 95.81% 95.98% Result shows that our implementation of the easy Navie Bayesian algorithm is quite accurate. B. User study We carry out the user study based on assumes that our work can help greatly when users want to sort a table by certain column or row and when they want to compare data of distant columns or rows when browsing web tables. We designed two scenarios and assigned a specific task for each scenario as follows. Table 4. Scenarios and tasks designed for user study No. Scenario Task 1 2 You are browsing a 5-column genuine web table which displays the top 50 popular songs ordered by their rankings, with titles showed in col. 2 and hyperlinks for trial audition showed in col. 5. You are browsing a 10-column genuine web table which displays the 20 teams of England Premier League ordered by their rankings, with team names showed in col. 2 and goals showed in col. 7. Given the titles of three songs, find out whether they are in the list of songs and click on the hyperlinks for trial audition. Find the teams with the largest and smallest number of goals. We recruited 7 participants, all of whom are Chinese graduate students familiar with web browsing on desktops using Microsoft Internet explorer. They were showed how to use the F5-0.57-0.14 F6-0.43-0.00 Avg. time 47.06s 16.73s 69.37s 35.08s Table 6. Result of Task 2 Task 2 Status of the toolbar Disabled Enabled Disabled Enabled Browser window size Normal Normal 240*320 240*320 Avg. times of using F1-F6 F1-0.85-0.71 F2-0.29-0.29 F3-0.00-0.14 F4-0.57-0.43 F5-0.00-0.00 F6-0.14-0.00 Avg. time 23.60s 7.52s 35.27s 10.37s From the result, we can see that both the sorting and the hiding functions are quite often used. Participants spent much less time when the toolbar is enabled, especially when the browsing window size is limited. Nearly all participants showed that they quite enjoyed the new browsing mode.

V. RELATED WORK At the first step of our implementation, we need to detect genuine tables in web pages. So we review existing works on web table analysis. Since late 1990s, web table analysis has attracted many attentions from researchers in the areas of Web Data Mining and Information Retrieval [2, 3, 5, 6, 7, 12, 13, 14]. Basically, there are two ways of processing web tables. One is based on the HTML source code, and the other is based on the visual rendition of web pages. For the first way, Lim and Ng [7] proposed to automatically retrieve hierarchical data from HTML tables by constructing the content tree for each table, without pre-requiring the internal table structure. Yang and Luk [13] first presented the definition of Web Table Mining and developed a frame work for comprehensively analyzing the structural aspects of web tables. Wang and Hu [12] focused on web table detection and proposed to automatically classify web tables either as genuine or non-genuine by machine learning, which is the method we refer to in our research. For the visual way, Krupl and Herzog [6] concentrated on detecting genuine web tables relied on the visual rendition of web pages. Later, Gatterbauer et al. [3] extended the idea of visually guided web table detection and used a model of the visual representation of web pages to extract domain-independent information from web tables. We also review related works on the browsing mode of web tables. Asakawa and Itoh [1] developed a non-visual web table navigation method enabling both horizontal and vertical navigation with a table cursor, a table pointer and a cell-jumping key. However, they only dealt with gridded tables, that is, TABLE elements defined in HTML documents but without COLSPAN and ROWSPAN. Recently, Tajima and Ohnishi [11] proposed three modes for web table browsing on small screens: normal mode, record mode and cell mode. They concentrated on how to present a segment of a large web table as the user requires, without concerning about the relationships among data in different table cells, and they did not present any user evaluation. Compared with our work, Tajima and Ohnishi s only provides presentation re-rendering functions like hiding unnecessary rows or columns, but cannot support advanced functions like sorting data of the entire table ordered by the data of current row or column. VI. CONCLUSION AND FUTURE WORK In this paper, we propose a new browsing mode to improve user experience when people encounter large web tables on small screen devices. Based on automatic table detection and a good understanding of table contents, we designed and implemented a plug-in for the Microsoft Internet Explorer, which provides a customized user interface to access several operations on web tables, such as sorting a table by some column/row or hiding and showing certain columns/rows. Through user study, we are pleased to see that our work is accepted and highly evaluated by most people. However, our work still has three limitations. Firstly, our detection of genuine tables is just based on the HTML TABLE elements. Web tables generated by CSS are not considered. In the future, we plan to detect genuine tables based on visual renditions. Secondly, our implementation is just an instance of Microsoft IE model, which still has a distance from our target of realizing it on handheld devices. So we will work on that later. Thirdly, we will improve our user interface based on suggestions we get through user study. REFERENCES [1] Asakawa, C., Itoh, T.: User Interface of a Nonvisual Table Navigation Method. In: ACM SIGCHI Conference on Human Factors in Computing Systems (CHI 99), pp. 214--215. ACM, New York, NY, USA (1999) [2] Cohen, W.W., Hurst, M., Jensen, L.S.: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents. In: 11th International Conference on World Wide Web (WWW 02), pp. 232--241. ACM, New York, NY, USA (2002) [3] Gaterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards Domain-Independent Information Extraction from Web Tables. In: 16th International Conference on World Wide Web (WWW 07), pp. 71--80. ACM, New York, NY, USA (2007) [4] Hassan,T., Baumgartner, R.: Table Recognition and Understanding from PDF Files. In: 9 th International Conference on Document Analysis and Recognition (ICDAR 07), pp. 1143--1147. IEEE Computer Society, Washington, D.C., USA (2007) [5] Hurst, M.: Classifying TABLE Elements in HTML. In: 11th International Conference on World Wide Web (WWW 02), Poster Paper. (2002) [6] Krüpl, B., Herzog, M.: Visually Guided Bottom-Up Table Detection and Segmentation in Web Documents. In: 15th International Conference on World Wide Web (WWW 06), pp. 933--934. ACM, New York, NY, USA (2006) [7] Lim, S.J., Ng, Y.K.: An Automated Approach for Retrieving Hierarchical Data from HTML Tables. In: 8th ACM International Conference on Information and Knowledge Management (CIKM 99), pp. 466--474. ACM, New York, NY, USA (1999) [8] Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Automatic Searching of Tables in Digital Libraries. In: 16th International Conference on World Wide Web (WWW 07), pp. 1135--1136. ACM, New York, NY, USA (2007) [9] Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table Extraction Using Conditional Random Fields. In: 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 03), pp. 235--242. ACM, New York, NY, USA (2003) [10] Ramel, J.Y., Crucianu, M., Vincent, N., Faure, C.: Detection, Extraction and Representation of Tables. In: 7th International Conference on Document Analysis and Recognition (ICDAR 03), pp. 374--378. IEEE Computer Society, Washington, D.C., USA (2003) [11] Tajima, K., Ohnishi, K.: Browsing Large HTML Tables on Small Screens. In: 21st Annual ACM Symposium on User Interface Software and Technology (UIST 08), pp. 259--268. ACM, New York, NY, USA (2008) [12] Wang. Y.L., Hu, J.Y.: A Machine Learning Based Approach for Table Detection on The Web. In: 11th International Conference on World Wide Web (WWW 02), pp. 242--250. ACM, New York, NY, USA (2002) [13] Yang, Y.C., Luk, W.S.: A Framework for Web Table Mining. In: 4th ACM CIKM International Workshop on Web Information and Data Management (WIDM 02), pp. 36--42. ACM, New York, NY, USA (2002) [14] Yoshida, M., Torisawa, K., Tsujii, J.: A Method to Integrate Tables of the World Wide Web. In: 1st International Workshop on Web Document Analysis (WDA 01), pp. 31--34. (2001)