EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

Size: px

Start display at page:

Download "EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES"

Leslie Lloyd
5 years ago
Views:

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES Praveen Kumar Malapati 1, M. Harathi 2, Shaik Garib Nawaz 2 1 M.Tech, Computer Science Engineering, 2 M.

P, India ABSTRACT The search on the Web for a particular thing through user query string results in a number of web pages and links.

1 EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES Praveen Kumar Malapati 1, M. Harathi 2, Shaik Garib Nawaz 2 1 M.Tech, Computer Science Engineering, 2 M.Tech, Associate Professor, Computer Science Engineering, Dept of CSE, SKDEC, Gooty, Anantapuram (Dist), A.P, India ABSTRACT The search on the Web for a particular thing through user query string results in a number of web pages and links. This data is obtained from the resource like a DATABASE or an XML source running at the back end Server. When each resulting website is considered from the search result, most of the websites use their own presentation structure or layout to display its contents. Based on the search query, the result page content is parsed to extract, align and present its relevant data to the user using a template format. Here in this paper, a study of how to extract the data from such web pages without any user input and propose a model to extract data from such websites. The model accepts the web page as input and extracts the data and aligns it in a tabular format. KEYWORDS: Tag-Value String analysis, Data Extraction, Alignment and Information extraction. I. INTRODUCTION The Web is a source of information which is expanding every day. Such information may be structured or semi-structured or un-structured data. But most of the web sites use their way of displaying the information or the data throughout its website. But in such a websites specific schemata that the website follows in displaying the information is observed. The source of this information can be a Server or DATABASE or XML. An example of such information is, a Company producing some electronic goods. In this case the set of products that the company produces will be published in the site with the common properties like Name, Model, Year of Manufacturing, Price, Rating and User comments etc. This paper studies the extraction of data in such web pages without any user inputs. Automatically extracting the data from such web pages is very useful and challenging. It is an important sub-problem in information integration systems that use the data from different websites. The pages belonging to the same site use a similar layout in a consistent manner across all the pages. Here is an example: Below Figure shows the details of Books related to computers displayed in a website. Figure 1: The sample data from a website. 193 Vol. 7, Issue 1, pp

2 For example, in the above site the details of the books are published with the Title, Authors, Edition, Publisher, ISBN, Length and Subjects details. The website also displays the Reviews or the Ratings given by the users. From this a conclusion is made to an extent that the site related to a particular domain will use a specific format to display their product related content for the comfort of users or visitors. This common format generally called as a template. Hence based on the type of domain or the product background the site belongs e deduce a certain Search strings of Keywords to use against the website to extract the data. Hence in the above case, the content is related to the book publishing where the publishers follow similar template to describe about the books information. Hence such repeated words are considered as Search keywords and try to get the data from such websites as much as possible. The extracted and aligned data from the above website will be as mentioned below Table. TABLE 1: the Data aligned after Extraction Title Authors Edition Publisher ISBN Length Subjects No. of Ratings Reviews Computers: Tools for an Information age Harriett Capron, J. A. Johnson 7, illustrated Prentice Hall, , PAGES Computers Internet General 0 Reviews 0 Rating Computers Marjorie Eberts, Margaret Gisler Illustrated McGraw Hill Professional, , PAGES Business & Economics Careers General 1 Review 4Rating The earliest information extraction techniques rely on a human to encode knowledge of the template into a program called wrapper, which then extracts data. In Hammer s Extracting semi-structure information from the web [2], declarative rules are derived manually through human using templates, and a wrapper generator converts these rules into a wrapper. Most systems like XWRAP [3] and STALKER [4] use manually generated training examples that identify data in a small number of pages, to learn knowledge of the template. We use these previous techniques/methods mentioned above to get the data from the web pages, deduce the template without any human input further, and use the deduced template to extract data. The disadvantage of the human inputs is: Human inputs will consume more time. This may depend on the knowledge of the person in that particular domain to deduce the sample data. In general we consider most of the content displayed in a paragraph format as a comment or description. As the content displayed on the web is in the form of text, there is a chance that the information may be present as part of a large paragraph and the person may miss such data. Most of the pages may be in an unstructured format with relevant data/information. A web page will contain some data as part of tag attributes. There is no obvious way of differentiating between text that is part of template and text that is part of data. Any word could be part of template, or data or both. It is not necessary for a word that is part of template to occur in every page. The schema of data in pages is usually not a flat set of attributes, but is more complex and semi-structured. The schema could contain non-atomic attributes that are sets of values. The existence of complex schema makes both the tasks of definition and automatic recognition of template harder. In fact the existence of complex schema makes our problem very closely related to the problem of regular expression inference which is known to be very hard. These problems can be rectified if we generate the templates related to the data automatically. The rest of the paper is organized as follows. Section II gives a base on the importance of data extraction and various techniques followed by earlier systems in extracting data. It describes about wrapper generation methodologies followed by these systems. Section III describes our idea of extracting data from web pages based on HTML tags. Section IV discusses about the future work. And we conclude with section V. 194 Vol. 7, Issue 1, pp

3 II. RELATED WORK The data extraction has been continued since the evolution of traditional file system or databases. Earlier a number of String processing algorithms have been introduced to search or for identifying a particular text/string. This has continued over to the tree searches, sequence searches, etc. Later during the course of study the wrapper induction technique for information extraction have been introduced. Wrapper induction technique involves creation of a wrapper that has a set of predefined rules to extract the data from a page and presents in a defined format. The format of display is generally in an entity format or in a tabular format. Wrapper induction uses supervised learning to learn data extraction rules from manually labeled training examples. This type of extraction is time consuming and difficult. Because of these cons an automated wrapper generation using unsupervised pattern mining is needed. Automated extraction is possible because most Web data objects follow fixed templates but not as easy as it looks. Discovering such templates or patterns enables the system to perform extraction automatically. Wrapper generation on the Web is an important problem with a wide range of applications. Extraction of such data enables one to integrate data/information from multiple Web sites to provide value-added services, e.g., comparative shopping, object search, and information integration. These techniques were used vastly in Artificial Intelligence (AI) [5]. This data extraction has been performed for Semi structured Information Sources, Ontology, documents, websites etc. Later Roadrunner [6] has been introduced towards Automatic Data Extraction from Large Web Sites. This Roadrunner aims to automatically generate the wrappers. Similarly Lixto [7] for Visual Web Information Extraction. It generates wrappers which translate relevant pieces of HTML pages into XML. It assists the user to semi-automatically create wrapper programs by providing a fully visual and interactive user interface. It creates extraction programs using a logic based declarative language Elog. Lixto creates an "XML-Companion" for a HTML web page with changing content, containing the continually updated XML translation of the relevant information. The other related works include A Fully Automated Object Extraction System for the World Wide Web [8], Information Extraction Based on Pattern Discovery (IEPAD) [9], Flexible Learning System for Wrapping Tables and Lists in HTML Document [10], Data Extraction and Label Assignment for Web Databases [11], Automatic Composite Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification [12], data extraction from Flat and nested data records [13]. III. IMPLEMENTATION It will be a challenging task in deducing the format the web page is using to display the data. As there is no obvious way of differentiating the text that is part of template and text that is part of data, a word can be part of template, or data or both. Similarly it is also not necessary for a word that is part of template to occur in every page. Hence when a page analyzed it almost contains in a HTML tag format or in XML format Parent/Child relation. The structure will be like a tree structure with root element as <html> and the corresponding presentation tags following it and an end </html>. Some pages may contain multiple <html> </html> segments which display the content based on the user action by submitting their data request to different handler classes at the backend. Some times the data may be present as part of tag attributes. And if a set of data records need to be displayed then they will use same tag attribute for representing that a particular characteristic of these records. This makes the definition and automatic recognition of templates harder. Hence the tag data has to be maintained in the database or in the XML document. 1.1 Tag Structure Analysis A sample HTML source for the above example figure is as below. 195 Vol. 7, Issue 1, pp

4 Figure 2: Source code of the above result page The tag tree structure (Tag hierarchy) related to the above html source is as below. Figure 3: The Tag structure hierarchy of the result page The figure shows tag hierarchy only for single data string Publisher and its data string Prentice Hall, Assumptions done as part of analysis: - A HTML or a root tag is always present as part of the template of the page. 196 Vol. 7, Issue 1, pp

- Hence an assumption is made that there exists always a start tag (<>) and an end tag (</>). - Either data or the String representing the property is always present with in these start and end tags.

5 - Hence an assumption is made that there exists always a start tag (<>) and an end tag (</>). - Either data or the String representing the property is always present with in these start and end tags. 1.2 Tag Data Analysis, Extraction and Alignment Once the tags are identified, then with the assumption that the data is present as part of these tags (start and end tags), analyze each tag data. Here the length of the data is considered as limited. If the length of the data string is too long then it is considered as an irrelevant data string. And it will not be considered for further analysis. Hence the standard html tag namespace is used for further analysis and extract the data. This can be done either using HTML namespace declared as part of an XML document or as part of database. The below procedure is followed once the tags structure is analyzed. - Extract the tag data strings. - Check for the string length, and if it is of limited length then consider it for further analysis. Otherwise do not consider it. - Count for the occurrences of the string data. If it is repeated then it s occurrences count is maintained. It is repeated multiple times then it is considered as property string. - The subsequent string following it is considered as the data string. - This procedure is continued until the end of the page. - The data is aligned based on the number of rows using the html or xml programming and present to the user. Pictorially it is represented as below. Figure 4: The data extraction and alignment approach Here the XML or the database represents the backend sources. These resources store the general strings that are repeated as part of the web pages. This data in the database can be updated to increase the extraction scope of the data. They can also maintain the occurrences count, mostly searched string count etc. IV. FUTURE WORK Here are few suggestions for future work. First is to derive a method to identification and extraction of the data from a large paragraph. There are chances that some information might present as part of it. Second is to deduce a method to generate the keywords based on the visited website type. Third is to extract the data from the websites without the use of predefined keywords stored in the database or in the XML source. As said by Weifeng Su in his Combining Tag and Value Similarity for Data Extraction and Alignment, [1], the tag and value similarity can be verified to extract and align the data. Here an approach can be derived for identifying the exact type of the result string and aligning it appropriately. V. CONCLUSIONS We presented a generic model to extract data from web pages. This approach follows a 2 step process. First step involves identification of the tags from the resulted <html> source page. Then the second 197 Vol. 7, Issue 1, pp

6 step involves the analysis of tag data and extraction. The extracted data is then aligned and presented to the user. The advantages of the proposed method are - The user search data will be available at a single place. - This provides the convenience for the user to compare the resulted data and choose the appropriate one. - As the multiple product details are available in a single page and most of the data on the page will be related to his search query. - This will remove all the unwanted data like advertisements, graphic images etc and display the relevant data to search query. The disadvantages and limitations of the proposed approach are - This approach will not categorize and aligns the data with 100% accuracy. - The alignment may not be proper when the data of similar type results for multiple properties of the table. In such a case the irrelevant data may be categorized as similar and tried to align under a single column. - As the database or the XML resource contains the knowledge about the tags, maintaining all the set of possible tags require great effort. It has to be updated when a new type of tag is encountered to avoid missing data in such cases. - As the knowledge about the tags increases its maintenance will be a challenging task. REFERENCES [1] Weifeng Su, Jiying Wang, Frederick H. Lochovsky, Member, IEEE Computer Society, and Yi Liu, Combining Tag and Value Similarityfor Data Extraction and Alignment, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 7, pp , JULY [2] J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semi structure information from the web. In Proceedings of the Workshop on Management of Semistructured Data, [3] L. Liu, C. Pu, and W. Han, XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources, Proc. 16th Int l Conf. Data Eng., pp , [4] I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction. In Proc. of Third Intl. Conf. on Autonomous Agents, pages , [5] N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness, Artificial Intelligence, vol. 118, nos. 1/2, pp , [6] V. Crescenzi, G. Mecca, and P. Merialdo, Roadrunner: Towards Automatic Data Extraction from Large Web Sites, Proc. 27th Int l Conf. Very Large Data Bases, pp , [7] R. Baumgartner, S. Flesca, and G. Gottlob, Visual Web Information Extraction with Lixto, Proc. 27th Int l Conf. Very Large Data Bases, pp , [8] D. Buttler, L. Liu, and C. Pu, A Fully Automated Object Extraction System for the World Wide Web, Proc. 21st Int l Conf. Distributed Computing Systems, pp , [9] C.H. Chang and S.C. Lui, IEPAD: Information Extraction Based on Pattern Discovery, Proc. 10th World Wide Web Conf., pp , [10] W. Cohen, M. Hurst, and L. Jensen, A Flexible Learning System for Wrapping Tables and Lists in HTML Documents, Proc. 11th World Wide Web Conf., pp , [11] J. Wang and F.H. Lochovsky, Data Extraction and Label Assignment for Web Databases, Proc. 12th World Wide Web Conf., pp , [12] L. Chen, H.M. Jamil, and N. Wang, Automatic Composite Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification, SIGMOD Record, vol. 33, no. 2, pp , [13] B. Liu and Y. Zhai, NET - A System for ExtractingWeb Data from Flat and Nested Data Records, Proc. Sixth Int l Conf. Web Information Systems Eng., pp , AUTHORS BIOGRAPHY Praveen Kumar Malapati, M.Tech, Department of CSE, Sri Krishna Devaraya Engineering College, Gooty, Ananthapuram (District). 198 Vol. 7, Issue 1, pp

Devaraya Engineering College, Gooty, Ananthapuram (District). Shaik Garib Nawaz, M.

7 M Harathi, M.Tech, Associate Professor, Department of Computer Science Engineering, Sri Krishna Devaraya Engineering College, Gooty, Ananthapuram (District). Shaik Garib Nawaz, M.Tech, Associate Professor, Department of Computer Science Engineering, Sri Krishna Devaraya Engineering College, Gooty, Ananthapuram (District). 199 Vol. 7, Issue 1, pp

A survey: Web mining via Tag and Value

A survey: Web mining via Tag and Value Khirade Rajratna Rajaram. Information Technology Department SGGS IE&T, Nanded, India Balaji Shetty Information Technology Department SGGS IE&T, Nanded, India Abstract