An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery Simon Pelletier Université de Moncton, Campus of Shippagan, BGI New Brunswick, Canada and Sid-Ahmed Selouani Université de Moncton, Campus of Shippagan, BGI New Brunswick, Canada ABSTRACT This paper addresses the issue of distilling relevant information from unstructured data such as content from Web pages. For the purpose of solving this issue, a system is designed to propose a utilization of automated guided web mining algorithms for meta-rules extraction. The proposed system can be viewed as an extensible tool to extract metadata and generate multi-format descriptions from existing Web documents. The on Canadian universities. The results show that the system easily provides meaningful visualizations and delivers powerful text extraction, supporting users in their quest to efficiently investigate and exploit available Web data sources. Keywords: Knowledge discovery, Web content mining, Information retrieval, Metadata, Visualisation capabilities 1. INTRODUCTION The rapid expansion of hugely unstructured data on the Web is causing several problems such as an increased difficulty of extracting potentially useful knowledge. Distilling relevant information from unstructured data, such as content from Web pages, can be both challenging and time consuming. Most Crawler-based search engines, such as Google, use methods that essentially do document-level ranking and retrieval, and create their listings automatically. They spider the web then they propose to the users to search through a proposed list of links of Web pages ranked according to their relevance to a given query. Extracting valuable information from such an ever increasing amount of data remains a fastidious and boring task. The biggest challenge is to drive the next generation of Web search by leveraging data mining, and knowledge discovery techniques for information organization, retrieval, and analysis. These new Web search services are expected to bring increased knowledge and intelligence to users. As such, enhanced search functions can effectively dig out understandable information and knowledge from unorganized and unstructured Web data. This paper is organized as follows. The related work is given in Section 2. In Section 3, we give the objectives of the designed tool. The components of the proposed tool are described in Section 4 through the presentation of two case studies. Finally, Section 5 concludes this paper. 2. RELATED WORK It was unanimously recognized that the huge volume of information on the web, which is disseminated to the users in a chaotic way, constitutes a great challenge to make use of that information in a systematic way. In order to face this challenge the Web mining is one of the fast growing technologies that aim at discovering and analyzing useful information from the Web. According to the classification proposed by Nadeem and Syed in [8], the Web mining consists of Web usage mining, Web structure mining, and Web content mining. The Web usage mining investigates the user access patterns from the Web usage logs. The Web structure mining aims at discovering useful knowledge from the structure of hyperlinks. The Web content mining refers to the extraction and integration of useful data, information and knowledge from Web page contents. In this paper we are concerned with the Web content mining. To extract structured data from semi-structured Web documents, pattern discovery based approaches can be used. Recent variants of these approaches consist of discovering extraction patterns from Web pages without user-labeled examples by using several pattern discovery

techniques, including radix trees, multiple string alignments and pattern matching algorithms [2]. These information extractors can be generalized over unseen pages from the same Web data source. One of the straightforward methods to extract Web data is to copy-paste. There are tools to copy-paste easier and one of these tools is Quotepad [10]. This tool permits to store notes or data directly from the Web and it also offers an option to convert the selected data by exporting and saving them as extended Markup Language (XML) format. Excel, the spreadsheet application of Microsoft Office Suite can also be used to extract data from the Web by using the Import from a website option [4]. The user may subsequently use data by making histograms or save them as a list. However, the extracted data must be beforehand structured so that the result is clear and easy to analyze and/or to navigate to. Tools like OutWit Hub [9] are useful to find, grab and organize data from the Web. However, these tools are more convenient to recover structured information such as tables or lists of data. Note that they do not automatically extract the data for all (unseen) Web pages of a given site, but only data from the Web page that is currently consulted. Besides this, they do not extract the data dynamically. For example, if the extracted data is saved in Excel and a histogram is made, you have to perform a new process to recreate this histogram if the Web page is updated. Screen Scraper is another Web data extraction tool [11]. This tool is used to store extracted data into databases. Its main advantage is that it can perform automatic extraction of targeted data during a certain period. This tool provides various useful features that allow users to easily interfacing it with their database engines. Data Mining Component Queue of URLs XHtml Parser Natural Language Processing Topic Identification Database Predicate Dictionary Query Composer Association rules Temp Text Storage File Accessor File Parser XSD/XML XSLT/FO PDF format Graphic format Figure 1. Overview of the proposed system Knowledge Base 3. OBJECTIVES To meet the challenge of delivering more intelligent search results to users, we propose a utilization of automated guided web mining algorithms for the purpose of metarules extraction. The proposed approach combines Natural Language Processing and supervised rule-based guidance algorithms to improve the knowledge discovery process by using information available on the Web. The proposed system can be viewed as an extensible tool to extract metadata and generate multi-format descriptions (including XML, database, graphics...) from existing Web documents. It provides a set of features that allow one to analyze documents from the Web without having to manually transcript the reliable information found. The on Canadian universities. 4. TRANSFORMING UNSTRUCTURED WEB DATA INTO INTUITIVE VISUAL FORMAT As illustrated by the block diagram of Figure 1, the proposed framework is composed of parsers, miners, and various output generators. The low-level processing performed by the parsers receives Web documents converted from different formats. It analyzes the contents and divides them into atomic units. For this task, we came up with a simple yet effective algorithm. The parser module contains two engines, and a temporary storage area. The first engine is a multi-format parser used in the system. Typically it selects important attributes by natural language processing of lexical analysis. The second one is used to open raw text documents as well as Microsoft Word documents, and PDF documents that are available for download from the fetched and queued URLs. Once the parsing is done, the documents are appended to the storage area for later processing. The miners make use of the parsed information to generate additional meta-data properties for the documents. Examples of miners include language identification module, Meta data extractor and

classifier, etc. Output generators allow users to highlight relevant information buried in unstructured content that is extracted/mined from metadata and present this information in an intuitive visual format. The findings are then presented as a consolidated view thanks to the visual (graphic) or structured information (database) discovered and extracted from processed documents. This framework was written in PHP and SQL function. To store the extracted data, we used a MySQL database. Through two practical case studies, we give details about the algorithms that are used in the proposed framework. Case study 1: Acadian literature resources This application aims at using the proposed framework to provide knowledge about Acadian literature derived by the mining algorithm given in Figure 2. In the steps of this algorithm, we have to enter the suitable combination of related keywords and discover the meaningful information of documents from the targeted web sites obtained by search support functions. To visualize the characteristics of obtained attributes, a Graphical User Interface (GUI) is developed. In order to operate the web miner, it is necessary to gather web pages selectively or entirely. When making request for a given feature, the miners check a text file that contains the queued URLs of these pages. Therefore, the possibility is given to the users to control the behavior of web miner by using this file. Additional selection policies and rules can be added in order to deeply gather and select more relevant web contents. For instance, according to these defined rules, it could be possible to manage the problems of intellectual properties and copyrights when storing copies of gathered web contents on personal servers. Algorithm 1: (deep search & GUI) Fix the number of Web sites S max that has been targeted Generate a set of rules and policies For S max sites Do For each set of visible and unseen pages Do Search for specific items related to publications Evaluate the attributes End for Select and store in the database End for Output to various formats and graphics Discover new sites and update S max Figure 2. The Mining algorithm used to provide Acadian literature information The modular architecture of the proposed framework allows administrators to consider the consistency of web pages, such as updating time of web contents and the validity of the hyperlinks to other web pages. Figure 3. Number of books related to the Acadian culture published per year (1980-2009) In this case study, the system extracts the content of both visible and unseen pages of the website [6], and sends it to the parser. Search patterns are then created and transmitted to a pattern matching procedure. This procedure is used to search a string for specific patterns and stores the results in an array. To extract all the content of all the Web pages and not only for one year (that covers the publication activity of Acadian literature), we must use a loop and change the year in an adapted and dynamic URL. This method permits to go through an array that contains all the publication years from 1980 to 2009. The result of this extraction was stored in a database (MySQL) and can be further visualized in various formats. The user may also use extracted data for future analysis by creating a histogram as illustrated in Figure 3. The branches of the histogram will be data that are stored in the database. The main advantage of this histogram is that it is dynamic so if the data change in the database, the histogram changes as well. In this framework, we are using the dynamic SQL statement in a loop to retrieves the number of books published annually. Consequently, it is convenient to use this system for extracting data from unstructured Web because it is exploiting the data dynamically unlike other tools that offer only the manual possibility. The major advantage of structured document formats is the possibility to produce multiple deliverables. But given the fact that there are multiple ways of converting unstructured data into structured formats, it would seem reasonable to choose the appropriate deliverable according to the type of applications and users needs. In our application, the analysis, navigation, and browsing Web site data are facilitated by these new formats. For instance, it is possible thanks to the framework to structure the data collected from the Web site of Acadian literature in bibliographic record format for each book. Based on the fact that the data are saved in XML format as illustrated in Figure 4, this makes our system ideal to extensively use the XSLT (extensible Stylesheet Language Transformations) or RDF (Resource Description Framework) Schema. Subsequently, we have the ability to display XML data about each book in a user-friendly fashion as illustrated in Figure 5.

Case Study 2: Information on Canadian Universities (Google Search) Figure 4. XML structure of the Web content In this application, the data that users want to extract are retrieved from selected URLs. To create this relevant list of URLs, a search procedure function (same as the one of the 1 st case study) based on pattern matching of Google search results is used. The Algorithm given in Figure 6 depicts the steps performed to provide enhanced knowledge from current Web search engines. A text file containing the filtered URLs is automatically created to guide the parsing procedure. Next, we use an array of patterns to extract relevant attributes that are previously defined by the users. Note that XSD rules can be established in order to provide the a well formed XML file containing the final retained attributes extracted from the raw data obtained after mining the selected documents (step 6 of Algorithm 2). The framework allows users to extract only the content they want (metadata for instance) without having to click on each link of a given university that Google provides. In this example, the metadata s of Canadian universities Web pages extracted from Google's results can also be stored in a database. Subsequently, they can also be saved in XML file or in any other format depending on the choice and the needs of users. They can also be simply displayed in XHTML format directly from the framework. Algorithm 2: (Search & Store metadata) 1) Define user attributes and optional XSD rules 2) Generate a set of templates to filter Google results: T i 3) Store a set of relevant URLs 4) For U max URLs Do 5) Get a URL x 6) Mine d x : documents of x 7) Evaluate the relevance of attributes by scoring the pattern matching with T i 8) if d x Ø Goto 6 9) Store temporarily selected attributes 10) End For 11) Output information to XML according to XSD rules if established Figure 6. Algorithm providing knowledge discovery through augmented Web search results Figure 5. XSLT result applied to the XML extracted file Figure 7 gives an example of the deliverable obtained in step 9 of Algorithm 2 presented in Figure 6. This file contains temporarily selected attributes according to the user requested information. These invisible data that are extracted from the Google search results on Canadian universities are now accessible. The raw information obtained in step 9, is further structured in XML format according to the predefined XSD rules.

Figure 7. Excerpt of the raw data obtained in step 9 of the algorithm presented in Figure 6 5. CONCLUSION In this paper we proposed a framework that can be used to identify and transform valuable text-based information extracted from Web documents into a multiple structured formats, facilitating the analytical process. This on Canadian universities. The algorithms developed within the framework are proven to be effective and intuitive to overcome some difficulties associated with the assimilation of unstructured data. Many uses and possibilities are achievable in order to provide meaningful visualizations, supporting users in their quest to efficiently investigate and exploit the data sources available on the Web. 6. REFERENCES [1] M. Y. Chau, Finding order in a chaotic world: A model for Organized research using the World Wide Web, Internet Reference Services Quarterly, vol. 2, No 2/3, pp. 37-53, 1997. [2] C. Chia-Hui, H. Chun-Nan and L. Shao-Cheng, Automatic information extraction from semi-structured Web pages by pattern discovery, journal of decision Support Systems, Vol. 35, No 1, pp. 129--147, Elsevier Science Publishers, 2003. [3] P. Desikan, J. Srivastava, V. Kumar, and P.N. Tan, Hyperlink Analysis: Techniques and Applications, Technical Report 2002-0152, Army High Performance Computing and Research Center, 2002. [4] Excel 2007 - Microsoft Office, Online on: http://office.microsoft.com/en-us/default.aspx, 2010. [5] H. Kawano, Web Archiving Strategies by using Web Mining Techniques, PACRIM IEEE-Communications, Computers and signal Processing Conference, pp. 915 918, 2003. [6] La littérature francophone en Acadie depuis 1980, (translation : «Acadian literature since 1980»), online on: http://www.acadielitteraire.ca/, 2009. [7] S. Lawrence and C. Lee Giles, Searching the World Wide Web, Science, vol. 280, No3, pp. 98-100, 1998. [8] M. Nadeem and S.H. Syed, Guided Web Content Mining Approach for Automated Meta-Rule Extraction and Information Retrieval, Proceedings of The 2008 International Conference on Data Mining, pp. 619-625, Las Vegas, USA, 2008. [9] Outwit technologies, Harvest the web, online on http://www.outwit.com/, 2010. [10] Quotepad, The free notepad that can save the text selected on the screen, online on: http://quotepad.info/, 2010. [11] Screen-Scraper, web data extraction, online on: http://www.screen-scraper.com/, 2010.