An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery

Similar documents
Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

A SURVEY- WEB MINING TOOLS AND TECHNIQUE

Competitive Intelligence and Web Mining:

3 Publishing Technique

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

An Approach To Web Content Mining

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2017

Support System- Pioneering approach for Web Data Mining

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

IJMIE Volume 2, Issue 9 ISSN:

DATA MINING - 1DL105, 1DL111

Semantic Web Mining and its application in Human Resource Management

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Secrets of Profitable Freelance Writing

The influence of caching on web usage mining

Chapter 50 Tracing Related Scientific Papers by a Given Seed Paper Using Parscit

Deep Web Content Mining

Implementing a Knowledge Database for Scientific Control Systems. Daniel Gresh Wheatland-Chili High School LLE Advisor: Richard Kidder Summer 2006

Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains

Deep Web Crawling and Mining for Building Advanced Search Application

LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology

ASG WHITE PAPER DATA INTELLIGENCE. ASG s Enterprise Data Intelligence Solutions: Data Lineage Diving Deeper

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

A NOVEL APPROACH TO INTEGRATED SEARCH INFORMATION RETRIEVAL TECHNIQUE FOR HIDDEN WEB FOR DOMAIN SPECIFIC CRAWLING

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Creating a Classifier for a Focused Web Crawler

Proposal for Implementing Linked Open Data on Libraries Catalogue

Finding Topic-centric Identified Experts based on Full Text Analysis

Development of an e-library Web Application

Developing Seamless Discovery of Scholarly and Trade Journal Resources Via OAI and RSS Chumbe, Santiago Segundo; MacLeod, Roddy

INTRODUCTION. Chapter GENERAL

D2.5 Data mediation. Project: ROADIDEA

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Efficient Indexing and Searching Framework for Unstructured Data

Life Science Journal 2017;14(2) Optimized Web Content Mining

A Supervised Method for Multi-keyword Web Crawling on Web Forums

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Web Analysis in 4 Easy Steps. Rosaria Silipo, Bernd Wiswedel and Tobias Kötter

A Comprehensive Comparison between Web Content Mining Tools: Usages, Capabilities and Limitations

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17

Part I: Data Mining Foundations

Overview of Web Mining Techniques and its Application towards Web

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Ranking Techniques in Search Engines

Crawling the Hidden Web Resources: A Review

WEB-BASED COLLECTION MANAGEMENT FOR LIBRARIES

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

USER S GUIDE FOR THE ECONOMICS ELECTRONIC LIBRARY

Chapter 2. Architecture of a Search Engine

A Comparative Study of the Search and Retrieval Features of OAI Harvesting Services

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

A Novel Interface to a Web Crawler using VB.NET Technology

Metadata Framework for Resource Discovery

ATLAS.ti 8 WINDOWS & ATLAS.ti MAC THE NEXT LEVEL

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

So You Want To Save Outlook s to SharePoint

Building Institutional Repositories: Emerging Challenges

Easy Ed: An Integration of Technologies for Multimedia Education 1

When Communities of Interest Collide: Harmonizing Vocabularies Across Operational Areas C. L. Connors, The MITRE Corporation

Provenance-aware Faceted Search in Drupal

Data and Information Integration: Information Extraction

Database of historical places, persons, and lemmas

Course Introduction & Foundational Concepts

A Lime Light on the Emerging Trends of Web Mining

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

An FCA Framework for Knowledge Discovery in SPARQL Query Answers

Review on Text Mining

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Semantic Web Search Model for Information Retrieval of the Semantic Data *

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

XFDU packaging contribution to an implementation of the OAIS reference model

Domain Specific Search Engine for Students

Construction of the Library Management System Based on Data Warehouse and OLAP Maoli Xu 1, a, Xiuying Li 2,b

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Information Retrieval May 15. Web retrieval

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

EXTRACTION OF REUSABLE COMPONENTS FROM LEGACY SYSTEMS

Unit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics

strategy IT Str a 2020 tegy

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

Automated Classification. Lars Marius Garshol Topic Maps

Information Retrieval Spring Web retrieval

International Journal of Scientific & Engineering Research, Volume 7, Issue 2, February ISSN

Search Engine Optimisation Basics for Government Agencies

An Efficient Approach for Color Pattern Matching Using Image Mining

Information Retrieval

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

Role of Metadata in Knowledge Management of Multinational Organizations

Information Retrieval (IR) through Semantic Web (SW): An Overview

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Transcription:

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery Simon Pelletier Université de Moncton, Campus of Shippagan, BGI New Brunswick, Canada and Sid-Ahmed Selouani Université de Moncton, Campus of Shippagan, BGI New Brunswick, Canada ABSTRACT This paper addresses the issue of distilling relevant information from unstructured data such as content from Web pages. For the purpose of solving this issue, a system is designed to propose a utilization of automated guided web mining algorithms for meta-rules extraction. The proposed system can be viewed as an extensible tool to extract metadata and generate multi-format descriptions from existing Web documents. The on Canadian universities. The results show that the system easily provides meaningful visualizations and delivers powerful text extraction, supporting users in their quest to efficiently investigate and exploit available Web data sources. Keywords: Knowledge discovery, Web content mining, Information retrieval, Metadata, Visualisation capabilities 1. INTRODUCTION The rapid expansion of hugely unstructured data on the Web is causing several problems such as an increased difficulty of extracting potentially useful knowledge. Distilling relevant information from unstructured data, such as content from Web pages, can be both challenging and time consuming. Most Crawler-based search engines, such as Google, use methods that essentially do document-level ranking and retrieval, and create their listings automatically. They spider the web then they propose to the users to search through a proposed list of links of Web pages ranked according to their relevance to a given query. Extracting valuable information from such an ever increasing amount of data remains a fastidious and boring task. The biggest challenge is to drive the next generation of Web search by leveraging data mining, and knowledge discovery techniques for information organization, retrieval, and analysis. These new Web search services are expected to bring increased knowledge and intelligence to users. As such, enhanced search functions can effectively dig out understandable information and knowledge from unorganized and unstructured Web data. This paper is organized as follows. The related work is given in Section 2. In Section 3, we give the objectives of the designed tool. The components of the proposed tool are described in Section 4 through the presentation of two case studies. Finally, Section 5 concludes this paper. 2. RELATED WORK It was unanimously recognized that the huge volume of information on the web, which is disseminated to the users in a chaotic way, constitutes a great challenge to make use of that information in a systematic way. In order to face this challenge the Web mining is one of the fast growing technologies that aim at discovering and analyzing useful information from the Web. According to the classification proposed by Nadeem and Syed in [8], the Web mining consists of Web usage mining, Web structure mining, and Web content mining. The Web usage mining investigates the user access patterns from the Web usage logs. The Web structure mining aims at discovering useful knowledge from the structure of hyperlinks. The Web content mining refers to the extraction and integration of useful data, information and knowledge from Web page contents. In this paper we are concerned with the Web content mining. To extract structured data from semi-structured Web documents, pattern discovery based approaches can be used. Recent variants of these approaches consist of discovering extraction patterns from Web pages without user-labeled examples by using several pattern discovery

techniques, including radix trees, multiple string alignments and pattern matching algorithms [2]. These information extractors can be generalized over unseen pages from the same Web data source. One of the straightforward methods to extract Web data is to copy-paste. There are tools to copy-paste easier and one of these tools is Quotepad [10]. This tool permits to store notes or data directly from the Web and it also offers an option to convert the selected data by exporting and saving them as extended Markup Language (XML) format. Excel, the spreadsheet application of Microsoft Office Suite can also be used to extract data from the Web by using the Import from a website option [4]. The user may subsequently use data by making histograms or save them as a list. However, the extracted data must be beforehand structured so that the result is clear and easy to analyze and/or to navigate to. Tools like OutWit Hub [9] are useful to find, grab and organize data from the Web. However, these tools are more convenient to recover structured information such as tables or lists of data. Note that they do not automatically extract the data for all (unseen) Web pages of a given site, but only data from the Web page that is currently consulted. Besides this, they do not extract the data dynamically. For example, if the extracted data is saved in Excel and a histogram is made, you have to perform a new process to recreate this histogram if the Web page is updated. Screen Scraper is another Web data extraction tool [11]. This tool is used to store extracted data into databases. Its main advantage is that it can perform automatic extraction of targeted data during a certain period. This tool provides various useful features that allow users to easily interfacing it with their database engines. Data Mining Component Queue of URLs XHtml Parser Natural Language Processing Topic Identification Database Predicate Dictionary Query Composer Association rules Temp Text Storage File Accessor File Parser XSD/XML XSLT/FO PDF format Graphic format Figure 1. Overview of the proposed system Knowledge Base 3. OBJECTIVES To meet the challenge of delivering more intelligent search results to users, we propose a utilization of automated guided web mining algorithms for the purpose of metarules extraction. The proposed approach combines Natural Language Processing and supervised rule-based guidance algorithms to improve the knowledge discovery process by using information available on the Web. The proposed system can be viewed as an extensible tool to extract metadata and generate multi-format descriptions (including XML, database, graphics...) from existing Web documents. It provides a set of features that allow one to analyze documents from the Web without having to manually transcript the reliable information found. The on Canadian universities. 4. TRANSFORMING UNSTRUCTURED WEB DATA INTO INTUITIVE VISUAL FORMAT As illustrated by the block diagram of Figure 1, the proposed framework is composed of parsers, miners, and various output generators. The low-level processing performed by the parsers receives Web documents converted from different formats. It analyzes the contents and divides them into atomic units. For this task, we came up with a simple yet effective algorithm. The parser module contains two engines, and a temporary storage area. The first engine is a multi-format parser used in the system. Typically it selects important attributes by natural language processing of lexical analysis. The second one is used to open raw text documents as well as Microsoft Word documents, and PDF documents that are available for download from the fetched and queued URLs. Once the parsing is done, the documents are appended to the storage area for later processing. The miners make use of the parsed information to generate additional meta-data properties for the documents. Examples of miners include language identification module, Meta data extractor and

classifier, etc. Output generators allow users to highlight relevant information buried in unstructured content that is extracted/mined from metadata and present this information in an intuitive visual format. The findings are then presented as a consolidated view thanks to the visual (graphic) or structured information (database) discovered and extracted from processed documents. This framework was written in PHP and SQL function. To store the extracted data, we used a MySQL database. Through two practical case studies, we give details about the algorithms that are used in the proposed framework. Case study 1: Acadian literature resources This application aims at using the proposed framework to provide knowledge about Acadian literature derived by the mining algorithm given in Figure 2. In the steps of this algorithm, we have to enter the suitable combination of related keywords and discover the meaningful information of documents from the targeted web sites obtained by search support functions. To visualize the characteristics of obtained attributes, a Graphical User Interface (GUI) is developed. In order to operate the web miner, it is necessary to gather web pages selectively or entirely. When making request for a given feature, the miners check a text file that contains the queued URLs of these pages. Therefore, the possibility is given to the users to control the behavior of web miner by using this file. Additional selection policies and rules can be added in order to deeply gather and select more relevant web contents. For instance, according to these defined rules, it could be possible to manage the problems of intellectual properties and copyrights when storing copies of gathered web contents on personal servers. Algorithm 1: (deep search & GUI) Fix the number of Web sites S max that has been targeted Generate a set of rules and policies For S max sites Do For each set of visible and unseen pages Do Search for specific items related to publications Evaluate the attributes End for Select and store in the database End for Output to various formats and graphics Discover new sites and update S max Figure 2. The Mining algorithm used to provide Acadian literature information The modular architecture of the proposed framework allows administrators to consider the consistency of web pages, such as updating time of web contents and the validity of the hyperlinks to other web pages. Figure 3. Number of books related to the Acadian culture published per year (1980-2009) In this case study, the system extracts the content of both visible and unseen pages of the website [6], and sends it to the parser. Search patterns are then created and transmitted to a pattern matching procedure. This procedure is used to search a string for specific patterns and stores the results in an array. To extract all the content of all the Web pages and not only for one year (that covers the publication activity of Acadian literature), we must use a loop and change the year in an adapted and dynamic URL. This method permits to go through an array that contains all the publication years from 1980 to 2009. The result of this extraction was stored in a database (MySQL) and can be further visualized in various formats. The user may also use extracted data for future analysis by creating a histogram as illustrated in Figure 3. The branches of the histogram will be data that are stored in the database. The main advantage of this histogram is that it is dynamic so if the data change in the database, the histogram changes as well. In this framework, we are using the dynamic SQL statement in a loop to retrieves the number of books published annually. Consequently, it is convenient to use this system for extracting data from unstructured Web because it is exploiting the data dynamically unlike other tools that offer only the manual possibility. The major advantage of structured document formats is the possibility to produce multiple deliverables. But given the fact that there are multiple ways of converting unstructured data into structured formats, it would seem reasonable to choose the appropriate deliverable according to the type of applications and users needs. In our application, the analysis, navigation, and browsing Web site data are facilitated by these new formats. For instance, it is possible thanks to the framework to structure the data collected from the Web site of Acadian literature in bibliographic record format for each book. Based on the fact that the data are saved in XML format as illustrated in Figure 4, this makes our system ideal to extensively use the XSLT (extensible Stylesheet Language Transformations) or RDF (Resource Description Framework) Schema. Subsequently, we have the ability to display XML data about each book in a user-friendly fashion as illustrated in Figure 5.

Case Study 2: Information on Canadian Universities (Google Search) Figure 4. XML structure of the Web content In this application, the data that users want to extract are retrieved from selected URLs. To create this relevant list of URLs, a search procedure function (same as the one of the 1 st case study) based on pattern matching of Google search results is used. The Algorithm given in Figure 6 depicts the steps performed to provide enhanced knowledge from current Web search engines. A text file containing the filtered URLs is automatically created to guide the parsing procedure. Next, we use an array of patterns to extract relevant attributes that are previously defined by the users. Note that XSD rules can be established in order to provide the a well formed XML file containing the final retained attributes extracted from the raw data obtained after mining the selected documents (step 6 of Algorithm 2). The framework allows users to extract only the content they want (metadata for instance) without having to click on each link of a given university that Google provides. In this example, the metadata s of Canadian universities Web pages extracted from Google's results can also be stored in a database. Subsequently, they can also be saved in XML file or in any other format depending on the choice and the needs of users. They can also be simply displayed in XHTML format directly from the framework. Algorithm 2: (Search & Store metadata) 1) Define user attributes and optional XSD rules 2) Generate a set of templates to filter Google results: T i 3) Store a set of relevant URLs 4) For U max URLs Do 5) Get a URL x 6) Mine d x : documents of x 7) Evaluate the relevance of attributes by scoring the pattern matching with T i 8) if d x Ø Goto 6 9) Store temporarily selected attributes 10) End For 11) Output information to XML according to XSD rules if established Figure 6. Algorithm providing knowledge discovery through augmented Web search results Figure 5. XSLT result applied to the XML extracted file Figure 7 gives an example of the deliverable obtained in step 9 of Algorithm 2 presented in Figure 6. This file contains temporarily selected attributes according to the user requested information. These invisible data that are extracted from the Google search results on Canadian universities are now accessible. The raw information obtained in step 9, is further structured in XML format according to the predefined XSD rules.

Figure 7. Excerpt of the raw data obtained in step 9 of the algorithm presented in Figure 6 5. CONCLUSION In this paper we proposed a framework that can be used to identify and transform valuable text-based information extracted from Web documents into a multiple structured formats, facilitating the analytical process. This on Canadian universities. The algorithms developed within the framework are proven to be effective and intuitive to overcome some difficulties associated with the assimilation of unstructured data. Many uses and possibilities are achievable in order to provide meaningful visualizations, supporting users in their quest to efficiently investigate and exploit the data sources available on the Web. 6. REFERENCES [1] M. Y. Chau, Finding order in a chaotic world: A model for Organized research using the World Wide Web, Internet Reference Services Quarterly, vol. 2, No 2/3, pp. 37-53, 1997. [2] C. Chia-Hui, H. Chun-Nan and L. Shao-Cheng, Automatic information extraction from semi-structured Web pages by pattern discovery, journal of decision Support Systems, Vol. 35, No 1, pp. 129--147, Elsevier Science Publishers, 2003. [3] P. Desikan, J. Srivastava, V. Kumar, and P.N. Tan, Hyperlink Analysis: Techniques and Applications, Technical Report 2002-0152, Army High Performance Computing and Research Center, 2002. [4] Excel 2007 - Microsoft Office, Online on: http://office.microsoft.com/en-us/default.aspx, 2010. [5] H. Kawano, Web Archiving Strategies by using Web Mining Techniques, PACRIM IEEE-Communications, Computers and signal Processing Conference, pp. 915 918, 2003. [6] La littérature francophone en Acadie depuis 1980, (translation : «Acadian literature since 1980»), online on: http://www.acadielitteraire.ca/, 2009. [7] S. Lawrence and C. Lee Giles, Searching the World Wide Web, Science, vol. 280, No3, pp. 98-100, 1998. [8] M. Nadeem and S.H. Syed, Guided Web Content Mining Approach for Automated Meta-Rule Extraction and Information Retrieval, Proceedings of The 2008 International Conference on Data Mining, pp. 619-625, Las Vegas, USA, 2008. [9] Outwit technologies, Harvest the web, online on http://www.outwit.com/, 2010. [10] Quotepad, The free notepad that can save the text selected on the screen, online on: http://quotepad.info/, 2010. [11] Screen-Scraper, web data extraction, online on: http://www.screen-scraper.com/, 2010.