A System s Approach Towards Domain Identification of Web Pages

Similar documents
Evaluation Methods for Focused Crawling

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

A genetic algorithm based focused Web crawler for automatic webpage classification

Creating a Classifier for a Focused Web Crawler

Web Crawling As Nonlinear Dynamics

Domain Specific Search Engine for Students

Web Structure Mining using Link Analysis Algorithms

Domain Based Categorization Using Adaptive Preprocessing

Deep Web Crawling and Mining for Building Advanced Search Application

Context Based Web Indexing For Semantic Web

INTRODUCTION. Chapter GENERAL

Automated Online News Classification with Personalization

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

Developing Focused Crawlers for Genre Specific Search Engines

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

A Supervised Method for Multi-keyword Web Crawling on Web Forums

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3

A Review on Identifying the Main Content From Web Pages

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Context Based Indexing in Search Engines: A Review

A hybrid method to categorize HTML documents

Smartcrawler: A Two-stage Crawler Novel Approach for Web Crawling

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

An Approach To Web Content Mining

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

Competitive Intelligence and Web Mining:

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Automatically Constructing a Directory of Molecular Biology Databases

Simulation Study of Language Specific Web Crawling

Correlation Based Feature Selection with Irrelevant Feature Removal

Ontology-Based Web Query Classification for Research Paper Searching

Smart Crawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Introduction to Text Mining. Hongning Wang

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

Taccumulation of the social network data has raised

Fault Identification from Web Log Files by Pattern Discovery

An Efficient Methodology for Image Rich Information Retrieval

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Oleksandr Kuzomin, Bohdan Tkachenko

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

DATA MINING - 1DL105, 1DL111

Rank Measures for Ordering

Unsupervised Clustering of Web Sessions to Detect Malicious and Non-malicious Website Users

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Efficient Crawling Through Dynamic Priority of Web Page in Sitemap

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

DATA MINING II - 1DL460. Spring 2014"

Title: Artificial Intelligence: an illustration of one approach.

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

An Improved Apriori Algorithm for Association Rules

Crawling the Hidden Web Resources: A Review

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Performance Analysis of Data Mining Classification Techniques

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering

Enhancing Cluster Quality by Using User Browsing Time

Domain-specific Concept-based Information Retrieval System

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5.

Web Mining Evolution & Comparative Study with Data Mining

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Query Disambiguation from Web Search Logs

Mining Web Data. Lijun Zhang

How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments?

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler

Improving Relevance Prediction for Focused Web Crawlers

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

Image Similarity Measurements Using Hmok- Simrank

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Ranking web pages using machine learning approaches

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Content Collection for the Labelling of Health-Related Web Content

Sentiment Analysis for Customer Review Sites

User Intent Discovery using Analysis of Browsing History

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

INDEXING FOR DOMAIN SPECIFIC HIDDEN WEB

Overview of Web Mining Techniques and its Application towards Web

Analytical survey of Web Page Rank Algorithm

Web Data mining-a Research area in Web usage mining

Conclusions. Chapter Summary of our contributions

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

An Improved Indexing Mechanism Based On Homonym Using Hierarchical Clustering in Search Engine *

arxiv: v1 [cs.lg] 3 Oct 2018

Classifiers Without Borders: Incorporating Fielded Text From Neighboring Web Pages

Semantic Website Clustering

A Framework for Hierarchical Clustering Based Indexing in Search Engines

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Transcription:

A System s Approach Towards Domain Identification of Web Pages Sonali Gupta Department of Computer Engineering YMCA University of Science & Technology Faridabad, India Sonali.goyal@yahoo.com Komal Kumar Bhatia Department of Computer Engineering YMCA University of Science & Technology Faridabad, India komal_bhatia1@rediffmail.com Abstract With the proliferation of the document corpora (commonly called as HTML documents or web pages) on the WWW, efficient ways of exploring relevant documents are of increasing importance [4, 8]. The key challenge lies in tackling the sheer volume of documents on the Web and evaluating relevancy for such a huge number. Efficient exploration needs a web crawler that can semantically understand and predict the domain of the web page through analytical processing. This will not only facilitate efficient exploration but also help in the better organization of the web content. As a search engine classifies the Search results by keyword matches, link analysis and other such mechanisms, the paper proposes a solution to the domain identification problem by finding keywords or key terms that are representative of the page s content through the elements like <META> and <TITLE> in the HTML structure of the webpage [11]. This paper proposes a two-step framework that automatically first identifies the domain of the specified web page and with the thus obtained domain information, classifies the web content according to the different pre- specified categories. The former uses the various HTML elements present in the web page while the latter is achieved using Artificial Neural Networks (ANN). Keywords- Search engine; crawler; domain-specific; HTML elements; META; TITLE; classification; categorization; Artificial Neural Networks; I. INTRODUCTION In recent years, the Web has become a huge information repository due to the increasing prevalence of documents and databases online [28]. There are numerous pages accessible on the World Wide Web and the number continues to increase (approximately by around 1.5 million) on a daily basis [24, 28]. The more the web contents evolve and varies, the more difficult it becomes to support the design and implementation of automatic information retrieval tools, among which the most typically used are the web search engines. The search engines discover the content as a first step to its retrieval and then after indexing present the required information in ranked order, to the user. The general purpose search engines cannot keep up with the growing pace of the web and are able to index only a small fraction of the entire that is available. To tackle the issue of scalability, focused crawlers [4, 5, 17] and domain specific search services like vertical search engines [29] have come up. The large and ever expanding scale, full of promising opportunities like, varying needs of the varied users, raises the issue of efficiently extracting the relevant information and maintaining it in an organized way. Determining relevancy of a document from a huge document corpus involves predicting the topic or the domain (say it be entertainment, food, sports) of the web page and categorizing it by organizing similar web pages into a common group, usually known as the class or the category of the web page.the process of categorization necessarily need not be mutually exclusive and the same web page may be assigned to one or more categories. Manually classifying each page for its classes is not a feasible task as identifying the domain of each and every page is itself tedious and time-consuming for humans. But, another fact is that categorization will simply facilitate the automation of the process of domain identification as documents on the web can be easily reached by following the hyperlinked structure. Hence, automatic classification forces automatic domain identification and vice versa. The activities being interdependent can be thought of as the two facets of the same coin. We propose a novel approach for identifying the topic or domain of web pages using the information available in <META> and <TITLE> tag of the webpage s HTML structure [1, 2, 11, 15, 20, 21]. The proposed system also makes use Artificial Neural Networks [27] to achieve the prime goal of domain identification which may later on be used to achieve a secondary goal of web page classification. The paper contributes towards the following: Enumerates the utility of the HTML elements like <META> and <TITLE> in the process of domain identification; Emphasizes on the ability of Artificial Neural Networks to achieve the stated goals; Develops a system that solves the web page classification problem based on the above mentioned features and may help in focusing a crawl. The rest of the paper is organized as follows: the background of the problem and the state-of-the-art is reviewed in section 2; the detailed working of our proposed system is presented in section 3; we discuss the experimental results and the advantages of the proposed system in section 4 and finally

conclude in section 5 along with some future directions for the work. II. BACKGROUND AND STATE- OF- THE- ART Information retrieval and management are the two prime tasks from the perspective of the Web users [28]. The aim of any search service employed for information retrieval is to efficiently build high-quality collection of hypertext documents belonging to a specific domain and return that effective set of results to the user as quickly as possible, against the posed query. Although this efficiency in building the index [3] and returning results can only be achieved if the search system deals with an organized document set both in its input as well as its output. Organized input set implies a well-organized collection of documents on the WWW whereas an organized output set specifies that the search engine only indexes and maintains pages belonging to a specific set of topics or domains that together represent a relatively narrow segment of the Web. The advantage of using such a search system is the sufficient coverage that is achieved by a small investment in hardware as well as network resources. The crawler for such a search engine must be guided by a predictor which tries to identify the topic or domain of the webpage and evaluate its relevance to search engine specialization. Managing information on the WWW can be achieved by employing numerous such specialized search engines which will help in automatic web page classification. Various machine learning techniques have been developed that help in automatic learning classification models called classifiers, based on the training examples [4, 6, 17]. The learned classifiers can then be applied to predict the classes of new documents. Thus, Web page classification not only assisted the organization of documents into hierarchical collections like the open directory project DMOZ but also aided a wide variety of information retrieval problems like focused crawling, question-answering, though hierarchical organization also facilitates retrieval of information but through a tedious process of browsing. At the back-end of the visual representation of each web page rendered by the browser lies a text representation in HTML. Most of the approaches for predicting the domain of any Web page followed by its classification rely on the text representation while simply ignoring the visual layout of the page, which may be useful as well [10]. Following are some classification mechanisms based on the text representation of the hypertext document, which have been proposed so far for the purpose [4, 10, 11, 14, 15, 16, 21, 25]: 1. Manual Categorization or Classification: Herein, a number of domain experts analyze the text contents of a web page manually and dole out a category domain to it for classification. For example, approach followed by Yahoo for organizing documents in the directory structure. The approach has the unconcealed advantage of accuracy but is challenged by the unprecedented scale of the WWW and seems infeasible for the huge number of web documents. 2. Text Clustering approaches: Clustering is an unsupervised learning process and does not need any background information to create clusters of similar documents. Also, the process being easier and faster has become very popular now-a-days. The process being expensive has not been employed against the sheer number of documents on the Web. 3. Content based Categorization: This approach relies on first creating an index database for each category that contains only the key terms (after removing stop words and obtaining the frequency of occurrence of each term) belonging to that category, from the exemplary set of documents. The candidate document is then classified by extracting its key terms and choosing the index that it resembles the most, so as to be classified into one of the categories. The approach does not take the full advantage of page being hypertext document and hence does not use any other relevant feature that can be drawn from its HTML structure like images, multimedia content etc. 4. Link and Content analysis: Based on the hyperlinked structure- in links, out links, associated anchor texts etc., these approaches finds hints about the contents of the documents and use the gathered hints to classify the referred document. These approaches take advantage of the neighboring pages that already have been assigned a category label but may suffer significantly in terms of performance when the category labels of the neighboring pages are not available, as is usually the case. Chakrabarti et al. (1998), Slattery and Mitchell (2000), and Calado et al. (2003) used the labels. 5. Categorization based on META tags: The approach relies solely on content attributes of the META tags (<META name= Keywords > and <META name= description >) [2]. The approach faces problem when irrelevant words are specified as keywords just to increase its hit ratio in the search engine results. All the above approaches except the first one mentioned, are a step towards automating the process of domain identification. Experiments in [23] show that most accurate classifier has been obtained by using Meta tags as the only text feature. Including any other tag (even Body) with the Meta tag results in less accuracy and decreases the precision of the classifier noticeably [23]. However, limiting the functionality of our system to just meta-tags would not help in classifying a large majority of the web documents as the lack of the widespread use of Meta tags steps up as a problem. Therefore, we consider Meta and title tags collectively in our system and use a link extractor that will extract backlinks to derive hints from its neighboring pages and supplement the process in progress, in case there no information from the tags can be obtained. Using these various features can significantly help improve identification and classification accuracy. A critical look at the above literature shows that: Most of the existing algorithms have used text content of a web page for identifying its domain and selecting the most suitable category [6, 7, 10]. Most HTML tags emphasize on representation rather than the semantics, and using the structural information derived from the tags may prove useful for predicting the domain

of the hypertext document can boost a classifier s performance [11, 14, 19]. If a page has been created with care, the information in the title and header may be more important than that in the prose. Using these various features can significantly help improve identification and classification accuracy [15, 16]. Most work in the field of web page classification has been accomplished using clustering algorithms and classifiers like Naïve Bayes, decision trees, Support Vector Machines etc. [6, 9, 18]. The paper contributes towards developing a system for automatic domain identification of hypertext documents, while at the same time, keeping in view the above characteristics.the proposed system has been developed to take the advantages of all the above automatic approaches. The result of our system depends on the weights of the various clusters, formed by extracting keywords from the tag structure of the web page. The next section explains in detail the proposed approach. III. PROPOSED APPROACH OF THE DOMAIN IDENTIFICATION SYSTEM In order to address the problem associated with manual approach for domain identification, a system that facilitates efficient exploration and better organization of the web contents has been proposed through automatic domain identification and classification of webpages. Our solution comprises of gathering domain knowledge from the HTML structure of the referred web page, extracting any backlinks (in case information cannot be derived from tags in its HTML structure) and finally assigning an appropriate category or the class to the webpage by using Artificial Neural Networks (ANN) [13, 27]. The major components and modules of our system have been listed as under: An index of web pages, their URLs and domain information, Tag Extractor, Back-link Extractor, A Clustering Module, A domain-specific repository of keywords & clusters of keywords, and A classifier based on ANN that constitutes a training module and a testing module. Our systems approach is based on the use of artificial neural networks that must first be trained by some exemplary data set and later on used to carry out the assigned task on other new candidate hypertext documents. For the training purpose, the system is initially provided with a set of web pages (and their URLs) with known domains. This seed set of webpages and URLs can be obtained from either any Web directory or the result listing of any search engine. Thereafter, an index is created that is used to store the web pages along with their domain information. In order to train the neural network, a URL or web page is taken from this index and given as input to the tag extractor. The tag extractor extracts meta-tag and title keywords associated with the URL or web page up to a pre-specified depth. The extracted keywords are then grouped together under a cluster based on various similarity metrics that are already known in the art. The clustering module is responsible for the creation of these clusters of keywords. However, if no metatags and title keywords are found, back-links are extracted for that URL or web page by a back-link extractor. The extracted back-links and their corresponding web pages are further added to the index so as to control the process of extracting any new keywords. Further, these clusters of similar keywords are saved in a domain-specific repository of keywords. Thereafter, keywords and clusters of keywords that are stored in the domain-specific repository are assigned weights. Now, the neural network that will be used for domain identification and classification of any new hypertext document is trained based on the weights of clusters of keywords associated with a web site whose domain is already identified. The process of calculation of weights and their assignment to every keyword/cluster is explained in detail later in step 2 below. Search Engine Seed URLs WWW URLs or Web Pages List of URLs Index of Domain wise Classified Web Pages Classifiers Back Line Extractor URL or Web Page Meta-Tags & Title Keywords Extractor Domains Training Module Keywords Domain Specific Repository of Keywords & Clusters Clustering Module New URL or WEB Pages Testing Module Figure 1. The proposed system for domain identification Clusters of Similar Keywords Domain Wise Classified Web Pages Similarly, based on the weights of clusters of keywords (fetched from the domain-specific repository of keywords & clusters), the domain of any new webpage is identified and the pages thus classified into one of the categories using the trained neural network. The classified web sites, thereafter, are stored in the index as per their respective domains. And the process continues for any number of candidate documents. Figure 1 illustrates the working of our proposed system. The process can be explained in detail with the following steps: A. Step 1: Keyword Extraction through <META> and <TITLE> tags of a web page In our approach, the URLs and web pages with known domains are given as input to a tag extractor that extracts metatags and title keywords by traversing the URL up to a depth as specified by the system. And, if any URL or a web page does not contain meta-tags and title keywords, the back-links for the corresponding URL of the web page will be extracted by a

back-link extractor. The extracted back-links are then also provided to the tag extractor just like others for extracting keywords. These extracted keywords are saved for future reference by the ANN. The process is carried out for all such back-links whenever extracted. B. Step 2: Assigning weights to the keywords extracted in step 1, based on their domain After the keywords have been extracted from the Meta and Title tags, they are stored in a domain specific repository of keywords, used for maintaining information about that domain. In other words, the keywords are saved domain-wise. Now, weights are assigned to the keywords based on their no. of occurrences. However, before assigning weights, keywords with similar context may be grouped together in order to form small clusters of keywords. But one must ensure that both keywords and their corresponding clusters must be stored in the domain-specific repository as shown in Figure 1. For example, the following keywords with similar context have been clustered together: carbohydrates, fats, proteins, minerals, vitamins. Another cluster might contain the related terms like sex, sexual, and sexual health whereas another might contain swim, swimmer, swimming, swimming pool, swimsuit.herein, the similarity might be based on the words having same base forms, words frequently found together, words with similar meanings and the like. Assuming that our system considers just the following domains: Entertainment, Food, Medicine and Sports. A total of 258 clusters of keywords have been prepared for use by our proposed system. Also, it is assumed that every domain has a unique set of keywords, i.e., no two domains share a common keyword. In order to assign weights, the following formula has been used by our system: The corresponding weights are also saved in the domainspecific repository of keywords and clusters. C. Step 3: Training the Neural Network from exemplary web pages with known domains. The keywords along with their associated weights and the exemplary web pages or hypertext documents are used to train the ANN so as to learn what kinds of keywords belong to which domain. For example: for the entertainment domain the related terms can be : fun, humor, jokes, travelling, tourism etc. In order to train the neural network for various web sites, the meta-tag extractor extracts the meta-tags and title keywords (if available) up to a pre-specified depth for every URL or web page stored in the index of domain-wise classified web pages. Further, keywords with similar context belonging to a domain may be grouped together. Thereafter, clusters of keywords are assigned weights by referring to domain-specific repository of keywords and clusters of keywords. The neural network is further trained by providing these weights and the domains (according to which classification needs to be done) as inputs to the training module, as shown above in figure 1. The proposed algorithm that has been used by our system for training the neural network for web page domain identification and classification is depicted in figure 2 below. Input: seed webpages and their urls, a list of domains, A Clustering algorithm, Output : Learned data (domain specific data repositories that pertains to data of each individual domain ) Procedure: 1. Store the webpage and the URL along with their specified domain information into an initial index 2. For each stored page and its url 3. if (meta and title tag exists) else 3 a) extract the Keywords or terms from them upto a pre-sepcified depth; 3 b) save the extracted keyterms and the frequency of occurrence in its domain specific data repository 3 c) Obtain the clusters of keywords say C 1, C 2, C 3,. C n using the specified clustering algorithm 3 d) To each C i assign a weight W i using the no. of occurrences of the keywords or terms and the total no. of web pages traversed for that domain 3 e) Based on the assigned weight W i, allot a domain to the cluster and its comprising keywords, store the information in its corresponding domain specific repository 3 a) extract its backlinks and store them in the Index 3 b) for the newly added webpage and its URL, repeat the same sequence as in step 3 Figure 2. Algorithm for training the neural network It may also be the case where a web page might consist of keywords from more than one domain. Now, the webpage has to be traversed for all the domains, so as to prepare an input and an output matrix that contains the keywords and their associated weights. The weights corresponding to every keyword and/or cluster for every webpage is fetched from the domain specific repository of keywords and clusters, in order to prepare the input and output matrices for training the neural network. In our system, the various domains have been represented by numbers as: Entertainment=0; Food=1; Medicine=2; Sports=3 Neural Network Model Specifications: The neural network model used in this work has the following specifications: inputs TABLE I. Neurons in Input layer NEUTRAL NETWORK SPECIFICATION Neurons in Hidden layer Neurons in Output layer 21*4= 84 4 5 1 Input Matrix: The input matrix used for training (as shown below in table II) is a 21x4 matrix that is prepared for every web site where we provide an input of 20 keyword weights, along with their row-wise sum in the 21st column for the domains used by our system (i.e. entertainment, food, medicine and sports). For example, the weight in the first column for entertainment domain is 0.2. It has been calculated by using the afore-mentioned formula for a 1-keyword cluster beach.

Similarly, all other keyword clusters are also assigned weights using the afore-mentioned formula. The last column depicts the row-wise sum of all the assigned weights. Herein, the rowwise sum of entertainment domain is 0.47. However, if a web site consists of keywords less than 20 for a particular domain, those entries are provided an input of 0, as shown in the table II below. Output Matrix: The output matrix used for training is a 21x1 matrix which reflects the domain of the web site to be classified based on entry in the last cell. For every entry in the input matrix, the maximum value is found in every column. For example, the maximum entry in first column of input matrix as shown in table II below is 0.36 which belongs to food domain. The code for food domain is 1, which is reflected in the first column of output matrix. Similarly, the whole output matrix is prepared. The overall output (i.e., the domain of the web site) is reflected by the entry in last cell i.e. 2 in the output matrix which represents the code for medicine. The last cell represents row-wise sum and column-wise maximum value amongst the sums of all the 4 domains. The output can also be supported by counting the no. of occurrences of each domain in the output matrix. For example, here, the no. of occurrences for entertainment domain is zero, for food domain is five, for medicine domain is thirteen and for sports domain is two. This implies that the maximum no. of occurrences is for medicine domain. The domain, whose sum and no. of occurrences would be maximum, will be the predicted domain for the corresponding web page. For example, the data in table II belongs to a web site of medicine domain. TABLE II. TRAINING DATA: INPUT & OUTPUT MATRICES Training Data (Input Matrix) Domains\ K1 K2 K3 K4 K5 K6 K7 K8 K9 K10 K11 K12 K13 K14 K15 K16 K17 K18 K19 K20 Sum Keyword Weights Entertainment (0) 0.2 0.27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.47 Food (1) 0.36 1 0.36 0.42 0.36 0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.6 Medicine (2) 0.04 0.18 0.54 0.27 0.09 0.04 0.59 0.13 0.09 0.63 0.13 0.95 0.04 1.04 0.04 0.04 0.04 0.13 0.59 0.04 5.64 Sports (3) 0.16 0.05 0.05 0.33 0.16 0.05 0.11 0.22 0.27 0.16 0 0 0 0 0 0 0 0 0 0 1.56 Result (Domain- Wise) Testing Data (Output Matrix) 1 1 2 1 1 1 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 D. Step 4: Testing the Neural Network for predicting the domain of web pages whose domain needs to be identified and classifying the web pages accordingly After the neural network has been trained with web pages whose domain is already identified, the trained neural network is used for predicting the domain of web pages whose domain needs to be identified, in order to classify them. Again, here input matrices that have been prepared for various web sites are fed to the neural network. The network predicts the domain of the web site based on the input matrix fed to it. Initially, the web page whose domain is to be predicted is given as input to the system through a tag extractor. The tag extractor extracts the meta-tags and title keywords (if available) up to a prespecified depth. A clustering algorithm then creates clusters of similar keywords. Figure 3 shows the proposed algorithm showing the usage of the neural network for predicting the domain of a given web page. Input : a webpage with unknown domain Output : the predicted domain of the web page and the assigned category label Procedure : 1. For the given webpage and its URL 2. if (meta and title tag exists) else 2 a) Extract the Keywords or terms from them upto a pre-sepcified depth; 2 b) Obtain a set of clusters of keywords say C= { C1, C2, C3,. Cn } using the specified clustering algorithm 2 c) For each cluster Ci Ɛ C if a similar cluster Ck exists within the learned training data (the domain specific repositories) else fetch its corresponding weight and domain information from the repository and assign values to Ci; discard the cluster 2 d) Find a subset SC= {SC1, SC2, SC3,. SCm} of C such that SCi contains all the clusters that have been assigned a common domain in the above step where value of m equals the number of domains under consideration. 2 e) For each item SCi of the subset SC, add the assigned weights of all its clusters C j belonging to SCi and assign the value to SCi for later predicting the domain of the hypertext document 2 f) Of all the SCi Find the one that has the maximum value for sum of weights and use the domain of that as the predicted domain of the webpage or hypertext document 2 g) In case there is conflict between the maximum of the total weights, use the metric no. of occurrences to resolve the conflict. 2 a) extract its backlinks and store them in the Index 2 b) for the newly added webpage and its URL, repeat step 2 Figure 3. Algorithm for predicting the domain of a web page using the configured neural network If a cluster similar to any of the obtained clusters already exists in the repository then the same weight and domain information is associated to the obtained cluster. However, if none of the keywords of the cluster exist in the repository, the system simply discards the cluster. After the weights and domains of all clusters have been fetched, they all are gathered together and are separated domain-wise for the corresponding web page or URL to finally predict the webpage s domain. IV. EXPERIMENTAL RESULTS A total of 89.75% of the web pages have been correctly classified by our proposed system, as shown in Table III below. It can also be inferred that a combination of the two tags META and TITLE provide an efficient and accurate way for identification and categorization of web pages thus facilitating an end user to explore or find web pages of his desired classes effectively. Prediction has been made on the basis of the cell maximum-sum as shown in table II below. Accordingly, web pages have been classified. For example, for a URL, http://www.aapkadoctor.com/ that belongs to medicine domain, the maximum-sum cell depicts the output to be 2, i.e., medicine domain, and it is thus correctly classified. Domain of web pages TABLE III. Percentage of web pages correctly classified EXPERIMENTAL RESULTS OBTAINED Percentage of web pages incorrectly classified Percentage of pages that couldn t be classified Entertainment(0) 66.67% 22.22% 11.11% Food (1) 100% 0% 0% Medicine (2) 100% 0% 0% Sports (3) 93.75% 6.25% 0% Total pages (4) 89.75% 7.69% 2.56%

However, there is also a case where the system is not able to predict the domain of the web page based on the weights that has been input to it, i.e., the data happens to be inaccurate for the system for precise prediction. In such a case, back-links for the URL are extracted. For each extracted back-link, meta-tags and title tag keywords are extracted by the meta-tag extractor. This data is, now, fed to the neural network again in order to predict the domain of the web page. Now the system would be able to predict the domain of the web page correctly. Though web pages contain useful features as discussed above but, these features are sometimes missing, misleading, or unrecognizable for various reasons in some particular web pages.for example, webpages containing large images or flash objects but little textual content. In such cases, it is difficult for classifiers to make reasonable judgments based on features on the page. Our system deals this problem to some extent by extracting hints from neighboring pages (through a link extractor) that are related in some way to the page under consideration and supply supplementary information necessary for prediction and classification. V. CONCLUSION & FUTURE WORK A novel approach for domain identification of the web pages along with their classification has been proposed in the paper. In this proposal, both meta-tags and title tag keywords have been used for the purpose. In the future, the classification performance is expected to improve if other factors are taken into account like applying a cumulative metric of both maximum sum and no. of occurrences. Further, classification performance is also expected to improve if other features of an HTML page are considered. For example, by considering the URL of a web page, hints may be provided regarding the domain of a web page [12]. In this case, the URL of a web page may be input to a tokenizer that creates meaningful tokens (n-grams) which provides hints about the domain of a web page. Also, the anchor text present in web pages may prove useful in determining a web page s domain. As good quality document summarization can accurately represent the major topic of a web page thus summarization can also help in classifying web pages accurately. REFERENCES [1] The WWW consortium HTML 4.01 Specification, W3C, 1999 [2] Meta tags, Frontware International. [3] J.Hayes, W.S.P. "A system for content-based indexing of a database of news stories". Proc. of Second Annual Conference on Innovative Applications of Artificial Intelligence, 1990. [4] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery.computer Networks, 31(11-16),p.p.1623 1640, 1999. [5] M. Diligenti, F. Coetzee, S. Lawrence, C. L.Giles, M. Gori. Focused crawling using context graphs. In Proc. of 26th International Conference on Very Large Data Bases, Cairo, Egypt, pages 527 534, 2000. [6] J. Yi and N. Sudershesan, A classifier for semi structured documents, In KDD, Boston, MA USA, 2000. [7] Wai-Chiu Wong, A. Wai.C. Fu, Incremental Document Clustering for Web Page Classification, Chinese University of Hong Kong, July 2000. [8] John.M.Pierre, On the Automated Classification of web sites, Linkoping Elec. Articles in Comp. and Info. Science, Vol. 6, 2001. [9] H. Yu, J. Han, K.C.Chang. PEBL: positive example based learning for web page classification using SVM. In KDD 02 : proceedings of the 8th ACM SIGKDD international conference on Knowledge Discovery and Data mining, pages 239-248, New York, NY, USA, 2002. [10] D Cai S, Yu J wen. Extracting Content Structure for Web Pages Based on Visual Representation. In the International Conferences on Asia- Pacific Web Conference(APWeb), 2003. [11] Golub, K. and A. Ardo (2005, September). Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Volume 3652 of LNCS, Berlin, pp. 368 378. Springer. [12] U. Schonfeld, Z. Bar-Yossef, I. Keidar. Do not crawl in the DUST: different URLs with similar text. In Proceedings of the 15th International World Wide Web Conference, pages 1015 1016, New York, NY, USA, 2006. [13] S. M. Kamruzzaman, Web Page Categorization Using Artificial Neural Networks, Proceedings of the 4th International Conference on Electrical Engineering & 2nd Annual Paper Meet 26-28 January, 2006. [14] X. Qi and B. D. Davison. Knowing a web page by the company it keeps. In International conference on Information and knowledge management (CIKM), pages 228-237, 2006. [15] Xiaoguang Qi and Brian D. Davison Web Page Classification: Features and Algorithms, Department of Computer Science & Engineering, Lehigh University, June 2007. [16] Bing Liu. Web Data Mining, Exploring Hyperlinds, Contents, and Usage Data. Springer. 2007. [17] Qingyang xu, Wanli Zuo. First-order Focused Crawling. ACM. pp1159-1160. WWW2007. [18] Daniela XHEMALI, Christopher J. HINDE and Roger G. STONE, Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, 2009. [19] Gerry McGovern. "A step to step approach to web page categorization". www.gerrymcgovern.com. [20] Aijun An and Xiangji Huang, "Feature selection with rough sets for web page categorization", York University, Toronto, Ontario, Canada. [21] Arul Prakash Asirvhatam and Kranti Kumar Ravi. "Web Page Categorization based on Document Structure", International Institute of Information Technology, Hyderabad, India 500019. [22] Chekuri, C., M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automated classification. In Proceedings of the Sixth International World Wide Web Conference, Santa Clara, CA. [23] Daniele Riboni Feature Selection for Web Page Classification, Universita degli Studi di Milano, Italy. [24] Adar, Eytan, Teevan, Jaime, Dumais, Susan T., and Elsas, Jonathan L., The web changes everything, Understanding the dynamics of web content, In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 282-289, February 2009. [25] Sini Shibu, Aishwarya Vishwakarma and Niket Bhargava, A combination approach for Web Page Classification using Page Rank and Feature Selection Technique, International Journal of Computer Theory and Engineering, Vol.2, No.6, December, 2010. [26] Guiseppe Attardi, Antonio Gulli, Fabrizio Sebastiani, Automatic Web Page Categorization by Link and Context Analysis. [27] TheMathsWork, http://www.mathworks.com/products/neuralnet/. [28] Lawrence,S.; Giles, C.,L.: Searching the World Wide Web. Science, Vol.280, pp. 98-100, (1998) www.sciencemag.org [29] Arguello, J.;Diaz, F.; Callan, J.; Crespo, J.,F.:. Sources of evidence for vertical selection. In: 32nd International conference on Research and development in Information Retrieval, SIGIR 09 pp. 315--322, ACM, New York, USA (2009)