A System s Approach Towards Domain Identification of Web Pages

Size: px
Start display at page:

Download "A System s Approach Towards Domain Identification of Web Pages"


1 A System s Approach Towards Domain Identification of Web Pages Sonali Gupta Department of Computer Engineering YMCA University of Science & Technology Faridabad, India Sonali.goyal@yahoo.com Komal Kumar Bhatia Department of Computer Engineering YMCA University of Science & Technology Faridabad, India komal_bhatia1@rediffmail.com Abstract With the proliferation of the document corpora (commonly called as HTML documents or web pages) on the WWW, efficient ways of exploring relevant documents are of increasing importance [4, 8]. The key challenge lies in tackling the sheer volume of documents on the Web and evaluating relevancy for such a huge number. Efficient exploration needs a web crawler that can semantically understand and predict the domain of the web page through analytical processing. This will not only facilitate efficient exploration but also help in the better organization of the web content. As a search engine classifies the Search results by keyword matches, link analysis and other such mechanisms, the paper proposes a solution to the domain identification problem by finding keywords or key terms that are representative of the page s content through the elements like <META> and <TITLE> in the HTML structure of the webpage [11]. This paper proposes a two-step framework that automatically first identifies the domain of the specified web page and with the thus obtained domain information, classifies the web content according to the different pre- specified categories. The former uses the various HTML elements present in the web page while the latter is achieved using Artificial Neural Networks (ANN). Keywords- Search engine; crawler; domain-specific; HTML elements; META; TITLE; classification; categorization; Artificial Neural Networks; I. INTRODUCTION In recent years, the Web has become a huge information repository due to the increasing prevalence of documents and databases online [28]. There are numerous pages accessible on the World Wide Web and the number continues to increase (approximately by around 1.5 million) on a daily basis [24, 28]. The more the web contents evolve and varies, the more difficult it becomes to support the design and implementation of automatic information retrieval tools, among which the most typically used are the web search engines. The search engines discover the content as a first step to its retrieval and then after indexing present the required information in ranked order, to the user. The general purpose search engines cannot keep up with the growing pace of the web and are able to index only a small fraction of the entire that is available. To tackle the issue of scalability, focused crawlers [4, 5, 17] and domain specific search services like vertical search engines [29] have come up. The large and ever expanding scale, full of promising opportunities like, varying needs of the varied users, raises the issue of efficiently extracting the relevant information and maintaining it in an organized way. Determining relevancy of a document from a huge document corpus involves predicting the topic or the domain (say it be entertainment, food, sports) of the web page and categorizing it by organizing similar web pages into a common group, usually known as the class or the category of the web page.the process of categorization necessarily need not be mutually exclusive and the same web page may be assigned to one or more categories. Manually classifying each page for its classes is not a feasible task as identifying the domain of each and every page is itself tedious and time-consuming for humans. But, another fact is that categorization will simply facilitate the automation of the process of domain identification as documents on the web can be easily reached by following the hyperlinked structure. Hence, automatic classification forces automatic domain identification and vice versa. The activities being interdependent can be thought of as the two facets of the same coin. We propose a novel approach for identifying the topic or domain of web pages using the information available in <META> and <TITLE> tag of the webpage s HTML structure [1, 2, 11, 15, 20, 21]. The proposed system also makes use Artificial Neural Networks [27] to achieve the prime goal of domain identification which may later on be used to achieve a secondary goal of web page classification. The paper contributes towards the following: Enumerates the utility of the HTML elements like <META> and <TITLE> in the process of domain identification; Emphasizes on the ability of Artificial Neural Networks to achieve the stated goals; Develops a system that solves the web page classification problem based on the above mentioned features and may help in focusing a crawl. The rest of the paper is organized as follows: the background of the problem and the state-of-the-art is reviewed in section 2; the detailed working of our proposed system is presented in section 3; we discuss the experimental results and the advantages of the proposed system in section 4 and finally

2 conclude in section 5 along with some future directions for the work. II. BACKGROUND AND STATE- OF- THE- ART Information retrieval and management are the two prime tasks from the perspective of the Web users [28]. The aim of any search service employed for information retrieval is to efficiently build high-quality collection of hypertext documents belonging to a specific domain and return that effective set of results to the user as quickly as possible, against the posed query. Although this efficiency in building the index [3] and returning results can only be achieved if the search system deals with an organized document set both in its input as well as its output. Organized input set implies a well-organized collection of documents on the WWW whereas an organized output set specifies that the search engine only indexes and maintains pages belonging to a specific set of topics or domains that together represent a relatively narrow segment of the Web. The advantage of using such a search system is the sufficient coverage that is achieved by a small investment in hardware as well as network resources. The crawler for such a search engine must be guided by a predictor which tries to identify the topic or domain of the webpage and evaluate its relevance to search engine specialization. Managing information on the WWW can be achieved by employing numerous such specialized search engines which will help in automatic web page classification. Various machine learning techniques have been developed that help in automatic learning classification models called classifiers, based on the training examples [4, 6, 17]. The learned classifiers can then be applied to predict the classes of new documents. Thus, Web page classification not only assisted the organization of documents into hierarchical collections like the open directory project DMOZ but also aided a wide variety of information retrieval problems like focused crawling, question-answering, though hierarchical organization also facilitates retrieval of information but through a tedious process of browsing. At the back-end of the visual representation of each web page rendered by the browser lies a text representation in HTML. Most of the approaches for predicting the domain of any Web page followed by its classification rely on the text representation while simply ignoring the visual layout of the page, which may be useful as well [10]. Following are some classification mechanisms based on the text representation of the hypertext document, which have been proposed so far for the purpose [4, 10, 11, 14, 15, 16, 21, 25]: 1. Manual Categorization or Classification: Herein, a number of domain experts analyze the text contents of a web page manually and dole out a category domain to it for classification. For example, approach followed by Yahoo for organizing documents in the directory structure. The approach has the unconcealed advantage of accuracy but is challenged by the unprecedented scale of the WWW and seems infeasible for the huge number of web documents. 2. Text Clustering approaches: Clustering is an unsupervised learning process and does not need any background information to create clusters of similar documents. Also, the process being easier and faster has become very popular now-a-days. The process being expensive has not been employed against the sheer number of documents on the Web. 3. Content based Categorization: This approach relies on first creating an index database for each category that contains only the key terms (after removing stop words and obtaining the frequency of occurrence of each term) belonging to that category, from the exemplary set of documents. The candidate document is then classified by extracting its key terms and choosing the index that it resembles the most, so as to be classified into one of the categories. The approach does not take the full advantage of page being hypertext document and hence does not use any other relevant feature that can be drawn from its HTML structure like images, multimedia content etc. 4. Link and Content analysis: Based on the hyperlinked structure- in links, out links, associated anchor texts etc., these approaches finds hints about the contents of the documents and use the gathered hints to classify the referred document. These approaches take advantage of the neighboring pages that already have been assigned a category label but may suffer significantly in terms of performance when the category labels of the neighboring pages are not available, as is usually the case. Chakrabarti et al. (1998), Slattery and Mitchell (2000), and Calado et al. (2003) used the labels. 5. Categorization based on META tags: The approach relies solely on content attributes of the META tags (<META name= Keywords > and <META name= description >) [2]. The approach faces problem when irrelevant words are specified as keywords just to increase its hit ratio in the search engine results. All the above approaches except the first one mentioned, are a step towards automating the process of domain identification. Experiments in [23] show that most accurate classifier has been obtained by using Meta tags as the only text feature. Including any other tag (even Body) with the Meta tag results in less accuracy and decreases the precision of the classifier noticeably [23]. However, limiting the functionality of our system to just meta-tags would not help in classifying a large majority of the web documents as the lack of the widespread use of Meta tags steps up as a problem. Therefore, we consider Meta and title tags collectively in our system and use a link extractor that will extract backlinks to derive hints from its neighboring pages and supplement the process in progress, in case there no information from the tags can be obtained. Using these various features can significantly help improve identification and classification accuracy. A critical look at the above literature shows that: Most of the existing algorithms have used text content of a web page for identifying its domain and selecting the most suitable category [6, 7, 10]. Most HTML tags emphasize on representation rather than the semantics, and using the structural information derived from the tags may prove useful for predicting the domain

3 of the hypertext document can boost a classifier s performance [11, 14, 19]. If a page has been created with care, the information in the title and header may be more important than that in the prose. Using these various features can significantly help improve identification and classification accuracy [15, 16]. Most work in the field of web page classification has been accomplished using clustering algorithms and classifiers like Naïve Bayes, decision trees, Support Vector Machines etc. [6, 9, 18]. The paper contributes towards developing a system for automatic domain identification of hypertext documents, while at the same time, keeping in view the above characteristics.the proposed system has been developed to take the advantages of all the above automatic approaches. The result of our system depends on the weights of the various clusters, formed by extracting keywords from the tag structure of the web page. The next section explains in detail the proposed approach. III. PROPOSED APPROACH OF THE DOMAIN IDENTIFICATION SYSTEM In order to address the problem associated with manual approach for domain identification, a system that facilitates efficient exploration and better organization of the web contents has been proposed through automatic domain identification and classification of webpages. Our solution comprises of gathering domain knowledge from the HTML structure of the referred web page, extracting any backlinks (in case information cannot be derived from tags in its HTML structure) and finally assigning an appropriate category or the class to the webpage by using Artificial Neural Networks (ANN) [13, 27]. The major components and modules of our system have been listed as under: An index of web pages, their URLs and domain information, Tag Extractor, Back-link Extractor, A Clustering Module, A domain-specific repository of keywords & clusters of keywords, and A classifier based on ANN that constitutes a training module and a testing module. Our systems approach is based on the use of artificial neural networks that must first be trained by some exemplary data set and later on used to carry out the assigned task on other new candidate hypertext documents. For the training purpose, the system is initially provided with a set of web pages (and their URLs) with known domains. This seed set of webpages and URLs can be obtained from either any Web directory or the result listing of any search engine. Thereafter, an index is created that is used to store the web pages along with their domain information. In order to train the neural network, a URL or web page is taken from this index and given as input to the tag extractor. The tag extractor extracts meta-tag and title keywords associated with the URL or web page up to a pre-specified depth. The extracted keywords are then grouped together under a cluster based on various similarity metrics that are already known in the art. The clustering module is responsible for the creation of these clusters of keywords. However, if no metatags and title keywords are found, back-links are extracted for that URL or web page by a back-link extractor. The extracted back-links and their corresponding web pages are further added to the index so as to control the process of extracting any new keywords. Further, these clusters of similar keywords are saved in a domain-specific repository of keywords. Thereafter, keywords and clusters of keywords that are stored in the domain-specific repository are assigned weights. Now, the neural network that will be used for domain identification and classification of any new hypertext document is trained based on the weights of clusters of keywords associated with a web site whose domain is already identified. The process of calculation of weights and their assignment to every keyword/cluster is explained in detail later in step 2 below. Search Engine Seed URLs WWW URLs or Web Pages List of URLs Index of Domain wise Classified Web Pages Classifiers Back Line Extractor URL or Web Page Meta-Tags & Title Keywords Extractor Domains Training Module Keywords Domain Specific Repository of Keywords & Clusters Clustering Module New URL or WEB Pages Testing Module Figure 1. The proposed system for domain identification Clusters of Similar Keywords Domain Wise Classified Web Pages Similarly, based on the weights of clusters of keywords (fetched from the domain-specific repository of keywords & clusters), the domain of any new webpage is identified and the pages thus classified into one of the categories using the trained neural network. The classified web sites, thereafter, are stored in the index as per their respective domains. And the process continues for any number of candidate documents. Figure 1 illustrates the working of our proposed system. The process can be explained in detail with the following steps: A. Step 1: Keyword Extraction through <META> and <TITLE> tags of a web page In our approach, the URLs and web pages with known domains are given as input to a tag extractor that extracts metatags and title keywords by traversing the URL up to a depth as specified by the system. And, if any URL or a web page does not contain meta-tags and title keywords, the back-links for the corresponding URL of the web page will be extracted by a

4 back-link extractor. The extracted back-links are then also provided to the tag extractor just like others for extracting keywords. These extracted keywords are saved for future reference by the ANN. The process is carried out for all such back-links whenever extracted. B. Step 2: Assigning weights to the keywords extracted in step 1, based on their domain After the keywords have been extracted from the Meta and Title tags, they are stored in a domain specific repository of keywords, used for maintaining information about that domain. In other words, the keywords are saved domain-wise. Now, weights are assigned to the keywords based on their no. of occurrences. However, before assigning weights, keywords with similar context may be grouped together in order to form small clusters of keywords. But one must ensure that both keywords and their corresponding clusters must be stored in the domain-specific repository as shown in Figure 1. For example, the following keywords with similar context have been clustered together: carbohydrates, fats, proteins, minerals, vitamins. Another cluster might contain the related terms like sex, sexual, and sexual health whereas another might contain swim, swimmer, swimming, swimming pool, swimsuit.herein, the similarity might be based on the words having same base forms, words frequently found together, words with similar meanings and the like. Assuming that our system considers just the following domains: Entertainment, Food, Medicine and Sports. A total of 258 clusters of keywords have been prepared for use by our proposed system. Also, it is assumed that every domain has a unique set of keywords, i.e., no two domains share a common keyword. In order to assign weights, the following formula has been used by our system: The corresponding weights are also saved in the domainspecific repository of keywords and clusters. C. Step 3: Training the Neural Network from exemplary web pages with known domains. The keywords along with their associated weights and the exemplary web pages or hypertext documents are used to train the ANN so as to learn what kinds of keywords belong to which domain. For example: for the entertainment domain the related terms can be : fun, humor, jokes, travelling, tourism etc. In order to train the neural network for various web sites, the meta-tag extractor extracts the meta-tags and title keywords (if available) up to a pre-specified depth for every URL or web page stored in the index of domain-wise classified web pages. Further, keywords with similar context belonging to a domain may be grouped together. Thereafter, clusters of keywords are assigned weights by referring to domain-specific repository of keywords and clusters of keywords. The neural network is further trained by providing these weights and the domains (according to which classification needs to be done) as inputs to the training module, as shown above in figure 1. The proposed algorithm that has been used by our system for training the neural network for web page domain identification and classification is depicted in figure 2 below. Input: seed webpages and their urls, a list of domains, A Clustering algorithm, Output : Learned data (domain specific data repositories that pertains to data of each individual domain ) Procedure: 1. Store the webpage and the URL along with their specified domain information into an initial index 2. For each stored page and its url 3. if (meta and title tag exists) else 3 a) extract the Keywords or terms from them upto a pre-sepcified depth; 3 b) save the extracted keyterms and the frequency of occurrence in its domain specific data repository 3 c) Obtain the clusters of keywords say C 1, C 2, C 3,. C n using the specified clustering algorithm 3 d) To each C i assign a weight W i using the no. of occurrences of the keywords or terms and the total no. of web pages traversed for that domain 3 e) Based on the assigned weight W i, allot a domain to the cluster and its comprising keywords, store the information in its corresponding domain specific repository 3 a) extract its backlinks and store them in the Index 3 b) for the newly added webpage and its URL, repeat the same sequence as in step 3 Figure 2. Algorithm for training the neural network It may also be the case where a web page might consist of keywords from more than one domain. Now, the webpage has to be traversed for all the domains, so as to prepare an input and an output matrix that contains the keywords and their associated weights. The weights corresponding to every keyword and/or cluster for every webpage is fetched from the domain specific repository of keywords and clusters, in order to prepare the input and output matrices for training the neural network. In our system, the various domains have been represented by numbers as: Entertainment=0; Food=1; Medicine=2; Sports=3 Neural Network Model Specifications: The neural network model used in this work has the following specifications: inputs TABLE I. Neurons in Input layer NEUTRAL NETWORK SPECIFICATION Neurons in Hidden layer Neurons in Output layer 21*4= Input Matrix: The input matrix used for training (as shown below in table II) is a 21x4 matrix that is prepared for every web site where we provide an input of 20 keyword weights, along with their row-wise sum in the 21st column for the domains used by our system (i.e. entertainment, food, medicine and sports). For example, the weight in the first column for entertainment domain is 0.2. It has been calculated by using the afore-mentioned formula for a 1-keyword cluster beach.

5 Similarly, all other keyword clusters are also assigned weights using the afore-mentioned formula. The last column depicts the row-wise sum of all the assigned weights. Herein, the rowwise sum of entertainment domain is However, if a web site consists of keywords less than 20 for a particular domain, those entries are provided an input of 0, as shown in the table II below. Output Matrix: The output matrix used for training is a 21x1 matrix which reflects the domain of the web site to be classified based on entry in the last cell. For every entry in the input matrix, the maximum value is found in every column. For example, the maximum entry in first column of input matrix as shown in table II below is 0.36 which belongs to food domain. The code for food domain is 1, which is reflected in the first column of output matrix. Similarly, the whole output matrix is prepared. The overall output (i.e., the domain of the web site) is reflected by the entry in last cell i.e. 2 in the output matrix which represents the code for medicine. The last cell represents row-wise sum and column-wise maximum value amongst the sums of all the 4 domains. The output can also be supported by counting the no. of occurrences of each domain in the output matrix. For example, here, the no. of occurrences for entertainment domain is zero, for food domain is five, for medicine domain is thirteen and for sports domain is two. This implies that the maximum no. of occurrences is for medicine domain. The domain, whose sum and no. of occurrences would be maximum, will be the predicted domain for the corresponding web page. For example, the data in table II belongs to a web site of medicine domain. TABLE II. TRAINING DATA: INPUT & OUTPUT MATRICES Training Data (Input Matrix) Domains\ K1 K2 K3 K4 K5 K6 K7 K8 K9 K10 K11 K12 K13 K14 K15 K16 K17 K18 K19 K20 Sum Keyword Weights Entertainment (0) Food (1) Medicine (2) Sports (3) Result (Domain- Wise) Testing Data (Output Matrix) D. Step 4: Testing the Neural Network for predicting the domain of web pages whose domain needs to be identified and classifying the web pages accordingly After the neural network has been trained with web pages whose domain is already identified, the trained neural network is used for predicting the domain of web pages whose domain needs to be identified, in order to classify them. Again, here input matrices that have been prepared for various web sites are fed to the neural network. The network predicts the domain of the web site based on the input matrix fed to it. Initially, the web page whose domain is to be predicted is given as input to the system through a tag extractor. The tag extractor extracts the meta-tags and title keywords (if available) up to a prespecified depth. A clustering algorithm then creates clusters of similar keywords. Figure 3 shows the proposed algorithm showing the usage of the neural network for predicting the domain of a given web page. Input : a webpage with unknown domain Output : the predicted domain of the web page and the assigned category label Procedure : 1. For the given webpage and its URL 2. if (meta and title tag exists) else 2 a) Extract the Keywords or terms from them upto a pre-sepcified depth; 2 b) Obtain a set of clusters of keywords say C= { C1, C2, C3,. Cn } using the specified clustering algorithm 2 c) For each cluster Ci Ɛ C if a similar cluster Ck exists within the learned training data (the domain specific repositories) else fetch its corresponding weight and domain information from the repository and assign values to Ci; discard the cluster 2 d) Find a subset SC= {SC1, SC2, SC3,. SCm} of C such that SCi contains all the clusters that have been assigned a common domain in the above step where value of m equals the number of domains under consideration. 2 e) For each item SCi of the subset SC, add the assigned weights of all its clusters C j belonging to SCi and assign the value to SCi for later predicting the domain of the hypertext document 2 f) Of all the SCi Find the one that has the maximum value for sum of weights and use the domain of that as the predicted domain of the webpage or hypertext document 2 g) In case there is conflict between the maximum of the total weights, use the metric no. of occurrences to resolve the conflict. 2 a) extract its backlinks and store them in the Index 2 b) for the newly added webpage and its URL, repeat step 2 Figure 3. Algorithm for predicting the domain of a web page using the configured neural network If a cluster similar to any of the obtained clusters already exists in the repository then the same weight and domain information is associated to the obtained cluster. However, if none of the keywords of the cluster exist in the repository, the system simply discards the cluster. After the weights and domains of all clusters have been fetched, they all are gathered together and are separated domain-wise for the corresponding web page or URL to finally predict the webpage s domain. IV. EXPERIMENTAL RESULTS A total of 89.75% of the web pages have been correctly classified by our proposed system, as shown in Table III below. It can also be inferred that a combination of the two tags META and TITLE provide an efficient and accurate way for identification and categorization of web pages thus facilitating an end user to explore or find web pages of his desired classes effectively. Prediction has been made on the basis of the cell maximum-sum as shown in table II below. Accordingly, web pages have been classified. For example, for a URL, that belongs to medicine domain, the maximum-sum cell depicts the output to be 2, i.e., medicine domain, and it is thus correctly classified. Domain of web pages TABLE III. Percentage of web pages correctly classified EXPERIMENTAL RESULTS OBTAINED Percentage of web pages incorrectly classified Percentage of pages that couldn t be classified Entertainment(0) 66.67% 22.22% 11.11% Food (1) 100% 0% 0% Medicine (2) 100% 0% 0% Sports (3) 93.75% 6.25% 0% Total pages (4) 89.75% 7.69% 2.56%

6 However, there is also a case where the system is not able to predict the domain of the web page based on the weights that has been input to it, i.e., the data happens to be inaccurate for the system for precise prediction. In such a case, back-links for the URL are extracted. For each extracted back-link, meta-tags and title tag keywords are extracted by the meta-tag extractor. This data is, now, fed to the neural network again in order to predict the domain of the web page. Now the system would be able to predict the domain of the web page correctly. Though web pages contain useful features as discussed above but, these features are sometimes missing, misleading, or unrecognizable for various reasons in some particular web pages.for example, webpages containing large images or flash objects but little textual content. In such cases, it is difficult for classifiers to make reasonable judgments based on features on the page. Our system deals this problem to some extent by extracting hints from neighboring pages (through a link extractor) that are related in some way to the page under consideration and supply supplementary information necessary for prediction and classification. V. CONCLUSION & FUTURE WORK A novel approach for domain identification of the web pages along with their classification has been proposed in the paper. In this proposal, both meta-tags and title tag keywords have been used for the purpose. In the future, the classification performance is expected to improve if other factors are taken into account like applying a cumulative metric of both maximum sum and no. of occurrences. Further, classification performance is also expected to improve if other features of an HTML page are considered. For example, by considering the URL of a web page, hints may be provided regarding the domain of a web page [12]. In this case, the URL of a web page may be input to a tokenizer that creates meaningful tokens (n-grams) which provides hints about the domain of a web page. Also, the anchor text present in web pages may prove useful in determining a web page s domain. As good quality document summarization can accurately represent the major topic of a web page thus summarization can also help in classifying web pages accurately. REFERENCES [1] The WWW consortium HTML 4.01 Specification, W3C, 1999 [2] Meta tags, Frontware International. [3] J.Hayes, W.S.P. "A system for content-based indexing of a database of news stories". Proc. of Second Annual Conference on Innovative Applications of Artificial Intelligence, [4] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific web resource discovery.computer Networks, 31(11-16),p.p , [5] M. Diligenti, F. Coetzee, S. Lawrence, C. L.Giles, M. Gori. Focused crawling using context graphs. In Proc. of 26th International Conference on Very Large Data Bases, Cairo, Egypt, pages , [6] J. Yi and N. Sudershesan, A classifier for semi structured documents, In KDD, Boston, MA USA, [7] Wai-Chiu Wong, A. Wai.C. Fu, Incremental Document Clustering for Web Page Classification, Chinese University of Hong Kong, July [8] John.M.Pierre, On the Automated Classification of web sites, Linkoping Elec. Articles in Comp. and Info. Science, Vol. 6, [9] H. Yu, J. Han, K.C.Chang. PEBL: positive example based learning for web page classification using SVM. In KDD 02 : proceedings of the 8th ACM SIGKDD international conference on Knowledge Discovery and Data mining, pages , New York, NY, USA, [10] D Cai S, Yu J wen. Extracting Content Structure for Web Pages Based on Visual Representation. In the International Conferences on Asia- Pacific Web Conference(APWeb), [11] Golub, K. and A. Ardo (2005, September). Importance of HTML structural elements and metadata in automated subject classification. In Proceedings of the 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL), Volume 3652 of LNCS, Berlin, pp Springer. [12] U. Schonfeld, Z. Bar-Yossef, I. Keidar. Do not crawl in the DUST: different URLs with similar text. In Proceedings of the 15th International World Wide Web Conference, pages , New York, NY, USA, [13] S. M. Kamruzzaman, Web Page Categorization Using Artificial Neural Networks, Proceedings of the 4th International Conference on Electrical Engineering & 2nd Annual Paper Meet January, [14] X. Qi and B. D. Davison. Knowing a web page by the company it keeps. In International conference on Information and knowledge management (CIKM), pages , [15] Xiaoguang Qi and Brian D. Davison Web Page Classification: Features and Algorithms, Department of Computer Science & Engineering, Lehigh University, June [16] Bing Liu. Web Data Mining, Exploring Hyperlinds, Contents, and Usage Data. Springer [17] Qingyang xu, Wanli Zuo. First-order Focused Crawling. ACM. pp WWW2007. [18] Daniela XHEMALI, Christopher J. HINDE and Roger G. STONE, Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages IJCSI International Journal of Computer Science Issues, Vol. 4, No 1, [19] Gerry McGovern. "A step to step approach to web page categorization". [20] Aijun An and Xiangji Huang, "Feature selection with rough sets for web page categorization", York University, Toronto, Ontario, Canada. [21] Arul Prakash Asirvhatam and Kranti Kumar Ravi. "Web Page Categorization based on Document Structure", International Institute of Information Technology, Hyderabad, India [22] Chekuri, C., M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automated classification. In Proceedings of the Sixth International World Wide Web Conference, Santa Clara, CA. [23] Daniele Riboni Feature Selection for Web Page Classification, Universita degli Studi di Milano, Italy. [24] Adar, Eytan, Teevan, Jaime, Dumais, Susan T., and Elsas, Jonathan L., The web changes everything, Understanding the dynamics of web content, In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp , February [25] Sini Shibu, Aishwarya Vishwakarma and Niket Bhargava, A combination approach for Web Page Classification using Page Rank and Feature Selection Technique, International Journal of Computer Theory and Engineering, Vol.2, No.6, December, [26] Guiseppe Attardi, Antonio Gulli, Fabrizio Sebastiani, Automatic Web Page Categorization by Link and Context Analysis. [27] TheMathsWork, [28] Lawrence,S.; Giles, C.,L.: Searching the World Wide Web. Science, Vol.280, pp , (1998) [29] Arguello, J.;Diaz, F.; Callan, J.; Crespo, J.,F.:. Sources of evidence for vertical selection. In: 32nd International conference on Research and development in Information Retrieval, SIGIR 09 pp , ACM, New York, USA (2009)

Evaluation Methods for Focused Crawling

Evaluation Methods for Focused Crawling Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

A genetic algorithm based focused Web crawler for automatic webpage classification

A genetic algorithm based focused Web crawler for automatic webpage classification A genetic algorithm based focused Web crawler for automatic webpage classification Nancy Goyal, Rajesh Bhatia, Manish Kumar Computer Science and Engineering, PEC University of Technology, Chandigarh, India

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

Web Crawling As Nonlinear Dynamics

Web Crawling As Nonlinear Dynamics Progress in Nonlinear Dynamics and Chaos Vol. 1, 2013, 1-7 ISSN: 2321 9238 (online) Published on 28 April 2013 www.researchmathsci.org Progress in Web Crawling As Nonlinear Dynamics Chaitanya Raveendra

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Domain Based Categorization Using Adaptive Preprocessing

Domain Based Categorization Using Adaptive Preprocessing Domain Based Categorization Using Adaptive Preprocessing Anam Nikhil 1, Supriye Tiwari 2, Ms. Arti Deshpande 3, Deepak Kaul 4, Saurabh Gaikwad 5 Abstract: As the number users accessing network for various

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

Context Based Web Indexing For Semantic Web

Context Based Web Indexing For Semantic Web IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT

More information


INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably

More information


TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Prof. T.P.Aher(ME), Ms.Rupal R.Boob, Ms.Saburi V.Dhole, Ms.Dipika B.Avhad, Ms.Suvarna S.Burkul 1 Assistant Professor, Computer

More information

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information George E. Tsekouras *, Damianos Gavalas, Stefanos Filios, Antonios D. Niros, and George Bafaloukas

More information

A Supervised Method for Multi-keyword Web Crawling on Web Forums

A Supervised Method for Multi-keyword Web Crawling on Web Forums Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 2, February 2014,

More information

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Department of Computer Science & Engineering, Gitam University, INDIA 1. binducheekati@gmail.com,

More information

A Review on Identifying the Main Content From Web Pages

A Review on Identifying the Main Content From Web Pages A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Focused crawling: a new approach to topic-specific Web resource discovery. Authors Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused

More information

Context Based Indexing in Search Engines: A Review

Context Based Indexing in Search Engines: A Review International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Context Based Indexing in Search Engines: A Review Suraksha

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Smartcrawler: A Two-stage Crawler Novel Approach for Web Crawling

Smartcrawler: A Two-stage Crawler Novel Approach for Web Crawling Smartcrawler: A Two-stage Crawler Novel Approach for Web Crawling Harsha Tiwary, Prof. Nita Dimble Dept. of Computer Engineering, Flora Institute of Technology Pune, India ABSTRACT: On the web, the non-indexed

More information

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh

More information

An Approach To Web Content Mining

An Approach To Web Content Mining An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

Competitive Intelligence and Web Mining:

Competitive Intelligence and Web Mining: Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction

More information

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations

A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.1, January 2013 1 A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations Hiroyuki

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

Automatically Constructing a Directory of Molecular Biology Databases

Automatically Constructing a Directory of Molecular Biology Databases Automatically Constructing a Directory of Molecular Biology Databases Luciano Barbosa Sumit Tandon Juliana Freire School of Computing University of Utah {lbarbosa, sumitt, juliana}@cs.utah.edu Online Databases

More information

Simulation Study of Language Specific Web Crawling

Simulation Study of Language Specific Web Crawling DEWS25 4B-o1 Simulation Study of Language Specific Web Crawling Kulwadee SOMBOONVIWAT Takayuki TAMURA, and Masaru KITSUREGAWA Institute of Industrial Science, The University of Tokyo Information Technology

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Ontology-Based Web Query Classification for Research Paper Searching

Ontology-Based Web Query Classification for Research Paper Searching Ontology-Based Web Query Classification for Research Paper Searching MyoMyo ThanNaing University of Technology(Yatanarpon Cyber City) Mandalay,Myanmar Abstract- In web search engines, the retrieval of

More information

Smart Crawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Smart Crawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces Smart Crawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces Rahul Shinde 1, Snehal Virkar 1, Shradha Kaphare 1, Prof. D. N. Wavhal 2 B. E Student, Department of Computer Engineering,

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,

More information

Taccumulation of the social network data has raised

Taccumulation of the social network data has raised International Journal of Advanced Research in Social Sciences, Environmental Studies & Technology Hard Print: 2536-6505 Online: 2536-6513 September, 2016 Vol. 2, No. 1 Review Social Network Analysis and

More information

Fault Identification from Web Log Files by Pattern Discovery

Fault Identification from Web Log Files by Pattern Discovery ABSTRACT International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 Fault Identification from Web Log Files

More information

An Efficient Methodology for Image Rich Information Retrieval

An Efficient Methodology for Image Rich Information Retrieval An Efficient Methodology for Image Rich Information Retrieval 56 Ashwini Jaid, 2 Komal Savant, 3 Sonali Varma, 4 Pushpa Jat, 5 Prof. Sushama Shinde,2,3,4 Computer Department, Siddhant College of Engineering,

More information

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Implementation of Enhanced Web Crawler for Deep-Web Interfaces Implementation of Enhanced Web Crawler for Deep-Web Interfaces Yugandhara Patil 1, Sonal Patil 2 1Student, Department of Computer Science & Engineering, G.H.Raisoni Institute of Engineering & Management,

More information

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Mukesh Kumar and Renu Vig University Institute of Engineering and Technology, Panjab University, Chandigarh,

More information

Oleksandr Kuzomin, Bohdan Tkachenko

Oleksandr Kuzomin, Bohdan Tkachenko International Journal "Information Technologies Knowledge" Volume 9, Number 2, 2015 131 INTELLECTUAL SEARCH ENGINE OF ADEQUATE INFORMATION IN INTERNET FOR CREATING DATABASES AND KNOWLEDGE BASES Oleksandr

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Unsupervised Clustering of Web Sessions to Detect Malicious and Non-malicious Website Users

Unsupervised Clustering of Web Sessions to Detect Malicious and Non-malicious Website Users Unsupervised Clustering of Web Sessions to Detect Malicious and Non-malicious Website Users ANT 2011 Dusan Stevanovic York University, Toronto, Canada September 19 th, 2011 Outline Denial-of-Service and

More information



More information

Efficient Crawling Through Dynamic Priority of Web Page in Sitemap

Efficient Crawling Through Dynamic Priority of Web Page in Sitemap Efficient Through Dynamic Priority of Web Page in Sitemap Rahul kumar and Anurag Jain Department of CSE Radharaman Institute of Technology and Science, Bhopal, M.P, India ABSTRACT A web crawler or automatic

More information

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80]. Focus: Accustom To Crawl Web-Based Forums M.Nikhil 1, Mrs. A.Phani Sheetal 2 1 Student, Department of Computer Science, GITAM University, Hyderabad. 2 Assistant Professor, Department of Computer Science,

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Title: Artificial Intelligence: an illustration of one approach.

Title: Artificial Intelligence: an illustration of one approach. Name : Salleh Ahshim Student ID: Title: Artificial Intelligence: an illustration of one approach. Introduction This essay will examine how different Web Crawling algorithms and heuristics that are being

More information


AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH Sai Tejaswi Dasari #1 and G K Kishore Babu *2 # Student,Cse, CIET, Lam,Guntur, India * Assistant Professort,Cse, CIET, Lam,Guntur, India Abstract-

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Crawling the Hidden Web Resources: A Review

Crawling the Hidden Web Resources: A Review Rosy Madaan 1, Ashutosh Dixit 2 and A.K. Sharma 2 Abstract An ever-increasing amount of information on the Web today is available only through search interfaces. The users have to type in a set of keywords

More information

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bonfring International Journal of Data Mining, Vol. 7, No. 2, May 2017 11 News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages Bamber and Micah Jason Abstract---

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information



More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information



More information

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering

Retrieval of Web Documents Using a Fuzzy Hierarchical Clustering International Journal of Computer Applications (97 8887) Volume No., August 2 Retrieval of Documents Using a Fuzzy Hierarchical Clustering Deepti Gupta Lecturer School of Computer Science and Information

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5.

analyzing the HTML source code of Web pages. However, HTML itself is still evolving (from version 2.0 to the current version 4.01, and version 5. Automatic Wrapper Generation for Search Engines Based on Visual Representation G.V.Subba Rao, K.Ramesh Department of CS, KIET, Kakinada,JNTUK,A.P Assistant Professor, KIET, JNTUK, A.P, India. gvsr888@gmail.com

More information

Web Mining Evolution & Comparative Study with Data Mining

Web Mining Evolution & Comparative Study with Data Mining Web Mining Evolution & Comparative Study with Data Mining Anu, Assistant Professor (Resource Person) University Institute of Engineering and Technology Mahrishi Dayanand University Rohtak-124001, India

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Query Disambiguation from Web Search Logs

Query Disambiguation from Web Search Logs Vol.133 (Information Technology and Computer Science 2016), pp.90-94 http://dx.doi.org/10.14257/astl.2016. Query Disambiguation from Web Search Logs Christian Højgaard 1, Joachim Sejr 2, and Yun-Gyung

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments?

How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments? How are XML-based Marc21 and Dublin Core Records Indexed and ranked by General Search Engines in Dynamic Online Environments? A. Hossein Farajpahlou Professor, Dept. Lib. and Info. Sci., Shahid Chamran

More information

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms.

Index Terms:- Document classification, document clustering, similarity measure, accuracy, classifiers, clustering algorithms. International Journal of Scientific & Engineering Research, Volume 5, Issue 10, October-2014 559 DCCR: Document Clustering by Conceptual Relevance as a Factor of Unsupervised Learning Annaluri Sreenivasa

More information

Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler

Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler Journal of Computer Science Original Research Paper Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler 1 P. Jaganathan and 2 T. Karthikeyan 1 Department

More information

Improving Relevance Prediction for Focused Web Crawlers

Improving Relevance Prediction for Focused Web Crawlers 2012 IEEE/ACIS 11th International Conference on Computer and Information Science Improving Relevance Prediction for Focused Web Crawlers Mejdl S. Safran 1,2, Abdullah Althagafi 1 and Dunren Che 1 Department

More information

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts. Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Advanced Preferred

More information

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher ISSN: 2394 3122 (Online) Volume 2, Issue 1, January 2015 Research Article / Survey Paper / Case Study Published By: SK Publisher P. Elamathi 1 M.Phil. Full Time Research Scholar Vivekanandha College of

More information

Image Similarity Measurements Using Hmok- Simrank

Image Similarity Measurements Using Hmok- Simrank Image Similarity Measurements Using Hmok- Simrank A.Vijay Department of computer science and Engineering Selvam College of Technology, Namakkal, Tamilnadu,india. k.jayarajan M.E (Ph.D) Assistant Professor,

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Ranking web pages using machine learning approaches

Ranking web pages using machine learning approaches University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 Ranking web pages using machine learning approaches Sweah Liang Yong

More information



More information

Content Collection for the Labelling of Health-Related Web Content

Content Collection for the Labelling of Health-Related Web Content Content Collection for the Labelling of Health-Related Web Content K. Stamatakis 1, V. Metsis 1, V. Karkaletsis 1, M. Ruzicka 2, V. Svátek 2, E. Amigó 3, M. Pöllä 4, and C. Spyropoulos 1 1 National Centre

More information

Sentiment Analysis for Customer Review Sites

Sentiment Analysis for Customer Review Sites Sentiment Analysis for Customer Review Sites Chi-Hwan Choi 1, Jeong-Eun Lee 2, Gyeong-Su Park 2, Jonghwa Na 3, Wan-Sup Cho 4 1 Dept. of Bio-Information Technology 2 Dept. of Business Data Convergence 3

More information

User Intent Discovery using Analysis of Browsing History

User Intent Discovery using Analysis of Browsing History User Intent Discovery using Analysis of Browsing History Wael K. Abdallah Information Systems Dept Computers & Information Faculty Mansoura University Mansoura, Egypt Dr. / Aziza S. Asem Information Systems

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information


INDEXING FOR DOMAIN SPECIFIC HIDDEN WEB International Journal of Computer Engineering and Applications, Volume VII, Issue I, July 14 INDEXING FOR DOMAIN SPECIFIC HIDDEN WEB Sudhakar Ranjan 1,Komal Kumar Bhatia 2 1 Department of Computer Science

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

Analytical survey of Web Page Rank Algorithm

Analytical survey of Web Page Rank Algorithm Analytical survey of Web Page Rank Algorithm Mrs.M.Usha 1, Dr.N.Nagadeepa 2 Research Scholar, Bharathiyar University,Coimbatore 1 Associate Professor, Jairams Arts and Science College, Karur 2 ABSTRACT

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

Conclusions. Chapter Summary of our contributions

Conclusions. Chapter Summary of our contributions Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

An Improved Indexing Mechanism Based On Homonym Using Hierarchical Clustering in Search Engine *

An Improved Indexing Mechanism Based On Homonym Using Hierarchical Clustering in Search Engine * International Journal of Computing Academic Research (IJCAR) ISSN 2305-9184 Volume 4, Number 6(2015), pp.270-277 MEACSE Publications http://www.meacse.org/ijcar An Improved Indexing Mechanism Based On

More information

arxiv: v1 [cs.lg] 3 Oct 2018

arxiv: v1 [cs.lg] 3 Oct 2018 Real-time Clustering Algorithm Based on Predefined Level-of-Similarity Real-time Clustering Algorithm Based on Predefined Level-of-Similarity arxiv:1810.01878v1 [cs.lg] 3 Oct 2018 Rabindra Lamsal Shubham

More information

Classifiers Without Borders: Incorporating Fielded Text From Neighboring Web Pages

Classifiers Without Borders: Incorporating Fielded Text From Neighboring Web Pages Classifiers Without Borders: Incorporating Fielded Text From Neighboring Web Pages Xiaoguang Qi and Brian D. Davison Department of Computer Science & Engineering Lehigh University Bethlehem, PA 18015 USA

More information

Semantic Website Clustering

Semantic Website Clustering Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic

More information

A Framework for Hierarchical Clustering Based Indexing in Search Engines

A Framework for Hierarchical Clustering Based Indexing in Search Engines BIJIT - BVICAM s International Journal of Information Technology Bharati Vidyapeeth s Institute of Computer Applications and Management (BVICAM), New Delhi A Framework for Hierarchical Clustering Based

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information