A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

Size: px
Start display at page:

Download "A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information"

Transcription

1 A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information George E. Tsekouras *, Damianos Gavalas, Stefanos Filios, Antonios D. Niros, and George Bafaloukas Department of Cultural Technology and Communication, University of the Aegean, 81100, Mytilene, Lesvos, Greece Abstract. We present a novel focused crawling method for extracting and processing cultural data from the web in a fully automated fashion. After downloading the pages, we extract from each document a number of words for each thematic cultural area. We then create multidimensional document vectors comprising the most frequent word occurrences. The dissimilarity between these vectors is measured by the Hamming distance. In the last stage, we employ cluster analysis to partition the document vectors into a number of clusters. Finally, our approach is illustrated via a proof-of-concept application which scrutinizes hundreds of web pages spanning different cultural thematic areas. Keywords: web crawling, HTML parser, document vector, cluster analysis, Hamming distance, similarity measure, filtering. 1 Introduction Web crawlers typically perform a simulated browsing of the web by extracting links from pages, downloading all of them and repeating the process ad infinitum. This process requires enormous amounts of hardware and network resources, ending up with a large fraction of the visible web on the crawler s storage array. When information about a predefined topic is desired though, a specialization of the aforementioned process called focused crawling is used [1]. When searching for further relevant web pages, the focused crawler starts from the given pages and recursively explores the linked web pages [1, 2]. While the simple crawlers perform a breadth-first search of the whole web; focused crawlers explore only a small portion of the web using a best-first search guided by the user interest and based on similarity estimations [1, 2]. To maintain a fast information retrieval process, a focused web crawler has to perform web document classification under certain similar characteristics. One of the most efficient approaches to maintain this issue is to use cluster analysis. This article introduces a novel clustering-based focused crawler that involves: (a) creation a multidimensional document vector comprising the most frequent word occurrences; (b) calculation of the Hamming distances between the cultural-related documents; (c) partitioning of the documents into a number of clusters. * Corresponding author. J. Darzentas et al. (Eds.): SETN 2008, LNAI 5138, pp , Springer-Verlag Berlin Heidelberg 2008

2 420 G.E. Tsekouras et al. The remainder of the article is organized as follows: Section 2 describes the structure of the focused web-crawler. Section 3 discusses the experimental results and Section 4 presents the conclusions of the present work. 2 The Proposed Method Prior to performing clustering of web documents, our algorithm involves the documents retrieval (crawling) and parsing and also the calculation of their distance vector and distance matrix. The high-level process of crawling, parsing, filtering and clustering of the downloaded web pages is illustrated in Figure 1. In detail, our algorithm is described within the following subsections. Internet Web crawler Counting cultural terms in segmented word XML Cultural dictionary HTML source documents Is it a cultural document? HTML parser Create document vector XML content Calculate Hamming distances among documents General dictionary Filtering noise words Offline documents clustering in thematic cultural clusters Fig. 1. The high-level process of the proposed focused web-crawler 2.1 Crawling Procedure For the retrieval of web pages we utilize a simple recursive procedure which enables breadth-first searches through the links in web pages across the Internet. The application downloads the first document, retrieves the web links included within the page and then recursively downloads the pages where these links point to, until the requested number of documents has been downloaded. The respective pseudo-code implementation is given in Figure 2.

3 A Clustering Framework to Build Focused Web Crawlers 421 Initialize: UrlsDone = ; UrlsTodo = { firstsite.seed.htm, secondsite.seed.htm..} Repeat: url = UrlsTodo.getNext() ip = DNSlookup( url.gethostname() ) html = DownloadPage( ip, url.getpath() ) UrlsDone.insert( url ) newurls = parseforlinks( html ) For each newurl If not UrlsDone.contains( newurl ) then UrlsTodo.insert( newurl ) Fig. 2. The web crawling algorithm 2.2 Parsing of HTML Documents The parser maintains the title and the clear text content of the document and the URL addresses where the document links point to. The title and the document s body content are then translated to XML format. The noisy words such as articles and punctuation marks are filtered. The documents not including sufficient cultural content are deleted. The documents based on their URL addresses are retrieved and parsed on the next algorithm s execution round, unless they have been already appended in the UrlsDone list. 2.3 Calculation of the Document Vectors For each parsed document we calculate the respective document vector denoted as DV. The DV indicates the descriptive and most useful words included within the document and their frequencies of appearance. For instance, if a document D i includes the words a, b, c and d with frequency 3, 2, 8 and 6 respectively, then its document vector will be: DV i = [3a, 2b, 8c, 6d]. The dimension of DV i equals the number of the words it includes (for the previous example it is DVi = 4) and varies for each document. Next, we reorder each DV i in descending order of words frequencies so the vector of the previous example becomes: DV i = [8c, 6d, 3a, 2b]. Finally, we filter the DV i maintaining only a specific number of T words, so that all DV i s are of equal dimension. Thus, for T=2, the vector of the previous example becomes: DV i = [8c, 6d]. The filtering excludes some information, since we have no knowledge of which words with small frequencies are included in each document. If no filtering is applied, then the worse case scenario for a dictionary of W words is a W-dimensions DV i, where each word of the dictionary appears only once in a document. 2.4 Calculation of the Document Vectors Distance Matrix We calculate the distance matrix DM of the parsed web documents. Each scalar element DM i,j of this matrix equals the distance between DV i and DV j document vectors: DM i,j = d (DV i, DV j ), which represents the dissimilarity of the words included within the corresponding documents, i.e. the more different the words (and their frequencies) of documents D i and D j, the higher the DM i,j value. The distance matrix values DM i,j are calculated based on the Hamming distance calculation method: if a word w with

4 422 G.E. Tsekouras et al. frequency f appears in D i and not in D j or vice-versa, DM i,j increases by f. Otherwise, w appears on D i with frequency f i and on D j with frequency f j and therefore, DM i,j increases by the absolute value of the difference between f i and f j ( f i - f j ). The distance matrix provides the input of our clustering algorithm presented in the following section. 2.5 Clustering Process = 1, 2, DV N be a set of N document vectors of equal dimension. The potential of the i-th document vector is defined as follows, Let X { DV DV..., } N Z i = S ji j= 1, ( 1 i N) (1) where S ji is the similarity measure between DV j and DV i given as follows, ji { α d( DV DV )} S = exp,, α (0,1) (2) j A document vector that appears a high potential value is surrounded by many neighboring document vectors. Therefore, a document vector with a high potential value is a good nominee to be a cluster center. Based on this remark the potentialbased clustering algorithm is described as follows, Step 1). Select values for the design parameters α (0,1) and β (0,1). Initialy, set the number of clusters equal to n=0. Step 2). Using eqs (1) and (2) and the distance matrix calculate the potential values for all document vectors DV i ( 1 i N). Step 3). Set n=n+1. Step 4). Calculate the maximum potential value: Z = { Z } Select the document vector element of the n-th cluster: i max max i (3) 1 i N DVi max that corresponds to Z max as the center C n = DV imax. Step 5). Remove from the set X all the document vectors having similarity (given in eq. (2)) with DV greater than β and assign them to the n-th cluster. i max Step 6). If X is empty stop. Else turn the algorithm to step 2. The implementation of the above algorithm requires a priori knowledge of the values for α and β. To simplify the approach, we set α = 0. 5 and we calculate the optimal value of β by using the following cluster validity index, COMP V = (7) SEP

5 A Clustering Framework to Build Focused Web Crawlers 423 Where COMP is cluster compactness measure and SEP the cluster separation measure, which are respectively given as, and COMP = n C Z k k = 1 (4) C k SEP = g min{ d( Ci, C j )} (5) i, j where in (4) Z is the potential value of the k-th cluster center and in (5) the function g is a strictly increasing function. Here, we choose the following form, q g ( x) = x, q ( 1, ) (6) Table 1. Comparative classification results with respect to the Harvest Ratio Category Cultural conservation Cultural heritage Painting Sculpture Dancing Cinematograph Architecture Museum Archaeology Folklore Music Theatre Cultural Events Audiovisual Arts Graphics Design Art History Best- First Search 48.7% 52.1% 70.6% 52.0% 66.8% 67.4% 55.4% 59.7% 60.8% 65.2% 71.5% 58.8% 63.3% 68.8% 48.7% Accelerated Focused Crawler [4] 65.3% 72.4% 72.5% 67.2% 73.8% 84.7% 50.5% 59.8% 64.9% 85.4% 87.2% 74.0% 68.4% 69.0% 59.6% First- Order Crawler [5] 68.9% 73.7% 76.4% 71.6% 88.8% 90.2% 78.3% 80.0% 82.5% 93.0% 91.5% 90.6% 82.8% 78.6% 60.9% Proposed Method 69.6% 77.2% 77.1% 72.1% 90.6% 85.4% 72.1% 84.9% 84.0% 93.4% 92.0% 87.1% 84.7% 82.9% 59.0% In the above equation, the parameter q is used to normalize the separation measure in order to cancel undesired effects related to the number of document vectors and the number of clusters, as well. To this end, the main objective is to select the value of the parameter β, which minimizes the validity index V. 3 Experimental Evaluation Target topics were defined and the page samples were obtained through metasearching the Yahoo search engine. We choose 15 categories related to cultural

6 424 G.E. Tsekouras et al. information, which are depicted in Table 1. For each category, we downloaded 1000 web pages to train the algorithm. After generating the dictionary, for each category, we selected the 200 most frequently reported words, using the inverse document frequency (IDF) for each word [3]. In the next step, we defined the dimension of each document vector as T=30. Note that, we can keep the feature space dimension large, but in this case the computational cost will increase. To test the method, we downloaded another 1000 pages for each category and we utilized the well-known Harvest Ratio [4, 5] to compare the method with other algorithms. The results are presented in Table 1, where we can easily verify that, except of few cases, our method outperformed the rest of the methods. 4 Conclusions We have shown how cluster analysis can be efficiently incorporated into a focused web crawler. The basic idea of the approach is to create multidimensional document vectors each of which corresponds to a specific web page. The dissimilarity between two distinct document vectors is measured using the Hamming distance. Then, we use a clustering algorithm to classify the set of all document vectors into a number of clusters, where the respective cluster centers are objects from the original data set that satisfy specific conditions. The classification of unknown web pages is accomplished by using the minimum Hamming distance. Several experimental simulations took place, which verified the efficiency of the proposed method. References [1] Huang, Y., Ye, Y.-M.: whunter: A Focused Web Crawler A Tool for Digital Library. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E.-p. (eds.) ICADL LNCS, vol. 3334, pp Springer, Heidelberg (2004) [2] Zhu, Q.: An algorithm for the focused web crawler. In: The Proceedings of the 6th International Conference on Machine Learning and Cybernetics, Hong Kong, (2007) [3] Tsekouras, G.E., Anagnostopoulos, C.N., Gavalas, D., Economou, D.: Classification of Web Documents using Fuzzy Logic Categorical Data Clustering. In: Boukis, C., Pnevmatikakis, A., Polymenakos, L. (eds.) Artificial Intelligence and Innovations: From Therory to Applications, pp Springer, Heidelberg (2007) [4] Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: WWW, pp (2002) [5] Xu, Q., Zuo, W.: First-order Focused Crawling. In: The Proceedings of the International Conference on WWW 2007, Banff, Alberta, Canada (2007)

An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining

An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Manuscript Click here to download Manuscript: IJSEKE_submitted.pdf An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Damianos Gavalas and

More information

Evaluation Methods for Focused Crawling

Evaluation Methods for Focused Crawling Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth

More information

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954

More information

Competitive Intelligence and Web Mining:

Competitive Intelligence and Web Mining: Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction

More information

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,

More information

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Focused crawling: a new approach to topic-specific Web resource discovery. Authors Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused

More information

A Review on Identifying the Main Content From Web Pages

A Review on Identifying the Main Content From Web Pages A Review on Identifying the Main Content From Web Pages Madhura R. Kaddu 1, Dr. R. B. Kulkarni 2 1, 2 Department of Computer Scienece and Engineering, Walchand Institute of Technology, Solapur University,

More information

Building Web Annotation Stickies based on Bidirectional Links

Building Web Annotation Stickies based on Bidirectional Links Building Web Annotation Stickies based on Bidirectional Links Hiroyuki Sano, Taiki Ito, Tadachika Ozono and Toramatsu Shintani Dept. of Computer Science and Engineering Graduate School of Engineering,

More information

Context Based Web Indexing For Semantic Web

Context Based Web Indexing For Semantic Web IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine

More information

Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information

Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information Ana González, Marcos Ortega Hortas, and Manuel G. Penedo University of A Coruña, VARPA group, A Coruña 15071,

More information

Video Inter-frame Forgery Identification Based on Optical Flow Consistency

Video Inter-frame Forgery Identification Based on Optical Flow Consistency Sensors & Transducers 24 by IFSA Publishing, S. L. http://www.sensorsportal.com Video Inter-frame Forgery Identification Based on Optical Flow Consistency Qi Wang, Zhaohong Li, Zhenzhen Zhang, Qinglong

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Introducing Dynamic Ranking on Web-Pages Based on Multiple Ontology Supported Domains

Introducing Dynamic Ranking on Web-Pages Based on Multiple Ontology Supported Domains Introducing Dynamic Ranking on Web-Pages Based on Multiple Ontology Supported Domains Debajyoti Mukhopadhyay 1,4, Anirban Kundu 2,4, and Sukanta Sinha 3,4 1 Calcutta Business School, D.H. Road, Bishnupur

More information

Texture Image Segmentation using FCM

Texture Image Segmentation using FCM Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Texture Image Segmentation using FCM Kanchan S. Deshmukh + M.G.M

More information

FSRM Feedback Algorithm based on Learning Theory

FSRM Feedback Algorithm based on Learning Theory Send Orders for Reprints to reprints@benthamscience.ae The Open Cybernetics & Systemics Journal, 2015, 9, 699-703 699 FSRM Feedback Algorithm based on Learning Theory Open Access Zhang Shui-Li *, Dong

More information

Accelerating Pattern Matching or HowMuchCanYouSlide?

Accelerating Pattern Matching or HowMuchCanYouSlide? Accelerating Pattern Matching or HowMuchCanYouSlide? Ofir Pele and Michael Werman School of Computer Science and Engineering The Hebrew University of Jerusalem {ofirpele,werman}@cs.huji.ac.il Abstract.

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

An Efficient Hash-based Association Rule Mining Approach for Document Clustering An Efficient Hash-based Association Rule Mining Approach for Document Clustering NOHA NEGM #1, PASSENT ELKAFRAWY #2, ABD-ELBADEEH SALEM * 3 # Faculty of Science, Menoufia University Shebin El-Kom, EGYPT

More information

An ICA-Based Multivariate Discretization Algorithm

An ICA-Based Multivariate Discretization Algorithm An ICA-Based Multivariate Discretization Algorithm Ye Kang 1,2, Shanshan Wang 1,2, Xiaoyan Liu 1, Hokyin Lai 1, Huaiqing Wang 1, and Baiqi Miao 2 1 Department of Information Systems, City University of

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Exploiting Symmetry in Relational Similarity for Ranking Relational Search Results

Exploiting Symmetry in Relational Similarity for Ranking Relational Search Results Exploiting Symmetry in Relational Similarity for Ranking Relational Search Results Tomokazu Goto, Nguyen Tuan Duc, Danushka Bollegala, and Mitsuru Ishizuka The University of Tokyo, Japan {goto,duc}@mi.ci.i.u-tokyo.ac.jp,

More information

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. Title A fuzzy k-modes algorithm for clustering categorical data Author(s) Huang, Z; Ng, MKP Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p. 446-452 Issued Date 1999 URL http://hdl.handle.net/10722/42992

More information

A Fast Distance Between Histograms

A Fast Distance Between Histograms Fast Distance Between Histograms Francesc Serratosa 1 and lberto Sanfeliu 2 1 Universitat Rovira I Virgili, Dept. d Enginyeria Informàtica i Matemàtiques, Spain francesc.serratosa@.urv.net 2 Universitat

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks SOMSN: An Effective Self Organizing Map for Clustering of Social Networks Fatemeh Ghaemmaghami Research Scholar, CSE and IT Dept. Shiraz University, Shiraz, Iran Reza Manouchehri Sarhadi Research Scholar,

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Collaborative Rough Clustering

Collaborative Rough Clustering Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical

More information

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters A New Online Clustering Approach for Data in Arbitrary Shaped Clusters Richard Hyde, Plamen Angelov Data Science Group, School of Computing and Communications Lancaster University Lancaster, LA1 4WA, UK

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

Unsupervised Clustering of Web Sessions to Detect Malicious and Non-malicious Website Users

Unsupervised Clustering of Web Sessions to Detect Malicious and Non-malicious Website Users Unsupervised Clustering of Web Sessions to Detect Malicious and Non-malicious Website Users ANT 2011 Dusan Stevanovic York University, Toronto, Canada September 19 th, 2011 Outline Denial-of-Service and

More information

Minimal Test Cost Feature Selection with Positive Region Constraint

Minimal Test Cost Feature Selection with Positive Region Constraint Minimal Test Cost Feature Selection with Positive Region Constraint Jiabin Liu 1,2,FanMin 2,, Shujiao Liao 2, and William Zhu 2 1 Department of Computer Science, Sichuan University for Nationalities, Kangding

More information

Color-Based Classification of Natural Rock Images Using Classifier Combinations

Color-Based Classification of Natural Rock Images Using Classifier Combinations Color-Based Classification of Natural Rock Images Using Classifier Combinations Leena Lepistö, Iivari Kunttu, and Ari Visa Tampere University of Technology, Institute of Signal Processing, P.O. Box 553,

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Image Compression with Competitive Networks and Pre-fixed Prototypes*

Image Compression with Competitive Networks and Pre-fixed Prototypes* Image Compression with Competitive Networks and Pre-fixed Prototypes* Enrique Merida-Casermeiro^, Domingo Lopez-Rodriguez^, and Juan M. Ortiz-de-Lazcano-Lobato^ ^ Department of Applied Mathematics, University

More information

CS294-1 Final Project. Algorithms Comparison

CS294-1 Final Project. Algorithms Comparison CS294-1 Final Project Algorithms Comparison Deep Learning Neural Network AdaBoost Random Forest Prepared By: Shuang Bi (24094630) Wenchang Zhang (24094623) 2013-05-15 1 INTRODUCTION In this project, we

More information

Application of Fuzzy Classification in Bankruptcy Prediction

Application of Fuzzy Classification in Bankruptcy Prediction Application of Fuzzy Classification in Bankruptcy Prediction Zijiang Yang 1 and Guojun Gan 2 1 York University zyang@mathstat.yorku.ca 2 York University gjgan@mathstat.yorku.ca Abstract. Classification

More information

Multiple Classifier Fusion using k-nearest Localized Templates

Multiple Classifier Fusion using k-nearest Localized Templates Multiple Classifier Fusion using k-nearest Localized Templates Jun-Ki Min and Sung-Bae Cho Department of Computer Science, Yonsei University Biometrics Engineering Research Center 134 Shinchon-dong, Sudaemoon-ku,

More information

Similarity Image Retrieval System Using Hierarchical Classification

Similarity Image Retrieval System Using Hierarchical Classification Similarity Image Retrieval System Using Hierarchical Classification Experimental System on Mobile Internet with Cellular Phone Masahiro Tada 1, Toshikazu Kato 1, and Isao Shinohara 2 1 Department of Industrial

More information

A Composite Graph Model for Web Document and the MCS Technique

A Composite Graph Model for Web Document and the MCS Technique A Composite Graph Model for Web Document and the MCS Technique Kaushik K. Phukon Department of Computer Science, Gauhati University, Guwahati-14,Assam, India kaushikphukon@gmail.com Abstract It has been

More information

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Mukesh Kumar and Renu Vig University Institute of Engineering and Technology, Panjab University, Chandigarh,

More information

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm K.Parimala, Assistant Professor, MCA Department, NMS.S.Vellaichamy Nadar College, Madurai, Dr.V.Palanisamy,

More information

Clustering-Based Distributed Precomputation for Quality-of-Service Routing*

Clustering-Based Distributed Precomputation for Quality-of-Service Routing* Clustering-Based Distributed Precomputation for Quality-of-Service Routing* Yong Cui and Jianping Wu Department of Computer Science, Tsinghua University, Beijing, P.R.China, 100084 cy@csnet1.cs.tsinghua.edu.cn,

More information

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

More information

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology Learning the Three Factors of a Non-overlapping Multi-camera Network Topology Xiaotang Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy

More information

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd.,

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Graph Based Workflow Validation

Graph Based Workflow Validation Graph Based Workflow Validation Anastasios Giouris and Manolis Wallace Department of Computer Science, University of Indianapolis Athens, Ipitou 9, Syntagma, 557, GREECE http://cs.uindy.gr cs@uindy.gr

More information

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques Proceedings of the 2nd International Conference on Intelligent Systems and Image Processing 2014 Proxy Systems Improvement Using Frequent Itemset Pattern-Based Techniques Saranyoo Butkote *, Jiratta Phuboon-op,

More information

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem IOP Conference Series: Materials Science and Engineering PAPER OPEN ACCESS Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem To cite this article:

More information

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Gamal Attiya and Yskandar Hamam Groupe ESIEE Paris, Lab. A 2 SI Cité Descartes, BP 99, 93162 Noisy-Le-Grand, FRANCE {attiyag,hamamy}@esiee.fr

More information

Reference Point Detection for Arch Type Fingerprints

Reference Point Detection for Arch Type Fingerprints Reference Point Detection for Arch Type Fingerprints H.K. Lam 1, Z. Hou 1, W.Y. Yau 1, T.P. Chen 1, J. Li 2, and K.Y. Sim 2 1 Computer Vision and Image Understanding Department Institute for Infocomm Research,

More information

Focused Web Crawling Using Neural Network, Decision Tree Induction and Naïve Bayes Classifier

Focused Web Crawling Using Neural Network, Decision Tree Induction and Naïve Bayes Classifier IJCST Vo l. 5, Is s u e 3, Ju l y - Se p t 2014 ISSN : 0976-8491 (Online) ISSN : 2229-4333 (Print) Focused Web Crawling Using Neural Network, Decision Tree Induction and Naïve Bayes Classifier 1 Prabhjit

More information

Detecting Clusters and Outliers for Multidimensional

Detecting Clusters and Outliers for Multidimensional Kennesaw State University DigitalCommons@Kennesaw State University Faculty Publications 2008 Detecting Clusters and Outliers for Multidimensional Data Yong Shi Kennesaw State University, yshi5@kennesaw.edu

More information

Improving Relevance Prediction for Focused Web Crawlers

Improving Relevance Prediction for Focused Web Crawlers 2012 IEEE/ACIS 11th International Conference on Computer and Information Science Improving Relevance Prediction for Focused Web Crawlers Mejdl S. Safran 1,2, Abdullah Althagafi 1 and Dunren Che 1 Department

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning

Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning Cluster Validation Ke Chen Reading: [5.., KPM], [Wang et al., 9], [Yang & Chen, ] COMP4 Machine Learning Outline Motivation and Background Internal index Motivation and general ideas Variance-based internal

More information

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17 Paper CC-17 Your Friendly Neighborhood Web Crawler: A Guide to Crawling the Web with SAS Jake Bartlett, Alicia Bieringer, and James Cox PhD, SAS Institute Inc., Cary, NC ABSTRACT: The World Wide Web has

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

Maximizing edge-ratio is NP-complete

Maximizing edge-ratio is NP-complete Maximizing edge-ratio is NP-complete Steven D Noble, Pierre Hansen and Nenad Mladenović February 7, 01 Abstract Given a graph G and a bipartition of its vertices, the edge-ratio is the minimum for both

More information

A New Type of ART2 Architecture and Application to Color Image Segmentation

A New Type of ART2 Architecture and Application to Color Image Segmentation A New Type of ART2 Architecture and Application to Color Image Segmentation Jiaoyan Ai 1,BrianFunt 2, and Lilong Shi 2 1 Guangxi University, China shinin@vip.163.com 2 Simon Fraser University, Canada Abstract.

More information

A genetic algorithm based focused Web crawler for automatic webpage classification

A genetic algorithm based focused Web crawler for automatic webpage classification A genetic algorithm based focused Web crawler for automatic webpage classification Nancy Goyal, Rajesh Bhatia, Manish Kumar Computer Science and Engineering, PEC University of Technology, Chandigarh, India

More information

A new predictive image compression scheme using histogram analysis and pattern matching

A new predictive image compression scheme using histogram analysis and pattern matching University of Wollongong Research Online University of Wollongong in Dubai - Papers University of Wollongong in Dubai 00 A new predictive image compression scheme using histogram analysis and pattern matching

More information

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task Xiaolong Wang, Xiangying Jiang, Abhishek Kolagunda, Hagit Shatkay and Chandra Kambhamettu Department of Computer and Information

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document BayesTH-MCRDR Algorithm for Automatic Classification of Web Document Woo-Chul Cho and Debbie Richards Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {wccho, richards}@ics.mq.edu.au

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm ISBN 978-93-84468-0-0 Proceedings of 015 International Conference on Future Computational Technologies (ICFCT'015 Singapore, March 9-30, 015, pp. 197-03 Sense-based Information Retrieval System by using

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications New Optimal Load Allocation for Scheduling Divisible Data Grid Applications M. Othman, M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia,

More information

Web-Page Indexing Based on the Prioritized Ontology Terms

Web-Page Indexing Based on the Prioritized Ontology Terms Web-Page Indexing Based on the Prioritized Ontology Terms Sukanta Sinha 1,2, Rana Dattagupta 2, and Debajyoti Mukhopadhyay 1,3 1 WIDiCoReL Research Lab, Green Tower, C-9/1, Golf Green, Kolkata 700095,

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Classification with Diffuse or Incomplete Information

Classification with Diffuse or Incomplete Information Classification with Diffuse or Incomplete Information AMAURY CABALLERO, KANG YEN Florida International University Abstract. In many different fields like finance, business, pattern recognition, communication

More information

RPKM: The Rough Possibilistic K-Modes

RPKM: The Rough Possibilistic K-Modes RPKM: The Rough Possibilistic K-Modes Asma Ammar 1, Zied Elouedi 1, and Pawan Lingras 2 1 LARODEC, Institut Supérieur de Gestion de Tunis, Université de Tunis 41 Avenue de la Liberté, 2000 Le Bardo, Tunisie

More information

Nearest Cluster Classifier

Nearest Cluster Classifier Nearest Cluster Classifier Hamid Parvin, Moslem Mohamadi, Sajad Parvin, Zahra Rezaei, and Behrouz Minaei Nourabad Mamasani Branch, Islamic Azad University, Nourabad Mamasani, Iran hamidparvin@mamasaniiau.ac.ir,

More information

International Journal of Advanced Engineering Technology

International Journal of Advanced Engineering Technology Research Article TYPE OF FREEDOM OF SIMPLE JOINTED PLANAR KINEMATIC CHAINS USING C++ Mr. Chaudhari Ashok R *. Address for correspondence Lecturer, Mechanical Engg. Dept, U. V. Patel college of Engineering,

More information

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Haiqin Yang and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

A fast parallel algorithm for frequent itemsets mining

A fast parallel algorithm for frequent itemsets mining A fast parallel algorithm for frequent itemsets mining Dora Souliou, Aris Pagourtzis, and Panayiotis Tsanakas "*" School of Electrical and Computer Engineering National Technical University of Athens Heroon

More information

Accelerating XML Structural Matching Using Suffix Bitmaps

Accelerating XML Structural Matching Using Suffix Bitmaps Accelerating XML Structural Matching Using Suffix Bitmaps Feng Shao, Gang Chen, and Jinxiang Dong Dept. of Computer Science, Zhejiang University, Hangzhou, P.R. China microf_shao@msn.com, cg@zju.edu.cn,

More information

Artificial Mosaics with Irregular Tiles BasedonGradientVectorFlow

Artificial Mosaics with Irregular Tiles BasedonGradientVectorFlow Artificial Mosaics with Irregular Tiles BasedonGradientVectorFlow Sebastiano Battiato, Alfredo Milone, and Giovanni Puglisi University of Catania, Image Processing Laboratory {battiato,puglisi}@dmi.unict.it

More information

Fast trajectory matching using small binary images

Fast trajectory matching using small binary images Title Fast trajectory matching using small binary images Author(s) Zhuo, W; Schnieders, D; Wong, KKY Citation The 3rd International Conference on Multimedia Technology (ICMT 2013), Guangzhou, China, 29

More information

A Modified Fuzzy Min-Max Neural Network and Its Application to Fault Classification

A Modified Fuzzy Min-Max Neural Network and Its Application to Fault Classification A Modified Fuzzy Min-Max Neural Network and Its Application to Fault Classification Anas M. Quteishat and Chee Peng Lim School of Electrical & Electronic Engineering University of Science Malaysia Abstract

More information

Parallel Evaluation of Hopfield Neural Networks

Parallel Evaluation of Hopfield Neural Networks Parallel Evaluation of Hopfield Neural Networks Antoine Eiche, Daniel Chillet, Sebastien Pillement and Olivier Sentieys University of Rennes I / IRISA / INRIA 6 rue de Kerampont, BP 818 2232 LANNION,FRANCE

More information

Probability Distribution of Index Distances in Normal Index Array for Normal Vector Compression

Probability Distribution of Index Distances in Normal Index Array for Normal Vector Compression Probability Distribution of Index Distances in Normal Index Array for Normal Vector Compression Deok-Soo Kim 1, Youngsong Cho 1, Donguk Kim 1, and Hyun Kim 2 1 Department of Industrial Engineering, Hanyang

More information

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques

Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques Graph Matching: Fast Candidate Elimination Using Machine Learning Techniques M. Lazarescu 1,2, H. Bunke 1, and S. Venkatesh 2 1 Computer Science Department, University of Bern, Switzerland 2 School of

More information

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Using Natural Clusters Information to Build Fuzzy Indexing Structure Using Natural Clusters Information to Build Fuzzy Indexing Structure H.Y. Yue, I. King and K.S. Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories,

More information

Invariant Generation in Vampire

Invariant Generation in Vampire Invariant Generation in Vampire Kryštof Hoder 1,LauraKovács 2, and Andrei Voronkov 1 1 University of Manchester 2 TU Vienna Abstract. This paper describes a loop invariant generator implemented in the

More information

A Framework for Hierarchical Clustering Based Indexing in Search Engines

A Framework for Hierarchical Clustering Based Indexing in Search Engines BIJIT - BVICAM s International Journal of Information Technology Bharati Vidyapeeth s Institute of Computer Applications and Management (BVICAM), New Delhi A Framework for Hierarchical Clustering Based

More information

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT A Patent Retrieval Method Using a Hierarchy of Clusters at TUT Hironori Doi Yohei Seki Masaki Aono Toyohashi University of Technology 1-1 Hibarigaoka, Tenpaku-cho, Toyohashi-shi, Aichi 441-8580, Japan

More information

CLUSTERING ALGORITHMS

CLUSTERING ALGORITHMS CLUSTERING ALGORITHMS Number of possible clusterings Let X={x 1,x 2,,x N }. Question: In how many ways the N points can be Answer: Examples: assigned into m groups? S( N, m) 1 m! m i 0 ( 1) m 1 m i i N

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Efficient Object Extraction Using Fuzzy Cardinality Based Thresholding and Hopfield Network

Efficient Object Extraction Using Fuzzy Cardinality Based Thresholding and Hopfield Network Efficient Object Extraction Using Fuzzy Cardinality Based Thresholding and Hopfield Network S. Bhattacharyya U. Maulik S. Bandyopadhyay Dept. of Information Technology Dept. of Comp. Sc. and Tech. Machine

More information

PersoNews: A Personalized News Reader Enhanced by Machine Learning and Semantic Filtering

PersoNews: A Personalized News Reader Enhanced by Machine Learning and Semantic Filtering PersoNews: A Personalized News Reader Enhanced by Machine Learning and Semantic Filtering E. Banos, I. Katakis, N. Bassiliades, G. Tsoumakas, and I. Vlahavas Department of Informatics, Aristotle University

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information