A genetic algorithm based focused Web crawler for automatic webpage classification

Size: px

Start display at page:

Download "A genetic algorithm based focused Web crawler for automatic webpage classification"

Buck Booker
6 years ago
Views:

1 A genetic algorithm based focused Web crawler for automatic webpage classification Nancy Goyal, Rajesh Bhatia, Manish Kumar Computer Science and Engineering, PEC University of Technology, Chandigarh, India Keywords: Genetic algorithm; webpage classification; feature extraction; focused Web crawler; Java. Abstract The rapid increase in the amount of information present on the World Wide Web makes it difficult to find information of interest to a user. Search engines uses focused Web crawlers to get the information about a particular topic. Focused Web crawler seeks, gathers and maintains webpages relevant to a pre-defined set of topics rather than downloading all the webpages. During focused crawling, automatic webpage classification method is used to determine whether the webpage is on-topic or not. This paper discusses a genetic algorithm based automatic webpage classification technique. In this method, tags and terms are considered as features and the classifier is made to learn from the webpages in the training set. The best features are selected from the genetic algorithm based fitness optimization technique. Using both tags and terms as features, high precision on test data is achieved. 1. Introduction The increase in popularity of the World Wide Web causes the inclusion of large information on the Web. Processing with the information obtained from the Web requires large storage and is time consuming, it also put load on processor and degrade its performance. As a result, hardware and software resources are over utilized without getting much useful information from the Web. To overcome this problem, focused Web crawler is used. Focused web crawler crawls Web to gather webpages based on some predefined set of topics [1]. The relevancy of a webpage to a topic can be determined manually or automatically.manual classification is a process in which webpages are classified based on pre-defined set of categories. Various open source projects such as DMOZ.org maintained their directories manually provide way to find information from pre-defined set of categories. As the Web grows, manual approach is less effective. During focused Web crawling process, automatic webpage classification method is used to ascertain whether the webpage considered is on-topic or not [7]. In automatic webpage classification, a set of labelled documents is used to train a classifier, and then classifier is employed to assign webpages to the class labels. Information present on the Web is so huge which contains large number of terms and categories for the classifier. Classifier with huge data results in large dimensions for data processing. Feature selection can be used to improve the efficiency, scalability and accuracy of the classifier [3]. Feature selection is the process to select subset of features from the webpages. Different machine learning techniques used for classification problems are decision tree, Bayesian, support vector machine, K-nearest neighbor, and genetic algorithm etc. K-nearest neighbor is the simplest approach but it has long classification time. Support vector machine and decision tree are suitable for classification problems in which number of feature is small [5]. Webpage classification problems are high dimensional problems, might contains large number of features, features in combinations such as <tags, terms>. In this study, genetic algorithm is used to select best feature from the large feature set. Genetic algorithm is a fitness optimisation technique based on hereditary and evolution. By imitating the natural selection process, genetic algorithm tried to find optimal solution by sampling the space that has high probability of generating an optimal solution. It is useful for various applications such as scheduling, ordering etc [4]. Here it is used for webpage classification. In this study, genetic algorithm is proposed which finds the best features for a webpage and learned features can be used to classify webpages. This paper is organised as follows. In section 2, various webpage classifiers are discussed. Proposed genetic algorithm based webpage classification system is discussed in section 3. In section 4, data set, experimental results and discussion on experimental result are discussed. 2. Related work A number of approaches are present in the literature for webpage classification. In this section, we have discussed approaches that have been studied for automatic webpage classification in focused Web crawler. Ali and Omatu [9] has proposed a webpage classification method that uses neural network for classification. The proposed method is based on principle component analysis 697

2 and class profile-based features for feature selection. Principle component analysis is helpful to reduce the feature vector from document-term matrix. These reduced features combined with manually selected words act as a complete feature set to be fed to neural network for classification. One of the webpage classification methods is proposed by Qi,Xia and brain [7] in which they uses latent semantic analysis and webpage feature selection process to extract semantic and text features. Two support vector machine classifiers are used so that two classification results can be checked to vote which category the webpage should be placed in. Ant colony optimization is used to select best features from the feature set obtained from the webpages present in training dataset is proposed by Saraç, Esra, and Özel [8]. C4.5, naïve bayes and k-nearest neighbor classifiers were used to assign class labels to webpages. Based on pheromone value, best feature group is selected using TF- IDF and classify webpages using classifiers. Based on the combination of genetic algorithm and K- means clustering algorithm, Qi, Dehu, and Sun [6] has developed a webpage classifier. In this approach, a set of keywords is generated for each category from training webpages. For each keyword, an initial weight is assigned to each category. Genetic algorithm is used to optimize weights for keywords for each category to select best weights. Genetic algorithm based automatic Web page classification system is proposed in which both HTML tags and terms belong to each tag are used as classification features proposed by Özel, Selma Ayşe [4]. Genetic algorithm is used to select best feature set by learning optimal classifier from the positive and negative webpages in the training dataset. Özel, Selma Ayşe [5] has proposed a webpage classification system which uses genetic algorithm and K-nearest neighbor to select best features from the feature to improve run time performance of the classifier. Webpages from the test set would be classified based on the selected top features obtained using genetic algorithm. Genetic algorithm is used to generate classification rules based on predicting values and the result value as proposed by Ferdeus, Ahmed and Khan [2]. Rules are generated in the form of IF-THEN statement and fed into genetic algorithm. IF condition contains predicting values and THEN statement contains result value. Best fitted classification rule is selected to assign the class label to the values. 3. Proposed genetic algorithm based classification system The proposed system consists of URL filtering, feature extraction, genetic algorithm based classifier and classification as shown in fig. 1. In this study, our aim is to determine whether webpage have some information of Indian origin faculty webpage working in foreign universities or not i.e. Binary classification is used for class labels. The process starts with URL filtering in which a subset of URLs is selected based on the keywords related to faculty of any university. In feature extraction, certain tags and terms as features are used and extract features using genetic algorithm. In document formation, 2-D array is created to represent the presence or absence of feature in the document. The genetic algorithm based classifier learning part consists of: (i) coding, (ii) generation of initial population. (iii) Evaluation of initial population, (iv) selection, (v) crossover, (vi) mutation, (vii) generation of new population such that steps (iii) to (vii) are repeated until convergence to learn a (sub) optimal classifier. After the learning process, learned classifier is used for the classification to classify the unseen data URL filtering The URLs present in the training set are crawled up to certain depth and those crawled URLs are filtered based on the filtering list. Filtering list consist of keywords related to faculty such as people, staff, directory- staff, all-people etc. Since data is unknown, presence of these words might lead us to the URLs containing faculty information Feature extraction Tags such as <title>, <h1>, <h2>, <h3>, <h4>, <img>, <b>, <table>, <li>, <a>, <p>, <meta> which denotes title, header at level 1, header at level 2, header at level 3, header at level 4, image, bold, table, list items, anchor, paragraph respectively are used to extract features that are needed in both classifier learning and classification process. After analysis and observations, list of Indian surnames, cities, institutes, designations, departments and universities etc are the terms chosen for the above mentioned tags. Feature set is created consisting of these tags and terms. For example <tag-terms> forms one feature in the feature set. For example <title-list of surname>, <table-list of institutes>, <bold-list of departments> etc. <h1>, <h2>, <h3>, <h4> are grouped together to represent one header to reduce number of features extracted Document formation Filtered URLs and extracted features together form documents. Document formation creates a 2-D array consisting of URLs in rows and features in the columns whereas the entries in the array are 0 or 1 depending on the 698

3 presence or absence of the feature in the document as shown in equation 1., = {,, h Where D(i, j) represent document 2-D array where i represents i th URL from the filtered list T and j represents j th feature from the feature set F. Before creating this 2-D array, stop word removal and stemming using Wordnet is performed on the terms fetched from the URLs present in the filtered list. Training dataset (1) 3.4. Coding A chromosome consists of feature weights list which are real numbers in range [0, 1] and is represented in equation 2. = (,,., 1,,,, ) (2) Where W ij denotes the term j in tag i. We used title, header, image, paragraph, table, list, bold, meta and anchor tags in this order. In the proposed work, initial weights are assigned randomly and will be updated in genetic algorithm process. Testing dataset URL filtering Document formation Feature extraction Coding Genetic algorithm based classifier Initial population Evaluation Crossover Mutation New generation <gen_size or >avg_fitness_p rev No Yes Classification Classified webpages Fig.1. Flow diagram for proposed model 3.5. Initial population Initial population consist of population size chromosomes generated randomly using coding scheme. Size of each chromosome equals to the feature set. Population size taken in the proposed work is Evaluation of population Fitness of every chromosome present in the population is computed by evaluating the cosine similarity of the chromosome with every document as shown in equation 3. Cos_simi(C,Di) represents the cosine similarity. After evaluating the cosine similarity, threshold value is taken which is the mean of the cosine similarities for a chromosome corresponding to all documents as shown in equation 4. This threshold value might provide average 699

4 result but donot decrement the overall performance. And then the average of the cosine similarities of the chromosome corresponding to the documents is computed. That average is the fitness for the chromosome. Fitness computation is as shown in equation 5. _, = hh = =1 [] [] =1 [] [] + =1 [] [] =1 _, = _, _, > hh (5) Where n is the number of elements in the feature set, m is the number of documents present in the training dataset. C represents the chromosome and D i represents the i th document from the training set Selection For the selection of the chromosomes, a novel technique is used in which a dummy chromosome is created as a parameter for selection. Dummy chromosome is created containing elements equals to the average of the corresponding elements of all the chromosomes present in the population as shown in equation 6. [] = (3) (4) = [][] (6) Where C m [i] represents i th element of the dummy chromosome and C[j][i] represents i th element of the j th chromosome. Then fitness of that dummy chromosome is computed using equation 5. Based on the minimum difference between the fitness of the dummy chromosome and the chromosome of the population, chromosomes are selected for further processing Crossover In the proposed approach, uniform crossover technique is used in which a chromosome sized random dummy chromosome is generated which contains random weights. And then that dummy chromosome is compared with the crossover probability as shown in equations 7 and 8. [] < h [] = [] [] = [] (7) [] > h [] = [] [] = [] (8) Where P1[i], P2[i], r[i], c1[i], c2[i] are the i th weight of the feature of the first parent chromosome, second parent chromosome, dummy chromosome, first child and second child respectively. P c denotes crossover probability. And then fitness of the newly generated children is computed using equation 5. Table I. Example of crossover operation F1 F2 F3 F4 F5 F FN P P r C C Consider for example as shown in Table I, creation of the child chromosomes from the crossover operation. F1, F2 and so on upto FN are the features present in the feature set. P1, P2, r are first parent chromosome, second parent chromosome, dummy chromosome. C1 and C2 are generated after comparing dummy chromosome with the crossover probability using equations 7 and 8. C1 and C2 are newly generated chromosomes Mutation A modified mutation technique is proposed in which mut_no is calculated to determine the number of features in the chromosome that has been changed. As shown in equation 9, pop_size represents the size of the population, P(m) represents the mutation probability and chromosome_size represents the number of elements in the chromosome. _ = _ h _ (9) In this, a dummy chromosome is calculated in which each element is the average of each feature from all chromosomes in the population of the present iteration and computing its fitness. Selection of the chromosome for the mutation is done using minimum difference between the fitness of the dummy chromosome and the i th chromosome from the population as shown in equation 10. = = min, (10) Where C s represents selected chromosome for mutation and C i represents i th chromosome. Fitnees d and fitness i represents fitness of dummy and i th chromosome for population respectively, n is the number of chromosomes in a population. After selecting the chromosome, the mut_no features is selected and their weights is updated based on the random number. Fig. 2 shows the algorithm to generate new child by using mutation. In this algorithm, C represents the selected chromosome, C a represents the dummy chromosome and C m represents newly generated chromosome. Arr is an array to represent the intermediate state to store randomly generated number j, k and i are simple variables. After generating new chromosome, fitness of that chromosome is computed. 700

5 Input: Selected Chromosome C, Arr, C a. Output: Mutated Chromosome C m. 1. k =0; 2. for j =1 to mut_no a. Generate random number ran between [0, 1]. b. Generate randomly a number j between [1, chromosome_size]. c. if(ran< C[j]) C m [i] = C[j]. d. else C m [i] = rand(c[j], C a [i]). e. Arr[k++] = j. 3. end of for loop. 4. for i = 1 to chromosome_size a. if i present in Arr array continue. b. else C m [i] = C[j]. 5. end of for loop. 6. Return C m. Fig.2. Proposed algorithm for mutation Generation of new population are taken to crawl them upto certain depth. Then the URL filtering of the crawled data is performed to filter out the webpages whose URLs does not contain these words present in the filtering list. Filtered URLs are then passed to document formation phase where they are represented as binary vector of size equal to the number of features taken into consideration. Then the Cos_simi of the webpage D and best fitted chromosome C is computed. If Cos_simi is greater than threshold, webpage is marked as relevant else irrelevant. Best fitted chromosome (C) Start Testing dataset Seed URLs Crawl upto certain depth URL filtering Document formation (D) All the chromosomes present in the population of the current iteration and newly generated chromosomes from crossover and mutation are sorted based on their fitness and highly fitted pop_size chromosomes are selected for next iteration. Average fitness of the newly generated population is computed. Cos_simi(C, D) Yes >threshold No Termination condition In order to achieve convergence, improved termination condition is used. Convergence conditions are as shown in equation 11. Relevant Stop Irrelevant, > < = { (11), h Where tp represents the total chromosomes present till iteration, gensz represents the maximum number of the chromosomes that can be generated in the system. avgftcrp and avgftpvp represent the average fitness of the current and previous population respectively. When the genetic algorithm is terminated, chromosome with the highest fitness is selected from all the chromosomes present and used for classification of the webpages Classification In the classification phase, fig. 3 shows the classification process of the proposed algorithm. In the proposed classification process, seed URLs from the testing dataset 4. Results and discussion Fig.2. Proposed classification process In this section, experimental setup and results obtained are discussed. All the implementations for experiments were made using NetBeans IDE under windows 7 operating system. The hardware used in the experiment had 3GB of RAM and Intel Core i3 CPU M 2.53 GHz processor Dataset To get Indian origin academician information working abroad, the dataset taken consists of the websites of the foreign universities. In this experiment, is the website of the Stanford University used as an initial URL. In order to create dataset, 701

6 this URL is crawled upto depth 6 to extract all URL from the Stanford domain. Then this dataset of URLs is filtered based on URL filtering list. URL filtering list consist of words such as faculty, directory, people, staff, people-all, directory-people etc. Filtered URLs consists of all those URLs which contains one of these words in the URL itself. These set of webpages consists of irrelevant as well as relevant webpages. For the dataset, features are extracted based on tags and terms. Tags used are title < t >, header (<h1>, <h2>, < h3>, < h4>), image <img>, bold <b>, paragraph <p>, table <td>, list <li>, anchor <a>. Terms consist of lists of surnames, institutes, cities, departments and designations. Surnames list containing surnames of the Indians, institutes list consists of educational institutes present in India, and cities list consists of cities of India. Departments list consist of departments present in foreign universities related to science and technology and designations list consist of the designations of the faculties such as professor, assistant professor and associate professor etc. After analysis, feature set is created based on the combination of tags and terms such as title-designation, title-surnames, header-department, list-cities, table-institutes etc. For each Filtered URL, document formation takes place in which for the presence or absence of the feature from the feature set, 1 or 0 is marked respectively in 2D document matrix. For example, presence of surname in the title, marks corresponding <title-surname> feature as Genetic algorithm parameters Genetic algorithm parameters were determined experimentally such that they were the good choice for our system. Parameters such as population size = 30, generation size = 400, crossover probability = 0.7, mutation probability = 0.5 are taken after analysis and observations. Learning process took 38 iterations to converge and best chromosome is achieved. 5. RESULTS Table 2 shows the number of webpages achieved by the proposed system after every stage. Table 2: No. of webpages achieved at every stage of proposed system Stages No. of webpages Crawle d upto depth 6 URL filterin g Total faculty of Stanford university Indian origin faculty After crawling the Stanford University website upto certain depth, proposed system is verified and validated based on the results obtained. Precision is taken as the performance parameter of the proposed system. It is defined as the number of the relevant retrieved items to the total number of retrieved items as shown in equation 12. = # # (12) Using proposed approach, precision of around 80% is achieved. 6. CONCLUSION Focused Web crawler seeks, acquires and gathers webpages relevant to pre-defined set of topics. In this paper a genetic algorithm based focused Web crawler is proposed for webpage classification to classify webpages as relevant or irrelevant. In this approach, best chromosome is achieved after the learning phase of the genetic algorithm. This chromosome consists of best weighted feature set. Stanford university website is crawled for testing and based on this chromosome all the URLs are classified as relevant or irrelevant. It means whether that URL contains the Indian origin faculty information or not. In the proposed approach, we are able to achieve precision upto 80%. The precision of the proposed approach can further be improved by using more features. References [1] Chakrabarti, Soumen, Martin Van den Berg, and Byron Dom. "Focused crawling: a new approach to topic-specific Web resource discovery."computer Networks (1999): [2] Ferdaus, Abu Ahmed, and Mehnaj Afrin Khan. "A Genetic Algorithm Approach using Improved Fitness Function for Classification Rule Mining."International Journal of Computer Applications (2014). [3] Korde, Vandana, and C. Namrata Mahender. "Text classification and classifiers: A survey." International Journal of Artificial Intelligence & Applications (IJAIA) 3.2 (2012): [4] Özel, Selma Ayşe. "A web page classification system based on a genetic algorithm using tagged-terms as features." Expert Systems with Applications38.4 (2011): [5] Özel, Selma Ayşe. "A genetic algorithm based optimal feature selection for web page classification." Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on. IEEE, [6] Qi, Dehu, and Bo Sun. "A genetic k-means approaches for automated web page classification." Information Reuse and Integration, IRI Proceedings of the 2004 IEEE International Conference on. IEEE, [7] Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12. [8] Saraç, Esra, and Selma Ayşe Özel. "An Ant Colony Optimization Based Feature Selection for Web Page Classification." The Scientific World Journal2014 (2014). [9] Selamat, Ali, and Sigeru Omatu. "Web page feature selection and classification using neural networks." Information Sciences 158 (2004):

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.