Combining Machine Learning And Data Mining For Intelligent Recommendations On Web Data

Size: px

Start display at page:

Download "Combining Machine Learning And Data Mining For Intelligent Recommendations On Web Data"

Harold Pitts
5 years ago
Views:

1 Combining Machine Learning And Data Mining For Intelligent Recommendations On Web Data Mridul Sahu 1 and Samiksha Bharne 2 1 M.Tech Student,Computer Science And Engineering, BIT, Ballarpur, India 2 Professor, Computer Science And Engineering, BIT, Ballarpur, India Abstract Web crawlers are the heart of search engines. Web crawlers continuously keep on crawling the web and find any new web pages that have been added to the web, pages that have been removed from the web. Due to growing and dynamic nature of the web; it has become a challenge to traverse all URLs in the web documents and to handle these URLs. A Topical Crawler is a agent that targets a particular topics and visits and gathers only relevant web Pages. In order to determine a web page is about a particular topic, topical crawlers use classification technique. In this paper we introduce the Smart Crawler with machine learning. A statistical analysis of performance of Smart crawler is presented. Further in this paper data mining algorithms are used to improve the efficiency of Smart Crawler. Keywords Crawler, CHARM, Data Mining, Bisecting K-Means, Machine Learning, Web Intelligence. I. INTRODUCTION The World Wide Web provides a vast source of information of almost all type. However this information is often scattered among many web servers and hosts, using many different formats. We all want that we should have the best possible search in less time. For any crawler there are two issues that it should consider. First, The crawler should have the capability to plan, i.e., a plan to decide which pages to download next. Second, It needs to have a highly optimized and robust system architecture so that it can download a large number of pages per second even against crashes, manageable, and considerate of resources and web servers. Some recent academic interest is there in the first issue, including work on deciding which important pages the crawler should take first. Every search engine is divided into different modules among those modules crawler module is the module on which search engine relies the most because it helps to provide the best possible results to the search engine. Crawlers are small programs that browse the web on the search engine s behalf, similarly to how a human user would follow links to reach different pages. The programs are given a starting seed URLs, whose pages they retrieve from the web. The crawler extracts URLs appearing in the retrieved pages, and gives this information to the crawler control module. This module determines what links to visit next, and feeds the links to visit back to the crawlers. The crawler also passes the retrieved pages into a page repository. Crawlers continue visiting the web, until local resources, such as storage, are exhausted. II. BACKGROUND Web crawlers were written as early as This year gave birth to four web crawlers: World Wide Web Wanderer, Jump Station, World Wide Web Worm [11], and RBSE spider. These four spiders mainly collected information and statistic about the web using a set of seed URLs. Early web crawlers iteratively downloaded URLs and updated their repository of URLs through the downloaded web pages. The next year, 1994, two new web crawlers appeared: WebCrawler and MOM spider. In addition to collecting stats and data about the state of the web, these two web crawlers introduced concepts of politeness and black-lists to traditional web crawlers. WebCrawler is considered to be the first parallel web crawler by downloading 15 links simultaneously. From World Wide All rights Reserved 263

2 Worm to WebCrawler, the number of indexed pages increased from 110,000 to 2 million. Shortly after, in the coming years a few commercial web crawlers became available: Lycos, Infoseek, Excite, AltaVista and HotBot. 2.1 Focused crawler A variety of methods and algorithms is proposed for building focused crawlers. A comparison of learning schemas that is employed by focused crawlers can be found in [5]. There are several works, which employs link-ranking mechanism [6,7,3] to help the classification of web pages. In [8] they described a focused crawler, which searches the web and finds relevant pages on a given topic. They used a classifier to determine the relevancy of a page and a distiller to evaluate page links. The classifier uses Bayesian classifier to determine the relevancy of a page to a predefined topic. Fish search [9] is one of the early works in focused crawling. In this algorithm, each URL simply behaves like a fish and its chance to survive depends on that page s relevancy. Sharksearch algorithm [10] improves fish search algorithm by making contributions in calculating relevancy of a page and the topic. Figure 1. System architecture Figure 1 shows the general architecture of the focused crawler system. The focused crawler starts with a set of seed URLs and topic words. The topic words are recorded in topic words weight table. First, a query is submitted to a well-known web search engine such as Google in order to find related first n pages. Words that appear in these pages are included in the topic words weight table. The weights are calculated by using the frequency of occurrence of terms in pages. Furthermore, the URLs of these first n pages also form the initial set of seed URLs that is added to URL frontier. The URL frontier component is a queue that stores links waiting to be crawled. The system works until URL frontier queue is empty. The URL frontier retrieves a page if a visited site allows visiting that is specified in robot.txt file. We apply preprocessing methods including stemming and stop word filtering. The page relevance calculator calculates a page score by computing the similarity between the retrieved page and the topic words. The pages with scores above a specific threshold are accepted as similar. If the retrieved page is similar to the query terms given in the topic words, then it is called relevant, otherwise irrelevant. After the page relevance calculation, the links included in each page are analyzed. Then, the crawler determines the links to be added to the URL frontier queue for further crawling. III. INTELLIGENT CRAWLING Intelligent crawling was proposed in [2] that allows users to specify arbitrary predicates. It suggests use of arbitrary implementable predicates that use four sets of document statistics All rights Reserved 264

3 source page content, URL tokens, linkage locality and sibling locality in the classifier to calculate the relevance of the document. The source page content allows prioritizing different URLs differently. URL tokens help in getting approximate semantics about the page. Linkage locality is based on the assumption that web pages on a given topic are more likely to link to those of the same topic. Sibling locality is based on the assumption that if a web page points to pages of a given topic then it is more likely to point to other pages on the same topic. The focused crawler in [3] uses only linkage locality and sibling locality properties which may not be applicable to all topics. It proposes to vary the dependence on each set of features online by reevaluating over currently crawled pages. The intelligent crawler, given a set of seed pages, initially crawls randomly to get a set of examples to train its classifier and learns the importance of difference feature sets. Each statistic changes based on the actual visiting of the page. It allows intelligent crawling adapt itself in order to maximize the performance. The reinforcement based learning of the classifier allows it to adapt for different topics and seed set. Until we discuss the process of web crawler and now we discuss the purposed work of our project. Following are the modules of the project. Collection of datasets for web crawling, Developing the crawler to mine the websites and get mark each unique website with a number. Developing machine learning to form sequences based on these websites. Implementing Top K Rules for mining of these sequences for effective data mining. Providing recommendations to the user based on the mined sequences. Figure 1. Functional flow diagram of topical crawler. Our crawler operates like this. When the user enters search text called a seed URL. The seed URL is input to the search engine it starts search with the seed URL its traverse URL to the intelligent crawler with the trained data. Trained data which is already saved. With the help of crawlers intelligence crawls the data with the thematic vertical crawling mechanism. After searching the dta to the local data or web data it will create the featured cluster related to the seed URL. And at last page history and the related URLs and its contents are stored in the database. IV.IMPLEMENTATION OF SMART CRAWLER 4.1. Setup The system is divided into three main sectors: Crawler, Data mining intelligence and Training data (not required while using unsupervised All rights Reserved 265

4 Figure 2. Block Diagram representing communication model of sectors of the system These sectors have special features which enable them to function with each other fluently: The sectors follow modular structure giving flexibility to replace/update any block without affecting the overall functionality. The crawler and Data Mining Intelligence have a bidirectional communication link established between them. This serves a purpose of forward communication from crawler to Data Mining Intelligence and feedback link from Data mining Intelligence to crawler. The Training data sector is use only when Data mining Intelligence sector uses a supervised learning algorithm. The link is a unidirectional link. The crawling strategy used can be classified as vertical topical strategy. The crawler follows a focused thematic approach, and the pages which it fetches will be guided by the interest of the user and the introduced intelligence[1]. 4.2 Data mining architecture Generic crawlers start crawling from a seed page. The seed page plays a critical role in guiding the crawler and to find path leading to the required/target page. The path followed to reach the target page can be optimized by taping performance parameters of a crawler. Consider a crawlerl, its output for a given topic can be written as a temporal sequence give as whereuiis URL of ith page crawled and Mis the maximum number of pages crawled. We should also be able to measure performance of the crawler, thus we need to define a function f that maps the sequence Si to a sequence where riis a scalar quantity that represents the relevance of the ith page crawled to the topic. The sequence Rlwill help in summarizing the results at various points during the path of crawl. However obtaining a complete relevance set of a topic is a difficult task and is prone to errors. This makes the evaluation of topical crawler challenging. In such scenario utilizing retrial measures like Harvest rate (defined below) of a crawler becomes necessary. Harvest rate: This rate estimates the rate of crawled pages that form relevance linking to the topic amongst all the pages that have been crawled. We highly depend on classifiers to make this type of judgment. The classifiers used as a part of data mining intelligence can perform such judgmental decisions. The Harvest rate after first tpages is computed using All rights Reserved 266

5 Here ri is binary relevance score of page i. This score is provided by the classifier. The score is subject to changes depending on the strategy used by the classification software.general representation of average harvest rate is hc, p Where c is the classifier used and p is the number of pages crawled. This value ranges from 0-1 where 0 being the worst case scenario performance and 1 being the best case performance. SVM (Support Vector Machines) In topical crawler, the aim of SVM algorithm[15] is to find a linear kernel classification function f(x) which classifies a particular link as useful or redundant. This classification function belongs to a hyper plane which separates the two classes from each other. Oncef(x) is obtained, new URL entry fed to crawler will pass through SVM, the classification function will analyze the link and return a value, if f(x)link0>1 the link belongs to positive class and is determined to be useful. This process occurs recursively until a certain depth of vertical crawling is reached. C4.5 C4.5 is a statistical classifier algorithm[16]; it generates a decision tree based on the features it place. It is a no-linear regression algorithm. It is used in circumstances where initial classification of datasets is unknown. Initially it takes a small amount of data having collection of factors and describes its values. C4.5 uses two criteria to rank possible tests: information gain of a class, which minimizes the total randomness of factors of a class of sample space (but is heavily biased towards tests with numerous outcomes), and the default gain ratio that divides information gain by the information provided by the test outcomes. Topical crawler will introduce initial rule sets in sample spacesi(x). C4.5 will check for base cases and start classification decision tree. A function f(x)decides the relevance of that link based on initial rule sets. f(x) returns a floating number representing the benefit ranking. The ranking decides the next link to traverse to while accumulating new set of attributes from the new link. Each new attribute captured contributes to improved efficiency. And unlike SVM the crawler keeps gaining intelligence with increase in features Algorithm Crawling Process A summary of crawling algorithm is given below. A set of seed URLs is given as input to the crawler. 1. while the list of URLs to search is not empty 2. { 3. get the first URL in the list. 4. move the URL to the list of URLs already searched. 5. check the URL to make sure its protocol is http // if not, break out of the loop, back to 1 6. see whether there is a robots.txt file at this site that includes a "disallow" statement. // if so, break out of the loop, back to 1 7. try to fetch the URL 8. if it is not an html file, break out of the loop, back to 1 9. while the html text contains another link, 10. { 11. validate the link's URL // just as in the outer loop 12. if it is an html file, 13. { 14. if the URL is not present 15. { 16. add it to the to-search list. 17. } 18. else if the type of the file is valid 19. All rights Reserved 267

6 20. add it to the list of files found 21. } 22. } 23. } 24.} International Journal of Modern Trends in Engineering and Research (IJMTER) Semantic Matching For URL optimization, we used link score evaluation methods for relevant pages. Reranking is based on the websites score will be used here. Score will be found using Semantic matching. Figure 3. Finding Cluster terms Figure 4. Semantic Matching Bisecting K-means Bisecting k-means[5] is like a combination of k-means and hierarchical clustering. Instead of partitioning the data into k clusters in each iteration, Bisecting k-means splits one cluster into two sub clusters at each bisecting step(by using k-means) until k clusters are obtained. As Bisecting k- means is based on k-means, it keeps the merits of k-means and also has some advantages over k- means. First, Bisecting k-means is more efficient when k is large. For the k-means algorithm, the computation involves every data point of the data set and k centroids. On the other hand, in each Bisecting step of Bisecting k-means, only the data points of one cluster and two centroids are involved in the computation. Thus, the computation time is reduced. Secondly, Bisecting k-means produce clusters of similar sizes, while k-means is known to produce clusters of widely different sizes. Bisecting K-means Algorithm for finding K clusters 1. Pick a cluster to split. 2. Find 2 sub-clusters using the basic K-means algorithm. (Bisecting step) 3. Repeat step 2, the bisecting step, for ITER times and take the split that produces the clustering with the highest overall similarity. 4. Repeat steps 1, 2 and 3 until the desired number of clusters is All rights Reserved 268

CHARM Algorithm CHARM, an efficient algorithm for mining all the frequent itemsets. We will first describe the algorithm in general terms, independent of the implementation details.

7 There are a number of different ways to choose which cluster is split. For example, we can choose the largest cluster at each step, the one with the least overall similarity, or use a criterion based on both size and overall similarity. CHARM Algorithm CHARM, an efficient algorithm for mining all the frequent itemsets. We will first describe the algorithm in general terms, independent of the implementation details. We then show how the algorithm can be implemented efficiently. CHARM simultaneously explores both the itemset space and tidset space using the IT-tree, unlike previous methods which typically exploit only the itemset space. CHARM uses a novel search method, based on the IT-pair properties, that skips many levels in the IT-tree to quickly converge on the itemset closures, rather than having to enumerate many possible subsets. The pseudo-code for CHARM appears in Figure 5. The algorithm starts by initializing the prefix class[p], of nodes to be examined, to the frequent single items and their tidsets in Line 1. We assume that the elements in [P] are ordered according to a suitable total order f. The main computation is performed in CHARM-Extend which returns the set of closed frequent itemsets C. Figure 5. The CHARM Algorithm 4.2. System Design In this paper, we presented the working and design of web crawler. Here, to run the crawler we will give one seed URL, as input as shown in fig.6. when we enter the search text "JAVA" and send a request to crawler it will search the data related to the same. In fig. 7 shows the results related to the search "JAVA" as it gives the links instead of showing the whole links it only display the dust removed links i.e optimized links. And that links will be stored in the database as shown in the fig. 8. Figure 6. Design Of All rights Reserved 269

Figure 7. Output Results Figure 8. Saved in database V. RESULT AND ANALYSIS The crawler efficiency increased over experience gained by the number of pages crawled.

This effect is found to occur when number of pages crawled increases over a threshold value, this threshold value is decided by the dimensionality of the algorithm.

8 Figure 7. Output Results Figure 8. Saved in database V. RESULT AND ANALYSIS The crawler efficiency increased over experience gained by the number of pages crawled. SVM and K-mean have a limitation of data crumbling which results in blurred hyperplane. This effect is found to occur when number of pages crawled increases over a threshold value, this threshold value is decided by the dimensionality of the algorithm. This effect can be overcome by increasing the number dimensions of factors identified. Though increasing the number of dimensions resolves blurring effect, it demands for higher processing power. SVM algorithm provides moderate harvest rate. It has an advantage of simplicity and should be preferred for small crawling activities because it requires less clock-cycles to complete its operation. If compared on the basis of training data, SVM requires training data which can be generated for small size of classification. If the classification factors are unknown, SVM All rights Reserved 270

9 false outputs. Thus in such circumstances K-mean provides a better operationalization. To conclude Bisecting K-Means and CHARM algorithm provides most optimal results. VI. CONCLUSION As proposed, we built a smart crawler to serve the needs of the Search Engine. The smart crawler successfully crawls in a breadth first approach. We could build the crawler and equip it with data processing as well as url processing capabilities. We filtered the data obtained from web pages on servers to get text files as needed by the Search engine. We could also filter out unnecessary URLs before fetching data from the server. We compared the performance of the existing crawler with that of the smart crawler. With the filtered URLs generated by the Smart Crawler was able to identify concepts from the data quickly and in a much more efficient way. Thus we were able to improve the efficiency of the smart crawler. REFERENCES [1] Abhiraj Darshakar," Crawler intelligence with Machine Learning and Data Mining integration", International Conference on Pervasive Computing (ICPC),2015. [2] C.C. Aggarwal, F. Al-Garawi, and P.S. Yu. "Intelligent crawling on the World Wide Web with arbitrary predicates". Proc. of 10th Intl. Conf. on WWW, [3] S. Chakrabarti, M. van den Berg, and B. Dom. "Focused Crawling: A New Approach for Topic-Specific Resource Discovery". WWW Conference, [4] Ruchika R.Patil, Amreen Khan,"Bisecting K-means for clustering web log data",international journal of computer application( ),volume 116-no.19, April [5] G. Pant, P. Srinivasan,"Learning To Crawl:compairing classification schemes",acm transaction on information system, vol 23 pp , oct [6] Gupta, P.; Johari, K., "Implementation of Web Crawler," Emerging Trends in Engineering and Technology (ICETET), nd International Conference on, vol., no., pp.838,843, Dec [7] Menczer, F. Fani G. and Srinivasan P. Topical web crawlers. ACM Trans. Int. Tech 4, 4, [8] Xindong Wu. Vipin Kumar, J. Ross, Joydeep G. Qiang Yang. Top 10 algoriths in data mining KnowlInfSyst 14:1-37(2008). [9] Menczer, F. and Belew, R.K. (1998). Adaptive Information Agents in Distributed Textual Environments. In K. Sycara and M. Wooldridge (eds.) Proceedings of the 2nd International Conference on Autonomous Agents (Agents '98). ACM Press. [10] Christopher D. Manning, PrabhakarRaghavan, and HinrichSchütze (2008). Introduction to Information Retrieval. Cambridge University Press. ISBN [11] F. Yuan, C. Yin and Y. PageRank in focused Crawler Fuzzy Systems and Knowledg August [12] Q. Cheng, W. Beizhan, W. strategy using combination of IT in Medicine & Education , December [13] S. Chakrabarti, M. van den Be approach to topic-specificinternational WWWConference.may [14] P. De Bra, G.-J. Houben, Y. retrieval in distributed hype Intelligent Multimedia, In Management, New York, NY,1994. [15] Kecman, Vojislav; Learning and Soft Computing Support Vector Machines, Neural Networks, Fuzzy Logic Systems, The MIT Press, Cambridge, MA, 2001 [16] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, All rights Reserved 271

Web Crawling As Nonlinear Dynamics

Progress in Nonlinear Dynamics and Chaos Vol. 1, 2013, 1-7 ISSN: 2321 9238 (online) Published on 28 April 2013 www.researchmathsci.org Progress in Web Crawling As Nonlinear Dynamics Chaitanya Raveendra