Promotional Ranking of Search Engine Results: Giving New Web Pages a Chance to Prove Their Values

Size: px
Start display at page:

Download "Promotional Ranking of Search Engine Results: Giving New Web Pages a Chance to Prove Their Values"

Transcription

1 Promotional Ranking of Search Engine Results: Giving New Web Pages a Chance to Prove Their Values Yizhen Zhu 1, Mingda Wu 1, Yan Zhang 1,, and Xiaoming Li 2 1 National Laboratory on Machine Perception, Peking Univ., Beijing, , China {zhuyz,wumd,zhy}@cis.pku.edu.cn 2 Department of Computer Science and Technology, Peking Univ., Beijing, , China lxm@pku.edu.cn Abstract. Recent studies show that the link-structure-based measures of Web page popularity prolong the procedure of new pages reaching their deserved ranking. In this paper we propose a promotional ranking scheme that offers an opportunity for newly-created pages to be recognized.we conduct a simulation to evaluate our method. Experimental results show that our method remarkably raises the probability for new pages to obtain user-awareness. 1 Introduction As the Web grows dramatically, search engine has become an indispensable tool for accessing online information. According to [9], by April 2006, Google and Yahoo respectively receive over 150 million search requests per day. Due to the enormous number of matching outcomes, normally people are only interested in the few top retrieved URLs on the list. This kind of user s preference leads to a higher popularity for the currently-popular pages and a deeper entrenchment for the newly-built ones to span. Ideally, search engines are supposed to present the results ordered by their quality. Nevertheless, the quality of a web page is too subjective to be measured directly, which impels search engines to turn to other substitutes for quality. Popularity is one of the most widely used substitutes which could be directly measured by number of in-links and clicks, or by PageRank, etc. Whereas, some researchers have found that the accumulation of popularity differs by the current popularity of the page [2] and search engines do inhibit the discovery of new pages. On the other hand, the real web environment is very dynamic with high rates of birth, death and replacement of both the web pages and hyperlinks between them [4]. However, the query results of search engines remain relatively stale under the rank schemes like PageRank, in-degree, or number of visits. The primitive motivation of our work is to solve this problem: to offer an opportunity for those newly-created Web pages to be noticed and visited. Section 2 is devoted to related work.section 3 explains the web evolution from the perspective of search engine. Section 4 describes the details of our method step by step. Section 5 demonstrates the details of our experiments. Supported by NSFC with Grant No and , Natural Science Foundation of Beijing with Grant No and Guangdong SCUT Key Lab Open Grant. Corresponding author. G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp , c Springer-Verlag Berlin Heidelberg 2007

2 504 Y. Zhu et al. 2 Related Work A. Ntoulas et al. found that existing pages were being removed from the Web and replaced by new ones at a very rapid rate[5,4]. In [1] researchers showed several relations between the macro structure of the web, page and site age, and quality of pages and sites. J. Cho et al. presented the impact of search engines on the page popularity by introducing two surfer models[2]. It is estimated that when search engine ranked pages based on their popularity, it took several orders of magnitude more time for a new page to become popular even if it was of high quality. In [3] J. Cho et al. further investigated the problem of page quality and they proposed a reasonable definition for page quality and derived a practical way of estimating the quality. The biggest difference between our method and [6] is that there exists an inevitable trade-off problem in their scheme: exploration vs exploitation. We doubt if it is secure to substitute a popular result for a new page whose quality is unpredictable. 3 The Evolution of Web from the Perspective of Search Engine The web is very dynamic with high rates of birth and death. According to [4], it is estimated that the new pages are created at the rate of 8% per week and only 40% of the pages will survive one year. As for the links, 25% new links are created per week. We conduct a survey by ourselves to investigate the turnover rate by retrieving a historical data set of URLs which is obtained two years ago. We find that only about 1/3 of the URLs are still available, which coincides with the conclusions in [4]. However, due to the limitation of disk space, band width and cost, it s impossible for a search engine to frequently update its repository to make it consistent with the real web. According to [8], Google refreshes its repository and pagerank every 28 days and other search engines from a week to a month. According to the birth rate of web pages and the refresh cycle of search engines, we estimate that between each two crawling processes, the new pages consist of the search engine s repository with a fraction from 15% to 20%. Here we assume a page to be new if it is not crawled in the last repository update. PageRank was first introduced as an index of the probability that a page is visited by a web user who randomly and repeatedly jumps into a new page and follows its link to the next page. Nowadays Web users are more relying on search engines to find useful information. Whats more, users tend to choose the results with a high rank. Under such a circumstance, the probability of a page being visited is much more connected to its current ranking. Many high-quality pages are ignored just because they are too new and low-ranked to be noticed. Recent studies [2,3]warn that search engines prolong the time for the new pages to reach the ranking they deserve. Now with a dynamic web environment and a highly search-engine-driven browsing habit, it is urgent to develop a new ranking method to alleviate the problem and offer the new pages a chance to prove their values. 4 Our Promotion Method We assume that search results are listed by page with a certain number of results in each page. Our idea is to display some relatively new results together with the mature ones in a friendly way, thus keep the standard ordering of the older pages and promote the

3 Promotional Ranking of Search Engine Results 505 new ones as well.the scheme in [6] promotes new pages at the cost of deserting several old pages. However, we think that insert promoted pages into a consistent ranking can be feasible and acceptable. Here are some questions we need to answer: How to evaluate the initial rank of the new pages? How many pages should be promoted each time? Which pages should be promoted? How to display the promoted pages? Before we start our discussion, a concept proposed in [6] need to be explained. The concept of community is designed to model the relationship between user queries and search engine results. If P is a set of pages related to a topic T,andU is a set of users interested in topic T, we say that the users U and pages P corresponding to topic T constitute a Web community. Under the assumption that queries and topics have a oneto-one relationship, it is easy to notice each query returns the same set of pages from the corresponding community and the only thing that can vary is the sequence of the results. The concept of community is crucial if we want to study the effect of a certain ranking method via simulation because in real web query results are basically unstable. In our case, we divide the pages into different categories according to their topics. Thus, the users looking into a particular category and all the pages within that category make up a community. 4.1 Initial Ranking of the New Pages The initial ranking of the new pages is very important in selecting the pages to be promoted for that it is the basic knowledge we have about these pages. An uneven distribution of the initial ranking will lead to the bias toward some few pages, an overeven distribution of the initial ranking will lead to the indistinguishability between new pages with high and low quality. We assume that the original pagerank of these new pages have been calculated beforehand which is reasonable and feasible for most of the search engines. Their original pagerank is a vivid index of the popularity of the pages pointing to them or the site they belong to. Since they are new to the web environment, it is unlikely that pages from another site will spare outlinks to the newly built pages. Therefore, we can make the assumption that the distribution of the new pages pagerank is similar to that of their parent pages. 4.2 Selection of Candidates In Section 3 we ve discussed the proportion of new pages in a search engine s repository. We use the ratio of new pages to old ones to decide how many new pages should be promoted each time. We narrow down our attention from the search engine repository to a specific topic and the corresponding community of pages. Assuming that within a specific community, the ratio of new pages to all pages is r and the number of pages listed in a batch is N, we choose the quota of promotion denoted as N p to hold the proportion of new pages in one batch. Ideally, we want r = N p /(N + N p ), thus we have that N p = r N/(1 r). Before explaining our selection scheme, we model our selection process as follows. All the new pages in the community constitute a set, known as the candidate pool. Let

4 506 Y. Zhu et al. P denote the candidate pool with a size of m. Supposed the quota of a promotion is n, then the selection will be conducted n times on P until we collect all the n pages to be recommended. If a new page is selected, it will be automatically wiped off from P.A straightforward selection scheme is the randomized selection. Each time we randomly pick out a new page from the candidate pool with equal probability. With the diversity of new pages quality, however, it will be disturbing and confusing if we randomly choose the candidates without taking their quality into account. In order to balance between impartially treating all new pages and maintaining the general quality of results, we propose a second selection scheme, the probabilistic selection, inspired by the study in [2]. J. Cho and S. Roy present an algorithm of visit popularity which takes the impact of search engines into consideration. We transplant this algorithm into our second selection scheme in order to model the behavior of page visit. First we sequence all pages in P by a descending order of their initial ranking. Suppose r i is the sequence number of page i in P and M is the size of candidate pool,the probability of i being selected is: P (i) =c r 3/2 i, c = 4.3 Combination of Promotional Ranking ( M ) 1 i=1 i 3/2 We propose two schemes to combine the pages selected from the candidate pool with the original results. No matter which scheme we adopt, the ranking of the original result remains intact. And the selected pages queue in the order of how they are picked out one after another from the candidate pool. The only difference is how we mix the two ranking. The first scheme is called implicit promotion. As we ve discussed before, we are going to insert N p promoted pages into the original N results in a batch. Firstly, we randomly choose N p positions in a sequence of N + N p, then we fill the N p positions respectively with the N p promoted pages in their order. We name it implicit simply because the promoted pages are mixed with the original results, so when a user clicks on a page, he is unaware if the page is already popular or promoted by our scheme. The other one is explicit promotion. Under this scheme users are informed of the status of the results. We append the N p promoted pages after the N original results in each batch and we clearly label the new pages so that we leave users the opportunity to decide whether or not to click on a promoted result. 5 Experiments 5.1 Experimental Setup As we ve discussed in Section 4, simulation is an inevitable process to evaluate our method. In [7,6] simulations prove to be efficient. In our simulation, firstly we build a website where visitors can retrieve pages from different categories sorted by different promotion modes. Then we learn users awareness and fondness of the pages from log analysis. The combination of 2 selection schemes and 2 display schemes produces 4 promotion modes: random-implicit-promotion, random-explicit-promotion, probabilisticimplicit-promotion and probabilistic-explicit-promotion. Together with no-promotion

5 Promotional Ranking of Search Engine Results 507 Table 1. Corresponding relationship between our simulation and the real web Our website Search engine website Photograph work Web page Photo preview Abstract of web page Photo information(title, author, date) Page information(title, site name, last update) Category of photography Community of web pages mode, there are totally 5 modes to be evaluated. Nevertheless, the quality of the promoted pages is unstable in random-implicit-promotion, which may disappoint the users and even bring a negative effect; meanwhile probabilistic-implicit-promotion does not make good use of the previous efforts such as estimating new pages quality. Due to the above reasons and the limited popularity of our temporary web site (more modes will lead to less users assigned to the group for a single mode, see below), we evaluate 3 modes: no-promotion, random-implicit-promotionand probabilistic-explicit-promotion. No-promotion serves as a baseline to compare with the other two modes. Randomimplicit-promotion can be viewed as a proximity of the randomshuffling method [6], except for 12 results with 2 new ones per batch, which is made to make a fair-play with probabilistic-explicit-promotion, one of the methods we proposed. We establish a website composed of 6, 912 Web pages, each containing a photograph work. All the photos were downloaded in Mar from a popular photography website, each with a smaller preview, an original rank and a brief introduction. The original photos were uploaded by the owners into corresponding categories and graded by visitors of the site. First, we chose six different categories, namely architecture, essay, people, photojournalism, vehicle, places. Then we downloaded 400 to 700 most popular photos and 600 newly-uploaded ones from each category. To study the effect of our promotion method, visitors of our website, or users are randomly assigned to one of the three groups since their first visit. The user interface of each group differs in the display of results.we record the IP address and the group number of every user so that if a user visits our site from the same IP again, the corresponding interface will be presented. During a period of 47 days, we ve attracted 455 visitors and recorded over 4, 000 actions of viewing and rating the pictures.table 1 shows the corresponding relationship between our simulation and the real web environment. Definition (democell): We call the preview plus the brief introduction of a photo a democell. A democell is corresponding to the web page with a full view of the photo. Democells are the basic elements we used to display the results to users. After dividing the photos into two groups: already-popular and newly uploaded, we assign initial ranks to both groups. Already-popular pages initial rank is their original rank while new pages is 20% of their original rank. This step extends the gap between the new pages and the already popular ones. As we ve discussed in Section 3 that the new pages constitute 15% to 20% of total amount in a community, and we ve proposed the quota calculation in Section 4 that N p = r N/(1 r). With N =10, 15% r 20%, we assign the quota of promotion to be 2.Under each category, one batch of the results presented contains 12 democells listed in 6 rows with 2 democells in each. We

6 508 Y. Zhu et al. display 12 results per batch because normally search engines present 10 urls per page and we promote 2 pages each time. no-promotion All the democells, both popular and unpopular, are ranked according to their initial ranking. random-implicit-promotion We randomly select 10 photos from the candidate pool of 600 photos. They are implicitly inserted into the popular results in the first 5 batches, 2 for each batch. The locations are randomly determined from 2 to 12. We avoid the first position to preserve the most popular result. probabilistic-explicit-promotion The 10 new pages are selected with the probability of P (i) =c r 3/2 i. They are displayed at the last row of the batch with a label of newly discovered photos. 5.2 Results Analysis After log information analysis, we decide to use user click as an index of user s awareness of a certain page. Let G i denote user group i (only including those who have at least a single click on a certain photo), U j denote user j, C k denote photo category k and P l denote photo l. We use Probability-Of-Hit (POH) to estimate the chance that a user from a certain group i visits a new page: POH i = 1 G i U j G i v j, v j = { 1 if user j has visited a new photo, 0 if not. The chance that a user from group i visits a new page from category k is: POH i,k = 1 G i,k U j G i,k v j,k, v j,k = { 1 if user j has visited a new photo of C k, 0 if not. By calculating both POH i and POH i,k, we can learn how new pages are becoming more accessible with our promotional ranking. From Figure 1, we see that both random-implicit promotion and probabilistic-explicit promotion yield good effect in making new pages more accessible. Furthermore, we notice that probabilistic-explicit promotion outperforms random-implicit promotion. There might be two reasons to explain.firstly, probabilistic-explicit promotion may stimulate the user to click on the labeled new page.secondly, the probabilistic candidate selection scheme is more sensitive to the potential quality of new pages,thus the pages promoted probabilistically are inherently more attractive than those promoted randomly.we ve also used other evaluation metrics, but only our key results are shown in this paper due to page limit. 5.3 Sorting New Pages Results in Section 5.2 demonstrate the advantage of a probabilistic selection based on the potential quality of new pages. In the above simulation experiment, we utilize the original user-made-grade of the photos to produce their initial rank. In order to discover the hidden quality of new pages, we further conduct another experiment to evaluate the method of sorting new pages by estimating pagerank for new pages.

7 Promotional Ranking of Search Engine Results POH total Community No. no promotion random-implicit promotion probabilistic-explict promotion Fig. 1. POH in each category Our estimation method originates from the idea that good pages (none-spamming ones) tend to link to pages with comparative quality. A new page is likely to be of high quality if its siblings (sharing the same parent with p) are of high quality. Therefore, we adopt ASP (the Average Siblings PageRank) as an index to estimate the quality of new pages. Meanwhile, to avoid the situation that one or two parent pages with too many outlinks may bias the estimated value, we assign ACP (the Average of Children PageRank) to each pages, and calculate ASP via ACP. ACP (q) = q p PR(p) outdegree(q), ASP(p) = q p ACP (q) indegree(p) To evaluate this method, we take a snapshot of 1, 631, 483 web pages (PS2) and compare it with another snapshot of the same set of pages (PS1) 22 months ago. First, we calculate the ordinary pageranks on both sets of pages the pagerank output by PS2 is supposed to be a measure of inherent quality of the pages. Then we randomly pick 5 sets of new pages, each set including 160, 000 pages that are of low pagerank and have only 1 or 2 inlinks each, and calculate ASP for each set of new pages and then sort them by ordinary pagerank and ASP respectively. Let R PS2 (p) be the rank position of p in all new pages ordered by PageRank calculated on PS2, R naive (p) the rank position of p in all new pages ordered by PageRank calculated on PS1 and R ASP (p) the rank position of page p in all new pages ordered by ASP. Now we have the performance evaluation functions: F naive(n) =average(r naive(p))/number of newpages, F ASP (N) =average(r ASP(p))/number of newpages, in which p sat. R PS2(p) N in which p sat. R PS2(p) N We run the calculation of F naive (N) and F ASP (N) on the five sets of pages. The results share a similar pattern. Figure 2 presents the five values on average. Obviously, ordinary PageRank hardly convey the potential of new pages while ASP does upgrade the rank position of pages with high quality. Whereas, due to the limit of our coarse selection on new pages in this experiment, the results may not be as good as we ve expected. We hope to implement further approaches to evaluate the ASP algorithm.

8 510 Y. Zhu et al. F (N ) 70% 60% 50% naïve 40% ASP 30% 20% 10% E E E E E E E N 6 Conclusion and Future Work Fig. 2. Performance of ASP We propose a promotional ranking scheme in the paper in order to give the new pages a chance to prove their values. The experimental results show that our methods really improve the result quality and the probability for new pages to be noticed. Due to the infeasibility of testing on a real commercial search engine, we conduct a simulation for the evaluation. For that a search engine is more like a recommendation system retrieving fuzzy answers than a QA system that offers the precise answer, we believe that our simulation, though primitive, still has its merit in demonstrating our method. In section 4.1, we make an assumption of distribution of the new pages PageRank is similar to that of their parent pages to help ease the sorting of new pages which may cause popular sites to become more popular. Our future work will be focused on finding less biased ways of ranking new pages. We are also looking forward to upgrading our simulation by using real web pages instead of photography works. References 1. R. Baeza-Yates, F. Saint-Jean, and C. Castillo. Web structure, dynamics and page quality. In Proc. String Processing and Information Retrieval, J. Cho and S. Roy. Impact of search engines on page popularity. In WWW2004, May J. Cho, S. Roy, and R. E. Adams. Page quality: In search of an unbiased web ranking. In SIGMOD 2005, June A. Ntoulas, J. Cho, H. K. Cho, H. Cho, and Y.-J. Cho. A study on the evolution of the web. In the 2005 UKC Conference, august A. Ntoulas, J. Cho, and C. Olston. What s new on the web? the evolution of the web from a search engine perspective. In WWW2004, May S. Pandey, S. Roy, C. Olston, J. Cho, and S. Chakrabarti. Shuffling a stacked deck: The case for partially randomized ranking of search engine results. In VLDB2005, August F. Qiu and J. Cho. Automatic identification of user interest for personalized search. In WWW2006, May C. Sherman. Meet the search engines D. Sullivan. Searches per day.

Search Engines Considered Harmful In Search of an Unbiased Web Ranking

Search Engines Considered Harmful In Search of an Unbiased Web Ranking Search Engines Considered Harmful In Search of an Unbiased Web Ranking Junghoo John Cho cho@cs.ucla.edu UCLA Search Engines Considered Harmful Junghoo John Cho 1/45 World-Wide Web 10 years ago With Web

More information

Search Engines Considered Harmful In Search of an Unbiased Web Ranking

Search Engines Considered Harmful In Search of an Unbiased Web Ranking Search Engines Considered Harmful In Search of an Unbiased Web Ranking Junghoo John Cho cho@cs.ucla.edu UCLA Search Engines Considered Harmful Junghoo John Cho 1/38 Motivation If you are not indexed by

More information

Impact of Search Engines on Page Popularity

Impact of Search Engines on Page Popularity Impact of Search Engines on Page Popularity Junghoo John Cho (cho@cs.ucla.edu) Sourashis Roy (roys@cs.ucla.edu) University of California, Los Angeles Impact of Search Engines on Page Popularity J. Cho,

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Automatic Identification of User Goals in Web Search [WWW 05]

Automatic Identification of User Goals in Web Search [WWW 05] Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality

More information

Exploring both Content and Link Quality for Anti-Spamming

Exploring both Content and Link Quality for Anti-Spamming Exploring both Content and Link Quality for Anti-Spamming Lei Zhang, Yi Zhang, Yan Zhang National Laboratory on Machine Perception Peking University 100871 Beijing, China zhangl, zhangyi, zhy @cis.pku.edu.cn

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

SQTime: Time-enhanced Social Search Querying

SQTime: Time-enhanced Social Search Querying SQTime: Time-enhanced Social Search Querying Panagiotis Lionakis 1, Kostas Stefanidis 2, and Georgia Koloniari 3 1 Department of Computer Science, University of Crete, Heraklion, Greece lionakis@csd.uoc.gr

More information

Improving Range Query Performance on Historic Web Page Data

Improving Range Query Performance on Historic Web Page Data Improving Range Query Performance on Historic Web Page Data Geng LI Lab of Computer Networks and Distributed Systems, Peking University Beijing, China ligeng@net.pku.edu.cn Bo Peng Lab of Computer Networks

More information

A STUDY ON THE EVOLUTION OF THE WEB

A STUDY ON THE EVOLUTION OF THE WEB A STUDY ON THE EVOLUTION OF THE WEB Alexandros Ntoulas, Junghoo Cho, Hyun Kyu Cho 2, Hyeonsung Cho 2, and Young-Jo Cho 2 Summary We seek to gain improved insight into how Web search engines should cope

More information

A Tagging Approach to Ontology Mapping

A Tagging Approach to Ontology Mapping A Tagging Approach to Ontology Mapping Colm Conroy 1, Declan O'Sullivan 1, Dave Lewis 1 1 Knowledge and Data Engineering Group, Trinity College Dublin {coconroy,declan.osullivan,dave.lewis}@cs.tcd.ie Abstract.

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Genealogical Trees on the Web: A Search Engine User Perspective

Genealogical Trees on the Web: A Search Engine User Perspective Genealogical Trees on the Web: A Search Engine User Perspective Ricardo Baeza-Yates Yahoo! Research Ocata 1 Barcelona, Spain rbaeza@acm.org Álvaro Pereira Federal Univ. of Minas Gerais Dept. of Computer

More information

AN OPTIMIZATION GENETIC ALGORITHM FOR IMAGE DATABASES IN AGRICULTURE

AN OPTIMIZATION GENETIC ALGORITHM FOR IMAGE DATABASES IN AGRICULTURE AN OPTIMIZATION GENETIC ALGORITHM FOR IMAGE DATABASES IN AGRICULTURE Changwu Zhu 1, Guanxiang Yan 2, Zhi Liu 3, Li Gao 1,* 1 Department of Computer Science, Hua Zhong Normal University, Wuhan 430079, China

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

The Evolution of Web Content and Search Engines

The Evolution of Web Content and Search Engines The Evolution of Web Content and Search Engines Ricardo Baeza-Yates Yahoo! Research Barcelona, Spain & Santiago, Chile ricardo@baeza.cl Álvaro Pereira Jr Dept. of Computer Science Federal Univ. of Minas

More information

Impact of Search Engines on Page Popularity

Impact of Search Engines on Page Popularity Introduction October 10, 2007 Outline Introduction Introduction 1 Introduction Introduction 2 Number of incoming links 3 Random-surfer model Case study 4 Search-dominant model 5 Summary and conclusion

More information

Evolutionary Linkage Creation between Information Sources in P2P Networks

Evolutionary Linkage Creation between Information Sources in P2P Networks Noname manuscript No. (will be inserted by the editor) Evolutionary Linkage Creation between Information Sources in P2P Networks Kei Ohnishi Mario Köppen Kaori Yoshida Received: date / Accepted: date Abstract

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

Effective Page Refresh Policies for Web Crawlers

Effective Page Refresh Policies for Web Crawlers For CS561 Web Data Management Spring 2013 University of Crete Effective Page Refresh Policies for Web Crawlers and a Semantic Web Document Ranking Model Roger-Alekos Berkley IMSE 2012/2014 Paper 1: Main

More information

DIAL: A Distributed Adaptive-Learning Routing Method in VDTNs

DIAL: A Distributed Adaptive-Learning Routing Method in VDTNs : A Distributed Adaptive-Learning Routing Method in VDTNs Bo Wu, Haiying Shen and Kang Chen Department of Electrical and Computer Engineering Clemson University, Clemson, South Carolina 29634 {bwu2, shenh,

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites *

Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites * Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites * Lijie Wang, Fei Liu, Ge Li **, Liang Gu, Liangjie Zhang, and Bing Xie Software Institute, School of Electronic Engineering

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to

More information

THE REAL ROOT CAUSES OF BREACHES. Security and IT Pros at Odds Over AppSec

THE REAL ROOT CAUSES OF BREACHES. Security and IT Pros at Odds Over AppSec THE REAL ROOT CAUSES OF BREACHES Security and IT Pros at Odds Over AppSec EXECUTIVE SUMMARY Breaches still happen, even with today s intense focus on security. According to Verizon s 2016 Data Breach Investigation

More information

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho, Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University {cho, hector}@cs.stanford.edu Abstract In this paper we study how we can design an effective parallel crawler. As the size of the

More information

Personalized Tour Planning System Based on User Interest Analysis

Personalized Tour Planning System Based on User Interest Analysis Personalized Tour Planning System Based on User Interest Analysis Benyu Zhang 1 Wenxin Li 1,2 and Zhuoqun Xu 1 1 Department of Computer Science & Technology Peking University, Beijing, China E-Mail: {zhangby,

More information

Discovering Information through Summon:

Discovering Information through Summon: Discovering Information through Summon: An Analysis of User Search Strategies and Search Success Ingrid Hsieh-Yee Professor, Dept. of Library and Information Science, Catholic University of America Shanyun

More information

Simulation Study of Language Specific Web Crawling

Simulation Study of Language Specific Web Crawling DEWS25 4B-o1 Simulation Study of Language Specific Web Crawling Kulwadee SOMBOONVIWAT Takayuki TAMURA, and Masaru KITSUREGAWA Institute of Industrial Science, The University of Tokyo Information Technology

More information

A Graph Theoretic Approach to Image Database Retrieval

A Graph Theoretic Approach to Image Database Retrieval A Graph Theoretic Approach to Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington, Seattle, WA 98195-2500

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho Stanford InterLib Technologies Information Overload Service Heterogeneity Interoperability Economic Concerns Information

More information

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang Microsoft Research, Asia School of EECS, Peking University Ordering Policies for Web Crawling Ordering policy To prioritize the URLs in a crawling queue

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Conclusions. Chapter Summary of our contributions

Conclusions. Chapter Summary of our contributions Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

NeighborWatcher: A Content-Agnostic Comment Spam Inference System

NeighborWatcher: A Content-Agnostic Comment Spam Inference System NeighborWatcher: A Content-Agnostic Comment Spam Inference System Jialong Zhang and Guofei Gu Secure Communication and Computer Systems Lab Department of Computer Science & Engineering Texas A&M University

More information

LET:Towards More Precise Clustering of Search Results

LET:Towards More Precise Clustering of Search Results LET:Towards More Precise Clustering of Search Results Yi Zhang, Lidong Bing,Yexin Wang, Yan Zhang State Key Laboratory on Machine Perception Peking University,100871 Beijing, China {zhangyi, bingld,wangyx,zhy}@cis.pku.edu.cn

More information

Efficient Crawling Through Dynamic Priority of Web Page in Sitemap

Efficient Crawling Through Dynamic Priority of Web Page in Sitemap Efficient Through Dynamic Priority of Web Page in Sitemap Rahul kumar and Anurag Jain Department of CSE Radharaman Institute of Technology and Science, Bhopal, M.P, India ABSTRACT A web crawler or automatic

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

A priority based dynamic bandwidth scheduling in SDN networks 1

A priority based dynamic bandwidth scheduling in SDN networks 1 Acta Technica 62 No. 2A/2017, 445 454 c 2017 Institute of Thermomechanics CAS, v.v.i. A priority based dynamic bandwidth scheduling in SDN networks 1 Zun Wang 2 Abstract. In order to solve the problems

More information

ONLINE EVALUATION FOR: Company Name

ONLINE EVALUATION FOR: Company Name ONLINE EVALUATION FOR: Company Name Address Phone URL media advertising design P.O. Box 2430 Issaquah, WA 98027 (800) 597-1686 platypuslocal.com SUMMARY A Thank You From Platypus: Thank you for purchasing

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

NON-CENTRALIZED DISTINCT L-DIVERSITY

NON-CENTRALIZED DISTINCT L-DIVERSITY NON-CENTRALIZED DISTINCT L-DIVERSITY Chi Hong Cheong 1, Dan Wu 2, and Man Hon Wong 3 1,3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong {chcheong, mhwong}@cse.cuhk.edu.hk

More information

A Cost-Aware Strategy for Query Result Caching in Web Search Engines

A Cost-Aware Strategy for Query Result Caching in Web Search Engines A Cost-Aware Strategy for Query Result Caching in Web Search Engines Ismail Sengor Altingovde, Rifat Ozcan, and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,rozcan,oulusoy}@cs.bilkent.edu.tr

More information

A P2P-based Incremental Web Ranking Algorithm

A P2P-based Incremental Web Ranking Algorithm A P2P-based Incremental Web Ranking Algorithm Sumalee Sangamuang Pruet Boonma Juggapong Natwichai Computer Engineering Department Faculty of Engineering, Chiang Mai University, Thailand sangamuang.s@gmail.com,

More information

ihits: Extending HITS for Personal Interests Profiling

ihits: Extending HITS for Personal Interests Profiling ihits: Extending HITS for Personal Interests Profiling Ziming Zhuang School of Information Sciences and Technology The Pennsylvania State University zzhuang@ist.psu.edu Abstract Ever since the boom of

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Introduction to List Building. Introduction to List Building

Introduction to  List Building. Introduction to  List Building Introduction to Email List Building Introduction to Email List Building 1 Table of Contents Introduction... 3 What is email list building?... 5 Permission-based email marketing vs. spam...6 How to build

More information

Web Crawling As Nonlinear Dynamics

Web Crawling As Nonlinear Dynamics Progress in Nonlinear Dynamics and Chaos Vol. 1, 2013, 1-7 ISSN: 2321 9238 (online) Published on 28 April 2013 www.researchmathsci.org Progress in Web Crawling As Nonlinear Dynamics Chaitanya Raveendra

More information

Ranking web pages using machine learning approaches

Ranking web pages using machine learning approaches University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 Ranking web pages using machine learning approaches Sweah Liang Yong

More information

Search Quality. Jan Pedersen 10 September 2007

Search Quality. Jan Pedersen 10 September 2007 Search Quality Jan Pedersen 10 September 2007 Outline The Search Landscape A Framework for Quality RCFP Search Engine Architecture Detailed Issues 2 Search Landscape 2007 Source: Search Engine Watch: US

More information

Finding a needle in Haystack: Facebook's photo storage

Finding a needle in Haystack: Facebook's photo storage Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,

More information

Are people biased in their use of search engines? Keane, Mark T.; O'Brien, Maeve; Smyth, Barry. Communications of the ACM, 51 (2): 49-52

Are people biased in their use of search engines? Keane, Mark T.; O'Brien, Maeve; Smyth, Barry. Communications of the ACM, 51 (2): 49-52 Provided by the author(s) and University College Dublin Library in accordance with publisher policies. Please cite the published version when available. Title Are people biased in their use of search engines?

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., DEC User Action Interpretation for Online Content Optimization

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., DEC User Action Interpretation for Online Content Optimization IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., DEC 2011 1 User Action Interpretation for Online Content Optimization Jiang Bian, Anlei Dong, Xiaofeng He, Srihari Reddy, and Yi Chang Abstract

More information

Supervised Web Forum Crawling

Supervised Web Forum Crawling Supervised Web Forum Crawling 1 Priyanka S. Bandagale, 2 Dr. Lata Ragha 1 Student, 2 Professor and HOD 1 Computer Department, 1 Terna college of Engineering, Navi Mumbai, India Abstract - In this paper,

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Mukesh Kumar and Renu Vig University Institute of Engineering and Technology, Panjab University, Chandigarh,

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Effective Performance of Information Retrieval by using Domain Based Crawler

Effective Performance of Information Retrieval by using Domain Based Crawler Effective Performance of Information Retrieval by using Domain Based Crawler Sk.Abdul Nabi 1 Department of CSE AVN Inst. Of Engg.& Tech. Hyderabad, India Dr. P. Premchand 2 Dean, Faculty of Engineering

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology Learning the Three Factors of a Non-overlapping Multi-camera Network Topology Xiaotang Chen, Kaiqi Huang, and Tieniu Tan National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information

SEO: SEARCH ENGINE OPTIMISATION

SEO: SEARCH ENGINE OPTIMISATION SEO: SEARCH ENGINE OPTIMISATION SEO IN 11 BASIC STEPS EXPLAINED What is all the commotion about this SEO, why is it important? I have had a professional content writer produce my content to make sure that

More information

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition

More information

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

Ranking Web Pages by Associating Keywords with Locations

Ranking Web Pages by Associating Keywords with Locations Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn

More information

Yammer Product Manager Homework: LinkedІn Endorsements

Yammer Product Manager Homework: LinkedІn Endorsements BACKGROUND: Location: Mountain View, CA Industry: Social Networking Users: 300 Million PART 1 In September 2012, LinkedIn introduced the endorsements feature, which gives its users the ability to give

More information

An Algorithm of Parking Planning for Smart Parking System

An Algorithm of Parking Planning for Smart Parking System An Algorithm of Parking Planning for Smart Parking System Xuejian Zhao Wuhan University Hubei, China Email: xuejian zhao@sina.com Kui Zhao Zhejiang University Zhejiang, China Email: zhaokui@zju.edu.cn

More information

Privacy Protection in Personalized Web Search with User Profile

Privacy Protection in Personalized Web Search with User Profile Privacy Protection in Personalized Web Search with User Profile Prateek C. Shukla 1,Tekchand D. Patil 2, Yogeshwar J. Shirsath 3,Dnyaneshwar N. Rasal 4 1,2,3,4, (I.T. Dept.,B.V.C.O.E.&R.I. Anjaneri,university.Pune,

More information

Scale Free Network Growth By Ranking. Santo Fortunato, Alessandro Flammini, and Filippo Menczer

Scale Free Network Growth By Ranking. Santo Fortunato, Alessandro Flammini, and Filippo Menczer Scale Free Network Growth By Ranking Santo Fortunato, Alessandro Flammini, and Filippo Menczer Motivation Network growth is usually explained through mechanisms that rely on node prestige measures, such

More information

Fast and Effective Interpolation Using Median Filter

Fast and Effective Interpolation Using Median Filter Fast and Effective Interpolation Using Median Filter Jian Zhang 1, *, Siwei Ma 2, Yongbing Zhang 1, and Debin Zhao 1 1 Department of Computer Science, Harbin Institute of Technology, Harbin 150001, P.R.

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Value of YouTube to the music industry Paper V Direct value to the industry

Value of YouTube to the music industry Paper V Direct value to the industry Value of YouTube to the music industry Paper V Direct value to the industry June 2017 RBB Economics 1 1 Introduction The music industry has undergone significant change over the past few years, with declining

More information

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites An Empirical Evaluation of User Interfaces for Topic Management of Web Sites Brian Amento AT&T Labs - Research 180 Park Avenue, P.O. Box 971 Florham Park, NJ 07932 USA brian@research.att.com ABSTRACT Topic

More information

The data quality trends report

The data quality trends report Report The 2015 email data quality trends report How organizations today are managing and using email Table of contents: Summary...1 Research methodology...1 Key findings...2 Email collection and database

More information

The tracing tool in SQL-Hero tries to deal with the following weaknesses found in the out-of-the-box SQL Profiler tool:

The tracing tool in SQL-Hero tries to deal with the following weaknesses found in the out-of-the-box SQL Profiler tool: Revision Description 7/21/2010 Original SQL-Hero Tracing Introduction Let s start by asking why you might want to do SQL tracing in the first place. As it turns out, this can be an extremely useful activity

More information

International Journal of Advancements in Research & Technology, Volume 2, Issue 6, June ISSN

International Journal of Advancements in Research & Technology, Volume 2, Issue 6, June ISSN International Journal of Advancements in Research & Technology, Volume 2, Issue 6, June-2013 159 Re-ranking the Results Based on user profile. Email: anuradhakale20@yahoo.com Anuradha R. Kale, Prof. V.T.

More information

Image Classification Using Wavelet Coefficients in Low-pass Bands

Image Classification Using Wavelet Coefficients in Low-pass Bands Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, August -7, 007 Image Classification Using Wavelet Coefficients in Low-pass Bands Weibao Zou, Member, IEEE, and Yan

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

An Efficient XML Index Structure with Bottom-Up Query Processing

An Efficient XML Index Structure with Bottom-Up Query Processing An Efficient XML Index Structure with Bottom-Up Query Processing Dong Min Seo, Jae Soo Yoo, and Ki Hyung Cho Department of Computer and Communication Engineering, Chungbuk National University, 48 Gaesin-dong,

More information

Top 3 Marketing Metrics You Should Measure in Google Analytics

Top 3 Marketing Metrics You Should Measure in Google Analytics Top 3 Marketing Metrics You Should Measure in Google Analytics Presented By Table of Contents Overview 3 How to Use This Knowledge Brief 3 Metric to Measure: Traffic 4 Direct (Acquisition > All Traffic

More information

Understanding User Operations on Web Page in WISE 1

Understanding User Operations on Web Page in WISE 1 Understanding User Operations on Web Page in WISE 1 Hongyan Li, Ming Xue, Jianjun Wang, Shiwei Tang, and Dongqing Yang National Laboratory on Machine Perception, School of Electronics Engineering and Computer

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information