Promotional Ranking of Search Engine Results: Giving New Web Pages a Chance to Prove Their Values

Size: px

Start display at page:

Download "Promotional Ranking of Search Engine Results: Giving New Web Pages a Chance to Prove Their Values"

Ross Montgomery
6 years ago
Views:

1 Promotional Ranking of Search Engine Results: Giving New Web Pages a Chance to Prove Their Values Yizhen Zhu 1, Mingda Wu 1, Yan Zhang 1,, and Xiaoming Li 2 1 National Laboratory on Machine Perception, Peking Univ., Beijing, , China {zhuyz,wumd,zhy}@cis.pku.edu.cn 2 Department of Computer Science and Technology, Peking Univ., Beijing, , China lxm@pku.edu.cn Abstract. Recent studies show that the link-structure-based measures of Web page popularity prolong the procedure of new pages reaching their deserved ranking. In this paper we propose a promotional ranking scheme that offers an opportunity for newly-created pages to be recognized.we conduct a simulation to evaluate our method. Experimental results show that our method remarkably raises the probability for new pages to obtain user-awareness. 1 Introduction As the Web grows dramatically, search engine has become an indispensable tool for accessing online information. According to [9], by April 2006, Google and Yahoo respectively receive over 150 million search requests per day. Due to the enormous number of matching outcomes, normally people are only interested in the few top retrieved URLs on the list. This kind of user s preference leads to a higher popularity for the currently-popular pages and a deeper entrenchment for the newly-built ones to span. Ideally, search engines are supposed to present the results ordered by their quality. Nevertheless, the quality of a web page is too subjective to be measured directly, which impels search engines to turn to other substitutes for quality. Popularity is one of the most widely used substitutes which could be directly measured by number of in-links and clicks, or by PageRank, etc. Whereas, some researchers have found that the accumulation of popularity differs by the current popularity of the page [2] and search engines do inhibit the discovery of new pages. On the other hand, the real web environment is very dynamic with high rates of birth, death and replacement of both the web pages and hyperlinks between them [4]. However, the query results of search engines remain relatively stale under the rank schemes like PageRank, in-degree, or number of visits. The primitive motivation of our work is to solve this problem: to offer an opportunity for those newly-created Web pages to be noticed and visited. Section 2 is devoted to related work.section 3 explains the web evolution from the perspective of search engine. Section 4 describes the details of our method step by step. Section 5 demonstrates the details of our experiments. Supported by NSFC with Grant No and , Natural Science Foundation of Beijing with Grant No and Guangdong SCUT Key Lab Open Grant. Corresponding author. G. Dong et al. (Eds.): APWeb/WAIM 2007, LNCS 4505, pp , c Springer-Verlag Berlin Heidelberg 2007

2 504 Y. Zhu et al. 2 Related Work A. Ntoulas et al. found that existing pages were being removed from the Web and replaced by new ones at a very rapid rate[5,4]. In [1] researchers showed several relations between the macro structure of the web, page and site age, and quality of pages and sites. J. Cho et al. presented the impact of search engines on the page popularity by introducing two surfer models[2]. It is estimated that when search engine ranked pages based on their popularity, it took several orders of magnitude more time for a new page to become popular even if it was of high quality. In [3] J. Cho et al. further investigated the problem of page quality and they proposed a reasonable definition for page quality and derived a practical way of estimating the quality. The biggest difference between our method and [6] is that there exists an inevitable trade-off problem in their scheme: exploration vs exploitation. We doubt if it is secure to substitute a popular result for a new page whose quality is unpredictable. 3 The Evolution of Web from the Perspective of Search Engine The web is very dynamic with high rates of birth and death. According to [4], it is estimated that the new pages are created at the rate of 8% per week and only 40% of the pages will survive one year. As for the links, 25% new links are created per week. We conduct a survey by ourselves to investigate the turnover rate by retrieving a historical data set of URLs which is obtained two years ago. We find that only about 1/3 of the URLs are still available, which coincides with the conclusions in [4]. However, due to the limitation of disk space, band width and cost, it s impossible for a search engine to frequently update its repository to make it consistent with the real web. According to [8], Google refreshes its repository and pagerank every 28 days and other search engines from a week to a month. According to the birth rate of web pages and the refresh cycle of search engines, we estimate that between each two crawling processes, the new pages consist of the search engine s repository with a fraction from 15% to 20%. Here we assume a page to be new if it is not crawled in the last repository update. PageRank was first introduced as an index of the probability that a page is visited by a web user who randomly and repeatedly jumps into a new page and follows its link to the next page. Nowadays Web users are more relying on search engines to find useful information. Whats more, users tend to choose the results with a high rank. Under such a circumstance, the probability of a page being visited is much more connected to its current ranking. Many high-quality pages are ignored just because they are too new and low-ranked to be noticed. Recent studies [2,3]warn that search engines prolong the time for the new pages to reach the ranking they deserve. Now with a dynamic web environment and a highly search-engine-driven browsing habit, it is urgent to develop a new ranking method to alleviate the problem and offer the new pages a chance to prove their values. 4 Our Promotion Method We assume that search results are listed by page with a certain number of results in each page. Our idea is to display some relatively new results together with the mature ones in a friendly way, thus keep the standard ordering of the older pages and promote the

3 Promotional Ranking of Search Engine Results 505 new ones as well.the scheme in [6] promotes new pages at the cost of deserting several old pages. However, we think that insert promoted pages into a consistent ranking can be feasible and acceptable. Here are some questions we need to answer: How to evaluate the initial rank of the new pages? How many pages should be promoted each time? Which pages should be promoted? How to display the promoted pages? Before we start our discussion, a concept proposed in [6] need to be explained. The concept of community is designed to model the relationship between user queries and search engine results. If P is a set of pages related to a topic T,andU is a set of users interested in topic T, we say that the users U and pages P corresponding to topic T constitute a Web community. Under the assumption that queries and topics have a oneto-one relationship, it is easy to notice each query returns the same set of pages from the corresponding community and the only thing that can vary is the sequence of the results. The concept of community is crucial if we want to study the effect of a certain ranking method via simulation because in real web query results are basically unstable. In our case, we divide the pages into different categories according to their topics. Thus, the users looking into a particular category and all the pages within that category make up a community. 4.1 Initial Ranking of the New Pages The initial ranking of the new pages is very important in selecting the pages to be promoted for that it is the basic knowledge we have about these pages. An uneven distribution of the initial ranking will lead to the bias toward some few pages, an overeven distribution of the initial ranking will lead to the indistinguishability between new pages with high and low quality. We assume that the original pagerank of these new pages have been calculated beforehand which is reasonable and feasible for most of the search engines. Their original pagerank is a vivid index of the popularity of the pages pointing to them or the site they belong to. Since they are new to the web environment, it is unlikely that pages from another site will spare outlinks to the newly built pages. Therefore, we can make the assumption that the distribution of the new pages pagerank is similar to that of their parent pages. 4.2 Selection of Candidates In Section 3 we ve discussed the proportion of new pages in a search engine s repository. We use the ratio of new pages to old ones to decide how many new pages should be promoted each time. We narrow down our attention from the search engine repository to a specific topic and the corresponding community of pages. Assuming that within a specific community, the ratio of new pages to all pages is r and the number of pages listed in a batch is N, we choose the quota of promotion denoted as N p to hold the proportion of new pages in one batch. Ideally, we want r = N p /(N + N p ), thus we have that N p = r N/(1 r). Before explaining our selection scheme, we model our selection process as follows. All the new pages in the community constitute a set, known as the candidate pool. Let

4 506 Y. Zhu et al. P denote the candidate pool with a size of m. Supposed the quota of a promotion is n, then the selection will be conducted n times on P until we collect all the n pages to be recommended. If a new page is selected, it will be automatically wiped off from P.A straightforward selection scheme is the randomized selection. Each time we randomly pick out a new page from the candidate pool with equal probability. With the diversity of new pages quality, however, it will be disturbing and confusing if we randomly choose the candidates without taking their quality into account. In order to balance between impartially treating all new pages and maintaining the general quality of results, we propose a second selection scheme, the probabilistic selection, inspired by the study in [2]. J. Cho and S. Roy present an algorithm of visit popularity which takes the impact of search engines into consideration. We transplant this algorithm into our second selection scheme in order to model the behavior of page visit. First we sequence all pages in P by a descending order of their initial ranking. Suppose r i is the sequence number of page i in P and M is the size of candidate pool,the probability of i being selected is: P (i) =c r 3/2 i, c = 4.3 Combination of Promotional Ranking ( M ) 1 i=1 i 3/2 We propose two schemes to combine the pages selected from the candidate pool with the original results. No matter which scheme we adopt, the ranking of the original result remains intact. And the selected pages queue in the order of how they are picked out one after another from the candidate pool. The only difference is how we mix the two ranking. The first scheme is called implicit promotion. As we ve discussed before, we are going to insert N p promoted pages into the original N results in a batch. Firstly, we randomly choose N p positions in a sequence of N + N p, then we fill the N p positions respectively with the N p promoted pages in their order. We name it implicit simply because the promoted pages are mixed with the original results, so when a user clicks on a page, he is unaware if the page is already popular or promoted by our scheme. The other one is explicit promotion. Under this scheme users are informed of the status of the results. We append the N p promoted pages after the N original results in each batch and we clearly label the new pages so that we leave users the opportunity to decide whether or not to click on a promoted result. 5 Experiments 5.1 Experimental Setup As we ve discussed in Section 4, simulation is an inevitable process to evaluate our method. In [7,6] simulations prove to be efficient. In our simulation, firstly we build a website where visitors can retrieve pages from different categories sorted by different promotion modes. Then we learn users awareness and fondness of the pages from log analysis. The combination of 2 selection schemes and 2 display schemes produces 4 promotion modes: random-implicit-promotion, random-explicit-promotion, probabilisticimplicit-promotion and probabilistic-explicit-promotion. Together with no-promotion

5 Promotional Ranking of Search Engine Results 507 Table 1. Corresponding relationship between our simulation and the real web Our website Search engine website Photograph work Web page Photo preview Abstract of web page Photo information(title, author, date) Page information(title, site name, last update) Category of photography Community of web pages mode, there are totally 5 modes to be evaluated. Nevertheless, the quality of the promoted pages is unstable in random-implicit-promotion, which may disappoint the users and even bring a negative effect; meanwhile probabilistic-implicit-promotion does not make good use of the previous efforts such as estimating new pages quality. Due to the above reasons and the limited popularity of our temporary web site (more modes will lead to less users assigned to the group for a single mode, see below), we evaluate 3 modes: no-promotion, random-implicit-promotionand probabilistic-explicit-promotion. No-promotion serves as a baseline to compare with the other two modes. Randomimplicit-promotion can be viewed as a proximity of the randomshuffling method [6], except for 12 results with 2 new ones per batch, which is made to make a fair-play with probabilistic-explicit-promotion, one of the methods we proposed. We establish a website composed of 6, 912 Web pages, each containing a photograph work. All the photos were downloaded in Mar from a popular photography website, each with a smaller preview, an original rank and a brief introduction. The original photos were uploaded by the owners into corresponding categories and graded by visitors of the site. First, we chose six different categories, namely architecture, essay, people, photojournalism, vehicle, places. Then we downloaded 400 to 700 most popular photos and 600 newly-uploaded ones from each category. To study the effect of our promotion method, visitors of our website, or users are randomly assigned to one of the three groups since their first visit. The user interface of each group differs in the display of results.we record the IP address and the group number of every user so that if a user visits our site from the same IP again, the corresponding interface will be presented. During a period of 47 days, we ve attracted 455 visitors and recorded over 4, 000 actions of viewing and rating the pictures.table 1 shows the corresponding relationship between our simulation and the real web environment. Definition (democell): We call the preview plus the brief introduction of a photo a democell. A democell is corresponding to the web page with a full view of the photo. Democells are the basic elements we used to display the results to users. After dividing the photos into two groups: already-popular and newly uploaded, we assign initial ranks to both groups. Already-popular pages initial rank is their original rank while new pages is 20% of their original rank. This step extends the gap between the new pages and the already popular ones. As we ve discussed in Section 3 that the new pages constitute 15% to 20% of total amount in a community, and we ve proposed the quota calculation in Section 4 that N p = r N/(1 r). With N =10, 15% r 20%, we assign the quota of promotion to be 2.Under each category, one batch of the results presented contains 12 democells listed in 6 rows with 2 democells in each. We

6 508 Y. Zhu et al. display 12 results per batch because normally search engines present 10 urls per page and we promote 2 pages each time. no-promotion All the democells, both popular and unpopular, are ranked according to their initial ranking. random-implicit-promotion We randomly select 10 photos from the candidate pool of 600 photos. They are implicitly inserted into the popular results in the first 5 batches, 2 for each batch. The locations are randomly determined from 2 to 12. We avoid the first position to preserve the most popular result. probabilistic-explicit-promotion The 10 new pages are selected with the probability of P (i) =c r 3/2 i. They are displayed at the last row of the batch with a label of newly discovered photos. 5.2 Results Analysis After log information analysis, we decide to use user click as an index of user s awareness of a certain page. Let G i denote user group i (only including those who have at least a single click on a certain photo), U j denote user j, C k denote photo category k and P l denote photo l. We use Probability-Of-Hit (POH) to estimate the chance that a user from a certain group i visits a new page: POH i = 1 G i U j G i v j, v j = { 1 if user j has visited a new photo, 0 if not. The chance that a user from group i visits a new page from category k is: POH i,k = 1 G i,k U j G i,k v j,k, v j,k = { 1 if user j has visited a new photo of C k, 0 if not. By calculating both POH i and POH i,k, we can learn how new pages are becoming more accessible with our promotional ranking. From Figure 1, we see that both random-implicit promotion and probabilistic-explicit promotion yield good effect in making new pages more accessible. Furthermore, we notice that probabilistic-explicit promotion outperforms random-implicit promotion. There might be two reasons to explain.firstly, probabilistic-explicit promotion may stimulate the user to click on the labeled new page.secondly, the probabilistic candidate selection scheme is more sensitive to the potential quality of new pages,thus the pages promoted probabilistically are inherently more attractive than those promoted randomly.we ve also used other evaluation metrics, but only our key results are shown in this paper due to page limit. 5.3 Sorting New Pages Results in Section 5.2 demonstrate the advantage of a probabilistic selection based on the potential quality of new pages. In the above simulation experiment, we utilize the original user-made-grade of the photos to produce their initial rank. In order to discover the hidden quality of new pages, we further conduct another experiment to evaluate the method of sorting new pages by estimating pagerank for new pages.

7 Promotional Ranking of Search Engine Results POH total Community No. no promotion random-implicit promotion probabilistic-explict promotion Fig. 1. POH in each category Our estimation method originates from the idea that good pages (none-spamming ones) tend to link to pages with comparative quality. A new page is likely to be of high quality if its siblings (sharing the same parent with p) are of high quality. Therefore, we adopt ASP (the Average Siblings PageRank) as an index to estimate the quality of new pages. Meanwhile, to avoid the situation that one or two parent pages with too many outlinks may bias the estimated value, we assign ACP (the Average of Children PageRank) to each pages, and calculate ASP via ACP. ACP (q) = q p PR(p) outdegree(q), ASP(p) = q p ACP (q) indegree(p) To evaluate this method, we take a snapshot of 1, 631, 483 web pages (PS2) and compare it with another snapshot of the same set of pages (PS1) 22 months ago. First, we calculate the ordinary pageranks on both sets of pages the pagerank output by PS2 is supposed to be a measure of inherent quality of the pages. Then we randomly pick 5 sets of new pages, each set including 160, 000 pages that are of low pagerank and have only 1 or 2 inlinks each, and calculate ASP for each set of new pages and then sort them by ordinary pagerank and ASP respectively. Let R PS2 (p) be the rank position of p in all new pages ordered by PageRank calculated on PS2, R naive (p) the rank position of p in all new pages ordered by PageRank calculated on PS1 and R ASP (p) the rank position of page p in all new pages ordered by ASP. Now we have the performance evaluation functions: F naive(n) =average(r naive(p))/number of newpages, F ASP (N) =average(r ASP(p))/number of newpages, in which p sat. R PS2(p) N in which p sat. R PS2(p) N We run the calculation of F naive (N) and F ASP (N) on the five sets of pages. The results share a similar pattern. Figure 2 presents the five values on average. Obviously, ordinary PageRank hardly convey the potential of new pages while ASP does upgrade the rank position of pages with high quality. Whereas, due to the limit of our coarse selection on new pages in this experiment, the results may not be as good as we ve expected. We hope to implement further approaches to evaluate the ASP algorithm.

8 510 Y. Zhu et al. F (N ) 70% 60% 50% naïve 40% ASP 30% 20% 10% E E E E E E E N 6 Conclusion and Future Work Fig. 2. Performance of ASP We propose a promotional ranking scheme in the paper in order to give the new pages a chance to prove their values. The experimental results show that our methods really improve the result quality and the probability for new pages to be noticed. Due to the infeasibility of testing on a real commercial search engine, we conduct a simulation for the evaluation. For that a search engine is more like a recommendation system retrieving fuzzy answers than a QA system that offers the precise answer, we believe that our simulation, though primitive, still has its merit in demonstrating our method. In section 4.1, we make an assumption of distribution of the new pages PageRank is similar to that of their parent pages to help ease the sorting of new pages which may cause popular sites to become more popular. Our future work will be focused on finding less biased ways of ranking new pages. We are also looking forward to upgrading our simulation by using real web pages instead of photography works. References 1. R. Baeza-Yates, F. Saint-Jean, and C. Castillo. Web structure, dynamics and page quality. In Proc. String Processing and Information Retrieval, J. Cho and S. Roy. Impact of search engines on page popularity. In WWW2004, May J. Cho, S. Roy, and R. E. Adams. Page quality: In search of an unbiased web ranking. In SIGMOD 2005, June A. Ntoulas, J. Cho, H. K. Cho, H. Cho, and Y.-J. Cho. A study on the evolution of the web. In the 2005 UKC Conference, august A. Ntoulas, J. Cho, and C. Olston. What s new on the web? the evolution of the web from a search engine perspective. In WWW2004, May S. Pandey, S. Roy, C. Olston, J. Cho, and S. Chakrabarti. Shuffling a stacked deck: The case for partially randomized ranking of search engine results. In VLDB2005, August F. Qiu and J. Cho. Automatic identification of user interest for personalized search. In WWW2006, May C. Sherman. Meet the search engines D. Sullivan. Searches per day.

Search Engines Considered Harmful In Search of an Unbiased Web Ranking

Search Engines Considered Harmful In Search of an Unbiased Web Ranking Junghoo John Cho cho@cs.ucla.edu UCLA Search Engines Considered Harmful Junghoo John Cho 1/45 World-Wide Web 10 years ago With Web