- Web Searching in Norwegian a Study of Sesam.no -

Size: px
Start display at page:

Download "- Web Searching in Norwegian a Study of Sesam.no -"

Transcription

1 Ole Martin Kjørstad Nils-Marius Sørli BI Norwegian School of Management Thesis - Web Searching in Norwegian a Study of Sesam.no - Date of Submission: Campus: BI Oslo Exam code and name: GRA Master Thesis Supervisor: Ingunn Myrtveit Program: Master of Science in Business and Economics Strategy Major This thesis is a part of the MSc programme at BI Norwegian School of Management. The school takes no responsibility for the methods used, results found and conclusions drawn.

2 Content CONTENT...I LIST OF FIGURES AND TABLES... II SUMMARY...III 1 INTRODUCTION BACKGROUND THE SESAM STUDY EXPECTED FINDINGS RELATED LITERATURE CHAU, FANG AND YANG (2007) JANSEN, BOOTH & SPINK (2007) SILVERSTEIN, MARAIS, HENZINGER & MORICZ (1999) JANSEN, SPINK & PEDERSEN (2005) SPINK, WOLFRAM, JANSEN & SARACEVIC (2001) SPINK, KORICICH, JANSEN & COLE (2004) DATA DATA COLLECTION DATA CLEANING DATA DESCRIPTIVES DATA ANALYSIS RESULTS RESULTS FOR RESULTS FOR COMPARISON DISCUSSION DIRECTIONS FOR FUTURE RESEARCH CONCLUSION REFERENCES ATTACHMENT 1: PRELIMINARY THESIS Page i

3 List of figures and tables FIGURE 1-1: GENERIC STRATEGIES (PORTER, 1980)...3 TABLE 2-1: RESULTS SUMMARY CHAU ET AL. (2007)...8 TABLE 2-2: DISTRIBUTION BETWEEN CLASSIFICATIONS...10 TABLE 2-3: DATA FIGURES FROM SILVERSTEIN ET AL. (1999)...13 FIGURE 2-1: DEVELOPMENT IN ONLINE RETAIL SPENDING...15 TABLE 2-4: RESULTS SUMMARY JANSEN ET AL. (2005)...18 TABLE 2-5: SESSION LENGTH...19 TABLE 2-6: QUERY LENGTH...20 TABLE 2-7: GENERAL TOPIC CATEGORIES...21 TABLE 2-8: RESULTS PAGES VIEWED...21 TABLE 2-9: RESULTS SUMMARY SPINK ET AL. (2001)...24 FIGURE 3-1: EXAMPLE OF THE FLAT DATA FILE...30 FIGURE 3-2: THE SEARCH CATEGORY CHOICES IN SESAM...31 TABLE 3-1: SIZE OF THE CLEANED DATA SETS...35 TABLE 3-2: THE SEARCH CATEGORIES IN 2005 AND TABLE 4-1: SEARCH CATEGORY DISTRIBUTION 2008 A...37 TABLE 4-2: SEARCH CATEGORY DISTRIBUTION 2008 B...38 TABLE 4-3: DISTRIBUTION BETWEEN NATIONS...39 TABLE 4-4: QUERY COMPLEXITY TABLE 4-5: DISTRIBUTION BETWEEN NUMBER OF TERMS TABLE 4-6: DISTRIBUTION BETWEEN NUMBER OF RESULT PAGES VIEWED TABLE 4-7: MEAN AND MEDIAN RESULT PAGES TABLE 4-8: SEARCH CATEGORY DISTRIBUTION TABLE 4-9: MEAN AND MEDIAN QUERY LENGTH TABLE 4-10: DISTRIBUTION BETWEEN NUMBER OF TERMS USED TABLE 4-11: MEAN AND MEDIAN RESULT PAGES TABLE 4-12: DISTRIBUTION BETWEEN NUMBER OF RESULT PAGES VIWEWED TABLE 4-13: DEVELOPMENT IN TERM USAGE...43 TABLE 4-14: DEVELOPMENT IN THE USE OF SEARCH CATEGORIES...44 TABLE 4-15: DEVELOPMENT IN RESULT PAGES...44 TABLE 4-16: DEVELOPMENT IN THE DISTRIBUITION BETWEEN RESULT PAGES VIEWED...45 Page ii

4 Summary Web searching has become an important part of people s way of retrieving information. As there are different needs for information, there could also exist different segments in the market for web searching services. A potential difference in use lies between those services that focus on one country in particular compared to those with an international focus. The way these services are used will also have implications for the providers, and those who target their customers through the web. There has not been done much extensive research on web searching services. The most frequently studied search engines are AltaVista and Excite, but smaller and less internationally oriented engines have also been studied. The study of Sesam.no shows that there seems to be differences in how people use different search engines. These findings suggest that a search engine with focus on a particular nation is being used for more specific information retrieval like information on people, companies and news. In addition, they bring up some implications for how actors operate on-, or target customers through the web. They also imply that it is possible for providers of search services to go into smaller positions in the market, i.e. focusing on providing good results for a more narrow part of the available information on the World Wide Web. Page iii

5 1 Introduction In our thesis we seek to explore if there are differences between how search engines that focus on one particular country are used, compared to those with a general focus on the entire web. The motivation for doing so is to try and say something about the potential different markets, or segments, that are present for web search engines and web searching technology. To this date, there has not been done extensive research on web searching and web searching behavior. There are quite few published studies, and they are done at different times and on different search engines. This research is relevant for the development of search services while it also has strategic relevance for the providers of these services. Our study, which focuses on differences between what we call national specific search engines, and search engines with an international focus, can hopefully say something about the differences in these two markets. These differences are, if they exist, represented by how the users handle the search engines differently and perform other types of queries (for instance more informational rather than explorative) on them. If there are differences, this may indicate different segments among the users, which would open room for different strategic approaches to web searching services. We return to this in section 1.1. If web searching services can be developed in a way that satisfies the users needs better and easier than competitors, there is room for a disruptive development where these services can take over the market even if they technically might be inferior to other services (Christensen, 1997). Studying user behavior in web searching is therefore a highly relevant topic within the field of strategy and technology, as well as seen from a strictly technological point of view. To this date, there is not much academic research done on search engines. There is little doubt, however, that it is a topic that has only become more relevant through the later years. This makes it an important area to study. To understand what the users of web searching tools want from the providers of these services, one needs to understand how they interact with this type of tool. Understanding how people utilize web searching services is also relevant for companies that address their customers over the internet, or plan on doing so. Page 1

6 Since web searching is an increasingly important channel for reaching out to potential customers, understanding the behavior of the web searchers will help better the understanding of how one can target the right segments and position oneself on the internet. To study this, we will analyze query logs from the web searching service Sesam.no from two different periods. Through doing so, we will look at how user behavior in a Norwegian web searching service has developed the last few years, and how it differs from behavior on search engines with a global focus. Studying this may tell us if there are different general informational needs that drives the user when performing searches on a local search engine, and thus say something about potential different markets and segments within the search engine industry. Our thesis will first present the background on how we started working with the topic we chose, and then explain the premises for our study. We then focus on the work that has been done prior to ours, and give a review of the most relevant literature that is present. After this we present the data we collected and the analysis, before we discuss the findings we came up with and give some directions for what could be interesting to study in the future. 1.1 Background When we started our work with this topic we wanted to investigate the impact of the changes in the industry for search technology both web searching and enterprise search. During the summer of 2008, we had meetings with Schibsted, including Sesam and Retriever, and representatives from Fast Search & Transfer in order to both gain insight into the business we were going to study, and to map out what opportunities there were to investigate the issues we wanted to look at. During this period we were considering several different approaches to the subject, but one that emerged as an interesting topic as we went through the existing literature and discussed possibilities in different meetings, was to look at different markets and segments for web searching services. As with other products and services, it is obvious that there exist customers / users who are motivated by different needs when using the service. Consequently, Page 2

7 actors usually choose a position or strategy for which part of a market or segment they wish to target and serve. One could for instance choose a broad scope to try and serve as large parts of the market as possible, or aim at serving certain niche markets and thus use a focus strategy (Porter, 1980). Figure 1-1: Generic strategies (Porter, 1980) The same theory can be used to understand the market for web searching services. There is a vast number of different web searching services available, and to earn a spot in the market these need to deliver quality in a way that users appreciate. In recent years the market for search engines has been dominated by Google, which since the middle of 2008 has had over 70 % of the market for web searching (seoconsultants.com, 2009). Google obviously positions itself with a broad scope and does not focus on delivering the best quality within one certain area of informational searches, but rather focuses on the entire web. They have managed to establish themselves as a dominating search engine in this field, which allows them to reap tremendous benefits in terms of revenues from advertisement. Some would probably argue that this strategy is the only relevant for providers of web searching services, and that the entire market for web searching consists of one rather homogenous segment. However, what we wanted to look further into was if there were differing needs among different users of web searching services, and if there thus were room for positioning services differently. This is obviously a large task which one thesis will not be able to answer completely. We found that the natural and possible topic to investigate in order to contribute to this field of research was to look at differences between the usage of what we call national specific search engines and search engines with an international focus. By Page 3

8 national specific search engines we refer to search sites that focus mainly on one particular language, thus aiming at returning results relevant for one particular country or nation. 1.2 The Sesam study Compared to previous studies made on search engines, ours has both advantages and disadvantages. Sesam offers the opportunity for users to choose different search categories. An important advantage of our dataset is thus the inclusion of a variable that lets us know which category, if any, the user chose. The feature is extensively used in Sesam, and represents an opportunity to better understand what the users have been looking for. When the user for instance has chosen maps and personal info as search category, we can more precisely, and with more confidence, say what the user intent was than with the more complicated methods used in some of the previous studies where such variables have not been included. A limitation when investigating the use of national specific search engines is, of course, that we are only looking at one particular search engine in this study. It would be very interesting to compare results with similar search engines in other countries as well, in order to generalize results better across different countries and search interfaces. This could be an interesting area for future research. However, as we do compare our results to rather similar studies, even though the metrics are not always the same, we expect to be able to see some interesting results. 1.3 Expected findings Through working with this subject we formed some opinions for what we expected to find during the analysis. One aspect we expected to see was Sesam being used for searches on information about people and news. We argue this as the search engine was marketed as a service gathering all hits in one place, with a focus on retrieving results from many different sources. This could make it a natural choice when looking for telephone numbers combined with information Page 4

9 about people, along with information about companies, services, and information in general which typically can be found in catalogues. We also expected that most searches will be in Norwegian as the search engine s layout is in Norwegian, and it focuses on providing good results from Norwegian internet pages. Consequently, it is also likely that the majority of queries will be submitted from Norway. Since the search engine has an available option to specify what type of query the user wishes to perform by selecting a category, we believe there will be less use of Boolean operators in Sesam compared to what is found in studies on other search engines. Another theory we have is that people often look for something more specific when using a search engine like Sesam than, let s say, Google. By specific we here mean concrete information relevant for a limited area, for instance news or personal information compared to general information on a field of interest. This is due to the fact that Sesam is more focused towards retrieving facts within Norway, such as news and information about people. Furthermore, since we are operating with such large numbers of observations in our data, we expect to see significant results where there have occurred changes. 2 Related literature In this section we will present a review of the most relevant points from several related studies we have come across during our work with this thesis. The goal is to give a brief summation of important previous studies, so that it will be easier to grasp what these have found and compare it to our own findings. In this way we hope to increase understanding in this particular field of web search research. Page 5

10 2.1 Chau, Fang and Yang (2007) The first article we look at is Web Searching in Chinese: A Study of a Search Engine in Hong Kong. In this article the authors report their results from an analysis of the query logs from the Hong Kong based search engine Timway. These logs span a period of 3 months, and contain data on sessions, queries entered, search topics and character usage. One interesting aspect of this article is that it analyzes a non-english search engine. A lot of the early analysis of search engines were done on English-based search engines, and as such it is interesting to get a different view on how people use search engines in other parts of the world. Especially for us, considering that we are also doing an analysis on a non-english search engine. Their results show that some of the characteristics they looked at, such as search topics and mean number of queries per session are similar to those found in English studies, while for example characteristics such as use of operators when formulating queries differs significantly. Researching non-english search engines is important due to the increasing amount of content on the web that is non-english, in addition to a large contingent of web users that are not native English speakers. An estimate from 2004, which they cite, puts the number of native English-speaking Internet users at 36,5 % (Global Reach, 2004). Research into the characteristics of this group of internet users is important to improve the design of search engines. One example of how these needs differ is due to the language differences where a different structure leads to differences in how people search. The lingual differences are especially apparent in this study due to the huge disparities between English and Chinese. Cultural differences can also play a part here. We think it is reasonable to believe that the differences might be smaller between Sesam and studies done on some of the large English-language search engines, compared to what they find in this study on Timway, as the lingual difference between English and Norwegian is smaller than the difference between English and Chinese. Page 6

11 The authors have listed the following research questions: a) What are the characteristics of the search queries submitted to a non- English search engine? b) How do these queries compare to those of English search engines such as Excite and AltaVista? c) What are the implications of these results on the design of non-english search engines? In their study they used the following metrics, already developed by Spink et al. (2001): Queries are sets of one or more terms. Unique queries are all differing queries entered by one user in a session. Repeat queries are all multiple occurrences of the same query by one user or submitted by the system automatically to periodically update the list of results. Multiple occurrences of the same query also can be generated by the system when the user requests subsequent result pages. Empty queries are defined as queries without any terms. They follow the definition of a session given in Silverstein et al. (1999), where a session is defined as a series of queries submitted by a single user within a small range of time. One weakness with the Timway data that they point out is that Timway does not capture empty queries. Thus they are unable to report on this. Since empty queries constitute a substantial amount of the queries in several of the other studies that have been done in this field, this information could have provided some interesting insights. What they do find though is that the average number of queries per session is 2,03, while the median is 1. This average is much lower than that reported by Page 7

12 Spink et al. (2001) in a study on the Excite search engine. It is, however, very similar to the 2,02 average that Silverstein et al. (1999) found when looking at data from AltaVista. In their data they identified the top 50 characters that appeared in the search log, and observe that these constitute 25.25% of the total occurrences of characters in the search log. They insinuate that this concurs with findings by Jansen et al. (1998) and Spink et al. (2001). However we see this as somewhat of a selfcontradiction, as they say in the next passage that their number is much higher than what Spink et al. (2001) found, which was that the top 75 terms constituted 9 % of the terms occurring in unique queries. They have also found that the mean number of characters in Chinese queries is This number is larger than the ones found for Excite and AltaVista, which were 2.16 and 2.35 respectively. However, this sort of comparison between Chinese queries and English queries makes little sense due to the large differences between the languages. Thus the numbers from the American studies that are quoted here are of most interest to us, due to larger similarities between English and Norwegian, as mentioned above. Some of their results are summarized in the following table: Total queries (100%) Unique queries (82, 28%) Repeat queries (17,72%) Empty queries 0 (as this is not captured in the Timway log) Mean queries per session 2,03 Median queries per session 1 Mean unique queries per session 1,67 Median unique queries per session 1 Table 2-1: Results summary Chau et al. (2007) In their concluding remark they point out the importance of studying how non- English internet users use available resources when seeking information on the web, especially as a lot more non-english material is made available. In relation to this they point out the importance of studying search engines that accept queries in several languages. The Timway engine that is the basis for their study Page 8

13 accepts both Chinese and English queries. This allows us to draw a parallel to our study of the Sesam engine, which also allows queries in multiple languages. Thus we might be able to make some interesting comparisons later on. 2.2 Jansen, Booth & Spink (2007) Another interesting article is Determining the intent of web search engine queries by Jansen et al. (2007). Here they point to one of the big issues present in research on user intent behind web searching, namely the lack of available data. Saying what people are really looking for just by examining their queries is not entirely simple. This is an area where our data could provide some valuable insight, as we have data on the opportunity that users have to choose search categories to give the search engine some directions for what kind of information the user is looking for. This means that the data can provide a much better insight into what the users were really looking for when submitting their queries. Jansen et al. (2007) on the other hand had to make qualified guesses to a much larger degree, based solely on the query terms entered, and then try to imagine what they were really looking for. They describe web search engines as a tool that helps people find the resources they are looking for by more clearly identifying the searcher s intent behind the query. Using an application that automatically classifies these queries from a query log, they classify results into the following three categories: Navigational Searching, Transactional Searching and Informational Searching. The following characteristics apply to each category (Jansen et al. 2007): Navigational Searching queries containing company/business/organization/people names queries containing domains suffixes queries with web as the source queries length (i.e., number of terms in query) less than 3 searcher viewing the first search engine results page Transactional Searching queries containing terms related to movies, songs, lyrics, recipes, images, Page 9

14 humor, and porn queries with obtaining terms (e.g., lyrics, recipes, etc.) queries with download terms (e.g., download, software, etc.) queries relating to image, audio, or video collections queries with audio, images, or video as the source queries with entertainment terms (pictures, games, etc.) queries with interact terms (e.g., buy, chat, etc.) queries with movies, songs, lyrics, images, and multimedia or compression file extensions (jpeg, zip, etc.) Informational Searching uses question words (i.e., ways to, how to, what is, etc.) queries with natural language terms queries containing informational terms (e.g., list, playlist, etc.) queries that were beyond the first query submitted queries where the searcher viewed multiple results pages queries length (i.e., number of terms in a query) greater than 2 queries that do not meet criteria for navigational or transactional Some of the characteristics they noticed were that most of the navigational queries were short in length and occurred at the beginning of the session. Transactional queries were mainly identified through analyzing content and terms. Since identifying these two categories was fairly straightforward, they chose to classify those searches that did not fall into either of these two categories as informational. One weakness we recognize in these classifications is that the authors do not identify searches for news and other events of current interest. The results of their classifications are listed in table 2-2. Classification Occurrences Percent Informational ,6 % Navigational ,2 % Transactional ,2 % % Table 2-2: Distribution between classifications Page 10

15 As we can see from the table, according to their classification, about 80 % of web searches are informational, while the two other categories represent roughly 10 % each. In their conclusion they emphasize that an increased understanding of web search engines requires increased understanding of user behavior. An important part of understanding user behavior is to understand the underlying intent behind a user's searches. This increased understanding can help to improve web search engines in the future. 2.3 Silverstein, Marais, Henzinger & Moricz (1999) This paper presents an analysis performed on the AltaVista Search Engine, where they looked at query logs containing approximately 1 billion entries, collected over a period of six weeks. The queries were collected over 43 days, from 2 nd August 1998 to 13 th September They include all the queries submitted to the main (US) AltaVista search engine over this time period. These queries represent almost 285 million user sessions. What the authors present is an analysis of individual queries, query duplication, query sessions, and a correlation analysis of the log entries which studies the interaction of terms within queries. They find support in their data for the speculation that web users differ from how users are assumed to be in standard information retrieval literature. Significant results include that web users use short queries when searching for information, that the users usually look only at the first 10 results, and seldom modify their queries. On this basis they speculate that this suggests traditional techniques used in information retrieval may be of little use for answering web search requests. The reason they want to study a large sample of queries collected over several weeks is that these data are a lot less likely to be affected by short-lived trends that might affect a smaller dataset. What they mean by this is for example the release of a major movie, or a major disaster that affects a lot of people. They also believe that due to the extended time-frame they will be able to identify individuals' patterns of use a lot better. The last argument is that by allowing for a large time span they will be able to collect queries made both during the day and the night, and to make sure that the input covers users from the entire world, and Page 11

16 not just those who happen to be awake during the period the log was taken from. They compare their method to a similar study by Jansen et al. (Jansen et al. 1998), which used a small sample of only queries from the Excite search engine. This number does not even amount to the number of queries made on a commercial search engine on a single day, something Silverstein et al. see as a weakness. This puts our study of data from Sesam somewhere in between these two aforementioned studies, as our four days are significantly larger than Jansen et al.'s data (Jansen et al. 1998), meaning our data may negate some of the weaknesses that one might attribute to Jansen et al.'s study. At the same time our dataset is not as all-encompassing as Silverstein et al.'s (1999). Whether this means that Silverstein et al. have a significantly better dataset than us or not is open for debate. Personally we believe that four days should be more than sufficient to capture most of the important nuances of web searching on a general commercial search engine, while at the same time avoiding some of the most glaring limitations that a small dataset, collected over a limited timespan, can provide you with. Regarding what they want to look for in the data, this is all pretty standard fare, like which queries are most common, the average number of terms per query, and how many queries are included in the average user session. The query logs themselves contains the following information (Silverstein et al. 1999): A timestamp indicating when the query was submitted. The timestamp is measured in milliseconds since 1 January A cookie, which can be used to say whether two queries come from the same user (this field is blank if the user has disabled cookies); The query terms, exactly as submitted; The bf result screen, that is the requested range of search results; Other user-specified modifiers, such as a restriction on the result pages' language or date of last modification; Submission information, such as whether the query is a simple or advanced query; and Submitter information, such as the browser the submitter is using and the Page 12

17 IP address of the submitting host. They also provide their definition of a session, which is that: A session is a series of queries by a single user made within a small range of time. A session is meant to capture a single user's attempt to fill a single information need. One of the first things they report on is that 15% of all their queries were empty. Total number of requests Total number of non-empty requests Total number of of non-empty queries Total number of unique, non-empty queries Total number of sessions Total number of exact-same-as-before requests Table 2-3: Data figures from Silverstein et al. (1999) They also find that the number of words per query is on average 2,35, which they note is the same as what Janset et al. (1998) found. An interesting find is that over the entire 6 week period they studied, about twothirds of all queries submitted were only asked once, something which indicates that information needs on the internet are very diverse, alternatively that people specify these needs in very diverse ways. They also note that most sessions are very short, as 63,7% of sessions contain only one request, meaning only one query was entered and only one screen of results examined. Unfortunately they remark that their analysis does not enable them to determine why so many people submit only one request. Other things they found were that the average number of queries per session was 2.02 while the average number of result pages viewed was Jansen et al. (1998) reported the average number of queries per session as 2.8, while the average number of result pages was reported as 2,21. They expect that these differences exist due to the difference in the definition of a session. Another explanation is that there is a difference between AltaVista and Excite Users. It could even be that Excite's short query log is not representative for their user base. Page 13

18 They also research the number of queries per session. What they find here is that 77, 6% of the sessions contained only one query, 13,5% contained two queries, 4,4% contained three queries, while 4,5% contained more than three queries. 2.4 Jansen, Spink & Pedersen (2005) In the article A Temporal Comparison of AltaVista Web Searching Jansen et al. (2005) look at three research questions: 1. What are the changes in AltaVista Web searching from 1998 to 2002? 2. What are the current characteristics of AltaVista searching, including the duration and frequency of search sessions? 3. What changes in the information needs of AltaVista users occurred between 1998 and 2002? They generalize their findings by suggesting that between 1998 and 2002 there was a slight change, with increasing session and query lengths. They also found that 70 % of sessions had a duration of 5 minutes or less, meaning the majority of people do not spend a lot of time trying to find what they are looking for. Another interesting development they noted between the data from 1998 and 2002 is that, seemingly, the range of topics that people search for has broadened. Apparently the most frequent terms account for less than 1 % of total term usage. This indicates a development where people use the web for retrieving a broader variety of activities and information retrieval, which then is reflected in the increasing number of topics that they search for. This could also be a reflection of the fact that the range of possibilities that one can use the internet for has broadened. For example, using the internet for shopping was probably a lot more common in 2002, compared to According to the site ecommerce-guide.com (Maguire, 2005) there is a clear tendency that spending on online retail is increasing, as we can see from figure 2-1: Page 14

19 Figure 2-1: Development in online retail spending The table shows the trend from 2001 to The same trend is present for the period from 1997 to 1999 (ecommerce-guide.com, 1999). The reason they argue it is important to see how people use web search engines like AltaVista is the fact that the web has attained an important role as many people's first choice for finding information. They see it as important to understand how web search engines perform, how they are used, and how users' search behavior changes over time. This reasoning is similar to the one employed in the other articles we have looked at. Especially the fact that they are examining changes from one time period to another makes this an interesting article to use as a comparison for us. The time span in their data ( ) is roughly similar to the span between the two sets of observations we have (2005 and 2008) from Sesam. Accurate research of this sort can positively contribute to improving future web search engine design. The way they have gone about their work is by analyzing general search characteristics such as session duration, query length, results pages viewed and term usage for the 2002 data, which they then compare to results achieved in a similar study by Silverstein, Henzinger, Marais and Moricz (1999) which looks at data from AltaVista from Concerning their data, these were collected over a 24-hour period on a single day, (Sunday, September 8, 2002). Thus they differ somewhat from our data which Page 15

20 uses queries submitted over a period of 4 days. We do not believe this will present any particular problems. If anything, it should mean that our data are a little bit more robust, as is might weaken the effect that some noteworthy extraordinary news-stories on one particular day might have had on the data. Earlier studies cite things like the death of Princess Diana, and the September 11 th plane crashes as examples of events that can have a huge impact on what people search for. We have controlled for such events in the period our data was collected, as we mention in section 3.1. Regarding the data set, it is a log containing approximately 3 million records, each containing the following information: 1. Time of day: measured in hours, minutes, and seconds from midnight of each day as recorded by the AltaVista server 2. User identification: an anonymous user code assigned by the AltaVista server 3. Query terms: terms exactly as entered by the given user These bits of information are similar to those in the data that Silverstein et al. (1999) looked at for their study, something which is beneficial in terms of achieving relevant comparisons. Compared to the AltaVista-logs, our own dataset from Sesam is somewhat more complex, in the way that it logs more parameters for each query. We believe, however, that this should be beneficial for a comparison between Sesam and AltaVista, as our data are sure to be at least as good as the data used by Jansen et al. (2005). However, the dataset from the 1998 study differs significantly from the 2002 set in terms of size. While the 2002 set was collected over a single day, the 1998 data were submitted between August 2 and September 13, 1998, for a total of queries, of which nearly half were empty. Silverstein et al. (1999) thus based their results on the queries that they reported as non-empty. Also in our data we experience a reduction in queries submitted to Sesam in the most recent time period compared to the first. Page 16

21 Since the transaction logs contain searches performed by humans, as well as automated searches, results can become skewed. Thus one would want to remove automated searches in order to get a better set of data (as the goal is to map the users behavior and intent). This is also an issue in the Sesam-data where there were a lot of searches from the same ID, with the same query text and category, on the same day. Accurately distinguishing automated searches from non-automated searches can be difficult. According to Jansen et al. (2005) you must either choose to ignore the problem (Cacheda & Viña, 2001), or decide some sort of cut-off (Montgomery & Faloutsos, 2001; Silverstein et al., 1999). Jansen et al. (2005) chose to do this by removing all sessions containing more than 100 queries. For the Sesam-data we chose to remove all queries where the same user had repeated the same query on the same day with the same search category. The results of their studies are summarized in Table 2-3 underneath. Some of the developments they note is that the percentage of queries using three terms increased from 28 % to 49 %, while the percentage of users who modified queries increased from 20 % to 52 %. In general this indicates a development where users to a larger degree interact with the search engine. There are, however, a few things about this table we do not understand. Firstly, we find it a strange coincidence that they have the same number for session length with three or more queries as for three or more results pages viewed, namely , when looking at the numbers for In addition, we find it illogical that the number of queries in the 2002 data is the same as the total number of terms, that is If this was the case, then the mean number of terms per query should have been 1, and not 2,92 as they report (and seems more likely). We do not know whether this is down to a typo or miscalculation. Page 17

22 AltaVista 1998 AltaVista 2002 Sessions Queries Terms Unique ,5 % Total ,0 % Mean terms per query 2,35 (sd = 1,74) 2,92 (sd = 1,91) Terms per query 1 term ,8 % ,4 % 2 terms ,0 % ,8 % 3+ terms ,6 % ,5 % Mean queries per user 2,02 (sd = 123,4) 2,91 (sd = 4,77) Users modifying queries 34, 416, ,4 % 193,468 52,4 % Session length 1 query ,6 % ,6 % 2 queries ,5 % ,4 % 3+ queries ,9 % ,0 % Results pages viewed 1 page ,2 % ,8 % 2 pages ,5 % ,0 % 3+ pages ,3 % ,1 % Boolean queries and other operators ,4 % ,0 % Terms not repeated in data set ,6 % Use of 100 most frequently occurring terms ,9 % Table 2-4: Results summary Jansen et al. (2005) Another aspect Jansen et al. (2005) looked at is session length. What they found was that in 2002, 32 % of the users submitted three or more queries, which is a massive increase from the 6,9 % found in the Silverstein et al. (1999) study. A very interesting observation made regarding session length is that the number of one-query sessions decreased from 1998 to 2002, while the occurrence of longer sessions increased. Interestingly, they claim that this development is counter to other similar studies they've looked at. As examples they mention Spink, Jansen et al. (2002) and Jansen and Spink (2004) which both report a move towards greater simplicity in searching. In trying to explain this they point out that the number of one-query sessions in Alta Vista in 1998 was a lot higher than what was reported in other studies of similar contemporary search engines, while in 2002 this number was a lot more similar. They also speculate that perhaps AltaVista users are just more sophisticated. Finding support for such an argument would probably be difficult, however. In addition they noticed an increase in the amount of users who viewed more than one page of results something they take as an indication that users are showing Page 18

23 an increased persistence in finding the desired results. This is quite contrary to our results, which we have commented further in section 4.3 and 5. The table below, taken from Jansen et al. (2005), shows the numbers for session length, in the form of queries per session, from 1998 and 2002 in more detail. However, as we have not been able to identify and define session length in the same way in our dataset, we will not go too deep into this point Session length Occurrences Percentages Occurrences Percentages ,6% ,62 % ,5% ,40 % ,4% ,95 % * 4,5% ,35 % ,99 % ,63 % ,80 % ,28 % ,94 % >= ,03 % *(number for 4 or more) Table 2-5: Session length They also look at session duration, i.e. for how long a session lasts, from start to finish. However, we have not found any good way to identify session duration in our data, and will thus not delve into this. An interesting observation, that shares certain similarities with queries per session, is that which looks at terms per query. Again they have compiled their results in a table, as shown below in table 2-5. Here we can see that there is a slight shift towards using more terms per query in 2002, when compared to Page 19

24 Query length Occurrences Percentages Occurrences Percentages ,6 % 301 0,0 % ,8 % ,4 % ,0 % ,8 % ,0 % ,8 % * 12,6 % ,0 % ,9 % ,5 % ,2 % ,5 % ,5 % >= ,4 % *(number for 4 or more) Table 2-6: Query length This is a development that they claim is in line with two other studies that also examined temporal data, done by Jansen & Spink from 2004, and Jansen, Spink et al. from As we can see Jansen and Spink are involved in all these three temporal studies. This underlines that there are a limited number of researchers who have done most of the existing research in this area. Therefore we think it will be interesting to be able to make a comparison between these three studies and our study on Sesam. From what we have found we can see a similar development among Sesam's users, as there is a notable change where query length increases between 2005 and Regarding the analysis of the terms themselves, the researchers note some patterns. One of the results they found was that the terms that occurred most often represented a small percentage of the overall terms used. For example the word 'free', which was the most frequently occurring term, accounted for 0,6% of all term usage. In addition, they found the variety of terms to be significant, which they took as an indication that AltaVista users had a diverse variety of information needs. They also tried to look at what kinds of topics the different searches belonged to. They did this by qualitatively analyzing a random sample of approximately queries from the 2002 data and place them in non-mutually exclusive, general Page 20

25 topic, categories developed by Spink et al. (2002). Their results are listed in table 2-6 below. General topic categories for 2002 Rank (2,603 English queries) Percentages 1 People, places or things 49,3 % 2 Commerce, travel, employment or economy 12,5 % 3 Computers or Internet or technology items 12,4 % 4 Health or sciences (physics, math) 7,5 % 5 Education or humanities 5,1 % 6 Entertainment or recreation (music, TV, sports) 4,6 % 7 Sex or pornography 3,3 % 8 Society, culture, ethnicity or religion 3,1 % 9 Government (or military) 1,6 % 10 Performing or fine arts (i.e., ballets, plays, etc.) 0,7 % 100 % Table 2-7: General topic categories An interesting observation regards the number of results pages that each user views. As we can see there is a slight change where people view more results pages in 2002 compared to One can speculate as to whether this means that users have become more patient over the years, taking more time to find the best results they can, or rather that the quality of the hits that they get has decreased, so that people have to go through a lot more pages of results to be satisfied. Interestingly, we have seen the opposite development in our data, where there has been a significant decrease in the number of results pages viewed. We will get back to these results in section 4.3. Number of results pages viewed Results pages viewed Occurrences Percentages Occurrences Percentages ,2 % ,8 % ,5 % ,0 % ,0 % ,6 % * 4,3 % ,5 % ,6 % ,1 % ,6 % ,5 % ,3 % ,4 % > ,6 % * These numbers and percentages are for results pages of 4 or more Table 2-8: Results pages viewed Page 21

26 Returning to the authors' initial goals for their research, which we listed at the start of this section, what they wanted to achieve was to identify characteristics of searching on AltaVista, and then to compare their results to results found on other search engines. What they note is an apparent increase in interactivity between users and the search engine, which they see as good news for search engine developers. They also found the average session duration to be 58 minutes and 10 seconds. However, 81 % of the sessions were shorter than 15 minutes, and almost 72 % shorter than 5 minutes. This shows that the majority of web searching sessions are short, but that some more extreme observations have skewed the results. Since it is not given that the user is at the computer between the submission of each query in the session, the numbers reported need not be reflecting the actual user interaction. We have neither found a good way of identifying the actual time a person spends using the search engine, and will therefore not dwell further on this point. In answer to their third research question, regarding changes in the information needs of AltaVista users, they conclude that the range of information needs has apparently broadened, a conclusion based on the high percentage of unique terms, and the broad range of topics that the queries are classified under. They say this result is comparable to other studies which also report a trend towards broadening information needs (Jansen & Spink, 2004; Spink, Jansen, et al., 2002). They end by going through some of the implications and limitations of their study. Since the data consist of real-life queries, they provide a realistic insight into how web users search the internet. In addition, the sample is large and obtained from a popular search engine. They also emphasize that their study will provide valuable insight due to that fact that they examine changing trends from one period to the next. These are the reasons they think their study could make valuable contributions. Insight such as this can help design future search engines. Among the limitations they acknowledge is that the sample is taken from one day of searching on a single search engine, and thus they might not be representative for the broader population of search engine users as a whole. However, in their defense, they point to Jansen and Pooch (2001) who suggest that the Page 22

27 characteristics of web-searching are consistent across search engines. Another issue is the difference in definition of a session between their own study and the study of the numbers from 1998 which they compare themselves with. However, they emphasize that they have addressed this issue when necessary throughout their text. In our case, we have both an advantage and a disadvantage here. Since we analyze both sets of data, our measures will be similar across the data. We do, however, have to take into account that the data from 2005 differ somewhat from the 2008 data in terms of what they have recorded. 2.5 Spink, Wolfram, Jansen & Saracevic (2001) In the article Searching the Web: The Public and Their Queries Spink et al. (2001) analyzed over one million queries submitted by users of the Excite search engine on the 16 th of September Some of the results they found were that most people use few search terms or modified queries, and that when they search they visit few Web pages. Users also rarely use available advanced search features, such as Boolean operators. There are a few terms that repeat often, while the majority of search terms are unique. Queries related to entertainment and recreation rank highest. The authors then compared their own findings to two other large studies of web queries, one study which was based on a sample of queries from the 9 th of March 1997, referred to as the 51K study, and an AltaVista study involving queries collected from the 2 nd of August and to the 13 th of September An issue they point out is the difficulty of comparing different studies to each other, as data definition and analysis differs to some extent between the different studies, as the metrics you look at are not standardized. Thus they want their comparisons to be taken as comparisons of emerging trends, and not as comparisons of raw numbers. What they found was that users posed a total of queries. 51,8 % of these were defined as unique, while 38,5 % were repeat queries, and 9,7 % were empty. For the total queries they report the mean number of queries to be Page 23

28 4,86, while the median is 8. Looking only at the unique queries on the other hand the mean number of queries is 2,52, while the median is 4. The two other studies report a mean of 2,8 for the 51K study, while for AltaVista the corresponding number is 2,02. Their results are summarized in the following table: Summary Number of users Number of queries (including repeat queries) Number of unique queries Number of repeat queries Number of zero term queries Mean number of queries per user session 4,86 Median number of queries per user session 8 Mean number of unique queries per user session 2,52 Median number of unique queries per user session 4 Total number of terms (including terms in repeated queries) Total number of terms (tokens) (excluding terms in repeat queries) Number of unique terms (types) Mean number of terms per query {including repeat queries) 2,16 Median number of terms per query (including repeat queries) 2 Mean number of terms per query {excluding repeat queries) 2,4 Median number of terms per query (excluding repeal queries) 2 Table 2-9: Results summary Spink et al. (2001) Regarding queries per user, 48,4 % of users submitted only a single query, while 20,8 % and 31% submitted two and three-or-more queries respectively. These numbers add up to 100,2 %, which we assume is due to a rounding error. In addition they mention that about 1,9 % of users submitted empty queries. They point out that the distribution is very skewed toward the lower end, and that there is a long tail of a small number of users who have submitted a large number of unique queries. The aforementioned 51K study also reports on queries per user, giving the results 67 %, 19 % and 7 % for the same three parameters. As we can see a larger amount of people submitted one query in this study. Still the general pattern remains the same, namely that most users submit only one query. In their study they also look at how users modify their subsequent queries when they submit more than one, most likely based on an assumption that the first search did not yield the results the user expected. The way they check these modifications is by checking whether the users used fewer, the same or more terms in their subsequent queries, compared to the last. What they found was that Page 24

29 in 32,5 % of the modified queries the same number of terms had been used, 41,6 % had added at least one term to their modified search, while 25,9 % had subtracted a term. They also found that in 99,2 % of the subsequent queries represented additions or subtractions of five terms or less. We will not replicate this in our study because of the lack of opportunities we have to track specific users queries. With single users we could manage this, but we do not have the opportunity to extract this information from all our data. An implication with the analysis Spink et al. has done is that they cannot know if the next query submitted by a user was to refine a search, or to search for something completely different. Another feature they look at is the number of results pages viewed (I.e. the number of pages with 10 search results they examine). According to the article the median number of pages viewed by each user is 8. However, they also find that 28,6 percent of users examined only one page of results, while another 19 percent looked at only 2 pages of results. This means that 47,6 % of the users only looked at 2 or less pages of results. That the median number is 8 means that there is an apparent jump in the number of pages viewed after 2. Thus over half the users view 8 or more pages of results. A significant learning from this result is that a lot of people are willing to look through several pages to find the results they want. If this tendency holds it could imply that firms are less willing to pay in order to be ranked highly on for example Google. They also look at the use of Relevance feedback. In Excite, whenever someone pushes the button for More like this, it is registered as a query with zero terms. They then make an assumption that all the zero-term queries they have are created in this way, meaning that at most 9,7% of all queries used this feature of the search engine, which they believe is low, thus indicating that users either did not find many relevant sites, did not care to pursue further searching for similar sites, or are unfamiliar with the capabilities of this feature. Alternatively, it could indicate that they were simply satisfied with the results. In a referral to the 51K study, which they compare their results to, about 5% of users used relevance feedback. They also looked at number of terms per query, finding that the mean number of terms in unique queries was 2,4. The general impression is that web queries are Page 25

30 generally short. What they found was that 26,6 % of queries had only one term, 31,5 % had two, while 18,2 % had three, meaning close to 60 % of all submitted queries had 2 or fewer terms, while less than 1,8% of queries had 7 terms or more. Comparably, in the 51K study and the AltaVista study the mean number of terms per query was 2,32 and 2,35, while the percentage of searches containing one, two or three terms were 31 %, 31 % and 18 % for the 51K study, and 25,8 %, 26 % and 15 % for AltaVista. As we can see the trend is roughly the same across these three studies. Distribution of terms has also been looked at, and what they found was that of the unique terms about 57,1 % were used only once, 14,5 % twice, while 6,1 % were used three times. In other words the web query language has a large degree of variation. In their discussion they highlight some of the issues behind doing these types of studies. One thing they point out is that since the only available data one has are search logs it is impossible to say anything about how good the results of these queries where, meaning you cannot conclude whether people were able to find what they were looking for. Thus drawing conclusions on the performance of search engines on this basis is difficult, if not impossible. However they do provide a way to look at behavior, which can be beneficial in improving search engines in general. In general users seem to submit short queries with little in the way of modification. Few of these queries include use of operators. When people do get results, they seldom bother to look through a lot of the results, rarely opting to browse more than one or two pages of results. 2.6 Spink, Koricich, Jansen & Cole (2004) An article by Spink, Koricich, Jansen and Cole (2004), called Sexual Information Seeking on Web Search Engines compares query logs from AltaVista and AllTheWeb.com from 2001, in an attempt to compare differences in sexuallyrelated web searching between the users of these two search engines. In different query log studies terms related to sexuality are always high in the lists of most Page 26

31 frequently occurring query terms. Thus a study that looks more closely into this is a valid undertaking. The differences they found were related to session duration, query outcomes and search term choices. They started out by analyzing sexual information seeking behavior on the web. One of the basic elements in internet searching is the characteristics of the user. One such characteristic is the query the user submits. What they do in this paper is to explore sexually related human information searching on the internet through a statistical examination of queries containing sexual terms. To begin with they mention that some recent studies have begun to examine sexually-related information searching on the internet. For example Goodrum and Spink (2001) found that 25 of the most frequently occurring terms in multimedia related queries submitted to the Excite commercial Web search engine were clearly sexually related. In their study Searching the Web: The Public and their queries which we've also looked at earlier, Spink et al. (2000) found that even though sexually related queries represent a small portion of less than 5 % of total searches, about 1 in 4 terms on a list they compiled of the 63 most frequently occurring searches can be classified as sexual. This indicates that there is a limited selection of terms that people use when looking for information of a sexual nature, while they are far more imaginary when it comes to other kinds of information. In a temporal study by Spink et al. (2002) they found that sexually related queries decreased as a proportion of all web queries between data sets from 1997, 1999 and 2001, moving from the second largest category (16,8 %) to the fifth largest, with 8,5 % of searches of a sexual nature. The data Spink et al. (2004) look at in Sexual Information Seeking on Web Search Engines is a random sample of 5000 AlltheWeb.com and 6000 AltaVista web queries, extracted from two query logs containing queries from February Interestingly, while the majority of AltaVista users are from the US, the majority of AlltheWeb.com users are Europeans from Germany and Norway. Thus these data can be very interesting to compare to our study of Sesam. However, since the article only analyze queries that were classified as sexually related, the ground for comparing the findings is not that good in our Page 27

32 case. A study using data from different search engines in the same way, but focusing on all types of queries would be very interesting to compare to our study. Each of the queries they looked at contained three fields: User ID, time of day and query terms, which are the exact terms entered by the users. These data were then qualitatively examined by two researchers to determine the nature of the query. Their analysis of the AlltheWeb.com query data showed: 273 queries in 143 AlltheWeb.com sessions contained searches for sexual material Sex was the most frequently occurring term in the AlltheWeb.com sessions 273/6000 = 4.5% of AlltheWeb.com queries were identified as sexuallyrelated Mean of 1.7 terms per sexually-related query Mean of 1.9 queries per sexually-related session 19/143 = 13.3% of the sessions included reformulated queries 68/143 = 47.6% of the sessions showed the user searching also for nonsexual material 9/273 = 3.3% of the sexually-related queries were for child pornography. This only included searches that explicitly stated terms for child pornography. Because of the vague nature, queries including the word teen were excluded from this number. For the Alta Vista data they found that: Sex was the most frequently occurring term 178 queries in 160 sessions were for sexual material 178/5000 = 3.5% of the sampled queries were for sexual material. 3.4 mean terms per sexually-related query 1.1 mean queries per sexually-related session 6/160 = 3.7% of the sessions included reformulated queries 24/160 = 15% of sexual sessions showed the user searching also for nonsexual material 10/178 = 5.6% of sexual queries searched for child pornography. This only included searches that explicitly stated terms for child pornography. Because Page 28

33 of the vague nature, queries including the word teen were excluded from this number. (Listings above are retrieved from Spink, Koricich, Jansen and Cole (2004)) In their discussion they point out that the data shows that for the European AlltheWeb.com users, the percentage of searches for sexual material was slightly higher, compared to the AltaVista data. AltaVista users, on the other hand, used more terms per sexually-related query. They have also looked at query reformulation, and what they found was only 3,7 % of the sexually related search sessions in AltaVista contained any form of query reformulation, while the corresponding number for AllTheWeb.com was 13,3 %. They speculate that either this indicates that AltaVista provided better results meaning the users found what they were looking for sooner, or that Alta Vista users were too impatient and gave up when they couldn't find what they were looking for. An alternative explanation that we can think of is that since a large part of AllTheWeb.com users are Norwegians and Germans, they might have to reformulate their queries due to misspelling the English words to begin with, while AltaVista users, being predominantly Americans, perhaps gets it right to begin with more often. In their discussion they point out how their study contributes to improving the knowledge base regarding user behavior when using internet search engines. The way it differentiates itself is by comparing results from two different search engines whose primary users differ by geographical location, whereas a lot of other studies have compared results from the same search engine, taken from different points in time instead. 3 Data In this section we will present how the data was collected and then treated before we did the analysis. We will also present the implications with our data and how we dealt with some of the challenges the data represented. Page 29

34 3.1 Data collection After deciding on a specific topic for our thesis, we needed to consider how this could be studied. Through contact with Schibsted and Sesam, we were given access to data from the databases containing query logs that they had collected and stored. We now met with representatives from both companies and discussed what sorts of data were possible to withdraw from their databases, and what format these would be in. We arrived at the decision to retrieve data containing query logs from one week in 2008, and a similar week from 2005 when sesam.no was first established. Using data from two periods would enable us to make a temporal study, enabling us to say something about how the usage of Sesam has developed. The data we received was on a flat file format, as illustrated in figure 3-1. Figure 3-1: Example of the flat data file To ensure that the data we were going to analyze were not affected by special events that received unnatural amounts of media attention, we investigated if such events had occurred in the two periods our data were collected from. We found no evidence indicating such an implication in these periods. This was further supported by a quick search in the document for frequently repeated queries. The data we collected reported on the following parameters (as we show in section 3.3, there are fewer parameters in the 2005 data). - Date and time stamp - Category choice - Search text Page 30

35 - Offset - Type of search - Country, region and municipality The date and time stamp were given down to seconds. The encrypted IP address consisted of both numbers and letters. Category choice refers to which type of search the user chose to make. This could be chosen from a rollup menu in the search interface (see figure 3-2), a feature that a large number of Sesam s users took advantage of. This represents an opportunity to better identify what the users were searching for. Earlier studies have studied search engines and data which do not offer this opportunity, and were thus forced to make qualified guesses as to what the users were looking for, based on the specific query submitted to the search engine. As one can see from the figure, the options the user could choose between are news, companies, persons, pictures, blogs, TV-guide, web-tv and maps. Not choosing any category would produce a default web search. In 2005 there were a few categories less to choose from, which we will show in section 3.3. Figure 3-2: The search category choices in Sesam The offset variable refers to the number of results viewed in the set. These values can thus tell us how many result pages the user chose to look through. The way this parameter works is that if the user only displays the first page of results, the value is set to zero and does thus not appear in the dataset. For every extra result page the user views, the value of the offset parameter increases by ten, which is Page 31

36 the number of results shown per page. Since the number refers to the number of results on every additional page viewed, the offset will be a number not dividable by ten if the user goes through all the result pages and the total results are not dividable by ten. For instance, if a search returns 48 hits, the offset parameter will be 0 if the user only views the first page of results, 20 if three pages are viewed, and 38 if the last page (fifth) is viewed, as it only will contain 8 results in this case. We solved this issue by rounding up the values to the closest number dividable by ten. The values we received could then tell us the number of result pages each user viewed. 3.2 Data cleaning In this section we will first explain how we treated the data from 2008 before we analyzed it. Then we will look at how we performed the same process for the 2005 data. To make the data eligible for an analysis we had to ensure that it was representative for what we wanted to investigate. Because of inconsistencies in the way the data were arranged in the flat files we received, there were problems converting data from all seven days into a format that would be easy to analyze. We therefore decided on just using data from the first four days of the week in the set from Similarly to what was the case in Silverstein et al. s (1999) study on the search engine AltaVista, about half of all searches submitted to Sesam were empty queries. Since the objective of our study is to analyze how people are searching and what they are searching for in a national specific search engine, we chose to remove all searches in the log that contained empty queries. Including these would be more interesting if we wanted to look at the technical side of how the search engine worked and performed, or if we were interested in looking at user friendliness and thus wanted to include clicks that were made without any query being typed in. However, for our purpose, including empty queries would only bring down the quality of the data as it does not say anything about what the user intended to search for. In addition, many of the empty query lines are automatically produced when the search engine is instructed to, for instance, open Page 32

37 a picture, map or similar. After removing the empty queries, the data was reduced from queries to Another challenge with the data was that many queries had been done multiple times by the same user. In a significant amount of cases, an identical query had been submitted more than 100 times in a row within a very short time frame. This is probably caused by automated searches or by some users hitting the search button several times for no obvious reason. It could maybe also be caused by a computer mistake. To deal with this, we removed all repeated queries where the ID, category choice, date and search text were completely identical. That way we dealt with the problem of searches being repeated unintentionally while not removing queries where people had tried looking for different results by using different category choices for the same query. In this process, we figured out that in as much as cases over the course of the four days, people had submitted the same query with two or more different category choices. After this final step of the data cleaning, we ended up with searches in the set for Thus searches were empty queries, or repeated queries with the same ID, date and category. This is consistent with what Silverstein et al. (1999) finds in their study in terms of share of empty queries. The data from 2008 included RSS feed searches as well as normal queries. One could argue that these should not be included in the data analysis as normal queries, and thus should be removed as well. However, since we have removed repeated queries (with the same ID, date and vertical choices), and the users are able to use the same category choices when submitting an RSS query, it seems reasonable to include them in the data analysis. When a user opens an information page on a person or a company, the database produces a search line which is included in the dataset. We have chosen to include these as searches in the analysis, meaning they will be shown in the tables and included in the total percentages in the results section (section 4). The reason we chose to include them in this manner is to easier see the correlations between queries on companies and persons, and the number of information pages opened for the corresponding categories. It may also be that some of these searches are Page 33

38 produced when these pages are opened directly, and we did not want to remove these from our analysis. When dealing with the data from 2005, we had to treat it differently. This dataset does not contain all the columns of information that the set from 2008 does. We did not have an ID in the 2005 data, which makes it impossible to remove repeated queries without removing many users and simply ending up with a list of unique queries. However, the problem with queries being repeated an unnatural number of times is not present in the same way in the 2005 data. Nor is there a problem with empty queries, as they registered the data in a different manner for this period. For instance, when someone opened a picture in 2005, this was not registered as a search in the database, while in 2008 it would come up as a search with an empty query. One implication could be that the data from 2005 is from a period very shortly after the search engine was launched, which could imply that the users had not familiarized themselves with the functionality. However, as the option to choose categories seems to be extensively used this does not seem to be a problem. The number of searches included in our data for 2005 were For this set we chose to extract data from three days. We limited the size of the dataset to this for practical reasons during the analysis. Since we are dealing with means and percentages, it does not reduce the quality of our comparison of the two periods. Furthermore, the statistical tests we will perform tests for significant changes corrected for the difference in the data s population size. Since the interface of Sesam has been altered in the period we are dealing with one could argue that this would affect how users interact with the engine, thus making a comparison less valid. However, the search engine has kept the same profile throughout the period, and has had the same functionality with the exception of some added options to the search categories that one can choose from. Therefore, the changes made to the appearance of Sesam do not represent a problem for our analysis. When it comes to the inclusion of the TV-guide-, webtv-, map- and blog search categories, these new categories are probably contributing to parts of the change in the use of the other categories, along with changes in user behavior and user intent. However, the extent to which these new Page 34

39 categories are used does not suggest a change that deteriorates the grounding for comparing the data from 2005 and 2008, as they do not represent a large percentage of the total choices. Finally, one could say that since we cannot remove identical queries made on the same day, by the same person and with the same category choice, the 2005 data does not represent user intent as good as in We argue, though, that since the problems with massively repeated queries are not present in 2005, and that the data otherwise is giving us the information we look for, it is reasonable to compare the query complexity and use of the search engine between these two periods. 3.3 Data descriptives After cleaning the data we ended up with two data sets, one for 2005 and one for As table 3-1 shows we have searches in our dataset for 2005, and in n Table 3-1: Size of the cleaned data sets The queries in 2008 were submitted by unique users, which is the same number of users we had before we cleaned the dataset. Thus no users were lost by accident in the process. Of the total queries, were unique queries. With the term unique query we refer to the number of different queries being made. If for instance two users have submitted the same query during the period the data is collected; it is only counted once as a unique query. Since we have no parameter containing a user-id in the data for 2005, we cannot provide this number for that period. The data from 2005 and 2008 report on the following parameters: Page 35

40 The parameters report on what we explained in section 3.1. The different options for category choices are shown in table 3-2. Options Default web search Yes Yes Company search Yes Yes News search Yes Yes Picture search Yes Yes Personal information search Yes Yes Open personal info Yes Yes Open company info Yes Yes TV-guide search - Yes Map search - Yes Blog search - Yes Web-TV search - Yes Table 3-2: The search categories in 2005 and Data analysis To be able to analyze the data we coded all variables into numerical formats. Then, for the data analysis we chose to use Excel and SPSS for retrieving the descriptive statistics such as means, medians etc. for the two different sets of observations. To test if the changes that occurred between 2005 and 2008 were significant and not caused by random variation, we used t-tests and a program called Zigne. Zigne is a program developed by Bernt Aardahl and Frode Berglund, and is used for testing if changes between two sets of observations are significant or caused Page 36

41 by random effects in the population. The main part of this program is based on percentages, as it was designed with election research primarily in mind. For the tests we conducted on the other values, we therefore used standard t-tests. What the program does in practice is to test if changes in different populations are significant or are caused by random events while correcting for different population sizes. 4 Results In this section we will present the results of the data analysis. We will first present the findings for 2008 and then for We have chosen this order as it fits best with how we explain our findings. In section 4.3 we will present the development between the two periods we have investigated, and the results of the statistical tests. 4.1 Results for 2008 Out of all the queries in our dataset, 22,5 % submitted their query as a default web search, thus not choosing any specific search category. One could also refer to this as a standard open web search, as the user gives no specific guidelines on what kind of results he or she desires. This shows that the opportunity to choose a search category is being used extensively. See table 4-1 for the distribution between the different categories. Default web search 22,56 % Company search 12,76 % News search 16,63 % TV-guide search 1,25 % Picture search 2,14 % Map search 2,74 % Personal information search 4,25 % Web-TV search 0,37 % Blog search 1,42 % Other 35,88 % Total 100,00 % Table 4-1: Search category distribution 2008 a Page 37

42 As we can see the majority of searches are done as default web searches, searches for company information or news searches. The line with other searches includes partner searches, and searches registered when people have opened a page with information on a company, person or map. Open company info 9,96 % Open personal info 2,13 % Open map 0,54 % Partner searches 23,25 % Total "other" 35,88 % Table 4-2: Search category distribution 2008 b Partner searches refer to queries that are submitted to Sesam through a box on different web pages. If one goes into aftenposten.no, for instance, there was a box where one could perform a web search in Sesam in the upper corner. Today Aftenposten s site offers the opportunity to choose between some of the available categories when doing a Sesam search, but to our knowledge most partner searches are default web searches. Therefore, these should be included as default web searches, resulting in a total of 46,06 % default web searches in our dataset. From table 4-1 and 4-2 we can also see that 78 % (9,96/12,76 %) of all company searches resulted in a company information page being viewed, while the number is only 50 % (2,13/4,25 %) for searches on personal information and personal information pages. This could be caused by the larger number of similar names for people compared to companies where names are more distinctive. As a result people are thus able to find the correct results using fewer attempts, thus reducing the need for narrowing down the search or re-typing the name and submitting a new query. The queries made were submitted from 123 different countries. Table 4-3 shows the top ten countries and the distribution between them. Page 38

43 Country Queries Percent Norway ,69 % Sweden ,16 % United States ,15 % Denmark ,74 % Netherlands ,49 % United Kingdom ,48 % Germany ,45 % Spain 785 0,29 % Italy 533 0,20 % China 182 0,07 % Table 4-3: Distribution between nations As we can see the clear majority of queries are submitted from Norway. From looking at the queries submitted from other countries, it seems that almost all of them are made in Norwegian, which makes it likely to assume that they are submitted by Norwegians staying or living abroad. This indicates that the data are representative for investigating user behavior in a national specific search engine. If results were scattered throughout the world and a significant share had been in English, the data would have been less appropriate for our research. The query complexity contains both similarities and differences with what previous studies have found. The average query length is slightly lower than both Spink et al. (2000) and Jansen et al. (2005) found in their studies. The median lengths however are the same. Here, as in the other studies, more than half of all queries are covered by searches with one and two terms. Average query length 2,0739 Median query length 2 Mean queries per user 2,5083 Mean queries per day 0,6271 Table 4-4: Query complexity 2008 The average query length is just over 2, which is the median. This indicates that most users are submitting rather simple queries. There is also a significant share of the queries that consist of 8 consecutive numbers, and thus appear to be Norwegian telephone numbers. These are counted as single term queries. Over the four days we investigated, the mean number of queries per user were 2,5, which Page 39

44 gives an average of 0,63 queries per day for each user. This indicates that most users do not use Sesam on a daily basis, especially since many of the users have submitted significantly more queries within one day. Table 4-5 shows the distribution between the different numbers of query terms. Number of terms Number of cases Percent ,14 % ,42 % ,65 % ,50 % ,23 % ,61 % ,23 % ,07 % ,03 % ,02 % > ,09 % Table 4-5: Distribution between number of terms 2008 As we can see, queries with one or two terms account for more than 70 percent of all queries, while more than 90 % of all queries are covered if we include searches with up to three terms. The highest number of terms was 28. If we only count those numbers that occurred more than 100 times the highest would be 12. When it comes to Boolean operators, such as and, or, + and -, there is very little use of these in Sesam. The occurrence of these words and signs in the queries is close to zero. In 2008, these terms and symbols occur no more than twice in total. The search engine s support the use of such operators has varied a bit over time, but and and or have been supported and still used to a nonrelevant extent. This also applies for the 2005 results. Spink et al. (2005) also report little use of advanced search features. However, here the use is not mentionable. The number of result pages the users go through has an even stronger tendency to have its main weight on one result page than previous studies. We comment further on this in section 5. The highest number is 123 pages. Page 40

45 Number of pages Cases Percent ,00 % ,61 % ,22 % ,02 % ,01 % ,01 % ,01 % ,01 % ,01 % ,04 % > ,03 % Table 4-6: Distribution between number of result pages viewed 2008 Mean result pages viewed 1,0453 Median result pages viewed 1 Table 4-7: Mean and median result pages 2008 As the tables show, it is a clear trend that the users only view the default result page and do not move on. If they do view more than the first page, the probability is 80,5 % that they will only view one extra page. In earlier studies, more than 70 % only view 1 page of results. And around half of those who view more than one page of results view two pages. This underpins the importance of achieving high page ranks if one is seeking to target customers or other groups on the internet. 4.2 Results for 2005 Table 4-8 shows how the queries submitted by Sesam s users were divided between the different category choices in In the table we see that there are fewer category choices available than in Default web search 56,18 % Company search 3,27 % News search 14,48 % Picture search 17,55 % Personal information search 5,39 % Open personal info 1,76 % Open company info 1,38 % Table 4-8: Search category distribution 2005 Page 41

46 As we can see the percentage of default web searches is higher than in We do not know for certain if this includes partner searches, or if these have not been registered. This is due to the fact that the data in 2005 did not return any specific tag on the partner searches. However, we find it likely that they are registered under default web searches as the partner pages in general only offered this opportunity. The amount of personal information searches that resulted in a viewed information page was 32,65 % (1,76/5,39 %), which is lower than in The same applies for the company searches, where the number is 42,20 % (1,38/3,27 %). The query complexity has also developed between 2005 and Mean query length 1,7969 Median query length 2 Table 4-9: Mean and median query length 2005 As we see from table 4-9, the mean query length is lower than in However, the median query length is the same, and as table 4-10 shows one only needs to include queries with one or two terms in order to cover 80 % of all queries made. If we count all queries with five or less terms they account for as much as 99,31 % of the queries. The longest query in the 2005 set was 14 terms. Number of terms Number of cases Percent ,86 % ,41 % ,79 % ,22 % ,03 % ,39 % ,17 % ,06 % ,03 % ,01 % > ,01 % Table 4-10: Distribution between number of terms used 2005 The number of result pages viewed was clearly higher in 2005 than in As table 4-11 shows, the mean number of result pages the user went through was 4,2819 in 2005, while in 2008 it was as low as 1,0453. Page 42

47 Mean result pages viewed 4,2819 Median result pages viewed 1 Table 4-11: Mean and median result pages 2005 The vast majority only viewed one page of results in 2005 as well, but these only account for 73,77 % of the queries compared to 98,00 % in The highest number of result pages viewed in 2005 is 952. It is important to bear in mind, for both 2005 and 2008, that this number does not necessarily imply that the user looked through all pages, but may have skipped some of the pages along the way. Number of pages Cases Percent ,77 % ,47 % ,50 % ,83 % ,05 % ,51 % ,19 % ,43 % ,91 % ,91 % > ,31 % Table 4-12: Distribution between number of result pages viewed Comparison In this section we have compared the data to see if there have been significant changes in the usage patterns among the Sesam users Change Significant Average query length 1,7969 2,0739 0,2770 t=11,844 1 term 45,86 % 35,14 % -10,72 % 1 % conf 2 terms 36,41 % 36,42 % 0,01 % Not sig. 3 terms 12,79 % 19,65 % 6,86 % 1 % conf 4 terms 3,22 % 5,50 % 2,28 % 1 % conf Table 4-13: Development in term usage When comparing the differences in the use of different category choices, we have included partner searches in the default searches for We do this as they have Page 43

48 been submitted as default searches from the partner pages in the same way it would have been if the user had typed in the query at Sesam s homepage Change Signifcant Default web search 56,18 % 45,81 % -10,37 % 1 % conf Company search 3,27 % 12,76 % 9,49 % 1 % conf News search 14,48 % 16,63 % 2,15 % 1 % conf Picture search 17,55 % 2,14 % -15,41 % 1 % conf Personal information search 5,39 % 4,25 % -1,14 % 1 % conf Open personal info 1,76 % 2,13 % 0,37 % 1 % conf Open company info 1,38 % 9,96 % 8,58 % 1 % conf TV-guide search 1,25 % Map search 2,74 % Blog search 1,42 % Web-TV search 0,37 % Table 4-14: Development in the use of search categories An implication here is that the 2005 data does not include tv-guide search, map search, blog search or web-tv search, as these options were not available. However, the test takes into account the size of the different datasets, thus the changes in use of the options included in table 4-14 are still significant, but can be attributed to the introduction of new options in addition to using the existing options in a different manner. And in that regard, all changes are significant, which is natural when we are operating with such a large data set. Table 4-15 and 4-16 show the development in the number of result pages the users chose to view after performing a query Change Significant Mean result pages viewed 4,2819 1,0453-3,2366 t=64,213 Table 4-15: Development in result pages Page 44

49 Number of result pages Change Significant 1 73,77 % 98,00 % 24,22 % 1 % conf 2 5,47 % 1,61 % -3,86 % 1 % conf 3 4,50 % 0,22 % -4,29 % 1 % conf 4 2,83 % 0,02 % -2,81 % 1 % conf 5 2,05 % 0,01 % -2,03 % 1 % conf 6 1,51 % 0,01 % -1,50 % 1 % conf 7 1,19 % 0,01 % -1,17 % 1 % conf 8 0,43 % 0,01 % -0,42 % 1 % conf 9 0,91 % 0,01 % -0,90 % 1 % conf 10 0,91 % 0,04 % -0,87 % 1 % conf >10 6,31 % 0,03 % -6,28 % 1 % conf Table 4-16: Development in the distribution between result pages viewed As we can see from the table, there has been a very significant shift towards fewer results pages viewed. The change is especially noticeable in the amount that only views one results page and among those who view more than ten. Interestingly, this is quite contrary to what Jansen et al. (2005) report in their study, where the number of users that viewed more than one page of results increased. 5 Discussion As the results from our analysis show, there have been significant changes in the usage pattern of Sesam. Not all of these changes are major, nor are all of them interesting in a discussion, but there are aspects that are interesting to look at. First of all, a clear tendency is that a smaller share of searches is default searches, or open web searches. This could indicate that today s users of Sesam appreciate the opportunity to choose a more narrow search category, and it could also be a sign that those using Sesam are looking for something more specific than the average user of a search engine that only offers default web searches. There could be many reasons for this development, however. Firstly, there are fewer queries submitted to Sesam in 2008 than in Since we do not have an ID variable for 2005, there is no way of controlling if this is caused by less users or simply that the users now choose other search engines for some of their web searching activity. One theory can thus be that users have started choosing, or gone back to, Page 45

50 other search engines for queries that would fit with typical default searches, while using Sesam for more specific searches for Norwegian material, such as news etc. Another explanation could also be that Sesam has lost many of its users, and those who remain are those who appreciated and used the vertical option the most or have adopted it to a higher degree over time. The clear increase in the share of news and company searches supports an increase in searches oriented towards retrieving information of current interest in Sesam. These are searches that are clearly targeted towards finding results that are specifically relevant for Norway. Most of these queries would fall into the category navigational searching according to Jansen et al. (2007). We find the increase in company searches especially interesting. This has to do with the implications it can have for services like the yellow pages. In Norway, companies spend a great deal on marketing themselves through the catalogue Gule Sider, similar to the yellow pages, which is based both on the internet and comes in a catalogue version. If a trend is developed where a web searching service can perform in a way that ensures the same quality of hits in different companies, there could be room for a disruptive development (Christensen, 1997) if the users move towards using this service. It requires that the users find this service just as satisfying to use, of course. But given a situation where the search service can find relevant hits and then sort them as satisfyingly as the internet catalogue can, it could be that this service would manage to gather more hits as companies have less expenses related to being present here than in the catalogue. If so, the search service over time could perform better than the catalogues. The amount of picture searches has been reduced very significantly. We find it likely to assume that a picture search often is done with a particular motive or theme in mind (for instance a person, destination or situation), and the user would thus benefit from retrieving results from as many sources as possible, also outside Norway. A probable explanation for the reduction in picture searches could thus be that users choose other search engines, which focus on the entire web, when performing that kind of searches. We were surprised to find a significant decrease in the share of searches on personal information. However, we recognize that there is an increase in the Page 46

51 percentage for opened personal information. This entails that the share of searches on personal information is reduced, but that more personal information pages have been viewed. The finding is an indication that the users to a larger extent find who they are searching for information on when submitting a query with this option. The reduction in personal information queries that are submitted could therefore probably be explained by a smaller need for correcting the search or re-typing the name that was searched for. We have estimated the average query length in this category specifically, and the numbers are 1,8952 in 2005 and 2,1344 in This fits with the general increase in average query length, but is at the same time an indication that people are submitting more detailed queries than in 05, which could explain the decrease in queries combined with an increase in information pages opened. The general increase in average query length could also be caused by Sesam losing the users who are mostly performing simple explorative searches with single term queries. Such a theory would support an assumption that Sesam is used more by those seeking specific information and material like news and catalogue listings. The users chose to view substantially less results pages on average in 2008 than in There could be several reasons for this. One explanation could be improved web searching skills, enabling the users to find what they search for on their first try more often. The increase in query length and use of different search categories than default web searches supports such an explanation. We also find it worth mentioning that the very significant reduction in picture searches could be an important explanation factor. When people search for pictures it is likely that they skim through more pages as looking for a photo of interest takes less time than reading through other types of results. The reduction in these searches will thus contribute to reduce the number of result pages viewed. Another notable fact in this regard, is that AltaVista throughout this time also has had picture searches included as one of their few search categories. This could contribute to explaining why studies in this search engine reports much higher numbers of result pages viewed per query (Spink et al. (2000); Jansen et al. (2005)) Page 47

52 For those seeking to reach customers or other groups through the web, the development we just discussed strengthens the importance of high page ranks if one wishes to be visible in search engine results. By page rank we mean how high up in the result pages a web page is listed. When people usually do not view more than one page, companies and advertisement will not reach out to their targets through web searching services unless they find ways to achieve high page ranks. This not only entails a challenge for these actors, but also an opportunity for the search service providers as their higher page rank spots this way increase in value. An important question in this discussion is whether or not the results can say something about differences between the type of search engine that Sesam represents, and international engines with a general and global focus like Google. We believe that the results we have found support the assumption that there exist different segments for web searching services. This does not have to entail that different people will choose different search engines, but that search engines can choose a more narrow scope and be attractive for certain uses as well as aiming at becoming a contender in the general search engine market. One person might, for example, use different search engines depending on what kind of information he or she is looking for. In the case of Sesam, we see that the users are moving towards more specific web searching and use the search engine more for retrieving information relevant within Norway specifically including news, and information about people and companies. As we have mentioned, there has been a decline in the number of queries submitted since the launch of the search engine. In addition to indicating a trend where people use other search engines for general web searching, this could also support the theory that Sesam has gained a position where it is used for a narrower part of people s general web searching. In that case, Sesam has moved into a niche position (Porter, 1980) in the web searching market. As people become more advanced in their web searching behavior, and the amount of information available on the World Wide Web is constantly increasing, it is not unlikely that search engines with a smaller and more specific focus could gain good strategic positions. Page 48

53 6 Directions for future research There are several aspects that would be interesting to study within this field to improve the knowledge on how web searching services are being utilized as a tool for different tasks, and to cover different information needs. One area we find particularly interesting, is to study people s use of different search engines. If it is so that different search engines are being used for covering different informational needs, understanding how individuals choose between them is very interesting. It is also interesting to see if people tend to choose a search engine that covers their most frequent needs and then use it for all web searching, or if they vary between the different service providers. Also on this aspect it could be beneficial to look at trends, and how this develops over time. Doing a study similar to our own on Sesam.no on a similar search engine in a different nation could be good for comparing results and see if there exists grounding for generalizing the findings. Another interesting project would be to conduct a lab test on how people act when confronted with certain tasks within information retrieval online. That way one could control perfectly what the users are seeking, and see how they go about retrieving the information. 7 Conclusion In this thesis we have investigated the use, and development, of the Norwegian web searching service Sesam. The findings we have come up with shows a significant change towards more complex use of the service, and more precise demands to results. We see that the service seems to be used more for specific information retrieval than search engines with an international focus, and the results thus suggest that there is room for taking different positions when providing web searching services. Page 49

54 8 References Cacheda & Viña Experiences retrieving information in the World Wide Web. Proceedings of the 6th IEEE Symposium on Computers and Communications (pp ). Hammamet, Tunisia. Chau, Fang and Yang Web Searching in Chinese: A Study of a Search Engine in Hong Kong, in Journal of The American Society for Information Science and Technology, 58(7); Christensen, Clayton M The Innovator s Dilemma: When New Technology Cause Great Firms to Fail, The Management of Innovation and Change Series, Boston Mass: Harvard Business School Press ecommerce-guide.com Online Purchases from Home Top 56 Million. (retrieved ) Global Reach. (2004). Global Internet statistics. (Retrieved March 10, 2007) Goodrum, A., & Spink, A Image searching on the Excite web search engine, in Information Processing and Management. 37: Jansen, B.J., & Pooch, U Web user studies: A review and framework for future work, in Journal of the American Society of Information Science and Technology, 52(3), Jansen, Bernard J., Booth, Danielle L. & Spink, Amanda Determining the User Intent of Web Search Engine Queries. WWW 2007, May 8 12, 2007, Banff, Alberta, Canada. ACM /07/0005. Jansen, Bernard J., Spink, Amanda & Pedersen, Jan A Temporal Comparison of AltaVista Web Searching, in Journal of the American Society for Information Science and Technology, 56(6): , Page 50

55 Jansen, B.J., & Spink, A An analysis of Web searching by European AlltheWeb.Com Users, in Information Processing and Management, 41(6), Jansen, B.J., Spink, Bateman, J., & Saracevic, T Real life information retrieval: A study of user queries on the Web. ACM SIGIR Forum, 32(1), Montgomery & Faloutsos Identifying Web browsing trends and patterns. IEEE Computer, 34(7), Porter, Michael E Competitive Strategy, Free Press, New York 1980, p34-46, 13p Seoconsultants Top Ten Search Engines. (retrieved ) Silverstein, C., Henzinger, M., Marais, H., & Moricz, M Analysis of a very large Web search engine query log. ACM SIGIR Forum, 33(1), Spink, Koricich, Jansen and Cole Sexual Information Seeking on Web Search Engines, in Cyber Psychology & Behavior, Volume 7, number 1, Mary Ann Liebert, Inc. Spink, A., Jansen, B.J., Wolfram, D. et al From e-sex to e-commerce: web search changes, IEEE Computer 35: Spink, A., Ozmutlu, S., Ozmutlu, H.C., & Jansen, B.J U.S. versus European Web searching trends. SIGIR Forum, 32(1), Spink, A., Wolfram, D., Jansen, B.J., & Saracevic, T Searching the Web: The public and their queries. Journal of the American Society for Information Science and Technology, 52(3), Page 51

56 Attachment 1: Preliminary Thesis Preliminary thesis Student IDs: How do people interact with a national specific search engine? A study of Sesam.no GRA Strategy Specialisation Nils-Marius Sørli and Ole Martin Kjørstad Supervisor: Ingunn Myrtveit Hand in date: Page 52

An Analysis of Document Viewing Patterns of Web Search Engine Users

An Analysis of Document Viewing Patterns of Web Search Engine Users An Analysis of Document Viewing Patterns of Web Search Engine Users Bernard J. Jansen School of Information Sciences and Technology The Pennsylvania State University 2P Thomas Building University Park

More information

MONSTERS AT THE GATE: WHEN SOFTBOTS VISIT WEB SEARCH ENGINES

MONSTERS AT THE GATE: WHEN SOFTBOTS VISIT WEB SEARCH ENGINES MONSTERS AT THE GATE: WHEN SOFTBOTS VISIT WEB SEARCH ENGINES Bernard J. Jansen and Amanda S. Spink School of Information Sciences and Technology The Pennsylvania State University University Park, PA, 16801,

More information

Repeat Visits to Vivisimo.com: Implications for Successive Web Searching

Repeat Visits to Vivisimo.com: Implications for Successive Web Searching Repeat Visits to Vivisimo.com: Implications for Successive Web Searching Bernard J. Jansen School of Information Sciences and Technology, The Pennsylvania State University, 329F IST Building, University

More information

Query Modifications Patterns During Web Searching

Query Modifications Patterns During Web Searching Bernard J. Jansen The Pennsylvania State University jjansen@ist.psu.edu Query Modifications Patterns During Web Searching Amanda Spink Queensland University of Technology ah.spink@qut.edu.au Bhuva Narayan

More information

Mining the Query Logs of a Chinese Web Search Engine for Character Usage Analysis

Mining the Query Logs of a Chinese Web Search Engine for Character Usage Analysis Mining the Query Logs of a Chinese Web Search Engine for Character Usage Analysis Yan Lu School of Business The University of Hong Kong Pokfulam, Hong Kong isabellu@business.hku.hk Michael Chau School

More information

Review of. Amanda Spink. and her work in. Web Searching and Retrieval,

Review of. Amanda Spink. and her work in. Web Searching and Retrieval, Review of Amanda Spink and her work in Web Searching and Retrieval, 1997-2004 Larry Reeve for Dr. McCain INFO861, Winter 2004 Term Project Table of Contents Background of Spink 2 Web Search and Retrieval

More information

Web Searcher Interactions with Multiple Federate Content Collections

Web Searcher Interactions with Multiple Federate Content Collections Web Searcher Interactions with Multiple Federate Content Collections Amanda Spink Faculty of IT Queensland University of Technology QLD 4001 Australia ah.spink@qut.edu.au Bernard J. Jansen School of IST

More information

COVER SHEET. Accessed from Copyright 2003 Elsevier.

COVER SHEET. Accessed from   Copyright 2003 Elsevier. COVER SHEET Ozmutlu, Seda and Spink, Amanda and Ozmutlu, Huseyin C. (2003) Multimedia web searching trends: 1997-2001. Information Processing and Management 39(4):pp. 611-621. Accessed from http://eprints.qut.edu.au

More information

Using Clusters on the Vivisimo Web Search Engine

Using Clusters on the Vivisimo Web Search Engine Using Clusters on the Vivisimo Web Search Engine Sherry Koshman and Amanda Spink School of Information Sciences University of Pittsburgh 135 N. Bellefield Ave., Pittsburgh, PA 15237 skoshman@sis.pitt.edu,

More information

How App Ratings and Reviews Impact Rank on Google Play and the App Store

How App Ratings and Reviews Impact Rank on Google Play and the App Store APP STORE OPTIMIZATION MASTERCLASS How App Ratings and Reviews Impact Rank on Google Play and the App Store BIG APPS GET BIG RATINGS 13,927 AVERAGE NUMBER OF RATINGS FOR TOP-RATED IOS APPS 196,833 AVERAGE

More information

To scope this project, we selected three top-tier biomedical journals that publish systematic reviews, hoping that they had a higher standard of

To scope this project, we selected three top-tier biomedical journals that publish systematic reviews, hoping that they had a higher standard of 1 Here we aim to answer the question: Does searching more databases add value to the systematic review. Especially when considering the time it takes for the ENTIRE process, the resources available like

More information

A Query-Level Examination of End User Searching Behaviour on the Excite Search Engine. Dietmar Wolfram

A Query-Level Examination of End User Searching Behaviour on the Excite Search Engine. Dietmar Wolfram A Query-Level Examination of End User Searching Behaviour on the Excite Search Engine Dietmar Wolfram University of Wisconsin Milwaukee Abstract This study presents an analysis of selected characteristics

More information

Just as computer-based Web search has been a

Just as computer-based Web search has been a C O V E R F E A T U R E Deciphering Trends In Mobile Search Maryam Kamvar and Shumeet Baluja Google Understanding the needs of mobile search will help improve the user experience and increase the service

More information

Whitepaper Spain SEO Ranking Factors 2012

Whitepaper Spain SEO Ranking Factors 2012 Whitepaper Spain SEO Ranking Factors 2012 Authors: Marcus Tober, Sebastian Weber Searchmetrics GmbH Greifswalder Straße 212 10405 Berlin Phone: +49-30-3229535-0 Fax: +49-30-3229535-99 E-Mail: info@searchmetrics.com

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Good Technology State of BYOD Report

Good Technology State of BYOD Report Good Technology State of BYOD Report New data finds Finance and Healthcare industries dominate BYOD picture and that users are willing to pay device and service plan costs if they can use their own devices

More information

SOME TYPES AND USES OF DATA MODELS

SOME TYPES AND USES OF DATA MODELS 3 SOME TYPES AND USES OF DATA MODELS CHAPTER OUTLINE 3.1 Different Types of Data Models 23 3.1.1 Physical Data Model 24 3.1.2 Logical Data Model 24 3.1.3 Conceptual Data Model 25 3.1.4 Canonical Data Model

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

SEARCHMETRICS WHITEPAPER RANKING FACTORS Targeted Analysis for more Success on Google and in your Online Market

SEARCHMETRICS WHITEPAPER RANKING FACTORS Targeted Analysis for more Success on Google and in your Online Market 2018 SEARCHMETRICS WHITEPAPER RANKING FACTORS 2018 Targeted for more Success on Google and in your Online Market Table of Contents Introduction: Why ranking factors for niches?... 3 Methodology: Which

More information

Whitepaper US SEO Ranking Factors 2012

Whitepaper US SEO Ranking Factors 2012 Whitepaper US SEO Ranking Factors 2012 Authors: Marcus Tober, Sebastian Weber Searchmetrics Inc. 1115 Broadway 12th Floor, Room 1213 New York, NY 10010 Phone: 1 866-411-9494 E-Mail: sales-us@searchmetrics.com

More information

Memorandum Participants Method

Memorandum Participants Method Memorandum To: Elizabeth Pass, Associate Professor, School of Writing, Rhetoric and Technical Communication From: Andrew Carnes, WRTC 456 Section 1[ADC] Date: February 2, 2016 Re: Project 1 Competitor

More information

Finding Nutrition Information on the Web: Coverage vs. Authority

Finding Nutrition Information on the Web: Coverage vs. Authority Finding Nutrition Information on the Web: Coverage vs. Authority Susan G. Doran Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208.Sue_doran@yahoo.com Samuel

More information

Internet Usage Transaction Log Studies: The Next Generation

Internet Usage Transaction Log Studies: The Next Generation Internet Usage Transaction Log Studies: The Next Generation Sponsored by SIG USE Dietmar Wolfram, Moderator. School of Information Studies, University of Wisconsin-Milwaukee Milwaukee, WI 53201. dwolfram@uwm.edu

More information

Whitepaper Italy SEO Ranking Factors 2012

Whitepaper Italy SEO Ranking Factors 2012 Whitepaper Italy SEO Ranking Factors 2012 Authors: Marcus Tober, Sebastian Weber Searchmetrics GmbH Greifswalder Straße 212 10405 Berlin Phone: +49-30-3229535-0 Fax: +49-30-3229535-99 E-Mail: info@searchmetrics.com

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

2005 University of California Undergraduate Experience Survey

2005 University of California Undergraduate Experience Survey 2005 University of California Undergraduate Experience Survey This year's survey has three parts: I. Time use and overall satisfaction II. Your background III. Rotating modules PART I: TIME USE and OVERALL

More information

A Tale of Two Studies Establishing Google & Bing Click-Through Rates

A Tale of Two Studies Establishing Google & Bing Click-Through Rates A Tale of Two Studies Establishing Google & Bing Click-Through Rates Behavioral Study by Slingshot SEO, Inc. using client data from January 2011 to August 2011 What s Inside 1. Introduction 2. Definition

More information

Building an ASP.NET Website

Building an ASP.NET Website In this book we are going to build a content-based ASP.NET website. This website will consist of a number of modules, which will all fit together to produce the finished product. We will build each module

More information

GOOGLE S MOST-SEARCHED ONLINE PRODUCTS AND SERVICES JULY Perfect Search Media

GOOGLE S MOST-SEARCHED ONLINE PRODUCTS AND SERVICES JULY Perfect Search Media GOOGLE S MOST-SEARCHED ONLINE PRODUCTS AND SERVICES JULY 2013 Perfect Search Media INTRODUCTION This study began with the word online. This report exclusively focuses on keyword queries typed into the

More information

USERS INTERACTIONS WITH THE EXCITE WEB SEARCH ENGINE: A QUERY REFORMULATION AND RELEVANCE FEEDBACK ANALYSIS

USERS INTERACTIONS WITH THE EXCITE WEB SEARCH ENGINE: A QUERY REFORMULATION AND RELEVANCE FEEDBACK ANALYSIS USERS INTERACTIONS WITH THE EXCITE WEB SEARCH ENGINE: A QUERY REFORMULATION AND RELEVANCE FEEDBACK ANALYSIS Amanda Spink, Carol Chang & Agnes Goz School of Library and Information Sciences University of

More information

2013 Association Marketing Benchmark Report

2013 Association  Marketing Benchmark Report 2013 Association Email Marketing Benchmark Report Part I: Key Metrics 1 TABLE of CONTENTS About Informz.... 3 Introduction.... 4 Key Findings.... 5 Overall Association Metrics... 6 Results by Country of

More information

CICS insights from IT professionals revealed

CICS insights from IT professionals revealed CICS insights from IT professionals revealed A CICS survey analysis report from: IBM, CICS, and z/os are registered trademarks of International Business Machines Corporation in the United States, other

More information

Counting daily bridge users

Counting daily bridge users Counting daily bridge users Karsten Loesing karsten@torproject.org Tor Tech Report 212-1-1 October 24, 212 Abstract As part of the Tor Metrics Project, we want to learn how many people use the Tor network

More information

A large scale study of European mobile search behaviour. Church, Karen; Smyth, Barry; Bradley, Keith; Cotter, Paul

A large scale study of European mobile search behaviour. Church, Karen; Smyth, Barry; Bradley, Keith; Cotter, Paul Provided by the author(s) and University College Dublin Library in accordance with publisher policies. Please cite the published version when available. Title A large scale study of European mobile search

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Cultural Analysis of Video-Sharing Websites. Why are locally developed South Korean video-sharing sites used instead of

Cultural Analysis of Video-Sharing Websites. Why are locally developed South Korean video-sharing sites used instead of 1 Page Cultural Analysis of Video-Sharing Websites Jordan Caudill Vincent Rowold Shane Tang Samantha Merritt (Mentor) Research Question: Why are locally developed South Korean video-sharing sites used

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Imperva Incapsula Survey: What DDoS Attacks Really Cost Businesses

Imperva Incapsula Survey: What DDoS Attacks Really Cost Businesses Survey Imperva Incapsula Survey: What DDoS Attacks Really Cost Businesses BY: TIM MATTHEWS 2016, Imperva, Inc. All rights reserved. Imperva and the Imperva logo are trademarks of Imperva, Inc. Contents

More information

An Analysis of Image Retrieval Behavior for Metadata Type and Google Image Database

An Analysis of Image Retrieval Behavior for Metadata Type and Google Image Database An Analysis of Image Retrieval Behavior for Metadata Type and Google Image Database Toru Fukumoto Canon Inc., JAPAN fukumoto.toru@canon.co.jp Abstract: A large number of digital images are stored on the

More information

Overview On Methods Of Searching The Web

Overview On Methods Of Searching The Web Overview On Methods Of Searching The Web Introduction World Wide Web (WWW) is the ultimate source of information. It has taken over the books, newspaper, and any other paper based material. It has become

More information

Physically-Based Laser Simulation

Physically-Based Laser Simulation Physically-Based Laser Simulation Greg Reshko Carnegie Mellon University reshko@cs.cmu.edu Dave Mowatt Carnegie Mellon University dmowatt@andrew.cmu.edu Abstract In this paper, we describe our work on

More information

Introduction to and calibration of a conceptual LUTI model based on neural networks

Introduction to and calibration of a conceptual LUTI model based on neural networks Urban Transport 591 Introduction to and calibration of a conceptual LUTI model based on neural networks F. Tillema & M. F. A. M. van Maarseveen Centre for transport studies, Civil Engineering, University

More information

CSC105, Introduction to Computer Science I. Introduction and Background. search service Web directories search engines Web Directories database

CSC105, Introduction to Computer Science I. Introduction and Background. search service Web directories search engines Web Directories database CSC105, Introduction to Computer Science Lab02: Web Searching and Search Services I. Introduction and Background. The World Wide Web is often likened to a global electronic library of information. Such

More information

File Size Distribution on UNIX Systems Then and Now

File Size Distribution on UNIX Systems Then and Now File Size Distribution on UNIX Systems Then and Now Andrew S. Tanenbaum, Jorrit N. Herder*, Herbert Bos Dept. of Computer Science Vrije Universiteit Amsterdam, The Netherlands {ast@cs.vu.nl, jnherder@cs.vu.nl,

More information

Identifying user behavior in domain-specific repositories

Identifying user behavior in domain-specific repositories Information Services & Use 34 (2014) 249 258 249 DOI 10.3233/ISU-140745 IOS Press Identifying user behavior in domain-specific repositories Wilko van Hoek, Wei Shen and Philipp Mayr GESIS Leibniz Institute

More information

The Mobile Movement Understanding Smartphone Users. Google/IPSOS OTX MediaCT U.S., April 2011

The Mobile Movement Understanding Smartphone Users. Google/IPSOS OTX MediaCT U.S., April 2011 The Mobile Movement Understanding Smartphone Users Google/IPSOS OTX MediaCT U.S., April 2011 Research Objectives Gain a deep understanding of smartphone consumer behavior, specifically with regard to:

More information

Browser Wars : Battles of Standards

Browser Wars : Battles of Standards Browser Wars : Battles of Standards CHARDONNEAU Innovation and Knowledge Management European Master in Business Studies 25/04/2009 Browser Wars Battles of Standards, Microsoft versus Netscape Topic : One

More information

INSPIRE and SPIRES Log File Analysis

INSPIRE and SPIRES Log File Analysis INSPIRE and SPIRES Log File Analysis Cole Adams Science Undergraduate Laboratory Internship Program Wheaton College SLAC National Accelerator Laboratory August 5, 2011 Prepared in partial fulfillment of

More information

Mobile Information Access: A Study of Emerging Search Behavior on the Mobile Internet

Mobile Information Access: A Study of Emerging Search Behavior on the Mobile Internet Mobile Information Access: A Study of Emerging Search Behavior on the Mobile Internet KAREN CHURCH and BARRY SMYTH University College Dublin and PAUL COTTER and KEITH BRADLEY ChangingWorlds It is likely

More information

Ovid Technologies, Inc. Databases

Ovid Technologies, Inc. Databases Physical Therapy Workshop. August 10, 2001, 10:00 a.m. 12:30 p.m. Guide No. 1. Search terms: Diabetes Mellitus and Skin. Ovid Technologies, Inc. Databases ACCESS TO THE OVID DATABASES You must first go

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

Chapter 3: Google Penguin, Panda, & Hummingbird

Chapter 3: Google Penguin, Panda, & Hummingbird Chapter 3: Google Penguin, Panda, & Hummingbird Search engine algorithms are based on a simple premise: searchers want an answer to their queries. For any search, there are hundreds or thousands of sites

More information

Extraction of Semantic Text Portion Related to Anchor Link

Extraction of Semantic Text Portion Related to Anchor Link 1834 IEICE TRANS. INF. & SYST., VOL.E89 D, NO.6 JUNE 2006 PAPER Special Section on Human Communication II Extraction of Semantic Text Portion Related to Anchor Link Bui Quang HUNG a), Masanori OTSUBO,

More information

This session will provide an overview of the research resources and strategies that can be used when conducting business research.

This session will provide an overview of the research resources and strategies that can be used when conducting business research. Welcome! This session will provide an overview of the research resources and strategies that can be used when conducting business research. Many of these research tips will also be applicable to courses

More information

Headings: Academic Libraries. Database Management. Database Searching. Electronic Information Resource Searching Evaluation. Web Portals.

Headings: Academic Libraries. Database Management. Database Searching. Electronic Information Resource Searching Evaluation. Web Portals. Erin R. Holmes. Reimagining the E-Research by Discipline Portal. A Master s Project for the M.S. in IS degree. April, 2014. 20 pages. Advisor: Emily King This project presents recommendations and wireframes

More information

CS155a: E-Commerce. Lecture 21: November 29, 2001 Portals

CS155a: E-Commerce. Lecture 21: November 29, 2001 Portals CS155a: E-Commerce Lecture 21: November 29, 2001 Portals Today s Class Course-evaluation forms Continue discussion of Google Portals End-of-term announcements Yahoo: An Internet Portal Full Name: Yahoo!,

More information

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No. # 3 Relational Model Hello everyone, we have been looking into

More information

INTERNET PORTALS DEFINITION OF PORTAL

INTERNET PORTALS DEFINITION OF PORTAL INTERNET PORTALS In order to gain an understanding of Internet portals, it is important to understand the role they play in e-commerce. What value-added services do they offer the customer? To the supplier?

More information

CPSC 320 Sample Solution, Playing with Graphs!

CPSC 320 Sample Solution, Playing with Graphs! CPSC 320 Sample Solution, Playing with Graphs! September 23, 2017 Today we practice reasoning about graphs by playing with two new terms. These terms/concepts are useful in themselves but not tremendously

More information

Finding a needle in Haystack: Facebook's photo storage

Finding a needle in Haystack: Facebook's photo storage Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,

More information

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS Fidel Cacheda, Francisco Puentes, Victor Carneiro Department of Information and Communications Technologies, University of A

More information

In the previous lecture we went over the process of building a search. We identified the major concepts of a topic. We used Boolean to define the

In the previous lecture we went over the process of building a search. We identified the major concepts of a topic. We used Boolean to define the In the previous lecture we went over the process of building a search. We identified the major concepts of a topic. We used Boolean to define the relationships between concepts. And we discussed common

More information

Q THE RISE OF MOBILE AND TABLET VIDEO GLOBAL VIDEO INDEX LONG-FORM VIDEO CONTINUES TO ENGAGE LIVE VIDEO DOMINATES ON-DEMAND MEDIA

Q THE RISE OF MOBILE AND TABLET VIDEO GLOBAL VIDEO INDEX LONG-FORM VIDEO CONTINUES TO ENGAGE LIVE VIDEO DOMINATES ON-DEMAND MEDIA THE RISE OF MOBILE AND TABLET VIDEO LONG-FORM VIDEO CONTINUES TO ENGAGE LIVE VIDEO DOMINATES ON-DEMAND MEDIA Q3 2013 GLOBAL VIDEO INDEX TABLE OF CONTENTS Executive Summary...3 The Rise of Mobile and Tablet

More information

Ananta: Cloud Scale Load Balancing. Nitish Paradkar, Zaina Hamid. EECS 589 Paper Review

Ananta: Cloud Scale Load Balancing. Nitish Paradkar, Zaina Hamid. EECS 589 Paper Review Ananta: Cloud Scale Load Balancing Nitish Paradkar, Zaina Hamid EECS 589 Paper Review 1 Full Reference Patel, P. et al., " Ananta: Cloud Scale Load Balancing," Proc. of ACM SIGCOMM '13, 43(4):207-218,

More information

Analyzing Spotify s Business

Analyzing Spotify s Business University of California, Berkeley INFO 234 Information Technology Economics, Policy, and Strategy Analyzing Spotify s Business Author: Snaheth Thumathy Supervisors: John Chuang Andy Brooks May 8, 2017

More information

Up in the Air: The state of cloud adoption in local government in 2016

Up in the Air: The state of cloud adoption in local government in 2016 Up in the Air: The state of cloud adoption in local government in 2016 Introduction When a Cloud First policy was announced by the Government Digital Service in 2013, the expectation was that from that

More information

The ebuilders Guide to selecting a Web Designer

The ebuilders Guide to selecting a Web Designer The ebuilders Guide to selecting a Web Designer With the following short guide we hope to give you and your business a better grasp of how to select a web designer. We also include a short explanation

More information

HOW TO USE THE INTERNET TO FIND THE PROSTATE CANCER INFORMATION YOU WANT

HOW TO USE THE INTERNET TO FIND THE PROSTATE CANCER INFORMATION YOU WANT 1 HOW TO USE THE INTERNET TO FIND THE PROSTATE CANCER INFORMATION YOU WANT (TIPS FOR EVERYONE EVEN IF YOU DON T OWN A COMPUTER ) by Robert Young Many feel they are unable to access prostate cancer information

More information

Computers and iphones and Mobile Phones, oh my!

Computers and iphones and Mobile Phones, oh my! Computers and iphones and Mobile Phones, oh my! A logs-based comparison of search users on different devices. Maryam Kamvar Melanie Kellar Rajan Patel Ya Xu Google, Inc Google, Inc Google, Inc Department

More information

Free Google Keyword Tool Alternatives

Free Google Keyword Tool Alternatives cloudincome.com http://www.cloudincome.com/google-keyword-tool-alternatives/ Free Google Keyword Tool Alternatives In August 2013 we saw the Google Keyword Tool as we know it, cease to exist. It s has

More information

Implications of Post-NCSC Project Scenarios for Future Test Development

Implications of Post-NCSC Project Scenarios for Future Test Development Implications of Post-NCSC Project Scenarios for Future Test Development Brian Gong Center for Assessment All rights reserved. Any or all portions of this document may be used to support additional study

More information

HOLIDAY HOT SHEET N O V E M B E R 6,

HOLIDAY HOT SHEET N O V E M B E R 6, HOLIDAY HOT SHEET NOVEMBER 6, 2013 2013 Holiday hot sheet: weekly insights for the holiday marketer As marketers seek to connect with their customers during the largest consumer spending season of the

More information

Wrapper: An Application for Evaluating Exploratory Searching Outside of the Lab

Wrapper: An Application for Evaluating Exploratory Searching Outside of the Lab Wrapper: An Application for Evaluating Exploratory Searching Outside of the Lab Bernard J Jansen College of Information Sciences and Technology The Pennsylvania State University University Park PA 16802

More information

A Preliminary Investigation into the Search Behaviour of Users in a Collection of Digitized Broadcast Audio

A Preliminary Investigation into the Search Behaviour of Users in a Collection of Digitized Broadcast Audio A Preliminary Investigation into the Search Behaviour of Users in a Collection of Digitized Broadcast Audio Haakon Lund 1, Mette Skov 2, Birger Larsen 2 and Marianne Lykke 2 1 Royal School of Library and

More information

Notebook Assignments

Notebook Assignments Notebook Assignments These six assignments are a notebook using techniques from class in the single concrete context of graph theory. This is supplemental to your usual assignments, and is designed for

More information

Today s Topics. Percentile ranks and percentiles. Standardized scores. Using standardized scores to estimate percentiles

Today s Topics. Percentile ranks and percentiles. Standardized scores. Using standardized scores to estimate percentiles Today s Topics Percentile ranks and percentiles Standardized scores Using standardized scores to estimate percentiles Using µ and σ x to learn about percentiles Percentiles, standardized scores, and the

More information

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites

An Empirical Evaluation of User Interfaces for Topic Management of Web Sites An Empirical Evaluation of User Interfaces for Topic Management of Web Sites Brian Amento AT&T Labs - Research 180 Park Avenue, P.O. Box 971 Florham Park, NJ 07932 USA brian@research.att.com ABSTRACT Topic

More information

The influence of caching on web usage mining

The influence of caching on web usage mining The influence of caching on web usage mining J. Huysmans 1, B. Baesens 1,2 & J. Vanthienen 1 1 Department of Applied Economic Sciences, K.U.Leuven, Belgium 2 School of Management, University of Southampton,

More information

Healthcare Information and Management Systems Society HIMSS. U.S. Healthcare Industry Quarterly HIPAA Compliance Survey Results: Summer 2002

Healthcare Information and Management Systems Society HIMSS. U.S. Healthcare Industry Quarterly HIPAA Compliance Survey Results: Summer 2002 Healthcare Information and Management Systems Society HIMSS U.S. Healthcare Industry Quarterly HIPAA Compliance Survey Results: Summer 2002 HIMSS / Phoenix Health Systems Healthcare Industry Quarterly

More information

Websites of different companies

Websites of different companies Websites of different companies In this presentation I aim to present two competing companies websites for the client. The client s company is Lightning games, and the two competing sites will also be

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

Testing is a very big and important topic when it comes to software development. Testing has a number of aspects that need to be considered.

Testing is a very big and important topic when it comes to software development. Testing has a number of aspects that need to be considered. Testing Testing is a very big and important topic when it comes to software development. Testing has a number of aspects that need to be considered. System stability is the system going to crash or not?

More information

Selected Members of the CCL-EAR Committee Review of EBSCO S MASTERFILE PREMIER September, 2002

Selected Members of the CCL-EAR Committee Review of EBSCO S MASTERFILE PREMIER September, 2002 Selected Members of the CCL-EAR Committee Review of EBSCO S MASTERFILE PREMIER September, 2002 In April 2002, selected members of the Council of Chief Librarians Electronic Access and Resources Committee

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Business Process Outsourcing

Business Process Outsourcing Business Process Outsourcing Copyright 2012-2014, AdminBetter Inc. LIST BUILDING SERVICES Table of Contents Introduction To List Building Services... 3 A Note On Ballbark Pricing... 3 Types Of List Building

More information

Excel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller

Excel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller Excel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller Table of Contents Introduction!... 1 Part 1: Entering Data!... 2 1.a: Typing!... 2 1.b: Editing

More information

Lecture 3: Linear Classification

Lecture 3: Linear Classification Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.

More information

GUIDELINES FOR MASTER OF SCIENCE INTERNSHIP THESIS

GUIDELINES FOR MASTER OF SCIENCE INTERNSHIP THESIS GUIDELINES FOR MASTER OF SCIENCE INTERNSHIP THESIS Dear Participant of the MScIS Program, If you have chosen to follow an internship, one of the requirements is to write a Thesis. This document gives you

More information

Information Retrieval CSCI

Information Retrieval CSCI Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1

More information

Graph Structure Over Time

Graph Structure Over Time Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines

More information

Guide to Google Analytics: Admin Settings. Campaigns - Written by Sarah Stemen Account Manager. 5 things you need to know hanapinmarketing.

Guide to Google Analytics: Admin Settings. Campaigns - Written by Sarah Stemen Account Manager. 5 things you need to know hanapinmarketing. Guide to Google Analytics: Google s Enhanced Admin Settings Written by Sarah Stemen Account Manager Campaigns - 5 things you need to know INTRODUCTION Google Analytics is vital to gaining business insights

More information

Introduction. But what about some of the lesser known SEO techniques?

Introduction. But what about some of the lesser known SEO techniques? Introduction When it comes to determine out what the best SEO techniques are for your inbound marketing campaign, the most basic strategies aren t that tough to figure out. If you ve been blogging or marketing

More information

(Refer Slide Time: 01:25)

(Refer Slide Time: 01:25) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture - 32 Memory Hierarchy: Virtual Memory (contd.) We have discussed virtual

More information

CE4031 and CZ4031 Database System Principles

CE4031 and CZ4031 Database System Principles CE431 and CZ431 Database System Principles Course CE/CZ431 Course Database System Principles CE/CZ21 Algorithms; CZ27 Introduction to Databases CZ433 Advanced Data Management (not offered currently) Lectures

More information

Direct Variations DIRECT AND INVERSE VARIATIONS 19. Name

Direct Variations DIRECT AND INVERSE VARIATIONS 19. Name DIRECT AND INVERSE VARIATIONS 19 Direct Variations Name Of the many relationships that two variables can have, one category is called a direct variation. Use the description and example of direct variation

More information

FACETs. Technical Report 05/19/2010

FACETs. Technical Report 05/19/2010 F3 FACETs Technical Report 05/19/2010 PROJECT OVERVIEW... 4 BASIC REQUIREMENTS... 4 CONSTRAINTS... 5 DEVELOPMENT PROCESS... 5 PLANNED/ACTUAL SCHEDULE... 6 SYSTEM DESIGN... 6 PRODUCT AND PROCESS METRICS...

More information

Categorizing Migrations

Categorizing Migrations What to Migrate? Categorizing Migrations A version control repository contains two distinct types of data. The first type of data is the actual content of the directories and files themselves which are

More information

CSI5387: Data Mining Project

CSI5387: Data Mining Project CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play

More information

Two-dimensional Totalistic Code 52

Two-dimensional Totalistic Code 52 Two-dimensional Totalistic Code 52 Todd Rowland Senior Research Associate, Wolfram Research, Inc. 100 Trade Center Drive, Champaign, IL The totalistic two-dimensional cellular automaton code 52 is capable

More information

Analysing Search Trends

Analysing Search Trends Data Mining in Business Intelligence 7 March 2013, Ben-Gurion University Analysing Search Trends Yair Shimshoni, Google R&D center, Tel-Aviv. shimsh@google.com Outline What are search trends? The Google

More information