Chapter-IV WEBOMETRICS

Size: px

Start display at page:

Download "Chapter-IV WEBOMETRICS"

Geoffrey Phelps
6 years ago
Views:

1 Chapter-IV WEBOMETRICS 4.1-Introduction Webometrics is the quantitative analysis of web phenomena, drawing upon informetric methods and typically addressing problems related to bibliometrics. Webometrics was triggered by the realization that the web is an enormous document repository with many of these documents being academic-related Moreover, the web has its own citation indexes in the form of commercial search engines, and so it is ready for researchers to exploit. In fact, several major search engines can also deliver their results automatically to investigators' computer programs, allowing large-scale investigations one of the most visible outputs of webometrics is the ranking of world universities based upon their web sites and online impact. Webometrics includes link analysis, web citation analysis, search engine evaluation and purely descriptive studies of the web. These are reviewed below, in addition to one recent application: the analysis of Web 2.0 phenomena. Note that there is also some research into developing web-based metrics for web sites to evaluate various aspects of their construction, such as usability and information content, but this will not be reviewed here Since the mid-1990s increasing efforts have been made to investigate the nature and properties of the World Wide Web, named the Web in this article, by applying modern informetric methodologies to its space of contents, link structures, and search engines. Studies of the Web have been named.webometrics. By Almind and Ingwersen (1997) or.cybermetrics. As in the electronic journal of that name (1997). This article attempts to point to selected areas of webometric research that demonstrate interesting progress and space for development as well as to some currently less promising areas. The contribution is not an exhaustive review, but rather a view on the specialty. Webometrics displays several similarities to informetric and scientometric studies and the application of common bibliometric methods. For instance, simplistic counts and content analysis of web pages are like traditional publication analysis; counts and analyses of outgoing links from web pages, here named out links, and of links pointing to web pages, called in links, can be

2 seen as reference and citation analyses, respectively. Out links and in links are then similar to references and citations, respectively, in scientific articles. However, due to its dynamic and distributed nature, the Web often demonstrates web pages simultaneously linking to each other. A case not possible in the Traditional paper-based citation world. The coverage of search engines of the total Web can be investigated in the same way as the coverage of domain and citation databases in the total document landscape and possible overlaps between engines detected. Since the Web consists of contributions from anyone who wishes to contribute, the quality of information or knowledge value is opaque due to the lack of kinds of peer reviewing; but citation-like link analyses may reveal clusters of sites to be reviewed. Patterns of Web search behavior can be investigated as in traditional information seeking studies. Issue tracking on the Web is carried out and knowledge discovery attempts are made, similar to common data or text mining in administrative or textual (bibliographic) databases. Since the Web is an information space quite different from the common scientific or professional databases, the similarities mentioned above may sometimes be superficial. For example, we do not know for sure why people on the Web link up to other pages. There exists no convention of citation as in the scientific world. Further, time plays a different role on the Web. On the other hand, because the Web is a highly complex Conglomerate of all types of information carriers produced by all kinds of people and searched by all kinds of users, it is tempting to investigate; and informetrics indeed offers some methodologies to start from. However, one must be aware that like for the online application of the ISI controlled citation databases, for instance by means of the Dialog command language, data collection on the Web depends on the retrieval features of the various search engines and web robots. Prior to the appearance of the.set postings on. Command feature in Dialog during the 1990s, online citation counts were not possible; one would have to download all the citing documents to be analyzed locally for the actual number of citations. Within the ISI-defined information space. At present this is exactly the case in most Web engines, as demonstrated by Rousseau (1997; 1999). The engines do not index the entire Web, their overlaps are not substantial Lawrence and Giles, (1998), and their retrieval features too simplistic for extensive webometric analyses online.

3 Webometrics, the quantitative study of web-related phenomena, originated in the realization that methods originally designed for bibliometric analysis of scientific journal article citation patterns could be applied to the Web with commercial search engines providing the raw data. Almind and Ingwersen (1997) defined the discipline and gave its name, although the basic issue had been identi ied simultaneously by Rodriguez Gairin (1997) and was pursued in Spain by Aguillo (1998). Larson (1996) is also a pioneer with his early exploratory link structure analysis, as is Rousseau (1997) with the first pure informetric analysis of the Web. We interpret webometrics in a broad sense encompassing research from disciplines outside of Information Science such as Communication Studies, Statistical Physics and Computer Science. In this review we will concentrate on types of link analysis but also cover other webometric areas that Information Scientists have been involved with, including web log file analysis. One theme that runs through this chapter is the messiness of web data and the need for heuristics to cleanse it. This is a problem even at the most basic level of defining the Web. The uncontrolled Web creates numerous problems in the interpretation of results, for instance from the automatic creation or replication of links and deliberately misleading publishing. The loose connection between the apparent usage of top level domains and their actual content is also a frustrating problem, for example with the extensive non-commercial content hosted in.com sites. Indeed a skeptical researcher could claim that the obstacles of this kind are so great that all web analyses have little value. As will be seen below, one response to this perspective - also a recurrent theme for critics of evaluative bibliometrics - is to demonstrate significant correlation statistics to prove that information is present. A practical response has been to develop increasingly sophisticated data cleansing strategies and multiple data analysis techniques. The immense importance of the Web to scholars and the wider society means that it is essential to build an understanding of it, however difficult. This review is split into four parts: basic concepts and methods; scholarly communication on the Web; general and commercial web use; and topological modelling and mining of the Web. As a new field based around analyzing a new data source, methods of collecting and processing the data have been prominent in many studies. The second part, scholarly communication on the Web, is predominantly

4 concerned with using link analysis to identify patterns in academic or scholarly web spaces. Almost all of these studies have direct analogies in traditional bibliometrics, and have drawn from this area a concern with developing effective methods and validating results, the latter being an issue of particular concern on the Web. A key question that still does not have a satisfactory answer is how to interpret counts of links to academic web spaces. For' example, if one university web site attracts double the links of another, what conclusions should be drawn. The general and commercial web use section reviews link analysis studies that have used techniques similar to those applied to academic web spaces. Some have origins in Social Network Analysis rather than Information Science, producing an interesting complementary perspective. The section also includes quantitative studies of the size of the 'whole' Web and web server log analysis. The final section, topological modelling and mining of the Web, covers mathematical approaches to modeling the growth of the Web or its internal link structure, mostly the product of Computer Science and Statistical Physics research. It culminates with an exciting new information science contribution to this area, providing detailed interpretations of small-world linking phenomena. 4.2-Definition The origin of Webometrics can be found in the field of Information Science. Thellwall, Vaughan and Bjorneborn (2005) point out that the discipline "emerged from the realization that methods originally designed for bibliometric analysis of scientific journal article citation patterns could be applied to the Web, with commercial search engines providing the raw data". In fact, the idea that a link pointing to a webpage means a Vote' to that webpage or document is based on bibliometric methods to rank scientific production Garfield, (1979). The term Webometrics was first coined by Tomas Almind and Peter Ingwersen (1997) and seems to be widely accepted by the research community together with the term Cybermetrics. Bjorneborn (2004) defined both terms by limiting their research areas. Webometrics is "the study of the quantitative aspects of the construction and use of information resources, structures and technologies on the Web drawing on bibliometric and informetric approaches" in Bjorneborn and Ingwersen, (2004) while Cybermetrics does the same but on the whole Internet. Hence, Cybermetrics is more

5 focused on the study of non web-based Internet phenomena, e.g. s, chat, newsgroup studies, etc. Recent developments within the field suggest a move in the scope of the definition into a more general social science research approach instead of an approach that is mainly based on an informetric and bibliometric perspective. Thelwall (2009) defines Webometrics as "the study of web-based content with primarily quantitative methods for social science research goals using techniques that are not specific to one field of study". Interdisciplinary research is getting more significant by enlarging the types of subjects of study and the techniques used. This evolution aligns with the definition of Internet research given by Hine (2008) "Internet research itself is not a discipline but an interdisciplinary, a field or a research network populated by heterogeneous perspectives." 4.3-Basic concepts Bjorneborn and Ingwersen (2004) carried out the first attempt to develop a consistent terminology on the webometric field. Some years later, Thelwall and Wilkinson (2008) proposed a generic lexical framework that, based on the previous work, intended to unify and extend existing methods through abstract notions of link lists and URL lists. 4.4-The Entire Web Letters in the diagram represent any type of document in the Web, whether it is a webpage or a website, for instance. The following basic webometric terms Bjorneborn and Ingwersen have been discuss. Inlink: B has an inlink from A. Outlink: A has an outlink to B Self-link: C has a self-link. Page or site isolated: K is isolated as it does not have any inlinks or outlinks. Reciprocal links: I and J have reciprocal links. Transversal link: A has a transversal outlink to H. This type refers to a link that joins todifferent areas of the Web that are not well interconnected. Co-inlinks: 1 and 4 have a co-inlink, as B links both of them simultaneously. Co-outlinks: G has a co-outlink, as 1 and 3 are linking to it.

6 4.5-History of Webometrics the information science field of webometrics is "the study of the quantitative aspects of the construction and use of information resources, structures and technologies on the web drawing on bibliometric and informetric approaches" or, more generally, "the study of web-based content with primarily quantitative methods for social science research goals using techniques that are not specific to one field of study" While the former definition emphasizes the informetric heritage of many bibliometric methods, the latter focuses on the value that webometrics could provide to the wider social sciences, reflecting a shift in webometrics over time from more theoretical studies to more applied studies, though retaining an emphasis on methods development. Webometrics currently provides a range of methods and software for various kinds of quantitative analyses of the web, and, despite initial concerns that web data would always be easily manipulated because they are not quality-controlled, the advocates of webometrics claim that it is useful both for studies of aspects of the web itself, such as hyperlinking among academic websites, and studies of offline phenomena that might be reflected online, such as political attitudes reflected in blogs. The term webometrics was coined in 1997 by Tomas Almind and Peter Ingwersen in recognition that informetric analyses could be applied to the web The field really took off, however, with the introduction of the Web Impact Factor (WIF).metric to assess the impact of a website or other area of the web based upon the number of hyperlinks pointing to it.wifs seemed to make sense because more useful or important areas of the web would presumably attract more hyperlinks than average. The logic of this metric was derived from the importance of citations.in journal impact factors, but WIFs had the advantage that they could be easily calculated Using the new advanced search queries introduced by AltaVista, a leading commercial search engine at the time. Webometrics subsequently rose to become a large coherent field within information science, at least from a bibliometric perspective. Encompassing link analysis, web citation analysis and a range of other web-based quantitative techniques. In addition, webometrics became useful in various applied contexts, such as to construct the world webomerrics ranking of universities and for scientometric evaluations or investigations of bodies of research or research areas This article reviews a few key areas of webometrics and summarizes its contribution to information science research.

7 4.6-Exploring the web The first web pages emerged in that faraway era of the early 1990s. and the Internet were already becoming well known but the web, which like uses the Internet's global computer network to share information in commonly agreedupon ways, had its start among physicists only in It moved into the mainstream in 1993 when the National Center for Supercomputing Applications (NCSA) at the University of Illinois released Mosaic, an easy-to-use graphical web browser that ran on most standard computers. Between mid-1993 and mid-1995 the number of servers-the computers that house web sites-jumped from 130 to 22,000. Even with the user-friendly Mosaic encouraging a major expansion of this new medium, only a few historians ventured out on the web frontier. Many of the pioneers already had some technical interests or background. In November 1994 Morris Pierce, an engineer who had recently earned a history Ph.D., created one of the first departmental websites for the University of Rochester. It "seemed like a natural thing to do," he recalls. George Welling already worked in a department of humanities computing, which the University of Groningen (Netherlands) had created in In the fall of 1994, Welling developed a course in computer skills for American history students and asked them to construct an American Revolution website. Other History Web pioneers came to the medium out of experience with earlier Internet applications, particularly . In the late 1980s, Joni Makivirta, a student at the University of Jyvaskyla, Finland started an online history discussion list because he noticed lists on other topics and thought a history list would allow him "to get ideas from professional historians around the world" for his master thesis. The participants included George Welling, Thomas Zielke, who later took over the list, Richard Jensen, who went on to found H-Net in 1993, Don Mabry, a Latin American historian at Mississippi State University, and Lynn Nelson, a medievalist at the University of Kansas. In 1991, Mabry-responding to the difficulty of circulating large documents via -began to make available primary sources and other materials of interest to historians via "anonymous FTP"-a "file transfer protocol" that allows anyone with an Internet connection to download the files to their own computers.

8 Nelson created his own site and then had the idea of linking together the emerging set of history FTP sites into HNSource using Gopher, a hierarchical, menudriven system for navigating the Internet that was much more popular than the web in the early 1990s. In September 1993, just after Mosaic was released, Nelson made HNSource available through the new web protocols, and it became one of the first if not the very first historical site on the web. In the 1980s and early 1990s, the most intense energy in digital history centered not on the possibilities of online networks but rather on fixed-media products like laser disks and CD-ROM. In 1982, the Library of Congress began its Optical Disk Pilot Project, which placed text and images from its massive collections on laser disks and later CD-ROM. With a large amount of material already in digital form, the library could quickly take advantage of the newly emerging web. In 1992, it started to offer its exhibits through FTP sites. Two years later, the library posted its first webbased collection, Selected Civil War Photographs. Around the time that these early settlers carved out primitive digital history homesteads, the first signs emerged that this new frontier might feature more than noncommercial exchange. In October 1994 Marc Andreessen and some of his colleagues who had developed Mosaic at the government-funded NCSA released the first version of a commercially funded browser they called Netscape. Within months, Mosaic was, as they say, history, and Netscape was king of the World Wide Web. The Netscape era saw the History Web come into its own. In mid-1995, when one of us co-wrote the first published guide to the web for historians in the American Historical Association's (AHA) Perspectives, it announced, "the explosion in Web sites has brought with it an explosion in materials relevant to historians." Earlier that year, the Center for History and New Media (CHNM) had helped the venerable AHA launch its website; by that summer forty-five history departments had posted home pages. The online presence of the AHA and the Library of Congress provided an official imprimatur to the History Web. But in those early years, amateurs, not professional historical organizations, provided the crucial energy for much of its growth. Starting in 1995, for example, Larry Stevens, a telephone company worker from Newark, Ohio, established a series of websites on Ohio in the Civil War. The

9 sites combined his two hobbies of history' and computers, and, he explained, he "decided to carve a niche into the net before the big boys, aka Ohio Historical Society, Ohio State University, etc., entered the field." Since the mid-1990s, the history Web has spun its threads with astonishing speed. In 2004, the same search yields 640,000 hits. In the fall of 1996, we did some additional history searches with what we thought were even more remarkable results- 200 hits for the Civil War General George B. McClellan and 300 for the Socialist Eugene V. Debs. Even by 1996 the "walking city" that was the History Web a year earlier had become a sprawling megalopolis that no one person could fully explore. Yahoo counted 873 U.S. history websites in an incomplete census that fall. But seven years later, an even is complete tally returned almost ten times as many American history websites. 4.7-Webometricc Bibliometrics & Informetries Being a global document network initially developed for scholarly use Berners-Lee & Cailliau, (1990) and now inhabited by a diversity of users, the Web constitutes an obvious research area for bibliometrics, scientometrics and informetrics. A range of new terms for the emerging research area have been proposed since the mid-1990s, for instance, net metrics Bossy, (1995); web metrics Abraham, (1996); internet metrics Almind & Ingwersen, (1996); webometrics Almind & Ingwersen, (1997); Cybermetrics journal started 1997 by Isidro Aguillo ; web bibliometrics (Chakrabarti et al., 2002); web metrics (term used in Computer Science, e.g.dhyani, Keong & Bhowmick, (2002). Webometrics and cybermetrics are currently the two most widely adopted terms in Information Science, often used as synonyms. Bjorneborn & Ingwersen (in press) have proposed a differentiated terminology distinguishing between studies of the Web and studies of all Internet applications. They used an Information Science related definition of webometrics as "the study of the quantitative aspects of the construction and use of information resources, structures and technologies on the WWW drawing on bibliometric and informetric approaches" Bjorneborn & Ingwersen, This definition thus covers quantitative aspects of both the construction side and the usage side of the Web embracing the four main

10 areas of present webometric research; web page content analysis, web link structure analysis, web usage analysis (e.g., exploiting log files for users' searching and browsing behavior), and web technology analysis (including search engine performance). This includes hybrid forms, for example, Pirolli et al. (1996) who explored web analysis techniques for automatic categorization utilizing link graph topology, text content and metadata similarity, as well as usage data. All four main research areas include longitudinal studies of changes on the dynamic Web, for example, of page contents, link structures and usage patterns. So-called web archaeology Bjorneborn & Ingwersen, (2001) could in this webometric context be important for recovering historical web developments, for instance, by means of the Internet Archive ( an approach already used in webometrics Bjorneborn, (2003); Vaughan & Thelwall, (2003); Thelwall & Vaughan, (2004), Furthermore, Bjorneborn & Ingwersen have proposed cybermetrics as a generic term for "the study of the quantitative aspects of the construction and use of information resources, structures and technologies on the whole Internet, drawing on bibliometric and informetric approaches". Cybermetrics thus encompasses statistical studies of discussion groups, mailing lists, and other computer-mediated communication on the Internet (e.g., Bar Ilan, (1997); Hernandez-Borges et al., (1997); Matzat, (1998); Herring, (2002) including the Web. Besides covering all computer-mediated communication using Internet applications, this definition of cybermetrics also covers quantitative measures of the Internet backbone technology, topology and traffic Molyneux & Williams, (1999). The breadth of coverage of cybermetrics and webometrics implies large overlaps with proliferating Computer- Science-based approaches in analyses of web contents, link structures, web usage and web technologies. A range of such approaches has emerged since the mid-1990s with names like Cyber Geography / Cyber Cartography Girardin, (1996); Dodge, (1999); Dodge & Kitchin,( 2001), Web Ecology (e.g., Chi et al., (1998); Huberman, (2001), Web Mining (e.g., Etzioni, (1996); Kosala & Blocked, (2000); Chen & Chau,( 2004), Web Graph Analysis (e.g., Chakrabarti et al., (1999); Kleinberg et al., (1999); Broder et al., (2000), and Web Intelligence e.g., Yao et al., (2001). The raison for using the term webometrics in this context could be to denote a heritage to bibliometrics and informetrics and stress an Information Science perspective on Web studies.

11 There are different conceptions of informetrics, bibliometrics and scientometrics. The diagram in Fig. 1.1 Bjorneborn & Ingwersen shows the field of informetrics embracing the overlapping fields of bibliometrics and scientometrics following widely adopted definitions by, e.g., Brookes (1990), Egghe & Rousseau (1990) and Tague-Sutcliffe (1992). According to Tague-Sutcliffe (1992), informetrics is "the study of the quantitative aspects of information in any form, not just records or bibliographies, and in any social group, not just scientists". Bibliometrics is defined as "the study of the quantitative aspects of the production, dissemination and use of recorded information" and scientometrics as "the study of the quantitative aspects of science as a discipline or economic activity'tague-sutcliffe, (1992). In the figure, political economical aspects of scientometrics are covered by the part of the scientometric ellipse lying outside the bibliometric one. In this context, the field of webometrics may be seen as entirely encompassed by bibliometrics, because web documents, whether text or multimedia, are recorded information (cf. Tague-Sutcliffe's abovementioned definition of bibliometrics) stored on web servers. This recording may be temporary only, just as not all paper documents are properly archived. In the diagram, webometrics is partially covered by scientometrics, as many scholarly activities today are web-based. Furthermore, webometrics is totally included within the field of cybermetrics as defined above. In the diagram, the field of cybermetrics exceeds the boundaries of bibliometrics, because some activities in cyberspace normally are not recorded, but communicated synchronously as in chat rooms. Cyberrnetric studies of such activities still fit in the generic field of informetrics as the study of the Webometrics is a scientific discipline that studies the quantitative aspects of information sources and their use. In other words, webometrics try to measure the World Wide Web, analyses technology usage and allows us a simple content analysis. As Figure 1.1 shows, webometrics is affected by many scientific disciplines: Bibliometrics - is the quantitative analysis of documents in scientific communication; the documents reflect the state of scientific knowledge. Cybermetrics - is the quantitative research of information sources, structures and technologies on the Internet; a study of discussion groups, communication.

12 Informetrics - is focused on the information stream in networks and demonstrates on the basis of mathematical and statistical methods a variety of relations between them. Scientometrics - is focused on the evaluation of efficiency of scientificresearch or individual researchers by citation counts 4.8-Scholarly Communication on the Web The hope that web links could be used to provide similar kinds of information to that extracted from journal citations has been a key factor in motivating much webometrics research Larson, (1996); Rodriguez Gairin, (1997); Rousseau, (1997); Ingwersen, (1998); Davenport & Cronin, (2000); Cronin, (2001); Borgman & Furner, (2002); Thelwall, (2002). But can this hope be fulfilled? Although structurally very similar, journal citations are in refereed documents and therefore their production is subject to quality control and they are part of the mainstream of academic endeavor, whereas hyperlinks are none of these things, causing problems for the early hyperlinkcitation analogies, as also noted by, for instance, Meyer (2000), Egghe (2000), van Raan (2001), Bjorneborn & Ingwersen (2001) and Prime et al. (2002). In this section we will summarize the results of a series of studies, organized by scale of units analyzed, before considering the fundamental issue of the reasons why links are created that is essential to interpret the results. Finally, we will conclude with a discussion of how far the early hopes have been realized. The goal underlying almost all of the research reported here is to validate links as a new information source, as a preliminary step to extracting useful information from them. Such a task entails several different strategies, as reported by Oppenheim (2000) in the related context of patent citations. One of the key tasks is to compare the link data with other related data in order to establish the degree of correlation and overlap between the two. With links between university web sites, for instance, a positive correlation between link counts and a measure of research would provide some evidence that link creation was not completely random and could be useful for studying scholarly activities. An important methods issue is that given the typically skewed nature of web link data, nonparametric Spearman correlation tests are normally more appropriate than Pearson. Note also that many of the studies reported below have developed improved methods that have been reported in the previous section.

13 4.9-Link Analysis: Impact Measurements and Networks Link analysis drove early webometrics research, primarily through a combination of the development of improved methods and applications to a range of different contexts. Two types of studies emerged, link impact analyses and link network analyses. Link impact studies essentially compare the numbers of hyperlinks pointing to each website within a pre-defined set, such as all universities in a country or all departments within a discipline in a country. Links to university websites and, in some cases departmental websites, were found to correlate significantly with measures of research productivity or prestige, giving evidence of the validity of using link impact metrics as research-related indicator. They have been used in this role to provide an indication of the most important organizations or websites within specific groups. In addition, a breakdown of the sources of links used in the calculations has been used to identify the sources of the impact, such as the country and organization types that host most of The links. Link network research created network diagrams of the links among specified collections of websites in order to identify connectivity patterns. In addition to networks based upon direct links between pairs of sites, coinlinks have also been used to indicate connections Between pairs of sites. A co-inlink to a pair of websites A and B is a third website C that contains a hyperlink to both A and B. This relation is similar to co-citation in bibliometrics and is particularly useful when investigating websites that are similar but do not necessarily hyperlink to each other. Figure 1 is an example of a co-inlink network diagram for ASIS&T. A direct link network diagram would be likely to exclude links between pairs of sites that were similar in some way but were not directly related to each other. The nodes in the network are the websites most highly linked to from a set of 741 pages reported by Bing as containing a URL citation to "asis.org." Lines between websites indicate coinlinks between them from the 741 pages. All the organizations represented should be in some way related to ASIS&T. Green nodes are general international sites and pink nodes are university sites in the United States. Two important components of link analysis are the software and methods to extract link data. Researchers were for many years able to gather hyperlink information from commercial search engines like Bing, AltaVista and Yahoo! via their advanced link search commands, but these tools were all eventually withdrawn. Link data can still be obtained by the use of specialist link analysis web crawlers, including free programs like SocSciBot

14 ( and Issue Crawler ( as well as a range of other crawlers developed by individual researchers. The Issue Crawler initiative from sociology seems to have been particularly successful at spreading link analysis methods to the wider areas of social sciences and the humanities.within information science, hyperlink-based network diagrams have been used to investigate the interconnections between large. Groups of organizations, such as universities in Europe and organizations within a specific knowledge sector. Some link analysis research has focused on the links themselves, investigating why they are created and why some sites or pages attract more links than others. These studies seem to have focused exclusively on links in academic contexts. Content analyses have shown that links between academic websites tend to be created for scholarly or educational reasons, a partial similarity with citation analysis. Statistical tests have also been used to see which attributes of the website owners (other than research productivity or production, which was already a known factor) tend to associate with higher inlink counts, for example finding that research group website owner gender is unimportant. A recent quite comprehensive study used the most advanced statistical modeling approach yet on a large dataset to gain significant insights into the factors behind academic website interlinking in Europe. Among the findings were that country, region, domain specialism and level (whether awarding doctoral degrees or not) were the most important factors predicting hyperlinks, while reputation was 4.10-Web Citation Analysis to Altmetrics The second type of webometrics to become popular was web citation analysis: counting online citations to published academic documents like refereed journal articles. The rationale behind early research was to assess whether the web could replace traditional citation databases to assess the impact of articles in open access online journals and subsequently also for all journals. This early research found that although counts of web citations correlated with citation counts from traditional databases, many of the web citations derived from non-academic sources, such as online library catalogues. As a result, the web appeared to be an inferior source of citation impact evidence for journals or individual journal articles. This strand of webometric research gave way to more specialized investigations into particular types of web citations to academic publications, such as citations from PowerPoint presentations,online syllabi And Google Books on the basis that within these

15 restricted domains, web-based citation counts could reveal different types of impact from the scholarly impact reflected by traditional citation counts. For example, online syllabus citations could reflect the education impact or value of articles. This line of research was subsequently overtaken by the altmetrics initiative, Discussed elsewhere in this issue. A promising but relatively little studied type of webometrics is the analysis of mentions of keywords or phrases - not necessarily citations. This type of analysis was started by an investigation into the context of online mentions of academics. But the keyword approach has also been used to map concepts online and interactions between concepts online by tracking co-words in web pages Theoretical Perspectives and Information-Centered Research Webometrics has been a methods-centered field, developing methods to gather and analyze data from the web. Perhaps as a result of this focus, the theoretical component of most webometric studies has typically been drawn from citation analysis rather than being created specifically for web data. For example, many early studies assessed whether web citation counts or web link counts correlated with traditional citation counts, drawing upon Robert Merton's theoretical discussion of citation norms in science. Hence, such studies assessed to some extent how well web data fitted Merton's theory. The lack of development of specialist theory for the most developed area of webometrics, link analysis, reflects the web being a far more varied and complex space than academic journal databases, with theory development in the latter being recognized as problematic and controversial. One partial exception to the lack of native theory for webometrics is information-centered research. A style of research theorized to be particularly appropriate to webometrics. An informationcentered research study focuses on a new information source, such as a type of web data, and attempts to identify the social science research problems that the data is most suited to address rather than using a priori intuitions to match the data with a research problem and then to assess the value of the data for the problem. This theory was used to justify the development of a range of different methods to analyze web data and to match the methods to a variety of social science problem areas 4.12-Web Data Analysis Webometrics research has expanded from general or academic web analyses to investigations of social websites, often by automatically downloading data from

16 those websites either through a web crawler or through data requests sent through permitted routes (application programming interfaces). For example, exploiting the information-centered research approach, blogs and RSS feeds have been analyzed to detect public fears about science while social network sites have been investigated to detect friendship patterns and language use. Twitter has been analyzed for the sentiment of public reactions. To major media events and YouTube for the factors associated with discussions attached to online videos. In all cases, the methods of the research have been webometrics - large scale data gathering and analysis for social science purposes - but the findings of the research have been targeted at disciplines outside information science, such as media studies, politics and science communication. Many of the programs used are now publicly available in the free software Webometric Analyst 4.13-Link analysis Link analysis is the quantitative study of hyperlinks between web pages. The use of links in bibliometrics was triggered by Ingwersen's web impact factor (WIF), created through analogy to the JIF, and the potential that hyperlinks might be usable by bibliometricians in ways analogous to citations, e.g. The standard WIF measures the average number of links per page to a web space (e.g. a web site or a whole country) from external pages. The hypothesis underlying early link analysis was that the number of links targeting an academic web site might be proportional to the research productivity of the owning organization, at the level of universities, departments, research groups, or individual scientists. Essentially the two are related because more productive researchers seem to produce more web content, on average, although this content does not attract more links per page. Nevertheless, the pattern is likely to be obscured in all except large-scale studies because of the often indirect relationship between research productivity and web visibility. For example, some researchers produce highly visible web resources as the main output of their research, whilst others with equally high quality offline research attract less online attention. Subsequent hyperlink research has introduced new metrics and applications as well as improved counting methods, such as the alternative document models. In most cases this research has focused on method development or case studies. The wide variety of reasons why links are created and the fact that, unlike citing, linking is not central to any areas of science, has led to hyperlinks rarely being used in an evaluative role.

17 Nevertheless, they can be useful in describing the evolution or connectivity of research groups within a field, especially in comparison with other sources of similar information, such as citations or patents. Links are also valuable to gain insights into web use in a variety of contexts, such as by departments in different fields. A generic problem with link analysis is that the web is continually changing and seems to be constantly expanding so that webometric findings might become rapidly obsolete. A series of longitudinal investigations into university web sites in Australia, New Zealand and the UK have addressed this issue. These university web sites seem to have stabilized in size from 2001, after several years of rapid growth. A comparison of links between the web sites from year to year found that this site size stabilization concealed changes in the individual links, but concluded that typical quantitative studies could nevertheless have a shelf life of many years.

18 4.14-Web Citation Analysis A number of webometric investigations have focused not on web sites but on academic publications; using the web to count how often journal articles are cited. The rationale behind this is partly to give a second opinion for the traditional ISI data, and partly to see if the web can produce evidence of wider use of research, including informal scholarly communication and for commercial applications. A number of studies have shown that the results of web-based citation counting correlates significantly with ISI citation counts across a range of disciplines, with web citations being typically more numerous. Nevertheless, many of the online citations are relatively trivial, for example appearing in journal contents lists rather than in the reference sections of academic articles. If this can be automated then it would give an interesting alternative to the ISI citation indexes 4.15-Search Engines A significant amount of webometrics research has evaluated commercial search engines. The two main investigation topics have been the extent of the coverage of the web and the accuracy of the reported results. Research into developing search engine algorithms (information retrieval), and into how search engines are used (information seeking) are not part of webometrics. The two audiences for webometrics search engine research are researchers who use the engines for data gathering (e.g. the link counts above) and web searchers wanting to understand their results. Search engines have been a main portal to the web for most users since the early years. Hence, it has been logical to assess how much of the web they cover. In 1999, a survey of the main search engines estimated that none covered more than 17.5% of the 'index able' web and that the overlap between search engines was surprisingly low. Here the 'index able' web is roughly the set of pages that a perfect search engine could be expected to find if it found all web site home pages and followed links to find the remainder of pages in the sites. The absence of comparable figures after 1999 is due to three factors: first, an obscure Hypertext Transfer Protocol technology, the virtual server, has rendered the sampling method of Lawrence and Giles ineffective; second, the rise of dynamic pages means that it is no longer reasonable to talk in terms of the 'total number of web pages'; finally, given that search engine coverage of the web is only partial, the exact

19 percentage is not particularly relevant, unless it has substantially changed. One outcome of this research, however, was clear evidence that meta-search engines could give more results through combining multiple engines. Nevertheless, these have lost out to Google, presumably because the key task of a search engine is to deliver relevant results in the first results page, rather than a comprehensive list of pages. Given that web coverage is partial, is it biased in any important ways? This is important because the key role of search engines as intermediaries between web users and content gives them considerable economic power in the new online economy. In fact, coverage is biased internationally In favor of countries that were early adopters of the web. This is a side effect of the way search engines find pages rather than a policy decision. The issue of accuracy of search engine results is multifaceted, relating to the extent to which a search engine correctly reports its own knowledge of the web. Bar-Ilan and Peritz have shown that search engines are not internally consistent in the way they report results to users. Through a longitudinal analysis of the results of the query 'Informetric OR Informetrics' in Google they showed that search engines reported only a fraction of the pages in their database. Although some of the omitted pages duplicated other returned results, this was not always the case and so some information would be lost to the user. A related analysis with Microsoft Live Search suggested that one reason for lost information could be the search engine policy of returning a maximum of two pages per site. Many webometric studies have used the hit count estimates provided by search engines on their Results pages (e.g. the '50,000* in 'Results 1-10 of about 50,000') rather than the list of matching URLs. For example, Ingwersen used these to estimate the number of hyperlinks between pairs of countries. The problem with using these estimates is that they can be unreliable and can even lead to inconsistencies, such as expanded queries giving fewer results. In the infancy of webometrics these estimates could be highly variable and so techniques were proposed to smooth out the inconsistencies, although the estimates subsequently became much more stable. A recent analysis of the accuracy of hit count estimates for Live Search found a surprising pattern Measuring Web 2.0 Web 2.0 is a term coined by the publisher Tim O'Reilly mainly to refer to web sites that are driven by consumer content, such as blogs, Wikipedia and social

20 network sites. The growth in volume of web content created by ordinary users has spawned a market intelligence industry and much measurement research. The idea behind these is data mining: since so many people have recorded informal thoughts online in various formats, such as blogs, chartrooms, bulletin boards and social network sites, it should be possible to extract patterns such as consumer reactions to products or world events. In order to address issues like these, new software has been developed by large companies like IBM's Web Fountain and Microsoft's Pulse. In addition, specialist web intelligence companies like Nielsen Buzz Metrics and Market Sentinel have been created or adapted. A good example of a research initiative to harness consumer generated media (CGM) is an attempt to predict sales patterns for books based upon the volume of blog discussions of them. The predictions had only limited success, however, perhaps because people often blogged about books after reading them, when it would be too late to predict a purchase. Other similar research has had less commercial goals. Gruhl et al. analysed the volume of discussion for a selection of topics in blog space, finding several different patterns. For example, some topics were discussed for one short period of time only, whereas others were discussed continuously, with or without occasional bursts of extra debate. A social sciences-oriented study sought to build retrospective timelines for major events from blog and news discussions, finding this to be possible to a limited extent. Problems occurred, for example, when a long running series of similar relatively minor events received little discussion but omitting them all from a timeline would omit an important aspect of the overall event. In addition to the data mining style of research, there have been many studies of Web 2.0 sites in Order to describe their contents and explain user behavior in them. Here, research into social network sites is reviewed. A large-scale study of the early years of Face book provides the most comprehensive overview of user activities. The data came from February 2004 to March 2006, when Face book was a social network site exclusively for US college students. Users seemed to fit their Face booking with their normal pattern of computer use whilst studying, rather than allocating separate times. In terms of the geography of friendship, members mainly used Face book to communicate with other students at the same college rather than school friends at distant universities. This suggests that social networking is an extension of offline communication rather than promoting radically new geographies of communication, although the latter is enabled by the

21 technology of Face book. This conclusion is supported by qualitative research into another popular site, MySpace. A webometric study of MySpace has indirectly investigated activity levels but focused on member profiles. Amongst other findings, this showed that about a third of registered members accessed the site weekly and the average reported age was 21. Although other research found that MySpace close friends tended to reflect offline friendships, both male and female users preferred to have a majority of female friends. Another study looked at the geography of friendship, finding that the majority of friends tended to live within a hundred miles, although a minority lived in the same town or city. Finally, many statistics about Web 2.0 have been published by market research companies. Despite the uncertain provenance of this data, the results sometimes seem reasonable and also, because of the cost of obtaining the data, seem unlikely to be duplicated by academic researchers. An example is the announcement by Hit Wise that MySpace had supplanted Google as the most visited web site by US users by December The data for this was reported to come from two million US web users via an agreement between Hit Wise and the users' internet service providers. Making the results of overview analyses public gives useful publicity to Hit Wise and valuable insights to web researchers The Development of Policy-Relevant Webometrics As introduced above, early link analysis webometrics developed methods or indicators but no clear practical applications. Early studies began with the Web Impact Factor, a type of calculation based on counting links to a web site or other web space Ingwersen, and (1998). This calculation was practical because links to a web space could be easily counted and listed using an advanced query in the web search engine AltaVista. It gave the promise that the impact of whole areas of the web, including entire countries, could be assessed and was inspired by the journal Impact Factor Garfield, (1999). Subsequent research found problems including the unreliability of search engines Bar-Ilan, (1999); Mettrop & Nieuwenhuysen, (2001) and the existence of links created for spam or recreational reasons Smith, (1999). This may have prevented the early adoption of Web Impact Factors as policy-relevant indicators and they subsequently attracted less interest. After the initial research there was a period of methodological development in which webometrics defined its key terminology Bjorneborn & Ingwersen, (2004), developed specialist data collection and analysis software Cothey, (2004); Heimeriks

The Webometrics. Prashant Goswami Umesh Sharma Anil Kumar Shukla

The Webometrics. Prashant Goswami Umesh Sharma Anil Kumar Shukla 656 International CALIBER-2008 The Webometrics Prashant Goswami Umesh Sharma Anil Kumar Shukla Abstract It has been experienced that web based information resources have great role to play in academic