ArcLink: Optimization techniques to build and retrieve the Temporal Web Graph

Size: px

Start display at page:

Download "ArcLink: Optimization techniques to build and retrieve the Temporal Web Graph"

Elisabeth Cole
6 years ago
Views:

1 ArcLink: Optimization techniques to build and retrieve the Temporal Web Graph A. A. and B. B. Computer Science Department, One University, One City Abstract. Archiving the web is socially and culturally critical, but presents problems of scale. In this paper, we present ArcLink, an exemplary system to optimize the construction, storage, and access to the temporal web graph from large-scale web archive. We divide the web archive construction into four stages (filtering, extraction, storage, and access) and explore optimizations for each stage. We were able to reduce the size of the corpus by 29%, and explored URI-based and content-based indexing approaches to support extraction of the web graph. Keywords: Web Graph, Web Archive, Memento 1 Introduction The web graph is a graph where each vertex represents a URI and the edges represent a hyperlink between them. The web graph was the foundation of various applications; ranking such as: PageRank [1] and Kleinberg HITS [2], web spam detection [3, 4], and finding the related pages [5]. Also, Bharat et al. [6] defined hostgraph by defining linkage between sites (not pages). The temporal web graph may have more than one memento for each URI in various timestamps. It will happen with the collections that were preserved by web archive. Web archives preserved the web materials before changing or disappearing forever. International Internet Preservation Consortium (IIPC) 1 has been formed on 2003 to improve the tools, standards, and the best practice in the web archiving. The IIPC Access working group proposed different use cases for access to Internet Archives [7]; these uses cases covered various types of users. The report highlighted the importance of the linking information of the archived web. The report suggested to provide an API interface to query the link structure for specific URI. The time dimension in the web graph challenges the traditional web graph application. For example, the temporal web graph could be used in ranking the full-text search for the web archives related to the time. We could not use the absolute number of the inlinks because the density of the mementos is not consistent [8]. Also, the results should differ based on the time filter 2 (the query

2 2 A.A., B.B. string on 2004 has bring Yahoo Mail as the first result, today, the same term will return Gmail ). The temporal web graph could be used by the web crawler to discover new URIs. The researchers used temporal web graph to study the evolution of the web. In this paper, we propose new optimization techniques for the creation, preservation, and retrieval for the web graph considering the time dimension. The contribution started with decreasing the size of the data, hashing technique for the URI to enable a native distributed processing with linear and equal insertion and update overhead, efficient schema to represent the time dimension, and API interface for accessing the web graph. ArcLink is a complete system using Hadoop framework that implements these optimization techniques. The paper uses Memento protocol notation [9]. Memento is an extension for HTTP protocol to allow the user to browse the past web as the current web. URI -R denotes an original resource that exists or used to exist in the live Web, URI -M is the memento that is a snapshot for the original resource as it appeared in the past and preserved by the web archive. We used the collaborative IIPC Olympics 2010 collection 3, the goal of this project was to gather a collection of websites on winter Olympics 2010 for experimental use. The collection was crawled between 11/2009 and 03/2010 during the winter Olympics The corpus size is +700 GB, the crawler started with a seed list of 302 websites. The total number of mementos was 23.7M URI-M with 6.4M unique URI-R. 2 Related Work Link Database [10], a part of the Connectivity Server [11] (used by the AltaVesta), provides a fast access storage to the web graph. Link Database compressed the link id to 6 bits per link, so the graph could be loaded into the main memory. Scalable Hyperlink Store (SHS; used by Microsoft) [12] is a distributed in-memory database to store the web graph. SHS provided fast access by keeping the web graph information in the main memory. SHS provides a kind of API to facilitate the interactions with SHS servers. Suel and Yuan [13] created a hostname list and URI list. They used Huffman codes to compress each one, then divided the links into global links between pages on different hosts, and local links between pages on the same host. These systems were built to run on a single machine. Avcular and Suel [14] discussed distributed manipulation of archival web graph using Hadoop. MapReduce [15] is a distributed programming framework to process large scale data. Hadoop is the open source implementation for MapReduce. Pregel [16] is a distributed system for efficient processing the large scale graph. PeGaSus [17] is a peta graph mining library to process the large web graph. Donato et al. [18] studied the properties of web graph based on Stanford WebBase collection [19]. Bordino et al. [20] provided statistical analysis to the temporal characteristics of 100M pages 12 months snapshots of the.uk domain that has been captured 3

3 ArcLink 3 Fig. 1. ArcLink Architecture between June 2006 to May The captured have been preserved as web graph for each month, the web graph has been built using Web Graph [21]. 3 ArcLink Stages We divided the creation of temporal web graph into four main stages: filtering, extraction, storage, and access. In this paper, we studied the characteristics of each stage, and proposed the suitable optimization approaches. Figure 1 illustrates the different stages and the relation between them. 3.1 Filtering Optimization The goal of the filtering optimization is to reduce the size of the input corpus to focus on the snapshots that carry link structure information. The input for the ArcLink system is a list of the URIs within a collection. For Heritirix based web crawler, the URIs list could be found in the crawler log entitled CDX file 4. The CDX file is a space delimited file, each records belongs to one snapshot. The information includes the URI, timestamp, response code, mimetype, and page checksum. ArcLink filtering optimization techniques used this information to create a unique list of mementos that will contribute to the web graph. Reducing the input size will improve the extraction time and the required storage space. Filtering rules Building the web graph started with extracting the outlinks of the web page and creating the inlinks model from that. Based on this procedure, the filtering stage will exclude any snapshot that does not have outlinks (e.g., images), the following set of filtering rules could be used: HTTP Status: Include memento with a successful HTTP status (e.g., 200). 4

4 4 A.A., B.B. Table 1. Reduction Efficiency Experiment Results. Rule type Rule parameter Efficiency in size Time INCLUDE HTTP Status M (75%) 936 sec EXCLUDE Images, JS, and CSS M (67%) 807 sec INCLUDE text/* only M (53%) 744 sec EXCLUDE Resources with image extension M (69%) 845 sec EXCLUDE Duplicate checksum 9.191M(39%) 886 sec - All the rules 6.789M (29%) 2098 sec Content Mimetype: Exclude mementos with mimetype that do not carry textual content (e.g., image, java script, style sheets). Resource extension: Exclude mementos that end with non-textual contents even if the mimetype is text/html. Content checksum: CDX file has the page content checksum calculated with SHA1 algorithm for each memento. Implementation Apache Pig is an Apache open source tool that is customized to parallel processing of a large scale data. Filtering rules are written with PigLatin script that is capable of loading the CDX file fields, then applied IN- CLUDE/EXCLUDE filters. PigLatin is customized to work with large amount of data, using Hadoop clusters for parallel processing, and simple in adding or removing rules from the script. The output of the filtering steps is a list of the unique mementos after applying the filtering rules. Experiment and Results To quantify the efficiency of the filtering optimization techniques, we ran different filtering rules to the Olympics 2010 collection CDX files. The input file has 23.7 M records. The success criteria was how to include the mementos that may contribute to the temporal web graph and how to exclude the non-important mementos that we think it may not add any value to the web graph. The experiment will calculate the reduction in the number of items by applying each rule and what is the total gain of applying all the filters. The efficiency will be the percentage of mementos in the output snapshot list to the original number of snapshot. Analysis The filtering stage used pre-known information, that was addressed by the crawler during the capture time, to save the computation time that may result unuseful (URIs do not have links) or duplicate content. The experimental results showed that the efficiency was so significant to reduce the input records to 29% of the original size. This reduction of the number of records reflected directly into the computation time. The filtering rules approach is flexible enough to manage several kinds of computation. For example, it could be used to extract images only or videos only by adapting rules to filter by mimetype.

5 ArcLink 5 Even this crawling information does not have any copyrighted materials, but it is still not available for the public users. It is highly recommended for the web archives to find a mechanism to publish this information to the public to help the third-party developer/researcher. IIPC took the initiative into this direction by funded IIPC Memento Aggregator Experiment 5 to aggregate all the metadata of the distributed archives of IIPC members to provide Memento based access to the holding of the open/restricted/closed archive. 3.2 Extraction Optimization The optimization technique for the extraction stage focused on two things: the creation of the URI-ID, and extraction mechanism (data source and tools). URI-ID Generation The creation of a unique ID for each URI is the core of web graph creation. The approaches that depended on ordering, by lexicographical ordering [10, 12, 14, 21, 22], inlinks degree [23] then applying Huffman codes, or even storing in array then using the index as ID [11], prevented the system from processing the link structure in parallel nor on different machines. Also, it affects the update/re-indexing process. Bharat et al. [24] proved that the pages used to point to other pages from the same domain. To reach the best performance, ArcLink generates a unique ID for each URI and make the URIs from the same domains have a sequential IDs. 1. First step, canonicalized the URI into SURT format 6 (e.g., example.org, www1.example.org will be org.example). 2. ArcLink will convert this canonicalized string using SimHash [25] with 128 bit length. Using this 1-1 mapping between the URI and the ID gives the ability to incremental and distributed processing for the same URI on different cycles or different machines because the ID generation depends on the URI only. Also, the SimHash ensures the same domain/host URIs will take similar IDs, this ID distribution improve the access to the URIs. Data sources The extraction of the link structure from the archived web data differs from the extraction from the live web in the nature of the format of the archived web. ArcLink could extract the link structure from three sources based on the availability of the input: 1. Web ARChive file (WARC ): is the standard file format for web archives that offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file

6 6 A.A., B.B. 2. Web Archive Metadata file (WAT ) 7 : is a new metadata file format that carries metadata about the WARC file including the outlinks. 3. Web Archive UI : that displays the web page as it appeared in the past. Implementation The Apache Hadoop software is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. We build MapReduce jobs using Java, the mapper part started with one memento at a time and started to extract the outlinks based on the available source. For extracting the link structure from the WARC and Web interface, we used HTML Parser 8 that is a Java library used to parse HTML. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. ArcLink focused on the following items: the outlink (anchor link or embedded resources link), the type (href or image), and the associated text (anchor text for the href or the alternate text for the image). The reduce job is responsible of ID creation for each of the extracted links and create one record that contains (Document Checksum, Outlink URI, OutLink ID, type, text). Hadoop provides a partition technique aims to move the data to the processing node. The input file for the extraction contains pointers to the actual data (e.g., WARC files) Each line in the input file has the URI, additional WARC file name with an offset to the content. Processing each record has an additional access to the file. The simultaneous multi-access to the same file affects the performance of the extraction stage [12]. ArcLink provides a novel partition technique to ensure single access to the WARC file per task by splitting the input file manually. Each WARC file name should appear in only one split file, the input split file could contain one or more WARC file. Extraction from WAT files The evaluation does not cover the extraction from the WAT files. Theoretically, the extraction from the WAT file is faster than the extraction from the actual content because the WAT file has been processed before. The extraction of the WAT is an expensive task because it extracts a lot of information from the actual text. So the quantitative comparison should take in the consideration the time that has been taken to create the WAT file. Also, the WAT files are not ready for all the collections, it is still in an early stage to be defined as an international standard. Finally, WAT file does not have all the required information that were addressed in this system. Experiment and Results In this section, we will do a quantitative study to compare the different extraction techniques. In this experiment, we extracted the outlinks from two sources (WARC, Wayback Machine) for the same collection. We repeated the experiment on different number of MapReduce jobs. 7 Metadata+File+Specification 8

7 ArcLink 7 Table 2. Extraction Experiment Results. Input Map Reduce Total time WARC 13,327 2,770 16,098 2 Tasks (Partition) WayBack Machine 21,422 4,194 25,616 2 Tasks (Normal) WARC 15,324 2,940 18,265 WARC 8,304 1,746 10,051 5 Tasks (Partition) WayBack Machine 13,721 2,257 15,978 We have two samples of data. For the two tasks experiments, we used the first sample that was 800k records that we feed the extractor on two modes, Normal using the default Hadoop partition, and Partition which ArcLink splits the data before the submission. For the five tasks run, we used the second sample that was 500k records. The task has been repeated to extract from the WARC file and WayBack Machine. Accessing the WayBack Machine was done without any politeness period between the requests. Table 2 shows the results for Map phase (total time for all mappers), Reduce phase, and the total time for both in seconds. Analysis Extracting from Wayback machine failed in the first round because of the machine was not prepared to receive the resulting high load. After updating the memory configuration, we were able to access the machine with different tasks. This problem does not happen with WARC extraction because we depend on Hadoop DFS which was designed to receive high load of requests. Also the results showed that increasing number of tasks may have a bad impact on the performance especially if the web server is not capable of receiving huge number of requests. 3.3 Storage Optimization In the previous stages, the optimization techniques focused on the time, in the storage optimization, ArcLink will optimize the required space to store the web graph. ArcLink preserved the extracted links structure for future access using database. The database will be the source of the link structure information for any further experiments. Schema The outlinks and inlinks with temporal dimension is many-to-many relation with various properties. For example, the URI may have different mementos, each memento may have the same or different outlinks with different anchor-text. The main research question is how to represent the web graph with the temporal information. Usually, the web graph is represented as a directed graph: the vertex represents the URI and the edges represents hyperlinks between the URIs. With the temporal dimension, URI-R x could point to URI-R y in different timestamps with different properties (i.e., anchor-text). Adding this information

8 A.A., B.B. (a) web graph with temporal properties. (b) Content-centric. Fig. 2. Temporal Web Graph approaches. to the graph needs more sophisticated method to make the graph with the minimum size.

For each edge (that represents link from URI-R x to URI-R y ), we added two fields, the datetime for this memento and the anchor text for this references. We used this schema for access (section 3.

8 8 A.A., B.B. (a) web graph with temporal properties. (b) Content-centric. Fig. 2. Temporal Web Graph approaches. to the graph needs more sophisticated method to make the graph with the minimum size. We came with two schema to represent the temporal web graph. Web Graph with temporal properties (figure 2(a)): In this schema, we expanded the regular web graph to include attributes for the edges. For each edge (that represents link from URI-R x to URI-R y ), we added two fields, the datetime for this memento and the anchor text for this references. We used this schema for access (section 3.4), because it is more readable for the users and the applications. Content-centric temporal web graph (figure 2(b)): In this schema, we replaced the URI and datetime attribute with the checksum for the textual content for this memento. The duplicate mementos have the same content (with same checksum) which evolve to the same vertex. This schema is used for the preservation, because focusing on the content will remove the duplicate information. The rest of this section will explain the schema implementation in Cassandra db. Implementation ArcLink reference implementation is using the Apache Cassandradatabase which is a highly scalable, distributed, and structured key-value store. We used the Super Column Family structure to build the schema for saving the link structure information. The advantage of this technique is providing the same analogy of the temporal relation between the datetime and the list of mementos with different attributes for each one. The Cassandra db handles the update/insert operations, so even if we insert the same record twice it will detect it was previously inserted and update the content with the new information. Heritirix crawler calculates the checksum for each memento to determine if this memento has been changed since last visit. Each URI-R has a set of checksum and each one has a list of observation datetime (mementos). We used SHA1-checksum as a super column family, and each one includes a list of the linkid each one describes one of the outlink in this memento. For each linkid, we add two fields, the type (e.g., href for hyperlink or img for images), and the text

9 ArcLink 9 Listing 1.1. Cassandra db schema. OutLink = { SHA_CheckSum1: { linkid1: { type: href, atext: Anchor text } linkid2: { type: img, atext: ALT text }}, SHA_CheckSum2: { linkid1: { type: href, atext: Anchor text } linkid4: { type: img, atext: ALT text }} } InLink{ linkid1: { SHA_CheckSum1 :{type: href, atext: Anchor text } SHA_CheckSum1 :{type: href, atext: Anchor text }} } (a) Insert Time. (b) Update Time. Fig. 3. Insert/Update time. (e.g., the anchor text for the href or the alt text for the images). Additional to these two main super column families, the system provides some other indexing column families such as: CDX-CF, SHA-CF, LinkID-CF. Experiment and Results In this section, we will evaluate the efficiency of the new schema in both space (reduction in storage) and time (insertion/update). The database driver is the program that is responsible of inserting the extracted links into the database, creating outlinks and inlinks tables. The Cassandra database driver calculated the required time for the insertion. Then, we repeat the process again to calculate the time for the update. Figure 3 shows that there is a linear relationship between the number of links and required time for insertion. Also, the same linear relationship appeared for the update which means there is no extra overhead for the update process. The content-centric schema focused on the content more than the URI. The same checksum may belong to one URI in one timestamp (unique memento), one URI in different timestamps (duplicate mementos), or different URIs in different timestamps (duplicate content). Our experiment found the duplicate content

10 10 A.A., B.B. Listing 1.2. Link Structure Response Schema. <xs:element name="linkelement"> <xs:complextype> <xs:sequence> <xs:element name="uri" type="xs:string" minoccurs="1" /> <xs:element name="outlink" minoccurs="0" maxoccurs="unbounded"> <xs:complextype> <xs:sequence> <xs:element name="href" type="xs:string" minoccurs="1" /> <xs:element name="type" type="xs:string" minoccurs="1" /> <xs:element name="atext" type="xs:string" minoccurs="0" /> <xs:element ref="timestamp" minoccurs="1" maxoccurs="unbounded" /> </xs:sequence> </xs:complextype> </xs:element> <xs:element name="inlink" minoccurs="0" maxoccurs="unbounded"> <xs:complextype> <xs:sequence> <xs:element name="type" type="xs:string" minoccurs="1" /> <xs:element name="uri" type="xs:string" minoccurs="1" /> <xs:element name="atext" type="xs:string" minoccurs="0" /> <xs:element ref="timestamp" minoccurs="1" maxoccurs="unbounded" /> </xs:sequence> </xs:complextype> </xs:element> </xs:sequence> </xs:complextype> </xs:element> case is so popular, for example, the Olympics collection has +23k snapshots that have soft 404 that returns 200 instead of 404. In the regular web graph, each one of these snapshot will have one vertex that has the same outlinks. The average number of mementos per SHA1 is 7.2 (standard deviation of 5011). 3.4 Access ArcLink provides an API (Application Programming Interface) to enable the users/third-party application to access the link structure information. ArcLink delivers the link structure information to other systems instead of processing/analyzing the information by itself. It makes the ArcLink as a source of knowledge which could be expanded by incrementally processing more archived web material or by aggregating the ArcLink interface with other archives. ArcLink delivery method depends on REST web services with the following signatures: Input: URI-R Output: List of outlinks and inlinks for this URI-R. Format: Text, XML, and JSON. Response Schema In this section, we will explain the response schema. Listing 1.2 shows the schema of the link structure response written in XSD schema language URI : The requested URI element in the canonicalized format.

11 ArcLink 11 Outlink: A sequence of olink elements, each one has four information: the href that points to, the type (href or embedded resources), the anchor text related to this href, and a list of timestamps for the mementos that carried this outlink. Inlink: A sequence of ilink elements, each one has four elements: the URI that points to, the type (href or embedded resources), the anchor text related to this URI, and a list of timestamps for the mementos. ArcLink interface could be aggregated with other ArcLink instances that may carry information about the requested URI. The aggregation could be done on the response level to give each ArcLink implementer the freedom to adjust the implementation details based on the requirements and capabilities. For example, the XML response could be aggregated by third-party application that merges two or more XML response tree into one unified tree based on the linkid. 4 Conclusion This paper presented ArcLink, a distributed system that applies novel optimization techniques to construct, preserve, and deliver the temporal web graph for the large-scale web archives. The experiment used the IIPC Olympics Collection, we reused the crawler log to reduce the input corpus size by 29%. ArcLink supports extraction from different sources, with a preference for WARC files if available. We built two schema types, Content-Centric for preservation, and URI-Centric for retrieval. ArcLink is supported with an API interface to enable third-parties to access the temporal web graph information. We plan to use Arclink to facilitate future research projects. 5 Acknowledgments This work is supported in part by the Library of Congress and NSF IIS We would like to thank Kris Carpenter Negulescu, Aaron Binns, and Vinay Goel from Internet Archive for allowing us to use the IA infrastructure for this research. References 1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30 (1998) Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46 (1999) Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of WebDB 04. (2004) 4. Becchetti, L., Castillo, C., Donato, D., Baeza-YATES, R., Leonardi, S.: Link analysis for Web spam detection. ACM Transactions on the Web 2 (2008) Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks 31 (1999)

12 12 A.A., B.B. 6. Bharat, K., Henzinger, M.R.: Improved algorithms for topic distillation in a hyperlinked environment. In: Proceedings of SIGIR 98. (1998) IIPC Access Working Group: Use cases for Access to Internet Archives. Technical report, International Internet Preservation Consortium Publications (2006) 8. Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the web is archived? In: Proceeding of JCDL 11 (2011) Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states. draft-vandesompel-memento/ (2011) 10. Randall, K.H., Stata, R., Wiener, J.L., Wickremesinghe, R.G.: The Link Database: Fast Access to Graphs of the Web. (2002) Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S.: The Connectivity Server: fast access to linkage information on the Web. Computer Networks and ISDN Systems 30 (1998) Najork, M.: The scalable hyperlink store. In: Proceedings of the 20th ACM conference on Hypertext and hypermedia - HT 09. (2009) Suel, T., Yuan, J.: Compressing the graph structure of the Web. In: Proceedings DCC Data Compression Conference, IEEE Comput. Soc (2001) Avcular, Y., Suel, T.: Scalable Manipulation of Archival Web Graphs. In: Proceeding of LSDS-IR. (2011) Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), USENIX Association (2004) Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A System for Large-Scale Graph Processing. In: Proceedings of SIGMOD 10. (2010) Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowledge and Information Systems 27 (2010) Donato, D., Laura, L., Leonardi, S., Millozzi, S.: Large scale properties of the Webgraph. The European Physical Journal B - Condensed Matter 38 (2004) Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., Wesley, G.: Stanford webbase components and applications. ACM Transactions on Internet Technology 6 (2006) 20. Bordino, I., Boldi, P., Donato, D., Santini, M., Vigna, S.: Temporal Evolution of the UK Web. In: 2008 IEEE International Conference on Data Mining Workshops, IEEE (2008) Boldi, P., Vigna, S.: The webgraph framework I. In: Proceedings of - WWW 04. (2004) Guillaume, J.L., Latapy, M., Viennot, L.: Efficient and Simple Encodings for the Web Graph. In Meng, X., Su, J., Wang, Y., eds.: Advances in Web-Age Information Management. Volume 2419 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2002) Adle, M., Mitzenmacher, M.: Towards compressing Web graphs. In: Proceedings DCC Data Compression Conference, IEEE Comput. Soc (2001) Bharat, K., Chang, B.W., Henzinger, M., Ruhl, M.: Who links to whom: mining linkage between Web sites. In: Proceedings 2001 IEEE International Conference on Data Mining, IEEE Computer Society (2001) Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC 02. (2002) 380

ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph

ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph ABSTRACT Ahmed AlSum Department of Computer Science Old Dominion University Norfolk VA, USA aalsum@cs.odu.edu Archiving the