ArcLink: Optimization techniques to build and retrieve the Temporal Web Graph

Size: px
Start display at page:

Download "ArcLink: Optimization techniques to build and retrieve the Temporal Web Graph"

Transcription

1 ArcLink: Optimization techniques to build and retrieve the Temporal Web Graph A. A. and B. B. Computer Science Department, One University, One City Abstract. Archiving the web is socially and culturally critical, but presents problems of scale. In this paper, we present ArcLink, an exemplary system to optimize the construction, storage, and access to the temporal web graph from large-scale web archive. We divide the web archive construction into four stages (filtering, extraction, storage, and access) and explore optimizations for each stage. We were able to reduce the size of the corpus by 29%, and explored URI-based and content-based indexing approaches to support extraction of the web graph. Keywords: Web Graph, Web Archive, Memento 1 Introduction The web graph is a graph where each vertex represents a URI and the edges represent a hyperlink between them. The web graph was the foundation of various applications; ranking such as: PageRank [1] and Kleinberg HITS [2], web spam detection [3, 4], and finding the related pages [5]. Also, Bharat et al. [6] defined hostgraph by defining linkage between sites (not pages). The temporal web graph may have more than one memento for each URI in various timestamps. It will happen with the collections that were preserved by web archive. Web archives preserved the web materials before changing or disappearing forever. International Internet Preservation Consortium (IIPC) 1 has been formed on 2003 to improve the tools, standards, and the best practice in the web archiving. The IIPC Access working group proposed different use cases for access to Internet Archives [7]; these uses cases covered various types of users. The report highlighted the importance of the linking information of the archived web. The report suggested to provide an API interface to query the link structure for specific URI. The time dimension in the web graph challenges the traditional web graph application. For example, the temporal web graph could be used in ranking the full-text search for the web archives related to the time. We could not use the absolute number of the inlinks because the density of the mementos is not consistent [8]. Also, the results should differ based on the time filter 2 (the query

2 2 A.A., B.B. string on 2004 has bring Yahoo Mail as the first result, today, the same term will return Gmail ). The temporal web graph could be used by the web crawler to discover new URIs. The researchers used temporal web graph to study the evolution of the web. In this paper, we propose new optimization techniques for the creation, preservation, and retrieval for the web graph considering the time dimension. The contribution started with decreasing the size of the data, hashing technique for the URI to enable a native distributed processing with linear and equal insertion and update overhead, efficient schema to represent the time dimension, and API interface for accessing the web graph. ArcLink is a complete system using Hadoop framework that implements these optimization techniques. The paper uses Memento protocol notation [9]. Memento is an extension for HTTP protocol to allow the user to browse the past web as the current web. URI -R denotes an original resource that exists or used to exist in the live Web, URI -M is the memento that is a snapshot for the original resource as it appeared in the past and preserved by the web archive. We used the collaborative IIPC Olympics 2010 collection 3, the goal of this project was to gather a collection of websites on winter Olympics 2010 for experimental use. The collection was crawled between 11/2009 and 03/2010 during the winter Olympics The corpus size is +700 GB, the crawler started with a seed list of 302 websites. The total number of mementos was 23.7M URI-M with 6.4M unique URI-R. 2 Related Work Link Database [10], a part of the Connectivity Server [11] (used by the AltaVesta), provides a fast access storage to the web graph. Link Database compressed the link id to 6 bits per link, so the graph could be loaded into the main memory. Scalable Hyperlink Store (SHS; used by Microsoft) [12] is a distributed in-memory database to store the web graph. SHS provided fast access by keeping the web graph information in the main memory. SHS provides a kind of API to facilitate the interactions with SHS servers. Suel and Yuan [13] created a hostname list and URI list. They used Huffman codes to compress each one, then divided the links into global links between pages on different hosts, and local links between pages on the same host. These systems were built to run on a single machine. Avcular and Suel [14] discussed distributed manipulation of archival web graph using Hadoop. MapReduce [15] is a distributed programming framework to process large scale data. Hadoop is the open source implementation for MapReduce. Pregel [16] is a distributed system for efficient processing the large scale graph. PeGaSus [17] is a peta graph mining library to process the large web graph. Donato et al. [18] studied the properties of web graph based on Stanford WebBase collection [19]. Bordino et al. [20] provided statistical analysis to the temporal characteristics of 100M pages 12 months snapshots of the.uk domain that has been captured 3

3 ArcLink 3 Fig. 1. ArcLink Architecture between June 2006 to May The captured have been preserved as web graph for each month, the web graph has been built using Web Graph [21]. 3 ArcLink Stages We divided the creation of temporal web graph into four main stages: filtering, extraction, storage, and access. In this paper, we studied the characteristics of each stage, and proposed the suitable optimization approaches. Figure 1 illustrates the different stages and the relation between them. 3.1 Filtering Optimization The goal of the filtering optimization is to reduce the size of the input corpus to focus on the snapshots that carry link structure information. The input for the ArcLink system is a list of the URIs within a collection. For Heritirix based web crawler, the URIs list could be found in the crawler log entitled CDX file 4. The CDX file is a space delimited file, each records belongs to one snapshot. The information includes the URI, timestamp, response code, mimetype, and page checksum. ArcLink filtering optimization techniques used this information to create a unique list of mementos that will contribute to the web graph. Reducing the input size will improve the extraction time and the required storage space. Filtering rules Building the web graph started with extracting the outlinks of the web page and creating the inlinks model from that. Based on this procedure, the filtering stage will exclude any snapshot that does not have outlinks (e.g., images), the following set of filtering rules could be used: HTTP Status: Include memento with a successful HTTP status (e.g., 200). 4

4 4 A.A., B.B. Table 1. Reduction Efficiency Experiment Results. Rule type Rule parameter Efficiency in size Time INCLUDE HTTP Status M (75%) 936 sec EXCLUDE Images, JS, and CSS M (67%) 807 sec INCLUDE text/* only M (53%) 744 sec EXCLUDE Resources with image extension M (69%) 845 sec EXCLUDE Duplicate checksum 9.191M(39%) 886 sec - All the rules 6.789M (29%) 2098 sec Content Mimetype: Exclude mementos with mimetype that do not carry textual content (e.g., image, java script, style sheets). Resource extension: Exclude mementos that end with non-textual contents even if the mimetype is text/html. Content checksum: CDX file has the page content checksum calculated with SHA1 algorithm for each memento. Implementation Apache Pig is an Apache open source tool that is customized to parallel processing of a large scale data. Filtering rules are written with PigLatin script that is capable of loading the CDX file fields, then applied IN- CLUDE/EXCLUDE filters. PigLatin is customized to work with large amount of data, using Hadoop clusters for parallel processing, and simple in adding or removing rules from the script. The output of the filtering steps is a list of the unique mementos after applying the filtering rules. Experiment and Results To quantify the efficiency of the filtering optimization techniques, we ran different filtering rules to the Olympics 2010 collection CDX files. The input file has 23.7 M records. The success criteria was how to include the mementos that may contribute to the temporal web graph and how to exclude the non-important mementos that we think it may not add any value to the web graph. The experiment will calculate the reduction in the number of items by applying each rule and what is the total gain of applying all the filters. The efficiency will be the percentage of mementos in the output snapshot list to the original number of snapshot. Analysis The filtering stage used pre-known information, that was addressed by the crawler during the capture time, to save the computation time that may result unuseful (URIs do not have links) or duplicate content. The experimental results showed that the efficiency was so significant to reduce the input records to 29% of the original size. This reduction of the number of records reflected directly into the computation time. The filtering rules approach is flexible enough to manage several kinds of computation. For example, it could be used to extract images only or videos only by adapting rules to filter by mimetype.

5 ArcLink 5 Even this crawling information does not have any copyrighted materials, but it is still not available for the public users. It is highly recommended for the web archives to find a mechanism to publish this information to the public to help the third-party developer/researcher. IIPC took the initiative into this direction by funded IIPC Memento Aggregator Experiment 5 to aggregate all the metadata of the distributed archives of IIPC members to provide Memento based access to the holding of the open/restricted/closed archive. 3.2 Extraction Optimization The optimization technique for the extraction stage focused on two things: the creation of the URI-ID, and extraction mechanism (data source and tools). URI-ID Generation The creation of a unique ID for each URI is the core of web graph creation. The approaches that depended on ordering, by lexicographical ordering [10, 12, 14, 21, 22], inlinks degree [23] then applying Huffman codes, or even storing in array then using the index as ID [11], prevented the system from processing the link structure in parallel nor on different machines. Also, it affects the update/re-indexing process. Bharat et al. [24] proved that the pages used to point to other pages from the same domain. To reach the best performance, ArcLink generates a unique ID for each URI and make the URIs from the same domains have a sequential IDs. 1. First step, canonicalized the URI into SURT format 6 (e.g., example.org, www1.example.org will be org.example). 2. ArcLink will convert this canonicalized string using SimHash [25] with 128 bit length. Using this 1-1 mapping between the URI and the ID gives the ability to incremental and distributed processing for the same URI on different cycles or different machines because the ID generation depends on the URI only. Also, the SimHash ensures the same domain/host URIs will take similar IDs, this ID distribution improve the access to the URIs. Data sources The extraction of the link structure from the archived web data differs from the extraction from the live web in the nature of the format of the archived web. ArcLink could extract the link structure from three sources based on the availability of the input: 1. Web ARChive file (WARC ): is the standard file format for web archives that offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file

6 6 A.A., B.B. 2. Web Archive Metadata file (WAT ) 7 : is a new metadata file format that carries metadata about the WARC file including the outlinks. 3. Web Archive UI : that displays the web page as it appeared in the past. Implementation The Apache Hadoop software is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. We build MapReduce jobs using Java, the mapper part started with one memento at a time and started to extract the outlinks based on the available source. For extracting the link structure from the WARC and Web interface, we used HTML Parser 8 that is a Java library used to parse HTML. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. ArcLink focused on the following items: the outlink (anchor link or embedded resources link), the type (href or image), and the associated text (anchor text for the href or the alternate text for the image). The reduce job is responsible of ID creation for each of the extracted links and create one record that contains (Document Checksum, Outlink URI, OutLink ID, type, text). Hadoop provides a partition technique aims to move the data to the processing node. The input file for the extraction contains pointers to the actual data (e.g., WARC files) Each line in the input file has the URI, additional WARC file name with an offset to the content. Processing each record has an additional access to the file. The simultaneous multi-access to the same file affects the performance of the extraction stage [12]. ArcLink provides a novel partition technique to ensure single access to the WARC file per task by splitting the input file manually. Each WARC file name should appear in only one split file, the input split file could contain one or more WARC file. Extraction from WAT files The evaluation does not cover the extraction from the WAT files. Theoretically, the extraction from the WAT file is faster than the extraction from the actual content because the WAT file has been processed before. The extraction of the WAT is an expensive task because it extracts a lot of information from the actual text. So the quantitative comparison should take in the consideration the time that has been taken to create the WAT file. Also, the WAT files are not ready for all the collections, it is still in an early stage to be defined as an international standard. Finally, WAT file does not have all the required information that were addressed in this system. Experiment and Results In this section, we will do a quantitative study to compare the different extraction techniques. In this experiment, we extracted the outlinks from two sources (WARC, Wayback Machine) for the same collection. We repeated the experiment on different number of MapReduce jobs. 7 Metadata+File+Specification 8

7 ArcLink 7 Table 2. Extraction Experiment Results. Input Map Reduce Total time WARC 13,327 2,770 16,098 2 Tasks (Partition) WayBack Machine 21,422 4,194 25,616 2 Tasks (Normal) WARC 15,324 2,940 18,265 WARC 8,304 1,746 10,051 5 Tasks (Partition) WayBack Machine 13,721 2,257 15,978 We have two samples of data. For the two tasks experiments, we used the first sample that was 800k records that we feed the extractor on two modes, Normal using the default Hadoop partition, and Partition which ArcLink splits the data before the submission. For the five tasks run, we used the second sample that was 500k records. The task has been repeated to extract from the WARC file and WayBack Machine. Accessing the WayBack Machine was done without any politeness period between the requests. Table 2 shows the results for Map phase (total time for all mappers), Reduce phase, and the total time for both in seconds. Analysis Extracting from Wayback machine failed in the first round because of the machine was not prepared to receive the resulting high load. After updating the memory configuration, we were able to access the machine with different tasks. This problem does not happen with WARC extraction because we depend on Hadoop DFS which was designed to receive high load of requests. Also the results showed that increasing number of tasks may have a bad impact on the performance especially if the web server is not capable of receiving huge number of requests. 3.3 Storage Optimization In the previous stages, the optimization techniques focused on the time, in the storage optimization, ArcLink will optimize the required space to store the web graph. ArcLink preserved the extracted links structure for future access using database. The database will be the source of the link structure information for any further experiments. Schema The outlinks and inlinks with temporal dimension is many-to-many relation with various properties. For example, the URI may have different mementos, each memento may have the same or different outlinks with different anchor-text. The main research question is how to represent the web graph with the temporal information. Usually, the web graph is represented as a directed graph: the vertex represents the URI and the edges represents hyperlinks between the URIs. With the temporal dimension, URI-R x could point to URI-R y in different timestamps with different properties (i.e., anchor-text). Adding this information

8 8 A.A., B.B. (a) web graph with temporal properties. (b) Content-centric. Fig. 2. Temporal Web Graph approaches. to the graph needs more sophisticated method to make the graph with the minimum size. We came with two schema to represent the temporal web graph. Web Graph with temporal properties (figure 2(a)): In this schema, we expanded the regular web graph to include attributes for the edges. For each edge (that represents link from URI-R x to URI-R y ), we added two fields, the datetime for this memento and the anchor text for this references. We used this schema for access (section 3.4), because it is more readable for the users and the applications. Content-centric temporal web graph (figure 2(b)): In this schema, we replaced the URI and datetime attribute with the checksum for the textual content for this memento. The duplicate mementos have the same content (with same checksum) which evolve to the same vertex. This schema is used for the preservation, because focusing on the content will remove the duplicate information. The rest of this section will explain the schema implementation in Cassandra db. Implementation ArcLink reference implementation is using the Apache Cassandradatabase which is a highly scalable, distributed, and structured key-value store. We used the Super Column Family structure to build the schema for saving the link structure information. The advantage of this technique is providing the same analogy of the temporal relation between the datetime and the list of mementos with different attributes for each one. The Cassandra db handles the update/insert operations, so even if we insert the same record twice it will detect it was previously inserted and update the content with the new information. Heritirix crawler calculates the checksum for each memento to determine if this memento has been changed since last visit. Each URI-R has a set of checksum and each one has a list of observation datetime (mementos). We used SHA1-checksum as a super column family, and each one includes a list of the linkid each one describes one of the outlink in this memento. For each linkid, we add two fields, the type (e.g., href for hyperlink or img for images), and the text

9 ArcLink 9 Listing 1.1. Cassandra db schema. OutLink = { SHA_CheckSum1: { linkid1: { type: href, atext: Anchor text } linkid2: { type: img, atext: ALT text }}, SHA_CheckSum2: { linkid1: { type: href, atext: Anchor text } linkid4: { type: img, atext: ALT text }} } InLink{ linkid1: { SHA_CheckSum1 :{type: href, atext: Anchor text } SHA_CheckSum1 :{type: href, atext: Anchor text }} } (a) Insert Time. (b) Update Time. Fig. 3. Insert/Update time. (e.g., the anchor text for the href or the alt text for the images). Additional to these two main super column families, the system provides some other indexing column families such as: CDX-CF, SHA-CF, LinkID-CF. Experiment and Results In this section, we will evaluate the efficiency of the new schema in both space (reduction in storage) and time (insertion/update). The database driver is the program that is responsible of inserting the extracted links into the database, creating outlinks and inlinks tables. The Cassandra database driver calculated the required time for the insertion. Then, we repeat the process again to calculate the time for the update. Figure 3 shows that there is a linear relationship between the number of links and required time for insertion. Also, the same linear relationship appeared for the update which means there is no extra overhead for the update process. The content-centric schema focused on the content more than the URI. The same checksum may belong to one URI in one timestamp (unique memento), one URI in different timestamps (duplicate mementos), or different URIs in different timestamps (duplicate content). Our experiment found the duplicate content

10 10 A.A., B.B. Listing 1.2. Link Structure Response Schema. <xs:element name="linkelement"> <xs:complextype> <xs:sequence> <xs:element name="uri" type="xs:string" minoccurs="1" /> <xs:element name="outlink" minoccurs="0" maxoccurs="unbounded"> <xs:complextype> <xs:sequence> <xs:element name="href" type="xs:string" minoccurs="1" /> <xs:element name="type" type="xs:string" minoccurs="1" /> <xs:element name="atext" type="xs:string" minoccurs="0" /> <xs:element ref="timestamp" minoccurs="1" maxoccurs="unbounded" /> </xs:sequence> </xs:complextype> </xs:element> <xs:element name="inlink" minoccurs="0" maxoccurs="unbounded"> <xs:complextype> <xs:sequence> <xs:element name="type" type="xs:string" minoccurs="1" /> <xs:element name="uri" type="xs:string" minoccurs="1" /> <xs:element name="atext" type="xs:string" minoccurs="0" /> <xs:element ref="timestamp" minoccurs="1" maxoccurs="unbounded" /> </xs:sequence> </xs:complextype> </xs:element> </xs:sequence> </xs:complextype> </xs:element> case is so popular, for example, the Olympics collection has +23k snapshots that have soft 404 that returns 200 instead of 404. In the regular web graph, each one of these snapshot will have one vertex that has the same outlinks. The average number of mementos per SHA1 is 7.2 (standard deviation of 5011). 3.4 Access ArcLink provides an API (Application Programming Interface) to enable the users/third-party application to access the link structure information. ArcLink delivers the link structure information to other systems instead of processing/analyzing the information by itself. It makes the ArcLink as a source of knowledge which could be expanded by incrementally processing more archived web material or by aggregating the ArcLink interface with other archives. ArcLink delivery method depends on REST web services with the following signatures: Input: URI-R Output: List of outlinks and inlinks for this URI-R. Format: Text, XML, and JSON. Response Schema In this section, we will explain the response schema. Listing 1.2 shows the schema of the link structure response written in XSD schema language URI : The requested URI element in the canonicalized format.

11 ArcLink 11 Outlink: A sequence of olink elements, each one has four information: the href that points to, the type (href or embedded resources), the anchor text related to this href, and a list of timestamps for the mementos that carried this outlink. Inlink: A sequence of ilink elements, each one has four elements: the URI that points to, the type (href or embedded resources), the anchor text related to this URI, and a list of timestamps for the mementos. ArcLink interface could be aggregated with other ArcLink instances that may carry information about the requested URI. The aggregation could be done on the response level to give each ArcLink implementer the freedom to adjust the implementation details based on the requirements and capabilities. For example, the XML response could be aggregated by third-party application that merges two or more XML response tree into one unified tree based on the linkid. 4 Conclusion This paper presented ArcLink, a distributed system that applies novel optimization techniques to construct, preserve, and deliver the temporal web graph for the large-scale web archives. The experiment used the IIPC Olympics Collection, we reused the crawler log to reduce the input corpus size by 29%. ArcLink supports extraction from different sources, with a preference for WARC files if available. We built two schema types, Content-Centric for preservation, and URI-Centric for retrieval. ArcLink is supported with an API interface to enable third-parties to access the temporal web graph information. We plan to use Arclink to facilitate future research projects. 5 Acknowledgments This work is supported in part by the Library of Congress and NSF IIS We would like to thank Kris Carpenter Negulescu, Aaron Binns, and Vinay Goel from Internet Archive for allowing us to use the IA infrastructure for this research. References 1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30 (1998) Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46 (1999) Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of WebDB 04. (2004) 4. Becchetti, L., Castillo, C., Donato, D., Baeza-YATES, R., Leonardi, S.: Link analysis for Web spam detection. ACM Transactions on the Web 2 (2008) Dean, J., Henzinger, M.R.: Finding related pages in the World Wide Web. Computer Networks 31 (1999)

12 12 A.A., B.B. 6. Bharat, K., Henzinger, M.R.: Improved algorithms for topic distillation in a hyperlinked environment. In: Proceedings of SIGIR 98. (1998) IIPC Access Working Group: Use cases for Access to Internet Archives. Technical report, International Internet Preservation Consortium Publications (2006) 8. Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the web is archived? In: Proceeding of JCDL 11 (2011) Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states. draft-vandesompel-memento/ (2011) 10. Randall, K.H., Stata, R., Wiener, J.L., Wickremesinghe, R.G.: The Link Database: Fast Access to Graphs of the Web. (2002) Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S.: The Connectivity Server: fast access to linkage information on the Web. Computer Networks and ISDN Systems 30 (1998) Najork, M.: The scalable hyperlink store. In: Proceedings of the 20th ACM conference on Hypertext and hypermedia - HT 09. (2009) Suel, T., Yuan, J.: Compressing the graph structure of the Web. In: Proceedings DCC Data Compression Conference, IEEE Comput. Soc (2001) Avcular, Y., Suel, T.: Scalable Manipulation of Archival Web Graphs. In: Proceeding of LSDS-IR. (2011) Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), USENIX Association (2004) Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A System for Large-Scale Graph Processing. In: Proceedings of SIGMOD 10. (2010) Kang, U., Tsourakakis, C.E., Faloutsos, C.: PEGASUS: mining peta-scale graphs. Knowledge and Information Systems 27 (2010) Donato, D., Laura, L., Leonardi, S., Millozzi, S.: Large scale properties of the Webgraph. The European Physical Journal B - Condensed Matter 38 (2004) Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., Wesley, G.: Stanford webbase components and applications. ACM Transactions on Internet Technology 6 (2006) 20. Bordino, I., Boldi, P., Donato, D., Santini, M., Vigna, S.: Temporal Evolution of the UK Web. In: 2008 IEEE International Conference on Data Mining Workshops, IEEE (2008) Boldi, P., Vigna, S.: The webgraph framework I. In: Proceedings of - WWW 04. (2004) Guillaume, J.L., Latapy, M., Viennot, L.: Efficient and Simple Encodings for the Web Graph. In Meng, X., Su, J., Wang, Y., eds.: Advances in Web-Age Information Management. Volume 2419 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg (2002) Adle, M., Mitzenmacher, M.: Towards compressing Web graphs. In: Proceedings DCC Data Compression Conference, IEEE Comput. Soc (2001) Bharat, K., Chang, B.W., Henzinger, M., Ruhl, M.: Who links to whom: mining linkage between Web sites. In: Proceedings 2001 IEEE International Conference on Data Mining, IEEE Computer Society (2001) Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC 02. (2002) 380

ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph

ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph ABSTRACT Ahmed AlSum Department of Computer Science Old Dominion University Norfolk VA, USA aalsum@cs.odu.edu Archiving the

More information

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS Satwinder Kaur 1 & Alisha Gupta 2 1 Research Scholar (M.tech

More information

Policies to Resolve Archived HTTP Redirection

Policies to Resolve Archived HTTP Redirection Policies to Resolve Archived HTTP Redirection ABC XYZ ABC One University Some city email@domain.com ABSTRACT HyperText Transfer Protocol (HTTP) defined a Status code (Redirection 3xx) that enables the

More information

Archival HTTP Redirection Retrieval Policies

Archival HTTP Redirection Retrieval Policies Archival HTTP Redirection Retrieval Policies Ahmed AlSum, Michael L. Nelson Old Dominion University Norfolk VA, USA {aalsum,mln}@cs.odu.edu Robert Sanderson, Herbert Van de Sompel Los Alamos National Laboratory

More information

Accessing Web Archives

Accessing Web Archives Accessing Web Archives Web Science Course 2017 Helge Holzmann 05/16/2017 Helge Holzmann (holzmann@l3s.de) Not today s topic http://blog.archive.org/2016/09/19/the-internet-archive-turns-20/ 05/16/2017

More information

CS Search Engine Technology

CS Search Engine Technology CS236620 - Search Engine Technology Ronny Lempel Winter 2008/9 The course consists of 14 2-hour meetings, divided into 4 main parts. It aims to cover both engineering and theoretical aspects of search

More information

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis CS276B Text Retrieval and Mining Winter 2005 Lecture 7 Plan for today Review search engine history (slightly more technically than in the first lecture) Web crawling/corpus construction Distributed crawling

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Mukesh Kumar and Renu Vig University Institute of Engineering and Technology, Panjab University, Chandigarh,

More information

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri¹ Computer Engineering Department, Sharif University of Technology, Tehran, Iran daneshpajouh@ce.sharif.edu,

More information

A P2P-based Incremental Web Ranking Algorithm

A P2P-based Incremental Web Ranking Algorithm A P2P-based Incremental Web Ranking Algorithm Sumalee Sangamuang Pruet Boonma Juggapong Natwichai Computer Engineering Department Faculty of Engineering, Chiang Mai University, Thailand sangamuang.s@gmail.com,

More information

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

Compact Encoding of the Web Graph Exploiting Various Power Laws

Compact Encoding of the Web Graph Exploiting Various Power Laws Compact Encoding of the Web Graph Exploiting Various Power Laws Statistical Reason Behind Link Database Yasuhito Asano, Tsuyoshi Ito 2, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 Department

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

A BEGINNER S GUIDE TO THE WEBGRAPH: PROPERTIES, MODELS AND ALGORITHMS.

A BEGINNER S GUIDE TO THE WEBGRAPH: PROPERTIES, MODELS AND ALGORITHMS. A BEGINNER S GUIDE TO THE WEBGRAPH: PROPERTIES, MODELS AND ALGORITHMS. Debora Donato Luigi Laura Stefano Millozzi Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza {donato,laura,millozzi}@dis.uniroma1.it

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

Breadth-First Search Crawling Yields High-Quality Pages

Breadth-First Search Crawling Yields High-Quality Pages Breadth-First Search Crawling Yields High-Quality Pages Marc Najork Compaq Systems Research Center 13 Lytton Avenue Palo Alto, CA 9431, USA marc.najork@compaq.com Janet L. Wiener Compaq Systems Research

More information

Thumbnail Summarization Techniques for Web Archives

Thumbnail Summarization Techniques for Web Archives Thumbnail Summarization Techniques for Web Archives Ahmed AlSum and Michael L. Nelson Computer Science Department, Old Dominion University, Norfolk VA, USA {aalsum,mln}@cs.odu.edu Abstract. Thumbnails

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

Previous: how search engines work

Previous: how search engines work detection Ricardo Baeza-Yates,3 ricardo@baeza.cl With: L. Becchetti 2, P. Boldi 5, C. Castillo, D. Donato, A. Gionis, S. Leonardi 2, V.Murdock, M. Santini 5, F. Silvestri 4, S. Vigna 5. Yahoo! Research

More information

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

The MDR: A Grand Experiment in Storage & Preservation

The MDR: A Grand Experiment in Storage & Preservation The MDR: A Grand Experiment in Storage & Preservation Agenda Overview of the IA Web Archive MDR What is it and why deploy it? Before & After: Philosophy & Best Practices Wayback Access Services What s

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

Compressing the Graph Structure of the Web

Compressing the Graph Structure of the Web Compressing the Graph Structure of the Web Torsten Suel Jun Yuan Abstract A large amount of research has recently focused on the graph structure (or link structure) of the World Wide Web. This structure

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

Design and implementation of an incremental crawler for large scale web. archives

Design and implementation of an incremental crawler for large scale web. archives DEWS2007 B9-5 Web, 247 850 5 53 8505 4 6 E-mail: ttamura@acm.org, kitsure@tkl.iis.u-tokyo.ac.jp Web Web Web Web Web Web Web URL Web Web PC Web Web Design and implementation of an incremental crawler for

More information

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

arxiv:cs/ v1 [cs.ir] 26 Apr 2002 Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884

More information

Technologies and Tools for Working with Web Archives

Technologies and Tools for Working with Web Archives Technologies and Tools for Working with Web Archives Helge Holzmann Web Data Engineer @ Internet Archive Researcher @ L3S Research Center, Leibniz Universität Hannover Conference on Preserving Digital

More information

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from

More information

WOOster: A Map-Reduce based Platform for Graph Mining

WOOster: A Map-Reduce based Platform for Graph Mining WOOster: A Map-Reduce based Platform for Graph Mining Aravindan Raghuveer Yahoo! Bangalore aravindr@yahoo-inc.com Abstract Large scale graphs containing O(billion) of vertices are becoming increasingly

More information

Effective Page Refresh Policies for Web Crawlers

Effective Page Refresh Policies for Web Crawlers For CS561 Web Data Management Spring 2013 University of Crete Effective Page Refresh Policies for Web Crawlers and a Semantic Web Document Ranking Model Roger-Alekos Berkley IMSE 2012/2014 Paper 1: Main

More information

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL International Journal of Mechanical Engineering & Computer Sciences, Vol.1, Issue 1, Jan-Jun, 2017, pp 12-17 HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL BOMA P.

More information

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM Masahito Yamamoto, Hidenori Kawamura and Azuma Ohuchi Graduate School of Information Science and Technology, Hokkaido University, Japan

More information

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information

More information

USING DYNAMOGRAPH: APPLICATION SCENARIOS FOR LARGE-SCALE TEMPORAL GRAPH PROCESSING

USING DYNAMOGRAPH: APPLICATION SCENARIOS FOR LARGE-SCALE TEMPORAL GRAPH PROCESSING USING DYNAMOGRAPH: APPLICATION SCENARIOS FOR LARGE-SCALE TEMPORAL GRAPH PROCESSING Matthias Steinbauer, Gabriele Anderst-Kotsis Institute of Telecooperation TALK OUTLINE Introduction and Motivation Preliminaries

More information

Information Retrieval Issues on the World Wide Web

Information Retrieval Issues on the World Wide Web Information Retrieval Issues on the World Wide Web Ashraf Ali 1 Department of Computer Science, Singhania University Pacheri Bari, Rajasthan aali1979@rediffmail.com Dr. Israr Ahmad 2 Department of Computer

More information

Review: Searching the Web [Arasu 2001]

Review: Searching the Web [Arasu 2001] Review: Searching the Web [Arasu 2001] Gareth Cronin University of Auckland gareth@cronin.co.nz The authors of Searching the Web present an overview of the state of current technologies employed in the

More information

An Archiving System for Managing Evolution in the Data Web

An Archiving System for Managing Evolution in the Data Web An Archiving System for Managing Evolution in the Web Marios Meimaris *, George Papastefanatos and Christos Pateritsas * Institute for the Management of Information Systems, Research Center Athena, Greece

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

Combinatorial Algorithms for Web Search Engines - Three Success Stories

Combinatorial Algorithms for Web Search Engines - Three Success Stories Combinatorial Algorithms for Web Search Engines - Three Success Stories Monika Henzinger Abstract How much can smart combinatorial algorithms improve web search engines? To address this question we will

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Efficient extraction of news articles based on RSS crawling

Efficient extraction of news articles based on RSS crawling Efficient extraction of news based on RSS crawling George Adam Research Academic Computer Technology Institute, and Computer and Informatics Engineer Department, University of Patras Patras, Greece adam@cti.gr

More information

Systems Interoperability and Collaborative Development for Web Archiving

Systems Interoperability and Collaborative Development for Web Archiving Systems Interoperability and Collaborative Development for Web Archiving Filling Gaps in the IMLS National Digital Platform Mark Phillips, University of North Texas Courtney Mumma, Internet Archive Talk

More information

A Novel Interface to a Web Crawler using VB.NET Technology

A Novel Interface to a Web Crawler using VB.NET Technology IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 6 (Nov. - Dec. 2013), PP 59-63 A Novel Interface to a Web Crawler using VB.NET Technology Deepak Kumar

More information

I/O-Efficient Techniques for Computing Pagerank

I/O-Efficient Techniques for Computing Pagerank I/O-Efficient Techniques for Computing Pagerank Yen-Yu Chen Qingqing Gan Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 {yenyu, qq gan, suel}@photon.poly.edu Categories and Subject

More information

Delta-K 2 -tree for Compact Representation of Web Graphs

Delta-K 2 -tree for Compact Representation of Web Graphs Delta-K 2 -tree for Compact Representation of Web Graphs Yu Zhang 1,2, Gang Xiong 1,, Yanbing Liu 1, Mengya Liu 1,2, Ping Liu 1, and Li Guo 1 1 Institute of Information Engineering, Chinese Academy of

More information

A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving System Using IPFS

A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving System Using IPFS A Collaborative, Secure, and Private InterPlanetary Wayback Web Archiving System Using IPFS Mat Kelly Old Dominion University Norfolk, Virginia, USA @machawk1 https://github.com/oduwsdl/ipwb David Dias

More information

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages An Enhanced Page Ranking Algorithm Based on eights and Third level Ranking of the ebpages Prahlad Kumar Sharma* 1, Sanjay Tiwari #2 M.Tech Scholar, Department of C.S.E, A.I.E.T Jaipur Raj.(India) Asst.

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Efficient extraction of news articles based on RSS crawling

Efficient extraction of news articles based on RSS crawling Efficient extraction of news articles based on RSS crawling George Adam, Christos Bouras and Vassilis Poulopoulos Research Academic Computer Technology Institute, and Computer and Informatics Engineer

More information

Local Methods for Estimating PageRank Values

Local Methods for Estimating PageRank Values Local Methods for Estimating PageRank Values Yen-Yu Chen Qingqing Gan Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 yenyu, qq gan, suel @photon.poly.edu Abstract The Google search

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

Web Archive Profiling Through Fulltext Search

Web Archive Profiling Through Fulltext Search Web Archive Profiling Through Fulltext Search Sawood Alam 1, Michael L. Nelson 1, Herbert Van de Sompel 2, and David S. H. Rosenthal 3 1 Computer Science Department, Old Dominion University, Norfolk, VA

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

Preserving Legal Blogs

Preserving Legal Blogs Preserving Legal Blogs Georgetown Law School Linda Frueh Internet Archive July 25, 2009 1 Contents 1. Intro to the Internet Archive All media The Web Archive 2. Where do blogs fit? 3. How are blogs collected?

More information

InterPlanetary Wayback

InterPlanetary Wayback InterPlanetary Wayback Peer-to-Peer Permanence of Web Archives Mat Kelly, Sawood Alam, Michael L. Nelson, Michele C. Weigle Old Dominion University Web Science and Digital Libraries Research Group Norfolk,

More information

A STUDY ON THE EVOLUTION OF THE WEB

A STUDY ON THE EVOLUTION OF THE WEB A STUDY ON THE EVOLUTION OF THE WEB Alexandros Ntoulas, Junghoo Cho, Hyun Kyu Cho 2, Hyeonsung Cho 2, and Young-Jo Cho 2 Summary We seek to gain improved insight into how Web search engines should cope

More information

Keywords: web crawler, parallel, migration, web database

Keywords: web crawler, parallel, migration, web database ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Design of a Parallel Migrating Web Crawler Abhinna Agarwal, Durgesh

More information

On Finding Power Method in Spreading Activation Search

On Finding Power Method in Spreading Activation Search On Finding Power Method in Spreading Activation Search Ján Suchal Slovak University of Technology Faculty of Informatics and Information Technologies Institute of Informatics and Software Engineering Ilkovičova

More information

MPGM: A Mixed Parallel Big Graph Mining Tool

MPGM: A Mixed Parallel Big Graph Mining Tool MPGM: A Mixed Parallel Big Graph Mining Tool Ma Pengjiang 1 mpjr_2008@163.com Liu Yang 1 liuyang1984@bupt.edu.cn Wu Bin 1 wubin@bupt.edu.cn Wang Hongxu 1 513196584@qq.com 1 School of Computer Science,

More information

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search

More information

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Reading Time: A Method for Improving the Ranking Scores of Web Pages Reading Time: A Method for Improving the Ranking Scores of Web Pages Shweta Agarwal Asst. Prof., CS&IT Deptt. MIT, Moradabad, U.P. India Bharat Bhushan Agarwal Asst. Prof., CS&IT Deptt. IFTM, Moradabad,

More information

CS 345A Data Mining Lecture 1. Introduction to Web Mining

CS 345A Data Mining Lecture 1. Introduction to Web Mining CS 345A Data Mining Lecture 1 Introduction to Web Mining What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns Web Mining v. Data Mining Structure (or lack of

More information

Acolyte: An In-Memory Social Network Query System

Acolyte: An In-Memory Social Network Query System Acolyte: An In-Memory Social Network Query System Ze Tang, Heng Lin, Kaiwei Li, Wentao Han, and Wenguang Chen Department of Computer Science and Technology, Tsinghua University Beijing 100084, China {tangz10,linheng11,lkw10,hwt04}@mails.tsinghua.edu.cn

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

InterPlanetary Wayback

InterPlanetary Wayback InterPlanetary Wayback The Next Step Towards Decentralized Web Archiving Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson Web Science and Digital Libraries Research Group Old Dominion University

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

SOUND, MUSIC AND TEXTUAL ASSOCIATIONS ON THE WORLD WIDE WEB

SOUND, MUSIC AND TEXTUAL ASSOCIATIONS ON THE WORLD WIDE WEB SOUND, MUSIC AND TEXTUAL ASSOCIATIONS ON THE WORLD WIDE WEB Ian Knopke Music Technology McGill University ian.knopke@mail.mcgill.ca ABSTRACT Sound files on the World Wide Web are accessed from web pages.

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

Web Crawling As Nonlinear Dynamics

Web Crawling As Nonlinear Dynamics Progress in Nonlinear Dynamics and Chaos Vol. 1, 2013, 1-7 ISSN: 2321 9238 (online) Published on 28 April 2013 www.researchmathsci.org Progress in Web Crawling As Nonlinear Dynamics Chaitanya Raveendra

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

Compressed Collections for Simulated Crawling

Compressed Collections for Simulated Crawling PAPER This paper was unfortunately omitted from the print version of the June 2008 issue of Sigir Forum. Please cite as SIGIR Forum June 2008, Volume 42 Number 1, pp 84-89 Compressed Collections for Simulated

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

Personalizing PageRank Based on Domain Profiles

Personalizing PageRank Based on Domain Profiles Personalizing PageRank Based on Domain Profiles Mehmet S. Aktas, Mehmet A. Nacar, and Filippo Menczer Computer Science Department Indiana University Bloomington, IN 47405 USA {maktas,mnacar,fil}@indiana.edu

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

I/O-Efficient Techniques for Computing Pagerank

I/O-Efficient Techniques for Computing Pagerank I/O-Efficient Techniques for Computing Pagerank Yen-Yu Chen Qingqing Gan Torsten Suel Department of Computer and Information Science Technical Report TR-CIS-2002-03 11/08/2002 I/O-Efficient Techniques

More information

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful

More information

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing

More information

Hypergraph-Theoretic Partitioning Models for Parallel Web Crawling

Hypergraph-Theoretic Partitioning Models for Parallel Web Crawling Hypergraph-Theoretic Partitioning Models for Parallel Web Crawling Ata Turk, B. Barla Cambazoglu and Cevdet Aykanat Abstract Parallel web crawling is an important technique employed by large-scale search

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

CS47300 Web Information Search and Management

CS47300 Web Information Search and Management CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page

More information

Compressed Collections for Simulated Crawling

Compressed Collections for Simulated Crawling PAPER Compressed Collections for Simulated Crawling Alessio Orlandi Università di Pisa, Italy aorlandi@di.unipi.it Sebastiano Vigna Università degli Studi di Milano, Italy vigna@dsi.unimi.it Abstract Collections

More information

Characterizing Home Pages 1

Characterizing Home Pages 1 Characterizing Home Pages 1 Xubin He and Qing Yang Dept. of Electrical and Computer Engineering University of Rhode Island Kingston, RI 881, USA Abstract Home pages are very important for any successful

More information

Find, New, Copy, Web, Page - Tagging for the (Re-)Discovery of Web Pages

Find, New, Copy, Web, Page - Tagging for the (Re-)Discovery of Web Pages Find, New, Copy, Web, Page - Tagging for the (Re-)Discovery of Web Pages Martin Klein and Michael L. Nelson Old Dominion University, Department of Computer Science Norfolk VA 23529 {mklein, mln}@cs.odu.edu

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Ontology Extraction from Heterogeneous Documents

Ontology Extraction from Heterogeneous Documents Vol.3, Issue.2, March-April. 2013 pp-985-989 ISSN: 2249-6645 Ontology Extraction from Heterogeneous Documents Kirankumar Kataraki, 1 Sumana M 2 1 IV sem M.Tech/ Department of Information Science & Engg

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Visualizing Thumbnails Of Archived Web Pages

Visualizing Thumbnails Of Archived Web Pages 1 DEPARTMENT OF COMPUTER SCIENCE MASTER S PROJECT Visualizing Thumbnails Of Archived Web Pages Author: Advisor: Dr. Michele C. Weigle April 24, 2017 1 Acknowledgement I express my gratitude to my project

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

INTRODUCTION (INTRODUCTION TO MMAS)

INTRODUCTION (INTRODUCTION TO MMAS) Max-Min Ant System Based Web Crawler Komal Upadhyay 1, Er. Suveg Moudgil 2 1 Department of Computer Science (M. TECH 4 th sem) Haryana Engineering College Jagadhri, Kurukshetra University, Haryana, India

More information