Structure Properties of the Thai WWW: The 2003 Survey

Size: px
Start display at page:

Download "Structure Properties of the Thai WWW: The 2003 Survey"

Transcription

1 Structure Properties of the Thai WWW: The 2003 Survey Surasak Sanguanpong and Kasom Koth-arsa Applied Network Research Laboratory, Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900, THAILAND {Surasak.S, Abstract. This paper presents quantitative measurements and analyses of structure properties of Thai World Wide Web, obtained by the high performance web spider called KSpider. The paper briefly describes the KSpider s system architecture and several designs techniques such as workload distributions, scheduling policy, in-memory URLs compression and enhanced DNS resolver. Selected statistics related to the Thai WWW based on information gathered in March 2003 are presented. KSpider has collected 8,170,005 of URLs (HTML and images) in 7 days. All compressed web data consumes around 155 GB of disk space. Total 3,277,988 HTML (around 54 GB) documents on 24,124 web servers with 9,167 unique IP addresses are found. Over 60 millions hyperlinks are analyzed. More interested statistics are reported, i.e., documents and servers classified by domain names, percentage of server types, distribution of page sizes, distribution of file extensions, and hyperlinks connectivity between domains. 1 Introduction The purpose of this paper is to present quantitative measurements and analyses of various issues related to web servers and some structural information on the Thai WWW, obtained by the parallel web spider called KSpider [1]. Prior reports on Thai WWW statistics were published [2] by the same author and more details of facts and figures are available on the homepage of the Applied Network Research Laboratory at The early report about Thai WWW statistics [2] showed the latest survey during March 2000 to March 2001 and it included only web servers whose name is registered under.th domain. However, the Internet has continuously grown in Thailand. [3]. Moreover, country domain name registrations suffer from the widespread use of.com and other top level domain names (TLDs) without any knowledge about its quantity and property. In order to get the latest information and such mentioned characteristics, the new survey has been reactivated to include two type of web servers : (1) web servers whose name ends with the.th suffix and (2) web servers whose name ends with any gtlds and cctlds domains (.com,.net,.org, and etc), but theirs IP address belongs to address blocks assigned to Thailand network organizations. KSpider was configured to download data from web servers whose names are registered under.th domain name or their IP addresses are in Thailand. The list of such IP addresses is created from the national exchange s route server (THNIX), available at telnet://route-server.cat.net.th.

2 2 Surasak Sanguanpong and Kasom Koth-arsa In the next Section, we briefly discuss the KSpider s architecture. The following section presents the analyses of Thai WWW. Finally, we conclude the paper. 2 KSpider s Design and Implementation Web spider (crawler or robot) is one of the key components of search engine. The main task of spider is to automatically collect HTML documents or web pages including other web data (images and other file types) from public web servers. These spiders crawl" across the web, following hyperlinks from site to site, storing downloaded pages they visit to build a searchable web pages index. KSpider is the second generation of our own spider implementation. Prior data collection in the past was based on the multi-process web spider, called NontriSpider, developed as a part of NontriSearch search engine [2]. KSpider has been developed to overcome many limitations of NontriSpider with more performance enhancements. 2.1 Overview of System components KSpider is based on Beowulf cluster [4]. The KCluster running KSpider currently consists of 4 set of AMD Athon XP 1500+, each equips with 768 Mbytes DDR RAM, six of 35 GBytes Ultra-160 SCSI harddisks and an Intel E1000 Gigabit Ethernet interface. They are connected together using 3Com Gigabit Ethernet switch. KSpider is implemented in C++ on top of Linux operating system. It consists of five main components as shown in Fig. 1. Each component is described as the following: URL Manager. The URL manager is responsible on all about URL handling. Each URL Manager in a node keeps track of a disjoint of URL subset, compared to the other nodes. The Storage Manager gets URLs from Buffer Queue and stores them in compressed form for further processing. The Scheduler on the URL Manager selects and schedules the URLs by sending the list of URLs to the Data Collector. Data Collector. The Data takes care of collector threads to fetch the data from the web servers. The collector threads get a list of URL from the queue and send the request to the web servers using HTTP/1.1. Data Processor. The fetched data will be passed from the Data Collector to the Data Processor for further processing, such as links extraction, statistics collection, URL filtering, and etc. Storage Manager. The Storage Manager contains two important components, i.e., the Compressor and Decompressor. Several web data will be compressed and packed together by the Compressor. The LZO algorithm [5] is used as the compression library. The Decompressor is responsible for data decompression. Communicator. Whenever, a new URL is extracted and the node found that it is not responsible for such URL, it is the task of the Communicator to delivery the URL to another node using UDP in asynchronous fashion.

3 Structure Properties of the Thai WWW: The 2003 Survey 3 Online indexer Data Processor Other processing Storage Manager URL Extractor Data Streamer Data Decompressor URL Filter Stats Collector Data Compressor URL Buffer Queue URL Manager Scheduler URL Buffer Queue Data Collector HTTP Data Collector URL Storage Manager In-memory Parallel DNS URL Buffer Queue Communicator To Communicator Cluster Communicator On Disk Scheduler Fig. 1. The System Architecture of KSpider 2.2 Design and Implementation Techniques There are many underlie concepts and techniques which have been implemented in KSpider. The important techniques are described in this section Data Distribution Data (web pages, images, and other file type) downloaded from the web servers are distributed over the nodes in the cluster. For any given URL, there is only one node that is responsible to fetch and keep data reference by that URL. A simple hash function based on the summation of every character in the URL is used to distribute the URLs among the nodes in the cluster Phase Swapping Each node has a list of URLs that may belong to the same web server. Hence, it is likely that every node may download web pages from the same web severs at the same time. Should there are several nodes running simultaneously, it would increasingly generate heavy loads on destination servers. To prevent this situation, a technique called phase swapping is proposed. The underlie concept of phase swapping is to group the URLs belonged to same web servers together (using hashing of the host name portion of the URL) and let each node works on the different set of servers at a time. After a pre-defined constant period, every node synchronously swaps the working phase to a new set of web servers. This technique does not only prevent the overload of the web servers, but also largely reduced the number of URLs the spider has to manage in any given time.

4 4 Surasak Sanguanpong and Kasom Koth-arsa URL Compression KSpider compresses the URLs by only keeping the differences of URLs tails. It utilizes the modified AVL tree with delta encoding [6]. Our compression technique can reduce the length of URL from 59.5 bytes to be 26.5 bytes by average (about 55% of size reduction with all data structure overhead). Therefore, all compressed URL can be kept inside the main memory. The current configuration of a node in KCluster (See Fig.1) is designed to handle up to 30 million URLs Enhanced DNS Resolver KSpider has DNS caching mechanism integrated to the resolver. This helps reducing DNS server workload when several thousand of hostnames must be resolved in a short period of time. Moreover, KSpider has built-in mechanism to specify resolver time-out and allow it to contact several DNS servers in the same time. 3 Crawling Results In this section, selected statistics related to web servers in Thailand are presented. In this paper, a web server is referred to as a web site that provides HTTP services. It is counted by a unique hostname, not a physical machine. However, a machine can support multiple web servers (with different hostnames). To get the number of machines, the name resolved was utilized to count unique IP addresses. Instead of downloading only HTML pages, images and other file types are also included to build database for another project. KSpider collected around eight million URLs. The crawling took 7 days long. The retrieved data from each request consisted of two parts, i.e. the HTTP header and the HTML body. The headers were analyzed by counting each field, while the HTML bodies were subjected to extensive analysis. Counting and analysis are mostly performed automatically by statistic collector integrated with KSpider (See Fig.2). HTML parsing was performed using our own C++ robust parser to get maximum performance. The following sections describe crawling results in details. 3.1 The Thai Domain Name The detailed domain structure was found from statistics published by Thailand Network Information Center (THNIC) at There are 7 top level domain names in Thailand. We prepared a list of third level domain names available from which composed of 10,504 subdomains as shown in Table 1.

5 Structure Properties of the Thai WWW: The 2003 Survey 5 Table 1. The 7 top level domain names ranked by number third level sub-domain names Rank Domain #Domains Percent 1 co.th 7, in.th 1, ac.th or.th go.th net.th mi.th Total 10, KSpider started to collect web data using 10,504 of seed URLs generated from such list. Each of domain names from the list was added with the standard prefix name www to create the complete URLs. Such seed URLs would be assurable to let KSpider get data from each domain name without any prejudice. 3.2 Size of Thai Web We have collected of 8,170,005 URLs, of which 3,277,988 URLs are HTML documents. Total web data consumed over 155 gigabytes of disk space (compressed), of which HTML documents consume around 54 gigabytes. The survey found 24,124 web servers with 9,167 unique machines. 3.3 Documents and Servers classified by Domain Name Statistics about the HTML documents for each domain are shown in Table 2. Over 70% of all documents are in academic and commercial domain. Table 2. Documents and Servers in each domain name ranked by number of documents Rank Domain #Documents #Servers #Machines 1 ac.th 1,093,388 2,979 1,973 2 com 977,478 10,385 2,249 3 go.th 313, co.th 279,532 6,159 2,664 5 or.th 236, org 107, net 87, others 56, net.th 55, in.th 36, edu 20, mi.th 14, Total 3,277,988 24,124 9,167

6 6 Surasak Sanguanpong and Kasom Koth-arsa 3.4 HTTP returned code Table 3 shows the HTTP returned code. The 200 (OK) means the successful request of unique pages. Table 3. HTTP errors under 3,961,227 total requests Rank Type Quantity Percent (OK) 3,277, (Not found) 536, (Unauthorized) 74, (Forbidden) 27, (Move temporary) 18, (Move permanently) 17, (Not Acceptable) 4, (Service Unavailable) 2, (Internal error) 1, (Bad request) Others (6 more) Total 3,961, HTTP Headers For each complete HTTP GET request, there were 70 different header types found as summarized in Table 4. Table 4. Frequency of various HTTP header types Rank HTTP Header Quantity Percent 1 Content-Type 3,277, Date 3,277, Server 3,273, Content-Length 2,401, Last-Modified 2,338, Accept-Ranges 2,291, Etag 2,224, Transfer-encoding 839, Connection 665, x-powered-by 486, Others (60 more) 706,

7 Structure Properties of the Thai WWW: The 2003 Survey Percentage of Server Types The total 149 different types of HTTP servers are discovered. Table 5 shows the summary of them without showing the detail versions. More than 88% of them are relied on Apache and Microsoft-IIS technology. Table 5. HTTP server types distribution Rank Type Quantity Percent 1 Apache 13, Microsoft-IIS 7, unknown 1, dozygroup WebServer Netscape-Enterprise TWH Rapidsite Ipswitch-IMail IBM_HTTP_Server Lotus-Domino OmniHTTPd Netscape-FastTrack Others (137 types) Total 24, Last Modification Distribution Most documents have the age less than a year (for known last modification) as shown in Figure 2. The label errors in the figure means that the servers answered the time referenced to the future. The lable current means that the servers answered with the time in the same period of crawling. Moreover, there are a lot of documents that do not reply the last modification time as labeled with unknown ,337, , , , , , , ,145 78,177 1,161 20,813 14,645 errors current 1 month 6 month 1 year 2 year 3 year 4 year 5 year 6 year > 10 year Unknown Fig. 2. Distribution of Last Modification

8 8 Surasak Sanguanpong and Kasom Koth-arsa 3.8 Distribution of file extensions File extensions classification has been done by the standard suffix used in file names e.g..html,.htm,.jpg,.gif, and other as shown in Table 6. Filenames without suffixes are classified as unknown. Table 6. Distribute of File extensions Extension Quantity Percent.jpg 2,474, gif 2,279, html 1,642, htm 1,260, unknown 622, pdf 128, asp 102, php 86, shtml 71, doc 42, xml 27, png 22, jpeg 19, jsp 1, Page Size Distribution Figure 3 shows the distribution of HTML page size (bytes) in logarithmic scale. Note that only the HTML documents are considered E+05 3E+05 5E+05 1E+06 2E+06 4E+06 8E+06 2E+07 3E+07 7E+07 Fig. 3. Distribution of HTML page size (bytes)

9 Structure Properties of the Thai WWW: The 2003 Survey General properties of HTML documents Table 7 shows the typical structure properties of HTML documents. Table 7. General properties of HTML documents Items Min Max Mean Std Page Sizes (bytes) 0 26,240,704 15, , Number of Internal Hyperlinks Number of Local Hyperlinks 0 32, Number of Remote Hyperlinks 0 17, Number of Java Applets Number of Embedded Images 0 27, Link connectivity Nearly 61 million hyperlinks are found in the experiment. Table 8 shows the connectivity matrix between domain names. Table 8. Link Connectivity between Domain Names Domain.ac.th.co.th.go.th.in.th.mi.th.net.th.or.th others Sum.ac.th 6,972,473 47,938 32,050 2, ,884 34, ,220 7,223,697.co.th 2,652 6,276,278 4, , ,254 6,648,824.go.th 16,989 36,152 1,760,686 1,228 1,324 4,140 14,995 56,524 1,892,038.in.th 1,085 12,381 1, , ,095 6, ,138.mi.th 568 1,003 1, , ,462 54,677.net.th 4,331 2,310 3, ,299,618 4,096 4,721 1,318,859.or.th 7,807 11, , ,828 3,363,298 43,333 3,550,837 others 47, ,916 38,428 4,853 1,158 7,237 56,874 39,561,025 39,922,113 Sum 7,053,527 6,591,988 1,964, ,167 53,536 1,327,271 3,480,374 40,153,080 60,905,183 4 Conclusion Quantitative measurement and statistics of Thai WWW are presented and analyzed. More extensive analyses are planned and the full results from the survey will be made available on-line at

10 10 Surasak Sanguanpong and Kasom Koth-arsa 5 References 1. Koht-Arsa, K. and Sanguanpong, S.: High Performance Large Scale Web Spider Architecture. The 2002 Internataional Symposium on Communications and Information Technology, Pattaya, Chonburi, Thailand, October Sanguanpong, S., Piamsa-nga, P., Poovarawan, Y., and Warangrit, S.: Measuring Thai Web Using NontriSpider. Proceeding of the International Forum cum Conference on Information Technology and Communication, pp , Bangkok, June NECTEC: Internet Information Research Center, Available Source: internet/index.html 4 Sterling, T., D. J. Becker, D. Savarese, J. E. Dorband, U. A. Ranawake, and C. E. Packer: Beowulf: A Parallel Workstation for Scientific Computation, In Proc. of International Conference on Parallel Processing, Oberhumer. M. F.X.J.: LZO data compression library, Available Source: 6. Koht-arsa, K., Sanguanpong, S.: In-memory URL Compression, National Computer Science and Engineering Conference, Chiang Mai, Thailand, 2001.

Yioop Full Historical Indexing In Cache Navigation. Akshat Kukreti

Yioop Full Historical Indexing In Cache Navigation. Akshat Kukreti Yioop Full Historical Indexing In Cache Navigation Akshat Kukreti Agenda Introduction History Feature Cache Page Validation Feature Conclusion Demo Introduction Project goals History feature for enabling

More information

Networks, WWW, HTTP. Web Technologies I. Zsolt Tóth. University of Miskolc. Zsolt Tóth (University of Miskolc) Networks, WWW, HTTP / 35

Networks, WWW, HTTP. Web Technologies I. Zsolt Tóth. University of Miskolc. Zsolt Tóth (University of Miskolc) Networks, WWW, HTTP / 35 Networks, WWW, HTTP Web Technologies I. Zsolt Tóth University of Miskolc 2018 Zsolt Tóth (University of Miskolc) Networks, WWW, HTTP 2018 1 / 35 Table of Contents Networks Internet 1 Networks Internet

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,

More information

Information Network Systems The application layer. Stephan Sigg

Information Network Systems The application layer. Stephan Sigg Information Network Systems The application layer Stephan Sigg Tokyo, November 15, 2012 Introduction 04.10.2012 Introduction to the internet 11.10.2012 The link layer 18.10.2012 The network layer 25.10.2012

More information

3. WWW and HTTP. Fig.3.1 Architecture of WWW

3. WWW and HTTP. Fig.3.1 Architecture of WWW 3. WWW and HTTP The World Wide Web (WWW) is a repository of information linked together from points all over the world. The WWW has a unique combination of flexibility, portability, and user-friendly features

More information

Google Search Appliance

Google Search Appliance Google Search Appliance Administering Crawl Google Search Appliance software version 7.0 September 2012 Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 www.google.com September 2012 Copyright

More information

Web Programming Paper Solution (Chapter wise)

Web Programming Paper Solution (Chapter wise) Introduction to web technology Three tier/ n-tier architecture of web multitier architecture (often referred to as n-tier architecture) is a client server architecture in which presentation, application

More information

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In

More information

Guide to Networking Essentials, 6 th Edition. Chapter 5: Network Protocols

Guide to Networking Essentials, 6 th Edition. Chapter 5: Network Protocols Guide to Networking Essentials, 6 th Edition Chapter 5: Network Protocols Objectives Describe the purpose of a network protocol, the layers in the TCP/IP architecture, and the protocols in each TCP/IP

More information

Objectives. Connecting with Computer Science 2

Objectives. Connecting with Computer Science 2 Objectives Learn what the Internet really is Become familiar with the architecture of the Internet Become familiar with Internet-related protocols Understand how the TCP/IP protocols relate to the Internet

More information

Connecting with Computer Science Chapter 5 Review: Chapter Summary:

Connecting with Computer Science Chapter 5 Review: Chapter Summary: Chapter Summary: The Internet has revolutionized the world. The internet is just a giant collection of: WANs and LANs. The internet is not owned by any single person or entity. You connect to the Internet

More information

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Around the Web in Six Weeks: Documenting a Large-Scale Crawl Around the Web in Six Weeks: Documenting a Large-Scale Crawl Sarker Tanzir Ahmed, Clint Sparkman, Hsin- Tsang Lee, and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Characterizing Home Pages 1

Characterizing Home Pages 1 Characterizing Home Pages 1 Xubin He and Qing Yang Dept. of Electrical and Computer Engineering University of Rhode Island Kingston, RI 881, USA Abstract Home pages are very important for any successful

More information

Google Search Appliance

Google Search Appliance Google Search Appliance Administering Crawl Google Search Appliance software version 7.4 Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 www.google.com GSA-ADM_200.02 March 2015 Copyright

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson Web Crawling Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc.

More information

1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1

1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1 A Scalable, Distributed Web-Crawler* Ankit Jain, Abhishek Singh, Ling Liu Technical Report GIT-CC-03-08 College of Computing Atlanta,Georgia {ankit,abhi,lingliu}@cc.gatech.edu In this paper we present

More information

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Performance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades

Performance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades Performance Benefits of OpenVMS V8.4 Running on BL8xc i2 Server Blades A detailed review of performance features and test results for OpenVMS V8.4. March 211 211, TechWise Research. All Rights Reserved

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection

More information

Integration of non harvested web data into an existing web archive

Integration of non harvested web data into an existing web archive Integration of non harvested web data into an existing web archive Bjarne Andersen Daily manager netarchive.dk bja@netarkivet.dk Abstract This paper describes a software prototype developed for transforming

More information

Lecture 7b: HTTP. Feb. 24, Internet and Intranet Protocols and Applications

Lecture 7b: HTTP. Feb. 24, Internet and Intranet Protocols and Applications Internet and Intranet Protocols and Applications Lecture 7b: HTTP Feb. 24, 2004 Arthur Goldberg Computer Science Department New York University artg@cs.nyu.edu WWW - HTTP/1.1 Web s application layer protocol

More information

Design and Implementation of a Monitoring and Scheduling System for Multiple Linux PC Clusters*

Design and Implementation of a Monitoring and Scheduling System for Multiple Linux PC Clusters* Design and Implementation of a Monitoring and Scheduling System for Multiple Linux PC Clusters* Chao-Tung Yang, Chun-Sheng Liao, and Ping-I Chen High-Performance Computing Laboratory Department of Computer

More information

Traditional Internet Applications

Traditional Internet Applications Traditional Internet Applications Asst. Prof. Chaiporn Jaikaeo, Ph.D. chaiporn.j@ku.ac.th http://www.cpe.ku.ac.th/~cpj Computer Engineering Department Kasetsart University, Bangkok, Thailand Adapted from

More information

HyperText Transfer Protocol

HyperText Transfer Protocol Outline Introduce Socket Programming Domain Name Service (DNS) Standard Application-level Protocols email (SMTP) HTTP HyperText Transfer Protocol Defintitions A web page consists of a base HTML-file which

More information

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta

PROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta PROJECT REPORT (Final Year Project 2007-2008) Hybrid Search Engine Project Supervisor Mrs. Shikha Mehta INTRODUCTION Definition: Search Engines A search engine is an information retrieval system designed

More information

This tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika.

This tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika. About the Tutorial This tutorial provides a basic understanding of Apache Tika library, the file formats it supports, as well as content and metadata extraction using Apache Tika. Audience This tutorial

More information

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document

More information

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Crawling the Web. Web Crawling. Main Issues I. Type of crawl Web Crawling Crawling the Web v Retrieve (for indexing, storage, ) Web pages by using the links found on a page to locate more pages. Must have some starting point 1 2 Type of crawl Web crawl versus crawl

More information

Web Analysis in 4 Easy Steps. Rosaria Silipo, Bernd Wiswedel and Tobias Kötter

Web Analysis in 4 Easy Steps. Rosaria Silipo, Bernd Wiswedel and Tobias Kötter Web Analysis in 4 Easy Steps Rosaria Silipo, Bernd Wiswedel and Tobias Kötter KNIME Forum Analysis KNIME Forum Analysis Steps: 1. Get data into KNIME 2. Extract simple statistics (how many posts, response

More information

Distributed System: Definition

Distributed System: Definition 2 / 25 Introduction Distributed System: Definition Definition A distributed system is a piece of software that ensures that: a collection of independent computers appears to its users as a single coherent

More information

Executive Summary. Performance Report for: https://edwardtbabinski.us/blogger/social/index. The web should be fast. How does this affect me?

Executive Summary. Performance Report for: https://edwardtbabinski.us/blogger/social/index. The web should be fast. How does this affect me? The web should be fast. Executive Summary Performance Report for: https://edwardtbabinski.us/blogger/social/index Report generated: Test Server Region: Using: Analysis options: Tue,, 2017, 4:21 AM -0400

More information

Real Life Web Development. Joseph Paul Cohen

Real Life Web Development. Joseph Paul Cohen Real Life Web Development Joseph Paul Cohen joecohen@cs.umb.edu Index 201 - The code 404 - How to run it? 500 - Your code is broken? 200 - Someone broke into your server? 400 - How are people using your

More information

1 of 10 8/10/2009 4:51 PM

1 of 10 8/10/2009 4:51 PM 1 of 10 8/10/ 4:51 PM Last Update: 16:20 Reported period: OK Current Month: Aug Summary Reported period Month Aug First visit 01 Aug - 00:00 Last visit 06:39 Unique visitors Number of visits Pages Hits

More information

= a hypertext system which is accessible via internet

= a hypertext system which is accessible via internet 10. The World Wide Web (WWW) = a hypertext system which is accessible via internet (WWW is only one sort of using the internet others are e-mail, ftp, telnet, internet telephone... ) Hypertext: Pages of

More information

DNS and HTTP. A High-Level Overview of how the Internet works

DNS and HTTP. A High-Level Overview of how the Internet works DNS and HTTP A High-Level Overview of how the Internet works Adam Portier Fall 2017 How do I Google? Smaller problems you need to solve 1. Where is Google? 2. How do I access the Google webpage? 3. How

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Technical Brief: Specifying a PC for Mascot

Technical Brief: Specifying a PC for Mascot Technical Brief: Specifying a PC for Mascot Matrix Science 8 Wyndham Place London W1H 1PP United Kingdom Tel: +44 (0)20 7723 2142 Fax: +44 (0)20 7725 9360 info@matrixscience.com http://www.matrixscience.com

More information

arxiv:cs/ v1 [cs.ir] 21 Jul 2004

arxiv:cs/ v1 [cs.ir] 21 Jul 2004 DESIGN OF A PARALLEL AND DISTRIBUTED WEB SEARCH ENGINE arxiv:cs/0407053v1 [cs.ir] 21 Jul 2004 S. ORLANDO, R. PEREGO, F. SILVESTRI Dipartimento di Informatica, Universita Ca Foscari, Venezia, Italy Istituto

More information

Analysis of the effects of removing redundant header information in persistent HTTP connections

Analysis of the effects of removing redundant header information in persistent HTTP connections Analysis of the effects of removing redundant header information in persistent HTTP connections Timothy Bower, Daniel Andresen, David Bacon Department of Computing and Information Sciences 234 Nichols

More information

Application Layer Protocols

Application Layer Protocols Application Layer Protocols Dr. Ihsan Ullah Department of Computer Science & IT University of Balochistan, Quetta Pakistan Email: ihsan.ullah.cs@gmail.com These slides are adapted from the slides accompanying

More information

WWW Document Technologies

WWW Document Technologies WWW Document Technologies Michael B. Spring Department of Information Science and Telecommunications University of Pittsburgh spring@imap.pitt.edu http://www.sis.pitt.edu/~spring Overview The Internet

More information

Collection Building on the Web. Basic Algorithm

Collection Building on the Web. Basic Algorithm Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring

More information

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program

More information

Efficient Hybrid Multicast Routing Protocol for Ad-Hoc Wireless Networks

Efficient Hybrid Multicast Routing Protocol for Ad-Hoc Wireless Networks Efficient Hybrid Multicast Routing Protocol for Ad-Hoc Wireless Networks Jayanta Biswas and Mukti Barai and S. K. Nandy CAD Lab, Indian Institute of Science Bangalore, 56, India {jayanta@cadl, mbarai@cadl,

More information

Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation

Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation Selim Ciraci, Ibrahim Korpeoglu, and Özgür Ulusoy Department of Computer Engineering, Bilkent University, TR-06800 Ankara,

More information

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per Information Retrieval Web Search Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?

More information

Internet Architecture. Web Programming - 2 (Ref: Chapter 2) IP Software. IP Addressing. TCP/IP Basics. Client Server Basics. URL and MIME Types HTTP

Internet Architecture. Web Programming - 2 (Ref: Chapter 2) IP Software. IP Addressing. TCP/IP Basics. Client Server Basics. URL and MIME Types HTTP Web Programming - 2 (Ref: Chapter 2) TCP/IP Basics Internet Architecture Client Server Basics URL and MIME Types HTTP Routers interconnect the network TCP/IP software provides illusion of a single network

More information

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey EECS 395/495 Lecture 5: Web Crawlers Doug Downey Interlude: US Searches per User Year Searches/month (mlns) Internet Users (mlns) Searches/user-month 2008 10800 220 49.1 2009 14300 227 63.0 2010 15400

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 17 September 2018 Some slides courtesy Manning, Raghavan, and Schütze Other characteristics Significant duplication Syntactic

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

CS 3640: Introduction to Networks and Their Applications

CS 3640: Introduction to Networks and Their Applications CS 3640: Introduction to Networks and Their Applications Fall 2018, Lecture 19: Application Layer III (Credit: Prof. Phillipa Gill @ University of Massachusetts) Instructor: Rishab Nithyanand Teaching

More information

SIP Compliance APPENDIX

SIP Compliance APPENDIX APPENDIX E This appendix describes Cisco SIP proxy server (Cisco SPS) compliance with the Internet Engineering Task Force (IETF) definition of Session Initiation Protocol (SIP) as described in the following

More information

Template Extraction from Heterogeneous Web Pages

Template Extraction from Heterogeneous Web Pages Template Extraction from Heterogeneous Web Pages 1 Mrs. Harshal H. Kulkarni, 2 Mrs. Manasi k. Kulkarni Asst. Professor, Pune University, (PESMCOE, Pune), Pune, India Abstract: Templates are used by many

More information

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton Crawling CS6200: Information Retrieval Slides by: Jesse Anderton Motivating Problem Internet crawling is discovering web content and downloading it to add to your index. This is a technically complex,

More information

Servlet Performance and Apache JServ

Servlet Performance and Apache JServ Servlet Performance and Apache JServ ApacheCon 1998 By Stefano Mazzocchi and Pierpaolo Fumagalli Index 1 Performance Definition... 2 1.1 Absolute performance...2 1.2 Perceived performance...2 2 Dynamic

More information

18050 (2.48 pages/visit) Jul Sep May Jun Aug Number of visits

18050 (2.48 pages/visit) Jul Sep May Jun Aug Number of visits 30-12- 0:45 Last Update: 29 Dec - 03:05 Reported period: OK Summary Reported period Month Dec First visit 01 Dec - 00:07 Last visit 28 Dec - 23:59 Unique visitors Number of visits Pages Hits Bandwidth

More information

Session 2. Background. Lecture Objectives

Session 2. Background. Lecture Objectives Session 2 Background 1 Lecture Objectives Understand how an Internet resource is accessed Understand the high level structure of the Internet cloud Understand the high level structure of the TCP/IP protocols

More information

Web Technology. COMP476 Networked Computer Systems. Hypertext and Hypermedia. Document Representation. Client-Server Paradigm.

Web Technology. COMP476 Networked Computer Systems. Hypertext and Hypermedia. Document Representation. Client-Server Paradigm. Web Technology COMP476 Networked Computer Systems - Paradigm The method of interaction used when two application programs communicate over a network. A server application waits at a known address and a

More information

Information Network I: The Application Layer. Doudou Fall Internet Engineering Laboratory Nara Institute of Science and Technique

Information Network I: The Application Layer. Doudou Fall Internet Engineering Laboratory Nara Institute of Science and Technique Information Network I: The Application Layer Doudou Fall Internet Engineering Laboratory Nara Institute of Science and Technique Outline Domain Name System World Wide Web and HTTP Content Delivery Networks

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu

More information

0 0& Basic Background. Now let s get into how things really work!

0 0& Basic Background. Now let s get into how things really work! +,&&-# Department of Electrical Engineering and Computer Sciences University of California Berkeley Basic Background General Overview of different kinds of networks General Design Principles Architecture

More information

Executive Summary. Performance Report for: The web should be fast. Top 4 Priority Issues

Executive Summary. Performance Report for:   The web should be fast. Top 4 Priority Issues The web should be fast. Executive Summary Performance Report for: https://www.wpspeedupoptimisation.com/ Report generated: Test Server Region: Using: Tue,, 2018, 12:04 PM -0800 London, UK Chrome (Desktop)

More information

Traditional Web Based Systems

Traditional Web Based Systems Chapter 12 Distributed Web Based Systems 1 Traditional Web Based Systems The Web is a huge distributed system consisting of millions of clients and servers for accessing linked documents Servers maintain

More information

Review for Internet Introduction

Review for Internet Introduction Review for Internet Introduction What s the Internet: Two Views View 1: Nuts and Bolts View billions of connected hosts routers and switches protocols control sending, receiving of messages network of

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

Performance Evaluation of Tcpdump

Performance Evaluation of Tcpdump Performance Evaluation of Tcpdump Farhan Jiva University of Georgia Abstract With the onset of high-speed networks, using tcpdump in a reliable fashion can become problematic when facing the poor performance

More information

Stager. A Web Based Application for Presenting Network Statistics. Arne Øslebø

Stager. A Web Based Application for Presenting Network Statistics. Arne Øslebø Stager A Web Based Application for Presenting Network Statistics Arne Øslebø Keywords: Network monitoring, web application, NetFlow, network statistics Abstract Stager is a web based

More information

CCNA 1 v3.11 Module 11 TCP/IP Transport and Application Layers

CCNA 1 v3.11 Module 11 TCP/IP Transport and Application Layers CCNA 1 v3.11 Module 11 TCP/IP Transport and Application Layers 2007, Jae-sul Lee. All rights reserved. 1 Agenda 11.1 TCP/IP Transport Layer 11.2 The Application Layer What does the TCP/IP transport layer

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Buffer Management for XFS in Linux. William J. Earl SGI

Buffer Management for XFS in Linux. William J. Earl SGI Buffer Management for XFS in Linux William J. Earl SGI XFS Requirements for a Buffer Cache Delayed allocation of disk space for cached writes supports high write performance Delayed allocation main memory

More information

CS WEB TECHNOLOGY

CS WEB TECHNOLOGY CS1019 - WEB TECHNOLOGY UNIT 1 INTRODUCTION 9 Internet Principles Basic Web Concepts Client/Server model retrieving data from Internet HTM and Scripting Languages Standard Generalized Mark up languages

More information

CNIT 129S: Securing Web Applications. Ch 3: Web Application Technologies

CNIT 129S: Securing Web Applications. Ch 3: Web Application Technologies CNIT 129S: Securing Web Applications Ch 3: Web Application Technologies HTTP Hypertext Transfer Protocol (HTTP) Connectionless protocol Client sends an HTTP request to a Web server Gets an HTTP response

More information

Web Architecture and Technologies

Web Architecture and Technologies Web Architecture and Technologies Ambient intelligence Fulvio Corno Politecnico di Torino, 2015/2016 Goal Understanding Web technologies Adopted for User Interfaces Adopted for Distributed Application

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

Performance Evaluation of a Regular Expression Crawler and Indexer

Performance Evaluation of a Regular Expression Crawler and Indexer Performance Evaluation of a Regular Expression Crawler and Sadi Evren SEKER Department of Computer Engineering, Istanbul University, Istanbul, Turkey academic@sadievrenseker.com Abstract. This study aims

More information

Introduction to Internet, Web, and TCP/IP Protocols SEEM

Introduction to Internet, Web, and TCP/IP Protocols SEEM Introduction to Internet, Web, and TCP/IP Protocols SEEM 3460 1 Local-Area Networks A Local-Area Network (LAN) covers a small distance and a small number of computers LAN A LAN often connects the machines

More information

Chapter 10: Application Layer CCENT Routing and Switching Introduction to Networks v6.0

Chapter 10: Application Layer CCENT Routing and Switching Introduction to Networks v6.0 Chapter 10: Application Layer CCENT Routing and Switching Introduction to Networks v6.0 CCNET v6 10 Chapter 10 - Sections & Objectives 10.1 Application Layer Protocols Explain the operation of the application

More information

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates

More information

EEC-682/782 Computer Networks I

EEC-682/782 Computer Networks I EEC-682/782 Computer Networks I Lecture 20 Wenbing Zhao w.zhao1@csuohio.edu http://academic.csuohio.edu/zhao_w/teaching/eec682.htm (Lecture nodes are based on materials supplied by Dr. Louise Moser at

More information

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

IBM Lotus Domino 7 Performance Improvements

IBM Lotus Domino 7 Performance Improvements IBM Lotus Domino 7 Performance Improvements Razeyah Stephen, IBM Lotus Domino Performance Team Rob Ingram, IBM Lotus Domino Product Manager September 2005 Table of Contents Executive Summary...3 Impacts

More information

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Crawling CE-324: Modern Information Retrieval Sharif University of Technology Crawling CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Sec. 20.2 Basic

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Design and Implementation of A P2P Cooperative Proxy Cache System

Design and Implementation of A P2P Cooperative Proxy Cache System Design and Implementation of A PP Cooperative Proxy Cache System James Z. Wang Vipul Bhulawala Department of Computer Science Clemson University, Box 40974 Clemson, SC 94-0974, USA +1-84--778 {jzwang,

More information

Technical Specifications for Web-based A+LS Servers

Technical Specifications for Web-based A+LS Servers Technical Specifications for Web-based A+LS Servers General Requirements Network Requirements In order to configure Web-based A+LS to properly answer requests from both the Internet and the local area

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

ENG224 INFORMATION TECHNOLOGY Part I 3. The Internet. 3. The Internet

ENG224 INFORMATION TECHNOLOGY Part I 3. The Internet. 3. The Internet 1 Reference Peter Norton, Introduction to Computers, McGraw Hill, 5 th Ed, 2003 2 What is the Internet? A global network that allows one computer to connect with other computers in the world What can be

More information

Hyper Text Transfer Protocol Compression

Hyper Text Transfer Protocol Compression Hyper Text Transfer Protocol Compression Dr.Khalaf Khatatneh, Professor Dr. Ahmed Al-Jaber, and Asma a M. Khtoom Abstract This paper investigates HTTP post request compression approach. The most common

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80]. Focus: Accustom To Crawl Web-Based Forums M.Nikhil 1, Mrs. A.Phani Sheetal 2 1 Student, Department of Computer Science, GITAM University, Hyderabad. 2 Assistant Professor, Department of Computer Science,

More information