Structure Properties of the Thai WWW: The 2003 Survey
|
|
- Hugh Wiggins
- 6 years ago
- Views:
Transcription
1 Structure Properties of the Thai WWW: The 2003 Survey Surasak Sanguanpong and Kasom Koth-arsa Applied Network Research Laboratory, Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900, THAILAND {Surasak.S, Abstract. This paper presents quantitative measurements and analyses of structure properties of Thai World Wide Web, obtained by the high performance web spider called KSpider. The paper briefly describes the KSpider s system architecture and several designs techniques such as workload distributions, scheduling policy, in-memory URLs compression and enhanced DNS resolver. Selected statistics related to the Thai WWW based on information gathered in March 2003 are presented. KSpider has collected 8,170,005 of URLs (HTML and images) in 7 days. All compressed web data consumes around 155 GB of disk space. Total 3,277,988 HTML (around 54 GB) documents on 24,124 web servers with 9,167 unique IP addresses are found. Over 60 millions hyperlinks are analyzed. More interested statistics are reported, i.e., documents and servers classified by domain names, percentage of server types, distribution of page sizes, distribution of file extensions, and hyperlinks connectivity between domains. 1 Introduction The purpose of this paper is to present quantitative measurements and analyses of various issues related to web servers and some structural information on the Thai WWW, obtained by the parallel web spider called KSpider [1]. Prior reports on Thai WWW statistics were published [2] by the same author and more details of facts and figures are available on the homepage of the Applied Network Research Laboratory at The early report about Thai WWW statistics [2] showed the latest survey during March 2000 to March 2001 and it included only web servers whose name is registered under.th domain. However, the Internet has continuously grown in Thailand. [3]. Moreover, country domain name registrations suffer from the widespread use of.com and other top level domain names (TLDs) without any knowledge about its quantity and property. In order to get the latest information and such mentioned characteristics, the new survey has been reactivated to include two type of web servers : (1) web servers whose name ends with the.th suffix and (2) web servers whose name ends with any gtlds and cctlds domains (.com,.net,.org, and etc), but theirs IP address belongs to address blocks assigned to Thailand network organizations. KSpider was configured to download data from web servers whose names are registered under.th domain name or their IP addresses are in Thailand. The list of such IP addresses is created from the national exchange s route server (THNIX), available at telnet://route-server.cat.net.th.
2 2 Surasak Sanguanpong and Kasom Koth-arsa In the next Section, we briefly discuss the KSpider s architecture. The following section presents the analyses of Thai WWW. Finally, we conclude the paper. 2 KSpider s Design and Implementation Web spider (crawler or robot) is one of the key components of search engine. The main task of spider is to automatically collect HTML documents or web pages including other web data (images and other file types) from public web servers. These spiders crawl" across the web, following hyperlinks from site to site, storing downloaded pages they visit to build a searchable web pages index. KSpider is the second generation of our own spider implementation. Prior data collection in the past was based on the multi-process web spider, called NontriSpider, developed as a part of NontriSearch search engine [2]. KSpider has been developed to overcome many limitations of NontriSpider with more performance enhancements. 2.1 Overview of System components KSpider is based on Beowulf cluster [4]. The KCluster running KSpider currently consists of 4 set of AMD Athon XP 1500+, each equips with 768 Mbytes DDR RAM, six of 35 GBytes Ultra-160 SCSI harddisks and an Intel E1000 Gigabit Ethernet interface. They are connected together using 3Com Gigabit Ethernet switch. KSpider is implemented in C++ on top of Linux operating system. It consists of five main components as shown in Fig. 1. Each component is described as the following: URL Manager. The URL manager is responsible on all about URL handling. Each URL Manager in a node keeps track of a disjoint of URL subset, compared to the other nodes. The Storage Manager gets URLs from Buffer Queue and stores them in compressed form for further processing. The Scheduler on the URL Manager selects and schedules the URLs by sending the list of URLs to the Data Collector. Data Collector. The Data takes care of collector threads to fetch the data from the web servers. The collector threads get a list of URL from the queue and send the request to the web servers using HTTP/1.1. Data Processor. The fetched data will be passed from the Data Collector to the Data Processor for further processing, such as links extraction, statistics collection, URL filtering, and etc. Storage Manager. The Storage Manager contains two important components, i.e., the Compressor and Decompressor. Several web data will be compressed and packed together by the Compressor. The LZO algorithm [5] is used as the compression library. The Decompressor is responsible for data decompression. Communicator. Whenever, a new URL is extracted and the node found that it is not responsible for such URL, it is the task of the Communicator to delivery the URL to another node using UDP in asynchronous fashion.
3 Structure Properties of the Thai WWW: The 2003 Survey 3 Online indexer Data Processor Other processing Storage Manager URL Extractor Data Streamer Data Decompressor URL Filter Stats Collector Data Compressor URL Buffer Queue URL Manager Scheduler URL Buffer Queue Data Collector HTTP Data Collector URL Storage Manager In-memory Parallel DNS URL Buffer Queue Communicator To Communicator Cluster Communicator On Disk Scheduler Fig. 1. The System Architecture of KSpider 2.2 Design and Implementation Techniques There are many underlie concepts and techniques which have been implemented in KSpider. The important techniques are described in this section Data Distribution Data (web pages, images, and other file type) downloaded from the web servers are distributed over the nodes in the cluster. For any given URL, there is only one node that is responsible to fetch and keep data reference by that URL. A simple hash function based on the summation of every character in the URL is used to distribute the URLs among the nodes in the cluster Phase Swapping Each node has a list of URLs that may belong to the same web server. Hence, it is likely that every node may download web pages from the same web severs at the same time. Should there are several nodes running simultaneously, it would increasingly generate heavy loads on destination servers. To prevent this situation, a technique called phase swapping is proposed. The underlie concept of phase swapping is to group the URLs belonged to same web servers together (using hashing of the host name portion of the URL) and let each node works on the different set of servers at a time. After a pre-defined constant period, every node synchronously swaps the working phase to a new set of web servers. This technique does not only prevent the overload of the web servers, but also largely reduced the number of URLs the spider has to manage in any given time.
4 4 Surasak Sanguanpong and Kasom Koth-arsa URL Compression KSpider compresses the URLs by only keeping the differences of URLs tails. It utilizes the modified AVL tree with delta encoding [6]. Our compression technique can reduce the length of URL from 59.5 bytes to be 26.5 bytes by average (about 55% of size reduction with all data structure overhead). Therefore, all compressed URL can be kept inside the main memory. The current configuration of a node in KCluster (See Fig.1) is designed to handle up to 30 million URLs Enhanced DNS Resolver KSpider has DNS caching mechanism integrated to the resolver. This helps reducing DNS server workload when several thousand of hostnames must be resolved in a short period of time. Moreover, KSpider has built-in mechanism to specify resolver time-out and allow it to contact several DNS servers in the same time. 3 Crawling Results In this section, selected statistics related to web servers in Thailand are presented. In this paper, a web server is referred to as a web site that provides HTTP services. It is counted by a unique hostname, not a physical machine. However, a machine can support multiple web servers (with different hostnames). To get the number of machines, the name resolved was utilized to count unique IP addresses. Instead of downloading only HTML pages, images and other file types are also included to build database for another project. KSpider collected around eight million URLs. The crawling took 7 days long. The retrieved data from each request consisted of two parts, i.e. the HTTP header and the HTML body. The headers were analyzed by counting each field, while the HTML bodies were subjected to extensive analysis. Counting and analysis are mostly performed automatically by statistic collector integrated with KSpider (See Fig.2). HTML parsing was performed using our own C++ robust parser to get maximum performance. The following sections describe crawling results in details. 3.1 The Thai Domain Name The detailed domain structure was found from statistics published by Thailand Network Information Center (THNIC) at There are 7 top level domain names in Thailand. We prepared a list of third level domain names available from which composed of 10,504 subdomains as shown in Table 1.
5 Structure Properties of the Thai WWW: The 2003 Survey 5 Table 1. The 7 top level domain names ranked by number third level sub-domain names Rank Domain #Domains Percent 1 co.th 7, in.th 1, ac.th or.th go.th net.th mi.th Total 10, KSpider started to collect web data using 10,504 of seed URLs generated from such list. Each of domain names from the list was added with the standard prefix name www to create the complete URLs. Such seed URLs would be assurable to let KSpider get data from each domain name without any prejudice. 3.2 Size of Thai Web We have collected of 8,170,005 URLs, of which 3,277,988 URLs are HTML documents. Total web data consumed over 155 gigabytes of disk space (compressed), of which HTML documents consume around 54 gigabytes. The survey found 24,124 web servers with 9,167 unique machines. 3.3 Documents and Servers classified by Domain Name Statistics about the HTML documents for each domain are shown in Table 2. Over 70% of all documents are in academic and commercial domain. Table 2. Documents and Servers in each domain name ranked by number of documents Rank Domain #Documents #Servers #Machines 1 ac.th 1,093,388 2,979 1,973 2 com 977,478 10,385 2,249 3 go.th 313, co.th 279,532 6,159 2,664 5 or.th 236, org 107, net 87, others 56, net.th 55, in.th 36, edu 20, mi.th 14, Total 3,277,988 24,124 9,167
6 6 Surasak Sanguanpong and Kasom Koth-arsa 3.4 HTTP returned code Table 3 shows the HTTP returned code. The 200 (OK) means the successful request of unique pages. Table 3. HTTP errors under 3,961,227 total requests Rank Type Quantity Percent (OK) 3,277, (Not found) 536, (Unauthorized) 74, (Forbidden) 27, (Move temporary) 18, (Move permanently) 17, (Not Acceptable) 4, (Service Unavailable) 2, (Internal error) 1, (Bad request) Others (6 more) Total 3,961, HTTP Headers For each complete HTTP GET request, there were 70 different header types found as summarized in Table 4. Table 4. Frequency of various HTTP header types Rank HTTP Header Quantity Percent 1 Content-Type 3,277, Date 3,277, Server 3,273, Content-Length 2,401, Last-Modified 2,338, Accept-Ranges 2,291, Etag 2,224, Transfer-encoding 839, Connection 665, x-powered-by 486, Others (60 more) 706,
7 Structure Properties of the Thai WWW: The 2003 Survey Percentage of Server Types The total 149 different types of HTTP servers are discovered. Table 5 shows the summary of them without showing the detail versions. More than 88% of them are relied on Apache and Microsoft-IIS technology. Table 5. HTTP server types distribution Rank Type Quantity Percent 1 Apache 13, Microsoft-IIS 7, unknown 1, dozygroup WebServer Netscape-Enterprise TWH Rapidsite Ipswitch-IMail IBM_HTTP_Server Lotus-Domino OmniHTTPd Netscape-FastTrack Others (137 types) Total 24, Last Modification Distribution Most documents have the age less than a year (for known last modification) as shown in Figure 2. The label errors in the figure means that the servers answered the time referenced to the future. The lable current means that the servers answered with the time in the same period of crawling. Moreover, there are a lot of documents that do not reply the last modification time as labeled with unknown ,337, , , , , , , ,145 78,177 1,161 20,813 14,645 errors current 1 month 6 month 1 year 2 year 3 year 4 year 5 year 6 year > 10 year Unknown Fig. 2. Distribution of Last Modification
8 8 Surasak Sanguanpong and Kasom Koth-arsa 3.8 Distribution of file extensions File extensions classification has been done by the standard suffix used in file names e.g..html,.htm,.jpg,.gif, and other as shown in Table 6. Filenames without suffixes are classified as unknown. Table 6. Distribute of File extensions Extension Quantity Percent.jpg 2,474, gif 2,279, html 1,642, htm 1,260, unknown 622, pdf 128, asp 102, php 86, shtml 71, doc 42, xml 27, png 22, jpeg 19, jsp 1, Page Size Distribution Figure 3 shows the distribution of HTML page size (bytes) in logarithmic scale. Note that only the HTML documents are considered E+05 3E+05 5E+05 1E+06 2E+06 4E+06 8E+06 2E+07 3E+07 7E+07 Fig. 3. Distribution of HTML page size (bytes)
9 Structure Properties of the Thai WWW: The 2003 Survey General properties of HTML documents Table 7 shows the typical structure properties of HTML documents. Table 7. General properties of HTML documents Items Min Max Mean Std Page Sizes (bytes) 0 26,240,704 15, , Number of Internal Hyperlinks Number of Local Hyperlinks 0 32, Number of Remote Hyperlinks 0 17, Number of Java Applets Number of Embedded Images 0 27, Link connectivity Nearly 61 million hyperlinks are found in the experiment. Table 8 shows the connectivity matrix between domain names. Table 8. Link Connectivity between Domain Names Domain.ac.th.co.th.go.th.in.th.mi.th.net.th.or.th others Sum.ac.th 6,972,473 47,938 32,050 2, ,884 34, ,220 7,223,697.co.th 2,652 6,276,278 4, , ,254 6,648,824.go.th 16,989 36,152 1,760,686 1,228 1,324 4,140 14,995 56,524 1,892,038.in.th 1,085 12,381 1, , ,095 6, ,138.mi.th 568 1,003 1, , ,462 54,677.net.th 4,331 2,310 3, ,299,618 4,096 4,721 1,318,859.or.th 7,807 11, , ,828 3,363,298 43,333 3,550,837 others 47, ,916 38,428 4,853 1,158 7,237 56,874 39,561,025 39,922,113 Sum 7,053,527 6,591,988 1,964, ,167 53,536 1,327,271 3,480,374 40,153,080 60,905,183 4 Conclusion Quantitative measurement and statistics of Thai WWW are presented and analyzed. More extensive analyses are planned and the full results from the survey will be made available on-line at
10 10 Surasak Sanguanpong and Kasom Koth-arsa 5 References 1. Koht-Arsa, K. and Sanguanpong, S.: High Performance Large Scale Web Spider Architecture. The 2002 Internataional Symposium on Communications and Information Technology, Pattaya, Chonburi, Thailand, October Sanguanpong, S., Piamsa-nga, P., Poovarawan, Y., and Warangrit, S.: Measuring Thai Web Using NontriSpider. Proceeding of the International Forum cum Conference on Information Technology and Communication, pp , Bangkok, June NECTEC: Internet Information Research Center, Available Source: internet/index.html 4 Sterling, T., D. J. Becker, D. Savarese, J. E. Dorband, U. A. Ranawake, and C. E. Packer: Beowulf: A Parallel Workstation for Scientific Computation, In Proc. of International Conference on Parallel Processing, Oberhumer. M. F.X.J.: LZO data compression library, Available Source: 6. Koht-arsa, K., Sanguanpong, S.: In-memory URL Compression, National Computer Science and Engineering Conference, Chiang Mai, Thailand, 2001.
Yioop Full Historical Indexing In Cache Navigation. Akshat Kukreti
Yioop Full Historical Indexing In Cache Navigation Akshat Kukreti Agenda Introduction History Feature Cache Page Validation Feature Conclusion Demo Introduction Project goals History feature for enabling
More informationNetworks, WWW, HTTP. Web Technologies I. Zsolt Tóth. University of Miskolc. Zsolt Tóth (University of Miskolc) Networks, WWW, HTTP / 35
Networks, WWW, HTTP Web Technologies I. Zsolt Tóth University of Miskolc 2018 Zsolt Tóth (University of Miskolc) Networks, WWW, HTTP 2018 1 / 35 Table of Contents Networks Internet 1 Networks Internet
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationInformation Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationA crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.
A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,
More informationInformation Network Systems The application layer. Stephan Sigg
Information Network Systems The application layer Stephan Sigg Tokyo, November 15, 2012 Introduction 04.10.2012 Introduction to the internet 11.10.2012 The link layer 18.10.2012 The network layer 25.10.2012
More information3. WWW and HTTP. Fig.3.1 Architecture of WWW
3. WWW and HTTP The World Wide Web (WWW) is a repository of information linked together from points all over the world. The WWW has a unique combination of flexibility, portability, and user-friendly features
More informationGoogle Search Appliance
Google Search Appliance Administering Crawl Google Search Appliance software version 7.0 September 2012 Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 www.google.com September 2012 Copyright
More informationWeb Programming Paper Solution (Chapter wise)
Introduction to web technology Three tier/ n-tier architecture of web multitier architecture (often referred to as n-tier architecture) is a client server architecture in which presentation, application
More informationPARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION
PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In
More informationGuide to Networking Essentials, 6 th Edition. Chapter 5: Network Protocols
Guide to Networking Essentials, 6 th Edition Chapter 5: Network Protocols Objectives Describe the purpose of a network protocol, the layers in the TCP/IP architecture, and the protocols in each TCP/IP
More informationObjectives. Connecting with Computer Science 2
Objectives Learn what the Internet really is Become familiar with the architecture of the Internet Become familiar with Internet-related protocols Understand how the TCP/IP protocols relate to the Internet
More informationConnecting with Computer Science Chapter 5 Review: Chapter Summary:
Chapter Summary: The Internet has revolutionized the world. The internet is just a giant collection of: WANs and LANs. The internet is not owned by any single person or entity. You connect to the Internet
More informationAround the Web in Six Weeks: Documenting a Large-Scale Crawl
Around the Web in Six Weeks: Documenting a Large-Scale Crawl Sarker Tanzir Ahmed, Clint Sparkman, Hsin- Tsang Lee, and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationCharacterizing Home Pages 1
Characterizing Home Pages 1 Xubin He and Qing Yang Dept. of Electrical and Computer Engineering University of Rhode Island Kingston, RI 881, USA Abstract Home pages are very important for any successful
More informationGoogle Search Appliance
Google Search Appliance Administering Crawl Google Search Appliance software version 7.4 Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94043 www.google.com GSA-ADM_200.02 March 2015 Copyright
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationWeb Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson
Web Crawling Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc.
More information1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1
A Scalable, Distributed Web-Crawler* Ankit Jain, Abhishek Singh, Ling Liu Technical Report GIT-CC-03-08 College of Computing Atlanta,Georgia {ankit,abhi,lingliu}@cc.gatech.edu In this paper we present
More informationAdministrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454
Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationPerformance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades
Performance Benefits of OpenVMS V8.4 Running on BL8xc i2 Server Blades A detailed review of performance features and test results for OpenVMS V8.4. March 211 211, TechWise Research. All Rights Reserved
More informationCHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER
CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection
More informationIntegration of non harvested web data into an existing web archive
Integration of non harvested web data into an existing web archive Bjarne Andersen Daily manager netarchive.dk bja@netarkivet.dk Abstract This paper describes a software prototype developed for transforming
More informationLecture 7b: HTTP. Feb. 24, Internet and Intranet Protocols and Applications
Internet and Intranet Protocols and Applications Lecture 7b: HTTP Feb. 24, 2004 Arthur Goldberg Computer Science Department New York University artg@cs.nyu.edu WWW - HTTP/1.1 Web s application layer protocol
More informationDesign and Implementation of a Monitoring and Scheduling System for Multiple Linux PC Clusters*
Design and Implementation of a Monitoring and Scheduling System for Multiple Linux PC Clusters* Chao-Tung Yang, Chun-Sheng Liao, and Ping-I Chen High-Performance Computing Laboratory Department of Computer
More informationTraditional Internet Applications
Traditional Internet Applications Asst. Prof. Chaiporn Jaikaeo, Ph.D. chaiporn.j@ku.ac.th http://www.cpe.ku.ac.th/~cpj Computer Engineering Department Kasetsart University, Bangkok, Thailand Adapted from
More informationHyperText Transfer Protocol
Outline Introduce Socket Programming Domain Name Service (DNS) Standard Application-level Protocols email (SMTP) HTTP HyperText Transfer Protocol Defintitions A web page consists of a base HTML-file which
More informationPROJECT REPORT (Final Year Project ) Project Supervisor Mrs. Shikha Mehta
PROJECT REPORT (Final Year Project 2007-2008) Hybrid Search Engine Project Supervisor Mrs. Shikha Mehta INTRODUCTION Definition: Search Engines A search engine is an information retrieval system designed
More informationThis tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika.
About the Tutorial This tutorial provides a basic understanding of Apache Tika library, the file formats it supports, as well as content and metadata extraction using Apache Tika. Audience This tutorial
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationCrawling the Web. Web Crawling. Main Issues I. Type of crawl
Web Crawling Crawling the Web v Retrieve (for indexing, storage, ) Web pages by using the links found on a page to locate more pages. Must have some starting point 1 2 Type of crawl Web crawl versus crawl
More informationWeb Analysis in 4 Easy Steps. Rosaria Silipo, Bernd Wiswedel and Tobias Kötter
Web Analysis in 4 Easy Steps Rosaria Silipo, Bernd Wiswedel and Tobias Kötter KNIME Forum Analysis KNIME Forum Analysis Steps: 1. Get data into KNIME 2. Extract simple statistics (how many posts, response
More informationDistributed System: Definition
2 / 25 Introduction Distributed System: Definition Definition A distributed system is a piece of software that ensures that: a collection of independent computers appears to its users as a single coherent
More informationExecutive Summary. Performance Report for: https://edwardtbabinski.us/blogger/social/index. The web should be fast. How does this affect me?
The web should be fast. Executive Summary Performance Report for: https://edwardtbabinski.us/blogger/social/index Report generated: Test Server Region: Using: Analysis options: Tue,, 2017, 4:21 AM -0400
More informationReal Life Web Development. Joseph Paul Cohen
Real Life Web Development Joseph Paul Cohen joecohen@cs.umb.edu Index 201 - The code 404 - How to run it? 500 - Your code is broken? 200 - Someone broke into your server? 400 - How are people using your
More information1 of 10 8/10/2009 4:51 PM
1 of 10 8/10/ 4:51 PM Last Update: 16:20 Reported period: OK Current Month: Aug Summary Reported period Month Aug First visit 01 Aug - 00:00 Last visit 06:39 Unique visitors Number of visits Pages Hits
More information= a hypertext system which is accessible via internet
10. The World Wide Web (WWW) = a hypertext system which is accessible via internet (WWW is only one sort of using the internet others are e-mail, ftp, telnet, internet telephone... ) Hypertext: Pages of
More informationDNS and HTTP. A High-Level Overview of how the Internet works
DNS and HTTP A High-Level Overview of how the Internet works Adam Portier Fall 2017 How do I Google? Smaller problems you need to solve 1. Where is Google? 2. How do I access the Google webpage? 3. How
More informationCRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA
CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationTechnical Brief: Specifying a PC for Mascot
Technical Brief: Specifying a PC for Mascot Matrix Science 8 Wyndham Place London W1H 1PP United Kingdom Tel: +44 (0)20 7723 2142 Fax: +44 (0)20 7725 9360 info@matrixscience.com http://www.matrixscience.com
More informationarxiv:cs/ v1 [cs.ir] 21 Jul 2004
DESIGN OF A PARALLEL AND DISTRIBUTED WEB SEARCH ENGINE arxiv:cs/0407053v1 [cs.ir] 21 Jul 2004 S. ORLANDO, R. PEREGO, F. SILVESTRI Dipartimento di Informatica, Universita Ca Foscari, Venezia, Italy Istituto
More informationAnalysis of the effects of removing redundant header information in persistent HTTP connections
Analysis of the effects of removing redundant header information in persistent HTTP connections Timothy Bower, Daniel Andresen, David Bacon Department of Computing and Information Sciences 234 Nichols
More informationApplication Layer Protocols
Application Layer Protocols Dr. Ihsan Ullah Department of Computer Science & IT University of Balochistan, Quetta Pakistan Email: ihsan.ullah.cs@gmail.com These slides are adapted from the slides accompanying
More informationWWW Document Technologies
WWW Document Technologies Michael B. Spring Department of Information Science and Telecommunications University of Pittsburgh spring@imap.pitt.edu http://www.sis.pitt.edu/~spring Overview The Internet
More informationCollection Building on the Web. Basic Algorithm
Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring
More informationWeb Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India
Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program
More informationEfficient Hybrid Multicast Routing Protocol for Ad-Hoc Wireless Networks
Efficient Hybrid Multicast Routing Protocol for Ad-Hoc Wireless Networks Jayanta Biswas and Mukti Barai and S. K. Nandy CAD Lab, Indian Institute of Science Bangalore, 56, India {jayanta@cadl, mbarai@cadl,
More informationCharacterizing Gnutella Network Properties for Peer-to-Peer Network Simulation
Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation Selim Ciraci, Ibrahim Korpeoglu, and Özgür Ulusoy Department of Computer Engineering, Bilkent University, TR-06800 Ankara,
More informationRelevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search
Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?
More informationInternet Architecture. Web Programming - 2 (Ref: Chapter 2) IP Software. IP Addressing. TCP/IP Basics. Client Server Basics. URL and MIME Types HTTP
Web Programming - 2 (Ref: Chapter 2) TCP/IP Basics Internet Architecture Client Server Basics URL and MIME Types HTTP Routers interconnect the network TCP/IP software provides illusion of a single network
More informationEECS 395/495 Lecture 5: Web Crawlers. Doug Downey
EECS 395/495 Lecture 5: Web Crawlers Doug Downey Interlude: US Searches per User Year Searches/month (mlns) Internet Users (mlns) Searches/user-month 2008 10800 220 49.1 2009 14300 227 63.0 2010 15400
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 17 September 2018 Some slides courtesy Manning, Raghavan, and Schütze Other characteristics Significant duplication Syntactic
More informationThe latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication
The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University
More informationCS 3640: Introduction to Networks and Their Applications
CS 3640: Introduction to Networks and Their Applications Fall 2018, Lecture 19: Application Layer III (Credit: Prof. Phillipa Gill @ University of Massachusetts) Instructor: Rishab Nithyanand Teaching
More informationSIP Compliance APPENDIX
APPENDIX E This appendix describes Cisco SIP proxy server (Cisco SPS) compliance with the Internet Engineering Task Force (IETF) definition of Session Initiation Protocol (SIP) as described in the following
More informationTemplate Extraction from Heterogeneous Web Pages
Template Extraction from Heterogeneous Web Pages 1 Mrs. Harshal H. Kulkarni, 2 Mrs. Manasi k. Kulkarni Asst. Professor, Pune University, (PESMCOE, Pune), Pune, India Abstract: Templates are used by many
More informationCrawling. CS6200: Information Retrieval. Slides by: Jesse Anderton
Crawling CS6200: Information Retrieval Slides by: Jesse Anderton Motivating Problem Internet crawling is discovering web content and downloading it to add to your index. This is a technically complex,
More informationServlet Performance and Apache JServ
Servlet Performance and Apache JServ ApacheCon 1998 By Stefano Mazzocchi and Pierpaolo Fumagalli Index 1 Performance Definition... 2 1.1 Absolute performance...2 1.2 Perceived performance...2 2 Dynamic
More information18050 (2.48 pages/visit) Jul Sep May Jun Aug Number of visits
30-12- 0:45 Last Update: 29 Dec - 03:05 Reported period: OK Summary Reported period Month Dec First visit 01 Dec - 00:07 Last visit 28 Dec - 23:59 Unique visitors Number of visits Pages Hits Bandwidth
More informationSession 2. Background. Lecture Objectives
Session 2 Background 1 Lecture Objectives Understand how an Internet resource is accessed Understand the high level structure of the Internet cloud Understand the high level structure of the TCP/IP protocols
More informationWeb Technology. COMP476 Networked Computer Systems. Hypertext and Hypermedia. Document Representation. Client-Server Paradigm.
Web Technology COMP476 Networked Computer Systems - Paradigm The method of interaction used when two application programs communicate over a network. A server application waits at a known address and a
More informationInformation Network I: The Application Layer. Doudou Fall Internet Engineering Laboratory Nara Institute of Science and Technique
Information Network I: The Application Layer Doudou Fall Internet Engineering Laboratory Nara Institute of Science and Technique Outline Domain Name System World Wide Web and HTTP Content Delivery Networks
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationEvaluating Algorithms for Shared File Pointer Operations in MPI I/O
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu
More information0 0& Basic Background. Now let s get into how things really work!
+,&&-# Department of Electrical Engineering and Computer Sciences University of California Berkeley Basic Background General Overview of different kinds of networks General Design Principles Architecture
More informationExecutive Summary. Performance Report for: The web should be fast. Top 4 Priority Issues
The web should be fast. Executive Summary Performance Report for: https://www.wpspeedupoptimisation.com/ Report generated: Test Server Region: Using: Tue,, 2018, 12:04 PM -0800 London, UK Chrome (Desktop)
More informationTraditional Web Based Systems
Chapter 12 Distributed Web Based Systems 1 Traditional Web Based Systems The Web is a huge distributed system consisting of millions of clients and servers for accessing linked documents Servers maintain
More informationReview for Internet Introduction
Review for Internet Introduction What s the Internet: Two Views View 1: Nuts and Bolts View billions of connected hosts routers and switches protocols control sending, receiving of messages network of
More informationINTRODUCTION. Chapter GENERAL
Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which
More informationPerformance Evaluation of Tcpdump
Performance Evaluation of Tcpdump Farhan Jiva University of Georgia Abstract With the onset of high-speed networks, using tcpdump in a reliable fashion can become problematic when facing the poor performance
More informationStager. A Web Based Application for Presenting Network Statistics. Arne Øslebø
Stager A Web Based Application for Presenting Network Statistics Arne Øslebø Keywords: Network monitoring, web application, NetFlow, network statistics Abstract Stager is a web based
More informationCCNA 1 v3.11 Module 11 TCP/IP Transport and Application Layers
CCNA 1 v3.11 Module 11 TCP/IP Transport and Application Layers 2007, Jae-sul Lee. All rights reserved. 1 Agenda 11.1 TCP/IP Transport Layer 11.2 The Application Layer What does the TCP/IP transport layer
More informationAN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES
Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes
More informationBuffer Management for XFS in Linux. William J. Earl SGI
Buffer Management for XFS in Linux William J. Earl SGI XFS Requirements for a Buffer Cache Delayed allocation of disk space for cached writes supports high write performance Delayed allocation main memory
More informationCS WEB TECHNOLOGY
CS1019 - WEB TECHNOLOGY UNIT 1 INTRODUCTION 9 Internet Principles Basic Web Concepts Client/Server model retrieving data from Internet HTM and Scripting Languages Standard Generalized Mark up languages
More informationCNIT 129S: Securing Web Applications. Ch 3: Web Application Technologies
CNIT 129S: Securing Web Applications Ch 3: Web Application Technologies HTTP Hypertext Transfer Protocol (HTTP) Connectionless protocol Client sends an HTTP request to a Web server Gets an HTTP response
More informationWeb Architecture and Technologies
Web Architecture and Technologies Ambient intelligence Fulvio Corno Politecnico di Torino, 2015/2016 Goal Understanding Web technologies Adopted for User Interfaces Adopted for Distributed Application
More informationSEARCH ENGINE INSIDE OUT
SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing
More informationPerformance Evaluation of a Regular Expression Crawler and Indexer
Performance Evaluation of a Regular Expression Crawler and Sadi Evren SEKER Department of Computer Engineering, Istanbul University, Istanbul, Turkey academic@sadievrenseker.com Abstract. This study aims
More informationIntroduction to Internet, Web, and TCP/IP Protocols SEEM
Introduction to Internet, Web, and TCP/IP Protocols SEEM 3460 1 Local-Area Networks A Local-Area Network (LAN) covers a small distance and a small number of computers LAN A LAN often connects the machines
More informationChapter 10: Application Layer CCENT Routing and Switching Introduction to Networks v6.0
Chapter 10: Application Layer CCENT Routing and Switching Introduction to Networks v6.0 CCNET v6 10 Chapter 10 - Sections & Objectives 10.1 Application Layer Protocols Explain the operation of the application
More informationToday s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications
Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates
More informationEEC-682/782 Computer Networks I
EEC-682/782 Computer Networks I Lecture 20 Wenbing Zhao w.zhao1@csuohio.edu http://academic.csuohio.edu/zhao_w/teaching/eec682.htm (Lecture nodes are based on materials supplied by Dr. Louise Moser at
More informationSelf Adjusting Refresh Time Based Architecture for Incremental Web Crawler
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh
More informationImplementation and Evaluation of Prefetching in the Intel Paragon Parallel File System
Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:
More informationIBM Lotus Domino 7 Performance Improvements
IBM Lotus Domino 7 Performance Improvements Razeyah Stephen, IBM Lotus Domino Performance Team Rob Ingram, IBM Lotus Domino Product Manager September 2005 Table of Contents Executive Summary...3 Impacts
More informationCrawling CE-324: Modern Information Retrieval Sharif University of Technology
Crawling CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Sec. 20.2 Basic
More informationNetwork Design Considerations for Grid Computing
Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom
More informationDesign and Implementation of A P2P Cooperative Proxy Cache System
Design and Implementation of A PP Cooperative Proxy Cache System James Z. Wang Vipul Bhulawala Department of Computer Science Clemson University, Box 40974 Clemson, SC 94-0974, USA +1-84--778 {jzwang,
More informationTechnical Specifications for Web-based A+LS Servers
Technical Specifications for Web-based A+LS Servers General Requirements Network Requirements In order to configure Web-based A+LS to properly answer requests from both the Internet and the local area
More informationAdministrative. Web crawlers. Web Crawlers and Link Analysis!
Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt
More informationENG224 INFORMATION TECHNOLOGY Part I 3. The Internet. 3. The Internet
1 Reference Peter Norton, Introduction to Computers, McGraw Hill, 5 th Ed, 2003 2 What is the Internet? A global network that allows one computer to connect with other computers in the world What can be
More informationHyper Text Transfer Protocol Compression
Hyper Text Transfer Protocol Compression Dr.Khalaf Khatatneh, Professor Dr. Ahmed Al-Jaber, and Asma a M. Khtoom Abstract This paper investigates HTTP post request compression approach. The most common
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationI. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].
Focus: Accustom To Crawl Web-Based Forums M.Nikhil 1, Mrs. A.Phani Sheetal 2 1 Student, Department of Computer Science, GITAM University, Hyderabad. 2 Assistant Professor, Department of Computer Science,
More information