Structure Properties of the Thai WWW: The 2003 Survey

Structure Properties of the Thai WWW: The 2003 Survey Surasak Sanguanpong and Kasom Koth-arsa Applied Network Research Laboratory, Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900, THAILAND {Surasak.S, g4205077}@ku.ac.th Abstract. This paper presents quantitative measurements and analyses of structure properties of Thai World Wide Web, obtained by the high performance web spider called KSpider. The paper briefly describes the KSpider s system architecture and several designs techniques such as workload distributions, scheduling policy, in-memory URLs compression and enhanced DNS resolver. Selected statistics related to the Thai WWW based on information gathered in March 2003 are presented. KSpider has collected 8,170,005 of URLs (HTML and images) in 7 days. All compressed web data consumes around 155 GB of disk space. Total 3,277,988 HTML (around 54 GB) documents on 24,124 web servers with 9,167 unique IP addresses are found. Over 60 millions hyperlinks are analyzed. More interested statistics are reported, i.e., documents and servers classified by domain names, percentage of server types, distribution of page sizes, distribution of file extensions, and hyperlinks connectivity between domains. 1 Introduction The purpose of this paper is to present quantitative measurements and analyses of various issues related to web servers and some structural information on the Thai WWW, obtained by the parallel web spider called KSpider [1]. Prior reports on Thai WWW statistics were published [2] by the same author and more details of facts and figures are available on the homepage of the Applied Network Research Laboratory at http://anres.cpe.ku.ac.th. The early report about Thai WWW statistics [2] showed the latest survey during March 2000 to March 2001 and it included only web servers whose name is registered under.th domain. However, the Internet has continuously grown in Thailand. [3]. Moreover, country domain name registrations suffer from the widespread use of.com and other top level domain names (TLDs) without any knowledge about its quantity and property. In order to get the latest information and such mentioned characteristics, the new survey has been reactivated to include two type of web servers : (1) web servers whose name ends with the.th suffix and (2) web servers whose name ends with any gtlds and cctlds domains (.com,.net,.org, and etc), but theirs IP address belongs to address blocks assigned to Thailand network organizations. KSpider was configured to download data from web servers whose names are registered under.th domain name or their IP addresses are in Thailand. The list of such IP addresses is created from the national exchange s route server (THNIX), available at telnet://route-server.cat.net.th.

2 Surasak Sanguanpong and Kasom Koth-arsa In the next Section, we briefly discuss the KSpider s architecture. The following section presents the analyses of Thai WWW. Finally, we conclude the paper. 2 KSpider s Design and Implementation Web spider (crawler or robot) is one of the key components of search engine. The main task of spider is to automatically collect HTML documents or web pages including other web data (images and other file types) from public web servers. These spiders crawl" across the web, following hyperlinks from site to site, storing downloaded pages they visit to build a searchable web pages index. KSpider is the second generation of our own spider implementation. Prior data collection in the past was based on the multi-process web spider, called NontriSpider, developed as a part of NontriSearch search engine [2]. KSpider has been developed to overcome many limitations of NontriSpider with more performance enhancements. 2.1 Overview of System components KSpider is based on Beowulf cluster [4]. The KCluster running KSpider currently consists of 4 set of AMD Athon XP 1500+, each equips with 768 Mbytes DDR RAM, six of 35 GBytes Ultra-160 SCSI harddisks and an Intel E1000 Gigabit Ethernet interface. They are connected together using 3Com Gigabit Ethernet switch. KSpider is implemented in C++ on top of Linux operating system. It consists of five main components as shown in Fig. 1. Each component is described as the following: URL Manager. The URL manager is responsible on all about URL handling. Each URL Manager in a node keeps track of a disjoint of URL subset, compared to the other nodes. The Storage Manager gets URLs from Buffer Queue and stores them in compressed form for further processing. The Scheduler on the URL Manager selects and schedules the URLs by sending the list of URLs to the Data Collector. Data Collector. The Data takes care of collector threads to fetch the data from the web servers. The collector threads get a list of URL from the queue and send the request to the web servers using HTTP/1.1. Data Processor. The fetched data will be passed from the Data Collector to the Data Processor for further processing, such as links extraction, statistics collection, URL filtering, and etc. Storage Manager. The Storage Manager contains two important components, i.e., the Compressor and Decompressor. Several web data will be compressed and packed together by the Compressor. The LZO algorithm [5] is used as the compression library. The Decompressor is responsible for data decompression. Communicator. Whenever, a new URL is extracted and the node found that it is not responsible for such URL, it is the task of the Communicator to delivery the URL to another node using UDP in asynchronous fashion.

Structure Properties of the Thai WWW: The 2003 Survey 3 Online indexer Data Processor Other processing Storage Manager URL Extractor Data Streamer Data Decompressor URL Filter Stats Collector Data Compressor URL Buffer Queue URL Manager Scheduler URL Buffer Queue Data Collector HTTP Data Collector URL Storage Manager In-memory Parallel DNS URL Buffer Queue Communicator To Communicator Cluster Communicator On Disk Scheduler Fig. 1. The System Architecture of KSpider 2.2 Design and Implementation Techniques There are many underlie concepts and techniques which have been implemented in KSpider. The important techniques are described in this section. 2.3.1 Data Distribution Data (web pages, images, and other file type) downloaded from the web servers are distributed over the nodes in the cluster. For any given URL, there is only one node that is responsible to fetch and keep data reference by that URL. A simple hash function based on the summation of every character in the URL is used to distribute the URLs among the nodes in the cluster. 2.3.2 Phase Swapping Each node has a list of URLs that may belong to the same web server. Hence, it is likely that every node may download web pages from the same web severs at the same time. Should there are several nodes running simultaneously, it would increasingly generate heavy loads on destination servers. To prevent this situation, a technique called phase swapping is proposed. The underlie concept of phase swapping is to group the URLs belonged to same web servers together (using hashing of the host name portion of the URL) and let each node works on the different set of servers at a time. After a pre-defined constant period, every node synchronously swaps the working phase to a new set of web servers. This technique does not only prevent the overload of the web servers, but also largely reduced the number of URLs the spider has to manage in any given time.

4 Surasak Sanguanpong and Kasom Koth-arsa 2.3.3 URL Compression KSpider compresses the URLs by only keeping the differences of URLs tails. It utilizes the modified AVL tree with delta encoding [6]. Our compression technique can reduce the length of URL from 59.5 bytes to be 26.5 bytes by average (about 55% of size reduction with all data structure overhead). Therefore, all compressed URL can be kept inside the main memory. The current configuration of a node in KCluster (See Fig.1) is designed to handle up to 30 million URLs. 2.3.4 Enhanced DNS Resolver KSpider has DNS caching mechanism integrated to the resolver. This helps reducing DNS server workload when several thousand of hostnames must be resolved in a short period of time. Moreover, KSpider has built-in mechanism to specify resolver time-out and allow it to contact several DNS servers in the same time. 3 Crawling Results In this section, selected statistics related to web servers in Thailand are presented. In this paper, a web server is referred to as a web site that provides HTTP services. It is counted by a unique hostname, not a physical machine. However, a machine can support multiple web servers (with different hostnames). To get the number of machines, the name resolved was utilized to count unique IP addresses. Instead of downloading only HTML pages, images and other file types are also included to build database for another project. KSpider collected around eight million URLs. The crawling took 7 days long. The retrieved data from each request consisted of two parts, i.e. the HTTP header and the HTML body. The headers were analyzed by counting each field, while the HTML bodies were subjected to extensive analysis. Counting and analysis are mostly performed automatically by statistic collector integrated with KSpider (See Fig.2). HTML parsing was performed using our own C++ robust parser to get maximum performance. The following sections describe crawling results in details. 3.1 The Thai Domain Name The detailed domain structure was found from statistics published by Thailand Network Information Center (THNIC) at http://www.thnic.net/. There are 7 top level domain names in Thailand. We prepared a list of third level domain names available from http://all.in.th, which composed of 10,504 subdomains as shown in Table 1.

Structure Properties of the Thai WWW: The 2003 Survey 5 Table 1. The 7 top level domain names ranked by number third level sub-domain names Rank Domain #Domains Percent 1 co.th 7,839 74.63 2 in.th 1,128 10.74 3 ac.th 788 7.50 4 or.th 465 4.43 5 go.th 245 2.33 6 net.th 27 0.26 7 mi.th 12 0.11 Total 10,504 100 KSpider started to collect web data using 10,504 of seed URLs generated from such list. Each of domain names from the list was added with the standard prefix name www to create the complete URLs. Such seed URLs would be assurable to let KSpider get data from each domain name without any prejudice. 3.2 Size of Thai Web We have collected of 8,170,005 URLs, of which 3,277,988 URLs are HTML documents. Total web data consumed over 155 gigabytes of disk space (compressed), of which HTML documents consume around 54 gigabytes. The survey found 24,124 web servers with 9,167 unique machines. 3.3 Documents and Servers classified by Domain Name Statistics about the HTML documents for each domain are shown in Table 2. Over 70% of all documents are in academic and commercial domain. Table 2. Documents and Servers in each domain name ranked by number of documents Rank Domain #Documents #Servers #Machines 1 ac.th 1,093,388 2,979 1,973 2 com 977,478 10,385 2,249 3 go.th 313,109 839 382 4 co.th 279,532 6,159 2,664 5 or.th 236,342 651 400 6 org 107,188 481 245 7 net 87,700 764 403 8 others 56,361 388 304 9 net.th 55,169 102 81 10 in.th 36,658 748 336 11 edu 20,500 561 109 12 mi.th 14,563 67 21 Total 3,277,988 24,124 9,167

6 Surasak Sanguanpong and Kasom Koth-arsa 3.4 HTTP returned code Table 3 shows the HTTP returned code. The 200 (OK) means the successful request of unique pages. Table 3. HTTP errors under 3,961,227 total requests Rank Type Quantity Percent 1 200 (OK) 3,277,988 82.75 2 404 (Not found) 536,451 13.54 3 401 (Unauthorized) 74,334 1.88 4 403 (Forbidden) 27,227 0.69 5 302 (Move temporary) 18,907 0.48 6 301 (Move permanently) 17,361 0.44 7 406 (Not Acceptable) 4,035 0.10 8 503 (Service Unavailable) 2,011 0.05 9 500 (Internal error) 1,449 0.04 10 400 (Bad request) 681 0.02 11 Others (6 more) 783 0.01 Total 3,961,227 100 3.5 HTTP Headers For each complete HTTP GET request, there were 70 different header types found as summarized in Table 4. Table 4. Frequency of various HTTP header types Rank HTTP Header Quantity Percent 1 Content-Type 3,277,988 100 2 Date 3,277,250 99.98 3 Server 3,273,134 99.85 4 Content-Length 2,401,286 73.25 5 Last-Modified 2,338,212 71.33 6 Accept-Ranges 2,291,211 69.90 7 Etag 2,224,593 67.86 8 Transfer-encoding 839,251 25.30 9 Connection 665,307 20.30 10 x-powered-by 486,940 14.85 11 Others (60 more) 706,253 21.54

Structure Properties of the Thai WWW: The 2003 Survey 7 3.6 Percentage of Server Types The total 149 different types of HTTP servers are discovered. Table 5 shows the summary of them without showing the detail versions. More than 88% of them are relied on Apache and Microsoft-IIS technology. Table 5. HTTP server types distribution Rank Type Quantity Percent 1 Apache 13,569 56.25 2 Microsoft-IIS 7,787 32.28 3 unknown 1,604 6.65 4 dozygroup WebServer 227 0.94 5 Netscape-Enterprise 222 0.92 6 TWH 116 0.48 7 Rapidsite 60 0.25 8 Ipswitch-IMail 50 0.21 9 IBM_HTTP_Server 50 0.21 10 Lotus-Domino 48 0.20 11 OmniHTTPd 26 0.11 12 Netscape-FastTrack 16 0.07 13 Others (137 types) 349 1.45 Total 24,124 100 3.7 Last Modification Distribution Most documents have the age less than a year (for known last modification) as shown in Figure 2. The label errors in the figure means that the servers answered the time referenced to the future. The lable current means that the servers answered with the time in the same period of crawling. Moreover, there are a lot of documents that do not reply the last modification time as labeled with unknown. 1600000 1400000 1,337,330 1200000 1000000 800000 600000 400000 200000 0 647,860 538,143 422,080 332,280 237,432 220,161 111,145 78,177 1,161 20,813 14,645 errors current 1 month 6 month 1 year 2 year 3 year 4 year 5 year 6 year > 10 year Unknown Fig. 2. Distribution of Last Modification

8 Surasak Sanguanpong and Kasom Koth-arsa 3.8 Distribution of file extensions File extensions classification has been done by the standard suffix used in file names e.g..html,.htm,.jpg,.gif, and other as shown in Table 6. Filenames without suffixes are classified as unknown. Table 6. Distribute of File extensions Extension Quantity Percent.jpg 2,474,971 28.18.gif 2,279,698 25.95.html 1,642,336 18.70.htm 1,260,005 14.34 unknown 622,548 7.09.pdf 128,877 1.47.asp 102,478 1.17.php 86,915 0.99.shtml 71,862 0.82.doc 42,252 0.48.xml 27,461 0.31.png 22,767 0.26.jpeg 19,752 0.22.jsp 1,687 0.02 3.9 Page Size Distribution Figure 3 shows the distribution of HTML page size (bytes) in logarithmic scale. Note that only the HTML documents are considered. 800000 700000 600000 500000 400000 300000 200000 100000 0 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 1E+05 3E+05 5E+05 1E+06 2E+06 4E+06 8E+06 2E+07 3E+07 7E+07 Fig. 3. Distribution of HTML page size (bytes)

Structure Properties of the Thai WWW: The 2003 Survey 9 3.10 General properties of HTML documents Table 7 shows the typical structure properties of HTML documents. Table 7. General properties of HTML documents Items Min Max Mean Std Page Sizes (bytes) 0 26,240,704 15,269.87 47,091.42 Number of Internal Hyperlinks 0 6346 0.47 12.11 Number of Local Hyperlinks 0 32,989 34 108.42 Number of Remote Hyperlinks 0 17,647 6.39 28.14 Number of Java Applets 0 77 0.008 0.26 Number of Embedded Images 0 27,494 16.04 70.22 3.11 Link connectivity Nearly 61 million hyperlinks are found in the experiment. Table 8 shows the connectivity matrix between domain names. Table 8. Link Connectivity between Domain Names Domain.ac.th.co.th.go.th.in.th.mi.th.net.th.or.th others Sum.ac.th 6,972,473 47,938 32,050 2,986 679 12,884 34,467 120,220 7,223,697.co.th 2,652 6,276,278 4,871 226 58 433 5,052 359,254 6,648,824.go.th 16,989 36,152 1,760,686 1,228 1,324 4,140 14,995 56,524 1,892,038.in.th 1,085 12,381 1,388 271,555 24 69 1,095 6,541 294,138.mi.th 568 1,003 1,049 33 50,003 62 497 1,462 54,677.net.th 4,331 2,310 3,628 64 91 1,299,618 4,096 4,721 1,318,859.or.th 7,807 11,010 122,140 222 199 2,828 3,363,298 43,333 3,550,837 others 47,622 204,916 38,428 4,853 1,158 7,237 56,874 39,561,025 39,922,113 Sum 7,053,527 6,591,988 1,964,240 281,167 53,536 1,327,271 3,480,374 40,153,080 60,905,183 4 Conclusion Quantitative measurement and statistics of Thai WWW are presented and analyzed. More extensive analyses are planned and the full results from the survey will be made available on-line at http://anres.cpe.ku.ac.th/.

10 Surasak Sanguanpong and Kasom Koth-arsa 5 References 1. Koht-Arsa, K. and Sanguanpong, S.: High Performance Large Scale Web Spider Architecture. The 2002 Internataional Symposium on Communications and Information Technology, Pattaya, Chonburi, Thailand, October 2002 2. Sanguanpong, S., Piamsa-nga, P., Poovarawan, Y., and Warangrit, S.: Measuring Thai Web Using NontriSpider. Proceeding of the International Forum cum Conference on Information Technology and Communication, pp123-132, Bangkok, June 2000. 3. NECTEC: Internet Information Research Center, 2003. Available Source: http://ntl.nectec.or.th/ internet/index.html 4 Sterling, T., D. J. Becker, D. Savarese, J. E. Dorband, U. A. Ranawake, and C. E. Packer: Beowulf: A Parallel Workstation for Scientific Computation, In Proc. of International Conference on Parallel Processing, 1995. 5 Oberhumer. M. F.X.J.: LZO data compression library, 1996. Available Source: http://www.oberhumer.com/opensource/lzo/ 6. Koht-arsa, K., Sanguanpong, S.: In-memory URL Compression, National Computer Science and Engineering Conference, Chiang Mai, Thailand, 2001.