Extensible Web Pages Crawler Towards Multimedia Material Analysis Wojciech Turek, Andrzej Opalinski, Marek Kisiel-Dorohinicki AGH University of Science and Technology, Krakow, Poland wojciech.turek@agh.edu.pl, opal@tempus.metal.agh.edu.pl, doroh@agh.edu.pl Abstract. Methods of Web pages content monitoring comes increasingly in the interest of law enforcement services searching for Web pages contain symptoms of criminal activities. These information can be hidden from indexing systems by embedding in multimedia materials. Finding such materials is a large challenge of contemporary criminal analysis. A concept of integrating a large scale Web crawling system with a multimedia materials analysis algorithms is described in this paper. The Web crawling system, which is processing a few hundred pages per second, provides a mechanism for plugin inclusion. A plugin receives can analyze processed resources and detect references to multimedia materials. The references have passed to a component, which implements an algorithm for image or video analysis. Several approaches to the integration are described and some exemplary implementation assumptions are presented. Keywords: Web crawling, image analysis 1 Introduction The World Wide Web is probably the greatest public ally available base of electronic data. It contains huge amounts of information on almost every subject. Last two decades brought significant changes in the source of information so called Web 2.0 solutions allowed users of the Web to become authors of Web pages content. This phenomenon has many side effects, including partial anonymousness of information source. As a result, the content of Web pages comes increasingly in the interest of law enforcement services. It is relatively easy to find or analyze text on a Web page. There are many popular and powerful tools for that present on the Internet. Publicly available tools, like Google Search, have some limitation and features, however an experienced user can make a good use of those. Because of that, people rarely place textual symptoms of criminal activities on publicly available pages. It is much easier to hide information, which is embedded into a multimedia material, like an image or a movie. Analysis of such materials is much more complex and time consuming. Public search systems can only perform basic operations of this kind, leaving lots of information unrecognized. In order to perform large-scale tests in the domain of Web pages content analysis, a Web crawling system is needed. It is relatively easy to create or adopt
a solution, that would be capable of processing tens of thousands of Web pages. Some examples of such tools written in Java, which can be easily downloaded from the Internet, are: WebSPHINX [2], which can use user-defined processing algorithms and provides graphical interface for crawling process visualization, Nutch, which can be integrated with Lucine [3], a text indexing system, Heritrix [4], used by The Internet Archive digital library, which stores Web pages changes over recent years. However, if a crawling system is required to process tens of millions of Web pages, a complex scale problem arises. Some of available solutions can handle this kind of a problem, however deploying such a system is not as easy as running an off-shelf application. The most obvious scale problem is a number of pages in the Web. On 2008 Google engineers announced, that the Google Search engine discovered one trillion unique URLs [6]. Assuming, that one URL has over 20 characters, the amount of space required for storing URLs only is more than 18 terabytes. Surprisingly, a list of unique words found on Web pages should not cause scale problems. A system, which is indexing over 10 8 pages found only 1.7 10 6 words. This amount cannot be held in memory, but can be located on a hard drive of a single computer. However searching Web pages by words requires the system to build an inverted index, witch stores a list of URLs for each word. Assuming, that each URL is identified by an 8-bytes integer, and an average Web page contains 1000 words, the index requires more than 7,2 petabytes. Another issue related to the scale of the problem is page refresh time, which determines how often a crawler should visit a particular page. If each page is to be visited once a year (which is rather rare), the system must process more than 31000 pages per second. Assuming, that an average page is bigger than 1 KB, a network bandwidth at 31 MB per second is a minimum requirement. It is obvious, that such a huge amount of data cannot be processed by a single computer. It is necessary to design a distributed system for parallel processing of well-defined subtasks. In this paper a concept of integrating a large scale Web crawling system with a multimedia materials analysis algorithms is described. The concept is being tested on a crawling system, which has been developed at the Department of Computer Science, AGH UST. 2 Architecture of the crawling system An exemplary crawling system, which has been created at the Department of Computer Science, AGH UST [1], is designed to handle hundreds of millions of Web pages. It runs on a cluster of 10 computers and handles up to few pages per second. Performance of the system can be improved easily, by adding more computers to the cluster. There are two main subsystems of the crawling system.
1. cluster management subsystem and 2. node management subsystem. The cluster management subsystem is responsible for integration of all nodes into one uniform system. It provides an administrator Web interface and controls the process of Web pages crawling an all nodes. The process is controlled by assigning Internet domains to particular nodes. Each domain (and all its subdomains) can be assigned to one node only, therefore there is no risk of processing the same URL on multiple nodes. The domain-based distribution of tasks is used also for load balancing each node gets approximately the same number of domains to process. The cluster management subsystem is also responsible for providing an API for search services. The services use the node management system s search services to perform search on all nodes, aggregate and return results. Two basic elements of the API allow users to search for: 1. URLs of Web pages containing specified words, 2. content of indexed Web page of a specified URL. The node management subsystem manages the crawling process on a single node. It is controlled by assigned set of domains to process. Each domain has a large set of URLs to be downloaded and analyzed each time the domain is refreshed. During the processing, new URLS can be found. One of three cases can occur: 1. the URL belongs to the domain being processed it is added to the current queue of URLs, 2. the URL belongs to a domain assigned to this node it is stored in a database of URLs, 3. the URL belongs to a different domain it is sent to the cluster management subsystem. Each domain is being processed by a single crawler. The node management subsystem runs tens of crawlers simultaneously, starting new ones if sufficient system resources are present. Each crawler has a list of URLs to be processed. URLs are being processed sequentially the crawler has a single processing thread. Sequential processing of a single domain can take a long time (a few days) if the domain is large or the connection is slow. However in this application long processing time is not a drawback, as particular time of indexing is not important. Moreover, this solution is less prone to simple DOS filters, detecting number of queries per second. It is also easier to implement due to reduction of thread synchronization required. The sequence of actions performed by a crawler during single URL processing is shown in figure 1. There are five main steps of resource processing: 1. Downloading the resource performed by a specialized component, which limits maximum time and controls the size of the resource.
1. URL Processing controller 5. URLs Downloader 2. page source Lexer 3. resource model URLs Indexer 4. resource model Plugins URL detector vocabularies INDEKSY indexes INDEKSY content data Fig. 1. Resource processing performed by a crawler 2. Lexical analysis performed on a source of the Resource, which creates a unified resource model. The source is: (a) converted to Unicode, (b) split into segments during HTML structure analysis, (c) divided into tokens and words sequences, (d) converted to tokens and words identifiers, which are stored in a dynamically extended vocabularies. 3. Content indexing performed on a resource model by an Indexer component. An inverted index of words and stored content of the resource are being updated. 4. Content analysis performed by specialized plugins. Each plugin receives a resource model, performs particular operations and stores results in own database. One of the plugins, the URL detector, locates all URLs in the resource. 5. URLs management performed by the Processing controller. URLs are added to a local queue stored in a database or sent to the cluster management subsystem, dependin on domain. Currently the system is indexing pages in Polish language. Polish language detection is based on the count of Polish-specific characters detected in the Web page content. The system is implemented in Java, it runs on a Lunux OS in JBoss application server [5]. Communication between computers in the cluster is based on EJB components. Most of the storage uses MySQL DBMS, which has been carefully tuned for this application. Each of the servers has a single, four-core CPU and 8 GB of RAM and 3 TB of storage.
3 Integration with multimedia analysis algorithm The crawling system, presented in the previous section, can be integrated with a multimedia analysis algorithm in several ways. In general, a plugin for the crawling system has to be developed. It has to detect multimedia materials references in the resource model. After a reference is found, several options are available: Download the material and perform the analysis by the plugin, Use a remote service to download and analyze the material; collect results, Feed a remote, independent service with references. Each of these solutions has some advantages and drawbacks. The integrated solution is limited to the Java technology, therefore is not universal. Many processing algorithms are written in different languages and cannot be directly integrated with the crawler. Moreover, this type of algorithm can have very large requirements, which can cause uncontrolled JVM errors and shutdown. A remote synchronous service, which downloads and processes a material autonomously, can be implemented in any technology and run on a remote computer. However, this type of solution can cause some synchronization problems. The crawling system has to wait for results, not knowing how much time is it going to take. The crawling system uses several computers and hundreds of crawlers working simultaneously. Adjustment of multimedia processing performance can be a hard task. The last solution, which uses a remote asynchronous service, has the smallest impact on the crawling system. However, feeding the remote service with too many references to multimedia materials can easily result in service overload. To reduce the number of queries to the multimedia analysis service, the plugin integrated into the crawling system cam perform some filtering. An exemplary application of the approach described in this paper will be responsible for finding images, which contain faces of people useful for identification. Therefore the plugin will analyze a surrounding of an img tag, verifying specified criteria. This should reduce the number of analyzed images by removing most of graphical elements of a Web page. For the test purposes, the second solution will be adopted. The remote service running on a remote server will be asked for analyzing images, which are suspected of containing faces of people. The crawling system will wait for results, using specified timeout. 4 Applications It is obvious, that there are many practical applications of the solution described in this paper. Some of those are already available in publicly available Web search engines, like Google Images (http://images.google.com/). A user can find images of a specified size or type or even images which contain faces.
However many different applications, which may be useful in criminal analysis, are not available. Therefore use of a dedicated system is required. Three different groups of applications have been identified: 1. text detection and recognition, 2. face detection and recognition, 3. crime identification. Plain text, located in a source of a Web page (HTML) can be found and analyzed easily. However, one could use images or movies to publish information on the Web, which would not be indexed by a text-based crawler. An image analysis algorithm, which can detect such images and recognize the text, can be used to enrich index of such page. This information can be integrated with a regular text-based indexing or searched separately. Face detection and face recognition can be used by identity analysis systems. If an individual, whose Internet activities are suspected, is to be identified, a system can help to locate and associate pages containing information about the individual. Moreover, face recognition can help to equate a person, active at different locations on the Web, by finding her/his face on several images. People can be associated with each other by finding different elements present in multimedia materials. Car plates are a good example of such object. In some cases, evidence of criminal activities are being published on the Web. The most typical example of such publications are movies submitted to videosharing services. This type of material can be analyzed by an advanced video processing component, searching for defined patterns or text elements. Another very important case of illegal multimedia material on the Web is child pornography. Despite the fact that publishing such material is a serious crime in most countries, such situations still occur. An algorithm for detecting child pornography in multimedia materials, integrated with a crawling system could significantly improve detectability of such crimes. The crawling system, presented in the previous section, is being integrated with an image analyzing component, developed at the Department of Telecommunications, AGH UST [7]. The component is able of finding the number of faces in a given image. The information will be gathered by the crawler and used for improving the quality of identity analysis systems. In the near future another features of the image analyzing component, like fascist contents identification or child pornography detection, will be used. 5 Conclusions The concept described in this paper is based on an idea of integrating a Web crawling system with multimedia analysis algorithm. These two elements can create a system for detecting various contents present in the Web, which are
impossible to find using text-based indexes. Such tools can be used by the law enforcement services in many different situations. The crawling system presented in this paper is being integrated with multimedia analysis algorithm to prove the correctness of the approach. Many technical issues concerning the integration have already been solved. Further research and tests of the system are aimed at providing high quality tools useful in criminal analysis and crime detection. Acknowledgments The research leading to these results has received funding from the European Communitys Seventh Framework Program (FP7/2007-2013) under grant agreement nr 218086. References 1. Opalinski A., Turek W. Information retrieval and identity analysis, in: Metody sztucznej inteligencji w dzialaniach na rzecz bezpieczenstwa publicznego. ISBN 978-83-7464-268-2, p. 173194, 2009. 2. Miller R. C., and Bharat K. SPHINX: A Framework for Creating Personal, Site- Specific Web Crawlers. In Proceedings of WWW7, Brisbane Australia, 1998. 3. Shoberg J. Building Search Applications with Lucene and Nutch. ISBN: 978-1590596876, APress 2006. 4. Sigursson K. Incremental crawling with Heritrix. Proceedings of the 5th International Web Archiving Workshop, 2005. 5. Marrs T., Davis S. JBoss At Work: A Practical Guide. ISBN 0596007345, O Reilly 2005. 6. Alpert J., Hajaj N. We knew the web was big... The Official Google Blog, 2008. 7. Korus P., Glowacz A. A system for automatic face indexing. Przeglad Telekomunikacyjny, Wiadomosci Telekomunikacyjne 81(8-9); ISSN 1230-3496, p. 13041312, 2008