Extensible Web Pages Crawler Towards Multimedia Material Analysis

Size: px
Start display at page:

Download "Extensible Web Pages Crawler Towards Multimedia Material Analysis"

Transcription

1 Extensible Web Pages Crawler Towards Multimedia Material Analysis Wojciech Turek, Andrzej Opalinski, Marek Kisiel-Dorohinicki AGH University of Science and Technology, Krakow, Poland Abstract. Methods of Web pages content monitoring comes increasingly in the interest of law enforcement services searching for Web pages contain symptoms of criminal activities. These information can be hidden from indexing systems by embedding in multimedia materials. Finding such materials is a large challenge of contemporary criminal analysis. A concept of integrating a large scale Web crawling system with a multimedia materials analysis algorithms is described in this paper. The Web crawling system, which is processing a few hundred pages per second, provides a mechanism for plugin inclusion. A plugin receives can analyze processed resources and detect references to multimedia materials. The references have passed to a component, which implements an algorithm for image or video analysis. Several approaches to the integration are described and some exemplary implementation assumptions are presented. Keywords: Web crawling, image analysis 1 Introduction The World Wide Web is probably the greatest public ally available base of electronic data. It contains huge amounts of information on almost every subject. Last two decades brought significant changes in the source of information so called Web 2.0 solutions allowed users of the Web to become authors of Web pages content. This phenomenon has many side effects, including partial anonymousness of information source. As a result, the content of Web pages comes increasingly in the interest of law enforcement services. It is relatively easy to find or analyze text on a Web page. There are many popular and powerful tools for that present on the Internet. Publicly available tools, like Google Search, have some limitation and features, however an experienced user can make a good use of those. Because of that, people rarely place textual symptoms of criminal activities on publicly available pages. It is much easier to hide information, which is embedded into a multimedia material, like an image or a movie. Analysis of such materials is much more complex and time consuming. Public search systems can only perform basic operations of this kind, leaving lots of information unrecognized. In order to perform large-scale tests in the domain of Web pages content analysis, a Web crawling system is needed. It is relatively easy to create or adopt

2 a solution, that would be capable of processing tens of thousands of Web pages. Some examples of such tools written in Java, which can be easily downloaded from the Internet, are: WebSPHINX [2], which can use user-defined processing algorithms and provides graphical interface for crawling process visualization, Nutch, which can be integrated with Lucine [3], a text indexing system, Heritrix [4], used by The Internet Archive digital library, which stores Web pages changes over recent years. However, if a crawling system is required to process tens of millions of Web pages, a complex scale problem arises. Some of available solutions can handle this kind of a problem, however deploying such a system is not as easy as running an off-shelf application. The most obvious scale problem is a number of pages in the Web. On 2008 Google engineers announced, that the Google Search engine discovered one trillion unique URLs [6]. Assuming, that one URL has over 20 characters, the amount of space required for storing URLs only is more than 18 terabytes. Surprisingly, a list of unique words found on Web pages should not cause scale problems. A system, which is indexing over 10 8 pages found only words. This amount cannot be held in memory, but can be located on a hard drive of a single computer. However searching Web pages by words requires the system to build an inverted index, witch stores a list of URLs for each word. Assuming, that each URL is identified by an 8-bytes integer, and an average Web page contains 1000 words, the index requires more than 7,2 petabytes. Another issue related to the scale of the problem is page refresh time, which determines how often a crawler should visit a particular page. If each page is to be visited once a year (which is rather rare), the system must process more than pages per second. Assuming, that an average page is bigger than 1 KB, a network bandwidth at 31 MB per second is a minimum requirement. It is obvious, that such a huge amount of data cannot be processed by a single computer. It is necessary to design a distributed system for parallel processing of well-defined subtasks. In this paper a concept of integrating a large scale Web crawling system with a multimedia materials analysis algorithms is described. The concept is being tested on a crawling system, which has been developed at the Department of Computer Science, AGH UST. 2 Architecture of the crawling system An exemplary crawling system, which has been created at the Department of Computer Science, AGH UST [1], is designed to handle hundreds of millions of Web pages. It runs on a cluster of 10 computers and handles up to few pages per second. Performance of the system can be improved easily, by adding more computers to the cluster. There are two main subsystems of the crawling system.

3 1. cluster management subsystem and 2. node management subsystem. The cluster management subsystem is responsible for integration of all nodes into one uniform system. It provides an administrator Web interface and controls the process of Web pages crawling an all nodes. The process is controlled by assigning Internet domains to particular nodes. Each domain (and all its subdomains) can be assigned to one node only, therefore there is no risk of processing the same URL on multiple nodes. The domain-based distribution of tasks is used also for load balancing each node gets approximately the same number of domains to process. The cluster management subsystem is also responsible for providing an API for search services. The services use the node management system s search services to perform search on all nodes, aggregate and return results. Two basic elements of the API allow users to search for: 1. URLs of Web pages containing specified words, 2. content of indexed Web page of a specified URL. The node management subsystem manages the crawling process on a single node. It is controlled by assigned set of domains to process. Each domain has a large set of URLs to be downloaded and analyzed each time the domain is refreshed. During the processing, new URLS can be found. One of three cases can occur: 1. the URL belongs to the domain being processed it is added to the current queue of URLs, 2. the URL belongs to a domain assigned to this node it is stored in a database of URLs, 3. the URL belongs to a different domain it is sent to the cluster management subsystem. Each domain is being processed by a single crawler. The node management subsystem runs tens of crawlers simultaneously, starting new ones if sufficient system resources are present. Each crawler has a list of URLs to be processed. URLs are being processed sequentially the crawler has a single processing thread. Sequential processing of a single domain can take a long time (a few days) if the domain is large or the connection is slow. However in this application long processing time is not a drawback, as particular time of indexing is not important. Moreover, this solution is less prone to simple DOS filters, detecting number of queries per second. It is also easier to implement due to reduction of thread synchronization required. The sequence of actions performed by a crawler during single URL processing is shown in figure 1. There are five main steps of resource processing: 1. Downloading the resource performed by a specialized component, which limits maximum time and controls the size of the resource.

4 1. URL Processing controller 5. URLs Downloader 2. page source Lexer 3. resource model URLs Indexer 4. resource model Plugins URL detector vocabularies INDEKSY indexes INDEKSY content data Fig. 1. Resource processing performed by a crawler 2. Lexical analysis performed on a source of the Resource, which creates a unified resource model. The source is: (a) converted to Unicode, (b) split into segments during HTML structure analysis, (c) divided into tokens and words sequences, (d) converted to tokens and words identifiers, which are stored in a dynamically extended vocabularies. 3. Content indexing performed on a resource model by an Indexer component. An inverted index of words and stored content of the resource are being updated. 4. Content analysis performed by specialized plugins. Each plugin receives a resource model, performs particular operations and stores results in own database. One of the plugins, the URL detector, locates all URLs in the resource. 5. URLs management performed by the Processing controller. URLs are added to a local queue stored in a database or sent to the cluster management subsystem, dependin on domain. Currently the system is indexing pages in Polish language. Polish language detection is based on the count of Polish-specific characters detected in the Web page content. The system is implemented in Java, it runs on a Lunux OS in JBoss application server [5]. Communication between computers in the cluster is based on EJB components. Most of the storage uses MySQL DBMS, which has been carefully tuned for this application. Each of the servers has a single, four-core CPU and 8 GB of RAM and 3 TB of storage.

5 3 Integration with multimedia analysis algorithm The crawling system, presented in the previous section, can be integrated with a multimedia analysis algorithm in several ways. In general, a plugin for the crawling system has to be developed. It has to detect multimedia materials references in the resource model. After a reference is found, several options are available: Download the material and perform the analysis by the plugin, Use a remote service to download and analyze the material; collect results, Feed a remote, independent service with references. Each of these solutions has some advantages and drawbacks. The integrated solution is limited to the Java technology, therefore is not universal. Many processing algorithms are written in different languages and cannot be directly integrated with the crawler. Moreover, this type of algorithm can have very large requirements, which can cause uncontrolled JVM errors and shutdown. A remote synchronous service, which downloads and processes a material autonomously, can be implemented in any technology and run on a remote computer. However, this type of solution can cause some synchronization problems. The crawling system has to wait for results, not knowing how much time is it going to take. The crawling system uses several computers and hundreds of crawlers working simultaneously. Adjustment of multimedia processing performance can be a hard task. The last solution, which uses a remote asynchronous service, has the smallest impact on the crawling system. However, feeding the remote service with too many references to multimedia materials can easily result in service overload. To reduce the number of queries to the multimedia analysis service, the plugin integrated into the crawling system cam perform some filtering. An exemplary application of the approach described in this paper will be responsible for finding images, which contain faces of people useful for identification. Therefore the plugin will analyze a surrounding of an img tag, verifying specified criteria. This should reduce the number of analyzed images by removing most of graphical elements of a Web page. For the test purposes, the second solution will be adopted. The remote service running on a remote server will be asked for analyzing images, which are suspected of containing faces of people. The crawling system will wait for results, using specified timeout. 4 Applications It is obvious, that there are many practical applications of the solution described in this paper. Some of those are already available in publicly available Web search engines, like Google Images ( A user can find images of a specified size or type or even images which contain faces.

6 However many different applications, which may be useful in criminal analysis, are not available. Therefore use of a dedicated system is required. Three different groups of applications have been identified: 1. text detection and recognition, 2. face detection and recognition, 3. crime identification. Plain text, located in a source of a Web page (HTML) can be found and analyzed easily. However, one could use images or movies to publish information on the Web, which would not be indexed by a text-based crawler. An image analysis algorithm, which can detect such images and recognize the text, can be used to enrich index of such page. This information can be integrated with a regular text-based indexing or searched separately. Face detection and face recognition can be used by identity analysis systems. If an individual, whose Internet activities are suspected, is to be identified, a system can help to locate and associate pages containing information about the individual. Moreover, face recognition can help to equate a person, active at different locations on the Web, by finding her/his face on several images. People can be associated with each other by finding different elements present in multimedia materials. Car plates are a good example of such object. In some cases, evidence of criminal activities are being published on the Web. The most typical example of such publications are movies submitted to videosharing services. This type of material can be analyzed by an advanced video processing component, searching for defined patterns or text elements. Another very important case of illegal multimedia material on the Web is child pornography. Despite the fact that publishing such material is a serious crime in most countries, such situations still occur. An algorithm for detecting child pornography in multimedia materials, integrated with a crawling system could significantly improve detectability of such crimes. The crawling system, presented in the previous section, is being integrated with an image analyzing component, developed at the Department of Telecommunications, AGH UST [7]. The component is able of finding the number of faces in a given image. The information will be gathered by the crawler and used for improving the quality of identity analysis systems. In the near future another features of the image analyzing component, like fascist contents identification or child pornography detection, will be used. 5 Conclusions The concept described in this paper is based on an idea of integrating a Web crawling system with multimedia analysis algorithm. These two elements can create a system for detecting various contents present in the Web, which are

7 impossible to find using text-based indexes. Such tools can be used by the law enforcement services in many different situations. The crawling system presented in this paper is being integrated with multimedia analysis algorithm to prove the correctness of the approach. Many technical issues concerning the integration have already been solved. Further research and tests of the system are aimed at providing high quality tools useful in criminal analysis and crime detection. Acknowledgments The research leading to these results has received funding from the European Communitys Seventh Framework Program (FP7/ ) under grant agreement nr References 1. Opalinski A., Turek W. Information retrieval and identity analysis, in: Metody sztucznej inteligencji w dzialaniach na rzecz bezpieczenstwa publicznego. ISBN , p , Miller R. C., and Bharat K. SPHINX: A Framework for Creating Personal, Site- Specific Web Crawlers. In Proceedings of WWW7, Brisbane Australia, Shoberg J. Building Search Applications with Lucene and Nutch. ISBN: , APress Sigursson K. Incremental crawling with Heritrix. Proceedings of the 5th International Web Archiving Workshop, Marrs T., Davis S. JBoss At Work: A Practical Guide. ISBN , O Reilly Alpert J., Hajaj N. We knew the web was big... The Official Google Blog, Korus P., Glowacz A. A system for automatic face indexing. Przeglad Telekomunikacyjny, Wiadomosci Telekomunikacyjne 81(8-9); ISSN , p , 2008

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Conclusions. Chapter Summary of our contributions

Conclusions. Chapter Summary of our contributions Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web

More information

Assignment 1 due Mon (Feb 4pm

Assignment 1 due Mon (Feb 4pm Announcements Assignment 1 due Mon (Feb 19) @ 4pm Next week: no classes Inf3 Computer Architecture - 2017-2018 1 The Memory Gap 1.2x-1.5x 1.07x H&P 5/e, Fig. 2.2 Memory subsystem design increasingly important!

More information

Large Crawls of the Web for Linguistic Purposes

Large Crawls of the Web for Linguistic Purposes Large Crawls of the Web for Linguistic Purposes SSLMIT, University of Bologna Birmingham, July 2005 Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation

More information

Computer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1

Computer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Computer Memory Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up public int sum1(int n, int m, int[][] table) { int output = 0; for (int i = 0; i < n; i++) { for (int j = 0; j

More information

Monitoring services on Enterprise Service Bus

Monitoring services on Enterprise Service Bus Monitoring services on Enterprise Service Bus Ilona Bluemke, Marcin Warda Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland {I.Bluemke}@ii.pw.edu.pl

More information

A comprehensive view of software in detail.

A comprehensive view of software in detail. A comprehensive view of software in detail. Software are a set of instructions or programs that are designed to put the computer hardware to work. Information is stored using binary encoding which consists

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In

More information

CS 345A Data Mining Lecture 1. Introduction to Web Mining

CS 345A Data Mining Lecture 1. Introduction to Web Mining CS 345A Data Mining Lecture 1 Introduction to Web Mining What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns Web Mining v. Data Mining Structure (or lack of

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

DiskSavvy Disk Space Analyzer. DiskSavvy DISK SPACE ANALYZER. User Manual. Version Dec Flexense Ltd.

DiskSavvy Disk Space Analyzer. DiskSavvy DISK SPACE ANALYZER. User Manual. Version Dec Flexense Ltd. DiskSavvy DISK SPACE ANALYZER User Manual Version 10.3 Dec 2017 www.disksavvy.com info@flexense.com 1 1 Product Overview...3 2 Product Versions...7 3 Using Desktop Versions...8 3.1 Product Installation

More information

Information Retrieval II

Information Retrieval II Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning

More information

Fast and Effective System for Name Entity Recognition on Big Data

Fast and Effective System for Name Entity Recognition on Big Data International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam

More information

The Topic Specific Search Engine

The Topic Specific Search Engine The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

DISTRIBUTED ASPECTS OF THE SYSTEM FOR DISCOVERING SIMILAR DOCUMENTS

DISTRIBUTED ASPECTS OF THE SYSTEM FOR DISCOVERING SIMILAR DOCUMENTS DISTRIBUTED ASPECTS OF THE SYSTEM FOR DISCOVERING SIMILAR DOCUMENTS Jan Kasprzak 1, Michal Brandejs 2, Jitka Brandejsová 3 1 Faculty of Informatics, Masaryk University, Czech Republic kas@fi.muni.cz 2

More information

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna Dipartimento di Informatica Università degli Studi di Milano Italy Once upon a time UbiCrawler UbiCrawler

More information

Market Report. Scale-out 2.0: Simple, Scalable, Services- Oriented Storage. Scale-out Storage Meets the Enterprise. June 2010.

Market Report. Scale-out 2.0: Simple, Scalable, Services- Oriented Storage. Scale-out Storage Meets the Enterprise. June 2010. Market Report Scale-out 2.0: Simple, Scalable, Services- Oriented Storage Scale-out Storage Meets the Enterprise By Terri McClure June 2010 Market Report: Scale-out 2.0: Simple, Scalable, Services-Oriented

More information

CS252 S05. CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2. I/O performance measures. I/O performance measures

CS252 S05. CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2. I/O performance measures. I/O performance measures CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2 I/O performance measures I/O performance measures diversity: which I/O devices can connect to the system? capacity: how many I/O devices

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Europeana Core Service Platform

Europeana Core Service Platform Europeana Core Service Platform DELIVERABLE D7.1: Strategic Development Plan, Architectural Planning Revision Final Date of submission 30 October 2015 Author(s) Marcin Werla, PSNC Pavel Kats, Europeana

More information

CSE 124: Networked Services Fall 2009 Lecture-19

CSE 124: Networked Services Fall 2009 Lecture-19 CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Deep Learning Based Real-time Object Recognition System with Image Web Crawler , pp.103-110 http://dx.doi.org/10.14257/astl.2016.142.19 Deep Learning Based Real-time Object Recognition System with Image Web Crawler Myung-jae Lee 1, Hyeok-june Jeong 1, Young-guk Ha 2 1 Department

More information

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

HTRC Data API Performance Study

HTRC Data API Performance Study HTRC Data API Performance Study Yiming Sun, Beth Plale, Jiaan Zeng Amazon Indiana University Bloomington {plale, jiaazeng}@cs.indiana.edu Abstract HathiTrust Research Center (HTRC) allows users to access

More information

Investigating F# as a development tool for distributed multi-agent systems

Investigating F# as a development tool for distributed multi-agent systems PROCEEDINGS OF THE WORKSHOP ON APPLICATIONS OF SOFTWARE AGENTS ISBN 978-86-7031-188-6, pp. 32-36, 2011 Investigating F# as a development tool for distributed multi-agent systems Extended abstract Alex

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide

SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide April 2013 SSIM Engineering Team Version 3.0 1 Document revision history Date Revision Description of Change Originator 03/20/2013

More information

Coveo Platform 6.5. Microsoft SharePoint Connector Guide

Coveo Platform 6.5. Microsoft SharePoint Connector Guide Coveo Platform 6.5 Microsoft SharePoint Connector Guide Notice The content in this document represents the current view of Coveo as of the date of publication. Because Coveo continually responds to changing

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

Database Architectures

Database Architectures Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/15/15 Agenda Check-in Parallelism and Distributed Databases Technology Research Project Introduction to NoSQL

More information

Anatomy of a Semantic Virus

Anatomy of a Semantic Virus Anatomy of a Semantic Virus Peyman Nasirifard Digital Enterprise Research Institute National University of Ireland, Galway IDA Business Park, Lower Dangan, Galway, Ireland peyman.nasirifard@deri.org Abstract.

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Deploying SharePoint Portal Server 2003 Shared Services at Microsoft

Deploying SharePoint Portal Server 2003 Shared Services at Microsoft Deploying SharePoint Portal Server 2003 Shared Services at Microsoft Deploying Enterprise Search, Notification, Audience Targeting and User Profile Services Using Microsoft Office SharePoint Portal Server

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

SCAM Portfolio Scalability

SCAM Portfolio Scalability SCAM Portfolio Scalability Henrik Eriksson Per-Olof Andersson Uppsala Learning Lab 2005-04-18 1 Contents 1 Abstract 3 2 Suggested Improvements Summary 4 3 Abbreviations 5 4 The SCAM Portfolio System 6

More information

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set.

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set. for Windows Azure and HPC Cluster 1. Introduction In parallel computing systems computations are executed simultaneously, wholly or in part. This approach is based on the partitioning of a big task into

More information

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh

More information

DISTRIBUTED DATABASE OPTIMIZATIONS WITH NoSQL MEMBERS

DISTRIBUTED DATABASE OPTIMIZATIONS WITH NoSQL MEMBERS U.P.B. Sci. Bull., Series C, Vol. 77, Iss. 2, 2015 ISSN 2286-3540 DISTRIBUTED DATABASE OPTIMIZATIONS WITH NoSQL MEMBERS George Dan POPA 1 Distributed database complexity, as well as wide usability area,

More information

Analysis of Parallelization Effects on Textual Data Compression

Analysis of Parallelization Effects on Textual Data Compression Analysis of Parallelization Effects on Textual Data GORAN MARTINOVIC, CASLAV LIVADA, DRAGO ZAGAR Faculty of Electrical Engineering Josip Juraj Strossmayer University of Osijek Kneza Trpimira 2b, 31000

More information

Firebird performance degradation: tests, myths and truth IBSurgeon, 2014

Firebird performance degradation: tests, myths and truth IBSurgeon, 2014 Firebird performance degradation: tests, myths and truth Check the original location of this article for updates of this article: http://ib-aid.com/en/articles/firebird-performance-degradation-tests-myths-and-truth/

More information

Istat s Pilot Use Case 1

Istat s Pilot Use Case 1 Istat s Pilot Use Case 1 Pilot identification 1 IT 1 Reference Use case X 1) URL Inventory of enterprises 2) E-commerce from enterprises websites 3) Job advertisements on enterprises websites 4) Social

More information

TIC: A Topic-based Intelligent Crawler

TIC: A Topic-based Intelligent Crawler 2011 International Conference on Information and Intelligent Computing IPCSIT vol.18 (2011) (2011) IACSIT Press, Singapore TIC: A Topic-based Intelligent Crawler Hossein Shahsavand Baghdadi and Bali Ranaivo-Malançon

More information

MG4J: Managing Gigabytes for Java. MG4J - intro 1

MG4J: Managing Gigabytes for Java. MG4J - intro 1 MG4J: Managing Gigabytes for Java MG4J - intro 1 Managing Gigabytes for Java Schedule: 1. Introduction to MG4J framework. 2. Exercitation: try to set up a search engine on a particular collection of documents.

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

Discovery services: next generation of searching scholarly information

Discovery services: next generation of searching scholarly information Discovery services: next generation of searching scholarly information Article (Unspecified) Keene, Chris (2011) Discovery services: next generation of searching scholarly information. Serials, 24 (2).

More information

Map-Reduce. John Hughes

Map-Reduce. John Hughes Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days Hundreds of ad-hoc

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,

More information

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch 619 A publication of CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 The

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

Hitachi Storage Command Portal Installation and Configuration Guide

Hitachi Storage Command Portal Installation and Configuration Guide Hitachi Storage Command Portal Installation and Configuration Guide FASTFIND LINKS Document Organization Product Version Getting Help Table of Contents # MK-98HSCP002-04 Copyright 2010 Hitachi Data Systems

More information

Red Hat Application Migration Toolkit 4.0

Red Hat Application Migration Toolkit 4.0 Red Hat Application Migration Toolkit 4.0 Getting Started Guide Simplify Migration of Java Applications Last Updated: 2018-04-04 Red Hat Application Migration Toolkit 4.0 Getting Started Guide Simplify

More information

Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009

Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009 Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009 Presenter s Name Simon CW See Title & and Division HPC Cloud Computing Sun Microsystems Technology Center Sun Microsystems,

More information

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank,

More information

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The MOSIX Scalable Cluster Computing for Linux.  mosix.org The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

Map-Reduce (PFP Lecture 12) John Hughes

Map-Reduce (PFP Lecture 12) John Hughes Map-Reduce (PFP Lecture 12) John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days

More information

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010 Scaling Without Sharding Baron Schwartz Percona Inc Surge 2010 Web Scale!!!! http://www.xtranormal.com/watch/6995033/ A Sharding Thought Experiment 64 shards per proxy [1] 1 TB of data storage per node

More information

Searching the Web for Information

Searching the Web for Information Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

This tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika.

This tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika. About the Tutorial This tutorial provides a basic understanding of Apache Tika library, the file formats it supports, as well as content and metadata extraction using Apache Tika. Audience This tutorial

More information

Day 3. Storage Devices + Types of Memory + Measuring Memory + Computer Performance

Day 3. Storage Devices + Types of Memory + Measuring Memory + Computer Performance Day 3 Storage Devices + Types of Memory + Measuring Memory + Computer Performance 11-10-2015 12-10-2015 Storage Devices Storage capacity uses several terms to define the increasing amounts of data that

More information

Lecture 1: January 22

Lecture 1: January 22 CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative

More information

Distributed Systems Principles and Paradigms. Chapter 12: Distributed Web-Based Systems

Distributed Systems Principles and Paradigms. Chapter 12: Distributed Web-Based Systems Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science steen@cs.vu.nl Chapter 12: Distributed -Based Systems Version: December 10, 2012 Distributed -Based Systems

More information

CSE 373: Data Structures and Algorithms. Memory and Locality. Autumn Shrirang (Shri) Mare

CSE 373: Data Structures and Algorithms. Memory and Locality. Autumn Shrirang (Shri) Mare CSE 373: Data Structures and Algorithms Memory and Locality Autumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey Champion, Ben Jones, Adam Blank, Michael Lee, Evan McCarty, Robbie Weber,

More information

Full-Text Indexing For Heritrix

Full-Text Indexing For Heritrix Full-Text Indexing For Heritrix Project Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp Dr. Jeffrey Smith Darshan Karia CS298 Master s Project Writing 1 2 Agenda Introduction Heritrix Design

More information

Dark Web. Ronald Bishof, MS Cybersecurity. This Photo by Unknown Author is licensed under CC BY-SA

Dark Web. Ronald Bishof, MS Cybersecurity. This Photo by Unknown Author is licensed under CC BY-SA Dark Web Ronald Bishof, MS Cybersecurity This Photo by Unknown Author is licensed under CC BY-SA Surface, Deep Web and Dark Web Differences of the Surface Web, Deep Web and Dark Web Surface Web - Web

More information

Worksheet - Storing Data

Worksheet - Storing Data Unit 1 Lesson 12 Name(s) Period Date Worksheet - Storing Data At the smallest scale in the computer, information is stored as bits and bytes. In this section, we'll look at how that works. Bit Bit, like

More information

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document

More information

Efficient Web Crawling for Large Text Corpora

Efficient Web Crawling for Large Text Corpora Efficient Web Crawling for Large Text Corpora Vít Suchomel Natural Language Processing Centre Masaryk University, Brno, Czech Republic xsuchom2@fi.muni.cz Jan Pomikálek Lexical Computing Ltd. xpomikal@fi.muni.cz

More information

Chunyan Wang Electrical and Computer Engineering Dept. National University of Singapore

Chunyan Wang Electrical and Computer Engineering Dept. National University of Singapore Chunyan Wang Electrical and Computer Engineering Dept. engp9598@nus.edu.sg A Framework of Integrating Network QoS and End System QoS Chen Khong Tham Electrical and Computer Engineering Dept. eletck@nus.edu.sg

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Towards the Performance Visualization of Web-Service Based Applications

Towards the Performance Visualization of Web-Service Based Applications Towards the Performance Visualization of Web-Service Based Applications Marian Bubak 1,2, Wlodzimierz Funika 1,MarcinKoch 1, Dominik Dziok 1, Allen D. Malony 3,MarcinSmetek 1, and Roland Wismüller 4 1

More information

New research on Key Technologies of unstructured data cloud storage

New research on Key Technologies of unstructured data cloud storage 2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State

More information

Firepoint: Porting Application to Mobile Platforms

Firepoint: Porting Application to Mobile Platforms Firepoint: Porting Application to Mobile Platforms Artem Timonin, Artem Kalinin, Alexander Troshkov, Kirill Kulakov Petrozavodsk State University (PetrSU) Petrozavodsk, Republic Karelia, Russia (timonin,

More information

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli 1 (barcarol@istat.it), Monica Scannapieco 1 (scannapi@istat.it), Donato Summa

More information

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Around the Web in Six Weeks: Documenting a Large-Scale Crawl Around the Web in Six Weeks: Documenting a Large-Scale Crawl Sarker Tanzir Ahmed, Clint Sparkman, Hsin- Tsang Lee, and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering

More information

Introduction. Table of Contents

Introduction. Table of Contents Introduction This is an informal manual on the gpu search engine 'gpuse'. There are some other documents available, this one tries to be a practical how-to-use manual. Table of Contents Introduction...

More information

THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY

THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY INTRODUCTION Driven by the need to remain competitive and differentiate themselves, organizations are undergoing digital transformations and becoming increasingly

More information