Extensible Web Pages Crawler Towards Multimedia Material Analysis
|
|
- Myrtle Sabina Ellis
- 6 years ago
- Views:
Transcription
1 Extensible Web Pages Crawler Towards Multimedia Material Analysis Wojciech Turek, Andrzej Opalinski, Marek Kisiel-Dorohinicki AGH University of Science and Technology, Krakow, Poland Abstract. Methods of Web pages content monitoring comes increasingly in the interest of law enforcement services searching for Web pages contain symptoms of criminal activities. These information can be hidden from indexing systems by embedding in multimedia materials. Finding such materials is a large challenge of contemporary criminal analysis. A concept of integrating a large scale Web crawling system with a multimedia materials analysis algorithms is described in this paper. The Web crawling system, which is processing a few hundred pages per second, provides a mechanism for plugin inclusion. A plugin receives can analyze processed resources and detect references to multimedia materials. The references have passed to a component, which implements an algorithm for image or video analysis. Several approaches to the integration are described and some exemplary implementation assumptions are presented. Keywords: Web crawling, image analysis 1 Introduction The World Wide Web is probably the greatest public ally available base of electronic data. It contains huge amounts of information on almost every subject. Last two decades brought significant changes in the source of information so called Web 2.0 solutions allowed users of the Web to become authors of Web pages content. This phenomenon has many side effects, including partial anonymousness of information source. As a result, the content of Web pages comes increasingly in the interest of law enforcement services. It is relatively easy to find or analyze text on a Web page. There are many popular and powerful tools for that present on the Internet. Publicly available tools, like Google Search, have some limitation and features, however an experienced user can make a good use of those. Because of that, people rarely place textual symptoms of criminal activities on publicly available pages. It is much easier to hide information, which is embedded into a multimedia material, like an image or a movie. Analysis of such materials is much more complex and time consuming. Public search systems can only perform basic operations of this kind, leaving lots of information unrecognized. In order to perform large-scale tests in the domain of Web pages content analysis, a Web crawling system is needed. It is relatively easy to create or adopt
2 a solution, that would be capable of processing tens of thousands of Web pages. Some examples of such tools written in Java, which can be easily downloaded from the Internet, are: WebSPHINX [2], which can use user-defined processing algorithms and provides graphical interface for crawling process visualization, Nutch, which can be integrated with Lucine [3], a text indexing system, Heritrix [4], used by The Internet Archive digital library, which stores Web pages changes over recent years. However, if a crawling system is required to process tens of millions of Web pages, a complex scale problem arises. Some of available solutions can handle this kind of a problem, however deploying such a system is not as easy as running an off-shelf application. The most obvious scale problem is a number of pages in the Web. On 2008 Google engineers announced, that the Google Search engine discovered one trillion unique URLs [6]. Assuming, that one URL has over 20 characters, the amount of space required for storing URLs only is more than 18 terabytes. Surprisingly, a list of unique words found on Web pages should not cause scale problems. A system, which is indexing over 10 8 pages found only words. This amount cannot be held in memory, but can be located on a hard drive of a single computer. However searching Web pages by words requires the system to build an inverted index, witch stores a list of URLs for each word. Assuming, that each URL is identified by an 8-bytes integer, and an average Web page contains 1000 words, the index requires more than 7,2 petabytes. Another issue related to the scale of the problem is page refresh time, which determines how often a crawler should visit a particular page. If each page is to be visited once a year (which is rather rare), the system must process more than pages per second. Assuming, that an average page is bigger than 1 KB, a network bandwidth at 31 MB per second is a minimum requirement. It is obvious, that such a huge amount of data cannot be processed by a single computer. It is necessary to design a distributed system for parallel processing of well-defined subtasks. In this paper a concept of integrating a large scale Web crawling system with a multimedia materials analysis algorithms is described. The concept is being tested on a crawling system, which has been developed at the Department of Computer Science, AGH UST. 2 Architecture of the crawling system An exemplary crawling system, which has been created at the Department of Computer Science, AGH UST [1], is designed to handle hundreds of millions of Web pages. It runs on a cluster of 10 computers and handles up to few pages per second. Performance of the system can be improved easily, by adding more computers to the cluster. There are two main subsystems of the crawling system.
3 1. cluster management subsystem and 2. node management subsystem. The cluster management subsystem is responsible for integration of all nodes into one uniform system. It provides an administrator Web interface and controls the process of Web pages crawling an all nodes. The process is controlled by assigning Internet domains to particular nodes. Each domain (and all its subdomains) can be assigned to one node only, therefore there is no risk of processing the same URL on multiple nodes. The domain-based distribution of tasks is used also for load balancing each node gets approximately the same number of domains to process. The cluster management subsystem is also responsible for providing an API for search services. The services use the node management system s search services to perform search on all nodes, aggregate and return results. Two basic elements of the API allow users to search for: 1. URLs of Web pages containing specified words, 2. content of indexed Web page of a specified URL. The node management subsystem manages the crawling process on a single node. It is controlled by assigned set of domains to process. Each domain has a large set of URLs to be downloaded and analyzed each time the domain is refreshed. During the processing, new URLS can be found. One of three cases can occur: 1. the URL belongs to the domain being processed it is added to the current queue of URLs, 2. the URL belongs to a domain assigned to this node it is stored in a database of URLs, 3. the URL belongs to a different domain it is sent to the cluster management subsystem. Each domain is being processed by a single crawler. The node management subsystem runs tens of crawlers simultaneously, starting new ones if sufficient system resources are present. Each crawler has a list of URLs to be processed. URLs are being processed sequentially the crawler has a single processing thread. Sequential processing of a single domain can take a long time (a few days) if the domain is large or the connection is slow. However in this application long processing time is not a drawback, as particular time of indexing is not important. Moreover, this solution is less prone to simple DOS filters, detecting number of queries per second. It is also easier to implement due to reduction of thread synchronization required. The sequence of actions performed by a crawler during single URL processing is shown in figure 1. There are five main steps of resource processing: 1. Downloading the resource performed by a specialized component, which limits maximum time and controls the size of the resource.
4 1. URL Processing controller 5. URLs Downloader 2. page source Lexer 3. resource model URLs Indexer 4. resource model Plugins URL detector vocabularies INDEKSY indexes INDEKSY content data Fig. 1. Resource processing performed by a crawler 2. Lexical analysis performed on a source of the Resource, which creates a unified resource model. The source is: (a) converted to Unicode, (b) split into segments during HTML structure analysis, (c) divided into tokens and words sequences, (d) converted to tokens and words identifiers, which are stored in a dynamically extended vocabularies. 3. Content indexing performed on a resource model by an Indexer component. An inverted index of words and stored content of the resource are being updated. 4. Content analysis performed by specialized plugins. Each plugin receives a resource model, performs particular operations and stores results in own database. One of the plugins, the URL detector, locates all URLs in the resource. 5. URLs management performed by the Processing controller. URLs are added to a local queue stored in a database or sent to the cluster management subsystem, dependin on domain. Currently the system is indexing pages in Polish language. Polish language detection is based on the count of Polish-specific characters detected in the Web page content. The system is implemented in Java, it runs on a Lunux OS in JBoss application server [5]. Communication between computers in the cluster is based on EJB components. Most of the storage uses MySQL DBMS, which has been carefully tuned for this application. Each of the servers has a single, four-core CPU and 8 GB of RAM and 3 TB of storage.
5 3 Integration with multimedia analysis algorithm The crawling system, presented in the previous section, can be integrated with a multimedia analysis algorithm in several ways. In general, a plugin for the crawling system has to be developed. It has to detect multimedia materials references in the resource model. After a reference is found, several options are available: Download the material and perform the analysis by the plugin, Use a remote service to download and analyze the material; collect results, Feed a remote, independent service with references. Each of these solutions has some advantages and drawbacks. The integrated solution is limited to the Java technology, therefore is not universal. Many processing algorithms are written in different languages and cannot be directly integrated with the crawler. Moreover, this type of algorithm can have very large requirements, which can cause uncontrolled JVM errors and shutdown. A remote synchronous service, which downloads and processes a material autonomously, can be implemented in any technology and run on a remote computer. However, this type of solution can cause some synchronization problems. The crawling system has to wait for results, not knowing how much time is it going to take. The crawling system uses several computers and hundreds of crawlers working simultaneously. Adjustment of multimedia processing performance can be a hard task. The last solution, which uses a remote asynchronous service, has the smallest impact on the crawling system. However, feeding the remote service with too many references to multimedia materials can easily result in service overload. To reduce the number of queries to the multimedia analysis service, the plugin integrated into the crawling system cam perform some filtering. An exemplary application of the approach described in this paper will be responsible for finding images, which contain faces of people useful for identification. Therefore the plugin will analyze a surrounding of an img tag, verifying specified criteria. This should reduce the number of analyzed images by removing most of graphical elements of a Web page. For the test purposes, the second solution will be adopted. The remote service running on a remote server will be asked for analyzing images, which are suspected of containing faces of people. The crawling system will wait for results, using specified timeout. 4 Applications It is obvious, that there are many practical applications of the solution described in this paper. Some of those are already available in publicly available Web search engines, like Google Images ( A user can find images of a specified size or type or even images which contain faces.
6 However many different applications, which may be useful in criminal analysis, are not available. Therefore use of a dedicated system is required. Three different groups of applications have been identified: 1. text detection and recognition, 2. face detection and recognition, 3. crime identification. Plain text, located in a source of a Web page (HTML) can be found and analyzed easily. However, one could use images or movies to publish information on the Web, which would not be indexed by a text-based crawler. An image analysis algorithm, which can detect such images and recognize the text, can be used to enrich index of such page. This information can be integrated with a regular text-based indexing or searched separately. Face detection and face recognition can be used by identity analysis systems. If an individual, whose Internet activities are suspected, is to be identified, a system can help to locate and associate pages containing information about the individual. Moreover, face recognition can help to equate a person, active at different locations on the Web, by finding her/his face on several images. People can be associated with each other by finding different elements present in multimedia materials. Car plates are a good example of such object. In some cases, evidence of criminal activities are being published on the Web. The most typical example of such publications are movies submitted to videosharing services. This type of material can be analyzed by an advanced video processing component, searching for defined patterns or text elements. Another very important case of illegal multimedia material on the Web is child pornography. Despite the fact that publishing such material is a serious crime in most countries, such situations still occur. An algorithm for detecting child pornography in multimedia materials, integrated with a crawling system could significantly improve detectability of such crimes. The crawling system, presented in the previous section, is being integrated with an image analyzing component, developed at the Department of Telecommunications, AGH UST [7]. The component is able of finding the number of faces in a given image. The information will be gathered by the crawler and used for improving the quality of identity analysis systems. In the near future another features of the image analyzing component, like fascist contents identification or child pornography detection, will be used. 5 Conclusions The concept described in this paper is based on an idea of integrating a Web crawling system with multimedia analysis algorithm. These two elements can create a system for detecting various contents present in the Web, which are
7 impossible to find using text-based indexes. Such tools can be used by the law enforcement services in many different situations. The crawling system presented in this paper is being integrated with multimedia analysis algorithm to prove the correctness of the approach. Many technical issues concerning the integration have already been solved. Further research and tests of the system are aimed at providing high quality tools useful in criminal analysis and crime detection. Acknowledgments The research leading to these results has received funding from the European Communitys Seventh Framework Program (FP7/ ) under grant agreement nr References 1. Opalinski A., Turek W. Information retrieval and identity analysis, in: Metody sztucznej inteligencji w dzialaniach na rzecz bezpieczenstwa publicznego. ISBN , p , Miller R. C., and Bharat K. SPHINX: A Framework for Creating Personal, Site- Specific Web Crawlers. In Proceedings of WWW7, Brisbane Australia, Shoberg J. Building Search Applications with Lucene and Nutch. ISBN: , APress Sigursson K. Incremental crawling with Heritrix. Proceedings of the 5th International Web Archiving Workshop, Marrs T., Davis S. JBoss At Work: A Practical Guide. ISBN , O Reilly Alpert J., Hajaj N. We knew the web was big... The Official Google Blog, Korus P., Glowacz A. A system for automatic face indexing. Przeglad Telekomunikacyjny, Wiadomosci Telekomunikacyjne 81(8-9); ISSN , p , 2008
Information Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationConclusions. Chapter Summary of our contributions
Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web
More informationAssignment 1 due Mon (Feb 4pm
Announcements Assignment 1 due Mon (Feb 19) @ 4pm Next week: no classes Inf3 Computer Architecture - 2017-2018 1 The Memory Gap 1.2x-1.5x 1.07x H&P 5/e, Fig. 2.2 Memory subsystem design increasingly important!
More informationLarge Crawls of the Web for Linguistic Purposes
Large Crawls of the Web for Linguistic Purposes SSLMIT, University of Bologna Birmingham, July 2005 Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation
More informationComputer Memory. Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1
Computer Memory Data Structures and Algorithms CSE 373 SP 18 - KASEY CHAMPION 1 Warm Up public int sum1(int n, int m, int[][] table) { int output = 0; for (int i = 0; i < n; i++) { for (int j = 0; j
More informationMonitoring services on Enterprise Service Bus
Monitoring services on Enterprise Service Bus Ilona Bluemke, Marcin Warda Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland {I.Bluemke}@ii.pw.edu.pl
More informationA comprehensive view of software in detail.
A comprehensive view of software in detail. Software are a set of instructions or programs that are designed to put the computer hardware to work. Information is stored using binary encoding which consists
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationINTRODUCTION. Chapter GENERAL
Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which
More informationPARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION
PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In
More informationCS 345A Data Mining Lecture 1. Introduction to Web Mining
CS 345A Data Mining Lecture 1 Introduction to Web Mining What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns Web Mining v. Data Mining Structure (or lack of
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationAnatomy of a search engine. Design criteria of a search engine Architecture Data structures
Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationDiskSavvy Disk Space Analyzer. DiskSavvy DISK SPACE ANALYZER. User Manual. Version Dec Flexense Ltd.
DiskSavvy DISK SPACE ANALYZER User Manual Version 10.3 Dec 2017 www.disksavvy.com info@flexense.com 1 1 Product Overview...3 2 Product Versions...7 3 Using Desktop Versions...8 3.1 Product Installation
More informationInformation Retrieval II
Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning
More informationFast and Effective System for Name Entity Recognition on Big Data
International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-3, Issue-2 E-ISSN: 2347-2693 Fast and Effective System for Name Entity Recognition on Big Data Jigyasa Nigam
More informationThe Topic Specific Search Engine
The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)
More informationFusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic
WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
More informationDISTRIBUTED ASPECTS OF THE SYSTEM FOR DISCOVERING SIMILAR DOCUMENTS
DISTRIBUTED ASPECTS OF THE SYSTEM FOR DISCOVERING SIMILAR DOCUMENTS Jan Kasprzak 1, Michal Brandejs 2, Jitka Brandejsová 3 1 Faculty of Informatics, Masaryk University, Czech Republic kas@fi.muni.cz 2
More informationBUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna
BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna Dipartimento di Informatica Università degli Studi di Milano Italy Once upon a time UbiCrawler UbiCrawler
More informationMarket Report. Scale-out 2.0: Simple, Scalable, Services- Oriented Storage. Scale-out Storage Meets the Enterprise. June 2010.
Market Report Scale-out 2.0: Simple, Scalable, Services- Oriented Storage Scale-out Storage Meets the Enterprise By Terri McClure June 2010 Market Report: Scale-out 2.0: Simple, Scalable, Services-Oriented
More informationCS252 S05. CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2. I/O performance measures. I/O performance measures
CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2 I/O performance measures I/O performance measures diversity: which I/O devices can connect to the system? capacity: how many I/O devices
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationEuropeana Core Service Platform
Europeana Core Service Platform DELIVERABLE D7.1: Strategic Development Plan, Architectural Planning Revision Final Date of submission 30 October 2015 Author(s) Marcin Werla, PSNC Pavel Kats, Europeana
More informationCSE 124: Networked Services Fall 2009 Lecture-19
CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search
CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects
More informationLecture 2: Memory Systems
Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More informationDeep Learning Based Real-time Object Recognition System with Image Web Crawler
, pp.103-110 http://dx.doi.org/10.14257/astl.2016.142.19 Deep Learning Based Real-time Object Recognition System with Image Web Crawler Myung-jae Lee 1, Hyeok-june Jeong 1, Young-guk Ha 2 1 Department
More informationAdministrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454
Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search
More informationHTRC Data API Performance Study
HTRC Data API Performance Study Yiming Sun, Beth Plale, Jiaan Zeng Amazon Indiana University Bloomington {plale, jiaazeng}@cs.indiana.edu Abstract HathiTrust Research Center (HTRC) allows users to access
More informationInvestigating F# as a development tool for distributed multi-agent systems
PROCEEDINGS OF THE WORKSHOP ON APPLICATIONS OF SOFTWARE AGENTS ISBN 978-86-7031-188-6, pp. 32-36, 2011 Investigating F# as a development tool for distributed multi-agent systems Extended abstract Alex
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationDistributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in
More informationSSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide
SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide April 2013 SSIM Engineering Team Version 3.0 1 Document revision history Date Revision Description of Change Originator 03/20/2013
More informationCoveo Platform 6.5. Microsoft SharePoint Connector Guide
Coveo Platform 6.5 Microsoft SharePoint Connector Guide Notice The content in this document represents the current view of Coveo as of the date of publication. Because Coveo continually responds to changing
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationCRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA
CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com
More informationMassive Scalability With InterSystems IRIS Data Platform
Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special
More informationDatabase Architectures
Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/15/15 Agenda Check-in Parallelism and Distributed Databases Technology Research Project Introduction to NoSQL
More informationAnatomy of a Semantic Virus
Anatomy of a Semantic Virus Peyman Nasirifard Digital Enterprise Research Institute National University of Ireland, Galway IDA Business Park, Lower Dangan, Galway, Ireland peyman.nasirifard@deri.org Abstract.
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationDeploying SharePoint Portal Server 2003 Shared Services at Microsoft
Deploying SharePoint Portal Server 2003 Shared Services at Microsoft Deploying Enterprise Search, Notification, Audience Targeting and User Profile Services Using Microsoft Office SharePoint Portal Server
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationSCAM Portfolio Scalability
SCAM Portfolio Scalability Henrik Eriksson Per-Olof Andersson Uppsala Learning Lab 2005-04-18 1 Contents 1 Abstract 3 2 Suggested Improvements Summary 4 3 Abbreviations 5 4 The SCAM Portfolio System 6
More informationMost real programs operate somewhere between task and data parallelism. Our solution also lies in this set.
for Windows Azure and HPC Cluster 1. Introduction In parallel computing systems computations are executed simultaneously, wholly or in part. This approach is based on the partitioning of a big task into
More informationSelf Adjusting Refresh Time Based Architecture for Incremental Web Crawler
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh
More informationDISTRIBUTED DATABASE OPTIMIZATIONS WITH NoSQL MEMBERS
U.P.B. Sci. Bull., Series C, Vol. 77, Iss. 2, 2015 ISSN 2286-3540 DISTRIBUTED DATABASE OPTIMIZATIONS WITH NoSQL MEMBERS George Dan POPA 1 Distributed database complexity, as well as wide usability area,
More informationAnalysis of Parallelization Effects on Textual Data Compression
Analysis of Parallelization Effects on Textual Data GORAN MARTINOVIC, CASLAV LIVADA, DRAGO ZAGAR Faculty of Electrical Engineering Josip Juraj Strossmayer University of Osijek Kneza Trpimira 2b, 31000
More informationFirebird performance degradation: tests, myths and truth IBSurgeon, 2014
Firebird performance degradation: tests, myths and truth Check the original location of this article for updates of this article: http://ib-aid.com/en/articles/firebird-performance-degradation-tests-myths-and-truth/
More informationIstat s Pilot Use Case 1
Istat s Pilot Use Case 1 Pilot identification 1 IT 1 Reference Use case X 1) URL Inventory of enterprises 2) E-commerce from enterprises websites 3) Job advertisements on enterprises websites 4) Social
More informationTIC: A Topic-based Intelligent Crawler
2011 International Conference on Information and Intelligent Computing IPCSIT vol.18 (2011) (2011) IACSIT Press, Singapore TIC: A Topic-based Intelligent Crawler Hossein Shahsavand Baghdadi and Bali Ranaivo-Malançon
More informationMG4J: Managing Gigabytes for Java. MG4J - intro 1
MG4J: Managing Gigabytes for Java MG4J - intro 1 Managing Gigabytes for Java Schedule: 1. Introduction to MG4J framework. 2. Exercitation: try to set up a search engine on a particular collection of documents.
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationDiscovery services: next generation of searching scholarly information
Discovery services: next generation of searching scholarly information Article (Unspecified) Keene, Chris (2011) Discovery services: next generation of searching scholarly information. Serials, 24 (2).
More informationMap-Reduce. John Hughes
Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days Hundreds of ad-hoc
More informationCHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER
CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but
More informationA crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.
A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,
More informationDesign and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch
619 A publication of CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 The
More informationCSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA
CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers
More informationHitachi Storage Command Portal Installation and Configuration Guide
Hitachi Storage Command Portal Installation and Configuration Guide FASTFIND LINKS Document Organization Product Version Getting Help Table of Contents # MK-98HSCP002-04 Copyright 2010 Hitachi Data Systems
More informationRed Hat Application Migration Toolkit 4.0
Red Hat Application Migration Toolkit 4.0 Getting Started Guide Simplify Migration of Java Applications Last Updated: 2018-04-04 Red Hat Application Migration Toolkit 4.0 Getting Started Guide Simplify
More informationData Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009
Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009 Presenter s Name Simon CW See Title & and Division HPC Cloud Computing Sun Microsystems Technology Center Sun Microsystems,
More informationBigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis
BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis Motivation Lots of (semi-)structured data at Google URLs: Contents, crawl metadata, links, anchors, pagerank,
More informationThe MOSIX Scalable Cluster Computing for Linux. mosix.org
The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What
More informationComputer Caches. Lab 1. Caching
Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main
More informationMap-Reduce (PFP Lecture 12) John Hughes
Map-Reduce (PFP Lecture 12) John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days
More informationScaling Without Sharding. Baron Schwartz Percona Inc Surge 2010
Scaling Without Sharding Baron Schwartz Percona Inc Surge 2010 Web Scale!!!! http://www.xtranormal.com/watch/6995033/ A Sharding Thought Experiment 64 shards per proxy [1] 1 TB of data storage per node
More informationSearching the Web for Information
Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content
More informationEvaluating the Usefulness of Sentiment Information for Focused Crawlers
Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,
More informationHadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
More informationThis tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika.
About the Tutorial This tutorial provides a basic understanding of Apache Tika library, the file formats it supports, as well as content and metadata extraction using Apache Tika. Audience This tutorial
More informationDay 3. Storage Devices + Types of Memory + Measuring Memory + Computer Performance
Day 3 Storage Devices + Types of Memory + Measuring Memory + Computer Performance 11-10-2015 12-10-2015 Storage Devices Storage capacity uses several terms to define the increasing amounts of data that
More informationLecture 1: January 22
CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative
More informationDistributed Systems Principles and Paradigms. Chapter 12: Distributed Web-Based Systems
Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science steen@cs.vu.nl Chapter 12: Distributed -Based Systems Version: December 10, 2012 Distributed -Based Systems
More informationCSE 373: Data Structures and Algorithms. Memory and Locality. Autumn Shrirang (Shri) Mare
CSE 373: Data Structures and Algorithms Memory and Locality Autumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey Champion, Ben Jones, Adam Blank, Michael Lee, Evan McCarty, Robbie Weber,
More informationFull-Text Indexing For Heritrix
Full-Text Indexing For Heritrix Project Advisor: Dr. Chris Pollett Committee Members: Dr. Mark Stamp Dr. Jeffrey Smith Darshan Karia CS298 Master s Project Writing 1 2 Agenda Introduction Heritrix Design
More informationDark Web. Ronald Bishof, MS Cybersecurity. This Photo by Unknown Author is licensed under CC BY-SA
Dark Web Ronald Bishof, MS Cybersecurity This Photo by Unknown Author is licensed under CC BY-SA Surface, Deep Web and Dark Web Differences of the Surface Web, Deep Web and Dark Web Surface Web - Web
More informationWorksheet - Storing Data
Unit 1 Lesson 12 Name(s) Period Date Worksheet - Storing Data At the smallest scale in the computer, information is stored as bits and bytes. In this section, we'll look at how that works. Bit Bit, like
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationEfficient Web Crawling for Large Text Corpora
Efficient Web Crawling for Large Text Corpora Vít Suchomel Natural Language Processing Centre Masaryk University, Brno, Czech Republic xsuchom2@fi.muni.cz Jan Pomikálek Lexical Computing Ltd. xpomikal@fi.muni.cz
More informationChunyan Wang Electrical and Computer Engineering Dept. National University of Singapore
Chunyan Wang Electrical and Computer Engineering Dept. engp9598@nus.edu.sg A Framework of Integrating Network QoS and End System QoS Chen Khong Tham Electrical and Computer Engineering Dept. eletck@nus.edu.sg
More informationFILTERING OF URLS USING WEBCRAWLER
FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More informationResearch and implementation of search engine based on Lucene Wan Pu, Wang Lisha
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationTowards the Performance Visualization of Web-Service Based Applications
Towards the Performance Visualization of Web-Service Based Applications Marian Bubak 1,2, Wlodzimierz Funika 1,MarcinKoch 1, Dominik Dziok 1, Allen D. Malony 3,MarcinSmetek 1, and Roland Wismüller 4 1
More informationNew research on Key Technologies of unstructured data cloud storage
2017 International Conference on Computing, Communications and Automation(I3CA 2017) New research on Key Technologies of unstructured data cloud storage Songqi Peng, Rengkui Liua, *, Futian Wang State
More informationFirepoint: Porting Application to Mobile Platforms
Firepoint: Porting Application to Mobile Platforms Artem Timonin, Artem Kalinin, Alexander Troshkov, Kirill Kulakov Petrozavodsk State University (PetrSU) Petrozavodsk, Republic Karelia, Russia (timonin,
More informationUsing Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies
Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli 1 (barcarol@istat.it), Monica Scannapieco 1 (scannapi@istat.it), Donato Summa
More informationAround the Web in Six Weeks: Documenting a Large-Scale Crawl
Around the Web in Six Weeks: Documenting a Large-Scale Crawl Sarker Tanzir Ahmed, Clint Sparkman, Hsin- Tsang Lee, and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering
More informationIntroduction. Table of Contents
Introduction This is an informal manual on the gpu search engine 'gpuse'. There are some other documents available, this one tries to be a practical how-to-use manual. Table of Contents Introduction...
More informationTHE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY
THE COMPLETE GUIDE COUCHBASE BACKUP & RECOVERY INTRODUCTION Driven by the need to remain competitive and differentiate themselves, organizations are undergoing digital transformations and becoming increasingly
More information