FCS Documentation. Release 1.0 AGH-GLK

Similar documents
Gunnery Documentation

Aldryn Installer Documentation

User Manual Appointment System

Connector for Microsoft SharePoint 2013, 2016 and Online Setup and Reference Guide

TangeloHub Documentation

tally Documentation Release Dan Watson

BriCS. University of Bristol Cloud Service Simulation Runner. User & Developer Guide. 1 October John Cartlidge & M.

SBCC Web File System - Xythos

Using the Scripting Interface

django-cron Documentation

Course CLD211.5x Microsoft SharePoint 2016: Search and Content Management

271 Waverley Oaks Rd. Telephone: Suite 206 Waltham, MA USA

vcloud Director Administrator's Guide

open-helpdesk Documentation

owncloud Android App Manual

Storage Made Easy Cloud Appliance installation Guide

Contents Introduction... 3 Access to PVT... 4 File uploading... 7 Contact information...10

Tangent MicroServices Documentation

1. Launch your web browser. 2. Go to < Enter your address and Password as requested. Click on login.

Protection! User Guide. A d m i n i s t r a t o r G u i d e. v L i c e n s i n g S e r v e r. Protect your investments with Protection!

Release Ralph Offinger

ETERNUS Veritas Operations Manager Storage Insight Plug-in 1.0 User s Guide P2X ENZ0

Django Test Utils Documentation

MAINTENANCE HELPDESK SYSTEM USER MANUAL: CUSTOMER (STAFF) VERSION 2.0

Admin Guide Hosted Applications

EMS MASTER CALENDAR Installation Guide

Spade Documentation. Release 0.1. Sam Liu

How to Login, Logout and Manage Password (QRG)

LifeSize Control Installation Guide

YU Kaltura Media Package User's Guide For version 1.1.x. Written by Media and Information Technology Center, Yamaguchi University.

User Management in Resource Manager

Administrator Guide. Find out how to set up and use MyKerio to centralize and unify your Kerio software administration.

Globalbrain Administration Guide. Version 5.4

SOA Software API Gateway Appliance 6.3 Administration Guide

User Manual. Admin Report Kit for IIS (ARKIIS)

DJOAuth2 Documentation

django-konfera Documentation

PACS ADMIN. Quick Reference Guide

Scrapy-Redis Documentation

EasyMorph Server Administrator Guide

Installing ITDS WebAdmin Tool into WebSphere Application Server Network Deployment V7.0

I hate money. Release 1.0

django-users2 Documentation

Building a Django Twilio Programmable Chat Application

Trunk Player Documentation

Django IPRestrict Documentation

Django Web Framework: A Comprehensive Introduction and Step by Step Installation

Workspace Administrator Help File

MedLook Technical Support User Guide. MedLook Hosted Environment

Overview. ACE Appliance Device Manager Overview CHAPTER

django-allauth-2fa Documentation

D365 DATA ARCHIVAL & RETENTION

AppSpider Enterprise. Getting Started Guide

Oracle Application Express

Support Visit mysupport.mcafee.com to find product documentation, announcements, and support.

Hosted Voice Console Assistant Set-up and User Guide V

django-embed-video Documentation

InSync Service User Guide

Admin Table is oftr Caoto ntr e s U ntsser Guide Table of Contents Introduction Accessing the Portal

Getting started with Install4install

The Guide. A basic guide for setting up your Samanage application

Océ Engineering Exec. Doc Exec Pro and Electronic Job Ticket for the Web

I, J, K. Lightweight directory access protocol (LDAP), 162

IronWASP (Iron Web application Advanced Security testing Platform)

django-openid Documentation

ANTIVIRUS SITE PROTECTION (by SiteGuarding.com)

Opaali Portal Quick guide

Distribute Call Studio applications to Unified CVP VXML Servers.

CVP 40 EVAL, CVP 40 DISTI, CVP 40 DART, CVP 41 EVAL,CVP 41 DIST NFR, CVP 41 DART NFR, CVP 70 EVAL, CVP 70 DIST NFR

OAM 2FA Value-Added Module (VAM) Deployment Guide

Technology Platform. Spectrum. Version 11.0 SP1. Administration Guide - AMI

Partner Integration Portal (PIP) Installation Guide

Configuring Shared Links for Web Access

Workshare Client Extranet. Getting Started Guide. for Mac

Synchro PRO 2016 Shared User License Manager Installation Instructions

PO Processor Installation and Configuration Guide

ClickToCall SkypeTest Documentation

pyatomiadns Documentation

zspace 300 Windows 8.1 Configuration

Oracle Fusion Middleware

vcloud Director Administrator's Guide vcloud Director 8.10

Initial Setup. Cisco APIC Documentation Roadmap. This chapter contains the following sections:

Bitnami OSQA for Huawei Enterprise Cloud

Orgnazition of This Part

Information Technology Services

django-embed-video Documentation

Microsoft IIS version 6 Integration

boost Documentation Release 0.1 Carl Chenet

memex-explorer Documentation

Perceptive Content Licensing

django-secure Documentation

Bishop Blanchet Intranet Documentation

Breeze at Penn State. About meeting roles and permissions

django-ratelimit-backend Documentation

ITCorporation HOW DO I INSTALL A FRESH INSTANCE OF ANALYZER? DESCRIPTION RESOLUTION. Knowledge Database KNOWLEDGE DATABASE

Lucid Key Server. Help Documentation.

Connection Broker Advanced Connections Management for Multi-Cloud Environments

European Commission FSF. - Financial Sanctions Files - Administrators' User Guide DIRECTORATE GENERAL DEVCO R4

Remote Support Web Rep Console

Group Admin Guide. NetBrain Consultant Edition 6.2

Transcription:

FCS Documentation Release 1.0 AGH-GLK November 06, 2014

Contents 1 Quickstart 3 2 FCS basics 5 2.1 Registration................................................ 5 2.2 Main page................................................ 6 2.3 List of tasks................................................ 6 2.4 Create new task.............................................. 6 2.5 Edit existing task............................................. 7 2.6 Send feedback.............................................. 8 2.7 Download crawling results........................................ 8 3 Management module (fcs.manager) 11 4 Crawling Unit module (fcs.crawler) 13 5 Task Server module (fcs.server) 15 6 Crawling results decoder (fcs.content_file_decoder) 17 7 Indices and tables 19 i

ii

Contents: Contents 1

2 Contents

CHAPTER 1 Quickstart Short instruction presenting how to launch Focused Crawling Search. Note: Unix-based operation system and Vagrant (preferred 1.35 or higher) are required. 1. Download project code from Github repository. 2. Change directory into /fcs. 3. In command line type vagrant up. Virtual machine with all requirements will be provisioned. Its ip address is 192.168.0.2. 4. Start second shell, in both of them: connect to machine with vagrant ssh, activate python virtual environment: source./fcs/bin/activate, move to FCS Management web application main directory: cd /vagrant/fcs. 5. In first terminal: create data base: python manage.py syncdb, apply data base migrations with python manage.py migrate, set Userena permissions: python manage.py check_permissions, start web application server: python manage.py runserver 192.168.0.2:8000 on local port 8000. 6. In second terminal window start Autoscaling module: python manage.py autoscaling 192.168.0.2. 7. Open browser and go to http://192.168.0.2:8000. 8. Register new user. Activation mail should be displayed in console. 9. Log in. 10. Click Tasks Add new. Fill the form. Confirm with Add button. 11. Crawling process will begin soon. You can monitor it in terminal s windows and logs of crawler and server located in./fcs/fcs. 12. In task details(hyperlink in tasks table) you can download crawling results. 3

4 Chapter 1. Quickstart

CHAPTER 2 FCS basics 2.1 Registration 1. Click on Register button on main page. 2. Fill fields with user name, correct email and password (two times the same). 3. Confirm with Register button. 4. Check your email. Registration message should wait for you. Click the link in email content. 5. Your account is activated. Now you can log in with your email or login and password. 5

2.2 Main page On main page you can: API - see REST API documentation, Tasks - display information about crawling tasks, Change password - change your password, Change your data - modify details of your account, Show quota - check your permissions in creating task, API keys - view keys required for using REST API, Logout - finish work with system. 2.3 List of tasks This page presents all tasks of current user. They can be active (yellow rows), paused (grey rows) and finished (green rows). To decrease amount of elements in table you can filter them with two select lists and Filter button. 2.4 Create new task 1. Click Add button under tasks list table. 6 Chapter 2. FCS basics

2. Fill form below. If not tell otherwise, all fields are mandatory. 3. In first row specify task s name. 4. Fill priority field with number from 1 to 10. The higher it is, the more important is this task in comparison with other tasks of this user. 5. Give start links separated with white space. 6. In the whitelist field you can specify list of regular expressions (separate them with comma) describing urls, which can be processed. If you leave this input empty, all urls will be crawled. 7. Blacklist is list of regular expressions which cannot be crawled. Optional. 8. In the next field set the maximal amount of pages which can be crawled. 9. Select maximal date of task lasting in Expire field. 10. In last input you can type list of MIME types which should be processed by crawler. 11. Send form with Add. 12. If you see message like below, task was created successfully. 2.5 Edit existing task 1. Click one of the rows in table with tasks. 2. If task is finished, you cannot change anything. View should look like below: 2.5. Edit existing task 7

3. If task is running or paused you can change some of its parameters, pause/resume it, stop, get crawling results: 4. After modifying task click Save changes. 2.6 Send feedback If task is running or paused, on task edition page you can rate some pages. To send feedback to Task Server, you need to specify url and rating. Higher than 3 means that link is valuable, lower means that link is useless. Confirm with Send feedback button. 2.7 Download crawling results On the same page you can also download crawling results. Click Get data. In window which appears set size in MB of file with part of results. Click OK and download should begin. 8 Chapter 2. FCS basics

2.7. Download crawling results 9

10 Chapter 2. FCS basics

CHAPTER 3 Management module (fcs.manager) Management module is a web application implemented in Django framework. It is responsible for managing user accounts and handling crawling requests from clients. Management module provides: admin accounts management, user s tasks management, email notification system, client REST API. api_urls api_views autoscale_views forms middleware models tasks urls views management/commands/autoscaling 11

12 Chapter 3. Management module (fcs.manager)

CHAPTER 4 Crawling Unit module (fcs.crawler) The fcs.crawler module contains classes that implement the Crawling Unit. Crawling Units execute clients tasks. Each Crawling Unit receives from a Task Server a pool of URI to fetch. A single Crawling Unit can perform simultaneously several crawling tasks. Crawling results and other information (like errors), are returned to a Task Server. content_parser crawler mime_content_type thread_with_exc web_interface 13

14 Chapter 4. Crawling Unit module (fcs.crawler)

CHAPTER 5 Task Server module (fcs.server) This module contains implementation of Task Server. Each Task Server is responsible for handling just one task at the same time. However, it does not mean that one physical machine corresponds with only one Task Server, since this model is logical. Each Task Server contains its own database for storing links or crawled data. content_db crawling_depth_policy data_base_policy_module graph_db link_db task_server url_processor web_interface 15

16 Chapter 5. Task Server module (fcs.server)

CHAPTER 6 Crawling results decoder (fcs.content_file_decoder) Script unpacking *.dat files, results of crawling. Proper usage: python script.py <file_location> <unpacked_directories_structure_location> Script creates tree of directories. In every leaf directory there are two files - url_links.txt (page URL in first line, extracted links separated with whitespace in second) and content.dat containing decoded from Base64 resource. At higher level directories with names of integers are stored. Additionally, file index.txt links directory name with page URL. 17

18 Chapter 6. Crawling results decoder (fcs.content_file_decoder)

CHAPTER 7 Indices and tables genindex modindex search 19