FCS Documentation Release 1.0 AGH-GLK November 06, 2014
Contents 1 Quickstart 3 2 FCS basics 5 2.1 Registration................................................ 5 2.2 Main page................................................ 6 2.3 List of tasks................................................ 6 2.4 Create new task.............................................. 6 2.5 Edit existing task............................................. 7 2.6 Send feedback.............................................. 8 2.7 Download crawling results........................................ 8 3 Management module (fcs.manager) 11 4 Crawling Unit module (fcs.crawler) 13 5 Task Server module (fcs.server) 15 6 Crawling results decoder (fcs.content_file_decoder) 17 7 Indices and tables 19 i
ii
Contents: Contents 1
2 Contents
CHAPTER 1 Quickstart Short instruction presenting how to launch Focused Crawling Search. Note: Unix-based operation system and Vagrant (preferred 1.35 or higher) are required. 1. Download project code from Github repository. 2. Change directory into /fcs. 3. In command line type vagrant up. Virtual machine with all requirements will be provisioned. Its ip address is 192.168.0.2. 4. Start second shell, in both of them: connect to machine with vagrant ssh, activate python virtual environment: source./fcs/bin/activate, move to FCS Management web application main directory: cd /vagrant/fcs. 5. In first terminal: create data base: python manage.py syncdb, apply data base migrations with python manage.py migrate, set Userena permissions: python manage.py check_permissions, start web application server: python manage.py runserver 192.168.0.2:8000 on local port 8000. 6. In second terminal window start Autoscaling module: python manage.py autoscaling 192.168.0.2. 7. Open browser and go to http://192.168.0.2:8000. 8. Register new user. Activation mail should be displayed in console. 9. Log in. 10. Click Tasks Add new. Fill the form. Confirm with Add button. 11. Crawling process will begin soon. You can monitor it in terminal s windows and logs of crawler and server located in./fcs/fcs. 12. In task details(hyperlink in tasks table) you can download crawling results. 3
4 Chapter 1. Quickstart
CHAPTER 2 FCS basics 2.1 Registration 1. Click on Register button on main page. 2. Fill fields with user name, correct email and password (two times the same). 3. Confirm with Register button. 4. Check your email. Registration message should wait for you. Click the link in email content. 5. Your account is activated. Now you can log in with your email or login and password. 5
2.2 Main page On main page you can: API - see REST API documentation, Tasks - display information about crawling tasks, Change password - change your password, Change your data - modify details of your account, Show quota - check your permissions in creating task, API keys - view keys required for using REST API, Logout - finish work with system. 2.3 List of tasks This page presents all tasks of current user. They can be active (yellow rows), paused (grey rows) and finished (green rows). To decrease amount of elements in table you can filter them with two select lists and Filter button. 2.4 Create new task 1. Click Add button under tasks list table. 6 Chapter 2. FCS basics
2. Fill form below. If not tell otherwise, all fields are mandatory. 3. In first row specify task s name. 4. Fill priority field with number from 1 to 10. The higher it is, the more important is this task in comparison with other tasks of this user. 5. Give start links separated with white space. 6. In the whitelist field you can specify list of regular expressions (separate them with comma) describing urls, which can be processed. If you leave this input empty, all urls will be crawled. 7. Blacklist is list of regular expressions which cannot be crawled. Optional. 8. In the next field set the maximal amount of pages which can be crawled. 9. Select maximal date of task lasting in Expire field. 10. In last input you can type list of MIME types which should be processed by crawler. 11. Send form with Add. 12. If you see message like below, task was created successfully. 2.5 Edit existing task 1. Click one of the rows in table with tasks. 2. If task is finished, you cannot change anything. View should look like below: 2.5. Edit existing task 7
3. If task is running or paused you can change some of its parameters, pause/resume it, stop, get crawling results: 4. After modifying task click Save changes. 2.6 Send feedback If task is running or paused, on task edition page you can rate some pages. To send feedback to Task Server, you need to specify url and rating. Higher than 3 means that link is valuable, lower means that link is useless. Confirm with Send feedback button. 2.7 Download crawling results On the same page you can also download crawling results. Click Get data. In window which appears set size in MB of file with part of results. Click OK and download should begin. 8 Chapter 2. FCS basics
2.7. Download crawling results 9
10 Chapter 2. FCS basics
CHAPTER 3 Management module (fcs.manager) Management module is a web application implemented in Django framework. It is responsible for managing user accounts and handling crawling requests from clients. Management module provides: admin accounts management, user s tasks management, email notification system, client REST API. api_urls api_views autoscale_views forms middleware models tasks urls views management/commands/autoscaling 11
12 Chapter 3. Management module (fcs.manager)
CHAPTER 4 Crawling Unit module (fcs.crawler) The fcs.crawler module contains classes that implement the Crawling Unit. Crawling Units execute clients tasks. Each Crawling Unit receives from a Task Server a pool of URI to fetch. A single Crawling Unit can perform simultaneously several crawling tasks. Crawling results and other information (like errors), are returned to a Task Server. content_parser crawler mime_content_type thread_with_exc web_interface 13
14 Chapter 4. Crawling Unit module (fcs.crawler)
CHAPTER 5 Task Server module (fcs.server) This module contains implementation of Task Server. Each Task Server is responsible for handling just one task at the same time. However, it does not mean that one physical machine corresponds with only one Task Server, since this model is logical. Each Task Server contains its own database for storing links or crawled data. content_db crawling_depth_policy data_base_policy_module graph_db link_db task_server url_processor web_interface 15
16 Chapter 5. Task Server module (fcs.server)
CHAPTER 6 Crawling results decoder (fcs.content_file_decoder) Script unpacking *.dat files, results of crawling. Proper usage: python script.py <file_location> <unpacked_directories_structure_location> Script creates tree of directories. In every leaf directory there are two files - url_links.txt (page URL in first line, extracted links separated with whitespace in second) and content.dat containing decoded from Base64 resource. At higher level directories with names of integers are stored. Additionally, file index.txt links directory name with page URL. 17
18 Chapter 6. Crawling results decoder (fcs.content_file_decoder)
CHAPTER 7 Indices and tables genindex modindex search 19