Scrapoxy Documentation

Similar documents
git-pr Release dev2+ng5b0396a

puppet-diamond Documentation

MCAFEE THREAT INTELLIGENCE EXCHANGE RESILIENT THREAT SERVICE INTEGRATION GUIDE V1.0

jumpssh Documentation

Tutorial 1. Account Registration

TangeloHub Documentation

TWO-FACTOR AUTHENTICATION Version 1.1.0

sensor-documentation Documentation

Elegans Documentation

disspcap Documentation

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Tailor Documentation. Release 0.1. Derek Stegelman, Garrett Pennington, and Jon Faustman

mp3fm Documentation Release Akshit Agarwal

Feed Cache for Umbraco Version 2.0

Veritas CloudPoint 1.0 Administrator's Guide

delegator Documentation

HYCU SCOM Management Pack for F5 BIG-IP

retask Documentation Release 1.0 Kushal Das

Testworks User Guide. Release 1.0. Dylan Hackers

invenio-formatter Documentation

Sensor-fusion Demo Documentation

Instagram PHP Documentation

SmartCash SmartNode Setup Guide V1.2 Windows 10 13/01/2018 By (Jazz) yoyomonkey Page 1

Administration Dashboard Installation Guide SQream Technologies

User Workspace Management

Bitnami Re:dash for Huawei Enterprise Cloud

SmartCash SmartNode Setup Guide v1.2. Windows 10. Date: 13/01/2018. By (Jazz) yoyomonkey

Testbed-12 TEAM Engine Virtualization User Guide

XStatic Documentation

OPi.GPIO Documentation

Zadara Enterprise Storage in

BIG-IP Access Policy Manager : Portal Access. Version 12.1

BME280 Documentation. Release Richard Hull

deepatari Documentation

PyCon APAC 2014 Documentation

USING NGC WITH GOOGLE CLOUD PLATFORM

Amazon AppStream 2.0: SOLIDWORKS Deployment Guide

ThoughtSpot on AWS Quick Start Guide

USING NGC WITH ALIBABA CLOUD

HCP Chargeback Collector Documentation

Edge Device Manager Quick Start Guide. Version R15

Building a Django Twilio Programmable Chat Application

MMS Backup Manual Release 1.4

agate-sql Documentation

Getting Started with Amazon Web Services

Piexif Documentation. Release 1.0.X. hmatoba

Infoblox Trinzic V-x25 Series Appliances for AWS

Guides SDL Server Documentation Document current as of 05/24/ :13 PM.

Bitnami OSQA for Huawei Enterprise Cloud

Single Sign-On for PCF. User's Guide

Kinto Documentation. Release Mozilla Services Da French Team

Getting Started and System Guide. Version

Professional Edition User Guide

DCLI User's Guide. Data Center Command-Line Interface

Quick Install for Amazon EMR

Using PCF Ops Manager to Deploy Hyperledger Fabric

BriCS. University of Bristol Cloud Service Simulation Runner. User & Developer Guide. 1 October John Cartlidge & M.

VMware AirWatch Content Gateway Guide For Linux

Alarm Counter. A Ceilometer OpenStack Application

CHEF MANUAL. Installation and Configuration. SGT, Inc. Innovation Technology Center

VMware AirWatch Content Gateway for Linux. VMware Workspace ONE UEM 1811 Unified Access Gateway

Paperspace. Deployment Guide. Cloud VDI. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

SaaSaMe Transport Workload Snapshot Export for. Alibaba Cloud

Swift Web Applications on the AWS Cloud

Patch Server for Jamf Pro Documentation

Piexif Documentation. Release 1.0.X. hmatoba

DreamFactory Security Guide

User Guide. Version R94. English

IBM BigFix Version 9.5. WebUI Administrators Guide IBM

DCLI User's Guide. Modified on 20 SEP 2018 Data Center Command-Line Interface

NetApp Cloud Volumes Service for AWS

WebAnalyzer Plus Getting Started Guide

Preparing Your Google Cloud VM for W4705

Pexip Infinity and Amazon Web Services Deployment Guide

VMware AirWatch Content Gateway Guide for Linux For Linux

ElasterStack 3.2 User Administration Guide - Advanced Zone

Lab 2 Third Party API Integration, Cloud Deployment & Benchmarking

EMARSYS FOR MAGENTO 2

utidylib Documentation Release 0.4

Node.js. Node.js Overview. CS144: Web Applications

Cloudera s Enterprise Data Hub on the Amazon Web Services Cloud: Quick Start Reference Deployment October 2014

CUSTOMER CONTROL PANEL... 2 DASHBOARD... 3 HOSTING &

abstar Documentation Release Bryan Briney

Tungsten Dashboard for Clustering. Eric M. Stone, COO

This guide assumes that you are setting up a masternode for the first time. You will need:

Pexip Infinity and Amazon Web Services Deployment Guide

LIVENX UPGRADE GUIDE (AIO)

Qualys Cloud Suite 2.30

Creating a Yubikey MFA Service in AWS

Flexible Engine. Startup Guide

271 Waverley Oaks Rd. Telephone: Suite 206 Waltham, MA USA

HySecure Quick Start Guide. HySecure 5.0

Eucalyptus User Console Guide

SmartCash SmartNode SCRIPT Setup Guide v2.2. Windows 10. Date: 20/02/2018. By (Jazz) yoyomonkey

Statsd Metrics Documentation

Table of Contents

KubeNow Documentation

labibi Documentation Release 1.0 C. Titus Brown

I- REQUIREMENTS. A- Main

DCLI User's Guide. Data Center Command-Line Interface 2.9.1

Transcription:

Scrapoxy Documentation Release 3.0.1 Fabien Vauchelles Jan 29, 2018

Get Started 1 What is Scrapoxy? 3 2 Documentation 5 3 Prerequisite 87 4 Contribute 89 5 License 91 i

ii

Get Started 1

2 Get Started

CHAPTER 1 What is Scrapoxy? http://scrapoxy.io Scrapoxy hides your scraper behind a cloud. It starts a pool of proxies to send your requests. Now, you can crawl without thinking about blacklisting! It is written in Javascript (ES6) with Node.js & AngularJS and it is open source! 1.1 How does Scrapoxy work? 1. When Scrapoxy starts, it creates and manages a pool of proxies. 2. Your scraper uses Scrapoxy as a normal proxy. 3. Scrapoxy routes all requests through a pool of proxies. 1.2 What Scrapoxy does? Create your own proxies Use multiple cloud providers (AWS, DigitalOcean, OVH, Vscale) Rotate IP addresses Impersonate known browsers Exclude blacklisted instances Monitor the requests Detect bottleneck 3

Optimize the scraping 1.3 Why Scrapoxy doesn t support anti-blacklisting? Anti-blacklisting is a job for the scraper. When the scraper detects blacklisting, it asks Scrapoxy to remove the proxy from the proxies pool (through a REST API). 1.4 What is the best scraper framework to use with Scrapoxy? You could use the open source Scrapy framework (Python). 1.5 Does Scrapoxy have a SaaS mode or a support plan? Scrapoxy is an open source tool. Source code is highly maintained. You are very welcome to open an issue for features or bugs. If you are looking for a commercial product in SaaS mode or with a support plan, we recommend you to check the ScrapingHub products (ScrapingHub is the company which maintains the Scrapy framework). 4 Chapter 1. What is Scrapoxy?

CHAPTER 2 Documentation You can begin with the Quick Start or look at the Changelog. Now, you can continue with Standard, and become an expert with Advanced. And complete with Tutorials. 2.1 Quick Start This tutorials works on AWS / EC2, with region eu-west-1. See the AWS / EC2 - Copy an AMI from a region to another if you want to change region. 2.1.1 Step 1: Get AWS credentials See Get AWS credentials. 2.1.2 Step 2: Create a security group See Create a security group. 2.1.3 Step 3A: Run Scrapoxy with Docker Run the container: sudo docker run -e COMMANDER_PASSWORD='CHANGE_THIS_PASSWORD' \ -e PROVIDERS_AWSEC2_ACCESSKEYID='YOUR ACCESS KEY ID' \ -e PROVIDERS_AWSEC2_SECRETACCESSKEY='YOUR SECRET ACCESS KEY' \ -it -p 8888:8888 -p 8889:8889 fabienvauchelles/scrapoxy 5

Warning: Replace PROVIDERS_AWSEC2_ACCESSKEYID and PROVIDERS_AWSEC2_SECRETACCESSKEY by your AWS credentials and parameters. 2.1.4 Step 3B: Run Scrapoxy without Docker Install Node.js See the Node Installation Manual. The minimum required version is 4.2.1. Install Scrapoxy from NPM Install make: sudo apt-get install build-essential And Scrapoxy: sudo npm install -g scrapoxy Generate configuration scrapoxy init conf.json Edit configuration 1. Edit conf.json 2. In the commander section, replace password by a password of your choice 3. In the providers/awsec2 section, replace accesskeyid, secretaccesskey and region by your AWS credentials and parameters. Start Scrapoxy scrapoxy start conf.json -d 2.1.5 Step 4: Open Scrapoxy GUI Scrapoxy GUI is reachable at http://localhost:8889 2.1.6 Step 5: Connect Scrapoxy to your scraper Scrapoxy is reachable at http://localhost:8888 6 Chapter 2. Documentation

2.1.7 Step 6: Test Scrapoxy 1. Wait 3 minutes 2. Test Scrapoxy in a new terminal with: scrapoxy test http://localhost:8888 3. Or with curl: curl --proxy http://127.0.0.1:8888 http://api.ipify.org 2.2 Changelog 2.2.1 3.0.1 Features digitalocean: support Digital Ocean tags on Droplets. Thanks to Ben Lavalley!!! Bug fixes digitalocean: use new image size (s-1vcpu-1gb instead of 512mb) 2.2.2 3.0.0 Features Warning: BREAKING CHANGE! The configuration of providers changes. See documentation here. providers: uses multiple providers at a time awsec2: provider removes instances in batch every second (and no longer makes thousands of queries) ovhcloud: provider creates instances in batch (new API route used) Bug fixes maxrunninginstances: remove blocking parameters maxrunninginstances 2.2.3 2.4.3 Bug fixes node: change minimum version of Node.js to 8 dependencies: upgrade dependencies to latest version 2.2. Changelog 7

2.2.4 2.4.2 Bug fixes useragent: set useragent at instance creation, not at startup instance: force crashed instance to be removed 2.2.5 2.4.1 Bug fixes instance: correctly remove instance when instance is removed. Thanks to Étienne Corbillé!!! 2.2.6 2.4.0 Features provider: add VScale.io provider. Thanks to Hotrush!!! Bug fixes proxy: use a valid startup script for init.d. Thanks to Hotrush!!! useragent: change useragents with a fresh list for 2017 2.2.7 2.3.10 Features docs: add ami-06220275 with type n2.nano Bug fixes instance: remove listeners on instance alive status on instance removal. Thanks to Étienne Corbillé!!! 2.2.8 2.3.9 Features digitalocean: update Digital Ocean documentation digitalocean: view only instances from selected region instances: remove random instances instead of the last ones pm2: add kill_timeout option for PM2 (thanks to cp2587) 8 Chapter 2. Documentation

Bug fixes digitalocean: limit the number of created instances at each API request digitalocean: don t remove locked instances 2.2.9 2.3.8 Features docker: create the Docker image fabienvauchelles/scrapoxy Bug fixes template: limit max instances to 2 2.2.10 2.3.7 Features connect: scrapoxy accepts now full HTTPS CONNECT method. It is useful for browser like PhantomJS. Thanks to Anis Gandoura!!! 2.2.11 2.3.6 Bug fixes template: replace old AWS AMI by ami-c74d0db4 2.2.12 2.3.5 Features instance: change Node.js version to 6.x ping: use an HTTP ping instead a TCP ping. Please rebuild instance image. 2.2.13 2.3.4 Features stats: monitor stop count history stats: add 3 more scales: 5m, 10m and 1h logs: normalize logs and add more informations scaling: pop a message when maximum number of instances is reached in a provider scaling: add quick scaling buttons 2.2. Changelog 9

docs: explain why Scrapoxy doesn t accept CONNECT mode docs: explain how User Agent is overwritten Bug fixes dependencies: upgrade dependencies ovh: monitor DELETED status docs: add example to test scrapoxy with credentials commander: manage twice instance remove 2.2.14 2.3.3 Bug fixes master: sanitize bad request headers proxy: catch all socket errors in the proxy instance 2.2.15 2.3.2 Bug fixes docs: fallback to markdown for README (because npmjs doesn t like retext) 2.2.16 2.3.1 Features docs: add tutorials for Scrapy and Node.js Bug fixes digitalocean: convert Droplet id to string 2.2.17 2.3.0 Features digitalocean: add support for DigitalOcean provider 10 Chapter 2. Documentation

2.2.18 2.2.1 Misc config: rename my-config.json to conf.json doc: migrate documentation to ReadTheDocs doc: link to the new website Scrapoxy.io 2.2.19 2.2.0 Breaking changes node: node minimum version is now 4.2.1, to support JS class Features all: migrate core and gui to ES6, with all best practices api: replace Express by Koa Bug fixes test: correct core e2e test 2.2.20 2.1.2 Bug fixes gui: correct token encoding for GUI 2.2.21 2.1.1 Bug fixes main: add message when all instances are stopped (at end) doc: correct misc stuff in doc 2.2.22 2.1.0 Features ovh: add OVH provider with documentation security: add basic auth to Scrapoxy (RFC2617) stats: add flow stats stats: add scale for stats (1m/1h/1d) 2.2. Changelog 11

stats: store stats on server stats: add globals stats doc: split of the documentation in 3 parts: quick start, standard usage and advanced usage doc: add tutorials for AWS / EC2 gui: add a scaling popup instead of direct edit (with integrity check) gui: add update popup when the status of an instance changes. gui: add error popup when GUI cannot retrieve data logs: write logs to disk instance: add cloud name instance: show instance IP instance: always terminate an instance when stopping (prefer terminate instead of stop/start) test: allow more than 8 requests (max 1000) ec2: force to terminate/recreate instance instead of stop/restart Bug fixes gui: emit event when scaling is changed by engine (before, event was triggered by GUI) stability: correct a lot of behavior to prevent instance cycling ec2: use status name instead of status code 2.2.23 2.0.1 Features test: specify the count of requests with the test command test: count the requests by IP in the test command doc: add GUI documentation doc: add API documentation doc: explain awake/asleep mode in user manual log: add human readable message at startup 2.2.24 2.0.0 Breaking changes commander: API routes are prefixed with /api 12 Chapter 2. Documentation

Features gui: add GUI to control Scrapoxy gui: add statistics to the GUI (count of requests / minute, average delay of requests / minute) doc: add doc about HTTP headers 2.2.25 1.1.0 Features commander: stopping an instance returns the new count of instances commander: password is hashed with base64 commander: read/write config with command (and live update of the scaling) Misc chore: force global install with NPM 2.2.26 1.0.2 Features doc: add 2 AWS / EC2 tutorials Bug fixes template: correct template mechanism config: correct absolute path for configuration 2.2.27 1.0.1 Misc doc: change author and misc informations 2.2.28 1.0.0 Features init: start of the project 2.2. Changelog 13

2.3 The MIT License (MIT) Copyright (c) 2016 Fabien Vauchelles Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the Software ), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PAR- TICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFT- WARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 2.4 Configure Scrapoxy 2.4.1 Create configuration To create a new configuration, use: scrapoxy init conf.json 2.4.2 Multi-Providers Example configuration This command creates a configuration example, with 4 providers: { "commander": { "password": "CHANGE_THIS_PASSWORD" }, "instance": { "port": 3128, "scaling": { "min": 1, "max": 2 } }, "providers": [ { "type": "awsec2", "accesskeyid": "YOUR ACCESS KEY ID", "secretaccesskey": "YOUR SECRET ACCESS KEY", "region": "YOUR REGION (could be: eu-west-1)", "instance": { "InstanceType": "t1.micro", "ImageId": "ami-c74d0db4", "SecurityGroups": [ "forward-proxy" ] 14 Chapter 2. Documentation

} ] }, { }, { }, { } } "type": "ovhcloud", "endpoint": "YOUR ENDPOINT (could be: ovh-eu)", "appkey": "YOUR APP KEY", "appsecret": "YOUR APP SECRET", "consumerkey": "YOUR CONSUMER KEY", "serviceid": "YOUR SERVICE ID", "region": "YOUR REGION (could be: BHS1, GRA1 or SBG1)", "sshkeyname": "YOUR SSH KEY (could be: mykey)", "flavorname": "vps-ssd-1", "snapshotname": "YOUR SNAPSHOT NAME (could be: forward-proxy)" "type": "digitalocean", "token": "YOUR PERSONAL TOKEN", "region": "YOUR REGION (could be: lon1)", "size": "s-1vcpu-1gb (previous: 512mb)", "sshkeyname": "YOUR SSH KEY (could be: mykey)", "imagename": "YOUR SNAPSHOT NAME (could be: forward-proxy)", "tags": "YOUR TAGS SEPARATED BY A COMMA (could be: proxy,instance)" "type": "vscale", "token": "YOUR PERSONAL TOKEN", "region": "YOUR REGION (could be: msk0, spb0)", "imagename": "YOUR SNAPSHOT NAME (could be: forward-proxy)", "sshkeyname": "YOUR SSH KEY (could be: mykey)", "plan": "YOUR PLAN (could be: small)" 2.4.3 Options: commander Option Default value Description port 8889 TCP port of the REST API password none Password to access to the commander 2.4. Configure Scrapoxy 15

2.4.4 Options: instance Option Default Description value port none TCP port of your instance (example: 3128) username none Credentials if your proxy instance needs them (optional) password none Credentials if your proxy instance needs them (optional) scaling none see instance / scaling checkdelay 10000 (in ms) Scrapoxy requests the status of instances to the provider, every X ms checkalivedelay 20000 (in ms) Scrapoxy pings instances every X ms stopifcrasheddelay 120000 (in ms) Scrapoxy restarts an instance if it has been dead for X ms autorestart none see instance / autorestart 2.4.5 Options: instance / autorestart Scrapoxy randomly restarts instance to change the IP address. The delay is between mindelay and maxdelay. Option Default value Description mindelay 3600000 (in ms) Minimum delay maxdelay 43200000 (in ms) Maximum delay 2.4.6 Options: instance / scaling Option Default value Description min none The desired count of instances when Scrapoxy is asleep max none The desired count of instances when Scrapoxy is awake required none The count of actual instances downscaledelay 600000 (in ms) Time to wait to remove unused instances when Scrapoxy is not in use 2.4.7 Options: logs Option Default value Description path none If specified, writes all logs in a dated file 2.4.8 Options: providers Providers is an array of provider. It can contains multiple providers: AWS EC2: see AWS EC2 - Configuration OVH Cloud: see OVH Cloud - Configuration DigitalOcean: see DigitalOcean - Configuration Vscale: see Vscale - Configuration 16 Chapter 2. Documentation

2.4.9 Options: proxy Option Default value Description port 8888 TCP port of Scrapoxy auth none see proxy / auth (optional) 2.4.10 Options: proxy / auth Option Default value Description username none Credentials if your Scrapoxy needs them password none Credentials if your Scrapoxy needs them 2.4.11 Options: stats Option Default value Description retention 86400000 (in ms) Duration of statistics retention samplingdelay 1000 (in ms) Get stats every X ms 2.5 AWS / EC2 2.5.1 Get started Step 1: Get your AWS credentials See Get credentials. Remember your Access Key and Secret Access Key. Step 2: Connect to your region 2.5. AWS / EC2 17

Step 3: Create a security group See Create a security group. Step 4: Choose an AMI Public AMI are available for theses regions: eu-west-1 / t1.micro: ami-c74d0db4 eu-west-1 / t2.nano: ami-06220275 If you cannot find your region, you can Copy an AMI from a region to another. Step 5: Update configuration Open conf.json: { } "providers": [ "type": "awsec2", ] "region": "eu-west-1", "instance": { "InstanceType": "t1.micro", "ImageId": "ami-c74d0db4", "SecurityGroups": [ "forward-proxy" ], } And update region and ImageId with your parameters. 2.5.2 Configure Scrapoxy Options: awsec2 For credentials, there is 2 choices: 1. Add credentials in the configuration file; 2. Or Use your own credentials (from profile, see the AWS documentation). Option Default Description value type none Must be awsec2 accesskeyid none Credentials for AWS (optional) secretaccesskey none Credentials for AWS (optional) region none AWS region (example: eu-west-1) tag Proxy Name of the AWS / EC2 instance instance none see awsec2 / instance max none Maximum number of instances for this provider. If empty, there is no maximum. 18 Chapter 2. Documentation

Options: awsec2 / instance Options are specific to AWS / EC2. Scrapoxy use the method runinstances to create new instances. Standard options are InstanceType, ImageId, KeyName, and SecurityGroups. 2.5.3 Tutorials Tutorial: AWS / EC2 - Get credentials Step 1: Connect to your AWS console Go to AWS console. Step 2: Go to credentials Step 3: Create a new key 1. Click on Access Key 2. Click on Create New Access Key 2.5. AWS / EC2 19

Step 4: Get the credentials 1. Click on Show Access Key 2. Get the values of Access Key ID and Secret Access Key Tutorial: AWS / EC2 - Create a security group Security groups (and AMI) are restricted to a region. A security group in eu-west-1 is not available in eu-central-1. 20 Chapter 2. Documentation

Warning: You must create a security group by region. Step 1: Connect to your AWS console Go to AWS console. Step 2: Go to EC2 dashboard 2.5. AWS / EC2 21

Step 3: Create a Security Groups 1. Click on Security Groups 2. Click on Create Security Group Step 4: Fill the Security Groups 1. Fill the name and the description with forward-proxy 2. Fill the Inbound rule with: Type: Custom TCP Rule Port Range: 3128 Source: Anywhere 3. Click on Create 22 Chapter 2. Documentation

Tutorial: AWS / EC2 - Copy an AMI from a region to another AMI (and security groups) are restricted to a region. A AMI in eu-west-1 is not available in eu-central-1. Warning: You must create an AMI by region. Step 1: Connect to your AWS console Go to AWS console. Step 2: Connect to Ireland region 2.5. AWS / EC2 23

Step 3: Go to EC2 dashboard Step 4: Find the public AMI 1. Click on AMIs 2. Search ami-c74d0db4 24 Chapter 2. Documentation

Step 5: Open copy AMI wizard 1. Right click on instance 2. Click on Copy AMI Step 6: Start AMI copy 1. Choose the new destination region 2. Click on Copy AMI 2.5. AWS / EC2 25

Step 7: Connect to the new region Step 8: Find the new AMI ID The new AMI ID is in the column AMI ID. 26 Chapter 2. Documentation

2.6 DigitalOcean 2.6.1 Get started Step 1: Get your DigitalOcean credentials See Get DigitalOcean credentials. Remember your token. Step 2: Create a SSH key for the project See Create a SSH key. Remember your SSH key name (mykey). Step 3: Create an image See Create an image. Remember your image name (forward-proxy). Step 4: Update configuration Open conf.json: { "providers": [ "type": "digitalocean", ] }, "token": "YOUR PERSONAL TOKEN", "region": "YOUR REGION (could be: lon1)", "size": "s-1vcpu-1gb (previous: 512mb)", "sshkeyname": "YOUR SSH KEY (could be: mykey)", "imagename": "YOUR SNAPSHOT NAME (could be: forward-proxy)", "tags": "YOUR TAGS SEPARATED BY A COMMA (could be: proxy,instance)" And update config with your parameters. 2.6. DigitalOcean 27

2.6.2 Configure Scrapoxy Options: digitalocean Option Default value Description type none Must be digitalocean token none Credentials for DigitalOcean region none DigitalOcean region (example: lon1) sshkeyname none Name of the SSH key size none Type of droplet name Proxy Name of the droplet imagename none Name of the image (for the proxy droplet) tags none Tags separated by a comma (example: proxy,instance) max none Maximum number of instances for this provider. If empty, there is no maximum. 2.6.3 Tutorials Tutorial: DigitalOcean - Get credentials Step 1: Connect to your DigitalOcean console Go to DigitalOcean console. Step 2: Generate a new token 1. Click on API 2. Click on Generate New Token 28 Chapter 2. Documentation

Step 3: Create a new token 1. Enter scrapoxy in Token Name 2. Click on Generate Token Step 4: Get the credentials Remember the token 2.6. DigitalOcean 29

Tutorial: DigitalOcean - Create a SSH key Step 1: Connect to your DigitalOcean console Go to DigitalOcean console. Step 2: Go to settings 1. Click on the top right icon 2. Click on Settings Step 3: Add a new key 1. Click on Security 2. Click on Add SSH Key 30 Chapter 2. Documentation

Step 4: Create a new key 1. Paste your SSH key 2. Enter mykey for the name 3. Click on Add SSH Key You can generate your key with this tutorial on Github. And remember the name of the key! Tutorial: DigitalOcean - Create an image Step 1: Connect to your DigitalOcean console Go to DigitalOcean console. 2.6. DigitalOcean 31

Step 2: Create a new droplet Click on Create and Droplets Step 3: Change the configuration of droplet Choose an image Ubuntu 16.04.3 x64: Choose the smallest size on Standard Droplets: 32 Chapter 2. Documentation

2.6. DigitalOcean 33

Choose a datacenter (e.g.: London): Use the SSH key named mykey: Step 4: Start the droplet Click on Create 34 Chapter 2. Documentation

Step 5: Connect to the instance Get the IP: And enter this command in a terminal: ssh root@<replace by IP> Step 6: Install the proxy Install proxy with: curl --silent --location https://deb.nodesource.com/setup_8.x sudo bash - and: sudo apt-get install --yes nodejs and: curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/ master/tools/install/proxy.js sudo tee /root/proxy.js > /dev/null and: curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/ master/tools/install/proxyup.sh sudo tee /etc/init.d/proxyup.sh > /dev/null and: 2.6. DigitalOcean 35

sudo chmod a+x /etc/init.d/proxyup.sh and: sudo update-rc.d proxyup.sh defaults and: sudo /etc/init.d/proxyup.sh start Step 7: Poweroff the droplet 1. Stop the last command (CTRL-C) 2. Power off the droplet: sudo poweroff The green icon disappears when the droplet is off: Step 8: Create a backup 1. Click on Images 2. Select your droplet 3. Enter forward-proxy in Image Name 4. Click on Take Snapshot 36 Chapter 2. Documentation

Wait a few minutes and the new image appears: 2.7 OVH Cloud 2.7.1 Get started Step 1: Get your OVH credentials See Get OVH credentials. Remember your Application Key, Application Secret and consumerkey. Step 2: Create a project See Create a project. Remember your serviceid. 2.7. OVH Cloud 37

Step 3: Create a SSH key for the project See Create a SSH key. Remember your SSH key name (mykey). Step 4: Create a proxy image See Create a proxy image. Remember your image name (forward-proxy). Step 5: Update configuration Open conf.json: { "providers": [ "type": "ovhcloud", ] }, "endpoint": "YOUR ENDPOINT (could be: ovh-eu)", "appkey": "YOUR APP KEY", "appsecret": "YOUR APP SECRET", "consumerkey": "YOUR CONSUMERKEY", "serviceid": "YOUR SERVICEID", "region": "YOUR REGION (could be: BHS1, GRA1 or SBG1)", "sshkeyname": "YOUR SSH KEY (could be: mykey)", "flavorname": "vps-ssd-1", "snapshotname": "YOUR SNAPSHOT NAME (could be: forward-proxy)" And update config with your parameters. 2.7.2 Configure Scrapoxy Options: ovhcloud Option Default Description value type none Must be ovhcloud endpoint none OVH subdivision (ovh-eu or ovh-ca) appkey none Credentials for OVH appsecret none Credentials for OVH consumerkey none Credentials for OVH serviceid none Project ID region none OVH region (example: GRA1) sshkeyname none Name of the SSH key flavorname none Type of instance name Proxy Name of the instance snapshot- none Name of the backup image (for the proxy instance) Name max none Maximum number of instances for this provider. If empty, there is no maximum. 38 Chapter 2. Documentation

2.7.3 Tutorials Tutorial: OVH Cloud - Get credentials Step 1: Create an API Application 1. Go on https://eu.api.ovh.com/createapp/ 2. Enter ID and password 3. Add a name name (e.g.: scrapoxy-12) 4. Add a description (e.g.: scrapoxy) 5. Click on Create keys Step 2: Save application credentials Remember Application Key and Application Secret: 2.7. OVH Cloud 39

Step 3: Get the consumerkey Use Scrapoxy to get your key: scrapoxy ovh-consumerkey <endpoint> <Application Key> <Application Secret> Endpoints are: ovh-eu for OVH Europe; ovh-ca for OVH North-America. Remember consumerkey and click on validation URL to validate key. Step 4: Add permission When you open validation URL: 1. Enter ID and password 2. Choose Unlimited validity 3. Click on Authorize Access 40 Chapter 2. Documentation

Tutorial: OVH Cloud - Create a project Step 1: Connect to your OVH dashboard Go to OVH Cloud. Step 2: Create a new project 1. Click on Create a new project 2. Fill project name 3. Click on Create the project 2.7. OVH Cloud 41

Step 3: Save serviceid Remember the ID in the URL: it is the project serviceid. Tutorial: OVH Cloud - Create a SSH key Step 1: Connect to your new project Go to OVH dashboard and select your new project. Step 2: Create a new key 1. Click on the tab SSH keys 2. Click on Add a key 42 Chapter 2. Documentation

Step 3: Enter the key 1. Set the name of the key mykey 2. Copy your key 3. Click on Add the key You can generate your key with this tutorial on Github. And remember the name of the key! Tutorial: OVH Cloud - Create a proxy image Step 1: Connect to your new project Go to OVH dashboard and select your new project. Step 2: Create a new server 1. Click on Infrastructure 2. Click on Add 3. Click on Add a server 2.7. OVH Cloud 43

Step 3: Change the configuration of server 1. Change the type of server to VPS-SSD-1 (cheapest) 2. Change the distribution to Ubuntu 3. Change region to GRA1 (or another if you want) 4. Change the SSH key to mykey 5. Click on Launch now 44 Chapter 2. Documentation

2.7. OVH Cloud 45

Step 4: Get login information When the instance is ready: 1. Click on the v on the top right corner 2. Click on Login information Step 5: Connect to the instance Remember the SSH command. Step 6: Install the proxy Connect to the instance and install proxy: 46 Chapter 2. Documentation

sudo apt-get install curl and: curl --silent --location https://deb.nodesource.com/setup_8.x sudo bash - and: sudo apt-get install --yes nodejs and: curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/ master/tools/install/proxy.js sudo tee /root/proxy.js > /dev/null and: curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/ master/tools/install/proxyup.sh sudo tee /etc/init.d/proxyup.sh > /dev/null and: sudo chmod a+x /etc/init.d/proxyup.sh and: sudo update-rc.d proxyup.sh defaults and: sudo /etc/init.d/proxyup.sh start Step 7: Create a backup Go back on the OVH project dashboard: 1. Click on the v on the top right corner 2. Click on Create a backup 2.7. OVH Cloud 47

Step 8: Start the backup 1. Enter the snapshot name forward-proxy 2. Click on Take a snapshot You need to wait 10 minutes to 1 hour. 48 Chapter 2. Documentation

2.8 Vscale Vscale is a russian cloud platform, like DigitalOcean. Note: IP addresses are updated every hour. If the instances are restarted too quickly, they will have the same IP address. 2.8.1 Get started Step 1: Get your Vscale credentials See Get Vscale credentials. Remember your token. Step 2: Create a SSH key for the project See Create a SSH key. Remember your SSH key name (mykey). Step 3: Create an image See Create an image. Remember your image name (forward-proxy). Step 4: Update configuration Open conf.json: { "providers": [ "type": "vscale", "token": "YOUR PERSONAL TOKEN", "region": "YOUR REGION (could be: msk0, spb0)", "name": "YOUR SERVER NAME", "sshkeyname": "YOUR SSH KEY (could be: mykey)", "plan": "YOUR PLAN (could be: small)" ] }, And update config with your parameters. 2.8. Vscale 49

2.8.2 Configure Scrapoxy Options: vscale Option Default value Description type none Must be vscale token none Credentials for Vscale region none Vscale region (example: msk0, spb0) sshkeyname none Name of the SSH key plan none Type of plan (example: small) name Proxy Name of the scalet imagename none Name of the image (for the proxy scalet) max none Maximum number of instances for this provider. If empty, there is no maximum. 2.8.3 Tutorials Tutorial: Vscale - Get credentials Step 1: Connect to your Vscale console Go to Vscale console. Step 2: Generate a new token 1. Click on SETTINGS 2. Click on API tokens 3. Click on CREATE TOKEN 50 Chapter 2. Documentation

Step 3: Create a new token 1. Enter scrapoxy in Token description 2. Click on GENERATE TOKEN Step 4: Get the credentials 1. Click on COPY TO CLIPBOARD 2. Click on SAVE Remember the token 2.8. Vscale 51

Tutorial: Vscale - Create a SSH key Step 1: Connect to your Vscale console Go to Vscale console. Step 2: Add a new key 1. Click on SETTINGS 2. Click on SSH keys 3. Click on ADD SSH KEY 52 Chapter 2. Documentation

Step 3: Create a new key 1. Enter mykey for the name 2. Paste your SSH key 3. Click on ADD 2.8. Vscale 53

You can generate your key with this tutorial on Github. And remember the name of the key! Tutorial: Vscale - Create an image Step 1: Connect to your Vscale console Go to Vscale console. Step 2: Create a new scalet 1. Click on SERVERS 2. Click on CREATE SERVER Step 3: Change the configuration of scalet Choose an image Ubuntu 16.04 64bit: 54 Chapter 2. Documentation

Choose the smallest configuration: Choose a datacenter (e.g.: Moscow): 2.8. Vscale 55

Use the SSH key named mykey: Step 4: Create the scalet Click on CREATE SERVER Step 5: Connect to the instance Get the IP: 56 Chapter 2. Documentation

And enter this command in a terminal: ssh root@<replace by IP> Step 6: Install the proxy Install proxy with: sudo apt-get update and: sudo apt-get install --yes curl and: curl --silent --location https://deb.nodesource.com/setup_8.x sudo bash - and: sudo apt-get install --yes nodejs and: curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/ master/tools/install/proxy.js sudo tee /root/proxy.js > /dev/null and: curl --silent --location https://raw.githubusercontent.com/fabienvauchelles/scrapoxy/ master/tools/install/proxyup.sh sudo tee /etc/init.d/proxyup.sh > /dev/null and: sudo chmod a+x /etc/init.d/proxyup.sh and: 2.8. Vscale 57

sudo update-rc.d proxyup.sh defaults and: sudo /etc/init.d/proxyup.sh start Step 7: Poweroff the scalet 1. Stop the last command (CTRL-C) 2. Power off the scalet: sudo poweroff The status is Offline when the scalet is off: Step 8: Open the scalet Click on the scalet: 58 Chapter 2. Documentation

Step 9: Prepare backup Click on the scalet: Click on Create backup Step 10: Create backup 3. Enter forward-proxy in Name 4. Click on CREATE BACKUP 2.8. Vscale 59

Wait a few minutes and the new image appears: 2.9 Manage Scrapoxy with a GUI 2.9.1 Connect You can access to the GUI at http://localhost:8889/ 60 Chapter 2. Documentation

2.9.2 Login Enter your password. The password is defined in the configuration file, key commander.password. 2.9.3 Dashboard Scrapoxy GUI has many pages: Instances. This page contains the list of instances managed by Scrapoxy; Stats. This page contains statistics on the use of Scrapoxy. To login page redirects to the Instances page. 2.9. Manage Scrapoxy with a GUI 61

2.9.4 Page: Instances Scaling This panel shows the number of instances. Scrapoxy has 3 settings: Min. The desired count of instances when Scrapoxy is asleep; Max. The desired count of instances when Scrapoxy is awake; Required. The count of actual instances. To add or remove an instance, click on the Scaling button and change the Required setting: Status of an instance Each instance is described in a panel. This panel contains many information: Name of the instance; IP of the instance; Provider type; Instance status on the provider; Instance status in Scrapoxy. Scrapoxy relays requests to instances which are started and alived ( + ). 62 Chapter 2. Documentation

Type of provider AWS / EC2 OVH Cloud DigitalOcean VScale Status in the provider Starting Started Stopping Stopped Status in Scrapoxy Alive Dead Remove an instance Click on the instance to delete it. The instance stops and is replaced by another. 2.9.5 Page: Stats There are 3 panels in stats: Global stats. This panel contains global stats; Requests. This panel contains the count of requests; Flow. This panel contains the flow requests. 2.9. Manage Scrapoxy with a GUI 63

Global This panel has 4 indicators: the total count of requests to monitor performance; the total count of received and sent data to control the volume of data; the total of stop instance orders, to monitor anti-blacklisting; the count of requests received by an instance (minimum, average, maximum) to check anti-blacklisting performance. Requests This panel combines 2 statistics on 1 chart. It measures: the count of requests per minute; the average execution time of a request (round trip), per minute. 64 Chapter 2. Documentation

Flow This panel combines 2 statistics on 1 chart. It measures: the flow received by Scrapoxy; the flow sent by Scrapoxy. How to increase the number of requests per minute? You add new instances (or new scrapers). Do you increase the number of requests par minute? Yes: Perfect! No: You pay instances for nothing. Do I overload the target website? You add new instances (or new scrapers). Did the time of response increase? Yes: The target website is overloaded. No: Perfect! 2.10 Understand Scrapoxy 2.10.1 Architecture Scrapoxy consists of 4 parts: 2.10. Understand Scrapoxy 65

the master, which routes requests to proxies; the manager, which starts and stops proxies; the commander, which provides a REST API to receive orders; the gui, which connects to the REST API. When Scrapoxy starts, the manager starts a new instance (if necessary), on the cloud. When the scraper sends a HTTP request, the manager starts all others proxies. 2.10.2 Requests Can Scrapoxy relay HTTPS requests? Yes. There is 2 modes: 66 Chapter 2. Documentation

Mode A: HTTPS over HTTP (or *no tunnel* mode) This mode is for scraper only. It allows Scrapoxy to override HTTP headers (like User-Agent). The scraper must send a HTTP request with an HTTPS URL in the Location header. Example: GET /index.html Host: localhost:8888 Location: https://www.google.com/index.html Accept: text/html Scrapers accept a GET (or POST) method instead of CONNECT for proxy. With Scrapy (Python), add /?noconnect to the proxy URL: PROXY='http://localhost:8888/?noconnect With Request (Node.js), add tunnel:false to options: request({ method: 'GET', url: 'https://api.ipify.org/', tunnel: false, proxy: 'http://localhost:8888', }, (err, response, body) => {...}); Mode B: HTTPS CONNECT (or *tunnel* mode) This mode is for browser only like PhantomJS. It doesn t allow Scrapoxy to override HTTP headers (like User-Agent). You must manually set the User-Agent. The best solution is to use only 1 User-Agent (it would be strange to have multiple User-Agents coming from 1 IP, isn t it?). What is the proxy that returned the response? Scrapoxy adds to the response an HTTP header x-cache-proxyname. This header contains the name of the proxy. Can the scraper force the request to go through a specific proxy? Yes. The scraper adds the proxy name in the header x-cache-proxyname. When the scraper receives a response, this header is extracted. The scraper adds this header to the next request. Does Scrapoxy override User Agent? Yes. When an instance starts (or restarts), it gets a random User Agent (from the User Agent list). When the instance receives a request, it overrides the User Agent. 2.10. Understand Scrapoxy 67

2.10.3 Blacklisting How can you manage blacklisted response? Remember, Scrapoxy cannot detect blacklisted response because it is too specific to a scraping usecase. It can be a 503 HTTP response, a captcha, a longer response, etc. Anti-blacklisting is a job for the scraper: 1. The scraper must detect a blacklisted response; 2. The scraper extracts the name of the instance from the HTTP response header (see here); 3. The scraper asks to Scrapoxy to remove the instance with the API (see here). When the blacklisted response is detected, Scrapoxy will replace the instance with a valid one (new IP address). There is a tutorial: Manage blacklisted request with Scrapy. 2.10.4 Instances management How does multi-providers work? In the configuration file, you can specify multiple providers (the providers field is an array). You can also specify the maximum number of instances by provider, with the max parameter (for example: 2 instances maximum for AWSEC2 and unlimited for DigitalOcean). When several instances are requested, the algorithm randomly asks the instances at the providers, within the specified capacities. How does the monitoring mechanism? 1. the manager asks the cloud how many instances are alive. It is the initial state; 2. the manager creates a target state, with the new count of instance; 3. the manager generates the commands to reach target state from the initial state; 4. the manager sends the commands to the cloud. These steps are very important because you cannot guess which is the initial state. Because an instance may be dead! Scrapoxy can restart an instance if: the instance is dead (stop status or no ping); the living limit is reached: Scrapoxy regulary restarts the instance to change the IP address. Do you need to create a VM image? By default, we provide you an AMI proxy instance on AWS / EC2. This is a CONNECT proxy opened on TCP port 3128. But you can use every software which accept the CONNECT method (Squid, Tinyproxy, etc.). 68 Chapter 2. Documentation

Can you leave Scrapoxy started? Yes. Scrapoxy has 2 modes: an awake mode and an asleep mode. When Scrapoxy receives no request after a while, he falls asleep. It sets the count of instances to minimum (instance.scaling.min). When Scrapoxy receives a request, it wakes up. It fixes the count of instances to maximum (instance.scaling.max). Scrapoxy needs at least 1 instance to receive the awake request. 2.11 Control Scrapoxy with a REST API 2.11.1 Endpoint You can access to the commander API at http://localhost:8889/api 2.11.2 Authentication Every requests must have an Authorization header. The value is the hash base64 of the password set in the configuration (commander/password). 2.11. Control Scrapoxy with a REST API 69

2.11.3 Instances Get all instances Request GET http://localhost:8889/api/instances Response (JSON) Status: 200 The body contains all informations about instances. Example: const request = require('request'); const password = 'YOUR_COMMANDER_PASSWORD'; const opts = { method: 'GET', url: 'http://localhost:8889/api/instances', headers: { 'Authorization': new Buffer(password).toString('base64'), }, }; request(opts, (err, res, body) => { if (err) return console.log('error: ', err); console.log('status: %d\n\n', res.statuscode); const bodyparsed = JSON.parse(body); }); console.log(bodyparsed); Stop an instance Request POST http://localhost:8889/api/instances/stop JSON payload: { } "name": "<name of the proxy>" 70 Chapter 2. Documentation

Response (JSON) Status: 200 The instance exists. Scrapoxy stops it. And the instance is restarted, with a new IP address. The body contains the remaining count of alive instances. { } "alive": <count> Status: 404 The instance does not exist. Example: const request = require('request'); const password = 'YOUR_COMMANDER_PASSWORD', instancename = 'YOUR INSTANCE NAME'; const opts = { method: 'POST', url: 'http://localhost:8889/api/instances/stop', json: { name: instancename, }, headers: { 'Authorization': new Buffer(password).toString('base64'), }, }; request(opts, (err, res, body) => { if (err) return console.log('error: ', err); console.log('status: %d\n\n', res.statuscode); }); console.log(body); 2.11.4 Scaling Get the scaling Request GET http://localhost:8889/api/scaling Response (JSON) Status: 200 2.11. Control Scrapoxy with a REST API 71

The body contains all the configuration of the scaling. Example: const request = require('request'); const password = 'YOUR_COMMANDER_PASSWORD'; const opts = { method: 'GET', url: 'http://localhost:8889/api/scaling', headers: { 'Authorization': new Buffer(password).toString('base64'), }, }; request(opts, (err, res, body) => { if (err) return console.log('error: ', err); console.log('status: %d\n\n', res.statuscode); const bodyparsed = JSON.parse(body); }); console.log(bodyparsed); Update the scaling Request PATCH http://localhost:8889/api/scaling JSON payload: { } "min": "min_scaling", "required": "required_scaling", "max": "max_scaling", Response (JSON) Status: 200 The scaling is updated. Status: 204 The scaling is not updated. Example: const request = require('request'); const password = 'YOUR_COMMANDER_PASSWORD'; 72 Chapter 2. Documentation

const opts = { method: 'PATCH', url: 'http://localhost:8889/api/scaling', json: { min: 1, required: 5, max: 10, }, headers: { 'Authorization': new Buffer(password).toString('base64'), }, }; request(opts, (err, res, body) => { if (err) return console.log('error: ', err); console.log('status: %d\n\n', res.statuscode); }); console.log(body); 2.11.5 Configuration Get the configuration Request GET http://localhost:8889/api/config Response (JSON) Status: 200 The body contains all the configuration of Scrapoxy (including scaling). Example: const request = require('request'); const password = 'YOUR_COMMANDER_PASSWORD'; const opts = { method: 'GET', url: 'http://localhost:8889/api/config', headers: { 'Authorization': new Buffer(password).toString('base64'), }, }; request(opts, (err, res, body) => { if (err) return console.log('error: ', err); console.log('status: %d\n\n', res.statuscode); 2.11. Control Scrapoxy with a REST API 73

const bodyparsed = JSON.parse(body); }); console.log(bodyparsed); Update the configuration Request PATCH http://localhost:8889/api/config JSON payload: { } "key_to_override": "<new_value>", "section": { "key2_to_override": "<new value>" } Response (JSON) Status: 200 The configuration is updated. Status: 204 The configuration is not updated. Example: const request = require('request'); const password = 'YOUR_COMMANDER_PASSWORD'; const opts = { method: 'PATCH', url: 'http://localhost:8889/api/config', json: { instance: { scaling: { max: 300, }, }, }, headers: { 'Authorization': new Buffer(password).toString('base64'), }, }; request(opts, (err, res, body) => { if (err) return console.log('error: ', err); console.log('status: %d\n\n', res.statuscode); 74 Chapter 2. Documentation

}); console.log(body); 2.12 Secure Scrapoxy 2.12.1 Secure Scrapoxy with Basic auth Scrapoxy supports standard HTTP Basic auth (RFC2617). Step 1: Add username and password in configuration Open conf.json and add auth section in proxy section (see Configure Scrapoxy): { } "proxy": { "auth": { "username": "myuser", "password": "mypassword" } } Step 2: Add username and password to the scraper Configure your scraper to use username and password: The URL is: http://myuser:mypassword@localhost:8888 (replace myuser and mypassword with your credentials). Step 3: Test credentials Open a new terminal and test Scrapoxy without credentials: scrapoxy test http://localhost:8888 It doesn t work. Now, test Scrapoxy with your credentials: scrapoxy test http://myuser:mypassword@localhost:8888 (replace myuser and mypassword with your credentials). 2.12.2 Secure Scrapoxy with a firewall on Ubuntu UFW simplifies IPTables on Ubuntu (>14.04). 2.12. Secure Scrapoxy 75

Step 1: Allow SSH sudo ufw allow ssh Step 2: Allow Scrapoxy sudo ufw allow 8888/tcp sudo ufw allow 8889/tcp Step 3: Enable UFW sudo ufw enable Enter y. Step 4: Check UFW status and rules sudo ufw status 2.13 Launch Scrapoxy at startup 2.13.1 Prerequise Scrapoxy is installed with a valid configuration (see Quick Start). 2.13.2 Step 1: Install PM2 sudo npm install -g pm2 2.13.3 Step 2: Launch PM2 at instance startup sudo pm2 startup ubuntu -u <YOUR USERNAME> 1. Replace ubuntu by your distribution name (see PM2 documentation). 2. Replace YOUR USERNAME by your Linux username 2.13.4 Step 3: Create a PM2 configuration Create a PM2 configuration file scrapoxy.json5 for Scrapoxy: 76 Chapter 2. Documentation

{ } apps : [ { name script args kill_timeout : 30000, }, ], : "scrapoxy", : "/usr/bin/scrapoxy", : ["start", "<ABSOLUTE PATH TO CONFIG>/conf.json", "-d"], 2.13.5 Step 4: Start configuration pm2 start scrapoxy.json5 2.13.6 Step 5: Save configuration pm2 save 2.13.7 Step 6: Stop PM2 (optional) If you need to stop Scrapoxy in PM2: pm2 stop scrapoxy.json5 2.14 Integrate Scrapoxy to Scrapy 2.14.1 Goal Is it easy to find a good Python developer on Paris? No! So, it s time to build a scraper with Scrapy to find our perfect profile. The site Scraping Challenge indexes a lot of profiles (fake, for demo purposes). We want to grab them and create a CSV file. However, the site is protected against scraping! We must use Scrapoxy to bypass the protection. 2.14.2 Step 1: Install Scrapy Install Python 2.7 Scrapy works with Python 2.7 2.14. Integrate Scrapoxy to Scrapy 77

Install dependencies On Ubuntu: apt-get install python-dev libxml2-dev libxslt1-dev libffi-dev On Windows (with Babun): wget https://bootstrap.pypa.io/ez_setup.py -O - python easy_install pip pact install libffi-devel libxml2-devel libxslt-devel Install Scrapy and Scrapoxy connector pip install scrapy scrapoxy 2.14.3 Step 2: Create the scraper myscraper Create a new project Bootstrap the skeleton of the project: scrapy startproject myscraper cd myscraper Add a new spider Add this content to myscraper/spiders/scraper.py: # -*- coding: utf-8 -*- from scrapy import Request, Spider class Scraper(Spider): name = u'scraper' def start_requests(self): """This is our first request to grab all the urls of the profiles. """ yield Request( url=u'http://scraping-challenge-2.herokuapp.com', callback=self.parse, ) def parse(self, response): """We have all the urls of the profiles. Let's make a request for each profile. """ urls = response.xpath(u'//a/@href').extract() for url in urls: yield Request( url=response.urljoin(url), callback=self.parse_profile, 78 Chapter 2. Documentation

) def parse_profile(self, response): """We have a profile. Let's extract the name """ name_el = response.css(u'.profile-info-name::text').extract() if len(name_el) > 0: yield { 'name': name_el[0] } If you want to learn more about Scrapy, check on this Tutorial. Run the spider Let s try our new scraper! Run this command: scrapy crawl scraper -o profiles.csv Scrapy scraps the site and extract profiles to profiles.csv. However, Scraping Challenge is protected! profiles.csv is empty... We will integrate Scrapoxy to bypass the protection. 2.14.4 Step 3: Integrate Scrapoxy to the Scrapy Install Scrapoxy See Quick Start to install Scrapoxy. Start Scrapoxy Set the maximum of instances to 6, and start Scrapoxy (see Change scaling with GUI). Warning: Don t forget to set the maximum of instances! Edit settings of the Scraper Add this content to myscraper/settings.py: CONCURRENT_REQUESTS_PER_DOMAIN = 1 RETRY_TIMES = 0 # PROXY PROXY = 'http://127.0.0.1:8888/?noconnect' # SCRAPOXY API_SCRAPOXY = 'http://127.0.0.1:8889/api' API_SCRAPOXY_PASSWORD = 'CHANGE_THIS_PASSWORD' 2.14. Integrate Scrapoxy to Scrapy 79

DOWNLOADER_MIDDLEWARES = { 'scrapoxy.downloadmiddlewares.proxy.proxymiddleware': 100, 'scrapoxy.downloadmiddlewares.wait.waitmiddleware': 101, 'scrapoxy.downloadmiddlewares.scale.scalemiddleware': 102, 'scrapy.downloadermiddlewares.httpproxy.httpproxymiddleware': None, } What are these middlewares? ProxyMiddleware relays requests to Scrapoxy. It is an helper to set the PROXY parameter. WaitMiddleware stops the scraper and waits for Scrapoxy to be ready. ScaleMiddleware asks Scrapoxy to maximize the number of instances at the beginning, and to stop them at the end. Note: ScaleMiddleware stops the scraper like WaitMiddleware. After 2 minutes, all instances are ready and the scraper continues to scrap. Warning: Don t forget to change the password! Run the spider Run this command: scrapy crawl scraper -o profiles.csv Now, all profiles are saved to profiles.csv! 2.15 Integrate Scrapoxy to Node.js 2.15.1 Goal Is it easy to find a good Javascript developer on Paris? No! So, it s time to build a scraper with Node.js, Request and Cheerio to find our perfect profile. The site Scraping Challenge indexes a lot of profiles (fake, for demo purposes). We want to list them. However, the site is protected against scraping! We must use Scrapoxy to bypass the protection. 2.15.2 Step 1: Create a Node.js project Install Node.js Install the latest Node.js version. 80 Chapter 2. Documentation

Create a new project Create a directory for the project: mkdir nodejs-request cd nodejs-request Create the package.json: npm init --yes Add dependencies: npm install lodash bluebird cheerio request winston --save What are these dependencies? lodash is a javascript helper, bluebird is a promise library, cheerio is a JQuery parser, requests makes HTTP requests, winston is a logger. Add a scraper Add this content to index.js const _ = require('lodash'), Promise = require('bluebird'), cheerio = require('cheerio'), request = require('request'), winston = require('winston'); winston.level = 'debug'; const config = { // URL of the site source: 'http://scraping-challenge-2.herokuapp.com', }; opts: { }, // Get all URLs getprofilesurls(config.source, config.opts).then((urls) => { winston.info('found %d profiles', urls.length); winston.info('wait 120 seconds to scale instances'); }) return urls; 2.15. Integrate Scrapoxy to Node.js 81

// Wait 2 minutes to scale instances.delay(2 * 60 * 1000).then((urls) => { // Get profiles one by one. return Promise.map(urls, (url) => getprofile(url, config.opts).then((profile) => { winston.debug('found %s', profile.name); return profile; }).catch(() => { winston.debug('cannot retrieve %s', url); }), {concurrency: 1}).then((profiles) => { const results = _.compact(profiles); winston.info('extract %d on %d profiles', results.length, urls. length); }); }).catch((err) => winston.error('error: ', err)); //////////// /** * Get all the urls of the profiles * @param url Main URL * @param defaultopts options for http request * @returns {promise} */ function getprofilesurls(url, defaultopts) { return new Promise((resolve, reject) => { // Create options for the HTTP request // Add the URL to the default options const opts = _.merge({}, defaultopts, {url}); request(opts, (err, res, body) => { if (err) { return reject(err); } if (res.statuscode!== 200) { return reject(body); } // Load content into a JQuery parser const $ = cheerio.load(body); // Extract all urls const urls = $('.profile a').map((i, el) => $(el).attr('href')).get().map((url) => `${config.source}${url}`); 82 Chapter 2. Documentation

} }); }); resolve(urls); /** * Get the profile and extract the name * @param url URL of the profile * @param defaultopts options for http request * @returns {promise} */ function getprofile(url, defaultopts) { return new Promise((resolve, reject) => { // Create options for the HTTP request // Add the URL to the default options const opts = _.merge({}, defaultopts, {url}); request(opts, (err, res, body) => { if (err) { return reject(err); } if (res.statuscode!== 200) { return reject(body); } // Load content into a JQuery parser const $ = cheerio.load(body); // Extract the names const name = $('.profile-info-name').text(); } }); }); resolve({name}); Run the script Let s try our new scraper! Run this command: node index.js The script scraps the site and list profiles. However, Scraping Challenge is protected! All requests fail... We will integrate Scrapoxy to bypass the protection. 2.15. Integrate Scrapoxy to Node.js 83