/ Web Crawling & Web Scraping

An intelligent proxy pool for humanities only supports Python 3.6

An intelligent proxy pool for humanities only supports Python 3.6

scylla

An intelligent proxy pool for humanities, only supports Python 3.6. Key features:

  • Automatic proxy ip crawling and validation
  • Easy-to-use JSON API
  • Simple but beautiful web-based user interface (eg. geographical distribution of proxies)
  • Get started with only 1 command minimally
  • Simple HTTP Forward proxy server
  • Scrapy and requests integration with only 1 line of code minimally
  • Headless browser crawling

Get started

Installation

Install with Docker (highly recommended)

docker run -d -p 8899:8899 -p 8081:8081 -v /var/www/scylla:/var/www/scylla --name scylla wildcat/scylla:latest

Install directly via pip

pip install scylla
scylla --help
scylla # Run the crawler and web server for JSON API

Install from source

git clone https://github.com/imWildCat/scylla.git
cd scylla

pip install -r requirements.txt

npm install # or yarn install
make build-assets

python -m scylla

Usage

This is an example of running a service locally (localhost), using
port 8899.

Note: You might have to wait for 1 to 2 minutes in order to get some proxy ips populated in the database for the first time you use Scylla.

JSON API

Proxy IP List

http://localhost:8899/api/v1/proxies

Optional URL parameters:

Parameters Default value Description
page 1 The page number
limit 20 The number of proxies shown on each page
anonymous any Show anonymous proxies or not. Possible values:true, only anonymous proxies; false, only transparent proxies
https any Show HTTPS proxies or not. Possible values:true, only HTTPS proxies; false, only HTTP proxies
countries None Filter proxies for specific countries. Format example: US, or multi-countries: US,GB

Sample result:

{
    "proxies": [{
        "id": 599,
        "ip": "91.229.222.163",
        "port": 53281,
        "is_valid": true,
        "created_at": 1527590947,
        "updated_at": 1527593751,
        "latency": 23.0,
        "stability": 0.1,
        "is_anonymous": true,
        "is_https": true,
        "attempts": 1,
        "https_attempts": 0,
        "location": "54.0451,-0.8053",
        "organization": "AS57099 Boundless Networks Limited",
        "region": "England",
        "country": "GB",
        "city": "Malton"
    }, {
        "id": 75,
        "ip": "75.151.213.85",
        "port": 8080,
        "is_valid": true,
        "created_at": 1527590676,
        "updated_at": 1527593702,
        "latency": 268.0,
        "stability": 0.3,
        "is_anonymous": true,
        "is_https": true,
        "attempts": 1,
        "https_attempts": 0,
        "location": "32.3706,-90.1755",
        "organization": "AS7922 Comcast Cable Communications, LLC",
        "region": "Mississippi",
        "country": "US",
        "city": "Jackson"
    },
    ...
    ],
    "count": 1025,
    "per_page": 20,
    "page": 1,
    "total_page": 52
}

System Statistics

http://localhost:8899/api/v1/stats

Sample result:

{
    "median": 181.2566407083,
    "valid_count": 1780,
    "total_count": 9528,
    "mean": 174.3290085201
}

HTTP Forward Proxy Server

By default, Scylla will start a HTTP Forward Proxy Server on port
8081. This server will select one proxy updated recently from the
database and it will be used for forward proxy. Whenever an HTTP request
comes, the proxy server will select a proxy randomly.

Note: HTTPS requests are not supported at present.

The example for curl using this proxy server is shown below:

curl http://api.ipify.org -x http://127.0.0.1:8081

You could also use this feature with [requests][]:

requests.get('http://api.ipify.org', proxies={'http': 'http://127.0.0.1:8081'})

Web UI

Open http://localhost:8899 in your browser to see the Web UI of this
project.

Proxy IP List

http://localhost:8899/

Screenshot:

screenshot-proxy-list

Globally Geographical Distribution Map

http://localhost:8899/#/geo

Screenshot:

screenshot-geo-distribution

API Documentation

Please read Module
Index
.

Roadmap

Please see Projects.

Development and Contribution

git clone https://github.com/imWildCat/scylla.git
cd scylla

pip install -r requirements.txt

npm install # or `yarn install`
make build-assets

Testing

If you wish to run tests locally, the commands are shown below:

pip install -r tests/requirements-test.txt
pytest tests/

You are welcomed to add more test cases to this project, increasing the
robustness of this project.

GitHub