transistor
The web is full of data. Transistor is a web scraping framework for collecting, storing, and using targeted data from structured web pages.
Transistor's current strengths are in being able to:
- provide an interface to use Splash headless browser / javascript rendering service.
- includes optional support for using the scrapinghub.com Crawlera 'smart' proxy service.
- ingest keyword search terms from a spreadsheet or use RabbitMQ or Redis as a message broker, transforming keywords into task queues.
- scale one Spider into an arbitrary number of workers combined into a WorkGroup.
- coordinate an arbitrary number of WorkGroups searching an arbitrary number of websites, into one scrape job.
- send out all the WorkGroups concurrently, using gevent based asynchronous I/O.
- return data from each website for each search term 'task' in our list, for easy website-to-website comparison.
- export data to CSV, XML, JSON, pickle, file object, and/or your own custom exporter.
- save targeted scrape data to the database of your choice.
Suitable use cases include:
- comparing attributes like stock status and price, for a list of book titles or part numbers, across multiple websites.
- concurrently process a large list of search terms on a search engine and then scrape results, or follow links first and then scrape results.
Primary goals:
- Enable scraping targeted data from a wide range of websites including sites rendered with Javascript.
- Navigate websites which present logins, custom forms, and other blockers to data collection, like captchas.
- Provide asynchronous I/O for task execution, using gevent.
- Easily integrate within a web app like Flask, Django , or other python based web frameworks.
- Provide spreadsheet based data ingest and export options, like import a list of search terms from excel, ods, csv, and export data to each as well.
- Utilize quick and easy integrated task work queues which can be automatically filled with data search terms by a simple spreadsheet import.
- Able to integrate with more robust task queues like Celery while using rabbitmq or redis as a message broker as desired.
- Provide hooks for users to persist data via any method they choose, while also supporting our own opinionated choice which is a PostgreSQL database along with newt.db.
- Contain useful abstractions, classes, and interfaces for scraping and crawling with machine learning assistance (wip, timeline tbd).
- Further support data science use cases of the persisted data, where convenient and useful for us to provide in this library (wip, timeline tbd).
- Provide a command line interface (low priority wip, timeline tbd).