transistor

The web is full of data. Transistor is a web scraping framework for collecting, storing, and using targeted data from structured web pages.

Transistor's current strengths are in being able to:

  • provide an interface to use Splash headless browser / javascript rendering service.
  • includes optional support for using the scrapinghub.com Crawlera 'smart' proxy service.
  • ingest keyword search terms from a spreadsheet or use RabbitMQ or Redis as a message broker, transforming keywords into task queues.
  • scale one Spider into an arbitrary number of workers combined into a WorkGroup.
  • coordinate an arbitrary number of WorkGroups searching an arbitrary number of websites, into one scrape job.
  • send out all the WorkGroups concurrently, using gevent based asynchronous I/O.
  • return data from each website for each search term 'task' in our list, for easy website-to-website comparison.
  • export data to CSV, XML, JSON, pickle, file object, and/or your own custom exporter.
  • save targeted scrape data to the database of your choice.

Suitable use cases include:

  • comparing attributes like stock status and price, for a list of book titles or part numbers, across multiple websites.
  • concurrently process a large list of search terms on a search engine and then scrape results, or follow links first and then scrape results.

Primary goals:

  1. Enable scraping targeted data from a wide range of websites including sites rendered with Javascript.
  2. Navigate websites which present logins, custom forms, and other blockers to data collection, like captchas.
  3. Provide asynchronous I/O for task execution, using gevent.
  4. Easily integrate within a web app like Flask, Django , or other python based web frameworks.
  5. Provide spreadsheet based data ingest and export options, like import a list of search terms from excel, ods, csv, and export data to each as well.
  6. Utilize quick and easy integrated task work queues which can be automatically filled with data search terms by a simple spreadsheet import.
  7. Able to integrate with more robust task queues like Celery while using rabbitmq or redis as a message broker as desired.
  8. Provide hooks for users to persist data via any method they choose, while also supporting our own opinionated choice which is a PostgreSQL database along with newt.db.
  9. Contain useful abstractions, classes, and interfaces for scraping and crawling with machine learning assistance (wip, timeline tbd).
  10. Further support data science use cases of the persisted data, where convenient and useful for us to provide in this library (wip, timeline tbd).
  11. Provide a command line interface (low priority wip, timeline tbd).

GitHub