Gerapy Playwright

This is a package for supporting Playwright in Scrapy, also this
package is a module in Gerapy.


pip3 install gerapy-playwright


You can use PlaywrightRequest to specify a request which uses playwright to render.

For example:

yield PlaywrightRequest(detail_url, callback=self.parse_detail)

And you also need to enable PlaywrightMiddleware in DOWNLOADER_MIDDLEWARES:

    'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware': 543,

Congratulate, you’ve finished the all of the required configuration.

If you run the Spider again, Playwright will be started to render every
web page which you configured the request as PlaywrightRequest.


GerapyPlaywright provides some optional settings.


You can directly use Scrapy’s setting to set Concurrency of Playwright,
for example:


Pretend as Real Browser

Some website will detect WebDriver or Headless, GerapyPlaywright can
pretend Chromium by inject scripts. This is enabled by default.

You can close it if website does not detect WebDriver to speed up:


Also you can use pretend attribute in PlaywrightRequest to overwrite this

Logging Level

By default, Playwright will log all the debug messages, so GerapyPlaywright
configured the logging level of Playwright to WARNING.

If you want to see more logs from Playwright, you can change the this setting:

import logging

Download Timeout

Playwright may take some time to render the required web page, you can also change this setting, default is 30s:

# playwright timeout


By default, Playwright is running in Headless mode, you can also
change it to False as you need, default is True:


Window Size

You can also set the width and height of Playwright window:


Default is 1400, 700.


You can set a proxy channel via below this config:

  'username': 'xxx',
  'password': 'xxxx'


You can get screenshot of loaded page, you can pass screenshot args to PlaywrightRequest as dict:

Below are the supported args:

  • type (str): Specify screenshot type, can be either jpeg or png. Defaults to png.
  • quality (int): The quality of the image, between 0-100. Not applicable to png image.
  • full_page (bool): When true, take a screenshot of the full scrollable page. Defaults to False.
  • clip (dict): An object which specifies clipping region of the page. This option should have the following fields:
    • x (int): x-coordinate of top-left corner of clip area.
    • y (int): y-coordinate of top-left corner of clip area.
    • width (int): width of clipping area.
    • height (int): height of clipping area.
  • omit_background (bool): Hide default white background and allow capturing screenshot with transparency.
  • timeout (str): Maximum time in milliseconds, defaults to 30 seconds, pass 0 to disable timeout.

Check more from

For example:

yield PlaywrightRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={
            'type': 'png',
            'full_page': True

then you can get screenshot result in response.meta['screenshot']:

Simplest save it to file:

def parse_index(self, response):
    with open('screenshot.png', 'wb') as f:

If you want to enable screenshot for all requests, you can configure it by GERAPY_PLAYWRIGHT_SCREENSHOT.

For example:

    'type': 'png',
    'full_page': True


PlaywrightRequest provide args which can override global settings above.

  • url: request url
  • callback: callback
  • wait_until: one of “load”, “domcontentloaded”, “networkidle”
    see, default is domcontentloaded
  • wait_for: wait for some element to load, also supports dict
  • script: script to execute
  • actions: actions defined for execution of Page object
  • proxy: use proxy for this time, like http://x.x.x.x:x
  • proxy_credential: the proxy credential, like {'username': 'xxxx', 'password': 'xxxx'}
  • sleep: time to sleep after loaded, override GERAPY_PLAYWRIGHT_SLEEP
  • timeout: load timeout, override GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT
  • ignore_resource_types: ignored resource types, override GERAPY_PLAYWRIGHT_IGNORE_RESOURCE_TYPES
  • pretend: pretend as normal browser, override GERAPY_PLAYWRIGHT_PRETEND
  • screenshot: ignored resource types, see,

For example, you can configure PlaywrightRequest as:

from gerapy_playwright import PlaywrightRequest

def parse(self, response):
    yield PlaywrightRequest(url,
        script='() => { return {name: "Germey"} }',

Then Playwright will:

  • wait for document to load
  • wait for title to load
  • execute console.log(document) script
  • sleep for 2s
  • return the rendered web page content, get from response.meta['screenshot']
  • return the script executed result, get from response.meta['script_result']

For waiting mechanism controlled by JavaScript, you can use await in script, for example:

js = '''async () => {
    await new Promise(resolve => setTimeout(resolve, 10000));
    return {
        'name': 'Germey'
yield PlaywrightRequest(url, callback=self.parse, script=js)

Then you can get the script result from response.meta['script_result'], result is {'name': 'Germey'}.

If you think the JavaScript is wired to write, you can use actions argument to define a function to execute Python based functions, for example:

async def execute_actions(page):
    await page.evaluate('() => { document.title = "Hello World"; }')
    return 1
yield PlaywrightRequest(url, callback=self.parse, actions=execute_actions)

Then you can get the actions result from response.meta['actions_result'], result is 1.

Also you can define proxy and proxy_credential for each Reqest, for example:

yield PlaywrightRequest(
      'username': 'xxxx',
      'password': 'xxxx'

proxy and proxy_credential will override the settings GERAPY_PLAYWRIGHT_PROXY and GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL.


For more detail, please see example.

Also you can directly run with Docker:

docker run germey/gerapy-playwright-example


2021-12-27 16:54:14 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
2021-12-27 16:54:14 [scrapy.utils.log] INFO: Versions: lxml, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.9 (default, Aug 31 2020, 07:22:35) - [Clang 10.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform Darwin-21.1.0-x86_64-i386-64bit
2021-12-27 16:54:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2021-12-27 16:54:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
 'NEWSPIDER_MODULE': 'example.spiders',
 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504],
 'SPIDER_MODULES': ['example.spiders']}
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet Password: e931b241390ad06a
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled extensions:
2021-12-27 16:54:14 [gerapy.playwright] INFO: playwright libraries already installed
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled spider middlewares:
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled item pipelines:
2021-12-27 16:54:14 [scrapy.core.engine] INFO: Spider opened
2021-12-27 16:54:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet console listening on
2021-12-27 16:54:14 [] DEBUG: start url
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: processing request <GET>
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: set options {'headless': False}
cookies []
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: crawling
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: request with options {'url': '', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET> (referer: None)
2021-12-27 16:54:20 [] DEBUG: start url
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: processing request <GET>
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: set options {'headless': False}
2021-12-27 16:54:20 [] INFO: detail url
2021-12-27 16:54:20 [] INFO: detail url
2021-12-27 16:54:20 [] INFO: detail url
2021-12-27 16:54:20 [] INFO: detail url
2021-12-27 16:54:20 [] INFO: detail url
2021-12-27 16:54:20 [] INFO: detail url
2021-12-27 16:54:20 [] INFO: detail url
2021-12-27 16:54:20 [] INFO: detail url
2021-12-27 16:54:20 [] INFO: detail url
2021-12-27 16:54:20 [] INFO: detail url
cookies []
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: crawling
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: request with options {'url': '', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:23 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:24 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET> (referer: None)
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: processing request <GET>
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: set options {'headless': False}


View Github