A fast HTML5 parser with CSS selectors using Modest engine

A fast HTML5 parser with CSS selectors using Modest engine.

Installation

From PyPI using pip:

pip install selectolax

Development version from github:

git clone --recursive  https://github.com/rushter/selectolax
cd selectolax
pip install -r requirements_dev.txt
python setup.py install

How to compile selectolax while developing:

make clean
make dev

Basic examples

In [1]: from selectolax.parser import HTMLParser
   ...:
   ...: html = """
   ...: <h1 id="title" data-updated="20201101">Hi there</h1>
   ...: <div class="post">Lorem Ipsum is simply dummy text of the printing and typesetting industry. </div>
   ...: <div class="post">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</div>
   ...: """
   ...: tree = HTMLParser(html)

In [2]: tree.css_first('h1#title').text()
Out[2]: 'Hi there'

In [3]: tree.css_first('h1#title').attributes
Out[3]: {'id': 'title', 'data-updated': '20201101'}

In [4]: [node.text() for node in tree.css('.post')]
Out[4]:
['Lorem Ipsum is simply dummy text of the printing and typesetting industry. ',
 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.']

In [1]: html = "<div><p id=p1><p id=p2><p id=p3><a>link</a><p id=p4><p id=p5>text<p id=p6></div>"
   ...: selector = "div > :nth-child(2n+1):not(:has(a))"

In [2]: for node in HTMLParser(html).css(selector):
   ...:     print(node.attributes, node.text(), node.tag)
   ...:     print(node.parent.tag)
   ...:     print(node.html)
   ...:
{'id': 'p1'}  p
div
<p id="p1"></p>
{'id': 'p5'} text p
div
<p id="p5">text</p>

Detailed overview

Simple Benchmark

Average of 10 experiments to parse and retrieve URLs from 800 Google SERP pages.

Package	Time	Memory (peak)
selectolax	2.38 sec.	768.11 MB
lxml	18.67 sec.	769.21 MB

License

Modest engine — LGPL2.1
selectolax – MIT

GitHub

https://github.com/rushter/selectolax

A fast HTML5 parser with CSS selectors using Modest engine

Installation

Basic examples

Simple Benchmark

Links

License

GitHub

John

A WebSocket client for Python

Python implementation of the Socket.IO realtime client and server

Installation

Basic examples

Simple Benchmark

Links

License

GitHub

A WebSocket client for Python

Python implementation of the Socket.IO realtime client and server

You might also like...