ChatNoir Resiliparse

A collection of robust and fast processing tools for parsing and analyzing web archive data.

Resiliparse is part of the ChatNoir web analytics toolkit. If you use ChatNoir or any of its tools for a publication, you can make us happy by citing our ECIR demo paper:

@InProceedings{bevendorff:2018,
  address =             {Berlin Heidelberg New York},
  author =              {Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
  booktitle =           {Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018)},
  editor =              {Leif Azzopardi and Allan Hanbury and Gabriella Pasi and Benjamin Piwowarski},
  ids =                 {potthast:2018c,stein:2018c},
  month =               mar,
  publisher =           {Springer},
  series =              {Lecture Notes in Computer Science},
  site =                {Grenoble, France},
  title =               {{Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl}},
  year =                2018
}

Usage Instructions

For detailed information about the build process, dependencies, APIs, or usage instructions, please read the Resiliparse Documentation

Resiliparse Module Summary

The Resiliparse collection encompasses the following two modules at the moment:

1. Resiliparse

The Resiliparse main module with the following subcomponents:

Parsing Utilities

The Resiliparse Parsing Utilities are the largest submodule and provide an extensive (and growing) collection of efficient tools for dealing with encodings and raw protocol payloads, parsing HTML web pages, and preparing them for further processing by extracting structural or semantic information.

For more information, see Resiliparse Parsing Tools

Process Guards

The Resiliparse Process Guard module is a set of decorators and context managers for guarding a processing context to stay within pre-defined limits for execution time and memory usage. Process Guards help to ensure the (partially) successful completion of batch processing jobs in which individual tasks may time out or use abnormal amounts of memory, but in which the success of the whole job is not threatened by (a few) individual failures. A guarded processing context will be interrupted upon exceeding its resource limits so that the task can be skipped or rescheduled.

For more information, see Resiliparse Process Guards

Itertools

Resiliparse Itertools are a collection of convenient and robust helper functions for iterating over data from unreliable sources using other tools from the Resiliparse toolkit.

For more information, see Resiliparse Itertools

2. FastWARC

FastWARC is a high-performance WARC parsing library for Python written in C++/Cython. The API is inspired in large parts by WARCIO, but does not aim at being a drop-in replacement.  FastWARC supports compressed and uncompressed WARC/1.0 and WARC/1.1 streams. Supported compression algorithms are GZip and LZ4.

For more information, see the FastWARC documentation

Building and Installing Resiliparse

The main Resiliparse package can be installed from PyPi as follows:

pip install resiliparse

You can also build Resiliparse directly from this repository with all or just some of its modules:

# Create venv (recommended, but not required)
python3 -m venv venv && source venv/bin/activate

# Install build dependencies
sudo apt install build-essential python3-dev zlib1g-dev liblz4-dev libuchardet-dev
pip install cython setuptools

# Build only FastWARC
BUILD_PACKAGES=fastwarc python setup.py install

# Build all modules
python setup.py install

FastWARC is being distributed as its own package that has to be installed separately. For optimal performance, it is recommended to build FastWARC from sources instead of relying on the pre-built binaries.

# Option 1: Install pre-built binaries:
pip install fastwarc

# Option 2: Install from sources (requires build-time dependencies to be installed,
#           but is recommended for better performance, see FastWARC docs):
pip install --no-binary fastwarc fastwarc
GitHub - chatnoir-eu/chatnoir-resiliparse: A robust web archive analytics toolkit
A robust web archive analytics toolkit. Contribute to chatnoir-eu/chatnoir-resiliparse development by creating an account on GitHub.