Metadata-Version: 2.4
Name: Resiliparse
Version: 1.0.1
Summary: A collection of robust and fast processing tools for parsing and analyzing (not only) web archive data.
Author: Janek Bevendorff
License-Expression: Apache-2.0
Project-URL: repository, https://github.com/chatnoir-eu/chatnoir-resiliparse
Project-URL: homepage, https://resiliparse.chatnoir.eu
Project-URL: documentation, https://resiliparse.chatnoir.eu
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: fastwarc==1.0.1
Provides-Extra: all
Requires-Dist: resiliparse[beam,cli]; extra == "all"
Provides-Extra: beam
Requires-Dist: apache_beam[aws]>=2.37.0; python_version < "3.14" and extra == "beam"
Requires-Dist: boto3>=1.9; extra == "beam"
Requires-Dist: elasticsearch>=7.0.0; extra == "beam"
Provides-Extra: cli
Requires-Dist: click; extra == "cli"
Requires-Dist: joblib; extra == "cli"
Requires-Dist: tqdm; extra == "cli"
Provides-Extra: cli-benchmark
Requires-Dist: beautifulsoup4; extra == "cli-benchmark"
Requires-Dist: langid; extra == "cli-benchmark"
Requires-Dist: selectolax; extra == "cli-benchmark"
Provides-Extra: test
Requires-Dist: resiliparse[all]; extra == "test"
Requires-Dist: pytest>=7.2.2; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"

# ChatNoir Resiliparse

A collection of robust and fast processing tools for parsing and analyzing web archive data written in Rust and
Cython/C++ with bindings for Python.

Resiliparse is a part of the [ChatNoir web analytics toolkit](https://github.com/chatnoir-eu/).

## Installing Resiliparse

Pre-built Resiliparse binaries can be installed from PyPi:

```bash
pip install resiliparse
```

## Building Resiliparse From Source

To build Resiliparse from sources, you need to install all required build-time dependencies listed in
`vcpkg.json`. It's possible to install them globally via your package manager, but the easiest and most consistent way
is to use [vcpkg](https://vcpkg.io/en/):

```bash
# Install vcpkg itself (skip if you have a working vcpkg installation already)
git clone https://github.com/Microsoft/vcpkg
./vcpkg/bootstrap-vcpkg.sh

# Install dependencies to vcpkg_installed (must be run from sources root)
./vcpkg/vcpkg install --triplet=x64-linux
```

Replace the triplet value with one suitable for your platform. Valid values are: `x64-windows`, `x64-osx`, `arm64-osx`,
`aarch64-linux` (or any of the vcpkg default triplets).

After installing the dependencies, you can build the actual Python packages:

```bash
# Create a fresh venv first (recommended)
python3 -m venv venv && source venv/bin/activate

# Option 1: Build and install in editable mode (best for development)
python3 -m pip install -e ./resiliparse-py

# Option 2 (alternative): Build and install wheels in separate steps (best for redistribution)
python3 -m pip wheel -w build ./resiliparse-py
ls ./build/*.whl | xargs python3 -m pip install
```

In most cases, the build routine should be smart enough to detect the location of the installed vcpkg dependencies.
However, in some cases you may be getting errors about missing header files or undefined symbols. This can happen if you
don't build from the source repository, use Python's new `build` module, or run `pip wheel` with `--isolated`. To work
around that, set the `RESILIPARSE_VCPKG_PATH` environment variable to the absolute path of the vcpkg installation
directory:

```bash
export RESILIPARSE_VCPKG_PATH="$(pwd)/vcpkg_installed"
```

**NOTE:** Unless you fix up the wheels to embed the linked shared libraries (
via [auditwheel](https://github.com/pypa/auditwheel) on
Linux, [delocate-wheel](https://github.com/matthew-brett/delocate) on macOS,
or [delvewheel](https://github.com/adang1345/delvewheel) on Windows), you will have to add the vcpkg library directory (
`vcpkg_installed/TRIPLET/lib`) to your library search path to use them. On Linux, add the directory path to the
`LD_LIBRARY_PATH` environment variable, on macOS to `DYLD_LIBRARY_PATH`. On Windows, you have to add the directory to
the `Path` environment variable.

Here's an example of how to use `auditwheel` on Linux to fix up the build wheels:

```bash
LD_LIBRARY_PATH=$(pwd)/vcpkg_installed/x64-linux/lib \
  auditwheel repair --plat linux_x86_64 build/Resiliparse*.whl
```

(Please note that `linux_x86_64` platform wheels
are [not suitable for general redistribution](https://packaging.python.org/en/latest/specifications/platform-compatibility-tags/#platform-tag).)

## Usage Instructions

For detailed usage instructions, please consult
the [Resiliparse User Manual](https://resiliparse.chatnoir.eu/en/latest/).

## Cite Us

If you use ChatNoir or Resiliparse, please consider citing
our [ECIR 2018 demo paper](https://webis.de/downloads/publications/papers/bevendorff_2018.pdf):

```bibtex
@InProceedings{bevendorff:2018,
  address =             {Berlin Heidelberg New York},
  author =              {Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
  booktitle =           {Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018)},
  editor =              {Leif Azzopardi and Allan Hanbury and Gabriella Pasi and Benjamin Piwowarski},
  month =               mar,
  publisher =           {Springer},
  series =              {Lecture Notes in Computer Science},
  site =                {Grenoble, France},
  title =               {{Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl}},
  year =                2018
}
```
