Metadata-Version: 2.4
Name: bcql_py
Version: 0.3.2
Summary: A Python parser for Blacklab Corpus Query Language
Project-URL: Homepage, https://github.com/BramVanroy/bcql_py
Project-URL: Documentation, https://bramvanroy.github.io/bcql_py/
Project-URL: Issues, https://github.com/BramVanroy/bcql_py/issues
Project-URL: Repository, https://github.com/BramVanroy/bcql_py.git
Author-email: Bram Vanroy <2779410+BramVanroy@users.noreply.github.com>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: MkDocs
Classifier: Framework :: Pydantic :: 2
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development :: Build Tools
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: pydantic<3,>=2.12.0
Description-Content-Type: text/markdown

# A Python parser for BlackLab Corpus Query Language

[![Documentation](https://img.shields.io/badge/documentation-4051b5)](https://bramvanroy.github.io/bcql_py/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/bcql_py)
[![codecov](https://codecov.io/gh/BramVanroy/bcql_py/branch/main/graph/badge.svg)](https://codecov.io/gh/BramVanroy/bcql_py)
[![Interrogate coverage](https://raw.githubusercontent.com/BramVanroy/bcql_py/main/.github/interrogate_badge.svg)](https://github.com/BramVanroy/bcql_py/actions/workflows/interrogate-badge.yml)
[![License](https://img.shields.io/github/license/BramVanroy/bcql_py)](LICENSE)

<!-- --8<-- [start:overview] -->

A full-coverage Python parser for the **[BlackLab](https://github.com/instituutnederlandsetaal/BlackLab/) Corpus
Query Language (BCQL)** that converts query strings into a
[Pydantic v2](https://docs.pydantic.dev/) AST (Abstract Syntax Tree) with lossless round-trip
reconstruction and structured error reporting.

To get started, you can check out:

- [A Quickstart guide][readme-quickstart]
- ``bcql_py`` and BCQL general [guides][readme-guides]
- The full [API reference][readme-api-reference]
- [Python code examples](https://github.com/BramVanroy/bcql_py/tree/main/examples)
- A [Gradio demo](https://huggingface.co/spaces/BramVanroy/bcql_py_validation)

## Features

- **Complete BCQL coverage**: token queries, sequences, repetitions, spans, lookarounds, captures,
  global constraints, relations, alignments, and built-in functions.
- **Immutable Pydantic v2 AST**: every node is a frozen `BaseModel` subclass with a `node_type`
  discriminator, making inspection and pattern matching straightforward.
- **Lossless BCQL round-trip**: [`to_bcql()`][readme-to-bcql]
  reproduces the original query (preserving shorthand forms, quote characters, sensitivity flags, etc.).
- **Position-aware syntax errors**: [`BCQLSyntaxError`][readme-syntax-error]
  carries the original query, the 0-based offset, and a caret-annotated message: ready to forward to
  a user or LLM.
- **Optional semantic validation**: a [`CorpusSpec`][readme-corpus-spec]
  describes which annotations, span tags, alignment fields, and dependency relations your corpus
  supports. Pass it as ``parse(query, spec=spec)`` to catch typos and unsupported features before
  they reach the corpus. See the [tagset validation guide][readme-tagset-validation].
- **Zero runtime dependencies** beyond Pydantic.

## Installation

```bash
pip install bcql_py
```

Or with [uv](https://docs.astral.sh/uv/):

```bash
uv add bcql_py
```

## Try the demo

A small Gradio app under [`app/`](https://github.com/BramVanroy/bcql_py/tree/main/app)
lets you paste a BCQL query, pick or build a `CorpusSpec`, and inspect parse +
validation results. The hosted demo runs on Hugging Face Spaces at
[BramVanroy/bcql_py_validation](https://huggingface.co/spaces/BramVanroy/bcql_py_validation).

To run it locally:

```sh
uv sync --group app
uv run python app/app.py
```

## Development

Clone and set up the project:

```bash
git clone https://github.com/BramVanroy/bcql_py.git
cd bcql_py
uv sync --dev
```

Enable pre-commit hooks:

```bash
uv run pre-commit install
```

After installation, hooks run automatically on every `git commit`.
We do style chechking with ruff and type-checking with mypy.
You can also run them manually across the whole repo:

```bash
uv run pre-commit run --all-files
```

To work on documentation locally:

```bash
make serve-docs
```

This rebuilds a fresh local mike preview before serving it, so you do not end
up testing stale versioned docs. By default it serves a local `0.3.0` version
and `latest` alias from a temporary docs branch. You can override those values
when needed, for example:

```bash
DOCS_VERSION=0.4.0 DOCS_SOURCE_REF=v0.4.0 make serve-docs
```

Open both `/latest/` and `/<version>/` while testing. If you are checking the
GitHub source links as well, set `DOCS_SOURCE_REF` to the release tag you want
to emulate.

You can/should run tests before pushing to the remote, although
a Github workflow will run those anyway on push. To run them locally:

```bash
make test
```

<!-- --8<-- [end:overview] -->

[readme-quickstart]: docs/quickstart.md
[readme-guides]: docs/guides/index.md
[readme-api-reference]: docs/api/top_level.md
[readme-to-bcql]: src/bcql_py/models/base.py
[readme-syntax-error]: src/bcql_py/exceptions.py
[readme-corpus-spec]: src/bcql_py/validation/spec.py
[readme-tagset-validation]: docs/guides/tagset-validation.md

## ANTLR to generate the needed tools

BlackLab uses ANTLR to generate the parser/lexer in Java based on a
[g4 file](https://github.com/instituutnederlandsetaal/BlackLab/blob/e248fc2acf2b8cf44deb2564e8b24138b140d4ca/query-parser/src/main/antlr4/nl/inl/blacklab/queryParser/corpusql/Bcql.g4#L1-L97).
We could similarly generate Python files. However, after trying it out, I find the files obfuscated
and unclear and I'm not fond of requiring an extra external (Java-based) library. That is not a slight to ANTLR;
I am simply not familiar with the tool: I am sure it is incredibly powerful and useful if you know
how to use it. To keep a clearer view of this library I therefore strive to make a Python-native
implementation that is true to spec. It's also just a fun project that I do not wish to "automate
away" (though I might regret that later). At a later time (TODO) I might implement functionality to
cross-validate our implementation with the generated ANTLR parser and lexer. For now I will be
satisfied with high coverage testing. In case of doubt I have followed the Bcql.g4 file.

If you'd like to try the ANTLR route yourself, you can try it as follows:

1. Install requirements (not included in our pyproject.toml file, you'll need to download these
   yourself!)

   ```sh
   uv pip install requests antlr4-tools antlr4-python3-runtime
   ```

2. Download the BlackLab G4 definition from GitHub. You can optionally specify a `--branch` or
   `--tag`, defaults to `--branch dev`.

   ```sh
   uv run python scripts/get_bcql_g4.py
   # Saved to parser/Bcql.g4
   cd parser/
   ```

3. Run ANTLR (you can update `-v` to [the latest version](https://github.com/antlr/antlr4/releases)
   if needed)

   ```sh
   antlr4 -v 4.13.2 -Dlanguage=Python3 Bcql.g4
   ```

## Acknowledgments

- [BlackLab](https://blacklab.ivdnt.org/)
- Robert Nystrom's guide on ["Crafting Interpreters"](https://craftinginterpreters.com/scanning.html),
  specifically the part on "Scanning". Token types and error handling in `bcql_py` is heavily
  inspired by his work.
- Jamis Buck's [blog post on recursive descent parsers](https://weblog.jamisbuck.org/2015/7/30/writing-a-simple-recursive-descent-parser.html)
- Berkeley [course notes on BNF](https://cs61a.org/study-guide/bnf/)
