Metadata-Version: 2.4
Name: pqfilt
Version: 0.1.2
Summary: Generic Parquet filtering tool (CLI + API)
Author: Yoonsoo P. Bach
License-Expression: MIT
License-File: LICENSE
Keywords: data,filter,parquet,predicate-pushdown,pyarrow
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.9
Requires-Dist: click>=8.0
Requires-Dist: pandas>=1.5
Requires-Dist: pyarrow>=10.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: docs
Requires-Dist: sphinx-rtd-theme>=2.0; extra == 'docs'
Requires-Dist: sphinx>=7.0; extra == 'docs'
Description-Content-Type: text/markdown

# pqfilt

Generic Parquet filtering tool (CLI and Python API).

[ReadtheDocs Documentation](https://pqfilt.readthedocs.io/en/latest/).

Originally developed while dealing with large Parquet files in [SPHEREx mission](https://spherex.caltech.edu/) ([GitHub](https://github.com/SPHEREx)).

`pqfilt` wraps `pyarrow.dataset` to let you filter Parquet files **before** they
are fully read into memory, using row-group-level filtering.

## Installation

```bash
pip install pqfilt
# or
uv add pqfilt
```

## Python API

```python
import pqfilt

# Simple filter
df = pqfilt.read("data.parquet", filters="vmag < 20")

# AND + OR with expression syntax
df = pqfilt.read("data.parquet", filters="(a < 30 & b > 50) | c == 1")

# Tuple syntax (flat AND)
df = pqfilt.read("data.parquet", filters=[("a", "<", 30), ("b", ">", 50)])

# DNF syntax (OR of ANDs)
df = pqfilt.read("data.parquet", filters=[
    [("a", "<", 30)],
    [("b", ">", 50)],
])

# Column selection + output
df = pqfilt.read("data/*.parquet", columns=["a", "b"], output="out.parquet")
```

## CLI

```bash
# Basic filter
pqfilt data/*.parquet -f "vmag < 20" -o filtered.parquet

# AND + OR expression
pqfilt data/*.parquet -f "(a < 30 & b > 50) | c == 1" -o filtered.parquet

# Multiple -f flags (AND-ed together)
pqfilt data/*.parquet -f "vmag < 20" -f "dec > 30" -o filtered.parquet

# Column selection
pqfilt data/*.parquet -f "vmag < 20" --columns vmag,ra,dec -o filtered.parquet

# Membership filter
pqfilt data/*.parquet -f "desig in 1,2,3" -o filtered.parquet
```

### Column names with special characters

Columns containing operator characters can be backtick-quoted:

```python
pqfilt.read("data.parquet", filters="`alpha*360` > 100")
```

## License

MIT
