API Reference
Core
- pqfilt.read(source: str | Path | list[str | Path], *, filters: str | list | FilterExpr | AndExpr | OrExpr | None = None, columns: list[str] | None = None, per_file: bool = True, output: str | Path | None = None, overwrite: bool = False) DataFrame[source]
Read Parquet file(s) with predicate-pushdown filtering.
Wraps
pyarrow.datasetto apply row-group-level predicate pushdown, avoiding unnecessary I/O and memory usage.- Parameters:
source (str, Path, or list) – File path, glob pattern (e.g.,
"data/*.parquet"), or explicit list of paths.filters (str, list, ExprNode, or None, optional) –
Filter specification. Accepts several formats:
Expression string – parsed via the built-in mini-language:
"vmag < 20" "(a < 30 & b > 50) | c == 1" "desig in 1,2,3"
List of 3-tuples (flat AND):
[("a", ">", 5), ("b", "<", 10)]
List of lists (DNF – OR of AND-groups):
[[("a", ">", 5)], [("b", "<", 10)]]
Pre-parsed AST node (
FilterExpr,AndExpr,OrExpr).columns (list of str, optional) – Columns to load (projection pushdown).
Noneloads all columns.per_file (bool, optional) – If
True(default), apply the filter to each file independently and concatenate. Better memory efficiency for many large files. IfFalse, concatenate first, then apply pandas-level filtering (useful when the filter cannot be pushed down).output (str or Path, optional) – Save the result to this path (
.parquetor.csv).overwrite (bool, optional) – Allow overwriting output if it already exists.
- Returns:
Filtered (and optionally column-selected) DataFrame.
- Return type:
- Raises:
FileNotFoundError – No files matched source.
FileExistsError – output exists and overwrite is
False.ValueError – Invalid filter syntax.
TypeError – filters is not a supported type.
Examples
Simple filter:
df = pqfilt.read("data.parquet", filters="vmag < 20")
AND + OR expression:
df = pqfilt.read("data.parquet", filters="(a < 30 & b > 50) | c == 1")
Tuple syntax:
df = pqfilt.read("data.parquet", filters=[("a", ">", 5), ("b", "<", 10)])
Expression Parser
- pqfilt.parse_expression(expr: str) FilterExpr | AndExpr | OrExpr[source]
Parse a filter expression string into an AST.
- Parameters:
expr (str) – Expression such as
"a > 5 & b < 10 | c == 1".- Returns:
A
FilterExpr,AndExpr, orOrExprtree.- Return type:
ExprNode
- Raises:
ValueError – On parse errors (unmatched parentheses, missing operator, etc.).
Examples
>>> parse_expression("vmag < 20") FilterExpr(col='vmag', op='<', val=20)
>>> parse_expression("a > 5 & b < 10") AndExpr(children=(FilterExpr(col='a', op='>', val=5), FilterExpr(col='b', op='<', val=10)))
- pqfilt._parser.to_pyarrow_expr(node: FilterExpr | AndExpr | OrExpr) Expression[source]
Convert a parsed AST into a
pyarrow.compute.Expression.- Parameters:
node (ExprNode) – Output of
parse_expression().- Returns:
Expression suitable for
pyarrow.dataset.Dataset.to_table(filter=...).- Return type:
pyarrow.compute.Expression
AST Nodes
- class pqfilt.FilterExpr(col: str, op: str, val: Any)[source]
Single comparison:
col op val.- val
Comparison value (scalar or list for
in/not in).- Type:
Any
- class pqfilt.AndExpr(children: tuple[~pqfilt._parser.FilterExpr | ~pqfilt._parser.OrExpr | ~pqfilt._parser.AndExpr, ...]=<factory>)[source]
Conjunction (AND) of child nodes.
- children: tuple[FilterExpr | OrExpr | AndExpr, ...]
Operators
Operator validation and application utilities.
- pqfilt._operators.apply_filter_operator(op: str, left: Any, right: Any) Any[source]
Apply op to left and right operands.
Works with both
pyarrow.compute.Expression(viads.field) andpandas.Series/ NumPy arrays.- Parameters:
op (str) – One of
SUPPORTED_OPERATORS.left (pyarrow.Expression, pandas.Series, or array-like) – Left operand.
right (scalar or array-like) – Right operand.
- Returns:
Boolean expression or mask.
- Return type:
result
- Raises:
ValueError – If op is unsupported.
TypeError – If
in/not inis used with an operand lackingisin().
- pqfilt._operators.to_numeric_if_possible(value_str: str) int | float | str[source]
Convert value_str to
intorfloatif possible.Prefers
intwhen the float value is integer-like.Examples
>>> to_numeric_if_possible("42") 42 >>> to_numeric_if_possible("3.14") 3.14 >>> to_numeric_if_possible("foo") 'foo'