API Reference

Core

pqfilt.read(source: str | Path | list[str | Path], *, filters: str | list | FilterExpr | AndExpr | OrExpr | None = None, columns: list[str] | None = None, per_file: bool = True, output: str | Path | None = None, overwrite: bool = False) DataFrame[source]

Read Parquet file(s) with predicate-pushdown filtering.

Wraps pyarrow.dataset to apply row-group-level predicate pushdown, avoiding unnecessary I/O and memory usage.

Parameters:
  • source (str, Path, or list) – File path, glob pattern (e.g., "data/*.parquet"), or explicit list of paths.

  • filters (str, list, ExprNode, or None, optional) –

    Filter specification. Accepts several formats:

    Expression string – parsed via the built-in mini-language:

    "vmag < 20"
    "(a < 30 & b > 50) | c == 1"
    "desig in 1,2,3"
    

    List of 3-tuples (flat AND):

    [("a", ">", 5), ("b", "<", 10)]
    

    List of lists (DNF – OR of AND-groups):

    [[("a", ">", 5)], [("b", "<", 10)]]
    

    Pre-parsed AST node (FilterExpr, AndExpr, OrExpr).

  • columns (list of str, optional) – Columns to load (projection pushdown). None loads all columns.

  • per_file (bool, optional) – If True (default), apply the filter to each file independently and concatenate. Better memory efficiency for many large files. If False, concatenate first, then apply pandas-level filtering (useful when the filter cannot be pushed down).

  • output (str or Path, optional) – Save the result to this path (.parquet or .csv).

  • overwrite (bool, optional) – Allow overwriting output if it already exists.

Returns:

Filtered (and optionally column-selected) DataFrame.

Return type:

pandas.DataFrame

Raises:

Examples

Simple filter:

df = pqfilt.read("data.parquet", filters="vmag < 20")

AND + OR expression:

df = pqfilt.read("data.parquet", filters="(a < 30 & b > 50) | c == 1")

Tuple syntax:

df = pqfilt.read("data.parquet", filters=[("a", ">", 5), ("b", "<", 10)])

Expression Parser

pqfilt.parse_expression(expr: str) FilterExpr | AndExpr | OrExpr[source]

Parse a filter expression string into an AST.

Parameters:

expr (str) – Expression such as "a > 5 & b < 10 | c == 1".

Returns:

A FilterExpr, AndExpr, or OrExpr tree.

Return type:

ExprNode

Raises:

ValueError – On parse errors (unmatched parentheses, missing operator, etc.).

Examples

>>> parse_expression("vmag < 20")
FilterExpr(col='vmag', op='<', val=20)
>>> parse_expression("a > 5 & b < 10")
AndExpr(children=(FilterExpr(col='a', op='>', val=5), FilterExpr(col='b', op='<', val=10)))
pqfilt._parser.to_pyarrow_expr(node: FilterExpr | AndExpr | OrExpr) Expression[source]

Convert a parsed AST into a pyarrow.compute.Expression.

Parameters:

node (ExprNode) – Output of parse_expression().

Returns:

Expression suitable for pyarrow.dataset.Dataset.to_table(filter=...).

Return type:

pyarrow.compute.Expression

AST Nodes

class pqfilt.FilterExpr(col: str, op: str, val: Any)[source]

Single comparison: col op val.

col

Column name.

Type:

str

op

Comparison operator (one of SUPPORTED_OPERATORS).

Type:

str

val

Comparison value (scalar or list for in / not in).

Type:

Any

col: str
op: str
val: Any
class pqfilt.AndExpr(children: tuple[~pqfilt._parser.FilterExpr | ~pqfilt._parser.OrExpr | ~pqfilt._parser.AndExpr, ...]=<factory>)[source]

Conjunction (AND) of child nodes.

children

All children must evaluate to True.

Type:

tuple of ExprNode

children: tuple[FilterExpr | OrExpr | AndExpr, ...]
class pqfilt.OrExpr(children: tuple[~pqfilt._parser.FilterExpr | ~pqfilt._parser.AndExpr | ~pqfilt._parser.OrExpr, ...]=<factory>)[source]

Disjunction (OR) of child nodes.

children

At least one child must evaluate to True.

Type:

tuple of ExprNode

children: tuple[FilterExpr | AndExpr | OrExpr, ...]

Operators

Operator validation and application utilities.

pqfilt._operators.apply_filter_operator(op: str, left: Any, right: Any) Any[source]

Apply op to left and right operands.

Works with both pyarrow.compute.Expression (via ds.field) and pandas.Series / NumPy arrays.

Parameters:
  • op (str) – One of SUPPORTED_OPERATORS.

  • left (pyarrow.Expression, pandas.Series, or array-like) – Left operand.

  • right (scalar or array-like) – Right operand.

Returns:

Boolean expression or mask.

Return type:

result

Raises:
  • ValueError – If op is unsupported.

  • TypeError – If in / not in is used with an operand lacking isin().

pqfilt._operators.to_numeric_if_possible(value_str: str) int | float | str[source]

Convert value_str to int or float if possible.

Prefers int when the float value is integer-like.

Examples

>>> to_numeric_if_possible("42")
42
>>> to_numeric_if_possible("3.14")
3.14
>>> to_numeric_if_possible("foo")
'foo'
pqfilt._operators.validate_operator(op: str, col: str | None = None) None[source]

Validate that op is a supported filter operator.

Parameters:
  • op (str) – Operator string.

  • col (str, optional) – Column name for error-message context.

Raises:

ValueError – If op is not in SUPPORTED_OPERATORS.