Metadata-Version: 2.3
Name: polarstation
Version: 0.2.0
Summary: Tidy helper functions for polars that can peek at the full data.
Author: const-ae
Author-email: const-ae <artjom31415@googlemail.com>
Requires-Dist: polars>=1.0
Requires-Python: >=3.10
Project-URL: Documentation, https://const-ae.github.io/polarstation/
Project-URL: Repository, https://github.com/const-ae/polarstation
Description-Content-Type: text/markdown

# polarstation


<!-- DO NOT EDIT README.md — it is generated from README.qmd.
     To update: quarto render README.qmd --to gfm -->

Tidy helper functions for [Polars](https://pola.rs), inspired by the R
[tidyverse](https://tidyverse.org/).

## Installation

``` bash
pip install polarstation
```

or with uv:

``` bash
uv add polarstation
```

## Quick start

``` python
import polars as pl
import polarstation   # registers extension functions for polars

df = pl.DataFrame({
    "animal": ["dog", "dog", None, "bird", "cow" , "bird", "bird"],
    "weight": [12.2, 8.1, 7.5, 0.5, 460, 0.4, None],
}).ps.with_columns(
    pl.col("animal").ps_enum.make().ps_enum.reorder(by='weight')
)
print(df)
print(df['animal'].dtype)
```

    shape: (7, 2)
    ┌────────┬────────┐
    │ animal ┆ weight │
    │ ---    ┆ ---    │
    │ enum   ┆ f64    │
    ╞════════╪════════╡
    │ dog    ┆ 12.2   │
    │ dog    ┆ 8.1    │
    │ null   ┆ 7.5    │
    │ bird   ┆ 0.5    │
    │ cow    ┆ 460.0  │
    │ bird   ┆ 0.4    │
    │ bird   ┆ null   │
    └────────┴────────┘
    Enum(categories=['bird', 'dog', 'cow'])

`ps.with_columns` is a drop-in replacement for `with_columns` from
polars that can handle some additional use cases like functions that
need to peek at the full data for evaluation. It works efficiently on
both `DataFrame` and `LazyFrame`.

## Details

The key idea is `FrameExpr` — an expression that needs a peek at the
data (schema or a small aggregation) before it resolves into a regular
Polars expression. This unlocks operations like deriving Enum categories
from the data, lumping rare levels, or reordering factor levels by a
summary statistic, while keeping the rest of your pipeline lazy.

### How FrameExpr stays efficient

`ps.with_columns` resolves each `FrameExpr` in two phases. First it runs
a small aggregation (e.g. `unique().sort()` to discover categories)
against the *current* lazy plan — so any preceding `.filter()` or
`.select()` is already embedded and Polars’ predicate/projection
pushdown keeps the peek cheap. Then it uses the result to build a
concrete `pl.Expr` (e.g. `.cast(pl.Enum(["a", "b", "c"]))`) that goes
back into the lazy plan and executes normally.

``` python
# Only the filtered rows are scanned for category discovery;
# the cast itself remains lazy.
lf = pl.scan_parquet("events.parquet")
result = (
    lf.filter(pl.col("country") == "DE")
      .ps.with_columns(pl.col("status").ps_enum.make())
      .filter(pl.col("status") == "active")
      .collect()
)
```

See the `FrameExpr` docstring for the full explanation, including when
the peek is larger and notes on parallel evaluation.

### Dev Notes

To build the documentation run:

``` bash
uv run quarto render
```

and then in a separate terminal

``` bash
uv run quarto preview
```

To update the documentation at https://const-ae.github.io/polarstation/
\``uv run quarto publish gh-pages`

To re-render the README.md run

    quarto render README.qmd --to gfm

To upload to pypi run

    uv build
    uv publish

## Acknowledgements

This package stands on the shoulders of several excellent projects:

- The [tidyverse](https://tidyverse.org/) team for establishing the tidy
  data philosophy and the vocabulary that shapes this package’s design.
- Hadley Wickham and the [forcats](https://forcats.tidyverse.org/)
  authors for the factor-manipulation functions that directly inspired
  the `ps_enum` namespace.
- David Hugh-Jones for [santoku](https://hughjonesd.github.io/santoku/),
  which inspired the `ps_chop` functions.
- Allison Horst, Alison Hill, and Kristen Gorman for the
  [palmerpenguins](https://allisonhorst.github.io/palmerpenguins/)
  dataset used in the examples and walkthrough.

## License

MIT
