Helpers

Various helper/utility functions.

API Documentation

class oi_tools.helpers.ParquetCache(
location: str | Path | None = None,
verbose: int = 1,
**kwargs,
)

A thin wrapper around joblib.Memory that caches polars.DataFrame to parquets.

See the examples section below or the joblib documentation for more.

Parameters:
  • location (str | Path | None) – Path to the cache directory. If None, caching is disabled.

  • verbose (int) – Verbosity level passed to joblib.

  • **kwargs – Additional keyword arguments passed to polars.DataFrame.write_parquet() as backend_options.

Examples

>>> import time
>>> import polars as pl
>>> from oi_tools.helpers import ParquetCache
>>>
>>> CACHE = ParquetCache("/tmp/example/cache/path/", verbose=0)
>>>
>>> @CACHE.cache()
... def slow_query() -> pl.DataFrame:
...     print("Computing expensive function...")
...     time.sleep(2)
...     return pl.DataFrame({"x": [1, 2, 3]})
>>>
>>> slow_query()  # slow on first call
Computing expensive function...
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘
>>> slow_query()  # fast on subsequent calls
shape: (3, 1)
┌─────┐
│ x   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

In addition to using this wrapper, it’s also possible to use the parquet backend (ParquetStoreBackend) directly from joblib.Memory by passing backend=”parquet”:

>>> from joblib import Memory
>>> CACHE = Memory("/tmp/another/cache/path/", backend="parquet")
oi_tools.helpers.clean_col_name(
name: str,
) str

Normalize a column name to snake_case.

Parameters:

name (str) – Raw column name string.

Returns:

Cleaned column name with whitespace replaced by underscores, camelCase converted to snake_case, and non-alphanumeric characters replaced with underscores.

Return type:

str

oi_tools.helpers.inflation_adjust(
col: str | Collection[str] | Selector | Expr | int | float,
*,
from_year: str | Collection[str] | Selector | Expr | int | float,
to_year: str | Collection[str] | Selector | Expr | int | float,
series: str = 'CUUR0000SA0',
) Expr

Adjust for inflation using the Consumer Price Index.

Useful references:

Parameters:
Return type:

Expr

Examples

>>> df = pl.DataFrame({"income": [50000, 75000], "year": [2010, 2015]})
>>> df.with_columns(
...     income_2023=inflation_adjust("income", from_year="year", to_year=2023)
... )
shape: (2, 3)
┌────────┬──────┬──────────────┐
│ income ┆ year ┆ income_2023  │
│ ---    ┆ ---  ┆ ---          │
│ i64    ┆ i64  ┆ f64          │
╞════════╪══════╪══════════════╡
│ 50000  ┆ 2010 ┆ 69867.832117 │
│ 75000  ┆ 2015 ┆ 96417.767502 │
└────────┴──────┴──────────────┘
oi_tools.helpers.regress(
lhs: str | Collection[str] | Selector | Expr | int | float,
*rhs: str | Collection[str] | Selector | Expr | int | float,
include_intercept: bool = True,
weights: str | Collection[str] | Selector | Expr | int | float | None = None,
output: Literal['predictions', 'residuals', 'coefficients', 'statistics'] = 'predictions',
**kwargs,
) Expr

Regress an expression on some covariates.

At the moment, this is just a thin wrapper around polars_ols.compute_least_squares().

Parameters:
  • lhs (str | Collection[str] | Selector | Expr | int | float) – The dependent variable (outcome) to regress.

  • *rhs (str | Collection[str] | Selector | Expr | int | float) – One or more independent variables (predictors).

  • include_intercept (bool) – Whether to add an intercept term.

  • weights (str | Collection[str] | Selector | Expr | int | float | None) – Optional observation weights for weighted least squares.

  • output (Literal['predictions', 'residuals', 'coefficients', 'statistics']) – What to return from the regression. See the “Returns” section for how this will effect the output.

  • **kwargs – Additional keyword arguments forwarded to polars_ols.OLSKwargs.

Returns:

The result depends on output:

  • "predictions": fitted values from the regression (same length as lhs).

  • "residuals": difference between observed and fitted values (same length as lhs).

  • "coefficients": estimated coefficients for each predictor (same length as the number of columns that rhs expands to).

  • "statistics": coefficient estimates with standard errors and p-values.

Return type:

pl.Expr

Examples

>>> import polars as pl
>>> from oi_tools.helpers import regress
>>> df = pl.DataFrame({"y": [1.0, 2.0, 3.0, 4.0], "x": [2.0, 4.0, 5.0, 7.0]})

Fitted values (default):

>>> df.select(regress("y", "x"))
shape: (4, 1)
┌──────────┐
│ y        │
│ ---      │
│ f64      │
╞══════════╡
│ 0.961538 │
│ 2.192308 │
│ 2.807692 │
│ 4.038462 │
└──────────┘

Residuals:

>>> df.select(regress("y", "x", output="residuals"))
shape: (4, 1)
┌───────────┐
│ y         │
│ ---       │
│ f64       │
╞═══════════╡
│ 0.038462  │
│ -0.192308 │
│ 0.192308  │
│ -0.038462 │
└───────────┘

Coefficients as a one-row struct, one field per predictor:

>>> df.select(regress("y", "x", output="coefficients")).unnest("coefficients")
shape: (1, 2)
┌──────────┬───────────┐
│ x        ┆ const     │
│ ---      ┆ ---       │
│ f64      ┆ f64       │
╞══════════╪═══════════╡
│ 0.615385 ┆ -0.269231 │
└──────────┴───────────┘

Model statistics (r2, mae, mse, and per-predictor coefficients, standard errors, t-values, and p-values):

>>> stats = df.select(regress("y", "x", output="statistics")).unnest("statistics")
>>> stats.select(["r2", "feature_names", "coefficients", "p_values"])
shape: (1, 4)
┌──────────┬────────────────┬──────────────────┬──────────────────┐
│ r2       ┆ feature_names  ┆ coefficients     ┆ p_values         │
│ ---      ┆ ---            ┆ ---              ┆ ---              │
│ f64      ┆ list[str]      ┆ list[f64]        ┆ list[f64]        │
╞══════════╪════════════════╪══════════════════╪══════════════════╡
│ 0.984615 ┆ ["x", "const"] ┆ [0.615385, -0.2… ┆ [0.007722, 0.41… │
└──────────┴────────────────┴──────────────────┴──────────────────┘
oi_tools.helpers.to_expr(
x: str | Collection[str] | Selector | Expr | int | float,
) Expr

Convert the input to a Polars expression.

Parameters:

x (str | Collection[str] | Selector | Expr | int | float) – Something that can be coerced to an expression.

Returns:

A Polars expression.

Return type:

pl.Expr

oi_tools.helpers.to_masked_expr(
*xs: str | Collection[str] | Selector | Expr | int | float,
) Sequence[Expr]

Create a set of expressions with a standardized null mask.

All output expressions evaluate to null wherever any input expression is null.

Parameters:

*xs (str | Collection[str] | Selector | Expr | int | float) – Expressions to mask.

Returns:

Expressions that evaluate to null when any input is null.

Return type:

Sequence[pl.Expr]

oi_tools.helpers.to_selector(
x: str | Collection[str] | Selector,
) Selector

Convert the input to a Polars selector.

Parameters:

x (str | Collection[str] | Selector) – Something that can be coerced to a column selector.

Returns:

A Polars column selector.

Return type:

cs.Selector