Helpers¶
Various helper/utility functions.
API Documentation¶
- class oi_tools.helpers.ParquetCache( )¶
A thin wrapper around
joblib.Memorythat cachespolars.DataFrameto parquets.See the examples section below or the joblib documentation for more.
- Parameters:
Examples
>>> import time >>> import polars as pl >>> from oi_tools.helpers import ParquetCache >>> >>> CACHE = ParquetCache("/tmp/example/cache/path/", verbose=0) >>> >>> @CACHE.cache() ... def slow_query() -> pl.DataFrame: ... print("Computing expensive function...") ... time.sleep(2) ... return pl.DataFrame({"x": [1, 2, 3]}) >>> >>> slow_query() # slow on first call Computing expensive function... shape: (3, 1) ┌─────┐ │ x │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘ >>> slow_query() # fast on subsequent calls shape: (3, 1) ┌─────┐ │ x │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
In addition to using this wrapper, it’s also possible to use the parquet backend (
ParquetStoreBackend) directly fromjoblib.Memoryby passing backend=”parquet”:>>> from joblib import Memory >>> CACHE = Memory("/tmp/another/cache/path/", backend="parquet")
- oi_tools.helpers.inflation_adjust(
- col: str | Collection[str] | Selector | Expr | int | float,
- *,
- from_year: str | Collection[str] | Selector | Expr | int | float,
- to_year: str | Collection[str] | Selector | Expr | int | float,
- series: str = 'CUUR0000SA0',
Adjust for inflation using the Consumer Price Index.
Useful references:
- Parameters:
col (str | Collection[str] | Selector | Expr | int | float) – The column (or columns) to adjust.
from_year (str | Collection[str] | Selector | Expr | int | float) – The year in which the dollar value is currently measured.
to_year (str | Collection[str] | Selector | Expr | int | float) – The year to which you would like to inflation adjust.
series (str) – The CPI series used for inflation adjustment.
- Return type:
Examples
>>> df = pl.DataFrame({"income": [50000, 75000], "year": [2010, 2015]}) >>> df.with_columns( ... income_2023=inflation_adjust("income", from_year="year", to_year=2023) ... ) shape: (2, 3) ┌────────┬──────┬──────────────┐ │ income ┆ year ┆ income_2023 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ f64 │ ╞════════╪══════╪══════════════╡ │ 50000 ┆ 2010 ┆ 69867.832117 │ │ 75000 ┆ 2015 ┆ 96417.767502 │ └────────┴──────┴──────────────┘
- oi_tools.helpers.regress(
- lhs: str | Collection[str] | Selector | Expr | int | float,
- *rhs: str | Collection[str] | Selector | Expr | int | float,
- include_intercept: bool = True,
- weights: str | Collection[str] | Selector | Expr | int | float | None = None,
- output: Literal['predictions', 'residuals', 'coefficients', 'statistics'] = 'predictions',
- **kwargs,
Regress an expression on some covariates.
At the moment, this is just a thin wrapper around
polars_ols.compute_least_squares().- Parameters:
lhs (str | Collection[str] | Selector | Expr | int | float) – The dependent variable (outcome) to regress.
*rhs (str | Collection[str] | Selector | Expr | int | float) – One or more independent variables (predictors).
include_intercept (bool) – Whether to add an intercept term.
weights (str | Collection[str] | Selector | Expr | int | float | None) – Optional observation weights for weighted least squares.
output (Literal['predictions', 'residuals', 'coefficients', 'statistics']) – What to return from the regression. See the “Returns” section for how this will effect the output.
**kwargs – Additional keyword arguments forwarded to
polars_ols.OLSKwargs.
- Returns:
The result depends on
output:"predictions": fitted values from the regression (same length aslhs)."residuals": difference between observed and fitted values (same length aslhs)."coefficients": estimated coefficients for each predictor (same length as the number of columns thatrhsexpands to)."statistics": coefficient estimates with standard errors and p-values.
- Return type:
pl.Expr
Examples
>>> import polars as pl >>> from oi_tools.helpers import regress >>> df = pl.DataFrame({"y": [1.0, 2.0, 3.0, 4.0], "x": [2.0, 4.0, 5.0, 7.0]})
Fitted values (default):
>>> df.select(regress("y", "x")) shape: (4, 1) ┌──────────┐ │ y │ │ --- │ │ f64 │ ╞══════════╡ │ 0.961538 │ │ 2.192308 │ │ 2.807692 │ │ 4.038462 │ └──────────┘
Residuals:
>>> df.select(regress("y", "x", output="residuals")) shape: (4, 1) ┌───────────┐ │ y │ │ --- │ │ f64 │ ╞═══════════╡ │ 0.038462 │ │ -0.192308 │ │ 0.192308 │ │ -0.038462 │ └───────────┘
Coefficients as a one-row struct, one field per predictor:
>>> df.select(regress("y", "x", output="coefficients")).unnest("coefficients") shape: (1, 2) ┌──────────┬───────────┐ │ x ┆ const │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════╪═══════════╡ │ 0.615385 ┆ -0.269231 │ └──────────┴───────────┘
Model statistics (r2, mae, mse, and per-predictor coefficients, standard errors, t-values, and p-values):
>>> stats = df.select(regress("y", "x", output="statistics")).unnest("statistics") >>> stats.select(["r2", "feature_names", "coefficients", "p_values"]) shape: (1, 4) ┌──────────┬────────────────┬──────────────────┬──────────────────┐ │ r2 ┆ feature_names ┆ coefficients ┆ p_values │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ list[str] ┆ list[f64] ┆ list[f64] │ ╞══════════╪════════════════╪══════════════════╪══════════════════╡ │ 0.984615 ┆ ["x", "const"] ┆ [0.615385, -0.2… ┆ [0.007722, 0.41… │ └──────────┴────────────────┴──────────────────┴──────────────────┘
- oi_tools.helpers.to_masked_expr( ) Sequence[Expr]¶
Create a set of expressions with a standardized null mask.
All output expressions evaluate to null wherever any input expression is null.
- oi_tools.helpers.to_selector(
- x: str | Collection[str] | Selector,
Convert the input to a Polars selector.
- Parameters:
x (str | Collection[str] | Selector) – Something that can be coerced to a column selector.
- Returns:
A Polars column selector.
- Return type:
cs.Selector