Statistics

Statistics-related helper functions including weighted means, variances, and correlations.

oi_tools.stats.weighted_bin(
x: str | Expr | int | float,
w: str | Expr | int | float,
n_bins: int = 10,
*,
ties: Literal['arbitrary', 'average'] = 'average',
) Expr

Bin values into weighted quantile bins.

Parameters:
  • x (str | Expr | int | float) – The values to bin.

  • w (str | Expr | int | float) – The weights associated with each value.

  • n_bins (int) – The number of bins. Default is 10.

  • ties (Literal['arbitrary', 'average']) – How to handle ties when assigning ranks. Default is "average".

Returns:

A Polars expression producing integer bin labels.

Return type:

pl.Expr

oi_tools.stats.weighted_covariance(
x: str | Expr | int | float,
y: str | Expr | int | float,
w: str | Expr | int | float,
*,
weight_type: Literal['frequency', 'precision'],
ddof: int = 1,
) Expr

Compute the weighted covariance between two expressions.

Rows where any of x, y, or w is null are omitted. See the documentation for np.cov for more.

Parameters:
  • x (str | Expr | int | float) – First expression.

  • y (str | Expr | int | float) – Second expression.

  • w (str | Expr | int | float) – Weights.

  • weight_type (Literal['frequency', 'precision']) –

    The type of weight.

    • "frequency" weights treat each weight as a repeat count, giving normalization 1 / (sum(w) - ddof). Like fweights in Stata.

    • "precision" (analytic/reliability) weights treat each weight as an inverse variance, giving normalization sum(w) / (sum(w)**2 - ddof * sum(w**2)). Like aweights in Stata.

  • ddof (int) – Delta degrees of freedom.

Returns:

The weighted covariance.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame(
...     {"x": [0.0, 1.0, 2.0], "y": [2.0, 1.0, 0.0], "w": [1.0, 3.0, 1.0]}
... )
>>> df.select(weighted_covariance("x", "y", "w", weight_type="frequency")).item()
-0.5
>>> df = pl.DataFrame(
...     {"x": [0.0, 1.0, 2.0], "y": [2.0, 1.0, 0.0], "w": [1.0, 2.0, 1.0]}
... )
>>> df.select(weighted_covariance("x", "y", "w", weight_type="precision")).item()
-0.8
oi_tools.stats.weighted_mean(
x: str | Expr | int | float,
w: str | Expr | int | float,
) Expr

Compute the weighted mean of an expression.

Rows where either x or w is null are omitted.

Parameters:
  • x (str | Expr | int | float) – First expression.

  • w (str | Expr | int | float) – Weights.

Returns:

The weighted mean.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [0.0, 1.0], "w": [1.0, 3.0]})
>>> df.select(weighted_mean("x", "w")).item()
0.75

Null values in either x or w are omitted:

>>> df = pl.DataFrame({"x": [0.0, None, 1.0], "w": [1.0, 1.0, 3.0]})
>>> df.select(weighted_mean("x", "w")).item()
0.75
oi_tools.stats.weighted_rank(
x: str | Expr | int | float,
w: str | Expr | int | float,
*,
ties: Literal['arbitrary', 'average'] = 'average',
) Expr

Compute the weighted quantile rank of an expression.

Parameters:
  • x (str | Expr | int | float) – The values to rank.

  • w (str | Expr | int | float) – The weights associated with each value.

  • ties (Literal['arbitrary', 'average']) –

    How to handle assigning quantiles in the case of ties:

    • "arbitrary": break ties arbitrarily,

    • "average": assign each unit the average rank of all units with the same x value.

Returns:

A Polars expression producing ranks in (0, 1).

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [1.0, 2.0], "w": [1.0, 3.0]})
>>> df.select(weighted_rank("x", "w")).to_series().to_list()
[0.125, 0.625]

Notes

Behavior is undefined if w contains null values.

oi_tools.stats.weighted_variance(
x: str | Expr | int | float,
w: str | Expr | int | float,
*,
weight_type: Literal['frequency', 'precision'],
ddof: int = 1,
) Expr

Compute the weighted variance of an expression.

Rows where either x or w is null are omitted. See the documentation for np.cov for more.

Parameters:
  • x (str | Expr | int | float) – The expression.

  • w (str | Expr | int | float) – Weights.

  • weight_type (Literal['frequency', 'precision']) – See func:weighted_covariance.

  • ddof (int)

Returns:

The weighted variance.

Return type:

pl.Expr

Examples

>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 3.0, 1.0]})
>>> df.select(weighted_variance("x", "w", weight_type="frequency")).item()
0.5
>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 2.0, 1.0]})
>>> df.select(weighted_variance("x", "w", weight_type="precision")).item()
0.8