Statistics¶
Statistics-related helper functions including weighted means, variances, and correlations.
- oi_tools.stats.weighted_bin(
- x: str | Expr | int | float,
- w: str | Expr | int | float,
- n_bins: int = 10,
- *,
- ties: Literal['arbitrary', 'average'] = 'average',
Bin values into weighted quantile bins.
- Parameters:
x (str | Expr | int | float) – The values to bin.
w (str | Expr | int | float) – The weights associated with each value.
n_bins (int) – The number of bins. Default is 10.
ties (Literal['arbitrary', 'average']) – How to handle ties when assigning ranks. Default is
"average".
- Returns:
A Polars expression producing integer bin labels.
- Return type:
pl.Expr
- oi_tools.stats.weighted_covariance(
- x: str | Expr | int | float,
- y: str | Expr | int | float,
- w: str | Expr | int | float,
- *,
- weight_type: Literal['frequency', 'precision'],
- ddof: int = 1,
Compute the weighted covariance between two expressions.
Rows where any of
x,y, orwis null are omitted. See the documentation for np.cov for more.- Parameters:
x (str | Expr | int | float) – First expression.
y (str | Expr | int | float) – Second expression.
w (str | Expr | int | float) – Weights.
weight_type (Literal['frequency', 'precision']) –
The type of weight.
"frequency"weights treat each weight as a repeat count, giving normalization1 / (sum(w) - ddof). Likefweightsin Stata."precision"(analytic/reliability) weights treat each weight as an inverse variance, giving normalizationsum(w) / (sum(w)**2 - ddof * sum(w**2)). Likeaweightsin Stata.
ddof (int) – Delta degrees of freedom.
- Returns:
The weighted covariance.
- Return type:
pl.Expr
Examples
>>> df = pl.DataFrame( ... {"x": [0.0, 1.0, 2.0], "y": [2.0, 1.0, 0.0], "w": [1.0, 3.0, 1.0]} ... ) >>> df.select(weighted_covariance("x", "y", "w", weight_type="frequency")).item() -0.5
>>> df = pl.DataFrame( ... {"x": [0.0, 1.0, 2.0], "y": [2.0, 1.0, 0.0], "w": [1.0, 2.0, 1.0]} ... ) >>> df.select(weighted_covariance("x", "y", "w", weight_type="precision")).item() -0.8
See also
- oi_tools.stats.weighted_mean(
- x: str | Expr | int | float,
- w: str | Expr | int | float,
Compute the weighted mean of an expression.
Rows where either
xorwis null are omitted.- Parameters:
x (str | Expr | int | float) – First expression.
w (str | Expr | int | float) – Weights.
- Returns:
The weighted mean.
- Return type:
pl.Expr
Examples
>>> df = pl.DataFrame({"x": [0.0, 1.0], "w": [1.0, 3.0]}) >>> df.select(weighted_mean("x", "w")).item() 0.75
Null values in either
xorware omitted:>>> df = pl.DataFrame({"x": [0.0, None, 1.0], "w": [1.0, 1.0, 3.0]}) >>> df.select(weighted_mean("x", "w")).item() 0.75
- oi_tools.stats.weighted_rank(
- x: str | Expr | int | float,
- w: str | Expr | int | float,
- *,
- ties: Literal['arbitrary', 'average'] = 'average',
Compute the weighted quantile rank of an expression.
- Parameters:
x (str | Expr | int | float) – The values to rank.
w (str | Expr | int | float) – The weights associated with each value.
ties (Literal['arbitrary', 'average']) –
How to handle assigning quantiles in the case of ties:
"arbitrary": break ties arbitrarily,"average": assign each unit the average rank of all units with the samexvalue.
- Returns:
A Polars expression producing ranks in (0, 1).
- Return type:
pl.Expr
Examples
>>> df = pl.DataFrame({"x": [1.0, 2.0], "w": [1.0, 3.0]}) >>> df.select(weighted_rank("x", "w")).to_series().to_list() [0.125, 0.625]
Notes
Behavior is undefined if
wcontains null values.
- oi_tools.stats.weighted_variance(
- x: str | Expr | int | float,
- w: str | Expr | int | float,
- *,
- weight_type: Literal['frequency', 'precision'],
- ddof: int = 1,
Compute the weighted variance of an expression.
Rows where either
xorwis null are omitted. See the documentation for np.cov for more.- Parameters:
x (str | Expr | int | float) – The expression.
w (str | Expr | int | float) – Weights.
weight_type (Literal['frequency', 'precision']) – See func:weighted_covariance.
ddof (int)
- Returns:
The weighted variance.
- Return type:
pl.Expr
Examples
>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 3.0, 1.0]}) >>> df.select(weighted_variance("x", "w", weight_type="frequency")).item() 0.5
>>> df = pl.DataFrame({"x": [0.0, 1.0, 2.0], "w": [1.0, 2.0, 1.0]}) >>> df.select(weighted_variance("x", "w", weight_type="precision")).item() 0.8
See also