Statistical context for determining which tests are applicable.
This dataclass captures all the information needed to decide which
statistical tests can be applied to the current data/figure context.
It is used by check_applicable() to filter the test registry.
Parameters:
n_groups (int) – Number of groups/levels to compare (e.g., 2 for A vs B).
sample_sizes (list of int) – Sample sizes per group in the same order as the groups.
outcome_type (OutcomeType) – Type of outcome variable:
- “continuous”: numeric, interval/ratio scale
- “ordinal”: ordered categories, ranks
- “binary”: 0/1 or yes/no
- “categorical”: nominal with >= 2 categories
Applicability rule for a specific statistical test.
Each TestRule defines the conditions under which a test is applicable.
The check_applicable() function uses these rules to filter tests
for a given StatContext.
Parameters:
name (str) – Internal name of the test (e.g., “ttest_ind”, “brunner_munzel”).
family (TestFamily) – High-level family of the test:
- “parametric”: t-test, ANOVA, etc.
- “nonparametric”: Mann-Whitney, Kruskal-Wallis, etc.
- “categorical”: Chi-square, Fisher’s exact, etc.
- “correlation”: Pearson, Spearman, etc.
- “normality”: Shapiro-Wilk, etc.
- “effect_size”: Cohen’s d, eta-squared, etc.
- “posthoc”: Tukey, Dunnett, etc.
- “other”: Other tests (Levene, etc.)
min_groups (int) – Minimum required number of groups.
max_groups (int or None) – Maximum allowed number of groups. None means no upper bound.
outcome_types (set of str) – Allowed outcome types for this test.
supports_paired (bool) – Whether the test supports paired/repeated measures.
supports_unpaired (bool) – Whether the test supports independent groups.
design_allowed (set of str) – Allowed designs, e.g., {“between”, “within”}.
requires_control_group (bool) – Whether a dedicated control group is required (e.g., Dunnett).
min_n_total (int or None) – Minimum total sample size. None means no constraint.
min_n_per_group (int or None) – Minimum sample size per group.
needs_normality (bool) – Whether test assumes normality (check normality_ok).
needs_equal_variance (bool) – Whether test assumes equal variances (check variance_homogeneity_ok).
min_factors (int or None) – Minimum number of factors.
max_factors (int or None) – Maximum number of factors.
priority (int) – Priority score for recommendation. Higher = more recommended.
Brunner-Munzel has priority 110 as the recommended default for 2 groups.
description (str) – Human-readable description for tooltips.
families (list of TestFamily or None) – Families to consider. If None, uses standard test families
(parametric, nonparametric, categorical, correlation).
Returns:
test_names – Internal names of recommended tests, sorted by priority.
Check whether a given statistical test is applicable to the context.
This function evaluates all conditions in the TestRule against the
StatContext and returns both the result and human-readable reasons
for any failures (suitable for tooltips).
Parameters:
rule (TestRule) – The rule definition for a specific test.
ctx (StatContext) – The context inferred from the figure and data.
var_y (str, default 'y') – Label for second sample
alternative ({'two-sided', 'greater', 'less'}, default 'two-sided') – Alternative hypothesis:
- ‘two-sided’: means are different
- ‘greater’: mean of x is greater than y
- ‘less’: mean of x is less than y
equal_var (bool, default True) – Assume equal population variances (Student’s t-test)
If False, use Welch’s t-test
alternative ({'two-sided', 'greater', 'less'}, default 'two-sided') – Alternative hypothesis:
- ‘two-sided’: mean ≠ popmean
- ‘greater’: mean > popmean
- ‘less’: mean < popmean
The one-sample t-test compares sample mean to a known population mean.
When to use:
- Test if sample mean differs from theoretical/known value
- Compare observed data to standard/reference value
- Test if mean differs from zero (common in difference scores)
Assumptions:
- Data are normally distributed
- Observations are independent
check_assumptions (bool, default True) – Whether to check normality and homogeneity assumptions
plot (bool, default False) – Whether to generate visualization
ax (matplotlib.axes.Axes, optional) – Axes object to plot on. If None and plot=True, creates new figure.
If provided, automatically enables plotting.
data (DataFrame, str, or None, optional) – DataFrame or CSV path. When provided with value_col and group_col,
groups are extracted automatically (seaborn-style).
group_col (str, optional) – Column containing group labels (used with data=).
return_as ({'dict', 'dataframe'}, default 'dict') – Output format
decimals (int, default 3) – Number of decimal places for rounding
verbose (bool, default False) – Whether to print test results
Returns:
results – Test results including:
- test_method: ‘One-way ANOVA’
- statistic: F-statistic value
- pvalue: p-value
- stars: Significance stars
- significant: Whether null hypothesis is rejected
- effect_size: Eta-squared (η²)
- effect_size_metric: ‘eta-squared’
- effect_size_interpretation: Interpretation of eta-squared
- n_groups: Number of groups
- n_samples: Sample sizes for each group
- df_between: Degrees of freedom between groups
- df_within: Degrees of freedom within groups
- var_names: Group labels
- assumptions_met: Whether assumptions are satisfied
- H0: Null hypothesis description
One-way ANOVA (Analysis of Variance) tests whether samples from different
groups have the same population mean.
Null Hypothesis (H0): All groups have equal population means
Alternative Hypothesis (H1): At least one group mean differs
Assumptions:
1. Independence: Observations within and between groups are independent
2. Normality: Data in each group are normally distributed
Can be checked with test_shapiro()
Robust to moderate violations with large samples (n > 30 per group)
Homogeneity of variance: Groups have equal population variances
- Can be checked with Levene’s test
- If violated, consider Welch’s ANOVA or non-parametric alternative
When assumptions are violated:
- Non-normality: Use test_kruskal() (Kruskal-Wallis test)
- Unequal variances: Use Welch’s ANOVA (not yet implemented)
- Outliers present: Use test_kruskal() or remove outliers
Interpretation:
- η² < 0.01: negligible
- η² < 0.06: small
- η² < 0.14: medium
- η² ≥ 0.14: large
Post-hoc tests:
If significant, perform pairwise comparisons with correction:
- test_ttest_ind() for all pairs (if assumptions met)
- test_brunner_munzel() for all pairs (robust alternative)
- correct_bonferroni() or correct_fdr() for multiple comparisons
References
Examples
>>> # Three groups with different means>>> group1=np.array([1,2,3,4,5])>>> group2=np.array([3,4,5,6,7])>>> group3=np.array([5,6,7,8,9])>>> result=test_anova([group1,group2,group3])>>> result['rejected']True
result (dict or DataFrame) – Test results including:
- statistic: F-statistic
- pvalue: p-value (possibly corrected)
- df_effect: Degrees of freedom for effect
- df_error: Degrees of freedom for error
- effect_size: Partial eta-squared
- sphericity_W: Mauchly’s W (if checked)
- sphericity_pvalue: Sphericity test p-value
- sphericity_met: Whether sphericity assumption met
- epsilon_gg: Greenhouse-Geisser epsilon
- correction_applied: Which correction was applied
- significant: Whether to reject null hypothesis
If plot=True, returns tuple of (result, figure)
Notes
Repeated measures ANOVA tests whether the means differ across multiple
conditions measured on the same subjects (within-subjects factor).
Null Hypothesis (H0): All condition means are equal
Assumptions:
1. Independence of subjects: Different subjects are independent
2. Normality: Differences between conditions are normally distributed
3. Sphericity: Variances of differences between all pairs of conditions
are equal (tested with Mauchly’s test)
Sphericity:
The sphericity assumption is unique to repeated measures ANOVA. If violated:
- Greenhouse-Geisser correction: More conservative, use when ε < 0.75
- Huynh-Feldt correction: Less conservative (not implemented)
- Multivariate approach: MANOVA (not implemented)
Greenhouse-Geisser Correction:
Adjusts degrees of freedom by multiplying by epsilon (ε):
- df_effect_adj = ε × df_effect
- df_error_adj = ε × df_error
Interpretation same as regular eta-squared:
- < 0.01: negligible
- < 0.06: small
- < 0.14: medium
- ≥ 0.14: large
Post-hoc tests:
If significant, use pairwise t-tests with correction:
- test_ttest_rel() for all pairs
- correct_bonferroni() or correct_holm() for multiple comparisons
result (dict or DataFrame) – Test results including for each effect (A, B, interaction):
- effect: Name of effect
- statistic: F-statistic
- pvalue: p-value
- df_effect: Degrees of freedom for effect
- df_error: Degrees of freedom for error
- effect_size: Partial eta-squared
- rejected: Whether to reject null hypothesis
- significant: Same as rejected
If plot=True, returns tuple of (result, figure)
Notes
Two-way ANOVA tests the effects of two independent categorical variables
(factors) on a continuous dependent variable, including their interaction.
Three Hypotheses Tested:
1. Main effect of Factor A: Marginal means of A levels differ
2. Main effect of Factor B: Marginal means of B levels differ
3. Interaction A×B: Effect of A depends on level of B (and vice versa)
Null Hypotheses:
- H0_A: All marginal means of Factor A are equal
- H0_B: All marginal means of Factor B are equal
- H0_AB: No interaction between Factors A and B
Assumptions:
1. Independence: Observations are independent
2. Normality: Residuals are normally distributed within each cell
3. Homogeneity of variance: Equal variances across all cells
Where:
- SS_A: Sum of squares for main effect A
- SS_B: Sum of squares for main effect B
- SS_AB: Sum of squares for interaction A×B
- SS_error: Sum of squares for error (within cells)
Interpreting Results:
- Significant interaction: Main effects should be interpreted cautiously.
Use simple effects analysis or interaction plots.
Non-significant interaction: Main effects can be interpreted directly.
Post-hoc tests:
If main effects are significant:
- Pairwise comparisons with test_ttest_ind()
- Apply corrections: correct_bonferroni(), correct_holm()
If interaction is significant:
- Simple effects: test effect of A at each level of B
- Pairwise comparisons within each level
var_y (str, default 'y') – Label for second sample
alternative ({'two-sided', 'greater', 'less'}, default 'two-sided') – Alternative hypothesis:
- ‘two-sided’: distributions differ
- ‘greater’: x tends to be greater than y
- ‘less’: x tends to be less than y
The Brunner-Munzel test is a non-parametric test for comparing two independent
samples. It is more robust than the t-test when:
- Distributions are non-normal
- Variances are unequal
- Sample sizes differ
- Data contain outliers
Unlike Mann-Whitney U test, Brunner-Munzel does not assume equal variances
and provides better control of Type I error rate.
The test statistic W is approximately t-distributed:
P(X > Y): Probability that a random value from X exceeds a random
value from Y. Interpretation:
- 0.50: No effect (chance)
- 0.56: Small effect
- 0.64: Medium effect
- 0.71: Large effect
Cliff’s delta (δ): Ranges from -1 to 1, related to P(X>Y) by:
δ = 2×P(X>Y) - 1. Interpretation:
- |δ| < 0.147: Negligible
- |δ| < 0.33: Small
- |δ| < 0.474: Medium
- |δ| ≥ 0.474: Large
Perform Wilcoxon signed-rank test (non-parametric paired test).
Parameters:
x (array or Series) – First sample (e.g., pre-test, baseline)
y (array or Series) – Second sample (e.g., post-test, follow-up)
Must have same length as x
var_x (str, default 'before') – Label for first sample
var_y (str, default 'after') – Label for second sample
alternative ({'two-sided', 'greater', 'less'}, default 'two-sided') – Alternative hypothesis:
- ‘two-sided’: distributions differ
- ‘greater’: x tends to be greater than y
- ‘less’: x tends to be less than y
The Wilcoxon signed-rank test is the non-parametric alternative to
the paired t-test. It tests whether the median of differences is zero.
When to use:
- Paired samples (before-after, matched pairs)
- Data are not normally distributed
- Ordinal data or continuous data with outliers
- Robust alternative to paired t-test
Assumptions:
- Paired observations
- Differences are symmetric around the median
- Ordinal or continuous data
How it works:
1. Compute differences: d = x - y
2. Remove zero differences
3. Rank absolute differences
4. Sum ranks of positive differences (W+)
5. Sum ranks of negative differences (W-)
6. Test statistic: W = min(W+, W-)
Effect size (rank-biserial correlation):
\[r = \frac{W_+ - W_-}{n(n+1)/2}\]
Ranges from -1 to 1:
- r close to 1: x > y (large positive effect)
- r close to 0: no difference
- r close to -1: x < y (large negative effect)
Interpretation:
- |r| < 0.1: negligible
- |r| < 0.3: small
- |r| < 0.5: medium
- |r| ≥ 0.5: large
plot (bool, default False) – Whether to generate box plots
ax (matplotlib.axes.Axes, optional) – Axes object to plot on. If None and plot=True, creates new figure.
If provided, automatically enables plotting.
data (DataFrame, str, or None, optional) – DataFrame or CSV path. When provided with value_col and group_col,
groups are extracted automatically (seaborn-style).
The Kruskal-Wallis H test is a non-parametric alternative to one-way ANOVA.
It tests whether samples originate from the same distribution by comparing
the ranks of observations across groups.
Null Hypothesis (H0): All groups have the same population median
(more precisely: all groups have identical distribution functions)
Assumptions:
- Independent observations within and between groups
- Ordinal or continuous data
- Similar distribution shapes across groups (for median interpretation)
Advantages over ANOVA:
- No normality assumption required
- Robust to outliers
- Works with ordinal data
- More powerful than ANOVA for heavy-tailed distributions
When to use:
- Comparing 3+ independent groups
- Data violate normality (use test_shapiro to check)
- Presence of outliers
- Ordinal data (e.g., Likert scales)
Where:
- k: Number of groups
- N: Total sample size
- R_i: Sum of ranks for group i
- n_i: Sample size of group i
Effect Size (Epsilon-squared):
\[\epsilon^2 = \frac{H - k + 1}{N - k}\]
Interpretation (similar to eta-squared):
- ε² < 0.01: negligible
- ε² < 0.06: small
- ε² < 0.14: medium
- ε² ≥ 0.14: large
Post-hoc tests:
If significant, use pairwise comparisons with correction:
- test_brunner_munzel() for all pairs
- correct_bonferroni() or correct_fdr() for multiple comparisons
Tied ranks: Handled automatically by scipy.stats.kruskal()
References
Examples
>>> # Three groups with different medians>>> group1=np.array([1,2,3,4,5])>>> group2=np.array([3,4,5,6,7])>>> group3=np.array([5,6,7,8,9])>>> result=test_kruskal([group1,group2,group3])>>> result['rejected']True
>>> # With custom names and plot>>> result,fig=test_kruskal(... [group1,group2,group3],... var_names=['Control','Treatment 1','Treatment 2'],... plot=True... )
Interpretation:
- |r| < 0.1: negligible
- |r| < 0.3: small
- |r| < 0.5: medium
- |r| ≥ 0.5: large
Advantages:
- No normality assumption required
- Robust to outliers
- Works with ordinal data
- More powerful than t-test for non-normal data
When to use:
- Comparing two independent groups
- Data violate normality
- Presence of outliers
- Ordinal data (e.g., Likert scales)
- Small sample sizes
Comparison with other tests:
- vs t-test: More robust, less powerful when assumptions met
- vs Brunner-Munzel: MWU assumes identical shape, BM does not
- vs KS test: MWU tests location, KS tests entire distribution
Note on relationship to Brunner-Munzel:
Mann-Whitney U assumes samples have the same distribution shape
(differing only in location). For more robust analysis without this
assumption, use test_brunner_munzel() instead.
plot (bool, default False) – Whether to generate visualization
ax (matplotlib.axes.Axes, optional) – Axes object to plot on. If None and plot=True, creates new figure.
If provided, automatically enables plotting.
return_as ({'dict', 'dataframe'}, default 'dict') – Output format
decimals (int, default 3) – Number of decimal places for rounding
verbose (bool, default False) – Whether to print test results
Returns:
result – Test results including:
- statistic: Chi-square statistic (Friedman’s χ²)
- pvalue: p-value
- df: Degrees of freedom (k - 1)
- kendall_w: Kendall’s W (coefficient of concordance)
- effect_size: Kendall’s W
- effect_size_interpretation: interpretation
- n_subjects: Number of subjects
- n_conditions: Number of conditions
- mean_ranks: Mean rank for each condition
- significant: Whether to reject null hypothesis
The Friedman test is the non-parametric alternative to repeated measures
ANOVA. It is used when:
- Normality assumption is violated
- Data are ordinal (e.g., Likert scales)
- Sample sizes are small
Null Hypothesis (H0): All conditions have the same distribution
Alternative Hypothesis (H1): At least one condition differs
Procedure:
1. Rank observations within each subject (across conditions)
2. Compute sum of ranks for each condition
3. Calculate test statistic based on rank sums
Range: -1 ≤ r ≤ 1
- r = 1: Perfect positive linear relationship
- r = 0: No linear relationship
- r = -1: Perfect negative linear relationship
Coefficient of determination (R²):
\[R^2 = r^2\]
R² represents the proportion of variance in y explained by x.
Interpretation (Cohen, 1988):
- |r| < 0.1: negligible
- |r| < 0.3: small
- |r| < 0.5: medium
- |r| ≥ 0.5: large
Assumptions:
1. Linearity: Relationship between variables is linear
2. Normality: Both variables are normally distributed (for hypothesis testing)
3. Homoscedasticity: Variance is constant across the range
4. Independence: Observations are independent
When to use:
- Assessing linear relationship between two continuous variables
- Both variables approximately normally distributed
- No major outliers present
- Relationship appears linear on scatter plot
When NOT to use:
- Non-linear relationships (consider transformation or Spearman)
- Ordinal data (use Spearman)
- Severe outliers present (use Spearman)
- Non-normal distributions (use Spearman)
Confidence Interval:
Computed using Fisher’s z-transformation:
plot (bool, default False) – If True, create visualization with scatter plot of ranks
data (DataFrame, str, or None, optional) – DataFrame or CSV path. When provided, string values for x/y
are resolved as column names (seaborn-style).
return_as ({'dict', 'dataframe'}, default 'dict') – Return format
decimals (int, default 3) – Number of decimal places for rounding
Returns:
result – Test results with:
- test_method: Name of test
- statistic: Spearman’s rho (ρ)
- pvalue: p-value
- alternative: Alternative hypothesis
- alpha: Significance level
- significant: Whether result is significant
- stars: Significance stars
- effect_size: Same as statistic (ρ)
- effect_size_metric: ‘rho’
- effect_size_interpretation: Interpretation
- rho_squared: Proportion of variance explained
- n: Sample size
- var_x: First variable name
- var_y: Second variable name
# Example 5: One-tailed test
>>> x = np.arange(20)
>>> y = x + np.random.normal(0, 2, size=20)
>>> result = test_spearman(x, y, alternative=’greater’)
>>> print(f”One-tailed p-value: {result[‘pvalue’]:.4f}”)
# Example 6: Non-linear monotonic relationship
>>> x = np.linspace(0, 10, 50)
>>> y = np.log(x + 1) + np.random.normal(0, 0.1, size=50)
>>> result = test_spearman(x, y, var_x=’x’, var_y=’log(x+1)’, plot=True)
# Example 7: Export to various formats
>>> result = test_spearman(x, y, return_as=’dataframe’)
>>> convert_results(result, return_as=’latex’, path=’spearman.tex’)
>>> convert_results(result, return_as=’csv’, path=’spearman.csv’)
plot (bool, default False) – Whether to generate scatter plot
ax (matplotlib.axes.Axes, optional) – Axes to plot on. If provided, plot is set to True
data (DataFrame, str, or None, optional) – DataFrame or CSV path. When provided, string values for x/y
are resolved as column names (seaborn-style).
return_as ({'dict', 'dataframe'}, default 'dict') – Output format
decimals (int, default 3) – Number of decimal places for rounding
verbose (bool, default False) – If True, print test results to logger
Returns:
result – Test results including:
- test_method: Name of test
- statistic: Kendall’s tau coefficient
- pvalue: p-value
- tau_squared: tau² (proportion of variance explained)
- effect_size: tau (same as statistic)
- effect_size_interpretation: interpretation
- n: Sample size
- n_concordant: Number of concordant pairs
- n_discordant: Number of discordant pairs
- n_ties: Number of tied pairs
- significant: Whether to reject null hypothesis
- stars: Significance stars
Kendall’s tau is a non-parametric measure of monotonic association between
two variables. It is based on concordant and discordant pairs.
Null Hypothesis (H0): No monotonic association (tau = 0)
Alternative Hypothesis (H1): Monotonic association exists
Concordant vs Discordant Pairs:
For pairs (x_i, y_i) and (x_j, y_j):
- Concordant: (x_i < x_j and y_i < y_j) or (x_i > x_j and y_i > y_j)
- Discordant: (x_i < x_j and y_i > y_j) or (x_i > x_j and y_i < y_j)
Where:
- n_c: Number of concordant pairs
- n_d: Number of discordant pairs
- n_0: n(n-1)/2 (total possible pairs)
- n_1: Sum of t_i(t_i-1)/2 for ties in x
- n_2: Sum of u_j(u_j-1)/2 for ties in y
Interpretation:
- tau = 1: Perfect positive association
- tau = 0: No association
- tau = -1: Perfect negative association
Effect size interpretation (same as correlation):
- |tau| < 0.1: negligible
- |tau| < 0.3: small
- |tau| < 0.5: medium
- |tau| ≥ 0.5: large
Advantages over Spearman:
- More robust to outliers
- Better for small samples
- Better interpretation (probability of concordance)
- More accurate p-values with ties
Disadvantages:
- Computationally more expensive (O(n²))
- Generally smaller magnitude than Spearman’s rho
- Less intuitive interpretation than Pearson
When to use Kendall’s tau:
- Small sample sizes (n < 30)
- Data with many ties
- Ordinal data
- Non-normal data with outliers
scitex_stats.test_theilsen(x, y, var_x='x', var_y='y', data=None, return_as='dict', verbose=True)[source]
Theil-Sen robust regression estimator.
A robust non-parametric regression method that estimates the slope as the
median of all pairwise slopes. Highly resistant to outliers (up to 29.3%
breakdown point).
Parameters:
x (array-like) – Independent variable
y (array-like) – Dependent variable
var_x (str, default="x") – Name of independent variable (for reporting)
var_y (str, default="y") – Name of dependent variable (for reporting)
data (DataFrame, str, or None, optional) – DataFrame or CSV path. When provided, string values for x/y
are resolved as column names (seaborn-style).
return_as (str, default="dict") – Format of return value: “dict” or “dataframe”
verbose (bool, default=True) – Whether to print results
Returns:
Dictionary or DataFrame containing:
- slope : float
Theil-Sen slope estimate (median of pairwise slopes)
The Theil-Sen estimator:
- Is robust to outliers (up to ~29% outliers)
- Has no distributional assumptions
- Is asymptotically normal
- Has ~64% efficiency compared to OLS for normal data
- Computational complexity: O(n²)
# Example 2: Using DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame([[12, 8, 5], [15, 20, 10]],
… index=[‘Group A’, ‘Group B’],
… columns=[‘Low’, ‘Med’, ‘High’])
>>> result = test_chi2(df, plot=True)
# Example 3: Test gender × preference association
>>> observed = np.array([
… [20, 30, 15], # Male: product A, B, C
… [25, 20, 40] # Female: product A, B, C
… ])
>>> result = test_chi2(observed, var_row=’Gender’, var_col=’Product’, plot=True)
>>> print(f”χ² = {result[‘statistic’]:.2f}, p = {result[‘pvalue’]:.4f}”)
>>> print(f”Cramér’s V = {result[‘effect_size’]:.3f} ({result[‘effect_size_interpretation’]})”)
# Example 4: Small expected frequencies warning
>>> observed = np.array([[2, 8], [3, 7]]) # Small counts
>>> result = test_chi2(observed)
# Example 5: Export to various formats
>>> result = test_chi2(observed, return_as=’dataframe’)
>>> convert_results(result, return_as=’latex’, path=’chi2_test.tex’)
Tests association between two binary categorical variables.
Exact test (no large-sample approximation required).
Parameters:
observed (array-like or DataFrame) – 2×2 contingency table as [[a, b], [c, d]]
If DataFrame, row/column names used as variable names
var_row (str, optional) – Name of row variable (default: ‘row_variable’)
var_col (str, optional) – Name of column variable (default: ‘col_variable’)
alternative ({'two-sided', 'less', 'greater'}, default 'two-sided') – Alternative hypothesis:
- ‘two-sided’: odds ratio ≠ 1
- ‘less’: odds ratio < 1
- ‘greater’: odds ratio > 1
alpha (float, default 0.05) – Significance level for confidence interval
plot (bool, default False) – If True, create visualization
ax (matplotlib.axes.Axes, optional) – Axes to plot on. If provided, plot is set to True
return_as ({'dict', 'dataframe'}, default 'dict') – Return format
decimals (int, default 3) – Number of decimal places for rounding
verbose (bool, default False) – If True, print test results to logger
Returns:
result – Test results with:
- test_method: Name of test
- statistic: Odds ratio
- pvalue: Exact p-value
- alternative: Alternative hypothesis
- alpha: Significance level
- significant: Whether result is significant
- stars: Significance stars
- effect_size: Odds ratio
- effect_size_metric: ‘Odds ratio’
- effect_size_interpretation: Interpretation
- ci_lower: Lower CI bound for odds ratio
- ci_upper: Upper CI bound for odds ratio
- n: Total sample size
- var_row: Row variable name
- var_col: Column variable name
Fisher’s exact test computes exact probability of observed table
(and more extreme tables) under independence assumption.
H₀: Two binary variables are independent (OR = 1)
H₁: Variables are associated (OR ≠ 1)
Odds Ratio (OR):
For table [[a, b], [c, d]]:
OR = (a × d) / (b × c)
Interpretation:
- OR = 1: No association
- OR > 1: Positive association
- OR < 1: Negative association
When to use:
- 2×2 contingency tables
- Small sample sizes (any cell < 5)
- Need exact p-value (not approximation)
Advantages over chi-square:
- Exact test (valid for any sample size)
- No minimum expected frequency requirement
- More powerful for small samples
References
Fisher, R. A. (1922). On the interpretation of χ² from contingency
tables, and the calculation of P. Journal of the Royal Statistical
Society, 85(1), 87-94.
Perform McNemar’s test for paired categorical data.
Tests whether there is a significant change in proportions for paired binary data.
Appropriate for before-after studies with binary outcomes.
Parameters:
observed (array-like, shape (2, 2)) –
2×2 contingency table:
[[a, b],
[c, d]]
where:
- a: both conditions negative (0,0)
- b: before negative, after positive (0,1)
- c: before positive, after negative (1,0)
- d: both conditions positive (1,1)
var_before (str, optional) – Name for before condition
var_after (str, optional) – Name for after condition
correction (bool, default True) – Whether to apply continuity correction (recommended for small samples)
plot (bool, default False) – Whether to generate visualization
ax (matplotlib.axes.Axes, optional) – Axes to plot on. If provided, plot is set to True
return_as ({'dict', 'dataframe'}, default 'dict') – Output format
decimals (int, default 3) – Number of decimal places for rounding
verbose (bool, default False) – If True, print test results to logger
Returns:
result – Test results including:
- test_method: Name of test
- statistic: χ² statistic
- pvalue: p-value
- df: degrees of freedom (always 1)
- b: count of (before=0, after=1)
- c: count of (before=1, after=0)
- odds_ratio: b / c
- effect_size: odds ratio
- effect_size_interpretation: interpretation
- significant: whether to reject null hypothesis
- stars: significance stars
Null hypothesis: The marginal proportions are equal (no change)
Alternative: The marginal proportions differ (significant change)
Assumptions:
- Paired data (matched observations)
- Binary outcomes for both conditions
- Large enough sample (b + c ≥ 10 recommended for chi-square approximation)
Effect size (Odds Ratio):
OR = b / c
- OR = 1: no change
- OR > 1: increase (more transitions from 0→1 than 1→0)
- OR < 1: decrease (more transitions from 1→0 than 0→1)
result (dict or DataFrame) – Test results including:
- statistic: Cochran’s Q statistic
- pvalue: p-value
- df: Degrees of freedom (k - 1)
- effect_size: Kendall’s W
- effect_size_interpretation: interpretation
- n_subjects: Number of subjects
- n_conditions: Number of conditions
- proportions: Success proportion for each condition
- n_successes: Number of successes per condition
- significant: Whether to reject null hypothesis
- stars: Significance stars
If plot=True, returns tuple of (result, figure)
Notes
Cochran’s Q test is used for repeated binary measurements (dichotomous data)
on the same subjects across 3+ conditions.
Null Hypothesis (H0): Proportions of successes are equal across conditions
Alternative Hypothesis (H1): At least one proportion differs
Where:
- k: Number of conditions
- n: Number of subjects
- G_j: Number of successes in condition j
- L_i: Number of successes for subject i (across conditions)
- N: Total number of successes
Q follows chi-square distribution with k-1 degrees of freedom.
Interpretation:
- W < 0.1: negligible
- W < 0.3: small
- W < 0.5: medium
- W ≥ 0.5: large
Assumptions:
- Binary outcomes (0/1, success/failure, yes/no)
- Repeated measurements on same subjects
- At least 3 conditions (for 2 conditions, use McNemar’s test)
Relation to other tests:
- Extension of McNemar’s test (2 conditions → 3+ conditions)
- Binary version of Friedman test
- Can use Friedman test on same data (Q ≈ Friedman χ²)
The Shapiro-Wilk test tests the null hypothesis that data come from a
normal distribution.
Null Hypothesis (H0): Data are normally distributed
Test Statistic W: Ranges from 0 to 1
- W close to 1: Data appear normal
- W much less than 1: Data deviate from normality
p-value interpretation:
- p > α (typically 0.05): Fail to reject H0, data appear normal
- p ≤ α: Reject H0, data significantly deviate from normality
Important considerations:
- Sensitive to sample size: with n > 50, may detect trivial deviations
- Works best for 3 ≤ n ≤ 5000
- Should be combined with visual inspection (Q-Q plots)
- Large samples: focus on Q-Q plots over p-values
- Small samples: test may lack power to detect non-normality
Recommendations based on results:
- Normal (p > 0.05): Use parametric tests (t-test, ANOVA, Pearson)
- Non-normal (p ≤ 0.05): Use non-parametric tests (Brunner-Munzel, Wilcoxon, Spearman)
- Borderline: Check Q-Q plot and consider robustness
References
Examples
>>> # Normal data>>> x=np.random.normal(0,1,100)>>> result=test_shapiro(x)>>> result['normal']True
warn (bool, default True) – Whether to log warnings for non-normal data
Returns:
Dictionary with results for each sample:
- ‘all_normal’: bool, True if all samples are normal
- ‘results’: list of individual test results
- ‘recommendation’: str, overall recommendation
Advantages:
- Distribution-free (no assumptions about data)
- Can test against any continuous distribution
- More general than Shapiro-Wilk (not limited to normality)
Disadvantages:
- Less powerful than Shapiro-Wilk for normality testing
- Sensitive to sample size (large n → high power, may detect trivial deviations)
- Assumes continuous distribution (not suitable for discrete data)
When to use:
- Testing goodness-of-fit to any continuous distribution
- Comparing sample to theoretical distribution
- When Shapiro-Wilk is not applicable (non-normal distributions)
- Large sample sizes (n > 50)
References
Examples
>>> # Test if data are normally distributed>>> x=np.random.normal(0,1,100)>>> result=test_ks_1samp(x,cdf='norm',args=(0,1))>>> result['rejected']False
>>> # Test if data are uniformly distributed>>> x=np.random.uniform(0,1,100)>>> result=test_ks_1samp(x,cdf='uniform',args=(0,1))
Advantages:
- Distribution-free (non-parametric)
- Tests entire distribution, not just location
- Can detect differences in location, scale, or shape
Disadvantages:
- Less powerful than t-test when assumptions are met
- Most sensitive to differences near the center of distributions
- Less sensitive to tail differences
When to use:
- Comparing two independent samples
- No assumptions about distribution shape
- Want to test overall distribution equality (not just means)
- Alternative to t-test when normality violated
Comparison with other tests:
- vs t-test: More robust, less powerful
- vs Mann-Whitney U: Tests different hypotheses (distribution vs median)
- vs Brunner-Munzel: KS tests full distribution, BM tests P(X>Y)
Examples
>>> # Two samples from same distribution>>> x=np.random.normal(0,1,100)>>> y=np.random.normal(0,1,100)>>> result=test_ks_2samp(x,y)>>> result['rejected']False
>>> # Two samples from different distributions>>> x=np.random.normal(0,1,100)>>> y=np.random.normal(2,1,100)>>> result=test_ks_2samp(x,y)>>> result['rejected']True