{{ audit_id | default("N/A") }}| GlassAlpha Version: | {{ version | default("1.0.0") }} | Python Version: | {{ manifest.environment.python_version.split()[0] if manifest and manifest.environment and manifest.environment.python_version else "N/A" }} |
| Report Generated: | {{ generation_date }} | Execution Time: | {{ "%.2f seconds" | format(manifest.get('execution', {}).get('duration_seconds', 0)) if manifest and manifest.get('execution', {}).get('duration_seconds') else "N/A" }} |
This model shows potential issues that may require regulatory review before deployment:
| Metric | Value | Description |
|---|---|---|
| Total Samples | {{ "{:,}".format(data_summary.shape[0]) if data_summary.shape else "N/A" }} | Number of observations in the dataset |
| Total Features | {{ data_summary.shape[1] if data_summary.shape else "N/A" }} | Number of input variables |
| Missing Values | {{ data_summary.missing_count | default(0) }} | Count of missing data points |
| Data Quality Score | {{ "%.1f%%" | format(((data_summary.shape[0] * data_summary.shape[1] - (data_summary.missing_count | default(0))) / (data_summary.shape[0] * data_summary.shape[1]) * 100) if data_summary.shape else 0) }} | Percentage of complete data points |
Dataset bias occurs when training data systematically differs from real-world distributions, contains proxy features correlated with protected attributes, or has sampling imbalances. Detecting bias at the data level is critical because biased data produces biased models regardless of algorithmic fairness interventions.
Features that strongly correlate with protected attributes may act as proxies, allowing the model to discriminate indirectly even when protected attributes are excluded.
| Feature | Protected Attribute | Correlation | P-Value | Status |
|---|---|---|---|---|
| {{ feature | replace("_", " ") | title }} | {{ protected_attr | replace("_", " ") | title }} | {{ "%.3f"|format(corr_data.correlation) }} | {{ "%.4f"|format(corr_data.p_value) }} | {% if corr_data.correlation|abs > 0.7 %} High Proxy Risk {% elif corr_data.correlation|abs > 0.5 %} Moderate Proxy Risk {% else %} Low Risk {% endif %} |
| No proxy correlations detected | ||||
High proxy risk (|r| > 0.7): Feature may enable
indirect discrimination. Consider removing or monitoring closely.
Moderate proxy risk (|r| > 0.5): Evaluate if
feature is necessary for model performance.
Statistical tests comparing feature distributions across protected groups. Significant drift may indicate sampling bias or systematic differences in data collection.
| Feature | Test Statistic | P-Value | Result |
|---|---|---|---|
| {{ feature | replace("_", " ") | title }} | {{ "%.4f"|format(drift_data.statistic) }} | {{ "%.4f"|format(drift_data.p_value) }} | {% if drift_data.p_value < 0.01 %} Significant Drift {% elif drift_data.p_value < 0.05 %} Moderate Drift {% else %} No Significant Drift {% endif %} |
| No distribution drift tests available | |||
Statistical power to detect sampling bias (e.g., underrepresentation of protected groups). Low power indicates insufficient sample size to reliably detect bias.
| Protected Attribute | Statistical Power | Min Sample Size | Assessment |
|---|---|---|---|
| {{ attr | replace("_", " ") | title }} ({{ group_name }}) | {{ "%.2f"|format(power_info.power) }} | {{ power_info.min_sample_size }} | {% if power_info.power >= 0.80 %} Adequate Power {% elif power_info.power >= 0.60 %} Marginal Power {% else %} Insufficient Power {% endif %} |
| No sampling bias power analysis available | |||
Statistical power ≥ 0.80: Sample size is adequate
for bias detection.
Power < 0.80: Consider collecting more data or
interpreting fairness metrics with caution.
This audit used AUTO preprocessing mode, which is NOT suitable for regulatory compliance. Auto mode dynamically fits preprocessing transformers to the audit data, creating a different preprocessing pipeline than production.
For compliance-grade audits:
mode: artifact in preprocessing configexpected_file_hash and
expected_params_hash
See documentation: docs/preprocessing.md
This audit used a verified preprocessing artifact from production, ensuring the model was evaluated with the exact same transformations used in deployment.
| Property | Value | Status |
|---|---|---|
| Mode |
{{ preprocessing_info.mode }}
|
{% if preprocessing_info.mode == 'artifact' %} Compliant {% else %} Non-Compliant {% endif %} |
| Artifact Path |
{{ preprocessing_info.artifact_path }}
|
— |
| File Hash (SHA256) |
{% set full_file_hash = preprocessing_info.file_hash |
replace('sha256:', '') %}
{{ full_file_hash[:12] }}...{{ full_file_hash[-12:]
}}
|
Verified |
| Params Hash (SHA256) |
{% set full_params_hash = preprocessing_info.params_hash |
replace('sha256:', '') %}
{{ full_params_hash[:12] }}...{{ full_params_hash[-12:]
}}
|
Verified |
| Artifact Created | {{ preprocessing_info.manifest.created_at }} | — |
The following transformation steps are applied to data in sequence.
Learned parameters extracted from the preprocessing artifact. These values were fitted on production training data and applied to audit data.
{% for component in preprocessing_info.manifest.components %}{{ component.class }}
Purpose:
{% if 'SimpleImputer' in component.class %} Fills in missing
values using the {{ component.strategy if component.strategy else
'configured' }} strategy {% elif 'StandardScaler' in
component.class %} Standardizes features by removing the mean and
scaling to unit variance {% elif 'OneHotEncoder' in
component.class %} Converts categorical variables into binary
indicator features {% elif 'RobustScaler' in component.class %}
Scales features using statistics robust to outliers {% elif
'MinMaxScaler' in component.class %} Scales features to a fixed
range (typically 0 to 1) {% else %} Data transformation component
{% endif %}
{% if component.handle_unknown %}
Handle Unknown:
{{ component.handle_unknown }}
{% endif %} {% if component.drop %}
Drop:
{{ component.drop }}
{% endif %} {% if component.sparse_output is not none %}
Sparse Output:
{{ component.sparse_output }}
{% endif %}
| Column | Value |
|---|---|
| {{ component.columns[i] if component.columns and i < (component.columns | length) else ("Column " + (i | string)) }} | {% set val = component.learned_stats[i] %} {% if val is number %} {{ "%.4f" | format(val) }} {% else %} {{ val }} {% endif %} |
| Column | {% if component.mean %}Mean | {% endif %} {% if component.scale %}Scale | {% endif %}
|---|---|---|
| {{ component.columns[i] if component.columns and i < (component.columns | length) else ("Column " + (i | string)) }} | {% if component.mean %}{{ "%.4f" | format(component.mean[i]) }} | {% endif %} {% if component.scale %}{{ "%.4f" | format(component.scale[i]) }} | {% endif %}
| Column | Categories | Count |
|---|---|---|
| {{ component.columns[i] if component.columns and i < (component.columns | length) else ("Column " + (i | string)) }} | {% set cats = component.categories[i] %} {% if cats | length > 10 %} {{ cats[:10] | join(", ") }}, ... ({{ cats | length - 10 }} more) {% else %} {{ cats | join(", ") }} {% endif %} | {{ component.categories[i] | length }} |
The preprocessing artifact was created with different library versions than the current audit environment. While the artifact has been successfully loaded, version differences may affect reproducibility.
Recommendation: For maximum reproducibility, use the same library versions that were used to create the preprocessing artifact.
| Library | Artifact Version | Audit Version | Status |
|---|---|---|---|
| {{ lib }} | {{ artifact_ver }} |
{{ audit_ver }} |
{% if artifact_ver == audit_ver %} Match {% else %} Mismatch {% endif %} |
Unknown categories are values in the audit data that were not seen during the training of the preprocessing artifact. High unknown rates may indicate distribution shift or data quality issues.
{% set high_unknown_cols = [] %} {% for col, rate in preprocessing_unknown_rates.items() %} {% if rate > 0.10 %} {% set _ = high_unknown_cols.append(col) %} {% endif %} {% endfor %} {% if high_unknown_cols %}The following columns have high unknown category rates (>10%): {{ high_unknown_cols | join(', ') }}
Potential Causes:
Recommended Actions:
| Column | Unknown Rate | Assessment |
|---|---|---|
| {{ col }} | {{ "%.2f%%" | format(rate * 100) }} | {% if rate > 0.10 %} High {% elif rate > 0.01 %} Moderate {% else %} Low {% endif %} |
| Metric | Value | Assessment | Description |
|---|---|---|---|
| Overall Accuracy | {{ "%.3f" | format(metric_data.accuracy) }} | {{ 'Excellent' if metric_data.accuracy > 0.9 else ('Good' if metric_data.accuracy > 0.8 else ('Fair' if metric_data.accuracy > 0.6 else 'Poor')) }} | Overall classification accuracy |
| Macro Precision | {{ "%.3f" | format(metric_data.macro_precision) }} | {{ 'Good' if metric_data.macro_precision > 0.8 else ('Fair' if metric_data.macro_precision > 0.6 else 'Poor') }} | Average precision across all classes |
| {{ metric_name | replace("_", " ") | title }} | {{ "%.3f" | format(value) }} | {{ 'Good' if value > 0.8 else ('Fair' if value > 0.6 else 'Poor') }} | {{ metric_descriptions.get(metric_name, "Performance metric") }} |
| {{ metric_name | replace("_", " ") | title }} | {{ "%.3f" | format(metric_data) }} | {{ 'Good' if metric_data > 0.8 else ('Fair' if metric_data > 0.6 else 'Poor') }} | {{ metric_descriptions.get(metric_name, "Performance metric") }} |
Calibration measures whether predicted probabilities match observed outcomes. A well-calibrated model predicting 70% confidence should be correct 70% of the time. Poor calibration can mislead decision-makers even if classification accuracy is high.
| Metric | Value | 95% Confidence Interval | Interpretation |
|---|---|---|---|
| Expected Calibration Error (ECE) | {% if calibration_ci.get('ece') is not none %} {{ "%.4f"|format(calibration_ci['ece']) }} {% else %} N/A {% endif %} | {% if calibration_ci.get('ece_ci') %} [{{ "%.4f"|format(calibration_ci['ece_ci']['ci_lower']) }}, {{ "%.4f"|format(calibration_ci['ece_ci']['ci_upper']) }}] {% else %} N/A {% endif %} | {% if calibration_ci.get('ece') is not none %} {% if calibration_ci['ece'] < 0.05 %} Well Calibrated {% elif calibration_ci['ece'] < 0.10 %} Acceptable {% else %} Poorly Calibrated {% endif %} {% endif %} |
| Brier Score | {{ "%.4f"|format(calibration_ci['brier_score']) }} | {% if calibration_ci.get('brier_ci') %} [{{ "%.4f"|format(calibration_ci['brier_ci']['ci_lower']) }}, {{ "%.4f"|format(calibration_ci['brier_ci']['ci_upper']) }}] {% else %} N/A {% endif %} | Combined measure of calibration and accuracy (lower is better) |
Sample size: {{ "{:,}".format(calibration_ci.get('n_samples', 0)) }} | Bins: {{ calibration_ci.get('n_bins', 10) }} | Bootstrap samples: {{ calibration_ci.get('ece_ci', {}).get('n_bootstrap', 'N/A') if calibration_ci.get('ece_ci') else 'N/A' }}
{% if calibration_ci.get('bin_calibration') %}Calibration error by predicted probability bin. Wide confidence intervals indicate uncertainty due to small sample sizes.
| Bin Range | Count | Mean Predicted | Mean Observed | Error | 95% CI |
|---|---|---|---|---|---|
| [{{ "%.2f"|format(bin['bin_lower']) }}, {{ "%.2f"|format(bin['bin_upper']) }}) | {{ bin['count'] }} | {{ "%.3f"|format(bin['mean_predicted']) }} | {{ "%.3f"|format(bin['mean_observed']) }} | {{ "%.3f"|format(bin['error']|abs) }} | {% if bin.get('error_ci') %} [{{ "%.3f"|format(bin['error_ci']['ci_lower']) }}, {{ "%.3f"|format(bin['error_ci']['ci_upper']) }}] {% else %} N/A {% endif %} |
ECE < 0.05: Model probabilities are
well-calibrated
ECE 0.05-0.10: Acceptable calibration for most
applications
ECE > 0.10: Consider recalibration (e.g., Platt
scaling, isotonic regression)
Confidence intervals quantify uncertainty in calibration estimates. Wide intervals suggest small sample sizes or high variability.
The following features have the greatest impact on model predictions:
| Feature | SHAP Value | Impact | Interpretation |
|---|---|---|---|
| {{ feature | replace("_", " ") | title }} | {{ "%.3f" | format(importance) }} | {{ 'Positive' if importance > 0 else 'Negative' }} | {{ 'Increases' if importance > 0 else 'Decreases' }} prediction probability (magnitude: {{ 'High' if (importance|abs) > 0.1 else ('Medium' if (importance|abs) > 0.05 else 'Low') }}) |
SHAP (SHapley Additive exPlanations) values explain individual predictions by quantifying the contribution of each feature:
Analysis of potential bias across protected demographic groups:
{% for attr_name, attr_metrics in fairness_analysis.items() %}| Fairness Metric | Result | Status | Interpretation |
|---|---|---|---|
| {{ metric_name | replace("_", " ") | title }} | {% if metric_result is mapping %} {% if metric_result.error is defined %} Error: {{ metric_result.error[:50] }}... {% elif metric_result.ratio is defined %} {{ "%.3f" | format(metric_result.ratio) }} {% elif metric_result.difference is defined %} {{ "%.3f" | format(metric_result.difference) }} {% else %} {{ metric_result }} {% endif %} {% else %} {{ metric_result }} {% endif %} | {% if metric_result is mapping and metric_result.error is defined %} Error {% elif metric_result is mapping and metric_result.is_fair is defined %} {{ 'Fair' if metric_result.is_fair else 'Biased' }} {% else %} Unknown {% endif %} | {% if metric_result is mapping and metric_result.error is defined %} Metric computation failed {% elif metric_result is mapping and metric_result.is_fair is defined %} {% if metric_result.is_fair %} No significant bias detected {% else %} Potential bias detected - requires review {% endif %} {% else %} Unable to assess fairness {% endif %} |
Key points for regulatory compliance:
Any "Biased" status may require model adjustment or additional oversight before deployment.
Intersectional fairness examines how multiple protected attributes interact (e.g., race × gender). Bias hidden in overall metrics can emerge at intersections. For example, a model fair for women overall may discriminate against Black women specifically.
| Group | Selection Rate | TPR | FPR | Sample Size | Warning |
|---|---|---|---|---|---|
| {{ group_name | replace("_", " ") | title }} | {% if metrics.selection_rate is defined %} {{ "%.3f"|format(metrics.selection_rate) }} {% if metrics.selection_rate_ci %} (95% CI: [{{ "%.3f"|format(metrics.selection_rate_ci.ci_lower) }}, {{ "%.3f"|format(metrics.selection_rate_ci.ci_upper) }}]) {% endif %} {% else %}N/A{% endif %} | {% if metrics.tpr is defined %} {{ "%.3f"|format(metrics.tpr) }} {% else %}N/A{% endif %} | {% if metrics.fpr is defined %} {{ "%.3f"|format(metrics.fpr) }} {% else %}N/A{% endif %} | {{ metrics.n }} | {% if metrics.n < 10 %} n < 10 {% elif metrics.n < 30 %} n < 30 {% else %} Adequate {% endif %} |
| Metric | Max-Min Difference | Max/Min Ratio | Assessment |
|---|---|---|---|
| Selection Rate | {{ "%.3f"|format(intersection_data.disparity_metrics.selection_rate_disparity.max_min_diff) }} | {{ "%.3f"|format(intersection_data.disparity_metrics.selection_rate_disparity.max_min_ratio) }} | {% if intersection_data.disparity_metrics.selection_rate_disparity.max_min_diff > 0.2 %} High Disparity {% elif intersection_data.disparity_metrics.selection_rate_disparity.max_min_diff > 0.1 %} Moderate Disparity {% else %} Low Disparity {% endif %} |
Sample size warnings: Groups with n < 30 have
unreliable metrics (wide confidence intervals).
High disparity (difference > 0.2): Investigate for
potential discrimination at intersections.
Action: Prioritize interventions for intersectional
groups with worst outcomes.
Individual fairness requires that similar individuals receive similar treatment. While group fairness examines aggregate metrics, individual fairness catches disparate treatment of specific people—a critical legal concern under anti-discrimination law.
| Metric | Value | Interpretation |
|---|---|---|
| Consistency Score | {{ "%.3f"|format(individual_fairness['consistency_score']) }} | {% if individual_fairness['consistency_score'] >= 0.90 %} High Consistency {% elif individual_fairness['consistency_score'] >= 0.75 %} Moderate Consistency {% else %} Low Consistency {% endif %} |
Measures whether similar individuals (based on non-protected features) receive similar predictions. Score of 1.0 means perfect consistency; lower scores indicate disparate treatment.
Identifies pairs of individuals with similar non-protected features but different protected attributes. Large prediction differences for matched pairs suggest potential discrimination.
| Statistic | Value |
|---|---|
| Matched Pairs Found | {{ individual_fairness['matched_pairs_count'] }} |
| Avg Prediction Difference | {{ "%.3f"|format(individual_fairness['avg_prediction_diff']) }} |
| Max Prediction Difference | {{ "%.3f"|format(individual_fairness['max_prediction_diff']) }} |
Tests whether changing only a protected attribute (e.g., gender) significantly changes the prediction. Violations indicate the model directly uses protected information.
| Result | Count | Assessment |
|---|---|---|
| Flip Test Violations | {{ individual_fairness['flip_test_violations'] }} | {% if individual_fairness['flip_test_violations'] == 0 %} No Violations {% elif individual_fairness['flip_test_violations'] < 10 %} Minor Violations {% else %} Significant Violations {% endif %} |
Consistency score < 0.75: Model may treat similar
individuals inconsistently—investigate for disparate treatment.
Flip test violations: Protected attributes directly
influence predictions—potential ECOA/Title VII violation.
Action: Review matched pairs with large prediction
differences for justification.
Robustness testing measures model stability under small input perturbations. Fragile models produce wildly different outputs for similar inputs, indicating potential manipulation vulnerabilities or unreliable decisions.
| Metric | Value | Gate Status | Interpretation |
|---|---|---|---|
| Robustness Score | {{ "%.4f"|format(stability_analysis.robustness_score) }} | {% if stability_analysis.gate_status == "PASS" %} PASS {% elif stability_analysis.gate_status == "WARNING" %} WARNING {% else %} FAIL {% endif %} | Maximum prediction change across all perturbations (L∞ norm) |
Model predictions perturbed with Gaussian noise at different intensities. Protected attributes are never perturbed (gender, race, etc. remain fixed).
| Perturbation Level (ε) | Max Prediction Delta | Mean Delta | Assessment |
|---|---|---|---|
| {{ epsilon }} | {{ "%.4f"|format(result.max_delta) }} | {% if result.mean_delta is defined %} {{ "%.4f"|format(result.mean_delta) }} {% else %}N/A{% endif %} | {% if result.max_delta < 0.05 %} Robust {% elif result.max_delta < 0.15 %} Moderate {% else %} Fragile {% endif %} |
Robustness score < 0.05: Model is highly stable to
input perturbations
Score 0.05-0.15: Acceptable stability for most
applications
Score > 0.15: Model is fragile—consider
regularization or ensemble methods
EU AI Act compliance: High-risk AI systems must demonstrate robustness to adversarial perturbations. This test provides evidence of model stability under input variations.
{{ manifest.audit_id | default("N/A") }}
Creation Time:
{{ manifest.creation_time | default("N/A") }}
Duration:
{{ "%.2f seconds" | format(manifest.get('execution',
{}).get('duration_seconds', 0)) if manifest.get('execution',
{}).get('duration_seconds') else "N/A" }}
Status:
{{ manifest.get('execution', {}).get('status', 'Unknown') | title
}}
{% endif %}
| Component | Seed Value | Purpose |
|---|---|---|
| Master Seed | {{ manifest.seeds.master_seed }} |
Global randomness control |
| {{ component | replace("_", " ") | title }} | {{ seed }} |
Component-specific randomness |
| Component Type | Selected Implementation | Version |
|---|---|---|
| {{ component_info.type | title }} | {{ component_info.name | title }} |
{{ component_info.version | default("N/A") }}
|
{{ manifest.environment.python_version.split()[0] if
manifest.environment.python_version else "N/A" }}
Platform:
{{ manifest.environment.platform | default("N/A") }}
Hostname:
{{ manifest.environment.hostname | default("N/A") }}
Working Directory:
{{ manifest.environment.working_directory | default("N/A")
}}
{{ manifest.git.commit_hash[:12] if manifest.git.commit_hash else
"N/A" }}
Branch:
{{ manifest.git.branch | default("N/A") }}
Working Directory Status:
{{ 'Clean' if not manifest.git.is_dirty else 'Modified' }}
This audit is fully reproducible. Running the same configuration with identical data and environment will produce byte-identical results.
{% if manifest and manifest.config_hash %}
Configuration Hash:
{{ manifest.config_hash[:16] }}...
| Requirement | Status | Evidence | Notes |
|---|---|---|---|
| Model Documentation | Complete | Full audit report generated | Comprehensive model documentation provided |
| Performance Validation | {{ 'Complete' if model_performance else 'Missing' }} | {{ "Performance metrics computed" if model_performance else "No performance data" }} | {{ "Model performance assessed across standard metrics" if model_performance else "Performance validation required" }} |
| Bias Testing | {{ 'Complete' if fairness_analysis else 'Missing' }} | {{ "Fairness analysis performed" if fairness_analysis else "No bias testing" }} | {{ "Demographic fairness assessed" if fairness_analysis else "Bias testing required for protected attributes" }} |
| Explainability | {{ 'Complete' if explanations else 'Missing' }} | {{ "SHAP explanations provided" if explanations else "No explanations available" }} | {{ "Model decisions can be explained to stakeholders" if explanations else "Explainability features required" }} |
| Reproducibility | {{ 'Complete' if (manifest and manifest.seeds) else 'Partial' }} | {{ "Full audit trail with seeds" if (manifest and manifest.seeds) else "Limited reproducibility" }} | {{ "Results can be exactly reproduced" if (manifest and manifest.seeds) else "Some randomness not controlled" }} |
| Data Governance | {{ 'Complete' if schema_info else 'Partial' }} | {{ "Data schema documented" if schema_info else "Basic data info only" }} | {{ "Data sources and processing documented" if schema_info else "Enhanced data governance recommended" }} |
| Preprocessing Verification | {{ 'Complete' if (preprocessing_info and preprocessing_info.mode == 'artifact') else 'Non-Compliant' }} | {{ "Production artifact verified with dual hash system" if (preprocessing_info and preprocessing_info.mode == 'artifact') else "Auto mode used - not suitable for regulatory compliance" }} | {{ "Preprocessing transformations match production deployment" if (preprocessing_info and preprocessing_info.mode == 'artifact') else "Requires artifact mode with hash verification for compliance" }} |
Assessment against major regulatory frameworks as of their latest revisions.
Effective: May 25, 2018 | Article 22 - Automated Decision-Making
Enacted: 1974 | Regulation B (12 CFR Part 1002)
The following issues must be addressed before regulatory submission:
This model audit meets standard regulatory requirements for:
This model card provides essential information about the ML model's intended use, limitations, and ethical considerations following the Model Cards framework (Mitchell et al., 2019).
Key terms and concepts used in this audit report, defined for non-technical reviewers.