{{ report_title | default("ML Model Audit Report") }}

Generated: {{ generation_date }}
Audit ID: {{ audit_id | default("N/A") }}
{% if audit_profile %} Profile: {{ audit_profile }} {% endif %} {% if strict_mode %} ✓ Strict Mode Enabled {% else %} ⚠ Strict Mode Disabled {% endif %}

1.Executive Summary

Model Overview

Model Type: {{ model_info.type | default("N/A") | title }} Dataset: {{ data_summary.shape[0] if data_summary.shape else "N/A" }} samples, {{ data_summary.shape[1] if data_summary.shape else "N/A" }} features Target: {{ schema_info.target | default("N/A") }} Protected Attributes: {% if schema_info.sensitive_features %} {{ schema_info.sensitive_features | join(", ") }} {% else %} None specified {% endif %}

Key Findings

    {% if model_performance.accuracy %}
  • Overall Accuracy: {% if model_performance.accuracy is mapping %} {% if model_performance.accuracy.value is defined %} {% set acc = model_performance.accuracy.value %} {% elif model_performance.accuracy.accuracy is defined %} {% set acc = model_performance.accuracy.accuracy %} {% else %} {% set acc = model_performance.accuracy %} {% endif %} {% else %} {% set acc = model_performance.accuracy %} {% endif %} {{ "%.1f%%" | format(acc * 100) }}
  • {% endif %} {% set bias_detected = [] %} {% for attr, metrics in fairness_analysis.items() %} {% for metric, result in metrics.items() %} {% if result.is_fair is defined and not result.is_fair %} {% set _ = bias_detected.append(attr) %} {% endif %} {% endfor %} {% endfor %}
  • Bias Detection: {% if bias_detected %} {{ bias_detected | length }} Issue(s) Found
    Bias detected in: {{ bias_detected | unique | join(", ") }} {% else %} No Issues Detected {% endif %}
  • Explainability: {% if explanations and explanations.global_importance %} Available {% else %} Not Available {% endif %}
  • Reproducibility: {% if manifest.seeds %} Fully Reproducible {% else %} Limited {% endif %}
  • Preprocessing Verification: {% if preprocessing_info and preprocessing_info.mode == 'artifact' %} Production Artifact Verified {% elif preprocessing_info and preprocessing_info.mode == 'auto' %} Non-Compliant (Auto Mode) {% else %} Not Configured {% endif %}
{% if model_performance.accuracy %} {% if model_performance.accuracy is mapping %} {% if model_performance.accuracy.value is defined %} {% set acc_val = model_performance.accuracy.value %} {% elif model_performance.accuracy.accuracy is defined %} {% set acc_val = model_performance.accuracy.accuracy %} {% else %} {% set acc_val = 0 %} {% endif %} {% else %} {% set acc_val = model_performance.accuracy %} {% endif %} {% else %} {% set acc_val = 0 %} {% endif %} {% if bias_detected or (model_performance.accuracy and acc_val < 0.7) %}

WARNING: Regulatory Attention Required

This model shows potential issues that may require regulatory review before deployment:

    {% if bias_detected %}
  • Potential bias detected in protected attributes: {{ bias_detected | unique | join(", ") }}
  • {% endif %} {% if model_performance.accuracy and acc_val < 0.7 %}
  • Model accuracy below recommended threshold ({{ "%.1f%%" | format(acc_val * 100) }} < 70%)
  • {% endif %}
{% endif %}

2.Data Overview

Dataset Statistics

{% if data_summary %} {% endif %}
Metric Value Description
Total Samples {{ "{:,}".format(data_summary.shape[0]) if data_summary.shape else "N/A" }} Number of observations in the dataset
Total Features {{ data_summary.shape[1] if data_summary.shape else "N/A" }} Number of input variables
Missing Values {{ data_summary.missing_count | default(0) }} Count of missing data points
Data Quality Score {{ "%.1f%%" | format(((data_summary.shape[0] * data_summary.shape[1] - (data_summary.missing_count | default(0))) / (data_summary.shape[0] * data_summary.shape[1]) * 100) if data_summary.shape else 0) }} Percentage of complete data points
{% if schema_info %}

Feature Schema

Categorical Features

    {% if schema_info and schema_info.categorical_features %} {% for feature in schema_info.categorical_features %}
  • {{ feature | replace("_", " ") | title }}
  • {% endfor %} {% else %}
  • None specified
  • {% endif %}

Numerical Features

    {% if schema_info and schema_info.numeric_features %} {% for feature in schema_info.numeric_features %}
  • {{ feature | replace("_", " ") | title }}
  • {% endfor %} {% else %}
  • None specified
  • {% endif %}
{% endif %}
{% if dataset_bias %}

3.Dataset-Level Bias Analysis

What is Dataset Bias?

Dataset bias occurs when training data systematically differs from real-world distributions, contains proxy features correlated with protected attributes, or has sampling imbalances. Detecting bias at the data level is critical because biased data produces biased models regardless of algorithmic fairness interventions.

{% if dataset_bias.proxy_correlations %}

Proxy Feature Correlations

Features that strongly correlate with protected attributes may act as proxies, allowing the model to discriminate indirectly even when protected attributes are excluded.

{% if dataset_bias.proxy_correlations and dataset_bias.proxy_correlations.correlations %} {% for protected_attr, features in dataset_bias.proxy_correlations.correlations.items() %} {% for feature, corr_data in features.items() %} {% endfor %} {% endfor %} {% else %} {% endif %}
Feature Protected Attribute Correlation P-Value Status
{{ feature | replace("_", " ") | title }} {{ protected_attr | replace("_", " ") | title }} {{ "%.3f"|format(corr_data.correlation) }} {{ "%.4f"|format(corr_data.p_value) }} {% if corr_data.correlation|abs > 0.7 %} High Proxy Risk {% elif corr_data.correlation|abs > 0.5 %} Moderate Proxy Risk {% else %} Low Risk {% endif %}
No proxy correlations detected

Interpretation

High proxy risk (|r| > 0.7): Feature may enable indirect discrimination. Consider removing or monitoring closely.
Moderate proxy risk (|r| > 0.5): Evaluate if feature is necessary for model performance.

{% endif %} {% if dataset_bias.distribution_drift %}

Distribution Drift Analysis

Statistical tests comparing feature distributions across protected groups. Significant drift may indicate sampling bias or systematic differences in data collection.

{% if dataset_bias.distribution_drift and dataset_bias.distribution_drift.drift_tests %} {% for feature, drift_data in dataset_bias.distribution_drift.drift_tests.items() %} {% endfor %} {% else %} {% endif %}
Feature Test Statistic P-Value Result
{{ feature | replace("_", " ") | title }} {{ "%.4f"|format(drift_data.statistic) }} {{ "%.4f"|format(drift_data.p_value) }} {% if drift_data.p_value < 0.01 %} Significant Drift {% elif drift_data.p_value < 0.05 %} Moderate Drift {% else %} No Significant Drift {% endif %}
No distribution drift tests available
{% endif %} {% if dataset_bias.sampling_bias_power %}

Sampling Bias Detection Power

Statistical power to detect sampling bias (e.g., underrepresentation of protected groups). Low power indicates insufficient sample size to reliably detect bias.

{% if dataset_bias.sampling_bias_power and dataset_bias.sampling_bias_power.power_by_group %} {% for attr, groups in dataset_bias.sampling_bias_power.power_by_group.items() %} {% for group_name, power_info in groups.items() %} {% endfor %} {% endfor %} {% else %} {% endif %}
Protected Attribute Statistical Power Min Sample Size Assessment
{{ attr | replace("_", " ") | title }} ({{ group_name }}) {{ "%.2f"|format(power_info.power) }} {{ power_info.min_sample_size }} {% if power_info.power >= 0.80 %} Adequate Power {% elif power_info.power >= 0.60 %} Marginal Power {% else %} Insufficient Power {% endif %}
No sampling bias power analysis available

Recommended Action

Statistical power ≥ 0.80: Sample size is adequate for bias detection.
Power < 0.80: Consider collecting more data or interpreting fairness metrics with caution.

{% endif %}
{% endif %} {% if preprocessing_info %}

4.Preprocessing Verification

{% if preprocessing_info.mode == 'auto' %}

⚠️ WARNING: Non-Compliant Preprocessing Mode

This audit used AUTO preprocessing mode, which is NOT suitable for regulatory compliance. Auto mode dynamically fits preprocessing transformers to the audit data, creating a different preprocessing pipeline than production.

For compliance-grade audits:

  • Use mode: artifact in preprocessing config
  • Provide the exact preprocessing artifact used in production
  • Include both expected_file_hash and expected_params_hash

See documentation: docs/preprocessing.md

{% elif preprocessing_info.mode == 'artifact' %}

✓ Production Artifact Verified

This audit used a verified preprocessing artifact from production, ensuring the model was evaluated with the exact same transformations used in deployment.

{% endif %}

Preprocessing Summary

{% if preprocessing_info.artifact_path %} {% endif %} {% if preprocessing_info.file_hash %} {% endif %} {% if preprocessing_info.params_hash %} {% endif %} {% if preprocessing_info.manifest and preprocessing_info.manifest.created_at %} {% endif %}
Property Value Status
Mode {{ preprocessing_info.mode }} {% if preprocessing_info.mode == 'artifact' %} Compliant {% else %} Non-Compliant {% endif %}
Artifact Path {{ preprocessing_info.artifact_path }}
File Hash (SHA256) {% set full_file_hash = preprocessing_info.file_hash | replace('sha256:', '') %} {{ full_file_hash[:12] }}...{{ full_file_hash[-12:] }} Verified
Params Hash (SHA256) {% set full_params_hash = preprocessing_info.params_hash | replace('sha256:', '') %} {{ full_params_hash[:12] }}...{{ full_params_hash[-12:] }} Verified
Artifact Created {{ preprocessing_info.manifest.created_at }}
{% if preprocessing_info.manifest and preprocessing_info.manifest.components %}

Preprocessing Pipeline

The following transformation steps are applied to data in sequence.

Component Details

Learned parameters extracted from the preprocessing artifact. These values were fitted on production training data and applied to audit data.

{% for component in preprocessing_info.manifest.components %}

{% if 'imputer' in component.name.lower() %} {% elif 'scaler' in component.name.lower() or 'standard' in component.class.lower() %} {% elif 'onehot' in component.name.lower() or 'encoder' in component.class.lower() %} {% else %} {% endif %} {{ component.name }}

Class: {{ component.class }} Purpose: {% if 'SimpleImputer' in component.class %} Fills in missing values using the {{ component.strategy if component.strategy else 'configured' }} strategy {% elif 'StandardScaler' in component.class %} Standardizes features by removing the mean and scaling to unit variance {% elif 'OneHotEncoder' in component.class %} Converts categorical variables into binary indicator features {% elif 'RobustScaler' in component.class %} Scales features using statistics robust to outliers {% elif 'MinMaxScaler' in component.class %} Scales features to a fixed range (typically 0 to 1) {% else %} Data transformation component {% endif %} {% if component.handle_unknown %} Handle Unknown: {{ component.handle_unknown }} {% endif %} {% if component.drop %} Drop: {{ component.drop }} {% endif %} {% if component.sparse_output is not none %} Sparse Output: {{ component.sparse_output }} {% endif %}
{% if component.columns %}
Applied to columns ({{ component.columns | length }}):
{{ component.columns | join(", ") }}
{% endif %} {% if component.learned_stats %}
Learned Statistics:
{% for i in range(component.learned_stats | length) %} {% endfor %}
Column Value
{{ component.columns[i] if component.columns and i < (component.columns | length) else ("Column " + (i | string)) }} {% set val = component.learned_stats[i] %} {% if val is number %} {{ "%.4f" | format(val) }} {% else %} {{ val }} {% endif %}
{% endif %} {% if component.mean or component.scale %}
Scaling Parameters:
{% if component.mean %} {% endif %} {% if component.scale %} {% endif %} {% for i in range((component.mean | length) if component.mean else (component.scale | length)) %} {% if component.mean %} {% endif %} {% if component.scale %} {% endif %} {% endfor %}
ColumnMeanScale
{{ component.columns[i] if component.columns and i < (component.columns | length) else ("Column " + (i | string)) }} {{ "%.4f" | format(component.mean[i]) }} {{ "%.4f" | format(component.scale[i]) }}
{% endif %} {% if component.categories %}
Encoder Categories (showing up to 10 per column):
{% for i in range(component.categories | length) %} {% endfor %}
Column Categories Count
{{ component.columns[i] if component.columns and i < (component.columns | length) else ("Column " + (i | string)) }} {% set cats = component.categories[i] %} {% if cats | length > 10 %} {{ cats[:10] | join(", ") }}, ... ({{ cats | length - 10 }} more) {% else %} {{ cats | join(", ") }} {% endif %} {{ component.categories[i] | length }}
{% endif %}
{% endfor %} {% endif %} {% if preprocessing_info.manifest and preprocessing_info.manifest.artifact_runtime_versions %}

Runtime Version Information

{% set has_mismatch = namespace(value=false) %} {% for lib, artifact_ver in preprocessing_info.manifest.artifact_runtime_versions.items() %} {% set audit_ver = preprocessing_info.manifest.audit_runtime_versions[lib] if preprocessing_info.manifest.audit_runtime_versions else "N/A" %} {% if artifact_ver != audit_ver %} {% set has_mismatch.value = true %} {% endif %} {% endfor %} {% if has_mismatch.value %}

⚠️ Version Mismatch Detected

The preprocessing artifact was created with different library versions than the current audit environment. While the artifact has been successfully loaded, version differences may affect reproducibility.

Recommendation: For maximum reproducibility, use the same library versions that were used to create the preprocessing artifact.

{% endif %}
{% for lib, artifact_ver in preprocessing_info.manifest.artifact_runtime_versions.items() %} {% set audit_ver = preprocessing_info.manifest.audit_runtime_versions[lib] if preprocessing_info.manifest.audit_runtime_versions else "N/A" %} {% endfor %}
Library Artifact Version Audit Version Status
{{ lib }} {{ artifact_ver }} {{ audit_ver }} {% if artifact_ver == audit_ver %} Match {% else %} Mismatch {% endif %}
{% endif %} {% if preprocessing_unknown_rates %}

Unknown Category Detection

Unknown categories are values in the audit data that were not seen during the training of the preprocessing artifact. High unknown rates may indicate distribution shift or data quality issues.

{% set high_unknown_cols = [] %} {% for col, rate in preprocessing_unknown_rates.items() %} {% if rate > 0.10 %} {% set _ = high_unknown_cols.append(col) %} {% endif %} {% endfor %} {% if high_unknown_cols %}

⚠️ High Unknown Category Rates Detected

The following columns have high unknown category rates (>10%): {{ high_unknown_cols | join(', ') }}

Potential Causes:

  • Data distribution has shifted since preprocessing artifact was created
  • Audit data comes from a different population than training data
  • New categories have emerged that didn't exist during training

Recommended Actions:

  • Investigate the source of unknown categories
  • Consider retraining the preprocessing artifact with updated training data
  • Verify that audit data is from the same distribution as production data
  • Review the encoder's handle_unknown strategy for appropriateness
{% endif %}
{% for col, rate in preprocessing_unknown_rates.items() %} {% endfor %}
Column Unknown Rate Assessment
{{ col }} {{ "%.2f%%" | format(rate * 100) }} {% if rate > 0.10 %} High {% elif rate > 0.01 %} Moderate {% else %} Low {% endif %}
{% endif %}
{% endif %}

{% if preprocessing_info and dataset_bias %}5.{% elif preprocessing_info or dataset_bias %}4.{% else %}3.{% endif %}Model Performance Analysis

{% if model_performance %}

Performance Metrics

{% for metric_name, metric_data in model_performance.items() %} {% if metric_data is mapping %} {% if metric_name == "classification_report" %} {% if metric_data.macro_precision is defined %} {% endif %} {% else %} {% set value = metric_data.value if metric_data.value is defined else metric_data.accuracy if metric_data.accuracy is defined else metric_data %} {% if value is number %} {% endif %} {% endif %} {% elif metric_data is number %} {% endif %} {% endfor %}
Metric Value Assessment Description
Overall Accuracy {{ "%.3f" | format(metric_data.accuracy) }} {{ 'Excellent' if metric_data.accuracy > 0.9 else ('Good' if metric_data.accuracy > 0.8 else ('Fair' if metric_data.accuracy > 0.6 else 'Poor')) }} Overall classification accuracy
Macro Precision {{ "%.3f" | format(metric_data.macro_precision) }} {{ 'Good' if metric_data.macro_precision > 0.8 else ('Fair' if metric_data.macro_precision > 0.6 else 'Poor') }} Average precision across all classes
{{ metric_name | replace("_", " ") | title }} {{ "%.3f" | format(value) }} {{ 'Good' if value > 0.8 else ('Fair' if value > 0.6 else 'Poor') }} {{ metric_descriptions.get(metric_name, "Performance metric") }}
{{ metric_name | replace("_", " ") | title }} {{ "%.3f" | format(metric_data) }} {{ 'Good' if metric_data > 0.8 else ('Fair' if metric_data > 0.6 else 'Poor') }} {{ metric_descriptions.get(metric_name, "Performance metric") }}
{% endif %} {% if performance_plots %}

Performance Visualizations

{% for plot_name, plot_path in performance_plots.items() %}
{{ plot_name | replace('_', ' ') | title }} Plot
{{ plot_name | replace('_', ' ') | title }} - {{ plot_descriptions.get(plot_name, "Model performance visualization") }}
{% endfor %} {% endif %}
{% if calibration_ci %}

{% if preprocessing_info and dataset_bias %}6.{% elif preprocessing_info or dataset_bias %}5.{% else %}4.{% endif %}Calibration Analysis with Confidence Intervals

What is Calibration?

Calibration measures whether predicted probabilities match observed outcomes. A well-calibrated model predicting 70% confidence should be correct 70% of the time. Poor calibration can mislead decision-makers even if classification accuracy is high.

Calibration Metrics

Metric Value 95% Confidence Interval Interpretation
Expected Calibration Error (ECE) {% if calibration_ci.get('ece') is not none %} {{ "%.4f"|format(calibration_ci['ece']) }} {% else %} N/A {% endif %} {% if calibration_ci.get('ece_ci') %} [{{ "%.4f"|format(calibration_ci['ece_ci']['ci_lower']) }}, {{ "%.4f"|format(calibration_ci['ece_ci']['ci_upper']) }}] {% else %} N/A {% endif %} {% if calibration_ci.get('ece') is not none %} {% if calibration_ci['ece'] < 0.05 %} Well Calibrated {% elif calibration_ci['ece'] < 0.10 %} Acceptable {% else %} Poorly Calibrated {% endif %} {% endif %}
Brier Score {{ "%.4f"|format(calibration_ci['brier_score']) }} {% if calibration_ci.get('brier_ci') %} [{{ "%.4f"|format(calibration_ci['brier_ci']['ci_lower']) }}, {{ "%.4f"|format(calibration_ci['brier_ci']['ci_upper']) }}] {% else %} N/A {% endif %} Combined measure of calibration and accuracy (lower is better)

Sample size: {{ "{:,}".format(calibration_ci.get('n_samples', 0)) }} | Bins: {{ calibration_ci.get('n_bins', 10) }} | Bootstrap samples: {{ calibration_ci.get('ece_ci', {}).get('n_bootstrap', 'N/A') if calibration_ci.get('ece_ci') else 'N/A' }}

{% if calibration_ci.get('bin_calibration') %}

Bin-Wise Calibration Error

Calibration error by predicted probability bin. Wide confidence intervals indicate uncertainty due to small sample sizes.

{% for bin in calibration_ci['bin_calibration'] %} {% endfor %}
Bin Range Count Mean Predicted Mean Observed Error 95% CI
[{{ "%.2f"|format(bin['bin_lower']) }}, {{ "%.2f"|format(bin['bin_upper']) }}) {{ bin['count'] }} {{ "%.3f"|format(bin['mean_predicted']) }} {{ "%.3f"|format(bin['mean_observed']) }} {{ "%.3f"|format(bin['error']|abs) }} {% if bin.get('error_ci') %} [{{ "%.3f"|format(bin['error_ci']['ci_lower']) }}, {{ "%.3f"|format(bin['error_ci']['ci_upper']) }}] {% else %} N/A {% endif %}
{% endif %}

Understanding Calibration

ECE < 0.05: Model probabilities are well-calibrated
ECE 0.05-0.10: Acceptable calibration for most applications
ECE > 0.10: Consider recalibration (e.g., Platt scaling, isotonic regression)

Confidence intervals quantify uncertainty in calibration estimates. Wide intervals suggest small sample sizes or high variability.

{% endif %}
{% if explanations %}

{% if preprocessing_info and dataset_bias and calibration_ci %}7.{% elif (preprocessing_info and dataset_bias) or (preprocessing_info and calibration_ci) or (dataset_bias and calibration_ci) %}6.{% elif preprocessing_info or dataset_bias or calibration_ci %}5.{% else %}4.{% endif %}Model Explainability (SHAP Analysis)

{% if explanations.global_importance %}

Feature Importance

The following features have the greatest impact on model predictions:

{% if explanations and explanations.global_importance %} {% for feature, importance in (explanations.global_importance.items() | list)[:10] %} {% endfor %} {% endif %}
Feature SHAP Value Impact Interpretation
{{ feature | replace("_", " ") | title }} {{ "%.3f" | format(importance) }} {{ 'Positive' if importance > 0 else 'Negative' }} {{ 'Increases' if importance > 0 else 'Decreases' }} prediction probability (magnitude: {{ 'High' if (importance|abs) > 0.1 else ('Medium' if (importance|abs) > 0.05 else 'Low') }})
{% endif %} {% if shap_plots %}

SHAP Visualizations

{% for plot_name, plot_path in shap_plots.items() %}
{{ plot_name | replace('_', ' ') | title }} Plot
{{ plot_name | replace('_', ' ') | title }} - {{ shap_descriptions.get(plot_name, "SHAP explanation visualization") }}
{% endfor %} {% endif %}

Understanding SHAP Values

SHAP (SHapley Additive exPlanations) values explain individual predictions by quantifying the contribution of each feature:

  • Positive values push the prediction toward the positive class
  • Negative values push the prediction toward the negative class
  • Magnitude indicates the strength of the feature's influence
  • Sum of all SHAP values equals the difference from the baseline prediction
{% endif %}
{% if fairness_analysis %}

{{ '6.' if preprocessing_info else '5.' }}Fairness & Bias Analysis

Analysis of potential bias across protected demographic groups:

{% for attr_name, attr_metrics in fairness_analysis.items() %}

{{ attr_name | replace("_", " ") | title }} Analysis

{% for metric_name, metric_result in attr_metrics.items() %} {% endfor %}
Fairness Metric Result Status Interpretation
{{ metric_name | replace("_", " ") | title }} {% if metric_result is mapping %} {% if metric_result.error is defined %} Error: {{ metric_result.error[:50] }}... {% elif metric_result.ratio is defined %} {{ "%.3f" | format(metric_result.ratio) }} {% elif metric_result.difference is defined %} {{ "%.3f" | format(metric_result.difference) }} {% else %} {{ metric_result }} {% endif %} {% else %} {{ metric_result }} {% endif %} {% if metric_result is mapping and metric_result.error is defined %} Error {% elif metric_result is mapping and metric_result.is_fair is defined %} {{ 'Fair' if metric_result.is_fair else 'Biased' }} {% else %} Unknown {% endif %} {% if metric_result is mapping and metric_result.error is defined %} Metric computation failed {% elif metric_result is mapping and metric_result.is_fair is defined %} {% if metric_result.is_fair %} No significant bias detected {% else %} Potential bias detected - requires review {% endif %} {% else %} Unable to assess fairness {% endif %}
{% endfor %} {% if fairness_plots %}

Fairness Visualizations

{% for plot_name, plot_path in fairness_plots.items() %}
{{ plot_name | replace('_', ' ') | title }} Plot
{{ plot_name | replace('_', ' ') | title }} - Bias detection across demographic groups
{% endfor %} {% endif %}

Regulatory Considerations

Key points for regulatory compliance:

  • Demographic Parity: Equal positive prediction rates across groups
  • Equal Opportunity: Equal true positive rates for qualified individuals
  • Equalized Odds: Equal true positive and false positive rates
  • Predictive Parity: Equal precision across demographic groups

Any "Biased" status may require model adjustment or additional oversight before deployment.

{% endif %} {% if intersectional_fairness %}

{% set base_num = 8 %} {% if not preprocessing_info %}{% set base_num = base_num - 1 %}{% endif %} {% if not dataset_bias %}{% set base_num = base_num - 1 %}{% endif %} {% if not calibration_ci %}{% set base_num = base_num - 1 %}{% endif %} {{ base_num }}. Intersectional Fairness Analysis

What is Intersectional Fairness?

Intersectional fairness examines how multiple protected attributes interact (e.g., race × gender). Bias hidden in overall metrics can emerge at intersections. For example, a model fair for women overall may discriminate against Black women specifically.

{% for intersection_name, intersection_data in intersectional_fairness.items() %}

{{ intersection_name | replace("*", " × ") | replace("_", " ") | title }}

{% if intersection_data.groups %}
{% for group_name, metrics in intersection_data.groups.items() %} {% endfor %}
Group Selection Rate TPR FPR Sample Size Warning
{{ group_name | replace("_", " ") | title }} {% if metrics.selection_rate is defined %} {{ "%.3f"|format(metrics.selection_rate) }} {% if metrics.selection_rate_ci %} (95% CI: [{{ "%.3f"|format(metrics.selection_rate_ci.ci_lower) }}, {{ "%.3f"|format(metrics.selection_rate_ci.ci_upper) }}]) {% endif %} {% else %}N/A{% endif %} {% if metrics.tpr is defined %} {{ "%.3f"|format(metrics.tpr) }} {% else %}N/A{% endif %} {% if metrics.fpr is defined %} {{ "%.3f"|format(metrics.fpr) }} {% else %}N/A{% endif %} {{ metrics.n }} {% if metrics.n < 10 %} n < 10 {% elif metrics.n < 30 %} n < 30 {% else %} Adequate {% endif %}
{% endif %} {% if intersection_data.disparity_metrics %}

Disparity Metrics

{% if intersection_data.disparity_metrics.selection_rate_disparity %} {% endif %}
Metric Max-Min Difference Max/Min Ratio Assessment
Selection Rate {{ "%.3f"|format(intersection_data.disparity_metrics.selection_rate_disparity.max_min_diff) }} {{ "%.3f"|format(intersection_data.disparity_metrics.selection_rate_disparity.max_min_ratio) }} {% if intersection_data.disparity_metrics.selection_rate_disparity.max_min_diff > 0.2 %} High Disparity {% elif intersection_data.disparity_metrics.selection_rate_disparity.max_min_diff > 0.1 %} Moderate Disparity {% else %} Low Disparity {% endif %}
{% endif %} {% endfor %}

Interpreting Intersectional Results

Sample size warnings: Groups with n < 30 have unreliable metrics (wide confidence intervals).
High disparity (difference > 0.2): Investigate for potential discrimination at intersections.
Action: Prioritize interventions for intersectional groups with worst outcomes.

{% endif %} {% if individual_fairness %}

{% set base_num = 9 %} {% if not preprocessing_info %}{% set base_num = base_num - 1 %}{% endif %} {% if not dataset_bias %}{% set base_num = base_num - 1 %}{% endif %} {% if not calibration_ci %}{% set base_num = base_num - 1 %}{% endif %} {% if not intersectional_fairness %}{% set base_num = base_num - 1 %}{% endif %} {{ base_num }}. Individual Fairness Analysis

What is Individual Fairness?

Individual fairness requires that similar individuals receive similar treatment. While group fairness examines aggregate metrics, individual fairness catches disparate treatment of specific people—a critical legal concern under anti-discrimination law.

Consistency Score

Metric Value Interpretation
Consistency Score {{ "%.3f"|format(individual_fairness['consistency_score']) }} {% if individual_fairness['consistency_score'] >= 0.90 %} High Consistency {% elif individual_fairness['consistency_score'] >= 0.75 %} Moderate Consistency {% else %} Low Consistency {% endif %}

Measures whether similar individuals (based on non-protected features) receive similar predictions. Score of 1.0 means perfect consistency; lower scores indicate disparate treatment.

Matched Pairs Analysis

Identifies pairs of individuals with similar non-protected features but different protected attributes. Large prediction differences for matched pairs suggest potential discrimination.

{% if individual_fairness.get('avg_prediction_diff') is not none %} {% endif %} {% if individual_fairness.get('max_prediction_diff') is not none %} {% endif %}
Statistic Value
Matched Pairs Found {{ individual_fairness['matched_pairs_count'] }}
Avg Prediction Difference {{ "%.3f"|format(individual_fairness['avg_prediction_diff']) }}
Max Prediction Difference {{ "%.3f"|format(individual_fairness['max_prediction_diff']) }}

Counterfactual Flip Test

Tests whether changing only a protected attribute (e.g., gender) significantly changes the prediction. Violations indicate the model directly uses protected information.

Result Count Assessment
Flip Test Violations {{ individual_fairness['flip_test_violations'] }} {% if individual_fairness['flip_test_violations'] == 0 %} No Violations {% elif individual_fairness['flip_test_violations'] < 10 %} Minor Violations {% else %} Significant Violations {% endif %}

Legal Implications

Consistency score < 0.75: Model may treat similar individuals inconsistently—investigate for disparate treatment.
Flip test violations: Protected attributes directly influence predictions—potential ECOA/Title VII violation.
Action: Review matched pairs with large prediction differences for justification.

{% endif %} {% if stability_analysis and stability_analysis.robustness_score is defined %}

{% set base_num = 10 %} {% if not preprocessing_info %}{% set base_num = base_num - 1 %}{% endif %} {% if not dataset_bias %}{% set base_num = base_num - 1 %}{% endif %} {% if not calibration_ci %}{% set base_num = base_num - 1 %}{% endif %} {% if not intersectional_fairness %}{% set base_num = base_num - 1 %}{% endif %} {% if not individual_fairness %}{% set base_num = base_num - 1 %}{% endif %} {{ base_num }}. Model Robustness Testing

What is Robustness Testing?

Robustness testing measures model stability under small input perturbations. Fragile models produce wildly different outputs for similar inputs, indicating potential manipulation vulnerabilities or unreliable decisions.

Adversarial Perturbation Analysis

Metric Value Gate Status Interpretation
Robustness Score {{ "%.4f"|format(stability_analysis.robustness_score) }} {% if stability_analysis.gate_status == "PASS" %} PASS {% elif stability_analysis.gate_status == "WARNING" %} WARNING {% else %} FAIL {% endif %} Maximum prediction change across all perturbations (L∞ norm)
{% if stability_analysis.results_by_epsilon %}

Perturbation Sweep Results

Model predictions perturbed with Gaussian noise at different intensities. Protected attributes are never perturbed (gender, race, etc. remain fixed).

{% for epsilon, result in stability_analysis.results_by_epsilon.items() %} {% endfor %}
Perturbation Level (ε) Max Prediction Delta Mean Delta Assessment
{{ epsilon }} {{ "%.4f"|format(result.max_delta) }} {% if result.mean_delta is defined %} {{ "%.4f"|format(result.mean_delta) }} {% else %}N/A{% endif %} {% if result.max_delta < 0.05 %} Robust {% elif result.max_delta < 0.15 %} Moderate {% else %} Fragile {% endif %}
{% endif %}

Understanding Robustness

Robustness score < 0.05: Model is highly stable to input perturbations
Score 0.05-0.15: Acceptable stability for most applications
Score > 0.15: Model is fragile—consider regularization or ensemble methods

EU AI Act compliance: High-risk AI systems must demonstrate robustness to adversarial perturbations. This test provides evidence of model stability under input variations.

{% endif %}

{% set base_num = 11 %} {% if not preprocessing_info %}{% set base_num = base_num - 1 %}{% endif %} {% if not dataset_bias %}{% set base_num = base_num - 1 %}{% endif %} {% if not calibration_ci %}{% set base_num = base_num - 1 %}{% endif %} {% if not intersectional_fairness %}{% set base_num = base_num - 1 %}{% endif %} {% if not individual_fairness %}{% set base_num = base_num - 1 %}{% endif %} {% if not (stability_analysis and stability_analysis.robustness_score is defined) %}{% set base_num = base_num - 1 %}{% endif %} {{ base_num }}. Audit Trail & Reproducibility

Execution Information

{% if manifest %} Audit ID: {{ manifest.audit_id | default("N/A") }} Creation Time: {{ manifest.creation_time | default("N/A") }} Duration: {{ "%.2f seconds" | format(manifest.get('execution', {}).get('duration_seconds', 0)) if manifest.get('execution', {}).get('duration_seconds') else "N/A" }} Status: {{ manifest.get('execution', {}).get('status', 'Unknown') | title }} {% endif %}

Reproducibility Information

{% if manifest and manifest.seeds %}
{% if manifest and manifest.seeds and manifest.seeds.component_seeds %} {% for component, seed in manifest.seeds.component_seeds.items() %} {% endfor %} {% endif %}
Component Seed Value Purpose
Master Seed {{ manifest.seeds.master_seed }} Global randomness control
{{ component | replace("_", " ") | title }} {{ seed }} Component-specific randomness
{% endif %}

Selected Components

{% if manifest and manifest.selected_components %}
{% if manifest and manifest.selected_components %} {% for component_key, component_info in manifest.selected_components.items() %} {% endfor %} {% endif %}
Component Type Selected Implementation Version
{{ component_info.type | title }} {{ component_info.name | title }} {{ component_info.version | default("N/A") }}
{% endif %}

Environment Information

{% if manifest and manifest.environment %}
Python Version: {{ manifest.environment.python_version.split()[0] if manifest.environment.python_version else "N/A" }} Platform: {{ manifest.environment.platform | default("N/A") }} Hostname: {{ manifest.environment.hostname | default("N/A") }} Working Directory: {{ manifest.environment.working_directory | default("N/A") }}
{% endif %} {% if manifest and manifest.git %}

Version Control

Git Commit: {{ manifest.git.commit_hash[:12] if manifest.git.commit_hash else "N/A" }} Branch: {{ manifest.git.branch | default("N/A") }} Working Directory Status: {{ 'Clean' if not manifest.git.is_dirty else 'Modified' }}
{% endif %}

Reproducibility Guarantee

This audit is fully reproducible. Running the same configuration with identical data and environment will produce byte-identical results.

{% if manifest and manifest.config_hash %}

Configuration Hash: {{ manifest.config_hash[:16] }}...

{% endif %}

{% set base_num = 12 %} {% if not preprocessing_info %}{% set base_num = base_num - 1 %}{% endif %} {% if not dataset_bias %}{% set base_num = base_num - 1 %}{% endif %} {% if not calibration_ci %}{% set base_num = base_num - 1 %}{% endif %} {% if not intersectional_fairness %}{% set base_num = base_num - 1 %}{% endif %} {% if not individual_fairness %}{% set base_num = base_num - 1 %}{% endif %} {% if not (stability_analysis and stability_analysis.robustness_score is defined) %}{% set base_num = base_num - 1 %}{% endif %} {{ base_num }}. Regulatory Compliance Assessment

Compliance Checklist

Requirement Status Evidence Notes
Model Documentation Complete Full audit report generated Comprehensive model documentation provided
Performance Validation {{ 'Complete' if model_performance else 'Missing' }} {{ "Performance metrics computed" if model_performance else "No performance data" }} {{ "Model performance assessed across standard metrics" if model_performance else "Performance validation required" }}
Bias Testing {{ 'Complete' if fairness_analysis else 'Missing' }} {{ "Fairness analysis performed" if fairness_analysis else "No bias testing" }} {{ "Demographic fairness assessed" if fairness_analysis else "Bias testing required for protected attributes" }}
Explainability {{ 'Complete' if explanations else 'Missing' }} {{ "SHAP explanations provided" if explanations else "No explanations available" }} {{ "Model decisions can be explained to stakeholders" if explanations else "Explainability features required" }}
Reproducibility {{ 'Complete' if (manifest and manifest.seeds) else 'Partial' }} {{ "Full audit trail with seeds" if (manifest and manifest.seeds) else "Limited reproducibility" }} {{ "Results can be exactly reproduced" if (manifest and manifest.seeds) else "Some randomness not controlled" }}
Data Governance {{ 'Complete' if schema_info else 'Partial' }} {{ "Data schema documented" if schema_info else "Basic data info only" }} {{ "Data sources and processing documented" if schema_info else "Enhanced data governance recommended" }}
Preprocessing Verification {{ 'Complete' if (preprocessing_info and preprocessing_info.mode == 'artifact') else 'Non-Compliant' }} {{ "Production artifact verified with dual hash system" if (preprocessing_info and preprocessing_info.mode == 'artifact') else "Auto mode used - not suitable for regulatory compliance" }} {{ "Preprocessing transformations match production deployment" if (preprocessing_info and preprocessing_info.mode == 'artifact') else "Requires artifact mode with hash verification for compliance" }}

Regulatory Framework Assessment

Assessment against major regulatory frameworks as of their latest revisions.

GDPR (EU) - Regulation (EU) 2016/679

Effective: May 25, 2018 | Article 22 - Automated Decision-Making

  • [PASS] Right to explanation supported (SHAP analysis)
  • [{{ 'PASS' if (manifest and manifest.seeds) else 'WARN' }}] Automated decision-making documented
  • [{{ 'PASS' if fairness_analysis else 'FAIL' }}] Bias assessment performed
  • [{{ 'PASS' if schema_info and schema_info.sensitive_features else 'WARN' }}] Sensitive data handling documented

Equal Credit Opportunity Act (US) - 15 U.S.C. § 1691

Enacted: 1974 | Regulation B (12 CFR Part 1002)

  • [{{ 'PASS' if fairness_analysis else 'FAIL' }}] Disparate impact testing
  • [{{ 'PASS' if explanations else 'FAIL' }}] Adverse action explanations available
  • [{{ 'PASS' if (schema_info and schema_info.sensitive_features) else 'FAIL' }}] Protected class monitoring
  • [PASS] Model performance documentation
{% set compliance_issues = [] %} {% if not fairness_analysis %}{% set _ = compliance_issues.append("Bias testing required") %}{% endif %} {% if not explanations %}{% set _ = compliance_issues.append("Model explainability required") %}{% endif %} {% if not (manifest and manifest.seeds) %}{% set _ = compliance_issues.append("Full reproducibility required") %}{% endif %} {% if compliance_issues %}

ALERT: Compliance Issues Detected

The following issues must be addressed before regulatory submission:

    {% for issue in compliance_issues %}
  • {{ issue }}
  • {% endfor %}
{% else %}

Regulatory Compliance Status: PASS

This model audit meets standard regulatory requirements for:

  • Explainable AI mandates
  • Bias testing and fairness assessment
  • Reproducibility and audit trail requirements
  • Performance validation standards
{% endif %}

{% set base_num = 13 %} {% if not preprocessing_info %}{% set base_num = base_num - 1 %}{% endif %} {% if not dataset_bias %}{% set base_num = base_num - 1 %}{% endif %} {% if not calibration_ci %}{% set base_num = base_num - 1 %}{% endif %} {% if not intersectional_fairness %}{% set base_num = base_num - 1 %}{% endif %} {% if not individual_fairness %}{% set base_num = base_num - 1 %}{% endif %} {% if not (stability_analysis and stability_analysis.robustness_score is defined) %}{% set base_num = base_num - 1 %}{% endif %} {{ base_num }}. Model Card

This model card provides essential information about the ML model's intended use, limitations, and ethical considerations following the Model Cards framework (Mitchell et al., 2019).

Model Details

  • Model Type: {{ model_info.type | default("N/A") | title }}
  • Model Version: {{ version | default("1.0.0") }}
  • Training Date: {{ preprocessing_info.manifest.created_at if (preprocessing_info and preprocessing_info.manifest and preprocessing_info.manifest.created_at) else generation_date }}
  • Model Owner: [Organization Name - To Be Completed]

Intended Use

  • Primary Use Case: {{ schema_info.target | default("Classification/Prediction") }} on tabular data
  • Intended Users: Data scientists, compliance officers, and decision-makers requiring transparent ML predictions
  • Out-of-Scope Uses: This model should not be used for decisions outside the training data distribution or for populations not represented in the training data

Training Data

  • Dataset Size: {{ data_summary.shape[0] if data_summary.shape else "N/A" }} samples
  • Features: {{ data_summary.shape[1] if data_summary.shape else "N/A" }} input variables
  • Protected Attributes: {{ schema_info.sensitive_features | join(", ") if schema_info.sensitive_features else "None specified" }}
  • Data Quality: {{ "%.1f%%" | format(((data_summary.shape[0] * data_summary.shape[1] - (data_summary.missing_count | default(0))) / (data_summary.shape[0] * data_summary.shape[1]) * 100) if data_summary.shape else 0) }} complete

Performance Metrics

    {% set acc = None %} {% if model_performance.accuracy %} {% if model_performance.accuracy is mapping %} {% if model_performance.accuracy.value is defined %} {% set acc = model_performance.accuracy.value %} {% elif model_performance.accuracy.accuracy is defined %} {% set acc = model_performance.accuracy.accuracy %} {% endif %} {% else %} {% set acc = model_performance.accuracy %} {% endif %} {% endif %}
  • Accuracy: {{ "%.1f%%" | format(acc * 100) if acc else "N/A" }}
  • Overall Assessment: {{ 'Excellent' if (acc and acc > 0.9) else ('Good' if (acc and acc > 0.8) else ('Fair' if (acc and acc > 0.6) else 'Needs Improvement')) if acc else 'Not Available' }}

Ethical Considerations

  • Fairness: {{ "Bias detected in " + (bias_detected | unique | join(", ")) if bias_detected else "No significant bias detected in protected attributes" }}
  • Privacy: Model operates on aggregated data without individual identification
  • Transparency: SHAP explanations provide interpretable feature importance
  • Accountability: Full audit trail maintained with reproducible seeds

Limitations & Risks

  • Model performance may degrade on data from different distributions than training data
  • Protected attribute fairness assessed at time of audit; ongoing monitoring recommended
  • {% if preprocessing_info and preprocessing_info.mode == 'auto' %}
  • CRITICAL: Auto preprocessing mode used - not suitable for production deployment
  • {% endif %}
  • Model interpretability relies on SHAP approximations which may not capture all nonlinear interactions
  • Regulatory compliance assessment is not legal advice; consult with legal counsel before deployment

Maintenance & Monitoring

  • Model should be re-audited periodically or when significant data distribution changes occur
  • Monitor for data drift, concept drift, and fairness degradation in production
  • Update preprocessing artifacts when training data is refreshed
  • Maintain version control for models, preprocessing artifacts, and audit reports

{% set base_num = 14 %} {% if not preprocessing_info %}{% set base_num = base_num - 1 %}{% endif %} {% if not dataset_bias %}{% set base_num = base_num - 1 %}{% endif %} {% if not calibration_ci %}{% set base_num = base_num - 1 %}{% endif %} {% if not intersectional_fairness %}{% set base_num = base_num - 1 %}{% endif %} {% if not individual_fairness %}{% set base_num = base_num - 1 %}{% endif %} {% if not (stability_analysis and stability_analysis.robustness_score is defined) %}{% set base_num = base_num - 1 %}{% endif %} {{ base_num }}. Glossary

Key terms and concepts used in this audit report, defined for non-technical reviewers.

Accuracy
The proportion of correct predictions out of all predictions made. A model with 85% accuracy makes correct predictions 85% of the time.
Artifact (Preprocessing)
A saved file containing the exact preprocessing transformations (imputation, scaling, encoding) used on training data. Ensures audit data is transformed identically to production data.
Bias (Statistical)
Systematic unfairness in model predictions across different demographic groups. Detected through fairness metrics that compare outcomes between protected groups.
Bootstrap Confidence Interval
A statistical method for estimating uncertainty by repeatedly resampling data. The 95% CI indicates the range within which the true value lies with 95% probability.
Calibration (Model)
The degree to which predicted probabilities match observed outcomes. A well-calibrated model predicting 70% confidence should be correct approximately 70% of the time.
Consistency Score
A measure of individual fairness indicating whether similar individuals receive similar predictions. Score of 1.0 means perfect consistency; lower scores suggest disparate treatment.
Demographic Parity
A fairness metric requiring that positive predictions occur at equal rates across all demographic groups. For example, loan approval rates should be similar across all age groups.
Equal Opportunity
A fairness metric requiring that qualified individuals have equal chances of positive outcomes regardless of protected attributes. Focuses on true positive rates.
Expected Calibration Error (ECE)
A metric quantifying calibration quality by measuring the average difference between predicted probabilities and observed outcomes. Lower ECE indicates better calibration (ECE < 0.05 is excellent, 0.05-0.10 is acceptable).
Feature
An input variable used by the model to make predictions. Features can be demographic information, financial data, behavioral patterns, etc.
Feature Importance
A measure of how much each feature contributes to model predictions. Higher importance means the feature has greater influence on outcomes.
Hash (Cryptographic)
A unique digital fingerprint of a file or data. Any change to the file produces a completely different hash, enabling verification that files haven't been tampered with.
Imputation
The process of filling in missing values in data. Common strategies include using the median, mean, or most frequent value from training data.
Individual Fairness
A fairness principle requiring that similar individuals receive similar treatment. Complements group fairness by catching disparate treatment of specific people, a critical legal concern under anti-discrimination law.
Intersectional Fairness
Analysis of fairness at the intersection of multiple protected attributes (e.g., race × gender). Bias hidden in overall metrics can emerge at intersections; for example, a model fair for women overall may discriminate against Black women specifically.
Model Card
Standardized documentation describing a model's intended use, limitations, training data, and ethical considerations. Promotes transparency and responsible AI deployment.
OneHotEncoder
A transformation that converts categorical variables (like "red", "blue", "green") into binary indicator columns (is_red: yes/no, is_blue: yes/no, is_green: yes/no).
Protected Attribute
A characteristic protected by anti-discrimination laws, such as age, gender, race, or disability status. Models must be evaluated for fair treatment across these attributes.
Proxy Feature
A feature that strongly correlates with a protected attribute, enabling indirect discrimination. For example, ZIP code may serve as a proxy for race. High proxy correlations (|r| > 0.7) require investigation.
Reproducibility
The ability to generate identical results when running the same analysis multiple times. Critical for regulatory audits to verify that results haven't been cherry-picked.
Robustness Score
A measure of model stability under small input perturbations. Quantifies the maximum prediction change when features are perturbed with noise. Low scores indicate stable, reliable models; high scores suggest fragility or manipulation vulnerabilities.
Scaling (Feature)
Transforming numerical features to a common range or distribution. StandardScaler removes the mean and scales to unit variance; this helps models treat all features fairly.
SHAP (SHapley Additive exPlanations)
A method for explaining individual predictions by quantifying each feature's contribution. Based on game theory and provides theoretically sound explanations.
Strict Mode
An audit configuration that enforces additional regulatory requirements, such as explicit random seeds, production preprocessing artifacts, and comprehensive documentation.
Unknown Category
A categorical value in audit data that wasn't seen during preprocessing training. High unknown rates may indicate data distribution shift or model drift.