Cohort Quality Control Report

Cohort Summary
    {% for key, value in summary.items() %} {% if key != "Samples by Covariate Combination" %}
  • {{ key }}: {{ value }}
  • {% endif %} {% endfor %}
{% if samples_by_comb %}

Samples by Covariate Combination

{% for cov in covariates %} {% endfor %} {% for row in samples_by_comb %} {% for cov in covariates %} {% endfor %} {% endfor %}
{{ cov }}Number of Samples
{{ row[cov] }}{{ row["Number of Samples"] }}
{% endif %}
PCA Analysis and Variance Explained

Principal Component Analysis (PCA) helps to identify the main sources of variation in the data. We perform PCA before and after batch effect correction to see if the batch effects dominate the first few principal components (PCs). The variance explained by each PC indicates how much of the data's variation is captured by that PC. After correction, the variance should ideally be more evenly distributed across PCs.

PCA Plots (PC1 vs PC2) Before and After Batch Correction

{% for plot in pca_plots %}
{% if plot.before %}

PCA Plot Before Correction - Colored by {{ plot.covariate }}

PCA before - {{ plot.covariate }}
{% else %}

Data before correction not available for {{ plot.covariate }}

{% endif %}

PCA Plot After Correction - Colored by {{ plot.covariate }}

PCA after - {{ plot.covariate }}
{% endfor %}

PC Variance Explained Before and After Batch Correction

Variance explained by each principal component before and after batch effect correction. Lower variance in the first few PCs suggests successful batch effect correction.

{% if pca_variance_before %}

Variance explained by PCs before correction

PCA variance before
{% endif %}

Variance explained by PCs after correction

PCA variance after

Association between PCs and Clinical Annotations

Interpretation

- High correlation between PCs and batch information before correction indicates batch effects. Lower correlation after correction suggests successful batch effect removal.

- However, it is important to note that even after batch correction, a strong association between datasets and PCs might still persist, especially in cases where:

  • Diverse cohort: When dealing with a highly diverse cohort, the first few principal components might capture this diversity, leading to a natural association between the PCs and the datasets. This diversity could be in terms of biological variation, differences in sample types, or varying conditions across the datasets.
  • Specific sample types per dataset: e.g., tumor vs normal samples. If datasets are composed of samples that are biologically distinct, such as normal versus tumor samples, the PCs might reflect these inherent biological differences. In such cases, the association between datasets and PCs is not due to batch effects, but rather due to the underlying biological differences that the PCA is capturing.

In these scenarios, the observed associations are expected and reflect meaningful biological or experimental differences rather than technical artifacts.

{% if association_matrix_before %}

Before Batch Effect:

Results format: statistics, p-value, test performed, and number of samples used.

{{ association_matrix_before | safe }} {% endif %}

After Batch Effect:

{{ association_matrix_after | safe }}
Correction Effect Metric

This metric quantifies the impact of the batch effect correction on the variability of gene expression data. The metric is calculated as follows:

  • Step 1: Calculate the Median Absolute Deviation (MAD) for each gene before correction. This involves computing the median expression level for each gene across all samples, then calculating the absolute deviations from this median, and finally taking the median of these deviations. {% if mad_before %}
    • Median absolute deviation before correction =
    • {{ mad_before|format_float(4) }}
    {% else %}
    • Data before correction not available
    {% endif %}
  • Step 2: Repeat the same process to calculate the MAD for each gene after the batch effect correction.
    • Median absolute deviation after correction =
    • {{ mad_after|format_float(4) }}
  • Step 3: Compute the correction effect metric as the ratio of the mean MAD after correction to the mean MAD before correction. {% if mad_before %}
    • The correction effect metric calculated is:
    • {{ effect_metric|format_float(4) }}
    {% else %}
    • Data before correction not available
    {% endif %}

Interpretation:

This ratio helps quantify how much variability remains after correction.

  • If close to 1: the correction may not have reduced batch effects much.
  • A ratio significantly less than 1: indicates that the correction has reduced the batch effect (lower variability). A metric around 0.5 might indicate that the batch correction was strong. While this reduction in variability could be due to effective correction, it also raises concerns about potentially removing biological variability.
  • A metric much higher than 1 could indicate that the correction process introduced additional variability. This could be due to overcorrection, which might have removed important biological signals along with the batch effects.
  • The correction effect metric can depend on the number of datasets or batches. With more datasets, batch effects might be stronger, necessitating more aggressive correction, potentially leading to a lower metric. Fewer datasets might result in a correction effect closer to 1, as the variability due to batch effects could be less significant.

Sample Distribution by Covariate Combination

The following boxplots compare gene expression across datasets (batches) before and after correction for different covariate combinations.

  • Before Correction: Look for variability across batches, which may indicate batch effects.
  • After Correction: Ideally, the distributions should become more consistent across batches, suggesting effective correction.

These plots help assess whether the correction process has successfully reduced batch-related variability without masking important biological differences.

{% for plot in distribution_plots %}

Distribution for covariate combination: {{ plot.combination }}

Distribution plot
{% endfor %}

Total Sample Distribution

Global distribution
Silhouette Score

The Silhouette Score measures how similar each sample is to its own batch compared to other batches. A high score before correction indicates strong batch effects. A lower score after correction means these effects were reduced.

  • {% if silhouette_before is not none %} Silhouette Score Before Correction: {{ silhouette_before|format_float(4) }} {% else %} Data before correction not available. {% endif %}
  • Silhouette Score After Correction: {{ silhouette_after|format_float(4) }}

Interpretation:

- A decrease in the Silhouette Score after correction suggests successful batch effect mitigation.

Entropy of Batch Mixing (EBM)

The Entropy of Batch Mixing (EBM) measures how well samples from different batches are mixed after correction. Higher entropy indicates better mixing, meaning batch effects have been reduced.

  • {% if entropy_before is not none %} Entropy Before Correction: {{ entropy_before|format_float(4) }} {% else %} Data before correction not available. {% endif %}
  • Entropy After Correction: {{ entropy_after|format_float(4) }}

Interpretation:

- An increase in entropy after correction indicates improved mixing of batches, suggesting successful batch effect correction.

Mixed Dataset Summary Report

Confidence in Batch Effect Correction regarding Mixed Datasets and Samples

  • Total Mixed Datasets: {{ summary_mixed["total_mixed_datasets"] }}
  • Total Non-Mixed Datasets: {{ summary_mixed["total_non_mixed_datasets"] }}
  • Total Mixed Samples: {{ summary_mixed["total_mixed_samples"] }}
  • Total Non-Mixed Samples: {{ summary_mixed["total_non_mixed_samples"] }}

Proportion of mixed datasets in the cohort: {{ mixed_dataset_ratio|percentage }}

Proportion of mixed samples in the cohort: {{ mixed_sample_ratio|percentage }}

Mixed Samples by Covariate

Mixed Samples

    {% for cov, count in summary_mixed["mixed_samples_by_covariate"].items() %}
  • {{ cov }}: {{ count }} samples
  • {% endfor %}

Non-Mixed Samples

    {% for cov, count in summary_mixed["non_mixed_samples_by_covariate"].items() %}
  • {{ cov }}: {{ count }} samples
  • {% endfor %}

Overall Proportion of Mixed Samples by Covariate Combination

{% for cov in covariate_names %} {% endfor %} {% for row in mixed_proportions_table %} {% for cov in covariate_names %} {% endfor %} {% endfor %}
{{ cov }}Proportion of Mixed Samples
{{ row["cov" ~ loop.index] }}{{ row.proportion | percentage }}

Interpretation

Very High Confidence: >50% mixed datasets and >50% mixed samples

Confidence is very high in the batch effect correction due to a substantial proportion of mixed datasets and samples. This suggests that the correction algorithm was applied across a highly diverse set of conditions, minimizing the risk that batch effects confound the biological signals. The variability across different conditions was well-represented, leading to more reliable results.

High Confidence: 30-50% mixed datasets or 30-50% mixed samples

Confidence is high in the batch effect correction due to a substantial proportion of mixed datasets and samples. This indicates that the correction algorithm was applied across a diverse range of conditions, reducing the likelihood that batch effects are confounded with biological signals. A higher representation of mixed datasets means that the variability across different conditions was well-represented during the correction, leading to more reliable and robust results.

Moderate Confidence: 15-30% mixed datasets or 15-30% mixed samples

Confidence is moderate in the batch effect correction. There is a reasonable proportion of mixed datasets and samples, suggesting that the correction was performed on a fairly diverse dataset. However, there's still a possibility that some batch effects might not have been fully corrected if certain covariate combinations were underrepresented. While the results are likely to be reliable, some caution is advised in interpreting the findings.

Low Confidence: <15% mixed datasets and <15% mixed samples

Confidence is low in the batch effect correction. The mixed datasets and samples form a small proportion of the cohort, which indicates that the correction may have been applied under limited conditions. This can lead to insufficient representation of the variability across different conditions, increasing the risk that batch effects may still confound the biological signals. In such cases, the reliability of the corrected data could be compromised, and further validation might be necessary.

Detailed Summary for Each Mixed Dataset

{% for dataset, details in summary_mixed["mixed_dataset_details"].items() %}

Dataset: {{ dataset }}

  • Total Samples: {{ details["total_samples"] }}
  • Samples by Covariate Combination:
      {% for comb, count in details["samples_by_covariate_combination"].items() %}
    • {{ comb }}: {{ count }} samples ({{ details["proportion_by_covariate_combination"][comb]|percentage }})
    • {% endfor %}
{% endfor %}