Accuracy Report

Synthetic Data Quality

Median accuracy: {{accuracy_value}}

Median clusters homogeneity: {{clustering_value}}

Bad
< 0.7
Low
0.7 - 0.8
Medium
0.8 - 0.9
High
> 0.9

The accuracy heatmap is calculated in the following way: for each pair of original columns a similarity between them (the Jensen-Shannon distance) is measured. The same operations is performed on the same pair of columns in the synthetic dataset. The resulting score is calculated as 1 - abs(original_score - synthetic_score).

Content

Correlations
Clustering
Utility
Univariate distributions
Bivariate distributions

Correlations

GOOD
BAD

The correlation metric is calculated in the following manner: for each pair of numeric columns in original dataset a pairwise correlations is measured thus resulting in the correlations matrix. The same operation is performed for the synthetic dataset. The resulting matrix is the difference between original correlation matrix and the synthetic one.

Clustering metric

For clustering metric, both original and synthetic datasets are concatenated. After this, a clustering using K-means is performed on the concatenated dataset. The optimal number of clusters is chosen using the elbow rule. For the good synthetic data the proportion of original to synthetic records in each cluster should be close to 1:1. The mean clusters homogeneity is calculated as a mean of ratios of original to synthetic records in each cluster.

Mean clusters homogeneity: {{clustering_value}}

Utility metric

For utility calculation an automatic modeling process is engaged. For every binary, categorical and continuous column it builds a model using given column as a target and other columns as predictors. Then the best target in terms of accuracy is chosen. For each of the best targets for different modeling tasks a model is trained on the original dataset, then tested on synthetic one. For the good synthetic data the score of model on the generated data should be close to the original one.

{{table_name}}

{{utility_table}}

Univariate distributions

The univariate distributions are either the density plots for original vs synthetic values overlapped for continous columns or barplots for original vs synthetic counts (in percent) for categorical columns.

{% for col, uni_img in uni_imgs.items() -%}

Column Name: {{col}}

{% endfor %}

Bivariate distributions

Bivariate distributions are actually the two-dimensional histograms. For each pair of columns each combination of their values' levels are plotted as a matrix, thus resulting in a grid. In each cell of this grid a number of records belonging to both classes simultaneously is calculated. For instance, for the column with values "Monday", "Tuesday", "Sunday" and column with values "Working", "Museum visiting" and "Walking" a count of each activity in each day will be calculated and plotted in the corresponding cell. For good synthetic data the pattern of bivariate distribution should match the original distribution.

{% for title, bi_img in bi_imgs.items() -%}

Original: {{title}}

Synthetics: {{title}}

{% endfor %}