Median accuracy: {{accuracy_value}}
The accuracy heatmap is calculated in the following way: for each pair of original columns a similarity between them (the Jensen-Shannon distance) is measured. The same operations is performed on the same pair of columns in the synthetic dataset. The resulting score is calculated as 1 - abs(original_score - synthetic_score).
Median of differences of correlations {% if correlation_median < 0.0001 %} is less than 0.0001 {% else %} : {{round(correlation_median, 4)}} {% endif %}
The correlation metric is calculated in the following manner: for each pair of numeric and categorical columns in the original dataset, a pairwise correlation is measured, thus resulting in the correlation matrix. The same operation is performed for the synthetic dataset. The resulting matrix is the difference between the original correlation matrix and the synthetic one. {% if correlation_median != correlation_median %} NaN values, shown in gray, indicate that correlations could be computed in one dataset but not the other. This suggests significant differences in data characteristics between original and synthetic datasets. {% endif %}
For clustering metric, both original and synthetic datasets are concatenated. After this, a clustering using K-means is performed on the concatenated dataset. The optimal number of clusters is chosen using the Davies-Bouldin Index. For the good synthetic data the proportion of original to synthetic records in each cluster should be close to 1:1. The mean clusters homogeneity is calculated as a mean of ratios of original to synthetic records in each cluster.
Mean clusters homogeneity: {{clustering_value}}
For utility calculation an automatic modeling process is engaged. For every binary, categorical and continuous column it builds a model using given column as a target and other columns as predictors. Then the best target in terms of accuracy is chosen. For each of the best targets for different modeling tasks a model is trained on the original dataset, then tested on synthetic one. For the good synthetic data the score of model on the generated data should be close to the original one.
No data available
{% endif %}The univariate distributions are either the density plots for original vs synthetic values overlapped for continous columns or barplots for original vs synthetic counts (in percent) for categorical columns.
{% for col, uni_img in uni_imgs.items() -%}Bivariate distributions are actually the two-dimensional histograms. For each pair of columns, each combination of their corresponding values are plotted as a matrix, resulting in a grid. Each cell within the grid contains a ratio, computed as the number of records belonging to both classes simultaneously divided by the total number of records in corresponding dataset. For example, consider a column with values 'Monday', 'Tuesday', 'Sunday', and another column with values 'Working', 'Museum visiting', 'Walking'. In this case, the ratio representing the occurrence of each activity on each day is calculated and displayed in the appropriate grid cell. For good synthetic data the pattern of bivariate distribution should match the original distribution.