In our most recent release, we made an important change to the way we measure synthetic data quality. From now on, the Quality Report will evaluate whether the synthetic data matches strong correlations from the real data. Before, the report used to consider all correlations, even the weak or non-existent ones. We expect the updated report to more accurately reflect the quality of synthetic data.
Synthetic data should capture important patterns
When we evaluate synthetic data quality, we’re typically asking one key question: Does the synthetic data capture important patterns from the real data?
When looking at pairwise data, we hope the synthetic data captures prominent correlations. The graph below shows an example of synthetic data being high quality vs. low quality in these terms. It’s high quality when both the real and synthetic data show the same, strong correlations (left). It’s low quality when the synthetic data is more scattered, and doesn’t show any correlation (right).
But what if the real data has no correlation to begin with, like the case below? The real data is too scattered to form any distinguishable pattern.
In this case, there are two options for evaluating synthetic data:
- (a) We can consider the synthetic data high quality, since it is not expected to have a pattern, OR
- (b) We should ignore these columns, since there is no significant pattern to measure.
Previously, the report was going with option (a), but we’ve recently come to the decision that (b) makes more sense for synthetic data evaluation.
The quality report should not give credit for non-existent patterns
Previously, the quality report would give credit to the synthetic data every time it did not show a pattern. This would manifest as a high sub-score that ultimately ended up contributing to the overall quality score.
But is it really fair to give that credit? When we looked into the data, we realized a few things:
- Tabular synthesizers do not typically invent correlations on their own. If there is no significant correlation in the real data, it is extremely unlikely that the synthetic data would somehow contain a strong correlation.
- In fact, we’re more concerned with the opposite case: A synthesizer could fail to learn a correlation from the real data. This matters more because we’re counting on synthetic data to match the patterns.
- A typical dataset contains only a few significant correlations. Rather than focus on these, the quality report was being influenced by all non-existent patterns.
So starting from this release, we made the decision to only evaluate strong correlations from the real data. For the mathematically inclined:
- We consider correlations between continuous columns as strong if the Pearson correlation coefficient is >0.5 or <-0.5,
- We consider correlations between discrete columns as strong if the Cramer’s V coefficient is >0.3.
This allows us to focus on the patterns that matter
Expect your quality scores to be more accurate
As a result of this change, the quality score more accurately reflects whether the synthetic data captures patterns from the real data. We re-ran the quality report using our own demo datasets, and discovered some interesting stats along the way:
- After making the change, the quality score slightly decreased for most of the datasets. The average change was -3.6%. This is because prior to our updates, the non-existent patterns were inflating the score. The new, lower score represents a more accurate measure of capturing the patterns that exist.
- When looking at individual tables, 28% of column pairs typically had strong correlations. Since the remaining 72% are now ignored, this explains why the quality score changes.
- When looking between tables (inter-table trends in multi-table data), only about 19% of pairs had strong correlation. This indicates that correlations are rarer between 2 tables than within a table .
Try it out on your own data and let us know what you think!
from sdmetrics.reports.multi_table import QualityReport
report = QualityReport()
report.generate(data, synthetic_data, metadata)
You’ll see that any weak/non-existent correlations are not reported. For example, when visualizing the scores, you’ll see gray boxes whenever the correlation is not strong enough to be reported.
report.get_visualization('Column Pair Trends')
Resources
- SDMetrics Quality Report. SDMetrics is our standalone, model-agnostic library for evaluating real vs. synthetic data.
- SDV single-table and multi-table evaluation. This is a handy wrapper around SDMetrics for anyone using SDV to generate the synthetic data.


