Quality Report Timing

pranav.rupireddy · June 10, 2024, 4:14pm

I am noticing that the quality report takes a very long time (upwards of multiple days) - I was wondering if you had any suggestions to speed this up or if this could be improved?

neha · June 10, 2024, 7:01pm

Hi @pranav.rupireddy, I’m sorry to hear that.

Are you running the report on a multi-table datasets? And would you mind providing more info about the dataset size for both real and synthetic – for eg. how many MB (or GB) is the full dataset, and how many rows/columns does it contain in each table?

As a reference, our analysis shows that for a single table of data (around 98MB each for real and synthetic), it is taking around 10 min to generate a quality report.

So I am wondering if you are trying this on a much large dataset, or one that has more complexity in terms of multi-table schemas?

I can dig in further based on your responses.

pranav.rupireddy · June 11, 2024, 4:03pm

It’s multi-table.

Table 1:
Dimensions - 1000000 rows, 15 columns
Real - 69206 KB
Synthetic - 76018 KB

Table 2:
Dimensions - 23064914 rows, 91 columns
Real - ~5 GB
Synthetic - ~6 GB

Table 3:
Dimensions - 8142945 rows , 22 columns
Real - ~1 GB
Synthetic - ~1 GB

pranav.rupireddy · June 11, 2024, 4:06pm

Also, manually specifying the ID columns with the sdtype of id in the metadata has helped speed things up, though it still will likely take a day or two or more it appears

neha · June 11, 2024, 9:47pm

Hi @pranav.rupireddy thanks for the details.

manually specifying the ID columns with the sdtype of id in the metadata has helped speed things up

Got it. The quality report will comprehensively compute marginals and correlations for any statistical column (eg. numerical, datetime, categorical). Other concepts such as IDs, or PII values should not be marked with these sdtypes. Perhaps it is best to go do another sweep of your metadata to double-check this? If a column is accidentally marked as categorical and has extremely high cardinality (aka a large number of different values), that may cause an issue. So be sure to especially check for this.
Additionally, your dataset size seems to be quite high. To get you unblocked ASAP, you can try to subset the multi-table data (real and synthetic) using the get_random_subset function. Then try running the quality report on the subsetted data – if you repeat this process a few times, it will become a good approximate value.

In the meantime, I will file and track a new feature request for improving the performance of the Quality report natively.

Topic		Replies	Views
Measuring significant correlations Inside the Vault reports , quality	0	49	February 4, 2026
[Resolved] Metadata detect Synthetic Data Creation metadata	27	261	April 29, 2024
[Question] How do you evaluate your data? Evaluation and Benchmarking	2	41	April 29, 2024
SDV Enterprise Version 0.7.0 Release Notes	0	18	November 21, 2023
[Resolved] Num_row_per_table=1000 DayZSynthesizer Multitable Synthetic Data Creation bug	7	73	April 21, 2026

Quality Report Timing

Related topics