I am noticing that the quality report takes a very long time (upwards of multiple days) - I was wondering if you had any suggestions to speed this up or if this could be improved?
Hi @pranav.rupireddy, I’m sorry to hear that.
Are you running the report on a multi-table datasets? And would you mind providing more info about the dataset size for both real and synthetic – for eg. how many MB (or GB) is the full dataset, and how many rows/columns does it contain in each table?
As a reference, our analysis shows that for a single table of data (around 98MB each for real and synthetic), it is taking around 10 min to generate a quality report.
So I am wondering if you are trying this on a much large dataset, or one that has more complexity in terms of multi-table schemas?
I can dig in further based on your responses.
It’s multi-table.
Table 1:
Dimensions - 1000000 rows, 15 columns
Real - 69206 KB
Synthetic - 76018 KB
Table 2:
Dimensions - 23064914 rows, 91 columns
Real - ~5 GB
Synthetic - ~6 GB
Table 3:
Dimensions - 8142945 rows , 22 columns
Real - ~1 GB
Synthetic - ~1 GB
Also, manually specifying the ID columns with the sdtype of id in the metadata has helped speed things up, though it still will likely take a day or two or more it appears
Hi @pranav.rupireddy thanks for the details.
manually specifying the ID columns with the sdtype of id in the metadata has helped speed things up
-
Got it. The quality report will comprehensively compute marginals and correlations for any statistical column (eg. numerical, datetime, categorical). Other concepts such as IDs, or PII values should not be marked with these sdtypes. Perhaps it is best to go do another sweep of your metadata to double-check this? If a column is accidentally marked as categorical and has extremely high cardinality (aka a large number of different values), that may cause an issue. So be sure to especially check for this.
-
Additionally, your dataset size seems to be quite high. To get you unblocked ASAP, you can try to subset the multi-table data (real and synthetic) using the get_random_subset function. Then try running the quality report on the subsetted data – if you repeat this process a few times, it will become a good approximate value.
In the meantime, I will file and track a new feature request for improving the performance of the Quality report natively.