Which software are you using? (SDV Community or SDV Enterprise?): SDV Community
Software Details (What is your SDV version? Python version?): SDV 1.28.0, Python 3.11.6
Description
I am working with a multi-table dataset consisting of 5 tables and am seeking advice on efficient training and scaling strategies.
The Dataset Challenge: The sizes of our tables are significantly unbalanced:
-
Table A: 107M+ rows
-
Table B: 37M+ rows
-
Tables C, D, E: ~8 rows each (reference/lookup tables)
Current Approach & Pain Points: Due to hardware constraints (RAM and training time), training on the full 144M+ rows isn’t feasible. To mitigate this, I’ve been sampling 10,000 rows from the two large tables for the training phase.
However, I’ve run into an issue during the generation phase:
-
We need the final synthetic output to reflect the original scale (millions of rows) for the large tables.
-
While the
scaleparameter can increase row counts, it applies a multiplier across the entire schema. -
If I scale up to reach millions of rows for Tables A and B, it also scales the small 8-row tables, which I need to remain at their original size.
My Question: Is there a recommended way to selectively scale specific tables during generation? Specifically, how can I generate millions of rows for the two large tables while keeping the three small reference tables at their original counts (or at a much lower scale)?