Scaling specific tables in a multi-table Synthetic Data Vault (SDV) model

Which software are you using? (SDV Community or SDV Enterprise?): SDV Community

Software Details (What is your SDV version? Python version?): SDV 1.28.0, Python 3.11.6

Description

I am working with a multi-table dataset consisting of 5 tables and am seeking advice on efficient training and scaling strategies.

The Dataset Challenge: The sizes of our tables are significantly unbalanced:

  • Table A: 107M+ rows

  • Table B: 37M+ rows

  • Tables C, D, E: ~8 rows each (reference/lookup tables)

Current Approach & Pain Points: Due to hardware constraints (RAM and training time), training on the full 144M+ rows isn’t feasible. To mitigate this, I’ve been sampling 10,000 rows from the two large tables for the training phase.

However, I’ve run into an issue during the generation phase:

  1. We need the final synthetic output to reflect the original scale (millions of rows) for the large tables.

  2. While the scale parameter can increase row counts, it applies a multiplier across the entire schema.

  3. If I scale up to reach millions of rows for Tables A and B, it also scales the small 8-row tables, which I need to remain at their original size.

My Question: Is there a recommended way to selectively scale specific tables during generation? Specifically, how can I generate millions of rows for the two large tables while keeping the three small reference tables at their original counts (or at a much lower scale)?

Hi @neha , can you check this out?