Scaling specific tables in a multi-table Synthetic Data Vault (SDV) model

Which software are you using? (SDV Community or SDV Enterprise?): SDV Community

Software Details (What is your SDV version? Python version?): SDV 1.28.0, Python 3.11.6

Description

I am working with a multi-table dataset consisting of 5 tables and am seeking advice on efficient training and scaling strategies.

The Dataset Challenge: The sizes of our tables are significantly unbalanced:

  • Table A: 107M+ rows

  • Table B: 37M+ rows

  • Tables C, D, E: ~8 rows each (reference/lookup tables)

Current Approach & Pain Points: Due to hardware constraints (RAM and training time), training on the full 144M+ rows isn’t feasible. To mitigate this, I’ve been sampling 10,000 rows from the two large tables for the training phase.

However, I’ve run into an issue during the generation phase:

  1. We need the final synthetic output to reflect the original scale (millions of rows) for the large tables.

  2. While the scale parameter can increase row counts, it applies a multiplier across the entire schema.

  3. If I scale up to reach millions of rows for Tables A and B, it also scales the small 8-row tables, which I need to remain at their original size.

My Question: Is there a recommended way to selectively scale specific tables during generation? Specifically, how can I generate millions of rows for the two large tables while keeping the three small reference tables at their original counts (or at a much lower scale)?

Hi @neha , can you check this out?

Hello @singhe,

If I understand correctly, you’d like tables C, D, and E to exist as-is (they are reference/lookup tables) and you only really need to create synthetic data for tables A, B (and scale them up from the original training data). The recommended approach is to use the ReferenceTables constraint, marking tables C, D, and E as reference tables. Then these tables will always remain their original size (~8 rows each) while the other tables will scale up.

Note that the ReferenceTable constraint is available in the CAG bundle, which is an add-on to SDV Enterprise. If you’re interested in SDV Enterprise, you can Contact Us.

BTW the memory and performance is better (by orders of magnitude) with the HSASynthesizer. So if you upgrade to SDV Enterprise, you may actually be able to run on the entire dataset – no need to subsample from the larger tables during the training phase.

Hi @neha,
If we are moving forward with the Enterprise version, how can we arrange a call to discuss? I haven’t yet discussed this with the organization, but let me know the procedure.

Hi @singhe, If you’re interested in SDV Enterprise, you can Contact Us. Someone from our team will reach out to you.

How does that normally happen Neha?
After reaching you, do you schedule a call to discuss, come up with a POC or?