Scaling specific tables in a multi-table Synthetic Data Vault (SDV) model

singhe · April 9, 2026, 3:30pm

Which software are you using? (SDV Community or SDV Enterprise?): SDV Community

Software Details (What is your SDV version? Python version?): SDV 1.28.0, Python 3.11.6

Description

I am working with a multi-table dataset consisting of 5 tables and am seeking advice on efficient training and scaling strategies.

The Dataset Challenge: The sizes of our tables are significantly unbalanced:

Table A: 107M+ rows
Table B: 37M+ rows
Tables C, D, E: ~8 rows each (reference/lookup tables)

Current Approach & Pain Points: Due to hardware constraints (RAM and training time), training on the full 144M+ rows isn’t feasible. To mitigate this, I’ve been sampling 10,000 rows from the two large tables for the training phase.

However, I’ve run into an issue during the generation phase:

We need the final synthetic output to reflect the original scale (millions of rows) for the large tables.
While the scale parameter can increase row counts, it applies a multiplier across the entire schema.
If I scale up to reach millions of rows for Tables A and B, it also scales the small 8-row tables, which I need to remain at their original size.

My Question: Is there a recommended way to selectively scale specific tables during generation? Specifically, how can I generate millions of rows for the two large tables while keeping the three small reference tables at their original counts (or at a much lower scale)?

singhe · April 10, 2026, 8:41am

Hi @neha , can you check this out?

neha · April 21, 2026, 7:03pm

Hello @singhe,

If I understand correctly, you’d like tables C, D, and E to exist as-is (they are reference/lookup tables) and you only really need to create synthetic data for tables A, B (and scale them up from the original training data). The recommended approach is to use the ReferenceTables constraint, marking tables C, D, and E as reference tables. Then these tables will always remain their original size (~8 rows each) while the other tables will scale up.

Note that the ReferenceTable constraint is available in the CAG bundle, which is an add-on to SDV Enterprise. If you’re interested in SDV Enterprise, you can Contact Us.

BTW the memory and performance is better (by orders of magnitude) with the HSASynthesizer. So if you upgrade to SDV Enterprise, you may actually be able to run on the entire dataset – no need to subsample from the larger tables during the training phase.

singhe · May 4, 2026, 9:37am

Hi @neha,
If we are moving forward with the Enterprise version, how can we arrange a call to discuss? I haven’t yet discussed this with the organization, but let me know the procedure.

neha · May 5, 2026, 7:37pm

Hi @singhe, If you’re interested in SDV Enterprise, you can Contact Us. Someone from our team will reach out to you.

singhe · May 12, 2026, 5:35am

How does that normally happen Neha?
After reaching you, do you schedule a call to discuss, come up with a POC or?

Topic		Replies	Views
Synthesizing a subset of a table Synthetic Data Creation	3	43	April 14, 2026
Large Model Training and Synthesizing Synthetic Data Creation data-integration , performance	2	61	August 23, 2024
[Resolved] Num_row_per_table=1000 DayZSynthesizer Multitable Synthetic Data Creation bug	7	71	April 21, 2026
Including a code table for reference only Synthetic Data Creation	3	37	April 14, 2026
Combining two columns, include code table for reference only and synthesize subset of table Synthetic Data Creation	2	50	April 13, 2026

Scaling specific tables in a multi-table Synthetic Data Vault (SDV) model

Description

Related topics