This question was originally filed here by @rizwan. I’m separating it out into a new thread so that we can discuss it specifically.
Which software are you using? SDV Enterprise
Software Details SDV 0.44, Python 3.13
Description
How to synthesize only a subset of a table? For example, Employee table has columns; EmployeeID, FirstName, Surname, Address, StartDate. How to synthesize only FirstName and Surname and merge the synthetic data back to the table?
The general premise of SDV is to create brand new synthetic data for each of the tables in your database. The synthetic rows that you will receive are completely new entities that do not correspond to any one, original entity.
For example if you are synthesizing an Employees table, then each row of the synthetic Employees table corresponds to a brand new Employee that doesn’t really map to any one, original employee. This is what allows SDV to scale up the synthetic data – for example, creating 100x or even 1000x the number of original employees.
Clarifying your use case
If the desire is to synthesize only the FirstName and Surname columns of the table, I’m not sure whether your desire is to synthesize brand-new employees as opposed to just anonymizing existing information?
This can be achieved through some other functionality that SDV provides (like RDTs or targeted sampling). To better help you, would you be able to clarify this use case? Is the desire to keep the original employees exactly as-is and just create new names for them? What about other tables that may be connected to the Employees table?
(Related the concept of synthesizing brand new entities, I’d recommend this blog post.)
Hi @neha , thanks for the reply. As you mentioned, anonymizing just names won’t create a synthetic data. I guess I would need to synthesize the entire table.
Right, I think for most cases you’d probably want to synthesize entire tables (unless they are reference tables that need to remain static).
Though even when it comes to synthetic data, you can “fix” a few values by using the conditional sampling feature. Usually, these are statistical values. The synthetic data would then be created with that in mind. For example, creating exactly 50 male and 50 female employees.
Let me know if there are any follow-ups or if you have a use case that requires partial anonymization.