Distributions of features in synthetic data

Hi @yuntien.lee, answering a few of your Qs down below.

Enterprise Version

May we confirm the latest version of sdv enterprise is 0.13.0?

To see the latest version of SDV Enterprise, you can always click here to check the Release Notes. There is always at least 1 new release every month, and sometimes we may do multiple. At the time of this writing, version 0.14.1 is the latest version (released on July 11, 2024), but I know that there will be a new version 0.15.0 available soon. Let us know if you are having troubles accessing the link.

Default Distribution

Also the default distribution is truncnorm for numerical fields?

For HSASynthesizer, the default distribution is truncnorm for the numerical fields of every table. To confirm this yourself, use the get_table_parameters() function to see what is printed out.

from sdv.multi_table import HSASynthesizer

synth = HSASynthesizer(metadata)
synth.get_table_parameters(table_name='mem')

Printout:

{'synthesizer_name': 'GaussianCopulaSynthesizer',
 'synthesizer_parameters': {'enforce_min_max_values': True,
  'enforce_rounding': True,
  'locales': None,
  'numerical_distributions': {},
  'default_distribution': 'truncnorm'}}

Outliers

Besides can you advise how to deal with outlier values properly, or use a special encoder for that purpose? Thank you.

We will have some follow up Qs about this shortly that will help us give you the best solution. Stay tuned!

Custom Constraints

Can you also advise how to set constraints e.g. if AdmitType is null then AdmitDate is null?

Let’s discuss this in a new thread since this one is getting a bit long and is focused on the distributions. I started a topic here where we can reply with follow-up questions. Thanks.