Distributions of features in synthetic data

Hi Datacebo team,
After generating synthetic data based on the metadata, inspecting using synthesizer.get_learned_distributions, all columns show ‘truncnorm’ distribution, but we used whatever the default settings were, may we confirm the default should be ‘beta’ or any conditions that make them ‘truncnorm’?

metadata
{
“tables”: {
“mem”: {
“columns”: {
“MemberID”: {
“sdtype”: “id”
},
“DOB”: {
“sdtype”: “datetime”
},
“Gender”: {
“sdtype”: “categorical”
},
“Product”: {
“sdtype”: “categorical”
},
“GroupID”: {
“sdtype”: “categorical”
},
“PolicyType”: {
“sdtype”: “categorical”
},
“GroupType”: {
“sdtype”: “categorical”
},
“Metal”: {
“sdtype”: “categorical”
},
“Exchange”: {
“sdtype”: “categorical”
},
“ContractType”: {
“sdtype”: “categorical”
},
“MedicalCOB”: {
“sdtype”: “categorical”
},
“RxCOB”: {
“sdtype”: “categorical”
},
“MSA”: {
“sdtype”: “categorical”
},
“DemographicGroup”: {
“sdtype”: “categorical”
},
“Exposure_Months”: {
“sdtype”: “categorical”
}
},
“primary_key”: “MemberID”
},
“med”: {
“columns”: {
“ClaimID”: {
“sdtype”: “categorical”
},
“LineNum”: {
“sdtype”: “numerical”
},
“MemberID”: {
“sdtype”: “id”
},
“FromDate”: {
“sdtype”: “datetime”
},
“ToDate”: {
“sdtype”: “datetime”
},
“AdmitDate”: {
“sdtype”: “datetime”
},
“DischDate”: {
“sdtype”: “datetime”
},
“PaidDate”: {
“sdtype”: “datetime”
},
“RevCode”: {
“sdtype”: “categorical”
},
“HCPCS”: {
“sdtype”: “categorical”
},
“Modifier”: {
“sdtype”: “categorical”
},
“Modifier2”: {
“sdtype”: “categorical”
},
“POS”: {
“sdtype”: “categorical”
},
“Specialty”: {
“sdtype”: “categorical”
},
“EncounterFlag”: {
“sdtype”: “categorical”
},
“ProviderID”: {
“sdtype”: “categorical”
},
“ProviderZIP”: {
“sdtype”: “categorical”
},
“BillType”: {
“sdtype”: “categorical”
},
“AdmitSource”: {
“sdtype”: “categorical”
},
“AdmitType”: {
“sdtype”: “categorical”
},
“Allowed”: {
“sdtype”: “numerical”
},
“Paid”: {
“sdtype”: “numerical”
},
“COB”: {
“sdtype”: “numerical”
},
“Copay”: {
“sdtype”: “numerical”
},
“Coinsurance”: {
“sdtype”: “numerical”
},
“Deductible”: {
“sdtype”: “numerical”
},
“PatientPay”: {
“sdtype”: “numerical”
},
“Days”: {
“sdtype”: “numerical”
},
“Units”: {
“sdtype”: “numerical”
},
“DischargeStatus”: {
“sdtype”: “categorical”
},
“AdmitDiag”: {
“sdtype”: “categorical”
},
“ICDDiag1”: {
“sdtype”: “categorical”
},
“ICDDiag2”: {
“sdtype”: “categorical”
},
“ICDDiag3”: {
“sdtype”: “categorical”
},
“ICDDiag4”: {
“sdtype”: “categorical”
},
“ICDDiag5”: {
“sdtype”: “categorical”
},
“ICDDiag6”: {
“sdtype”: “categorical”
},
“ICDDiag7”: {
“sdtype”: “categorical”
},
“ICDDiag8”: {
“sdtype”: “categorical”
},
“ICDDiag9”: {
“sdtype”: “categorical”
},
“ICDDiag10”: {
“sdtype”: “categorical”
},
“ICDDiag11”: {
“sdtype”: “categorical”
},
“ICDDiag12”: {
“sdtype”: “categorical”
},
“ICDDiag13”: {
“sdtype”: “categorical”
},
“ICDDiag14”: {
“sdtype”: “categorical”
},
“ICDDiag15”: {
“sdtype”: “categorical”
},
“ICDDiag16”: {
“sdtype”: “categorical”
},
“ICDDiag17”: {
“sdtype”: “categorical”
},
“ICDDiag18”: {
“sdtype”: “categorical”
},
“ICDDiag19”: {
“sdtype”: “categorical”
},
“ICDDiag20”: {
“sdtype”: “categorical”
},
“ICDDiag21”: {
“sdtype”: “categorical”
},
“ICDDiag22”: {
“sdtype”: “categorical”
},
“ICDDiag23”: {
“sdtype”: “categorical”
},
“ICDDiag24”: {
“sdtype”: “categorical”
},
“ICDDiag25”: {
“sdtype”: “categorical”
},
“ICDDiag26”: {
“sdtype”: “categorical”
},
“ICDDiag27”: {
“sdtype”: “categorical”
},
“ICDDiag28”: {
“sdtype”: “categorical”
},
“ICDDiag29”: {
“sdtype”: “categorical”
},
“ICDDiag30”: {
“sdtype”: “categorical”
},
“ICDProc1”: {
“sdtype”: “categorical”
},
“ICDProc2”: {
“sdtype”: “categorical”
},
“ICDProc3”: {
“sdtype”: “categorical”
},
“ICDProc4”: {
“sdtype”: “categorical”
},
“ICDProc5”: {
“sdtype”: “categorical”
},
“ICDProc6”: {
“sdtype”: “categorical”
},
“ICDProc7”: {
“sdtype”: “categorical”
},
“ICDProc8”: {
“sdtype”: “categorical”
},
“ICDProc9”: {
“sdtype”: “categorical”
},
“ICDProc10”: {
“sdtype”: “categorical”
},
“ICDProc11”: {
“sdtype”: “categorical”
},
“ICDProc12”: {
“sdtype”: “categorical”
},
“ICDProc13”: {
“sdtype”: “categorical”
},
“ICDProc14”: {
“sdtype”: “categorical”
},
“ICDProc15”: {
“sdtype”: “categorical”
},
“ICDProc16”: {
“sdtype”: “categorical”
},
“ICDProc17”: {
“sdtype”: “categorical”
},
“ICDProc18”: {
“sdtype”: “categorical”
},
“ICDProc19”: {
“sdtype”: “categorical”
},
“ICDProc20”: {
“sdtype”: “categorical”
},
“ICDProc21”: {
“sdtype”: “categorical”
},
“ICDProc22”: {
“sdtype”: “categorical”
},
“ICDProc23”: {
“sdtype”: “categorical”
},
“ICDProc24”: {
“sdtype”: “categorical”
},
“ICDProc25”: {
“sdtype”: “categorical”
},
“ICDProc26”: {
“sdtype”: “categorical”
},
“ICDProc27”: {
“sdtype”: “categorical”
},
“ICDProc28”: {
“sdtype”: “categorical”
},
“ICDProc29”: {
“sdtype”: “categorical”
},
“ICDProc30”: {
“sdtype”: “categorical”
},
“OON”: {
“sdtype”: “categorical”
},
“ClaimLineStatus”: {
“sdtype”: “categorical”
}
}
},
“pharm”: {
“columns”: {
“ClaimID”: {
“sdtype”: “categorical”
},
“MemberID”: {
“sdtype”: “id”
},
“FromDate”: {
“sdtype”: “datetime”
},
“PaidDate”: {
“sdtype”: “datetime”
},
“NDC”: {
“sdtype”: “categorical”
},
“EncounterFlag”: {
“sdtype”: “categorical”
},
“MedicalCovered”: {
“sdtype”: “categorical”
},
“ProviderID”: {
“sdtype”: “categorical”
},
“ProviderZIP”: {
“sdtype”: “categorical”
},
“MailOrder”: {
“sdtype”: “categorical”
},
“Allowed”: {
“sdtype”: “numerical”
},
“Paid”: {
“sdtype”: “numerical”
},
“COB”: {
“sdtype”: “numerical”
},
“Copay”: {
“sdtype”: “numerical”
},
“Coinsurance”: {
“sdtype”: “numerical”
},
“Deductible”: {
“sdtype”: “numerical”
},
“PatientPay”: {
“sdtype”: “numerical”
},
“Units”: {
“sdtype”: “categorical”
},
“DaysSupply”: {
“sdtype”: “categorical”
},
“QuantityDispensed”: {
“sdtype”: “categorical”
},
“OON”: {
“sdtype”: “categorical”
},
“ClaimLineStatus”: {
“sdtype”: “categorical”
}
}
}
},
“relationships”: [
{
“parent_table_name”: “mem”,
“child_table_name”: “pharm”,
“parent_primary_key”: “MemberID”,
“child_foreign_key”: “MemberID”
},
{
“parent_table_name”: “mem”,
“child_table_name”: “med”,
“parent_primary_key”: “MemberID”,
“child_foreign_key”: “MemberID”
}
],
“METADATA_SPEC_VERSION”: “MULTI_TABLE_V1”
}

Hi @yuntien.lee, we’ll take a look and follow up here.

In addition to the metadata, could you provide more details about what kind of synthesizer you are creating? Is it HSASynthesizer (available in SDV Enterprise only)? And are you updating its settings, adding constraints, preprocessing etc. before you do fit? If so, it would be great if you could share a code snippet about how you’re setting it up.

Yes. It was HSASynthesizer and before synthesizer.fit() it was just synthesizer = HSASynthesizer(metadata) with the metadata shown previously.

Hi @yuntien.lee, thanks for your response.

I can confirm that the default distribution for HSASynthesizer is truncnorm. So if you are just using the default settings, this is working as expected. Here are a few functions that you may find useful:

  1. Run get_table_parameters to see what values have been set. By default, I see that the HSASynthesizer will print out 'default_distribution': 'truncnorm'
  2. Update the distribution using the set_table_parameters function. For example the code below will set it to 'beta':
synthesizer.set_table_parameters(
    table_name='mem',
    table_synthesizer='GaussianCopulaSynthesizer',
    table_parameters={
        'default_distribution': 'beta'
    }
)

Out of curiosity, is there a reason why you would like to use (or are expecting) 'beta' for your dataset?


Datacebo has suggested that default distributions would be beta, and also suggested beta is a better distribution to use in most cases.

Thanks for pointing that out @yuntien.lee.

Indeed the default for the single-table GaussianCopulaSynthesizer right now is 'beta'. It’s just that our multi-table, HSASynthesizer has updated it to use 'truncnorm' instead. I will check-in with the team to better understand why this decision was made, whether it should be updated, and how best to communicate this to you in the future.

Datacebo has suggested that default distributions would be beta, and also suggested beta is a better distribution to use in most cases.

This may be true for single-table (though I will check in with the team). For multi-table cases specifically, I have seen many scenarios where 'truncnorm' actually ended up being a much better option. Usually, it was because 'beta' led to worse performance, as it takes much longer to fit. Plus any supposed quality improvements with beta were minimal (if any at all). Of course, your mileage may vary depending on your exact dataset, but 'truncnorm' is an excellent option to start with for multi-table.

Some follow up issues.

  • What metadata type should we set ClaimID and GroupID to? Apparently they are ID like which should not be foreign keys among tables but should not be categorical either.
  • The process of running the overall score took (1) 11 mins for shape score 92.9% (2) 29.5 hours for column pair score of 80.5% (3) < 1 min for cardinality score of 99.7% (4) 5 hours for intertable trends score of 82.7%, total score at 88.9%.
  • We found either using truncnorm or beta the amount related fields do not get good matches in terms of distributions between source vs synthetic data. Any ideas how it can be improved? Will send numerical values via email.

Some supplemental information.

  • All truncnorm took around 16 hours to fit and all beta took around 30 hours.
  • Datasets that are used are the ones described in the presentation slides, membership 1m, medical claims 23m and drug claims 8m records.
  • You mentioned there might be some outliers in the source data to make the fit tilted and would like to check the source data. Unfortunately, based on our contracts with our data sources, we’re not allowed to share that data with you, however, below is the five number summary of the actual allowed cost and synthetic allowed cost that underly the medical claims data we used in our cast study. Based on this, do you have any suggestions for the distribution we should use for this field when using the HSA synthesizer?
    – Source min = -450k max = 3.1m Q1 = 8.3 Q2 = 34 Q3 = 109
    – Synthetic min = -8210 max = 9093 Q1 = -877 Q2 = 166 Q3 = 1209

Thank you @yuntien.lee this maybe in metadata, could you also let us know the number of columns in each table ?

yes you can see those in the metadata
mem 15 fields
med 93 fields
pharm 22 fields

May we confirm the latest version of sdv enterprise is 0.13.0? Also the default distribution is truncnorm for numerical fields?

Can you also advise how to set constraints e.g. if AdmitType is null then AdmitDate is null?

Besides can you advise how to deal with outlier values properly, or use a special encoder for that purpose? Thank you.

Hi @yuntien.lee, answering a few of your Qs down below.

Enterprise Version

May we confirm the latest version of sdv enterprise is 0.13.0?

To see the latest version of SDV Enterprise, you can always click here to check the Release Notes. There is always at least 1 new release every month, and sometimes we may do multiple. At the time of this writing, version 0.14.1 is the latest version (released on July 11, 2024), but I know that there will be a new version 0.15.0 available soon. Let us know if you are having troubles accessing the link.

Default Distribution

Also the default distribution is truncnorm for numerical fields?

For HSASynthesizer, the default distribution is truncnorm for the numerical fields of every table. To confirm this yourself, use the get_table_parameters() function to see what is printed out.

from sdv.multi_table import HSASynthesizer

synth = HSASynthesizer(metadata)
synth.get_table_parameters(table_name='mem')

Printout:

{'synthesizer_name': 'GaussianCopulaSynthesizer',
 'synthesizer_parameters': {'enforce_min_max_values': True,
  'enforce_rounding': True,
  'locales': None,
  'numerical_distributions': {},
  'default_distribution': 'truncnorm'}}

Outliers

Besides can you advise how to deal with outlier values properly, or use a special encoder for that purpose? Thank you.

We will have some follow up Qs about this shortly that will help us give you the best solution. Stay tuned!

Custom Constraints

Can you also advise how to set constraints e.g. if AdmitType is null then AdmitDate is null?

Let’s discuss this in a new thread since this one is getting a bit long and is focused on the distributions. I started a topic here where we can reply with follow-up questions. Thanks.

2 posts were split to a new topic: Error when upgrading SDV Enterprise

Hi is this still the syntax of switching to beta? We are noticing the resulting fits are still normal.

for table_name in all_data.keys():
synthesizer.set_table_parameters(
table_name=table_name,
table_synthesizer=‘GaussianCopulaSynthesizer’,
table_parameters={
‘default_distribution’: ‘beta’})

-Pranav

It appears that the metadata are registering the beta distribution, but for whatever reason the fits are still normal (both pasted below):

synth.get_table_parameters(table_name=‘pharmclaims’)
{‘synthesizer_name’: ‘GaussianCopulaSynthesizer’, ‘synthesizer_parameters’: {‘enforce_min_max_values’: True, ‘enforce_rounding’: True, ‘locales’: [‘en_US’], ‘numerical_distributions’: {}, ‘default_distribution’: ‘beta’}}

synth.get_table_parameters(table_name=‘medclaims’)
{‘synthesizer_name’: ‘GaussianCopulaSynthesizer’, ‘synthesizer_parameters’: {‘enforce_min_max_values’: True, ‘enforce_rounding’: True, ‘locales’: [‘en_US’], ‘numerical_distributions’: {}, ‘default_distribution’: ‘beta’}}

synth.get_table_parameters(table_name=‘membership_exposure’)
{‘synthesizer_name’: ‘GaussianCopulaSynthesizer’, ‘synthesizer_parameters’: {‘enforce_min_max_values’: True, ‘enforce_rounding’: True, ‘locales’: [‘en_US’], ‘numerical_distributions’: {}, ‘default_distribution’: ‘beta’}}

Hi @pranav.rupireddy, I can confirm that the API that you are using to set the 'beta' distribution looks to be correct.

Are you using the HSA synthesizer? If so, you can see what exactly the HSA learned by using the get_learned_distributions function after fitting. Click here to see the docs.

This will return the learned distribution shape, and its parameters for all columns of a given table. You can then inspect the ones you are interested in.

>>> synthesizer.get_learned_distributions(table_name='my_table_name')
{
    'column_1': {
        'distribution': 'beta',
        'learned_parameters': { 'a': 2.22, 'b': 3.17, 'loc': 0.07, 'scale': 48.5 }
    },
    'column_2': { 
        ...
    },
}

I would guess that your HSA synthesizer is correctly applying the 'beta' distribution … just that the learned parameters of that distribution is making it appear normal (even though it’s not). It would be helpful to see what the real distribution looks like, if you are able to share?

See below for more info about the Beta distribution.

Beta Distribution: Additional Info

When you set the distribution to 'beta' you are telling the SDV to estimate the column’s shape using the beta distribution’s formula. SDV tries to compute its 2 parameters (alpha and beta) that best matches the shape of your data.

The beta distribution can take on a lot of different shapes depending on which exact parameters are learned. For example, if the alpha parameter < beta parameter, the resulting distribution will skew right.

If the alpha parameter is approximately equal to the beta parameter, the resulting distribution will appear, to the eye, as a normal distribution. This doesn’t mean that it is truly normal – it is still computed using beta’s formula.

So it’s important to verify what the learned parameters are using get_learned_distributions.

Hi @neha,

You’re right. It looks like everything is getting fitted to beta. I have attached the distribution parameters for all three tables.

I have attached a comparison of the synthetic and real marginal distribution. Even though I didn’t divide out by the number of members for either, it doesn’t make a difference in this case since they there are the same number for both - the graph looks the same when I do.

med_dists_unconstrained_20250130.json (15.2 KB)
mem_dists_unconstrained_20250130.json (2.2 KB)
pharm_dists_unconstrained_20250130.json (3.3 KB)