Maintaining Cardinality Between Tables

We have a database of claims, procedures, and diagnoses, along with a header table. The header table uniquely defines records by MBR_ID, CLM_ID, and CLM_HASH_KEY. The expected output is that every claim in the clm_details table will have multiple CLM_HASH_KEY values per CLM_ID.

We define the metadata such that claims, procedures, and diagnoses are children and linked to the header, based on the CLM_HASH_KEY:

Here is the (expected) distribution of values (number of unique hash keys, per claim id):

n
1      1584
2       531
3       139
4       121
5        70
6        57
7        57
8        24
9        12
10       22

and here is what the model generates:

n
1    9762

So clearly something is wrong with the model such that the relationship is not being captured properly. I have done a few tests and am not certain the issue here. I’m fairly certain it is probably a way I am specifying the metadata. Any advice on how to diagnose this would be appreciated.

Thanks very much sdv team for the support so far!

Hello @giovanni.circo ,

I believe that since CLM_HASH_KEY is being labeled as an id but not a foreign key, the column is just being generated as a random id. To confirm, after fitting your synthesizer, you can run

synthesizer.get_transformers('clm_details')['CLM_HASH_KEY']

If you see something like

AnonymizedFaker(function_name='bothify', function_kwargs={'text': 'sdv-id-??????'}

then it is just generating ids independently of the other tables for that column.

A work around might be to add CLM_HASH_KEY as a second foreign key for clm_details, clm_diag and clm_proc. Please let me know if that works!

Hi @andrew

Yes, going through some tests it looks like the primary issue is that the newly synthesized header table doesn’t maintain the correct cardinality (e.g. 1 MBR_ID should have 2-5 CLM_IDs, which should have 5-6 CLM_HASH_KEYS). However it does appear that splitting the tables up and adding HASH_KEYS as a new foreign key does work (albeit with a bit of extra work).

Thanks.

Hi @giovanni.circo,

Glad you were able to get it to work! For more reference, the reason I suggested that is because the model will not learn cardinality for columns that are labeled with an sdtype of id unless they are a primary or foreign keys. This is because by default, id columns are dropped before modeling.

Can I ask what extra work you needed to do?
Thanks!