Generating complex relationships in multi-table data

The multi-table synthesizer does a good job out-of-the-box creating simple flat tables that are linked. However, most of our use cases involve complex many-to-one relationships. For example: A claims header table has a single claim id, but links to a diagnosis table that has between 1 to 20 diagnoses (1 per row).

Maintaining the proper cardinality seems difficult without doing some post-processing after generation.

I think this problem is related to a similar feature request here. I am not sure if you have other suggestions.

Hi @giovanni.circo, thanks for starting this topic. I’d love to understand a bit more about the scenario you’re mentioning so we don’t miss anything.

Maintaining the proper cardinality seems difficult without doing some post-processing after generation.

Is this something you have already observed in the synthetic data you’re creating?

Our multi-table synthesizers are designed to learn the min/max cardinality as well as the most and least frequent cardinalities. In your example, a claim (parent table) may be connected to many diagnoses (child table). The synthesizer should learn:

  • Each claim must be connected to at least 1 diagnosis (min cardinality))
  • Each claim must be connected to at most 20 diagnoses (max cardinality)
  • The typical # of diagnoses for each claim (eg. 50% of claims are connected 1 diagnosis each, 10% have 2 diagnoses, etc,)

The synthetic data should then follow these patterns. Of course, I am assuming that the claims table has 1 row per claim and the diagnosis table has 1 row per diagnosis (linked to a claim).

Is there something I’m missing? Or did you try it out and observe issues with the synthetic data? Perhaps an example may help.