Duplicated values within a group

metadata.xlsx (18.1 KB)

Hi Datacebo Team,
After finishing a run with the metadata attached we have noticed that the YearMo in the synthetic mem_yearmo table has some duplicated values for a certain MemberID. YearMo values should be within the range of 201901-201912 and cannot be repeated for a certain MemberID, e.g. a member can have one 201901 and one 201902 records but not two 201903 records. We understand the metadata should be rectified to do that, so can you advise how to?

Hi @yuntien.lee thanks for reaching out and attaching the file. I have for you one quick workaround that’ll get you unblocked, as well as some follow up questions that will help with the longer term solution.

Quick Workaround

To get you unblocked ASAP, I would recommend sampling more synthetic data than you need and then dropping the duplicates in mem_yearmo. Since you will be dropping in multi-table context, I would also recommend running our clean-up function at the end to enforce referential integrity.

Assuming your synthesizer is already fit, this is what you can do:

from sdv.utils import drop_unknown_references

# synthesize more data than is necessary, here I am doing 5x
synthetic_data = synthesizer.sample(scale=5.0)

# drop the duplicates in the mem_yearmo 
mem_yearmo = synthetic_data['mem_yearmo']
mem_yearmo_cleaned = mem_yearmo.drop_duplicates(subset=['MemberID', 'YearMo'])

synthetic_data['mem_yearmo'] = mem_yearmo_cleaned

# now clean up the data to ensure referential integrity
cleaned_data = drop_unknown_references(synthetic_data, metadata)

And you should be good to go!

Longer Term Qs

Single table vs. sequential: I am wondering if it is fair to categorize mem_yearmo as sequential data rather than single table one? One key differentiator is that single table data generally has rows that are independent of each other. Your data seems to have some dependency between rows (unique date for each member id) as well as some supposed order (yearmo is actually a date). You can read more about sequential data here and single table data here

Note that right now, our multi-table models do not support sequential modeling. So this may be a new feature request to consider.

Limited scalability: Generally speaking, SDV synthesizers assume that the data can be scaled almost indefinitely (10x, 100x, etc.). However your requirement below would ultimately limit the synthetic data size:

YearMo values should be within the range of 201901-201912 and cannot be repeated

I’m curious whether for your model, it would be ok for the values to forecast past 201912 into 202001, 202002, etc? Or are you not looking to scale this part?

Since we have limited amount of memory, when we set scale to even =2.0 we got OOM errors so we have to seek other workarounds. And you are right mem_yearmo should be thought of as sequential data because a record in a certain yearmo may depend on a record of a yearmo in the past. For this dataset we think it should be better to keep the data within the boundary between 201901 and 201912, but yes we are looking to extend the dates if we have more data.

Thanks for confirming. We can create and track a new feature request for accomodating sequential data inside of a multi-table case.

Since we have limited amount of memory, when we set scale to even =2.0 we got OOM errors so we have to seek other workarounds.

In this case, I would suggest keeping scale=1.0 (default case) but then repeating that process few times to get enough data. Every time you call sample, the synthesizer should be creating new IDs anyways, so you can append the results of each iteration to a file.

for i in range(5): # run for 5 iterations
  synthetic_data = synthesizer.sample(scale=1.0)

  # drop the duplicates in the mem_yearmo 
  mem_yearmo = synthetic_data['mem_yearmo']
  mem_yearmo_cleaned = mem_yearmo.drop_duplicates(subset=['MemberID', 'YearMo'])

  synthetic_data['mem_yearmo'] = mem_yearmo_cleaned

  # now clean up the data to ensure referential integrity
  cleaned_data = drop_unknown_references(synthetic_data, metadata)

  # append the results of cleaned_data to CSV files
  cleaned_data['mem_yearmo'].to_csv('my_data.csv', mode='a'),
  ... # repeat for all tables