Hi @yuntien.lee thanks for reaching out and attaching the file. I have for you one quick workaround that’ll get you unblocked, as well as some follow up questions that will help with the longer term solution.
Quick Workaround
To get you unblocked ASAP, I would recommend sampling more synthetic data than you need and then dropping the duplicates in mem_yearmo. Since you will be dropping in multi-table context, I would also recommend running our clean-up function at the end to enforce referential integrity.
Assuming your synthesizer is already fit, this is what you can do:
from sdv.utils import drop_unknown_references
# synthesize more data than is necessary, here I am doing 5x
synthetic_data = synthesizer.sample(scale=5.0)
# drop the duplicates in the mem_yearmo
mem_yearmo = synthetic_data['mem_yearmo']
mem_yearmo_cleaned = mem_yearmo.drop_duplicates(subset=['MemberID', 'YearMo'])
synthetic_data['mem_yearmo'] = mem_yearmo_cleaned
# now clean up the data to ensure referential integrity
cleaned_data = drop_unknown_references(synthetic_data, metadata)
And you should be good to go!
Longer Term Qs
Single table vs. sequential: I am wondering if it is fair to categorize mem_yearmo as sequential data rather than single table one? One key differentiator is that single table data generally has rows that are independent of each other. Your data seems to have some dependency between rows (unique date for each member id) as well as some supposed order (yearmo is actually a date). You can read more about sequential data here and single table data here
Note that right now, our multi-table models do not support sequential modeling. So this may be a new feature request to consider.
Limited scalability: Generally speaking, SDV synthesizers assume that the data can be scaled almost indefinitely (10x, 100x, etc.). However your requirement below would ultimately limit the synthetic data size:
YearMo values should be within the range of 201901-201912 and cannot be repeated
I’m curious whether for your model, it would be ok for the values to forecast past 201912 into 202001, 202002, etc? Or are you not looking to scale this part?