Background
I created a very simple data model and generated test data using a python script to evaluate if SDV (HSA Synthesizer) is able to generate synthetic data that preserves correlations or data patterns found in the original data. Data Model
The given data model is about prescriptions, where a drug is prescribed by a doctor for a patient on a date. Patients have a birthday (age), doctors have a doctor type, and drugs belong to drug type and have a cost. Problem
There are 7 hardcoded correlations in the original test data out of which only 2 were preserved in the synthetic data generated by SDV HSA Synthesizer. Conclusion so far
At this point, synthetic data produced by SDV HSA Synthesizer offers only very limited value to researchers aiming to derive meaningful insights from synthetic data. In addition. Further Questions
Without knowing the data patterns in the original data a priori, is there a way to configure SDV HSA Synthesizer such that it actually reproduces these patterns in the synthetic data?
I would like to know more about the inherent risks of sensitive data leaking out into synthetic data. I do not mean PII but a combination of non-pii attributes that correlate with one individual. What are the risks or chances of HSA Synthesizer to actually leak the original data - in the sense that a person is de-identified by the appearance of a combination of attribute values that match the orginal data?
Attached you find more details on the analysis (Data Synthetization with SDV HSA Synthesizer.pdf synthetic_data.py (10.0 KB) dummy_data.py (9.2 KB) sdv_metadata.json (1.9 KB)
), the python script I used for generating the model / test data and the script I used for generating the synthetic data and the metadata file.
Hi @joachim.zaspel, I wanted to thank you for the very informative writeup and detailed code for your issue. Regarding your first question: our team will take a look at your materials and respond here with some further thoughts and analysis.
Thanks again for providing the material. Our team was able to use it to create some test data and try it out using HSA. We were able to replicate the patterns you were seeing, and wanted to provide more color. Just so we’re all looking at the same data, I’m attaching a zip folder with the test data that we’re using (created using portions of your script in dummy_data.py). post_covid_synthetic_data.zip (149.8 KB)
We’ve also quantified the similarities/differences between real and synthetic data using appropriate metrics from our SDMetrics library. We find that it generally makes for a more productive discussion if we are able to compute scores rather than eyeballing the patterns.
Please note that team is out for holidays right now, so we can send a few more notes once we regroup after break. But I wanted to get this analysis out to ASAP in case you’re able to take a look sooner.