SDV HSA Synthesizer not preserving correlations of orginal data

joachim.zaspel · December 20, 2024, 3:00pm

Background
I created a very simple data model and generated test data using a python script to evaluate if SDV (HSA Synthesizer) is able to generate synthetic data that preserves correlations or data patterns found in the original data.
Data Model
The given data model is about prescriptions, where a drug is prescribed by a doctor for a patient on a date. Patients have a birthday (age), doctors have a doctor type, and drugs belong to drug type and have a cost.
Problem
There are 7 hardcoded correlations in the original test data out of which only 2 were preserved in the synthetic data generated by SDV HSA Synthesizer.
Conclusion so far
At this point, synthetic data produced by SDV HSA Synthesizer offers only very limited value to researchers aiming to derive meaningful insights from synthetic data. In addition.
Further Questions

Without knowing the data patterns in the original data a priori, is there a way to configure SDV HSA Synthesizer such that it actually reproduces these patterns in the synthetic data?
I would like to know more about the inherent risks of sensitive data leaking out into synthetic data. I do not mean PII but a combination of non-pii attributes that correlate with one individual. What are the risks or chances of HSA Synthesizer to actually leak the original data - in the sense that a person is de-identified by the appearance of a combination of attribute values that match the orginal data?

Attached you find more details on the analysis (Data Synthetization with SDV HSA Synthesizer.pdf
synthetic_data.py (10.0 KB)
dummy_data.py (9.2 KB)
sdv_metadata.json (1.9 KB)
), the python script I used for generating the model / test data and the script I used for generating the synthetic data and the metadata file.

Data Synthetization with SDV HSA Synthesizer.pdf (748.9 KB)

neha · December 20, 2024, 10:41pm

Hi @joachim.zaspel, I wanted to thank you for the very informative writeup and detailed code for your issue. Regarding your first question: our team will take a look at your materials and respond here with some further thoughts and analysis.

In the meantime, I’d like to address your second question about the inherent risk of data leaking out. Since this is a separate topic of conversation, I have created a new thread for it here, and added a response: How do you measure the risk of sensitive data leaking out in the synthetic data?

neha · December 28, 2024, 6:09am

Hi @joachim.zaspel,

Thanks again for providing the material. Our team was able to use it to create some test data and try it out using HSA. We were able to replicate the patterns you were seeing, and wanted to provide more color. Just so we’re all looking at the same data, I’m attaching a zip folder with the test data that we’re using (created using portions of your script in dummy_data.py).
post_covid_synthetic_data.zip (149.8 KB)

We’ve also quantified the similarities/differences between real and synthetic data using appropriate metrics from our SDMetrics library. We find that it generally makes for a more productive discussion if we are able to compute scores rather than eyeballing the patterns.

Our detailed analysis is provided in the PDF below.
HSASynthesizer with post-covid data.pdf (306.0 KB)

Please note that team is out for holidays right now, so we can send a few more notes once we regroup after break. But I wanted to get this analysis out to ASAP in case you’re able to take a look sooner.

joachim.zaspel · January 20, 2025, 8:35am

Hello Neha,

thank you for addressing my questions.

Freundliche Grüße

Joachim Zaspel

SBK
Siemens-Betriebskrankenkasse

Analytik
joachim.zaspel@sbk.org

Tel. +49 (89) 62700-713 | Fax: +49 (89) 62700-60713
Heimeranstr. 35 | 80339 München
sbk.org | Meine SBK | Facebook

Topic		Replies	Views
How do you measure the risk of sensitive data leaking out in the synthetic data? Evaluation and Benchmarking privacy , metrics	1	41	December 20, 2024
Distributions of features in synthetic data Evaluation and Benchmarking quality	16	153	February 4, 2025
[Resolved] Error when fitting HSASynthesizer Synthetic Data Creation bug	1	36	May 2, 2024
Measuring significant correlations Inside the Vault reports , quality	0	48	February 4, 2026
[Resolved] Synthetic Data with Standardized Predictor Variables (Constant values) Evaluation and Benchmarking quality	7	60	May 13, 2024

SDV HSA Synthesizer not preserving correlations of orginal data

Related topics