How do you measure the risk of sensitive data leaking out in the synthetic data?

neha · December 20, 2024, 10:33pm

I’m starting this topic based on a question asked by @joachim.zaspel in this other thread. The original text is copy/pasted below:

I would like to know more about the inherent risks of sensitive data leaking out into synthetic data. I do not mean PII but a combination of non-pii attributes that correlate with one individual. What are the risks or chances of HSA Synthesizer to actually leak the original data - in the sense that a person is de-identified by the appearance of a combination of attribute values that match the orginal data?

neha · December 20, 2024, 10:41pm

Hi @joachim.zaspel, let’s use this thread for the discussion about data privacy around synthetic data.

I believe that your question is closely aligned with the concept of synthetic data disclosure risk. This concept is meant to evaluate the risk associated with sharing the synthetic data, as the synthetic data can encapsulate data patterns that make it easier to de-identify sensitive values in the real data.

To that end, I would recommend looking into the DisclosureProtection metric that we released recently with the newest SDV Enterprise 0.21.0. This metric simulates what an attacker may be able to do with the synthetic data. It requires you to specify:

Which column(s) may an attacker already know? For eg, this could be information that is publicly available, and
Which column(s) may an attacker want to guess? This is your sensitive data

Based on this, the metric provides you with an estimate of how well the data patterns are protected.

Let us know if this is aligned with what you were hoping to measure.

Resources:

Disclosure Protection metric
DisclosureProtectionEstimate metric, use this if you have a larger dataset, as the main DisclosureProtection metric may take a long time to compute

Topic		Replies	Views
SDV HSA Synthesizer not preserving correlations of orginal data Evaluation and Benchmarking quality	3	74	January 20, 2025
About the Evaluation and Benchmarking category Evaluation and Benchmarking	0	26	January 26, 2026
SDV Enterprise Version 0.21.0 Release Notes	0	9	December 17, 2024
SDV Enterprise Version 0.4.0 Release Notes	0	14	August 15, 2023
Do fitted multi-table synthesizer models retain original data properties (PHI \ PII data concerns)? Evaluation and Benchmarking privacy	2	46	July 15, 2024

How do you measure the risk of sensitive data leaking out in the synthetic data?

Related topics