How do you measure the risk of sensitive data leaking out in the synthetic data?

I’m starting this topic based on a question asked by @joachim.zaspel in this other thread. The original text is copy/pasted below:

I would like to know more about the inherent risks of sensitive data leaking out into synthetic data. I do not mean PII but a combination of non-pii attributes that correlate with one individual. What are the risks or chances of HSA Synthesizer to actually leak the original data - in the sense that a person is de-identified by the appearance of a combination of attribute values that match the orginal data?

Hi @joachim.zaspel, let’s use this thread for the discussion about data privacy around synthetic data.

I believe that your question is closely aligned with the concept of synthetic data disclosure risk. This concept is meant to evaluate the risk associated with sharing the synthetic data, as the synthetic data can encapsulate data patterns that make it easier to de-identify sensitive values in the real data.

To that end, I would recommend looking into the DisclosureProtection metric that we released recently with the newest SDV Enterprise 0.21.0. This metric simulates what an attacker may be able to do with the synthetic data. It requires you to specify:

  • Which column(s) may an attacker already know? For eg, this could be information that is publicly available, and
  • Which column(s) may an attacker want to guess? This is your sensitive data

Based on this, the metric provides you with an estimate of how well the data patterns are protected.

Let us know if this is aligned with what you were hoping to measure.

Resources: