Imagine, in your transactions data, you have a column that has the bank name with which the transaction happened. So this column could have names like Citibank, Bank of America and Key Bank etc. Would it be possible to create a completely different bank names in the synthetic data, like ABC bank , XYZ bank and others.
Reasoning in support for this:
- We may want the synthetic data receiver not know which banks we transact with. So anonymizing would be good.
Reasoning against this:
- If the downstream application has logic that uses the original bank names it would not be tested.
Why this is interesting
-
These categorical columns, like
bank nameusually are not considered PII or confidential so they don’t need to be preserved the same way asphone numbersoraddresses. However, it is understandable that a business may. want to preserve this information - that is, whoever is receiving the synthetic data, they may not want them to know about the banks that they transact with. -
By extension this can then extend to any categorical column. For example, lets say you have a column that is state in US, one may NOT want anyone receiving synthetic data to know which states you transact with.. So in the end, perhaps one feature here is to provide for categorical columns that one wants to anonymize a mapping set of values. Warning though if the downstream applications reply on those values for testing, it may not work