Anonymize a specific categorical column, for example Bank name

Imagine, in your transactions data, you have a column that has the bank name with which the transaction happened. So this column could have names like Citibank, Bank of America and Key Bank etc. Would it be possible to create a completely different bank names in the synthetic data, like ABC bank , XYZ bank and others.

Reasoning in support for this:

  • We may want the synthetic data receiver not know which banks we transact with. So anonymizing would be good.

Reasoning against this:

  • If the downstream application has logic that uses the original bank names it would not be tested.

Why this is interesting

  • These categorical columns, like bank name usually are not considered PII or confidential so they don’t need to be preserved the same way as phone numbers or addresses. However, it is understandable that a business may. want to preserve this information - that is, whoever is receiving the synthetic data, they may not want them to know about the banks that they transact with.

  • By extension this can then extend to any categorical column. For example, lets say you have a column that is state in US, one may NOT want anyone receiving synthetic data to know which states you transact with.. So in the end, perhaps one feature here is to provide for categorical columns that one wants to anonymize a mapping set of values. Warning though if the downstream applications reply on those values for testing, it may not work

@ashok.kumar.muthimen @neha

Hi @ashok.kumar.muthimen, I have marked this issue as a Feature Request.

To help us understand the use case a bit better, could you explain how this will work with your downstream application (software testing) where you plan to use the synthetic data?

Generally, we have seen that software testing suites may be testing specific rules with the synthetic data. For example, they may be testing specific logic for Citibank or Bank of America:

# software testing suite
if data['bank_name'] == 'Citibank':
   # perform some test
elif data['bank_name'] == 'Bank of America':
  # perform some other test

We are concerned that if the synthetic data contained random, new bank names (such as ABC bank or XYZ bank), then you will not be able to use the synthetic data effectively for the software testing suite.

Please advise on this matter for your software testing suite. Thanks.