[Resolved] Specifying regex format for ID columns

Hi @ashok.kumar.muthimen for ID columns you can specify regex format so what is generated follows the format. This is in the metadata. You can read the documentation here: Sdtypes | Synthetic Data Vault

You can edit the metadata through python itself by following these guidelines in docs here Creating Metadata | Synthetic Data Vault

Can you try updating this for all ID columns in the single table case except for the primary key and let us know ?

Hi Kalyan,

Could you please share the code

For the column that starts with USMMBI followed by a number of 12 digits, this would be the regex that will generate them:

metadata.update_column('<column_name>', sdtype='id', regex_format='USMMBI[0-9]{12}')

The output for this column would look like this:

USMMBI000000000000 
USMMBI000000000001 
USMMBI000000000002

metadata.update_column(table_name=‘table’,column_name=‘ADMIN_CLIENT_ID_HUB’, sdtype=‘id’, regex_format=‘USMMBI[0-9]{9}’)

metadata.update_column(table_name=‘table’,column_name=‘CONT_ID’, sdtype=‘id’, regex_format=‘[0-9]{18}’)

I see CONT_ID is generated same as without Regex
0
1
2

USMMBI000000000
USMMBI000000001

CONT_ID is primary key is different so we can see how to handle. We will put in a new thread.

For the rest of the IDs is it generating fine? Also you can do bulk updates for all columns. That is you can specify for multiple columns at once. On this page in docs: Single Table Metadata API | Synthetic Data Vault

There is a function called update_columns_metadata that you can use to bulk update.

One question is, after USMMBI the numbers are in sequence USMMBI00000001 and then USMMBI00000002, is that ok?

Hi @ashok.kumar.muthimen if multiple columns regex format need to specified, you can use the metadata.update_columns functionality and pass a dictionary. The documentation is here and we can provide a code snippet if necessary.

Summary for this issue so far

  • For ID columns (that are not primary key), for security reasons, SDV enterprise automatically identifies them as possible PII and creates “sdv-xxxx” type IDs.
  • Team desired that the ID format be preserved, for example one column has: USMMBIxxxxxxxxxxxx. This can be done by updating the metadata with regex format.
  • If multiple columns need to be updated, that could be done via using the metadata.update_columns functionality supported here.
  • This will preserve format, but would generate in sequence still, that is,
USMMBI000000000000 
USMMBI000000000001 
USMMBI000000000002

It is still meaningful data though it is in sequence. Scrambling this would be available natively in SDV

Hi everyone, it looks like we have provided an answer for how to provide a regex format in the metadata. I will mark this topic as resolved.