Generate IDs using complex Regexes

epicvu · March 26, 2025, 3:11pm

Hi!
I’m currently working on trying to generate some data using the DayZsynthesizer and I’m trying to give a id column a specific regex. Until now I have not encountered any problems with my regex but now it seems to crash when I want to generate the data.

(?:22|01|29|13|02|20|14|12|07|05|17|18|11|32)[0-9]{8}

select from a list of numbers and add 8 random number afterwards.

I also tried with (?:[22|01|29|13|02|20|14|12|07|05|17|18|11|32])[0-9]{8} adding the brakets but it doesn’t select from the list.

My error looks like this : File “sdv_enterprise/sdv/multi_table/dayz/day_zero.pyx”, line 90, in sdv_enterprise.sdv.multi_table.dayz.day_zero.expirable.wrapper
File “sdv_enterprise/sdv/multi_table/dayz/day_zero.pyx”, line 602, in sdv_enterprise.sdv.multi_table.dayz.day_zero.DayZSynthesizer.load
File “sdv_enterprise/sdv/multi_table/dayz/day_zero.pyx”, line 603, in sdv_enterprise.sdv.multi_table.dayz.day_zero.DayZSynthesizer.load
File “sdv_enterprise/rdt/transformers/id/id.pyx”, line 60, in sdv_enterprise.rdt.transformers.id.id.RandomRegexGenerator.setstate
File “sdv_enterprise/rdt/transformers/id/utils.pyx”, line 203, in sdv_enterprise.rdt.transformers.id.utils.random_strings_from_regex
KeyError: BRANCH

This is the regex that I want to apply. When updating the metadata it’s seem to work until sampling it does not want to generate this.

Thanks,

Charles -

epicvu · March 27, 2025, 8:57am

Hi,

I also wanted to add that I found a work-around by just generating the 8 digits and add the prefix later in my Jupyter Notebook. I would still like to utilize the regex though…

I am using the DayZSynthesizer.

neha · March 27, 2025, 1:34pm

Hi @epicvu, I’m glad you were able to provide a workaround.

Currently, our Regex generation tool does not support complex regexes with OR logic. We’ve described this in our RDT FAQs. Click here to read. I recognize that this information is not as easy to find. We’re working ASAP to provide you this info upfront in the both the docs and the code.

In the meantime, I will also file a feature request internally to be able to support more complex Regexes. Out of curiosity, what does this column represent in your data? I see that it’s overall a 10-digit value where the first two digits have to be from a specific list of values (22, 01, 29, 19, …). Do the first two digits have any specific meaning to your downstream application?

epicvu · March 27, 2025, 2:08pm

Hi @neha, thanks for the informations.
It is quite tricky to get to that information and thanks for the link.

The numbers represents regions (two digits french postcodes) to be exact. It does not inherit from a specific column in the data but it just has to be informative.

neha · March 27, 2025, 3:07pm

Appreciate the information @epicvu. Happy to help find the right info for you.

The numbers represents regions (two digits french postcodes) to be exact.

I see online that French postcodes are generally 5-digits long. However, your prefix is 2-digits and the total Regex is for a 10-digit value. So is the overall value for a postcode or some other address-related concept?

I am asking because if your column represents a real world concept (such as a postcode), then the recommended approach is to specify that real-world concept as the sdtype in your metadata. For example, the following code should generate 5-digit French postcodes:

# update the column's sdtype to a real-world concept
metadata.update_column(
    table_name='my_table_name',
    column_name='my_column_name',
    sdtype='postcode')

metadata.save_to_json(filepath='my_updated_metadata.json')

# specify the French locale when creating the synthesizer
synthesizer = DayZSynthesizer(metadata, locales=['fr_FR'])
synthesizer.sample(num_rows=10)

This should produce postcodes such as 52462, 97640, 38852, ...

For more information on the possible real-world concepts, you can click here to see the docs.

epicvu · March 27, 2025, 4:14pm

Yeah I would use the sdtype postcode if I had to. In this case I wanted to generate an “id” with 8 random numbers and also add the 2 postcode digits. Ultimately I generated 50k unique ids and when I added the special prefix and looked for unique ids using the command

len(list(set(df["special_id"])))

I would either find 49999 or 50000 unique ids. So it suited my needs. The data isn’t really used at the moment it is an experimentation to replicate encountered data and new id types in the company.

Topic		Replies	Views
[Resolved] Specifying regex format for ID columns Synthetic Data Creation metadata	8	81	April 29, 2024
Customizing id column Synthetic Data Creation	5	66	March 20, 2026
[Resolved] Specifying regex format for primary key and not having them in sequence Synthetic Data Creation	2	57	January 26, 2026
Allow setting Faker parameters in DayZ Synthetic Data Creation feature-request	0	21	April 3, 2024
[Duplicate] Set character limit in DayZ (update parameters for faking data) Synthetic Data Creation	5	71	April 4, 2024

Generate IDs using complex Regexes

Related topics