Generate IDs using complex Regexes

Hi!
I’m currently working on trying to generate some data using the DayZsynthesizer and I’m trying to give a id column a specific regex. Until now I have not encountered any problems with my regex but now it seems to crash when I want to generate the data.

(?:22|01|29|13|02|20|14|12|07|05|17|18|11|32)[0-9]{8}

select from a list of numbers and add 8 random number afterwards.

I also tried with (?:[22|01|29|13|02|20|14|12|07|05|17|18|11|32])[0-9]{8} adding the brakets but it doesn’t select from the list.

My error looks like this : File “sdv_enterprise/sdv/multi_table/dayz/day_zero.pyx”, line 90, in sdv_enterprise.sdv.multi_table.dayz.day_zero.expirable.wrapper
File “sdv_enterprise/sdv/multi_table/dayz/day_zero.pyx”, line 602, in sdv_enterprise.sdv.multi_table.dayz.day_zero.DayZSynthesizer.load
File “sdv_enterprise/sdv/multi_table/dayz/day_zero.pyx”, line 603, in sdv_enterprise.sdv.multi_table.dayz.day_zero.DayZSynthesizer.load
File “sdv_enterprise/rdt/transformers/id/id.pyx”, line 60, in sdv_enterprise.rdt.transformers.id.id.RandomRegexGenerator.setstate
File “sdv_enterprise/rdt/transformers/id/utils.pyx”, line 203, in sdv_enterprise.rdt.transformers.id.utils.random_strings_from_regex
KeyError: BRANCH

This is the regex that I want to apply. When updating the metadata it’s seem to work until sampling it does not want to generate this.

Thanks,

Charles -

Hi,

I also wanted to add that I found a work-around by just generating the 8 digits and add the prefix later in my Jupyter Notebook. I would still like to utilize the regex though…

I am using the DayZSynthesizer.

Hi @epicvu, I’m glad you were able to provide a workaround.

Currently, our Regex generation tool does not support complex regexes with OR logic. We’ve described this in our RDT FAQs. Click here to read. I recognize that this information is not as easy to find. We’re working ASAP to provide you this info upfront in the both the docs and the code.

In the meantime, I will also file a feature request internally to be able to support more complex Regexes. Out of curiosity, what does this column represent in your data? I see that it’s overall a 10-digit value where the first two digits have to be from a specific list of values (22, 01, 29, 19, …). Do the first two digits have any specific meaning to your downstream application?

Hi @neha, thanks for the informations.
It is quite tricky to get to that information and thanks for the link.

The numbers represents regions (two digits french postcodes) to be exact. It does not inherit from a specific column in the data but it just has to be informative.

Appreciate the information @epicvu. Happy to help find the right info for you.

The numbers represents regions (two digits french postcodes) to be exact.

I see online that French postcodes are generally 5-digits long. However, your prefix is 2-digits and the total Regex is for a 10-digit value. So is the overall value for a postcode or some other address-related concept?

I am asking because if your column represents a real world concept (such as a postcode), then the recommended approach is to specify that real-world concept as the sdtype in your metadata. For example, the following code should generate 5-digit French postcodes:

# update the column's sdtype to a real-world concept
metadata.update_column(
    table_name='my_table_name',
    column_name='my_column_name',
    sdtype='postcode')

metadata.save_to_json(filepath='my_updated_metadata.json')

# specify the French locale when creating the synthesizer
synthesizer = DayZSynthesizer(metadata, locales=['fr_FR'])
synthesizer.sample(num_rows=10)

This should produce postcodes such as 52462, 97640, 38852, ...

For more information on the possible real-world concepts, you can click here to see the docs.

Yeah I would use the sdtype postcode if I had to. In this case I wanted to generate an “id” with 8 random numbers and also add the 2 postcode digits. Ultimately I generated 50k unique ids and when I added the special prefix and looked for unique ids using the command

len(list(set(df["special_id"])))

I would either find 49999 or 50000 unique ids. So it suited my needs. The data isn’t really used at the moment it is an experimentation to replicate encountered data and new id types in the company.