I’m working on our Synthetic Data Vault (SDV) implementation and need some guidance.
Current situation:
We’re generating synthetic data from our PostgreSQL database using SDV. The generated data looks good, but the ID columns are coming out in SDV’s default format (like sdv-xxxxx-12345).
What we need:
Our real data uses ULID format for IDs (e.g., 01KG201W4FG8E2S3T306RBDA65). We need the synthetic data to match this format so it’s realistic and can be used for testing.
question:
Does SDV support customizing how specific columns are generated? Specifically Can we tell SDV to generate IDs in ULID format instead of its default
I’ve looked at metadata.update_column() but I’m not sure if that’s the right approach to generate id patterns like this. I tried but these patterns are not working , is there any other workarounds
Has anyone dealt with custom ID formats in SDV before? Any pointers would be appreciated
Context:
Using SDV 1.34.2
Single table synthesis with GaussianCopulaSynthesizer
Hi @Mariam, SDV supports adding regexes to your metadata in order to synthesize ID values with a specific format. I would make sure:
The column is listed as sdtype id in the metadata (sounds like it is)
The column is listed as a primary key for the table (this will maintain uniqueness)
You provide a regex_format string that describes the exact format you want. I’m not sure exactly what the rules are for your ULID format. To create a random alphanumeric string with 26 characters, you could use Regex format [0-9A-Z]{26}.
I tried using the regex pattern [0-9A-Z]{26} and also 0[1-9A-Z]{25} for the ID column. While these patterns technically allow for mixed digits and letters, in practice the generated IDs are mostly sequences of zeros followed by a few characters (as shown in the attached screenshot). What I’m aiming for is IDs that consistently have a mix of digits and letters throughout, similar to (e.g., 01KG201W4FG8E2S3T306RBDA65).
Is there a way to guarantee that the synthetic IDs always contain both digits and letters, rather than just one type
Any advice or examples from your experience would be greatly appreciated. Thanks again for your help!
One point of clarification: Since the Regex pattern is [0-9A-Z]{26}, is it still technically correct that a value of '0000...' will be valid? I understand it may not be extremely likely to occur, as most of the ULIDs will contain a mix of digits and letters just probability-wise, but is it still valid to have ULID with just 0s?
How SDV works:
SDV Community version is designed to create values that are valid based on the Regex that your provide. SDV Community creates these IDs alphanumeric order such as 000, 001, .... 009, 00A, 00B, and so on. If you sample enough synthetic data points, you will eventually see all the combinations of possible values such as 5F9, PL8, etc.
If your goal is to create Regex-based IDs in a completely random order (meaning that you will most likely see a mix of letters and digits), you would need to upgrade to SDV Enterprise. SDV Enterprise users will see their Regex values created in a random order by default. (The underlying mechanism here comes from the RegexGenerator in our RDT library; generation_order='random' is available for SDV Enterprise.)
I’m curious how this is impacting your use case. As I understand it, IDs like this are meant to uniquely identify each row, though they don’t really have a statistical meaning of their own. So as for the desire to have random, mixed characters in your IDs, is it more of an aesthetic concern (i.e. the synthetic data just looks different from the real data)? Or is it also impacting the downstream usage of your synthetic data (for eg. if you’re using synthetic data for software testing, you can’t use it for some reason)? Any more explanation would be helpful!
Hi, @neha , thank you for your response!!!
yes i agree, the value of ‘0000…’ is valid for the regex pattern [0-9A-Z]{26}
The ids like those are /looks different from the real data and ofcorse for our testing and all we need the ids to be in the way- like mix of digits and numbers
Now doing in the way that the generated data from sdv is processed again to change the id column like using the ulid library from python believiing that there is no other workarounds
Hi @Mariam, no problem! Yes, as a workaround you can generate some IDs on your own and use it to replace the synthetic data column. This is easy enough to do for a single-table usage.
(For multi-table, this may become harder, as SDV’s synthetic data has referential integrity which makes sure that the IDs are consistent between tables. If ever you’re looking at multi-table, I would recommend looking into SDV Enterprise!)