Enhancing realism between user's name and email

Hello! I have a little suggestion to make about a relationship that I encountered in my workflow. Since we are trying to achieve some sort of realism in the synthetic data, would it be nice to create a column relationship called “email addresses” where if we input the columns that has sdtype “first_name” and “last_name” we can create an email that looks like “first_name.last_name@random_mail.com” or something like that ? I understand that it is doable in Python with a simple command but what if we want to implement it inside SDV ?

Hi @epicvu, thanks for the suggestion. I definitely agree that we can improve the realism of the data across a row – so a user’s first and last name should make sense with the email. We can take this as a feature request. Let me update the title of this topic to be more specific about this.

The good news is that you can write a Custom Constraint to address this. Custom constraints are included inside of SDV, and in the future the team can also make this a predefined constraint or column relationship. Let us know if you’d like to try this and we can provide some suggestions.

A few factors to consider:

  1. The formula we’re looking at is: email = first_name.last_name@domain.com. Though I think it would also be unrealistic if every single row perfectly followed this formula. Generally, there would appear to be some variation in real data – eg. first initial + last name or last name only. Does this match your understanding?
  2. What is your use case? Many of our users create synthetic data for testing purposes, which doesn’t require this type of realism. However there may be other cases (such as doing demos) where this would be useful. Do you have any thoughts about this?
1 Like

HIi!
I tend to forget about the custom constraints I’ll have to take a look at them at some point.
But I totally agree with you where the perfect “first_name.last_name@domain.com” is not realistic and having sometimes random “fun” emails would be more realistic.
This type of realism is mainly personal honestly but it would be a plus for the demo usecases. But at the moment, it is mainly to test the limits of SDV and try to improve it without developping the solution in our own and give you guys some ways to perfect the product.

1 Like

Yes, constraints will definitely be useful for pre-defined, formulaic-type rules. Typically, we’ll see that customers may need 2-3 for targeted, business logic that needs to be valid.

I hear you that name/email realism will be useful for demo purposes. I am marking this as a feature request. We prioritize based on demand and the importance to your use case, so if this is becoming something important to your projects, please let us know ASAP. Thanks for this feedback and keep the ideas coming. :slight_smile:

1 Like