[Resolved] Synthetic Data with Standardized Predictor Variables (Constant values)

@kalyan :

I am working on a simulation to find the performance of the synthetic data generation process for the DOE. The original predictor variables are either -1, 0, or 1 (low, mid-point, and high); however, when I synthesize using your available algorithms, the outcomes are -1, 0, and 1, which does not cover the design space. Particularly in “GaussianCopulaSynthesizer,” everything is synthethized as 0. See an example below:

Original Data

Synthesized Data (GaussianCopulaSynthesizer)

Thanks for your attention and help in this matter.

Thanks,
Lochana

Hi @lpalayangoda, appreciate you sharing the details here.

Just to make sure we are on the same page about the problem: Even though the real data contains values [-1.0, 0.0, +1.0], the synthetic data only contains 0.0, and you expect the synthetic data to cover the same 3 values as the real data. Is that correct?

For us to diagnose this, it would be helpful if you are able to provide the following information:

  1. For these columns X1, X2... X8: What sdtype is specified in the metadata? I’m particularly curious whether you have inputted "categorical" or "numerical"
  2. When you used GaussianCopulaSynthesizer, did you change any of the default settings, add any constraints or customization? If so, it would be helpful to see your code for instantiating and setting up the synthesizer.
  3. As for the values in these columns, are they skewed in any direction? For eg, a vast majority of the values are 0.0 with only a few being -1.0 and +1.0. It would be helpful for us to understand how skewed the data is.

@neha :
Thanks for your response. Let me clarify them again.
There are two issues here:
(1) The data in the DOE contains predictor variables as X=(-1, 0, 1) and the response variable (y), which can be any number. The reason we go for -1 is that it is the lowest setting, and 1 is the highest setting. This is to cover the full design space of the experiment. In the syncretization process, we expect it will provide interpolated values for X and y within the design space (eg. X = 0.12). However, all the algorithms provide X as either -1, 0, or 1 as the synthesized data.

(2) The second problem is particularly for GaussianCopulaSynthesizer, where it gave me only 0 as synthesized predictor variables.

To clarify your questions:

  • My metadata setting is all numeric
  • I did not change any settings. Let me know if I want to add any specific function argument settings here.
  • Most DOE data are balanced (unless they are optimized), which means there are equal numbers of -1, 0, and 1 in predictors.

Got it, thanks for clarifying.

Issue (1)

SDV is currently designed to replicate what it sees in the real data. Since SDV is seeing range from [-1, 1] with whole numbers only, it is recreating this in the synthetic data.

One simple workaround would be to multiply your data by 100 so it is in range [-100, 100]. For eg.

# multiply real values by 100 before fitting your synthesizer
for column_name in ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8']:
    real_data[column_name] = real_data[column_name]*100

synthesizer.fit(real_data)

# synthetic data will also have multiplied values between [-100, +100]
# you can divide them if you want
synthetic_data = synthesizer.sample(num_rows=1000)
synthetic_data['X1'] = synthetic_data['X1'] / 100.0

Issue (2)

No need to change any settings yet. We will help you debug this, as the synthesizer should be able to faithfully recreate the balances of -1, 0, 1.

I wonder if this issue could be in how the data is loaded into Python and being stored. Could you share the storage types of your columns? The following command should do it:

print(real_data.dtypes)

@neha:

  • The solution you suggested made some improvements, as now I see synthetic data has data values in between the design space

  • However, the GaussianCopulaSynthesizer is still synthesizing as 0 for predictor variables. I have attached sample simulated data for you to check on (but these are not multiplied by 100. I checked it with your previous suggestion, but it did not work with this method). (PS: these are simulated data, and they are ok to share)
    DOE_simulated_data.csv (6.0 KB)

  • Please see the storage types of the simulated real data
    Screenshot 2024-05-13 at 1.43.35 PM

Thanks again for your help!
Lochana

Hi @lpalayangoda, no problem. Always here to help.

Really appreciate you sharing the data – I was able to replicate the issue with it.

Root Cause

The GaussianCopula tries to estimate the shape of each column using a beta distribution. We rely on a popular data science library to do this step (scipy) but occasionally, we find that it fails to converge … producing constant values like the one you’re seeing.

Workaround

A simple workaround will be to use the truncnorm distribution instead of the default beta.

synthesizer = GaussianCopulaSynthesizer(
    metadata,
    default_distribution='truncnorm')

On your simulated data, this produced good results:

In the meantime, I will surface this example to the team. We will investigate what’s going wrong with the beta distribution and if we can do anything to make the default settings better.

1 Like

@neha :
Thanks for the explanation and advice. Yes, it helps. :slight_smile:

Awesome. I will mark this overall topic as Resolved but feel free to reply if there is more to investigate or discuss.

Otherwise, if you need help troubleshooting anything else, please feel free to start a new thread.

1 Like