[Resolved] Inequality constraint for 2 datetime columns

Hello,
I’m trying to apply below constraint for 2 datetime columns:

subtr_close_date_ineq_constr = {
        'constraint_class': 'Inequality',
        'table_name': 'SubtrajectData',
        'constraint_parameters': {
            'low_column_name': 'openingsdatum',  # opening_date
            'high_column_name': 'afsluitdatum',  # close_date
            'strict_boundaries': False
        }
    }

Some of the values in “afsluitdatum” column are missing (NA).
Example of those 2 columns:

openingsdatum afsluitdatum
0     2022-08-30   2022-09-09
1     2020-07-28   2021-02-16
2     2022-03-05   2022-04-26
3     2020-06-16   2020-06-16
4     2021-06-05   2021-06-07
5     2020-07-26   2020-08-10
6     2023-01-24         <NA>
7     2021-07-28   2021-08-09
8     2020-05-13   2020-07-13
9     2021-09-28         <NA>
10    2020-01-27   2021-02-21
11    2022-12-19   2023-01-16
12    2019-07-24   2019-10-09
13    2020-06-27         <NA>
14    2020-11-04   2021-01-15
15    2022-06-08   2022-06-21
16    2022-11-28   2022-11-28
17    2019-10-20   2022-02-14
18    2022-04-20   2022-05-05
19    2022-12-24   2023-01-23 

Metadata related part:

"columns": {
                "openingsdatum": {
                    "sdtype": "datetime",
                    "datetime_format": "%Y-%m-%d"
                },
                "afsluitdatum": {
                    "sdtype": "datetime",
                    "datetime_format": "%Y-%m-%d"
                },

As result I get error:

Step 2: Train synthesizer ...
Traceback (most recent call last):
  File "C:\Users\anna.popovychenko\project\sdv_trial\fit.py", line 159, in <module>
    synthesizer.fit(real_data)
  File "C:\Users\anna.popovychenko\.pyenv\pyenv-win\versions\3.9.13\lib\site-packages\sdv\multi_table\base.py", line 364, in fit
    processed_data = self.preprocess(data)
  File "packaging\\sdv_enterprise\\sdv\\multi_table\\hsa\\hsa.pyx", line 75, in sdv_enterprise.sdv.multi_table.hsa.hsa.expirable.wrapper
  File "packaging\\sdv_enterprise\\sdv\\multi_table\\hsa\\hsa.pyx", line 161, in sdv_enterprise.sdv.multi_table.hsa.hsa.HSASynthesizer.preprocess
  File "C:\Users\anna.popovychenko\.pyenv\pyenv-win\versions\3.9.13\lib\site-packages\sdv\multi_table\base.py", line 308, in preprocess
    self.validate(data)
  File "C:\Users\anna.popovychenko\.pyenv\pyenv-win\versions\3.9.13\lib\site-packages\sdv\multi_table\base.py", line 222, in validate
    raise InvalidDataError(errors)
sdv.errors.InvalidDataError: The provided data does not match the metadata:

boolean value of NA is ambiguous

But in docs it says that constraint ignores missing values ( Inequality - Synthetic Data Vault (sdv.dev))

Could you help with that? Or it is expected behavior?

Thanks

Hi @anna.popovychenko, thanks for providing the details. The Inequality constraint is supposed to ignore missing values, so this is not working as expected.

Unfortunately, I am unable to replicate the problem on my end. From the screenshots you shared, I suspect it has something to do with how the data is represented in Python. For some reason your missing values are showing as <NA> whereas usually they show up as NaN.

Short Term Workaround

As a short-term workaround, I’d recommend converting the missing values:

import numpy as np

real_data['SubtrajectData']['afsluitdatum'] = real_data['SubtrajectData']['afsluitdatum'].fillna(np.nan)

Let us know if that works.

Long Term

To understand the root cause of this and fix the underlying bugs, it would be helpful to know how you loaded in the data into Python. Was it from a CSV – or some other format? If you could share the code snippet of loading in real_data, that would be very helpful for replication. Thanks.

Hi @neha ,
I tried .fillna(np.nan) - same result. For me helped below:

real_data[['afsluitdatum', 'openingsdatum']] = \
    real_data[['afsluitdatum', 'openingsdatum']].apply(pd.to_datetime)

I load data from csv files:

real_data = pd.read_csv("file.csv")
real_data= real_data.convert_dtypes()

In real_data.dtypes I get:

openingsdatum            string[python]
afsluitdatum             string[python]

Hi @anna.popovychenko, thanks for the info. I was able to replicate it now.

The issue is with the convert_dtypes command. What is the intended reason for adding this command?

If I delete the below line, then everything works fine. Could you try this?

# delete this line
# real_data= real_data.convert_dtypes()

Resources

-Click this link for more information on how to load the data: Loading Data | Synthetic Data Vault.

We have a built-in command called load_csvs that should load multiple files at once. No other conversion or modification should be needed.

from sdv.datasets.local import load_csvs

# load multiple CSV files from within a folder
data = load_csvs(folder_name='my_folder/')

# the data is ready for training
# no modifications should be needed unless we otherwise need a workaround
synthesizer.fit(data)

Hi @neha , thanks for recommendations.

I’ve commented convert_dtypes line and removed all conversions for columns.
For some of columns I get object type and if this column is a primary key then I get error during fit(). I added convertions only for such kind of columns. It works indeed.

However I noticed increase in timing:
for fit() from 260 to 614.9 seconds and
for sample() from 53 to 300 seconds.
It’s still very fast, just my observation.

Why I used convert_dtypes: I was thinking that having columns with type object will slow down the process of fit and sample.

As for load_csvs - thanks, it’s a good option