[Resolved] Error when fitting HSASynthesizer

Hi @neha , facing issue while creating datafarme from synthsized data.

hsas = HSASynthesizer(metadata)
hsas.fit(data)
Traceback (most recent call last):
File “”, line 1, in
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/multi_table/base.py”, line 385, in fit
processed_data = self.preprocess(data)
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 29, in sdv_enterprise.sdv.multi_table.hsa.hsa.expirable.wrapper
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 89, in sdv_enterprise.sdv.multi_table.hsa.hsa.HSASynthesizer.preprocess
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/multi_table/base.py”, line 338, in preprocess
processed_data[table_name] = synthesizer._preprocess(table_data)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/single_table/base.py”, line 347, in _preprocess
self._data_processor.fit(data)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/data_processing/data_processor.py”, line 754, in fit
self._fit_hyper_transformer(constrained)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/data_processing/data_processor.py”, line 669, in _fit_hyper_transformer
self._hyper_transformer.fit(data)
File “/sasdata/python3.8/lib/python3.8/site-packages/rdt/hyper_transformer.py”, line 749, in fit
data = self._fit_field_transformer(data, field, self.field_transformers[field])
File “/sasdata/python3.8/lib/python3.8/site-packages/rdt/hyper_transformer.py”, line 671, in _fit_field_transformer
transformer.fit(data, field)
File “/sasdata/python3.8/lib/python3.8/site-packages/rdt/transformers/base.py”, line 55, in wrapper
return function(self, *args, **kwargs)
File “/sasdata/python3.8/lib/python3.8/site-packages/rdt/transformers/base.py”, line 390, in fit
self._fit(columns_data)
File “packaging/sdv_enterprise/rdt/transformers/email/email.pyx”, line 189, in sdv_enterprise.rdt.transformers.email.email.DomainBasedAnonymizer._fit
File “packaging/sdv_enterprise/rdt/transformers/email/utils.pyx”, line 50, in sdv_enterprise.rdt.transformers.email.utils.validate_email_address
rdt.errors.InvalidDataError: The input data must be email addresses. Data contains (‘Y’, ‘Y’, + 887 more).
hsas.save(‘/saswork/sample_single.pkl’)
synthesizer = HSASynthesizer.load(‘/saswork/sample_single.pkl’)
synthetic_data = synthesizer.sample(scale=1)
Traceback (most recent call last):
File “”, line 1, in
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 29, in sdv_enterprise.sdv.multi_table.hsa.hsa.expirable.wrapper
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 113, in sdv_enterprise.sdv.multi_table.hsa.hsa.HSASynthesizer.sample
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/multi_table/base.py”, line 414, in sample
sampled_data = self._sample(scale=scale)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/sampling/independent_sampler.py”, line 148, in _sample
num_rows = int(self._table_sizes[table] * scale)
KeyError: ‘master’

Hi @ashok.kumar.muthimen thank you for providing us with the detailed information. From the detailed error message, this particular line stands out to me:

The input data must be email addresses. Data contains (‘Y’, ‘Y’, + 887 more).

This indicates that there is a mismatch between the data and the metadata. The metadata mentions that a column contains emails (eg. "johndoe@gmail.com"), but in reality the data contains other values such as "Y".

From looking at your metadata, I see that there are 5 columns currently specified as sdtype "email". Could you double-check to ensure that this is correct? If any of these columns do not contain email addresses, it is best to change them to the proper sdtype such as "categorical".

Do note that the SDV uses metadata as the ground source-of-truth about your data. To prevent bugs during fitting, sampling, and evaluation it is crucial to ensure that the metadata is accurate.