[Resolved] Metadata detect

Hi @neha , As you suggested, I applied above but facing different issue.

cleaned_data = poc.drop_unknown_references(data, metadata)
Traceback (most recent call last):
File “”, line 1, in
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/utils/poc.py”, line 33, in drop_unknown_references
table_names = sorted(metadata.tables)
AttributeError: ‘dict’ object has no attribute ‘tables’

Hello @ashok.kumar.muthimen , can you share your code until you ran this code snippet? We would like to see how you loaded your data and how did you load or created your metadata.

import sdv

print(sdv.version.enterprise)

import pandas as pd

from sdv.metadata.multi_table import MultiTableMetadata
from sdv.multi_table import HSASynthesizer

df_master = pd.read_csv(‘/saswork/MDM_CONTACT_MASTER.csv’)
df_person = pd.read_csv(‘/saswork/MDM_CONTACT_PERSON.csv’)

data={‘master’: df_master,‘person’: df_person,‘org’:df_org}

metadata = MultiTableMetadata()

metadata.detect_from_dataframes(data)

metadata.remove_primary_key(‘person’)

metadata.add_relationship(parent_table_name=‘master’,child_table_name=‘person’,parent_primary_key=‘CONT_ID’,child_foreign_key=‘CONT_ID’)

metadata.save_to_json(‘/saswork/metadata_6.json’)

hsas = HSASynthesizer(metadata)

hsas.fit(data)

Hi @ashok.kumar.muthimen You can try this instead:

cleaned_data = poc.drop_unknown_references(metadata=metadata, data=data)

It is helpful if you name your variables (metadata=) and (data=). Apologies for the confusion.

Also just an FYI that the poc.drop_unknown_references method should be used before creating and fitting the synthesizer.

  1. Use the drop_unknown_references. This will return cleaned data
  2. Create an HSASynthesizer using the metadata
  3. Then fit the HSASynthesizer using the new, cleaned data from step #1

metadata.remove_primary_key(‘person’)
metadata.add_relationship(parent_table_name=‘master’,child_table_name=‘person’,parent_primary_key=‘CONT_ID’,child_foreign_key=‘CONT_ID’)
cleaned_data = poc.drop_unknown_references(metadata=metadata, data=data)
Traceback (most recent call last):
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/utils/poc.py”, line 42, in drop_unknown_references
metadata.validate_data(data)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py”, line 826, in validate_data
raise InvalidDataError(errors)
sdv.errors.InvalidDataError: The provided data does not match the metadata:
Relationships:
Error: foreign key column ‘CONT_ID’ contains unknown references: (769162789078583302, 273153625548603402, 373450055630721902, 669058145948247803, 819253797098187502, + more). Please use the utility method ‘drop_unknown_references’ to clean the data.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “”, line 1, in
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/utils/poc.py”, line 65, in drop_unknown_references
raise InvalidDataError([
sdv.errors.InvalidDataError: The provided data does not match the metadata:
All references in table ‘person’ are unknown and must be dropped.Try providing different data for this table.

Hi @ashok.kumar.muthimen , hope you are doing great. Allow me to pitch in.

My name is Wim Blommaert and since 4 years I’m leading the synthetic data roll out at ING Bank in Belgium using SDV. We have done 18 projects so far. Looking at the case here I would recommend the following.

In SDV anything related to multi table really only makes sense when there are parents and childeren connected with PK - FK relations. As here in this case there are 2 tables in 121 relation (PK - PK ) these things don’t really apply. In such case the easy solution is to join the 2 “brother” tables into one single table on the PK. From there on it is then handled as a single table and you can split them out again at the end.

You might want to take a random sample from one of the tables (say 10k random rows) and then join those with the corresponding rows from the other table on the PK.

If you want to try a real multi table case I would recommend selecting parent - child tables connected in a one2many relationship (PK - FK). Hope this makes sense. Kind rgds, Wim

A post was merged into an existing topic: [Bug] InvalidDataError: All references in table are unknown and must be dropped

Hi @ashok.kumar.muthimen and @kalyan, since we have resolved the original issue for metadata detection, I will mark this thread as closed (resolved).

I have started a new topic specifically for the InvalidDataError (All references in table ‘person’ are unknown and must be dropped.Try providing different data for this table.) that you mention.