[Resolved] Metadata detect

ashok.kumar.muthimen · April 22, 2024, 10:31am

Hi @neha , As you suggested, I applied above but facing different issue.

cleaned_data = poc.drop_unknown_references(data, metadata)
Traceback (most recent call last):
File “”, line 1, in
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/utils/poc.py”, line 33, in drop_unknown_references
table_names = sorted(metadata.tables)
AttributeError: ‘dict’ object has no attribute ‘tables’

Plamen · April 22, 2024, 10:48am

Hello @ashok.kumar.muthimen , can you share your code until you ran this code snippet? We would like to see how you loaded your data and how did you load or created your metadata.

ashok.kumar.muthimen · April 22, 2024, 12:13pm

import sdv

print(sdv.version.enterprise)

import pandas as pd

from sdv.metadata.multi_table import MultiTableMetadata
from sdv.multi_table import HSASynthesizer

df_master = pd.read_csv(‘/saswork/MDM_CONTACT_MASTER.csv’)
df_person = pd.read_csv(‘/saswork/MDM_CONTACT_PERSON.csv’)

data={‘master’: df_master,‘person’: df_person,‘org’:df_org}

metadata = MultiTableMetadata()

metadata.detect_from_dataframes(data)

metadata.remove_primary_key(‘person’)

metadata.add_relationship(parent_table_name=‘master’,child_table_name=‘person’,parent_primary_key=‘CONT_ID’,child_foreign_key=‘CONT_ID’)

metadata.save_to_json(‘/saswork/metadata_6.json’)

hsas = HSASynthesizer(metadata)

hsas.fit(data)

neha · April 22, 2024, 1:25pm

Hi @ashok.kumar.muthimen You can try this instead:

cleaned_data = poc.drop_unknown_references(metadata=metadata, data=data)

It is helpful if you name your variables (metadata=) and (data=). Apologies for the confusion.

neha · April 22, 2024, 1:37pm

Also just an FYI that the poc.drop_unknown_references method should be used before creating and fitting the synthesizer.

Use the drop_unknown_references. This will return cleaned data
Create an HSASynthesizer using the metadata
Then fit the HSASynthesizer using the new, cleaned data from step #1

ashok.kumar.muthimen · April 23, 2024, 7:14am

metadata.remove_primary_key(‘person’)
metadata.add_relationship(parent_table_name=‘master’,child_table_name=‘person’,parent_primary_key=‘CONT_ID’,child_foreign_key=‘CONT_ID’)
cleaned_data = poc.drop_unknown_references(metadata=metadata, data=data)
Traceback (most recent call last):
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/utils/poc.py”, line 42, in drop_unknown_references
metadata.validate_data(data)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py”, line 826, in validate_data
raise InvalidDataError(errors)
sdv.errors.InvalidDataError: The provided data does not match the metadata:
Relationships:
Error: foreign key column ‘CONT_ID’ contains unknown references: (769162789078583302, 273153625548603402, 373450055630721902, 669058145948247803, 819253797098187502, + more). Please use the utility method ‘drop_unknown_references’ to clean the data.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “”, line 1, in
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/utils/poc.py”, line 65, in drop_unknown_references
raise InvalidDataError([
sdv.errors.InvalidDataError: The provided data does not match the metadata:
All references in table ‘person’ are unknown and must be dropped.Try providing different data for this table.

Wim · April 23, 2024, 12:08pm

Hi @ashok.kumar.muthimen , hope you are doing great. Allow me to pitch in.

My name is Wim Blommaert and since 4 years I’m leading the synthetic data roll out at ING Bank in Belgium using SDV. We have done 18 projects so far. Looking at the case here I would recommend the following.

In SDV anything related to multi table really only makes sense when there are parents and childeren connected with PK - FK relations. As here in this case there are 2 tables in 121 relation (PK - PK ) these things don’t really apply. In such case the easy solution is to join the 2 “brother” tables into one single table on the PK. From there on it is then handled as a single table and you can split them out again at the end.

You might want to take a random sample from one of the tables (say 10k random rows) and then join those with the corresponding rows from the other table on the PK.

If you want to try a real multi table case I would recommend selecting parent - child tables connected in a one2many relationship (PK - FK). Hope this makes sense. Kind rgds, Wim

neha · April 29, 2024, 9:58pm

A post was merged into an existing topic: [Bug] InvalidDataError: All references in table are unknown and must be dropped

neha · April 29, 2024, 10:00pm

Hi @ashok.kumar.muthimen and @kalyan, since we have resolved the original issue for metadata detection, I will mark this thread as closed (resolved).

I have started a new topic specifically for the InvalidDataError (All references in table ‘person’ are unknown and must be dropped.Try providing different data for this table.) that you mention.

Topic		Replies	Views
[Resolved] InvalidDataError: All references in table are unknown and must be dropped Synthetic Data Creation bug	3	68	May 2, 2024
SDV Enterprise Version 0.6.0 Release Notes	0	23	October 17, 2023
Combining two columns, include code table for reference only and synthesize subset of table Synthetic Data Creation	2	55	April 13, 2026
About the Synthetic Data Creation category Synthetic Data Creation	0	28	January 26, 2026
SDV Enterprise Version 0.11.0 Release Notes	0	10	March 26, 2024

[Resolved] Metadata detect

Related topics