[Resolved] Metadata detect

I just started synthetic data generation for one of our table with test data by importing csv file. Facing issue with metadata detection .

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Welcome @ashok.kumar.muthimen ,

Adding some more context from your email since this helps us to provide a more accurate assistance:

import sdv

import pandas as pd
from sdv.metadata.multi_table import MultiTableMetadata

pd.read_csv('/saswork/MDM_CONTACT_MASTER.csv')  # This is my table data with 10000 records (test data)
df.columns
df.head()
df['CONT_ID'].head()


metadata = MultiTableMetadata()
metadata.detect_from_dataframes(df)


ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Note: I am just trying to generate synthetic data for single table so no relationship involved here.

The reason why this is happening is that you are using MultiTableMetadata instead of a SingleTableMetadata object.

In this case this should be the snippet of code to run as described in our documentation:

import pandas as pd
from sdv.metadata import SingleTableMetadata

df = pd.read_csv('/saswork/MDM_CONTACT_MASTER.csv')
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)

Once successfully detected the metadata feel free to use any of our SingleTable synthesizers to generate synthetic data.

Yes, I tried from sdv.metadata import SingleTableMetadata but still facing issue with Synthesizer.

from sdv.metadata import SingleTableMetadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)
metadata.get_column_names(sdtype=‘unknown’)
Traceback (most recent call last):
File “”, line 1, in
AttributeError: ‘SingleTableMetadata’ object has no attribute ‘get_column_names’
metadata.validate()
metadata.save_to_json(‘metadata.json’)
from sdv.multi_table import HSASynthesizer
synthesizer = HSASynthesizer(metadata)
Traceback (most recent call last):
File “”, line 1, in
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 26, in sdv_enterprise.sdv.multi_table.hsa.hsa.expirable.wrapper
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 51, in sdv_enterprise.sdv.multi_table.hsa.hsa.HSASynthesizer.init
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/multi_table/base.py”, line 86, in init
self._check_metadata_updated()
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/multi_table/base.py”, line 73, in _check_metadata_updated
if self.metadata._check_updated_flag():
AttributeError: ‘SingleTableMetadata’ object has no attribute ‘_check_updated_flag’

@ashok.kumar.muthimen The reason why is that HSASynthesizer is a MultiTable model. Instead of using HSASynthesizer for SingleTable, we encourage using any of our SingleTable models – more information in our documentation.

If you are still willing to use HSASynthesizer for this task, you can do that by using the MultiTableMetadata as you did before, but instead of giving one dataframe as input, providing a dictionary mapping the name of the table to the pandas.DataFrame that contains its data.

Here is a code snippet that will help you run your example.

import pandas as pd
from sdv.multi_table import MultiTableMetadata

df = pd.read_csv('/saswork/MDM_CONTACT_MASTER.csv')
data = {'table': df}
metadata = MultiTableMetadata()
metadata.detect_from_dataframes(data)

hsas = HSASynthesizer(metadata)
hsas.fit(data)
sample = hsas.sample()  # This will return a dictionary!

Thanks, I am now able to generate synthetic data and trying to print some sample data.

hsas.save(‘/saswork/sample.pkl’)
synthesizer = HSASynthesizer.load(‘/saswork/sample.pkl’)
synthetic_data = synthesizer.sample(scale=1)
synthetic_data.head()
Traceback (most recent call last):
File “”, line 1, in
AttributeError: ‘dict’ object has no attribute ‘head’

@ashok.kumar.muthimen As I mentioned earlier, if you are using the HSASynthesizer you will get a dictionary in return, not a dataframe.

Therefore you will have to get the synthetic_data from the name of the table, which in this case if you have followed my instructions, should be table. You should do:

synthetic_df = synthetic_data['table']
synthetic_df.head()

This is why we encourage the usage of SingleTable when it comes to SingleTable datasets instead of directly using HSASynthesizer .

Here is an example that should get you the same results with SingleTable instead:

import pandas as pd
from sdv.metadata import SingleTableMetadata

df = pd.read_csv('/saswork/MDM_CONTACT_MASTER.csv')
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df)

from sdv.single_table import GaussianCopulaSynthesizer
gcs = GaussianCopulaSynthesizer(metadata)
gcs.fit(data)
synthetic_data = gcs.sample(len(df))  # This will return a DataFrame!

Yes for single table datasets, it is recommended you use all the metadata and synthesizers that are meant for single table.

Resources:

To avoid any confusion, please do not use multi-table synthesizers or metadata if you have a single table use case. Let us know if that helps.

I am working with multi tables and trying to create relationship between multi tables and facing below issue. I unserstand the issue but relnationship should primary key of parent table to non-primary key of child table.

metadata.add_relationship(parent_table_name=‘master’,child_table_name=‘org’,parent_primary_key=‘CONT_ID’,child_foreign_key=‘CONT_ID’)
Traceback (most recent call last):
File “”, line 1, in
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py”, line 262, in add_relationship
self._validate_relationship(
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py”, line 190, in _validate_relationship
self._validate_foreign_child_key(child_table_name, parent_table_name, child_foreign_key)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py”, line 154, in _validate_foreign_child_key
raise InvalidMetadataError(
sdv.metadata.errors.InvalidMetadataError: Invalid relationship between table ‘master’ and table ‘org’. A relationship must connect a primary key with a non-primary key.

Hi @ashok.kumar.muthimen ,

Generally, the relationship we have in relational structures are:

  • Parent — Primary key.
  • Child — Foreign key (that is not the primary key of this table).

Are you trying to connect the following :

  • Parent — Primary Key.
  • Child — Primary Key.

Would this be just that the two tables rows have 1-1 relationship?

Yes, I am trying to create relationship parent PK to Child PK and it is 1-1 relationship.

With that we have the following two questions:

  1. Is the child table connected to any other table ?
  2. Can you create a metadata without connecting this one table and share a visual here, to do so you can run the visualize the metadata command: metadata.visualize() ; Find the documentation here on how to use this functionality. https://docs.sdv.dev/sdv/multi-table-data/data-preparation/multi-table-metadata-api#visualize

You can also export it directly to a png file using:

metadata.visualize(
    show_table_details='full',
    output_filepath='my_metadata.png'
)

How can i create metadata without connection. I just ran below code .

metadata.detect_from_dataframes(data={‘master’: df_master,‘person’: df_person,‘org’:df_org})

metadata.detect_from_dataframes(data)
/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py:490: UserWarning: Could not automatically add relationships for all tables. The relationships in the dataset are disjointed. Tables [‘master’, ‘person’, ‘org’] are not connected to any of the other tables.
warnings.warn(warning_msg)

metadata.visualize(show_table_details=‘full’,output_filepath=‘/saswork/my_metadata.png’)
Traceback (most recent call last):
File “/sasdata/python3.8/lib/python3.8/site-packages/graphviz/backend/execute.py”, line 81, in run_check
proc = subprocess.run(cmd, **kwargs)
File “/sasdata/python3.8/lib/python3.8/subprocess.py”, line 489, in run
with Popen(*popenargs, **kwargs) as process:
File “/sasdata/python3.8/lib/python3.8/subprocess.py”, line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File “/sasdata/python3.8/lib/python3.8/subprocess.py”, line 1702, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: PosixPath(‘dot’)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “”, line 1, in
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py”, line 913, in visualize
return visualize_graph(nodes, edges, output_filepath)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/visualization.py”, line 118, in visualize_graph
digraph.render(filename=filename, cleanup=True, format=graphviz_extension)
File “/sasdata/python3.8/lib/python3.8/site-packages/graphviz/_tools.py”, line 171, in wrapper
return func(*args, **kwargs)
File “/sasdata/python3.8/lib/python3.8/site-packages/graphviz/rendering.py”, line 122, in render
rendered = self._render(*args, **kwargs)
File “/sasdata/python3.8/lib/python3.8/site-packages/graphviz/_tools.py”, line 171, in wrapper
return func(*args, **kwargs)
File “/sasdata/python3.8/lib/python3.8/site-packages/graphviz/backend/rendering.py”, line 324, in render
execute.run_check(cmd,
File “/sasdata/python3.8/lib/python3.8/site-packages/graphviz/backend/execute.py”, line 84, in run_check
raise ExecutableNotFound(cmd) from e
graphviz.backend.execute.ExecutableNotFound: failed to execute PosixPath(‘dot’), make sure the Graphviz executables are on your systems’ PATH

Hi @ashok.kumar.muthimen I think what Plamen is saying is, take that one table out, and follow the steps for the rest as in docs. So you would detect metadata, and then add relationships and then visualize.
If you don’t have graphviz, you can put the JSON here and we can visualize.

metadata.detect_from_dataframes(data)
/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py:490: UserWarning: Could not automatically add relationships for all tables. The relationships in the dataset are disjointed. Tables [‘master’, ‘person’] are not connected to any of the other tables.
warnings.warn(warning_msg)
metadata.add_relationship(parent_table_name=‘master’,child_table_name=‘person’,parent_primary_key=‘CONT_ID’,child_foreign_key=‘CONT_ID’)
Traceback (most recent call last):
File “”, line 1, in
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py”, line 262, in add_relationship
self._validate_relationship(
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py”, line 190, in _validate_relationship
self._validate_foreign_child_key(child_table_name, parent_table_name, child_foreign_key)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/metadata/multi_table.py”, line 154, in _validate_foreign_child_key
raise InvalidMetadataError(
sdv.metadata.errors.InvalidMetadataError: Invalid relationship between table ‘master’ and table ‘person’. A relationship must connect a primary key with a non-primary key.

Hi Kalyan, As you suggested i tried to detect metadata with 2 tables only but facing same issue with relationship

can we update metadata for preimary key columns
cat metadata_2.json | grep ‘primary’
“primary_key”: “CONT_ID”,
“primary_key”: “CONT_ID”,

Hi @ashok.kumar.muthimen ,

Does your child table have any other childs or it is just connected to the parent ?
If there are no other tables connected to the child table you can drop the primary key that was autodetect for that table and use that column to establish the relationship as you are intending.

Here is how you can achieve so, but again, this would only work if there are no other tables depending on the child (person table):

metadata.remove_primary_key('person')  # This will remove the primary key.

metadata.add_relationship(
    parent_table_name='master', 
    child_table_name='person', 
    parent_primary_key='CONT_ID', 
    child_foreign_key='CONT_ID'
)

Feel free to share a .json representation from the autodetected metadata so we can visualize it and understand better the schema that you are working with. You can copy the content printed out from this command:

metadata.to_dict()

Thanks Plamen.

Here is an observation while fitting the data.

hsas.fit(data)
Traceback (most recent call last):
File “”, line 1, in
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/multi_table/base.py”, line 378, in fit
processed_data = self.preprocess(data)
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 26, in sdv_enterprise.sdv.multi_table.hsa.hsa.expirable.wrapper
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 80, in sdv_enterprise.sdv.multi_table.hsa.hsa.HSASynthesizer.preprocess
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/multi_table/base.py”, line 321, in preprocess
self.validate(data)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/multi_table/base.py”, line 235, in validate
raise InvalidDataError(errors)
sdv.errors.InvalidDataError: The provided data does not match the metadata:
Relationships:
Error: foreign key column ‘CONT_ID’ contains unknown references: (769162789078583302, 273153625548603402, 373450055630721902, 669058145948247803, 819253797098187502, + more). All the values in this column must reference a primary key.
hsas.save(‘/saswork/sample_multi.pkl’)
synthesizer = HSASynthesizer.load(‘/saswork/sample_multi.pkl’)
synthetic_data = synthesizer.sample(scale=1)
Traceback (most recent call last):
File “”, line 1, in
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 26, in sdv_enterprise.sdv.multi_table.hsa.hsa.expirable.wrapper
File “packaging/sdv_enterprise/sdv/multi_table/hsa/hsa.pyx”, line 104, in sdv_enterprise.sdv.multi_table.hsa.hsa.HSASynthesizer.sample
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/multi_table/base.py”, line 407, in sample
sampled_data = self._sample(scale=scale)
File “/sasdata/python3.8/lib/python3.8/site-packages/sdv/sampling/independent_sampler.py”, line 148, in _sample
num_rows = int(self._table_sizes[table] * scale)
KeyError: ‘master’

Hi @ashok.kumar.muthimen,

I believe this section of the SDV docs will be useful to help you debug. It explains what kind of data schemas are compatible with SDV. The two final requirements are what’s causing your issues.

Item #4: Relationships should be one-to-many

You must connect a primary key of a table with a non-primary key of another table. It is not possible to connect 2 primary keys. To make progress, we suggested to drop this table and we can add it back again after we go end-to-end with the rest of the tables. Is this resolved now?

Item #3: There should be no missing references

This is the cause of your latest Error. All the foreign key values must refer to an existing primary key. If the reference cannot be found, then the connection is not valid.

[Note: This is happening probably because you are working with downloaded data from DBs as CSVs. With Database connectors this would not be the case]

To resolve this, we have provided some functionality in SDV Enterprise 0.11.0. Could you upgrade SDV Enterprise to use this feature? I have sent you an email with info on how to do this.

After you upgrade, use this function to clean your data. Example below

from sdv.utils import poc

cleaned_data = poc.drop_unknown_references(data, metadata)