[Resolved] CSVHandler and large file reading

Hi,

I am currently testing the CSVHandler on a really large database of mine. I have a dump in csv files of each tables (50ish tables) but some of the tables are either empty or really full (around 10Gbs). I know that it is still in beta testing but I wish that it had the same as load_csvs where we would put informations about the data and pandas specifities such as “num_rows” or “escapechar”… So the first issue would be the loading time. It takes a lot of time for a large CSV file (which is understandable but I wish we could only load like 1M lines or less ??)

So as of now I’m using load_csvs() to load my file :

datasets = load_csvs("../data/dump", read_csv_parameters={"encoding":'latin-1', "nrows":500000, "escapechar":"\\", "quotechar":'"'})

The second issue is when detecting from dataframes, it seems like it has some problems. I encountered this issue :

AttributeError: 'set' object has no attribute 'dtype'

I was wondering if there was a fix for this issue. Unfortunatly I am not allowed to provide you with the data but I hope that you can still help me.

Bests,

Charles

Hi @epicvu, thanks for starting this thread. Tackling each of your questions in separate sections below.

CSVHandler loading time & truncation

We created a CSVHandler object as a convenience wrapper for purposes for reading and writing multiple CSV files at once. I agree that it would be good to offer all the different types of read options that are available in the underlying pandas.read_csv function. I will raise a feature request for this so that the CSVHandler will have the same functionality as load_csvs. In the meantime, you can continue to use load_csvs.

BTW I noticed that you are using the nrows parameter when using this function to read only the first 500K rows from each file. Is this working out for you?

datasets = load_csvs( "../data/dump",
    read_csv_parameters={
        "encoding":'latin-1',
        "nrows":500000, # read only the forst 500K lines from each file
        "escapechar":"\\",
        "quotechar":'"'})

A potential issue is that when you have multiple tables, loading in the first 500K lines from each table could result in broken references (lack of referential integrity). For example, if one of the CSVs has a foreign key reference to another, that reference could be broken. To solve this, you’d need to use the drop_unknown_references utility function after creating (and validating your metadata);

from sdv.utils import drop_unknown_references

metadata = Metadata.detect_from_dataframes(dataset)

# TODO inspect, update, and validate your metadata
metadata.update_column(...)

# drop any unknown references to ensure referential integrity
cleaned_data = drop_unknown_references(dataset, metadata)

Error when detecting metadata from dataframes

I am wondering if perhaps this is related to something going wrong when loading in your data into Python. I haven’t observed this error before, so the following info would be very useful to help us debug:

  1. Which version of SDV Enterprise are you using? If you are not on the latest version (0.25.0 as described here), I’d recommend upgrading to get the latest fixes. You can use the command below to find out.
import sdv
print(sdv.version.enterprise).
  1. Are you able to provide us with the full stack trace? (i.e. everything that gets printed out when the error occurs). I believe the AttributeError you’ve shared above would be the final line of the stack trace. This would help us see where it’s coming from.

In the meantime, are you noticing anything strange about the data that is being loaded into Python? For example, if you print out a few rows of each of the table, is there anything odd about how the tables look?

print(datasets['MY_TABLE_NAME'].head(10))

Hello @neha thanks for the quick response.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[9], line 1
----> 1 metadata = Metadata.detect_from_dataframes(datasets)

File sdv_enterprise\\sdv\\metadata\\metadata.pyx:115, in sdv_enterprise.sdv.metadata.metadata.MetadataExtension.detect_from_dataframes()

File c:\Users\Charles VU\AppData\Local\Programs\Python\Python311\Lib\site-packages\sdv\metadata\metadata.py:98, in Metadata._detect_from_dataframes(cls, data, infer_sdtypes, infer_keys, foreign_key_inference_algorithm)
     93     metadata.detect_table_from_dataframe(
     94         table_name, dataframe, infer_sdtypes, None if infer_keys is None else 'primary_only'
     95     )
     97 if infer_keys == 'primary_and_foreign':
---> 98     metadata._detect_relationships(data, foreign_key_inference_algorithm)
    100 return metadata

File sdv_enterprise\\sdv\\metadata\\multi_table.pyx:90, in sdv_enterprise.sdv.metadata.multi_table.MultiTableMetadataExtension._detect_relationships()

AttributeError: 'set' object has no attribute 'dtype'

This is the full AttributeError, it is maybe because one of my CSV is empty (I just noticed it). Other than that, I don’t see any problems…

In the mean time, yes I understand that getting only the first 500K lines can later cause problem on my data but it’s fine because in the end I wanted to generate some sort of mockup database for charge tests but I want some kind of “realness” just in case I have to make a little demo.

Hi @epicvu, always happy to help.

I was actually able to replicate your AttributeError in the case of a fake dataset. It seems that there is currently a bug in our detect_from_dataframes method whenever the auto-detection logic runs into multiple possible relationships (and needs to make a choice). This bug is new but the good news is that the team is already aware and fixing it; we will make it available in the next release.

Workaround

Until then, you can still continue to auto-detect metadata by opting out of the foreign key detection logic that is causing the issue.

My recommended solution is to fallback to simpler logic that is just using column names:

metadata = Metadata.detect_from_dataframes(
  data=datasets,
  foreign_key_inference_algorithm='column_name_match')

Or alternatively, you can turn off the foreign key auto-detection logic altogether by only asking the the SDV to detect primary keys.

metadata = Metadata.detect_from_dataframes(
  data=datasets,
  infer_keys='primary_only')

Keep in mind that you can always add in the foreign keys later using the add_relationship function.

metadata.add_relationship(
    parent_table_name='hotels',
    child_table_name='guests',
    parent_primary_key='hotel_id',
    child_foreign_key='hotel_id'
)

Apologies for the inconvenience!

For more info:

  • Click here to see the auto-detection API options
  • Click here to see the API for adding relationships (and updating metadata in general)

Hello @neha,

This is really great news! I’m glad that you were able to replicate the bug. Happy to find new bugs in the process!

So I tried the :

metadata = Metadata.detect_from_dataframes(
  data=datasets,
  foreign_key_inference_algorithm='column_name_match')

And it got fixed! I am used to changing a lot of relationship since it’s not always the most acurate detection so it’s fine :slight_smile:

Thank you very much for the help!

Bests,

@epicvu glad the workaround is useful for you :slight_smile: The bug fix will be available in the upcoming release. I will mark this current thread as resolved.

I am used to changing a lot of relationship since it’s not always the most acurate detection so it’s fine

Would you like to start a new thread about this? We’re always looking for feedback about our auto-detection logic, as it is a work in progress. If you’re able to share any more details (table names, column names, examples of data, etc.), we can definitely use this feedback to improve our capabilities.

@neha I would love to contribute! Let’s start the new topic.

Great to hear @epicvu.

I created a new topic for discussing the metadata auto-detection for primary/foreign key relationships. Click here to see the thread. You can reply more there. Appreciate the feedback!