Hi @epicvu, thanks for starting this thread. Tackling each of your questions in separate sections below.
CSVHandler loading time & truncation
We created a CSVHandler object as a convenience wrapper for purposes for reading and writing multiple CSV files at once. I agree that it would be good to offer all the different types of read options that are available in the underlying pandas.read_csv function. I will raise a feature request for this so that the CSVHandler will have the same functionality as load_csvs. In the meantime, you can continue to use load_csvs.
BTW I noticed that you are using the nrows parameter when using this function to read only the first 500K rows from each file. Is this working out for you?
datasets = load_csvs( "../data/dump",
read_csv_parameters={
"encoding":'latin-1',
"nrows":500000, # read only the forst 500K lines from each file
"escapechar":"\\",
"quotechar":'"'})
A potential issue is that when you have multiple tables, loading in the first 500K lines from each table could result in broken references (lack of referential integrity). For example, if one of the CSVs has a foreign key reference to another, that reference could be broken. To solve this, you’d need to use the drop_unknown_references utility function after creating (and validating your metadata);
from sdv.utils import drop_unknown_references
metadata = Metadata.detect_from_dataframes(dataset)
# TODO inspect, update, and validate your metadata
metadata.update_column(...)
# drop any unknown references to ensure referential integrity
cleaned_data = drop_unknown_references(dataset, metadata)
Error when detecting metadata from dataframes
I am wondering if perhaps this is related to something going wrong when loading in your data into Python. I haven’t observed this error before, so the following info would be very useful to help us debug:
- Which version of SDV Enterprise are you using? If you are not on the latest version (0.25.0 as described here), I’d recommend upgrading to get the latest fixes. You can use the command below to find out.
import sdv
print(sdv.version.enterprise).
- Are you able to provide us with the full stack trace? (i.e. everything that gets printed out when the error occurs). I believe the
AttributeError you’ve shared above would be the final line of the stack trace. This would help us see where it’s coming from.
In the meantime, are you noticing anything strange about the data that is being loaded into Python? For example, if you print out a few rows of each of the table, is there anything odd about how the tables look?
print(datasets['MY_TABLE_NAME'].head(10))