Improving metadata auto-detection of primary/foreign key connections

I’m starting this topic to discuss the overall issue of improving the metadata auto-detection for primary and foreign key connections. It will be helpful to reply with examples of cases where metadata auto-detection did not work as expected – any/all details would be much appreciated including:

  • What was the expected primary/foreign key relationship? Details can include: table names, column names, column position (eg. leftmost column, rightmost, etc.), and the examples of data that they have.
  • Was the primary key detected correctly? If not, what was detected instead?
  • Assuming the foreign key was detected incorrectly: Was a different foreign key column detected instead? What was it?
  • Were any extraneous primary/foreign key relationships detected? Are there any patterns you are noticing?

The current algorithm

As of SDV Enterprise version 0.25.0, there are currently 2 different algorithms that we offer. They can be controlled by the foreign_key_inference_algorithm parameter as show below:

metadata = Metadata.detect_from_dataframes(
  data=my_dataframes,
  foreign_key_inference_algorithm='data_match' # choose btwn 'data_match' or 'column_name_match'
)

The 'data_match' algorithm looks to see if the actual data matches between a foreign key and primary key. Meanwhile, the 'column_name_match' only checks whether the name of the column matches between a foreign and primary key.

Issues we are aware of

  • Both algorithms may create metadata that reuses the same foreign key column in multiple relationships, which is not valid (for more information, click here). The team is actively fixing this issue for both algorithms.
  • Composite keys (composite primary keys/composite foreign keys) are not auto-detected right now either of the algorithms

@neha : 2 upfront questions

  • Does this include auto detecting relations with composite keys ?
  • Does the algo consider in any way both data & column names or the methods are mutually exclusive ?

@anon78540062 , @rameshkannan , @bhavya.kundra : please follow and contribute if needed

Hi @Wim, both great questions.

  • Does this include auto detecting relations with composite keys ?

At this current time, neither of the algorithms are set up to auto-detect composite keys. I will add this under the list of issues that we are aware of.

  • Does the algo consider in any way both data & column names or the methods are mutually exclusive ?

The algorithms are mutually exclusive right now. We consider the 'data_match' to be superior because the actual data matching (between a foreign and primary key) is the ultimate proof that the connection actually exists – moreso than just the column names aligning. The 'column_name_match' is a proxy that the columns might be related. Definitely open for feedback here!

Hello,

I will try my best to share my latests experiences concerning the Metadata auto-detection. My database consist of 40ish tables. My main table is called “terrain” and it’s primary key is “code”

The metadata of “terrain” does not contain any columns that has “id” inside. Thus it predicted the primary key well. (The position of the variable is in the middle of the table).

The issues starts when trying to find the relationships. All the other tables have the same foreign key which is “codeterrain” (1st - 5th position depending on the table) yet the only table that was able to be linked to “terrain” was the table “audilog” and there was no primary key detected. The only column that has “id” inside is “id_auditlog”. The primary key is a combination of multiple columns mainly “id_tablename” + “codeterrain” which probably caused problem for the primary key detection.

The only links I wanted was from “terrain” to all the tables. I was easily fixable using Python but there was no link at all between tables. After the metadata autodetect, the system only detected 1 relationship and didn’t linked the rest of the tables.

As of right now the detection of primary key was almost always right. In my case, all the tables except from “terrain” didn’t have any primary_key detected (which is good).

I’ll try to give an update when working with another database but this is my experience so far.

Bests,

Charles

Thank you very much @epicvu for the detailed description. Really appreciate it!

Summarizing below, just to make sure I have the right understanding and adding some of my thoughts.

Schema setup:

  • You have 1 table called "terrain" which is a parent table with primary key "code".
  • There are a bunch of other tables (~40) that are children of "terrain". Each of these tables has a foreign key called codeterrain which links to code.
    • Most of these tables do not have a primary key of their own.
    • One of the tables as a composite primary key – the primary key is a combination of "id_tablename", and "codeterrain".

Metadata auto-detection results:

  • The primary keys are generally accurate: The primary key of table "terrain" as well as the fact that the remaining tables do not have primary keys. (The composite primary key is not detected – this is listed in the first comment as a known issue)
  • The foreign keys are not detected at all. Of course the 'column_name_match' algorithm wouldn’t work on this, as the foreign key’s name (codeterrain) is not exactly the same as the primary key’s name (code).

Questions:

  • You mention whether the columns do/do not contain the phrase "id" inside them. In your experience, is it affecting the quality of auto-detection? For example if the column was called code_id, is it changing what is detected?
  • Are there any unknown references between the tables? I.e. is it ever the case that a value in codeterrain may not be found in the primary key code? This is one reason why the data_match algorithm may not be working.

@neha Does it mean that even if just one PK - FK link is broken in a relation the auto detection (data based) won’t work ? Perhaps some tolerance should be put in place ? Or user should have that option ? And if the column names match a “name based” match would work even if some links are broken ?

Hi @Wim,

Does it mean that even if just one PK - FK link is broken in a relation the auto detection (data based) won’t work ? Perhaps some tolerance should be put in place ? Or user should have that option ?

This is mostly true. If you have a large dataset, the algorithm validates a subset of the full column. But it does expect that, within the subset, 100% of those values should have a match. I think some tolerance is an excellent idea (perhaps exposed as a parameter); I will note this down internally as a feature request.

I expect this will predominantly be an issue when you have already saved your multi-table data locally (in a CSV file or similar). As in this case, there may be broken links. If instead, you use our AI Connectors feature (click here), we will automatically connect to your database, and import a data sample with referential integrity.

And if the column names match a “name based” match would work even if some links are broken ?

Yes, as the 'column_name_match' algorithm only checks for the names matching and does not look at the actual data. So this actually would work if links are broken.

Of course, this would only find relationships if your database design enforces that related columns have the same name. I’m not sure how frequently that occurs in your systems?

I have seen both cases (names matching at both end of the relation , names not matching )

1 Like

Hi @neha

Questions:

  • You mention whether the columns do/do not contain the phrase "id" inside them. In your experience, is it affecting the quality of auto-detection? For example if the column was called code_id, is it changing what is detected?

Well, in my experience I didn’t feel like it was affecting the auto-detection.

  • Are there any unknown references between the tables? I.e. is it ever the case that a value in codeterrain may not be found in the primary key code? This is one reason why the data_match algorithm may not be working.

Yes in some cases all the primary_key are not necessarily present.

Thanks for confirming @epicvu. Appreciate the details.

It seems the algorithm here is failing you for the same reasons we are discussing above: Right now, the data_match algorithm does not have any tolerance for missing references (broken links); it expects a 100% match. I have also noted this down internally for the team to fix.

Why are there missing references? I am assuming this may be happening because you are downloading data separately for these different tables. After fixing/updating the metadata, I’m assuming you’re later running drop_unknown_references, as SDV does expect referential integrity to ultimately fit on the data. – Please correct me if my understanding is wrong.

@neha Thanks for the return.

Well, yes I do use the drop_unknown_references later, right before the fitting. For the missing data I think it’s the fact that my csv are broken so I chose to only load 1M lines instead of the whole data (because of parsing issues).

1 Like