This is a more general question regarding the scalability of the multi-table synthesizers to very large training sets and synthesized sets which are too large to fit into memory.
If I have a very large database which cannot be read into memory at training time, what is the best way to utilize as much of the original data as possible into training? Are there ways to take samples of the original database and iteratively update the model with new samples? As of now, it does not appear that ‘sdv’ supports this.
Currently our approach is to train on a stratified sample of the original database, but we have had mixed results with this.
Currently SDV does not support batch training. In large part, this is because our customers have typically seen reasonable results after sub-sampling their training data. You mention that in your case, stratified data is not producing good results. Could you extrapolate on this? In particular, I am curious about what your schema looks like (metadata), what kind of stratification method you are using, and how large of a training dataset you are using. Also, when you say mixed results, is there a specific metric or quality measure you are looking at?
For more context: Some time ago, we performed an experiment that suggested a random subsample is sufficient for training SDV. The intuition is that SDV uses the training data to learn patterns. You need some data to establish those patterns, but after a certain point adding more data does not give SDV any new information; it merely reinforces the same pattern it already knows. So from a theoretical perspective, there there wouldn’t be any benefit to training on all the data. (To help our users extract a random training set for multi-table, we have provided some options in our database connectors feature .)
Of course, our understanding of this phenomenon may change based on what we learn from you. I would like to figure out whether training on more data will help SDV produce better data for you, or whether there are certain patterns in your data that are inherently hard for SDV to learn. In the first case, iterative batch training would be a good feature for us to add. In the latter, perhaps there are customizations and parameters we can try instead.
Thank you for your response. The specific issue I was referring to here is for some data that populates test queries for HEDIS Measures. One of the challenges is that many of the measures have very specific rules (e.g. women aged 18-45 who were admitted twice in the last 100 days for a metabolic disease).
Due to size constraints we have found that we have to draw very specific samples to ensure that these rules are met by selecting an initial population that contains at least one individual that matches some of the criteria for a measure.