I was able to get it to run by bucketing each of the columns (except for ID, sequence index and age as age is a context column) and then concatenating all the bucket symbols (digits) for a given row to effectively only have one column (plus those three), thereby saving on space. I also made sure to use UniformEncoder to avoid a potential memory-costly one hot encoder. After generating synthetic data, I uniformly sampled within each bucket to reobtain original data format. I just wanted
-to let you know what we did and also
-see if you had any thoughts on the merits of this approach and also
-I just wanted to check in to see if you had any other general thoughts for dealing with memory issues in PAR
From your description, it seems that your approach is very similar to the way our FixedCombinations constraint works. It essentially concatenates the columns into a single column to treat as categorical (to use with UniformEncoder).
I think this certainly saves space, but an unintended side-effect is that you are limiting the permutations of buckets that can be synthesized. This can be a positive side-effect (or even intended, as in the case of FixedCombinations). Or it can be a negative side-effect because limiting the permutations will limit the diversity of synthetic data that is possible to generate. This is up to you and your use-case.
May I ask how fast the PARSynthesizer ran for you after making these changes?
To improve performance, you may also try to sub-sample your data. For multi-sequence data, the best way to do this would be to pick out entire sequences in your data. Here is a helper method you can use:
An example of applying this to the demo dataset of NASDAQ stocks, where the sequence_key is the ticker Symbol.
from sdv.datasets.demo import download_demo
# this data contains stock ticker info for 100 different companies
data, metadata = download_demo(modality='sequential', dataset_name='nasdaq100_2019')
# randomly select the full sequence for 10 companies
subsetted_data = get_sequence_subset(data, 'Symbol', 10)