PAR Out-of-Memory

Error:

PAR throws out of memory error

Environment Details:

SDV Version: 1.11.0
Python Version: 3.11.7
Operating System: Windows 10 Enterprise

Data Context:

rows: 4242950
unique sequences: 115742
columns: 17

Code Context:

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(df_input)
metadata.update_column(‘Member_ID’, sdtype = ‘id’)
metadata.set_sequence_key(‘Member_ID’)
metadata.set_sequence_index(‘DateIndex’)
synthesizer = PARSynthesizer(metadata = metadata, epochs = 1, context_columns = [‘Gender’, ‘Age’], verbose = True, segment_size = 10)
synthesizer.fit(df_input)

I was able to get it to run by bucketing each of the columns (except for ID, sequence index and age as age is a context column) and then concatenating all the bucket symbols (digits) for a given row to effectively only have one column (plus those three), thereby saving on space. I also made sure to use UniformEncoder to avoid a potential memory-costly one hot encoder. After generating synthetic data, I uniformly sampled within each bucket to reobtain original data format. I just wanted

-to let you know what we did and also
-see if you had any thoughts on the merits of this approach and also
-I just wanted to check in to see if you had any other general thoughts for dealing with memory issues in PAR

Thanks for following up @pranav.rupireddy

From your description, it seems that your approach is very similar to the way our FixedCombinations constraint works. It essentially concatenates the columns into a single column to treat as categorical (to use with UniformEncoder).

I think this certainly saves space, but an unintended side-effect is that you are limiting the permutations of buckets that can be synthesized. This can be a positive side-effect (or even intended, as in the case of FixedCombinations). Or it can be a negative side-effect because limiting the permutations will limit the diversity of synthetic data that is possible to generate. This is up to you and your use-case.

May I ask how fast the PARSynthesizer ran for you after making these changes?

To improve performance, you may also try to sub-sample your data. For multi-sequence data, the best way to do this would be to pick out entire sequences in your data. Here is a helper method you can use:

import numpy as np

def get_sequence_subset(data, sequence_key, num_sequences):
  unique_sequences = data[sequence_key].unique()
  sequence_subset = np.random.choice(unique_sequences, size=num_sequences)
  subsetted_data = data[data[sequence_key].isin(sequence_subset)].reset_index(drop=True)
  return subsetted_data

An example of applying this to the demo dataset of NASDAQ stocks, where the sequence_key is the ticker Symbol.

from sdv.datasets.demo import download_demo

# this data contains stock ticker info for 100 different companies
data, metadata = download_demo(modality='sequential',  dataset_name='nasdaq100_2019')

# randomly select the full sequence for 10 companies 
subsetted_data = get_sequence_subset(data, 'Symbol', 10)