Provide your segments to the SegmentSynthesizer

Our SegmentSynthesizer is a specialty synthesizer designed to model highly segmented data with higher accuracy. If your data table has high segmentation, it contains groups that behave in different ways. We’ve recently updated the features that are available when you use the SegmentSynthesizer.

If you know the segments already, you can input them

We originally designed the SegmentSynthesizer to algorithmically determine the segments. This is useful when you suspect that there are different groups present in your data, but you don’t have any explicit knowledge about how to separate them out. In this case, SegmentSynthesizer algorithmically finds these groups based on the data and learns each segment’s pattern separately.

However, we’ve since come across cases where you might already know the segments. For example, if you’re dealing with a dataset of real and fraudulent transactions, you know the different groups – real and fraud. There’s already a column that contains this information.

In this case, you can now set up your SegmentSynthesizer to directly use the values in that “is_fraud” column as your segments.

from sdv.single_table import SegmentSynthesizer

synthesizer = SegmentSynthesizer(
    metadata,
    segmentation_params={
        'method': 'exact_values',
        'column_name': 'is_fraud' })

And since these segments are pre-determined, you can also adjust the settings for modeling each segment.

# use a GAN-based synthesizer for non-fraud rows
synthesizer.set_synthesizer_for_segment(
    segment_name=False, # non-fraud
    synthesizer_name='CTGANSynthesizer')

SegmentSynthesizer is optimized to learn patterns for each segment

Under-the-hood, SegmentSynthesizer models each segment separately so that it’s able to understand the unique patterns to each one. You don’t have to manually split apart the table into real vs. fraud datasets. Everything is managed inside a single synthesizer.

But there’s more SegmentSynthesizer than this.

The approach of modeling segments separately doesn’t work if some segments are small. In our example, fraudulent transactions may occur very rarely, whereas normal, non-fraud transactions are plentiful. So the dataset might be skewed to having 1000 examples of normal transactions and only 3 with fraud. It’s hard to learn patterns with just 3 rows of data.

The SegmentSynthesizer does something smart for this case:

  • It learns any idiosyncratic patterns within the fraud segment
  • … but it also pulls more general information from the other segments to supplement the learnings. The general trends are established for all transactions (fraud or not).

This represents the best of both worlds: Learning the segments individually, but also using the entire data to enrich the learnings.

SegmentSynthesizer is compatible with the SDV platform

All SDV synthesizers are compatible with the overall platform. This means that you can use all your favorite SDV features with any synthesizer – including SegmentSynthesizer. For example:

Conditional sample segments. This can be useful for re-balancing imbalanced data – like sampling 1000 fraudulent transactions to balance out the data.

from sdv.sampling import Condition

fraud_condition = Condition(
    num_rows=1000,
    column_values={'is_fraud': True})

synthetic_fraud_data = synthesizer.sample_from_conditions([fraud_condition])

Add constraints to capture your business logic. SegmentSynthesizer is compatible with any constraints that you have access to – including constraints in the CAG bundle like FixedNullCombinations.

Use the SegmentSynthesizer within a multi-table schema. If you’re modeling a multi-table database and you know that one of your tables is highly segmented, you can choose to model that particular table with the SegmentSynthesizer.

from sdv.multi_table import HSASynthesizer

# create a synthesizer for a larger, multi-table schema
synthesizer = HSASynthesizer(metadata)

# update one of the tables to use SegmentSynthesizer
synthesizer.set_table_parameters(
    table_name='transactions',
    table_synthesizer='SegmentSynthesizer',
    table_parameters={
        'segmentation_params': {
            'method': 'exact_values',
            'column_name': 'is_fraud' }}})

That’s the power of our comprehensive, SDV platform!

Would the SegmentSynthesizer be useful for your use case? Read about the XSynthesizers bundle, which gives you access to many useful synthesizers including SegmentSynthesizer.