The importance of leading zeros

We’ve recently made an important enhancement to the way SDV imports data: We now retain leading zeros from the original CSV file when loading it into Python, which can dramatically improve your usage of SDV.

Leading zeros are largely ignored in data science libraries

A leading zero occurs in a cell whenever a number starts with 0 – for example “012325”, “02116”, or “091”. This is not generally the way that numbers are written or processed, so most software packages ignore the leading zeros when loading data. For example, when loading “012325” into Python, the pandas library defaults to stripping out the “0” and presents it as an integer: 12,325.

For general data science purposes, this is usually helpful; many statistical datasets just contain numbers. However, when creating synthetic data, we are typically dealing with enterprise-grade, “messy” data that is collected in the real world. The goal is to capture the formats as-is, and create synthetic data that matches the patterns. For SDV, it’s especially problematic to ignore leading zeros.

When creating synthetic data, leading zeros are important clues about what the data means

The datasets that our users have don’t just represent numbers. We find that in most cases, a leading zero is an important clue that the column actually isn’t a number. Usually, it’s a higher-level concept instead. Looking at our examples:

  • “012325” could actually represent the date in MMDDYYY format. Any month before October (10) is padded with a 0. Our value represents January 23, 2025.
  • “02116” could represent a postal code, which can start with “0” for certain regions. This particular postal code belongs to a region of Boston.
  • “091” could represent ID values, error codes, or other information that is unique to an enterprise. For example, every product could be associated with a 3-digit code.

This means that the leading zeros are important for all of these columns.

Tip: Use SDV’s CSVHandler and double-check your metadata

Use our CSVHandler feature to load in your data, rather than other data science tools. Our handler knows to keep leading zeros if they exist in the CSV. In this case, it loads in the columns as strings. It only converts columns to floats/ints if there are no leading zeros.

connector = CSVConnector()

data = connector.read(
    folder_name='project/data/',
    file_names=['users.csv', 'transactions.csv', 'sessions.csv'],
    keep_leading_zeros=True
)

Once the data is loaded, make sure that your metadata includes the correct sdtype for each column, updating it if necessary. For example, this is what the metadata may look like for 3 columns with leading zeros – a product ID column, a date column, and a postal code column.

"columns": {
    "product_id": {
        "sdtype": "id",
        "regex_format": "[0-9]{3}"
    },
    "date_available": {
        "sdtype": "datetime", 
        "datetime_format": "%m%d%y"
    },
    "area": {
        "sdtype": "postcode" 
    }
}

With our latest updates, SDV can load this type of data correctly. As a result, your synthetic data will contain valid IDs, datetime values, and postal code – even if they have leading zeros!

Resources

  • CSVHandler. Use this to load in CSV files into Python for modeling, and to write synthetic data back into a new file.
  • Database Integrations. SDV can also natively integrate with databases to understand the schema and import subsamples of data for modeling.
  • Metadata reference. Metadata gives the SDV more information about your data schema. This includes concepts like the sdtype, which tells SDV about the type of data in each column.
2 Likes