How to generate synthetic data for csv files to test data pipelines

How to generate synthetic data for csv files to test data pipelines

Introduction

If you're working in healthcare data processing, you know the challenge: you need to test your pipelines with realistic medical claims data, but you can't use real patient information in your development or testing environments. This is especially true for medical claims software companies that receive CSV files from customers and need to process them through various pipeline stages.

At Neosync, we've been focused on solving this problem by providing tools to generate high-quality synthetic data. In this blog, I'll share how you can use Neosync's APIs to generate synthetic medical claims CSV files that perfectly mimic your production data format without exposing any sensitive patient information.

The Medical Claims CSV Challenge

Medical claims software companies face a specific set of challenges:

  1. They receive CSV files from healthcare providers and insurance companies
  2. These files contain highly sensitive PHI (Protected Health Information)
  3. They need to test their ETL pipelines across development, QA, and staging environments
  4. The test data needs to maintain the exact format and schema of production files
  5. HIPAA compliance requirements prevent using real patient data in non-production environments

Traditional approaches like anonymizing production data is an option that but that works downstream. While manually creating test data is time-consuming and often doesn't represent real-world scenarios.

Generating Synthetic Medical Claims CSVs with Neosync

Let's look at a PySpark script that uses Neosync's APIs to generate synthetic medical claims data and output it as a CSV file – exactly mimicking the format your production pipeline would process.

Setting up the Claims Data Schema

First, we define a schema that matches the structure of the claims CSV files you'd receive from customers:

schema = StructType([
    StructField("patient_name", StringType(), False),
    StructField("patient_dob", DateType(), False),
    StructField("patient_ssn", StringType(), False),
    StructField("patient_email", StringType(), True),
    StructField("patient_phone", StringType(), True),
    StructField("patient_address", StringType(), True),
    StructField("provider_name", StringType(), False),
    StructField("charge_amount", DoubleType(), False)
])

This schema captures the essential fields typically found in medical claims data, but you can easily modify it to match your exact file format.

Configuring Transformers for Medical Data

For each field in our CSV, we define specialized transformers that generate realistic medical claims data:

transformers = [
    TransformerMapping(
        expression=".patient_name",
        transformer=TransformerConfig(
            generate_full_name_config=GenerateFullName()
        )
    ),
    TransformerMapping(
        expression=".provider_name",
        transformer=TransformerConfig(
            transform_full_name_config=GenerateFullName()
        )
    ),
    TransformerMapping(
        expression=".charge_amount",
        transformer=TransformerConfig(
            generate_float64_config=GenerateFloat64(
                min=50.0,
                max=1000.0,
                precision=2
            )
        )
    ),
    # Other fields...
]

Notice how we can granularly control each piece of data being created to exactly fit our need.

Custom Date Generator for Patient DOBs

Medical claims require valid dates of birth that follow specific patterns. Here's a custom JavaScript transformer that generates realistic DOBs:

custom_dob = """
function generateCustomDOB() {
  // Get current date
  const today = new Date();
 
  // Define age range for patients (18-90 years)
  const minAge = 18;
  const maxAge = 90;
 
  // Calculate date ranges based on ages
  const minYear = today.getFullYear() - maxAge;
  const maxYear = today.getFullYear() - minAge;
 
  // Generate random year within range
  const year = Math.floor(Math.random() * (maxYear - minYear + 1)) + minYear;
 
  // Generate random month (0-11)
  const month = Math.floor(Math.random() * 12);
 
  // Get number of days in the generated month
  const daysInMonth = new Date(year, month + 1, 0).getDate();
 
  // Generate random day (1 to days in month)
  const day = Math.floor(Math.random() * daysInMonth) + 1;
 
  // Format components with leading zeros if needed
  const formattedYear = year.toString();
  const formattedMonth = (month + 1).toString().padStart(2, '0');
  const formattedDay = day.toString().padStart(2, '0');
 
  // Format as YYYY-MM-DD for ISO date format
  return `${formattedYear}-${formattedMonth}-${formattedDay}`;
}
"""

This ensures we get valid dates that reflect a realistic patient population age distribution.

And we can add this function to our transformers array:

 
...
     TransformerMapping(
                expression=".patient_dob",
                transformer=TransformerConfig(
                    generate_javascript_config=GenerateJavascript(code=custom_dob)
                )
            ),
...

Generating and Saving the CSV

The key part of the process is generating each record through Neosync's API and then saving the entire dataset as a CSV:

# Process the template through Neosync
response = client.anonymization.AnonymizeSingle(
    AnonymizeSingleRequest(
        input_data=json.dumps(template),
        transformer_mappings=transformers,
        account_id=ACCOUNT_ID
    )
)
 
# Add the processed record to our collection
all_records.append(json.loads(response.output_data))
 
# Convert to DataFrame
df = spark.createDataFrame(all_records, schema=schema)
 
# Save to CSV
df.coalesce(1).write.mode("overwrite").option("header", "true").csv("medical_claims_data.csv")

This produces a CSV file that looks identical to what your customers would send, but contains completely synthetic data.

Practical Use Cases in Medical Claims Processing

This approach solves several real-world problems for medical claims processors:

1. Pipeline Testing with Multiple File Formats

While we've focused on CSV files, the same approach works for all the formats you might encounter in healthcare data interchange:

# Save as CSV
df.coalesce(1).write.mode("overwrite").option("header", "true").csv("claims_data.csv")
 
# Save as pipe-delimited file
df.coalesce(1).write.mode("overwrite").option("header", "true").option("delimiter", "|").csv("claims_data.pipe")
 
# Save as fixed-width file
# ...

2. Testing Claims Validation Rules

Generate edge cases to test your validation logic:

# Add some invalid claim records with extreme values
invalid_transformers = [
    # Same as before but with extreme charge amounts
    TransformerMapping(
        expression=".charge_amount",
        transformer=TransformerConfig(
            transform_float64_config=TransformFloat64(
                min=10000.0,  # Unusually high amount
                max=50000.0,
                precision=2
            )
        )
    ),
    # Other fields...
]

3. Automating Test Data Refresh

By integrating this script into your CI/CD pipeline, you can automatically refresh test data for each test run:

# In your CI/CD pipeline
def refresh_test_data():
    # Generate different sizes of test files
    generate_medical_claims_data(100)  # Small test file
    generate_medical_claims_data(10000, filename="large_claims_test.csv")  # Large test file

Conclusion

For medical claims software companies, having realistic test data is critical for building reliable processing pipelines. Using Neosync's APIs to generate synthetic CSV files offers several advantages:

  1. HIPAA Compliance: You completely eliminate the risk of exposing PHI in non-production environments
  2. Realistic Data: Your test files perfectly mimic the format and content of real claims data
  3. Flexibility: You can generate different variations of data to test edge cases and validation rules
  4. Automation: The process can be integrated into your testing pipeline for consistent data refresh

By implementing this approach, you can confidently test your claims processing pipelines with data that looks real but carries none of the compliance risks.

If you're looking to improve your medical claims testing process, give this approach a try. You can sign up for a free Neosync account at neosync.dev to get started. And checkout our SDK docs.

Happy testing!


Top Open Source Alternatives to Tonic AI for Data Anonymization and Synthetic Data

Top Open Source Alternatives to Tonic AI for Data Anonymization and Synthetic Data

Top Open Source Alternatives to Tonic AI for Data Anonymization and Synthetic Data

March 31st, 2025

View Article
Top 4 Alternatives to Tonic AI for Data Anonymization and Synthetic Data Generation

Top 4 Alternatives to Tonic AI for Data Anonymization and Synthetic Data Generation

Top 4 Alternatives to Tonic AI for Data Anonymization and Synthetic Data Generation

March 25th, 2025

View Article