
Top Open Source Alternatives to Tonic AI for Data Anonymization and Synthetic Data
Top Open Source Alternatives to Tonic AI for Data Anonymization and Synthetic Data
March 31st, 2025
If you're working in healthcare data processing, you know the challenge: you need to test your pipelines with realistic medical claims data, but you can't use real patient information in your development or testing environments. This is especially true for medical claims software companies that receive CSV files from customers and need to process them through various pipeline stages.
At Neosync, we've been focused on solving this problem by providing tools to generate high-quality synthetic data. In this blog, I'll share how you can use Neosync's APIs to generate synthetic medical claims CSV files that perfectly mimic your production data format without exposing any sensitive patient information.
Medical claims software companies face a specific set of challenges:
Traditional approaches like anonymizing production data is an option that but that works downstream. While manually creating test data is time-consuming and often doesn't represent real-world scenarios.
Let's look at a PySpark script that uses Neosync's APIs to generate synthetic medical claims data and output it as a CSV file – exactly mimicking the format your production pipeline would process.
First, we define a schema that matches the structure of the claims CSV files you'd receive from customers:
schema = StructType([
StructField("patient_name", StringType(), False),
StructField("patient_dob", DateType(), False),
StructField("patient_ssn", StringType(), False),
StructField("patient_email", StringType(), True),
StructField("patient_phone", StringType(), True),
StructField("patient_address", StringType(), True),
StructField("provider_name", StringType(), False),
StructField("charge_amount", DoubleType(), False)
])
This schema captures the essential fields typically found in medical claims data, but you can easily modify it to match your exact file format.
For each field in our CSV, we define specialized transformers that generate realistic medical claims data:
transformers = [
TransformerMapping(
expression=".patient_name",
transformer=TransformerConfig(
generate_full_name_config=GenerateFullName()
)
),
TransformerMapping(
expression=".provider_name",
transformer=TransformerConfig(
transform_full_name_config=GenerateFullName()
)
),
TransformerMapping(
expression=".charge_amount",
transformer=TransformerConfig(
generate_float64_config=GenerateFloat64(
min=50.0,
max=1000.0,
precision=2
)
)
),
# Other fields...
]
Notice how we can granularly control each piece of data being created to exactly fit our need.
Medical claims require valid dates of birth that follow specific patterns. Here's a custom JavaScript transformer that generates realistic DOBs:
custom_dob = """
function generateCustomDOB() {
// Get current date
const today = new Date();
// Define age range for patients (18-90 years)
const minAge = 18;
const maxAge = 90;
// Calculate date ranges based on ages
const minYear = today.getFullYear() - maxAge;
const maxYear = today.getFullYear() - minAge;
// Generate random year within range
const year = Math.floor(Math.random() * (maxYear - minYear + 1)) + minYear;
// Generate random month (0-11)
const month = Math.floor(Math.random() * 12);
// Get number of days in the generated month
const daysInMonth = new Date(year, month + 1, 0).getDate();
// Generate random day (1 to days in month)
const day = Math.floor(Math.random() * daysInMonth) + 1;
// Format components with leading zeros if needed
const formattedYear = year.toString();
const formattedMonth = (month + 1).toString().padStart(2, '0');
const formattedDay = day.toString().padStart(2, '0');
// Format as YYYY-MM-DD for ISO date format
return `${formattedYear}-${formattedMonth}-${formattedDay}`;
}
"""
This ensures we get valid dates that reflect a realistic patient population age distribution.
And we can add this function to our transformers array:
...
TransformerMapping(
expression=".patient_dob",
transformer=TransformerConfig(
generate_javascript_config=GenerateJavascript(code=custom_dob)
)
),
...
The key part of the process is generating each record through Neosync's API and then saving the entire dataset as a CSV:
# Process the template through Neosync
response = client.anonymization.AnonymizeSingle(
AnonymizeSingleRequest(
input_data=json.dumps(template),
transformer_mappings=transformers,
account_id=ACCOUNT_ID
)
)
# Add the processed record to our collection
all_records.append(json.loads(response.output_data))
# Convert to DataFrame
df = spark.createDataFrame(all_records, schema=schema)
# Save to CSV
df.coalesce(1).write.mode("overwrite").option("header", "true").csv("medical_claims_data.csv")
This produces a CSV file that looks identical to what your customers would send, but contains completely synthetic data.
This approach solves several real-world problems for medical claims processors:
While we've focused on CSV files, the same approach works for all the formats you might encounter in healthcare data interchange:
# Save as CSV
df.coalesce(1).write.mode("overwrite").option("header", "true").csv("claims_data.csv")
# Save as pipe-delimited file
df.coalesce(1).write.mode("overwrite").option("header", "true").option("delimiter", "|").csv("claims_data.pipe")
# Save as fixed-width file
# ...
Generate edge cases to test your validation logic:
# Add some invalid claim records with extreme values
invalid_transformers = [
# Same as before but with extreme charge amounts
TransformerMapping(
expression=".charge_amount",
transformer=TransformerConfig(
transform_float64_config=TransformFloat64(
min=10000.0, # Unusually high amount
max=50000.0,
precision=2
)
)
),
# Other fields...
]
By integrating this script into your CI/CD pipeline, you can automatically refresh test data for each test run:
# In your CI/CD pipeline
def refresh_test_data():
# Generate different sizes of test files
generate_medical_claims_data(100) # Small test file
generate_medical_claims_data(10000, filename="large_claims_test.csv") # Large test file
For medical claims software companies, having realistic test data is critical for building reliable processing pipelines. Using Neosync's APIs to generate synthetic CSV files offers several advantages:
By implementing this approach, you can confidently test your claims processing pipelines with data that looks real but carries none of the compliance risks.
If you're looking to improve your medical claims testing process, give this approach a try. You can sign up for a free Neosync account at neosync.dev to get started. And checkout our SDK docs.
Happy testing!
Top Open Source Alternatives to Tonic AI for Data Anonymization and Synthetic Data
March 31st, 2025
Top 4 Alternatives to Tonic AI for Data Anonymization and Synthetic Data Generation
March 25th, 2025
Nucleus Cloud Corp. 2025