How to Build and Test Data Pipelines with Neosync and dbt

Evis Drenova

@evisdrenova

November 12th, 2024

Introduction

One of the biggest challenges that data engineering teams face is testing their data pipelines without exposing sensitive production data. Most teams either create mock data that doesn't represent their production data well or they copy production data into their lower environments which creates security and privacy concerns. Neither of these approaches is ideal.

This is where combining Neosync with dbt can create a powerful workflow that allows data engineers to build and test their pipelines with anonymized production-like data.

Let's dive into how this works.

The Problem with Testing Data Pipelines

Testing data pipelines is complex because you need representative data that matches your production schema and data distributions. If you're working with sensitive data like PII or financial records, you can't just copy that data into your development environment. But if you use mock data, you might miss edge cases that only appear with real-world data patterns.

Here's a common scenario:

Data engineer writes new dbt models locally
Tests pass locally with limited mock data
Models are deployed to production
Pipeline fails because production data has edge cases not covered in testing

Let's see how we can solve this using Neosync and dbt together.

Architecture Overview

mapping

In this architecture:

Neosync connects to your production data warehouse
It anonymizes sensitive data while maintaining referential integrity
The anonymized data is synced to a development/staging warehouse
dbt runs against the anonymized data for testing

This gives data engineers a safe way to test their transformations against production-like data without security risks.

Setting Up the Pipeline

Let's walk through how to set this up step by step.

Configure Neosync

First, create a Neosync job to sync and anonymize your production data. Here's what that might look like:

# Example Neosync job configuration
job:
  name: 'prod-to-dev-sync'
  source:
    connection: 'prod-warehouse'
  destination:
    connection: 'dev-warehouse'
transformers:
  - column: 'email'
    type: 'email_anonymize'
  - column: 'customer_id'
    type: 'generate_uuid'
  - column: 'address'
    type: 'address_anonymize'

Integrate with dbt

Next, update your dbt project to work with the anonymized data. Your profiles.yml might look something like:

dev:
   target: dev
   outputs:
     dev:
       type: snowflake
       account: your-account
       database: DEV_DB # Contains anonymized data
       schema: dbt_dev
 prod:
   target: prod
   outputs:
     prod:
       type: snowflake
       account: your-account
       database: PROD_DB
       schema: dbt_prod

Build Testing Workflow

Now you can create a comprehensive testing workflow:

# 1. Sync anonymized data
neosync jobs trigger prod-to-dev-sync
 
# 2. Run dbt tests against anonymized data
dbt test --target dev
 
# 3. If tests pass, deploy to production
dbt run --target prod

Best Practices

Here are some key practices we've seen teams implement successfully:

Consistent Scheduling - Schedule Neosync jobs to run nightly to ensure dev environments have fresh anonymized data
Version Control - Keep your Neosync transformer configurations in version control alongside your dbt models
Data Validation - Use dbt tests to verify that anonymized data maintains the same statistical properties as production
Pipeline Testing - Test entire pipelines end-to-end with anonymized data before deploying to production

Benefits

This approach has several benefits:

Better Testing - Test against realistic data distributions and edge cases
Improved Security - No sensitive data in development environments
Faster Development - Developers can quickly test changes against production-like data
Reduced Risk - Catch issues before they hit production

Wrapping up

Combining Neosync and dbt creates a powerful workflow for testing data pipelines with anonymized production data. This approach gives data engineers the confidence that their transformations will work in production while maintaining data privacy and security.

If you're interested in trying this workflow, you can get started with Neosync today. We'd love to hear how you're using it with your dbt workflows!

Top Open Source Alternatives to Tonic AI for Data Anonymization and Synthetic Data

Evis Drenova

@evisdrenova

March 31st, 2025

Top 4 Alternatives to Tonic AI for Data Anonymization and Synthetic Data Generation

Evis Drenova

@evisdrenova

March 25th, 2025

See all posts