How to Build and Test Data Pipelines with Neosync and dbt

How to Build and Test Data Pipelines with Neosync and dbt

Introduction

One of the biggest challenges that data engineering teams face is testing their data pipelines without exposing sensitive production data. Most teams either create mock data that doesn't represent their production data well or they copy production data into their lower environments which creates security and privacy concerns. Neither of these approaches is ideal.

This is where combining Neosync with dbt can create a powerful workflow that allows data engineers to build and test their pipelines with anonymized production-like data.

Let's dive into how this works.

The Problem with Testing Data Pipelines

Testing data pipelines is complex because you need representative data that matches your production schema and data distributions. If you're working with sensitive data like PII or financial records, you can't just copy that data into your development environment. But if you use mock data, you might miss edge cases that only appear with real-world data patterns.

Here's a common scenario:

  1. Data engineer writes new dbt models locally
  2. Tests pass locally with limited mock data
  3. Models are deployed to production
  4. Pipeline fails because production data has edge cases not covered in testing

Let's see how we can solve this using Neosync and dbt together.

Architecture Overview

mapping

In this architecture:

  1. Neosync connects to your production data warehouse
  2. It anonymizes sensitive data while maintaining referential integrity
  3. The anonymized data is synced to a development/staging warehouse
  4. dbt runs against the anonymized data for testing

This gives data engineers a safe way to test their transformations against production-like data without security risks.

Setting Up the Pipeline

Let's walk through how to set this up step by step.

  1. Configure Neosync

First, create a Neosync job to sync and anonymize your production data. Here's what that might look like:

# Example Neosync job configuration
job:
  name: 'prod-to-dev-sync'
  source:
    connection: 'prod-warehouse'
  destination:
    connection: 'dev-warehouse'
transformers:
  - column: 'email'
    type: 'email_anonymize'
  - column: 'customer_id'
    type: 'generate_uuid'
  - column: 'address'
    type: 'address_anonymize'
  1. Integrate with dbt

    Next, update your dbt project to work with the anonymized data. Your profiles.yml might look something like:

    dev:
       target: dev
       outputs:
         dev:
           type: snowflake
           account: your-account
           database: DEV_DB # Contains anonymized data
           schema: dbt_dev
     prod:
       target: prod
       outputs:
         prod:
           type: snowflake
           account: your-account
           database: PROD_DB
           schema: dbt_prod
  2. Build Testing Workflow

Now you can create a comprehensive testing workflow:

# 1. Sync anonymized data
neosync jobs trigger prod-to-dev-sync
 
# 2. Run dbt tests against anonymized data
dbt test --target dev
 
# 3. If tests pass, deploy to production
dbt run --target prod

Best Practices

Here are some key practices we've seen teams implement successfully:

  1. Consistent Scheduling - Schedule Neosync jobs to run nightly to ensure dev environments have fresh anonymized data
  2. Version Control - Keep your Neosync transformer configurations in version control alongside your dbt models
  3. Data Validation - Use dbt tests to verify that anonymized data maintains the same statistical properties as production
  4. Pipeline Testing - Test entire pipelines end-to-end with anonymized data before deploying to production

Benefits

This approach has several benefits:

  1. Better Testing - Test against realistic data distributions and edge cases
  2. Improved Security - No sensitive data in development environments
  3. Faster Development - Developers can quickly test changes against production-like data
  4. Reduced Risk - Catch issues before they hit production

Wrapping up

Combining Neosync and dbt creates a powerful workflow for testing data pipelines with anonymized production data. This approach gives data engineers the confidence that their transformations will work in production while maintaining data privacy and security.

If you're interested in trying this workflow, you can get started with Neosync today. We'd love to hear how you're using it with your dbt workflows!


Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows

Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows

Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows

December 13th, 2024

View Article