Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
We often see customers using Neosync to anonymize and orchestrate data across their databases. For example, taking production data, anonymizing it and syncing it down to staging and development. Lately, we've started to see more companies use Neosync to test their data warehouses as well. As data warehouses grow in complexity and scale, testing them effectively becomes increasingly challenging.
Whether you're using Snowflake, BigQuery, Redshift or another DWH solution, you need a reliable way to validate your transformations and data quality. In this blog, we'll walk through how to use Neosync to effectively test your data warehouse.
Data warehouse testing presents several unique challenges:
Traditional data testing tools often fall short in addressing these challenges. While tools like dbt data tests are great for testing transformations, they don't solve the fundamental problem of getting high-quality test data into your warehouse in the first place.
Neosync can help solve these challenges in a few different ways:
Instead of copying production data, you can use Neosync as a test data generation tool to create synthetic data that matches your production schema:
-- Example schema in your warehouse
CREATE TABLE users (
id INT,
email VARCHAR,
first_name VARCHAR,
last_name VARCHAR,
created_at TIMESTAMP
);
CREATE TABLE orders (
id INT,
user_id INT,
amount DECIMAL,
status VARCHAR
);
Neosync can generate millions or even billions of rows of realistic-looking data while maintaining referential integrity between tables. This is essential for big data testing scenarios where you need to validate warehouse performance at scale.
For cases where you do want to use production data, Neosync's subsetting capabilities let you extract a representative sample while maintaining referential integrity:
-- Example subsetting in Neosync
SELECT * FROM users
WHERE created_at >= '2024-01-01'
This gives you a smaller, manageable dataset for data validation testing while ensuring all related records come along for the ride.
When working with sensitive production data, Neosync can anonymize it before it reaches your test warehouse:
-- Example anonymization mapping
users.email -> generate_email()
users.first_name -> generate_first_name()
users.last_name -> generate_last_name()
This lets you maintain data realism for testing while complying with security requirements.
Here's a practical example of how to integrate Neosync into your DWH testing workflow:
You can automate this entire workflow using Neosync's APIs or Terraform provider.
When using Neosync as a TDM tool for warehouse testing, keep these tips in mind:
Effective data warehouse testing requires both the right tools and the right approach. Neosync provides the capabilities needed to generate, manage and sync test data at scale while maintaining data quality and security. By combining Neosync with other data testing tools in your stack, you can build a comprehensive DWH testing strategy that grows with your needs.
Whether you're doing basic data validation testing or complex big data testing scenarios, having reliable test data is crucial. Neosync helps solve this fundamental challenge while providing the flexibility to adapt to different testing requirements.
Want to learn more? Check out our documentation or give Neosync a try with our free tier.
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
You can now run pre and post job hooks to further customize your workflows
December 9th, 2024
Nucleus Cloud Corp. 2024