Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
PlanetScale is a managed MYSQL database built on Vitess. It comes out of the box with branching, sharding, insights and more.
In this guide, we're going to walk through how you can sync your PlanetScale database branch with anonymized production data from a main branch using Neosync. This is really helpful to get representative data locally for better testing, debugging and developer experience.
Neosync is an open source data anonymization platform that can create anonymized or synthetic data and sync it across all of your PlanetScale environments/branches for better security, privacy and development.
If you haven't already done so, follow the How to seed your PlanetScale DB with Synthetic Data blog to get set up.
Let's jump in.
We're going to need a PlanetScale account and a Neosync account. If you don't already have those, we can get those here:
Now that we have our accounts, we can get this ball rolling. First, let's log into PlanetScale. If you already have a PlanetScale account then you can either create a new database or use an existing project. We're going to use the source-db
from the How to seed your PlanetScale DB with Synthetic Data blog as our source.
In this case, this source-db
is acting like our production branch or database that we want to anonymize and sync to a lower-level environment such as staging/qa/dev etc.
We'll need two separate databases or branches to show data syncing from a source to a destination. We're going to create another branch but the same process works across databases as well.
Now we can click on Create Branch in order to create our new branch.
Planetscale creates the branch quickly and you should see an empty branch page like this:
If you're using an existing branch that already has a schema then it will look slightly different.
Next, we'll need to connect to our database and define our database schema.
Click on Connect on the top right hand corner and then make sure the connection type is 'Primary' and click Create Password. When you do that, a username and password will be created and displayed.
Scroll down to the language and framework section and select the last option, Other. Select the Optimized configuration and you'll see the Connection credentials section with your database params.
Connect to your database using your favorite client or the mysql CLI.
Once you're connected, here is the SQL script I ran to create our table:
CREATE TABLE users (
id CHAR(36) PRIMARY KEY,
first_name VARCHAR(255) NOT NULL,
last_name VARCHAR(255) NOT NULL,
email VARCHAR(255) NOT NULL,
age INT NOT NULL
);
We can do a quick sanity check on our database to make sure our columns are there.
Since we used our source-db
as our main, it will also contain data that we don't need. You can either truncate this data using:
truncate sourcedb.users
Or configure Neosync to truncate the data for you.
Nice! Let's move to setting up Neosync.
Now that we're in Neosync, we'll want to first create a connection to our PlanetScale branch and then create a job to sync data. Let's get started.
Navigate over to Neosync and login. Once you're logged in, go to to Connections -> New Connection then click on Mysql.
Let's use the Url tab to enter in our PlanetScale connection string to connect to our branch.
This is the format for the connection string mysql://<username>:<password>@host:port/<database>?tls=true
. Use the credentials from the PlanetScale dashboard to construct your connection string.
Once you've completed filling out the form, you can click on Test Connection to test that you're connected. You should see this if it passes:
Let's click Submit to save the connection. You should now have two connections that you can use as a source and destination, like so.
Nice! Let's move onto the last set up step.
In order to sync data, we need to create a Job in Neosync. Let's click on Job and then click on New Job. We're now presented with three options:
Since we're syncing data across databases or branches, we can select the Data Synchronization job and click Next.
Let's give our job a name and then set Initiate Job Run to Yes. We can leave the schedule and advanced options alone for now.
Click Next to move onto the Connect page. Here we want to select the connection we previously connected from the dropdown.
There are some other options here that can be useful such as Truncate before insert. I mentioned earlier that you can manually truncate data in your destination or have Neosync do it for you, and this is where you can configure Neosync to do it. We'll enable this and click Next.
Now for the fun part. Select the tables that you want to sync from source -> destination. Since we're only working with one table right now, We'll select the sourcedb.users
table. If there are multiple tables that you want to sync, you can bring over as many tables as you'd like.
Next, we'll configure out transformers. We need to determine what kind of anonymization we want to do and/or what type of synthetic data we want to create and map that to our schema. Neosync has Transformers which are ways of creating synthetic data or anonymizing existing data. Click on the Transformer and then select the right Transformer that maps to the right column.
I've set the id
column to passthrough which will allow us to validate our output data. The first_name
column to generate a brand new first name, the last_name
column to generate a new last name, the email
column to transform the existing email address by generating a new username and preserving the domain, and lastly the age
column to transform it by anonymizing the current value.
You can set the email transformer options by clicking on the pencil icon to the right of the transformer and then configuring your settings like this:
Now that we've configured everything, we can click on Next to go to the Subset page. We don't have to subset our data but if we did we can enter in a SQL filter to subset our data. Neosync will take care of the rest.
Now let's click Save to create our job and kick off our first run! We'll get routed to the Job page and see something like this:
Now we can head back over to Planetscale and check on our data. Let's check the users
table. We'll want to first ensure that we generated 10,000 rows since that's how many were in our source (neosync-test) database and then check that the data was anonymized and generated correctly according to our transformers.
Great, count looks good. Next, let's check the data. Let's take an ID from our source and check it against the destination. Note: remember that relational databases don't guarantee order when selecting data unless you pass an ORDER BY
clause so using a SELECT * FROM users
query will be misleading. This is our source:
This is our destination:
We can see that we generated new first and last names, we anonymized the email address username but preserved the domain and anonymized and our age. Nice!
In this guide, we walked through how to anonymize sensitive data and generate synthetic data from one Planetscale database branch to another. You can also do it across databases as well! This is just a small test and you can expand this to anonymize millions or more rows of data across any relational database. Neosync handles all of the referential integrity. If you're working with sensitive data and want a better way to protect that data, then consider Neosync to take on the heavy lifting.
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
A guide on how to test your data warehouse using Neosync
December 11th, 2024
Nucleus Cloud Corp. 2024