How to sync your PlanetScale DB with anonymized production data

How to sync your PlanetScale DB with anonymized production data

Introduction

PlanetScale is a managed MYSQL database built on Vitess. It comes out of the box with branching, sharding, insights and more.

In this guide, we're going to walk through how you can sync your PlanetScale database branch with anonymized production data from a main branch using Neosync. This is really helpful to get representative data locally for better testing, debugging and developer experience.

Neosync is an open source data anonymization platform that can create anonymized or synthetic data and sync it across all of your PlanetScale environments/branches for better security, privacy and development.

If you haven't already done so, follow the How to seed your PlanetScale DB with Synthetic Data blog to get set up.

Let's jump in.

Prerequisites

We're going to need a PlanetScale account and a Neosync account. If you don't already have those, we can get those here:

Setting up PlanetScale

Now that we have our accounts, we can get this ball rolling. First, let's log into PlanetScale. If you already have a PlanetScale account then you can either create a new database or use an existing project. We're going to use the source-db from the How to seed your PlanetScale DB with Synthetic Data blog as our source.

In this case, this source-db is acting like our production branch or database that we want to anonymize and sync to a lower-level environment such as staging/qa/dev etc.

We'll need two separate databases or branches to show data syncing from a source to a destination. We're going to create another branch but the same process works across databases as well.

Now we can click on Create Branch in order to create our new branch.

feat1

Planetscale creates the branch quickly and you should see an empty branch page like this:

feat1

If you're using an existing branch that already has a schema then it will look slightly different.

Next, we'll need to connect to our database and define our database schema.

Connecting to PlanetScale

Click on Connect on the top right hand corner and then make sure the connection type is 'Primary' and click Create Password. When you do that, a username and password will be created and displayed.

sb-create-project

Scroll down to the language and framework section and select the last option, Other. Select the Optimized configuration and you'll see the Connection credentials section with your database params.

Connect to your database using your favorite client or the mysql CLI.

Once you're connected, here is the SQL script I ran to create our table:

 
CREATE TABLE users (
    id CHAR(36) PRIMARY KEY,
    first_name VARCHAR(255) NOT NULL,
    last_name VARCHAR(255) NOT NULL,
    email VARCHAR(255) NOT NULL,
    age INT NOT NULL
);
 

We can do a quick sanity check on our database to make sure our columns are there.

sb-created-tables

Since we used our source-db as our main, it will also contain data that we don't need. You can either truncate this data using:

truncate sourcedb.users

Or configure Neosync to truncate the data for you.

Nice! Let's move to setting up Neosync.

Setting up Neosync

Now that we're in Neosync, we'll want to first create a connection to our PlanetScale branch and then create a job to sync data. Let's get started.

Creating a Connection

Navigate over to Neosync and login. Once you're logged in, go to to Connections -> New Connection then click on Mysql.

neosync-connect-form

Let's use the Url tab to enter in our PlanetScale connection string to connect to our branch.

This is the format for the connection string mysql://<username>:<password>@host:port/<database>?tls=true. Use the credentials from the PlanetScale dashboard to construct your connection string.

Once you've completed filling out the form, you can click on Test Connection to test that you're connected. You should see this if it passes:

neosync-test

Let's click Submit to save the connection. You should now have two connections that you can use as a source and destination, like so.

neosync-test

Nice! Let's move onto the last set up step.

Creating a Job

In order to sync data, we need to create a Job in Neosync. Let's click on Job and then click on New Job. We're now presented with three options:

neosync-test

  • Data Synchronization - Synchronize and anonymize data between a source and destination.
  • Data Generation - Generate synthetic data from scratch for a chosen destination.
  • AI Data Generation - Generate synthetic data using a prompt

Since we're syncing data across databases or branches, we can select the Data Synchronization job and click Next.

Let's give our job a name and then set Initiate Job Run to Yes. We can leave the schedule and advanced options alone for now.

neosync-test

Click Next to move onto the Connect page. Here we want to select the connection we previously connected from the dropdown.

neosync-test

There are some other options here that can be useful such as Truncate before insert. I mentioned earlier that you can manually truncate data in your destination or have Neosync do it for you, and this is where you can configure Neosync to do it. We'll enable this and click Next.

Now for the fun part. Select the tables that you want to sync from source -> destination. Since we're only working with one table right now, We'll select the sourcedb.users table. If there are multiple tables that you want to sync, you can bring over as many tables as you'd like.

Next, we'll configure out transformers. We need to determine what kind of anonymization we want to do and/or what type of synthetic data we want to create and map that to our schema. Neosync has Transformers which are ways of creating synthetic data or anonymizing existing data. Click on the Transformer and then select the right Transformer that maps to the right column.

neosync-test

I've set the id column to passthrough which will allow us to validate our output data. The first_name column to generate a brand new first name, the last_name column to generate a new last name, the email column to transform the existing email address by generating a new username and preserving the domain, and lastly the age column to transform it by anonymizing the current value.

You can set the email transformer options by clicking on the pencil icon to the right of the transformer and then configuring your settings like this:

neosync-test

Now that we've configured everything, we can click on Next to go to the Subset page. We don't have to subset our data but if we did we can enter in a SQL filter to subset our data. Neosync will take care of the rest.

Now let's click Save to create our job and kick off our first run! We'll get routed to the Job page and see something like this:

neosync-test

Now we can head back over to Planetscale and check on our data. Let's check the users table. We'll want to first ensure that we generated 10,000 rows since that's how many were in our source (neosync-test) database and then check that the data was anonymized and generated correctly according to our transformers.

neosync-test

Great, count looks good. Next, let's check the data. Let's take an ID from our source and check it against the destination. Note: remember that relational databases don't guarantee order when selecting data unless you pass an ORDER BY clause so using a SELECT * FROM users query will be misleading. This is our source:

neosync-test

This is our destination:

neosync-test

We can see that we generated new first and last names, we anonymized the email address username but preserved the domain and anonymized and our age. Nice!

Conclusion

In this guide, we walked through how to anonymize sensitive data and generate synthetic data from one Planetscale database branch to another. You can also do it across databases as well! This is just a small test and you can expand this to anonymize millions or more rows of data across any relational database. Neosync handles all of the referential integrity. If you're working with sensitive data and want a better way to protect that data, then consider Neosync to take on the heavy lifting.


How Are Named Entity Recognition Models (NER) Used in Data Anonymization for LLMs?

How Are Named Entity Recognition Models (NER) Used in Data Anonymization for LLMs?

An overview of Named Entity Recognition (NER) models and how they're used in Data anonymization

October 14th, 2024

View Article
Join our Community
Have questions about Neosync? Come chat with us on Discord!
dev
NeosyncLogo
soc2
Nucleus Cloud Corp. 2024
Privacy Policy
Terms of Service