Securing Sensitive Data for AI Agents
A guide on how to protect your sensitive data when using AI agents
January 9th, 2025
Neosync is a data anonymization and synthetic data platform that customers use to anonymize sensitive production data and sync it across environments. One of the most common questions that we get from customers is, "How can we move data across our cloud accounts since our production and staging systems are in different accounts ?". And it makes sense given that most customers segregate their environments into different AWS/GCP/Azure accounts. So I wanted to take a minute and talk through how we solve this using Neosync.
In this blog we'll go through a best practice architecture review for how to securely move data across AWS accounts.
First, let's expand on our requirements. Generally, when we talk to customers about this use-case, the same requirements come up:
These are all reasonable requirements that most mature, cloud-native companies will have. And as you're about to see, using this architecture we can meet all of these requirements.
Let's diagram out the traditional customer cloud environment using segregated accounts. We're going to use AWS in this example but you can pretty much substitute it with any other cloud provider.
In this diagram, we have two environments, each in it's own VPC. In each environment, we have a database that is in a private subnet and then a bunch of other resources. Since we really only care about the database in this blog, I've pretty much left out everything else for brevity.
This is pretty standard and generally what most segregated environments look like.
Now, back to our original question. How do we securely move data across environments? Let's build on this diagram.
The first thing we'll need is another AWS account that we can use as a shared account. In this account we'll deploy S3 and use it as our staging ground. Let's update our diagram.
Why are we using S3? Why don't we just use a couple of bastion hosts to tunnel into each environment? Ah good question.
First, S3 is cheaper than using an EC2 instance as a bastion host and depending on the amount of data you're moving, may have cheaper data transfer costs if you're moing data across regions. Second, configuring, managing and scaling bastion hosts is a pain. And you're likely not going to want those running all of the time, so you'll want to spin them up, use them and bring them back down. Meaning you'll need to terraform that entire solution, which isn't the biggest deal but it's still a pain.
Second, if you use S3 as a staging ground, you can have different sync'ing schedules from production -> stage. Meaning that you can sync from prod -> stage 1x/month but then folks can pull from stage however many times they want since the data will still be there (if you don't delete it). A developer can do what they need with the data and change it and mess it up and then just blow away their environment and re-sync a fresh copy without having to trigger the entire pipeline again.
Third, while we haven't talked about Neosync yet, another reason why S3 is great is because you limit the direction connection across environments through Neosync. We'll talk about this more in a minute.
Let's add in some data flow lines so we can see how data is going to flow across the system.
Now that we have our data flow diagrammed out, let's add update our diagram to include Neosync and talk through what's happening.
We've deployed Neosync and attached it to our private subnet in both environments. From a deployment perspective, you can deploy Neosync into Kubernetes or on an EC2 machine, you'll just need to set up the networking.
Let's start from our production environment. The entire Neosync product is deployed here including the Frontend, Backend, Worker, Orchestrator and more. This is where developers create their jobs and define the schema transformations. Here, Neosync talks to the production database and retrieves, anonymizes and streams the anonymized data to S3.
Now S3 has anonymized production data that any other environment that can talk to S3 can retrieve.
Let's move to our staging environment.
The unique thing here is that only the CLI and Neosync API server are deployed in staging. We don't need to deploy all of the resources because we're just using Neosync as a way to retrieve the data from S3 and stream it to another database. Which is exactly what's happening here.
The Neosync CLI retrieves configurations from the Neosync API Server and then streams the anonymized data from S3 directly to a staging server.
This allows us to fulfill our requirements. No developer we're given extra privileges. Sensitive production data was anonymized before it went to lower level environments. Inter-network traffic was minimized by using S3 as a staging ground versus using bastion-hosts.
And we get the added benefit of being able to hydrate other databases - development, local, CI, from S3 without having to run a production sync again since the data is already in S3!
In this blog, we looked at how to securely move data across AWS accounts using Neosync and S3. This is a common architecture that we see customers adopt because it's fast, reliable and secure. Using S3 as our staging ground, we can flexibly sync data down to lower level environments without giving developers extra permissions to access production infrastructure.
A guide on how to protect your sensitive data when using AI agents
January 9th, 2025
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
Nucleus Cloud Corp. 2025