Securing Sensitive Data for AI Agents
A guide on how to protect your sensitive data when using AI agents
January 9th, 2025
We spend a lot of time talking to engineering teams at companies that have sensitive data. Whether it's sensitive consumer data such as PII or sensitive business data such as financial records, engineering teams are constantly trying to strike a balance between protecting their data and enabling their developers with a great local developer experience.
For some teams in highly regulated industries such as financial service, insurance and healthcare, it seems like the need for data security usually wins over the need for a great local developer experience. But it doesn't have to be this way. In this blog, we're going to explore how teams in any industry, especially highly regulated ones, can have strong data security and a great local experience.
First, let's define what a great developer experience looks like. I think there are three main parts:
Arguably the most important part of a great developer experience is having access to the tools and resources you need to develop locally. This means having access to your own isolated database that other people aren't constantly changing, your own resources and infrastructure so that you're not competing with other developers for compute/storage, and your own copy of data that you can manipulate and work with.
As cloud infrastructure becomes cheaper and easier to manage with tools such as Pulumi and Terraform, a lot of teams are giving developers their own isolated environments. And depending on the tooling that you need, you might be able to have an entire stack that is local on your machine as opposed to something that is hosted in a big cloud provider. Tools like LocalStack are a great way to get started with creating local developer environments including resources that you would typically find in the cloud such as S3 buckets and more.
But the point remains that having access to all of the tooling and resources you need is key for a great developer experience.
The best local developer experiences we've seen can be set up with a single command. When you have this level of setup and automation, it makes setting up and tearing down local environments so easy and fast. At Neosync, we use Docker Compose scripts to create all of our local infrastructure and can set it up using a single command - docker compose -f compose.dev.yml watch
.
It's so important that your local infrastructure 'looks like production'. One of the biggest causes of bugs is when you're building something that works locally but doesn't work in production. This causes a lot of confusion and frustration because you don't know if it's an infrastructure problem, a data problem or a code problem. If you can get your local developer experience to replicate your production environment as closely as possible, it's a huge step forward towards a great developer experience.
There are two areas to focus on here. The first is the data. You have to make sure that your local developer data resembles your production data otherwise data errors will crop up and it'll be difficult to understand where they're coming from. You can use a tool like Neosync to anonymize your production data so that you can safely use it locally. The second part is the infrastructure. While you may not have the same level of compute and storage locally, you at least need the same tools (back to our first point).
We really don't have to make much of a case for data security. It's painfully obvious that protecting your data and infrastructure is incredibly important for every business regardless.
But the main question is how do we balance these two things that are seemingly at odds with each other.
First, it's important to define the data that developers should have access to and the format of that data. This is usually informed by your security, privacy and compliance teams and requires that you scan through databases, data warehouses, data lakes and other data stores and catalogue the data as sensitive or insensitive.
Once you have a good idea of the type of data you have then you can start to implement technology to give developers access to the data that they are allowed to see. Ideally, this is in a self-service way that makes it easy and fast for developers to access the data they need without having to go through a data request process. Whether it's through a portal where developers can 'check-out' data or through an automated pipeline that hydrates a database or data warehouse, it's important for developer experience and efficiency.
Let's look at an example. How do we secure our production data while still giving developers production-like data. The process can be that no one has access to a production database. We can then use technology to sync and anonymize data from production to a local developer's database. Now we have a good balance between data security - no one has access to production data, and developer experience - developer's have access to realistic test data to build and test against.
Depending on your own data security requirements and policies, it's worth the time to understand what developers need and how those requirements map to your policies. Once you've done that, you can implement technology to automate the rest of the process.
It's entirely possible to create a great local developer experience without sacrificing data security. Combining process and technology to protect your most sensitive assets (data + infrastructure) while giving your developers access to the tools they need, an easy way to setup and automate their environments and production-like data, is the right approach. Teams have to be willing to do the hard work to understand the trade-offs and search for solutions that fit their data security posture but also work for developers.
A guide on how to protect your sensitive data when using AI agents
January 9th, 2025
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
Nucleus Cloud Corp. 2025