Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
At Neosync, we work with a lot of customers in the EU who have sensitive data and need to abide by GDPR. One of the biggest problems they have is figuring out how to handle sensitive data while still making it usable for Machine Learning.
This presents an interesting problem because we need a way to anonymize sensitive data so that it's not considered Personally Identifiable Information (PII) by GDPR's definition but the data is still usable for things like machine learning.
In this blog, we're going to look two solutions to this problem. We'll discuss the trade-offs and use-cases for both and hopefully build a framework that you can use to guide your decision-making.
Let's jump in.
Let's start by clearly defining the problem statement. Let's use an example to do this.
Let's imagine that we're a company based in France that records and transcribes customer interviews for medical insurance companies. That is, medical insurance companies use our software to record and transcribe calls they have with customers. On these calls, customers provide a lot of sensitive information about themselves such as their prescriptions, medical conditions and PII. We would like to use that data to fine-tune and train our machine learnings models so that we can improve our transcription capabilities.
Since we're in the EU, customers have a right to ask us to delete their data. However, if we use their data, including sensitive data, to train our models, then we would need to scrub their data from our data set and retrain our models. This is expensive and time-consuming.
Our problem statement is as follows: we need a way to handle sensitive customer data such that if we use it to train our models and a customer asks us to delete their data, we don't have to retrain our models.
In the ideal world, we could run a single job that would delete one or multiple records from the production databases, and potentially any backups of production, and then that data service access request to delete that customer's data would be fulfilled. Ideally, we wouldn't have to worry about touching any development and staging environments and retraining any models.
In our example transcription company, we're working with two types of data: structured and unstructured. Structured data such as customer data collected from an online form is stored in a well-defined schema while unstructured data such as transcripts are stored as long strings of free-form text. So if we only want to delete data from our production database then we need a way to isolate the sensitive data to our production environment and not proliferate it downstream to any development or staging environments, and we need a way to handle this within our machine learning models.
Anonymization can come in handy here. We can use anonymization to anonymize all of the structured data fairly easily. Because it's in a well-defined schema, we'll have a pretty good idea of what is sensitive and what is not sensitive and we can apply Neosync Transformers to those columns. For example, using transformers we can anonymize the current age of a user by randomly selecting an int that is +/- 3 years from that age. This way, we can't use that age to identify that individual anymore but it's still an integer that, for all intents and purposes, is representative of an age.
We can do this for all of the structured data columns. We'll then get to a dataset that is anonymized but still retains the context of a customer's record. We can then freely use that data for machine learning since it's no longer sensitive. And if a customer asks to delete their data, we can simply delete the source record in production without having to scrub any of our lower level environments or retrain our machine learning models.
The trickier part is handling the free-form text. We need a way to identify what is sensitive in the free-form text, parse that out, anonymize it and then insert it back into the free-form text.
The challenge here is that data that might not traditionally be sensitive in isolation might be sensitive in context. Let's take an example. If there is a transcript where a customer is talking about a medical condition they have that is super rare, then it might be considered PII since it could be used to identify just the few people that have it when you combine it with something as broad as the city or even state that the customer lives in.
This is the challenge with probabilistic PII detection models. They can usually capture a large portion of the sensitive data but you can never be 100% sure.
There is another way though. This is where large language models (LLMs) come into play. LLMs are great at generating synthetic data that closely resembles real data such as a customer call transcript with a medical insurance company.
This is a great way to generate synthetic data that you can then feed into your machine learning model without having to worry about any sensitive data falling through the cracks. In a way, this is deterministic PII anonymization because you never use any PII to begin with.
Here is an example of how you can use Neosync to generate synthetic data using LLMs.
The great thing is that you can combine this synthetic data generation approach with the more traditional, column-based data anonymization into one job. Then you can have full control over the structure, shape and sensitivity of your data.
With this, we solve our problem statement.
In this blog, we presented an approach that safely handles both structured and unstructured data, making it usable for machine learning use-cases. As more countries and regions pass data security and data privacy regulations, we'll continue to see more companies dealing with this problem. If you're interested in trying our Neosync, you can sign up for a free account.
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
A guide on how to test your data warehouse using Neosync
December 11th, 2024
Nucleus Cloud Corp. 2024