Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
Data anonymization is a powerful tool to protect sensitive data and give developers a way to use production data locally for a better developer experience. One of the most common questions that customers ask us is how to generate consistent data. Meaning, if there are 7 instances of a name or city address or any other field and we want to anonymize that, how can we make sure that the output is consistent across all 7 instances?
In this blog, we're going to talk about data consistency in anonymization, why it's important and how you can use Neosync to consistently anonymize data.
There are a lot of use cases where data consistency is important when you want to anonymize data. Let's take a look at an example.
First, imagine that you're an ecommerce provider and you're collecting user information and transaction data. That data is sensitive so you want to anonymize it but you want to retain the transactional nature of that data after anonymization. Meaning that there might be several orders by the same customer and you want to retain that so you can test a new Order History feature. And that data relies on being able to transform an email address which is the primary key for the orders. If every email address in that table was anonymized to a new email address, it would be really difficult to find orders that belong to the same user.
This is where data consistency in anonymization comes into play. When we consistently anonymize this data to product the same output given the same input, we can really easily see which orders belong to which users without ever knowing who those users are.
This is a great question and one that we get asked all of the time. First a hash, such as md5, sha256, etc. takes in an input value and produces the same alphanumeric string for the output given the same input.
For example, using a sha256 hash, the input value evis drenova
will always return 890e64f0d9e7c43b0058467f81f65e9f6fd6dc343de18129e74e3823c6c9dbbc
. So in a way you get data consistency using a hash but that return value is pretty ugly and definitely doesn't look like your production data. It would be a lot nicer if evis drenova
could be randomly assigned another human sounding name, consistently.
This is what Neosync does. We combine the ability to generate consistent data like a hash, but do it with contextually appropriate data. Such as replacing names with names, cities with cities, emails with emails and so on.
Neosync is our data anonymization platform that gives developers tools to powerfully anonymize their sensitive data so they can use it locally for a better developer experience. A core part of that is data consistency.
Neosync allows you to set seed values within custom transformers in order to produce data consistency as you anonymize your data. Let's take a look an example.
Let's say that we have email addresses in our database that we want to anonymize and we want to ensure that the emails are anonymized consistently. We can create a custom transformer to do this like so:
const newEmail = neosync.transformEmail(value, {
preserveLength: false,
preserveDomain: false,
excludedDomains: [],
maxLength: 10000,
seed: 1,
emailType: 'uuidv4',
invalidEmailAction: 'reject',
});
return newEmail;
This custom email transformer takes in the sensitive email address and generates a new email address using a seed value of 1. Given a list of 5 email addresses, here is what we can expect:
In our example, we can se that john.doe@example.com
makes to frank.smith@domain.com
in both instances. Back to our example from above, if I'm a developer building out an Order History feature then I want to make sure that I can see all of the orders for john.doe@example.com
without knowing the actual user behind it. Data consistency with seed values give us that power.
In this blog, we explored how to generate consistent anonymized data for a better developer experience using Neosync. This is just the tip of the iceberg. There is a lot more to explore how to set seed values for an entire Job so that you can reproduce an entire data set from one machine or Job to another. And many other use cases such as analytics. But we'll leave those for another blog!
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
A guide on how to test your data warehouse using Neosync
December 11th, 2024
Nucleus Cloud Corp. 2024