Testing Your Data Warehouse with Neosync
A guide on how to test your data warehouse using Neosync
December 11th, 2024
As AI and ML become more integrated into our workflows, protecting sensitive data has become increasingly critical. Today, we're excited to launch support for free-form text anonymization in Neosync.
The feature is pretty straightforward - it takes in free-form text as input, detects sensitive information like names, dates, SSNs and other PII, and redacts them. Let's look at an example:
Input:
Dear Mr. John Chang, your physical therapy for your rotator cuff injury is approved for 12 sessions. Your first appointment with therapist Jake is on 8/1/2024 at 11 AM. Please bring a photo ID. We have your SSN on file as 246-80-1357. Is this correct?
Output:
Dear Mr. <REDACTED>, your physical therapy for your rotator cuff injury is approved for 12 sessions. Your first appointment with therapist <REDACTED> is on <REDACTED> at <REDACTED>. Please bring a photo ID. We have your SSN on file as <REDACTED>. Is this correct?
Under the hood, we're using a Named Entity Recognition model (NER) identifies and redacts sensitive information while preserving the context and structure of the text. You can configure how aggressive the redaction is by adjusting the scoreThreshold
- lower values mean more paranoid redaction with potential false positives, higher values mean less redaction but potential false negatives.
There are three main use cases we're seeing:
As more companies build AI agents that pass prompts between each other, protecting sensitive data becomes crucial. You don't want customer PII getting accidentally embedded in model weights or leaked through prompt chains. This feature lets you automatically sanitize text before it hits any LLM.
ML teams working with text data often need to anonymize large datasets for training. Whether it's medical records, customer service transcripts, or other text data - our API makes it easy to process entire datasets while maintaining data utility for training.
Beyond AI/ML, many teams just need to anonymize text data for testing, development, or compliance reasons. The API integrates easily into existing workflows to automatically catch and redact sensitive information.
Using the feature is simple. Here's a basic example using our TypeScript SDK:
const neosyncClient = getNeosyncClient({
getAccessToken: () => 'your_api_key',
// Setup client config
});
const data = {
text: 'your text here',
};
const transformers = [
new TransformerMapping({
expression: '.text',
transformer: new TransformerConfig({
config: {
case: 'transformPiiTextConfig',
value: new TransformPiiText({
scoreThreshold: 0.1,
}),
},
}),
}),
];
const result = await neosyncClient.anonymization.anonymizeSingle({
inputData: JSON.stringify(data),
transformerMappings: transformers,
accountId: 'your_account_id',
});
Additionally, you can configure rules to implement custom recognizers, add in allow lists, add in deny lists and much more.
The API is available now and you can find full documentation on our docs site.
This is just the start - we're working on expanding the types of entities we can detect and adding more configuration options.
We're excited to see how teams use this to build more secure AI and ML workflows. If you want to try it out, sign up for a free Neosync account. And as always, let us know what you think!
Until next time, Evis
A guide on how to test your data warehouse using Neosync
December 11th, 2024
You can now run pre and post job hooks to further customize your workflows
December 9th, 2024
Nucleus Cloud Corp. 2024