Introducing Free-Form Text Anonymization for AI and Machine Learning Workflows
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
AI agents are front and center right now and as more companies adopt AI agents for various workflows and tasks, an important question comes up - how do we ensure that AI agents aren't leaking sensitive data to model providers?
The problem is pretty straightforward. AI agents need data to be useful. The more context they have, the better they can help. But that creates an obvious tension - how much data is too much? And how do we ensure that sensitive data isn't accidentally sent to a model provider like OpenAI or Anthropic?
Think about a customer service workflow. An AI agent might need access to order history to help resolve issues. But if that agent accidentally includes credit card numbers or other PII in its prompts to the LLM, you've just leaked sensitive data that you can't get back.
There are a few key risks that engineering teams need to think about:
Let's talk about how to actually solve this. You need three main components:
First, classify your data based on sensitivity:
This is really the foundation and helps define the access controls.
Implement strict controls on what agents can access. For example, here's what a function in Python might look like to validate that an agent has the appropriate permissions to access data:
def validate_agent_access(agent_id, data_source):
# Check if agent has permission
permissions = get_agent_permissions(agent_id)
if data_source not in permissions['allowed_sources']:
raise AccessDeniedException()
# Validate access hasn't expired
if permissions['expiry'] < current_time():
raise ExpiredAccessException()
It's important to ensure that users and AI agents aren't accidently leaking any sensitive data to models through their prompts. Here is an example, using the Neosync Python SDK, of how you can detect and anonymization data in free-form text.
import grpc
import json
import os
from neosync.mgmt.v1alpha1.anonymization_pb2_grpc import AnonymizationServiceStub
from neosync.mgmt.v1alpha1.anonymization_pb2 import AnonymizeSingleRequest,TransformerMapping
from neosync.mgmt.v1alpha1.transformer_pb2 import TransformerConfig, TransformPiiText
def main():
channel_credentials = grpc.ssl_channel_credentials()
channel = grpc.secure_channel(os.getenv('NEOSYNC_API_URL'), channel_credentials)
stub = AnonymizationServiceStub(channel)
transformer_mapping = TransformerMapping(
expression=".text",
transformer=TransformerConfig(
transform_pii_text_config=TransformPiiText(
score_threshold=0.5,
)
)
)
data = {"text": "Dear Mr. John Chang, your physical therapy for your rotator cuff injury is approved for 12 sessions. Your first appointment with therapist Jake is on 8/1/2024 at 11 AM. Please bring a photo ID. We have your SSN on file as 246-80-1357. Is this correct?"}
input_data = json.dumps(data)
access_token = '<neosync-api-token>'
metadata = [('authorization', f'Bearer {access_token}')] if access_token else None
try:
# Make the RPC call
response = stub.AnonymizeSingle(
AnonymizeSingleRequest(
input_data=input_data,
transformer_mappings=[transformer_mapping],
account_id="<neosync-account-id>"
),
metadata=metadata
)
# Parse and print response
try:
output_data = json.loads(response.output_data)
print("Anonymized data:", output_data)
except json.JSONDecodeError:
print("Raw response:", response.output_data)
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.UNAUTHENTICATED:
print("Authentication failed - please check your access token")
elif e.code() == grpc.StatusCode.UNAVAILABLE:
print("Failed to connect to server. Please check the hostname and port.")
else:
print(f"RPC failed: {e.code()}, {e.details()}")
finally:
channel.close()
if __name__ == "__main__":
main()
As AI agents become more deeply integrated into business processes, these security controls become even more critical. The companies that build these guardrails now will be able to adopt AI agents more confidently and effectively than those that treat security as an afterthought.
I expect we'll see more tools emerge specifically for securing AI agent data access. Remember that unless you're running your own AI agents, you're renting those frm someone else. And if you don't want that other party to see your sensitive data then you need to take the right steps to protect it.
Securing AI agent data access isn't optional - it's a critical requirement for safely deploying these systems in production.
The goal isn't to block AI agent usage entirely, but to enable it with appropriate safeguards. The companies that get this balance right will be much better positioned as AI agents become an even bigger part of how we work.
Use Neosync to detect and redact PII in free-form text such as LLM prompts and other workflows
December 13th, 2024
A guide on how to test your data warehouse using Neosync
December 11th, 2024
Nucleus Cloud Corp. 2025