How to Personalize AI Without Storing PII
Before You Start
The distinction between personalization data and personal data is critical. Personalization data describes how a user interacts with your system: their preferred programming language, explanation style, detail level, and rejected approaches. Personal data describes who the user is: their name, email, location, employer, and demographic characteristics. Effective personalization needs the first category, not the second. A coding assistant does not need to know your name to remember that you prefer TypeScript and dislike verbose explanations.
This guide assumes you have a basic preference storage system in place. If not, start with the preference engine guide. The techniques here modify how you capture and store preferences to avoid PII, not how preferences are structured or retrieved.
Step-by-Step Implementation
PII definitions vary by regulation and context. Under GDPR, any information that can directly or indirectly identify a natural person is personal data. Under CCPA, the definition is similar but includes household-level identifiers. Start by listing every data point your application touches and classify each as PII or non-PII.
Common PII categories to watch for: names, email addresses, phone numbers, physical addresses, IP addresses (PII under GDPR), device identifiers that can be linked to individuals, employment details, financial information, and any combination of non-PII fields that together could identify someone (quasi-identifiers). For AI personalization, the most common PII leakage happens when conversation content containing personal details gets stored as preference evidence without scrubbing.
A user might say "I'm the lead engineer at Stripe working on the payments API." Your preference engine should extract "expertise: distributed payments systems, role: senior engineer" and discard the company name and specific product. The behavioral preference is useful for personalization. The identifying details are not needed and create compliance obligations.
Store preferences as abstract behavioral patterns, not as facts about the person. Instead of "works at a Fortune 500 company," store "prefers enterprise-scale solutions." Instead of "based in Germany," store "requires GDPR compliance context." Instead of "has been coding for 15 years," store "expertise level: expert." The abstract version is just as useful for personalization while being impossible to trace back to a specific individual.
The abstraction should happen at extraction time, not after storage. Never store the raw personal detail and then attempt to abstract it later, because the raw detail existing in your database even temporarily creates a compliance surface. Your extraction pipeline should output abstract preferences directly.
// PII-safe preference extraction prompt
const PII_SAFE_EXTRACTION = `Extract user preferences from this conversation.
CRITICAL: Do NOT include any personally identifiable information:
- No names, emails, or contact information
- No company names or specific employers
- No locations or geographic details
- No project names that could identify the user
- No specific dates that could narrow identity
Instead, abstract personal details into behavioral preferences:
- "works at Google on search" -> "expertise: search infrastructure, scale: large"
- "I'm a junior dev in Berlin" -> "expertise_level: junior, compliance_context: EU"
- "our team of 50 engineers" -> "team_size: large, prefers: scalable solutions"
Return only abstract, non-identifying preference observations.`;Even with careful extraction prompts, PII can leak through. Add a scrubbing layer between extraction and storage that catches common PII patterns. Use regex patterns for structured PII (emails, phone numbers, IP addresses) and named entity recognition for unstructured PII (names, organizations, locations).
const PII_PATTERNS = [
/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z]{2,}\b/gi, // emails
/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, // phone numbers
/\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/g, // IP addresses
/\b\d{3}-\d{2}-\d{4}\b/g, // SSN pattern
];
function scrubPII(text) {
let cleaned = text;
for (const pattern of PII_PATTERNS) {
cleaned = cleaned.replace(pattern, '[REDACTED]');
}
// If any redaction happened, the preference likely contains PII
// and should be re-extracted with stricter instructions
if (cleaned !== text) {
return { text: cleaned, hadPII: true };
}
return { text: cleaned, hadPII: false };
}When the scrubber detects PII, do not simply redact and store the mangled preference. Instead, flag it for re-extraction with stricter instructions, or discard it entirely. A preference that says "prefers [REDACTED] framework" is useless. Better to lose that observation than store corrupted data.
Replace real user identifiers (email, username, database ID) with opaque pseudonymous tokens in your preference store. The mapping between real identity and pseudonym should live in a separate, access-controlled system that your preference store never queries directly.
Generate pseudonyms using a one-way hash of the real identifier combined with a secret salt. This produces a consistent token for the same user (so you can associate preferences across sessions) without being reversible by anyone who does not have the salt. Store the salt in a separate security boundary from the preference data.
import hashlib
import os
# Salt stored separately from preference data
PSEUDONYM_SALT = os.environ['PSEUDONYM_SALT']
def get_pseudonym(real_user_id):
"""Generate a consistent, non-reversible pseudonym."""
combined = f"{PSEUDONYM_SALT}:{real_user_id}"
return hashlib.sha256(combined.encode()).hexdigest()[:24]
# In your preference store, only the pseudonym appears
# pref_store.set(pseudonym, preferences)
# The mapping real_user_id -> pseudonym is only needed
# at the application boundary, not in the preference storeIf you analyze preferences across users (for cold start initialization, cohort analysis, or product improvement), add differential privacy noise to prevent individual re-identification. Even abstract preferences can become identifying when combined: "expert Python developer who prefers async, works on ML pipelines, and uses a specific rare framework" might narrow the population to one person.
The simplest differential privacy approach for preference aggregation is the randomized response technique. When aggregating a categorical preference, each user's actual value is kept with probability p and replaced with a random value with probability (1-p). The aggregate statistics are adjusted to compensate for the noise. This guarantees that any individual's preference cannot be determined from the aggregate with confidence greater than a known bound.
Regardless of whether your preference data contains PII, users should be able to view what the system has stored, export their data, and delete it on demand. Under GDPR, these are legal requirements. Under any reasonable product design, they are trust requirements.
Expose three API endpoints: one that returns all preferences associated with a user's pseudonym (with the pseudonym resolved to real identity at the application boundary), one that exports preferences in a portable format, and one that deletes all preference data and resets the user to a new-user state. Adaptive Recall provides these through its standard memory API: query to view, export through the REST endpoint, and forget to delete.
What You Can and Cannot Personalize Without PII
PII-free personalization works well for behavioral adaptation (response style, detail level, format preferences), domain expertise matching (technical depth, framework choices, architecture scale), interaction pattern optimization (step-by-step vs complete solutions, iteration style), and negative preference enforcement (avoiding rejected approaches). It works less well for geographic or cultural personalization (time zones, language variants, cultural norms), company-specific context (internal tools, team conventions, product-specific knowledge), and personal relationship continuity (remembering personal details the user shared for rapport). If your application needs these deeper personalization dimensions, you will need to store some personal data with appropriate consent and security, not attempt to deliver these features through behavioral abstraction alone.
Adaptive Recall stores preferences as memories with built-in lifecycle management. Use entity tags for behavioral abstraction and the forget tool for complete data deletion.
Start Building Free