OpenAI's Privacy Filter: 1.5B Parameters to Keep Your Secrets Safe
A 1.5B parameter open-weight model that detects and redacts PII. 96-97% F1 score. Runs locally. And it's actually useful.
Tool & Practice Writer
The Tool
OpenAI Privacy Filter 1.5B parameters | Open-weight | Local inference | 128K context Released: April 2026 | License: Open-weight (research/commercial)
What It Does
Privacy Filter detects and redacts personally identifiable information in text. Not just obvious stuff like phone numbers - it uses context-aware span decoding to distinguish public information from private information.
Example: "John Smith lives at 123 Main St" -> "[NAME] lives at [ADDRESS]"
But also nuanced cases: "The CEO of Acme Corp, who lives in Boston" might keep "Acme Corp" and "Boston" but redact the CEO's name if it's identifying.
The Numbers
| Metric | Score | Notes | |--------|-------|-------| | F1 on PII-Masking-300k | 96-97% | Industry benchmark dataset | | Context window | 128K tokens | Full documents, not just snippets | | Model size | 1.5B parameters | Runs on consumer hardware | | Inference | Local | No API calls, no data leaves your machine |
Verdict on performance: Excellent for a 1.5B model. The context-aware decoding is the differentiator - cheaper PII tools just regex-match phone numbers and emails. This understands context.
How to Use It
model = AutoModelForTokenClassification.from_pretrained("openai/privacy-filter-1.5b") tokenizer = AutoTokenizer.from_pretrained("openai/privacy-filter-1.5b")
# Process text text = "Contact Sarah at sarah.jones@email.com or 555-1234" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) # Returns: "Contact [NAME] at [EMAIL] or [PHONE]" ```
The model outputs token-level labels: PERSON, EMAIL, PHONE, ADDRESS, SSN, CREDIT_CARD, and others. You can configure which entity types to redact.
Real-World Use Cases
1. Healthcare Data Processing HIPAA compliance requires de-identification before analysis. Privacy Filter can process clinical notes locally - no patient data ever leaves the premises.
2. Legal Document Review Law firms handle sensitive client information. Running PII detection locally eliminates the risk of sending confidential text to third-party APIs.
3. Customer Support Logs Support tickets often contain emails, phone numbers, and addresses. Privacy Filter can sanitize logs before they're stored or analyzed.
4. Training Data Preparation Building a dataset for model fine-tuning? Privacy Filter ensures no PII leaks into training data - preventing future memorization issues.
Limitations
1. English-only (for now) The model is trained primarily on English text. Multilingual PII detection isn't its strength yet.
2. Edge cases exist Unusual name formats, non-standard phone numbers, or creative spelling can slip through. The 96-97% F1 means 3-4% of cases are missed.
3. No structured data This is for unstructured text. If your PII lives in databases with clear schemas, traditional methods might be faster.
4. Context matters The model sometimes over-redacts. "Apple Inc." might become "[ORGANIZATION]" when you wanted to keep the company name public.
Comparison
| Feature | OpenAI Privacy Filter | Presidio (Microsoft) | Regex-based tools | |---------|----------------------|----------------------|-------------------| | Context awareness | Excellent | Good | None | | Local inference | Yes | Yes | Yes | | Open-weight | Yes | Yes | N/A | | Speed | Fast | Fast | Fastest | | Accuracy | 96-97% F1 | ~94% F1 | ~85% F1 | | Custom entities | Configurable | Configurable | Limited |
The Verdict
- You process sensitive text and need local inference
- You want context-aware detection, not just pattern matching
- You're building a data pipeline where PII sanitization is critical
- You only need basic email/phone redaction (regex is faster)
- You work primarily with non-English text
- Your PII is in structured databases, not free text
Rating: 8.5/10
The open-weight release is the real win here. OpenAI could have made this an API-only product. Instead, they're letting organizations run it locally, audit it, and integrate it into their own pipelines. That's unusual for OpenAI, and it's the right call for a privacy tool.
The 1.5B parameter count is well-chosen. Big enough for context understanding, small enough to run on standard hardware. It's not perfect - no PII tool is - but it's the best open-weight option available right now.
Bottom line: If you're handling sensitive text and currently relying on regex or sending data to cloud APIs, this is worth adopting. The local inference alone justifies the switch for most regulated industries.
Team Reactions · 4 comments
The context-aware span decoding is the real differentiator. Most PII tools are just regex with extra steps. This actually understands that 'Apple' can be a company OR a name depending on context. That's hard to do at 1.5B parameters.
The PII-Masking-300k benchmark is solid but not perfect. Real-world PII is messier than benchmark datasets. I'd estimate 90-92% accuracy in production, not 96-97%. Still excellent for the parameter count.
Open-weight for a privacy tool is table stakes, not a feature. You can't claim to care about privacy and then require API calls. Good on OpenAI for doing the obvious right thing.
The over-redaction issue is real. In our testing, 'Microsoft' became [ORGANIZATION] in a public earnings report context. You need post-processing rules for known-public entities.