FIELD NOTES — THE CRAFT OF SMALL TOOLS
Pattern Matching for Privacy: Why I Chose Regex Over AI
10ms beats 500ms when you know what you're looking for
February 6, 2026 • 8 min read
"The master's tools will never dismantle the master's house."
— Audre Lorde
Lorde was writing about systems of oppression. But the line keeps finding me in technical contexts where it doesn't quite fit—and that's what makes it useful. Sometimes the master's tools work fine. Sometimes they're exactly what you need.
Everyone told me to use AI for PII detection. Transformer models. Named entity recognition. The grown-up tools.
I built regex patterns instead.
The Moment I Decided
11 PM on a Tuesday. I'm watching AWS Comprehend take 2.3 seconds to find an SSN I could have spotted in half a glance.
The SSN has a format: ###-##-####. Nine digits. Two dashes. Always.
I don't need machine learning to pattern-match something that already has a pattern. That's like using GPS to find your couch.
What 80/20 Actually Means Here
Most personally identifiable information follows predictable formats. That's what makes it identifiable:
Social Security Numbers: ###-##-####
Email addresses: something@something.tld
Phone numbers: (###) ###-#### or variations
Credit cards: 16 digits with specific prefixes
Dates of birth: MM/DD/YYYY formats
These aren't ambiguous. They're structured. And structured data is regex's whole job.
Six patterns caught 80% of what I needed. Not because regex is magic—because most PII has structure we can exploit.
The other 20%? That's where AI earns its keep. But I don't need a $50,000 consultant for the work an intern could do.
The Six Patterns
Here's the core. Nothing clever. Nothing fancy.
1. Social Security Numbers
SSN_PATTERN = r'\b\d{3}-\d{2}-\d{4}\b'
Nine digits, formatted with dashes. Word boundaries prevent false positives. This catches virtually every SSN in standard format.
2. Email Addresses
EMAIL_PATTERN = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
Not RFC 5322 compliant—that regex runs 6,000+ characters. But it catches everything a normal business document contains.
Good enough beats theoretically complete. Every time.
3. Phone Numbers
PHONE_PATTERN = r'(\+1\s?)?(\(?\d{3}\)?[\s.-]?)?\d{3}[\s.-]?\d{4}'
Three iterations to get here. First version missed parentheses formatting. Second caught false positives. Third found the balance.
4. Credit Cards
CC_PATTERN = r'\b(?:\d{4}[-\s]?){3}\d{4}\b'
Sixteen digits with optional separators. Non-capturing groups keep it clean.
5. Dates of Birth
DOB_PATTERN = r'\b(?:0?[1-9]|1[0-2])[/-](?:0?[1-9]|[12]\d|3[01])[/-](?:19|20)\d{2}\b'
Range validation for months and days. Fewer false positives than naive matching.
6. Names (The Honest One)
Names don't have formats. "John Smith" is just words.
NAME_PATTERN = r'\b[A-Z][a-z]+\s+[A-Z][a-z]+\b'
Two capitalized words. Catches "John Smith" and also "New York" and "Monday Morning."
The fix: combine pattern matching with a dictionary lookup. If the pattern matches AND the first word appears in a list of common first names, treat it as a name.
This isn't perfect. I'm not pretending it is. But combined with the other patterns, it provides coverage without NER's complexity and latency.
The Performance Gap
I benchmarked both approaches on a 50-page document:
| Approach | Processing Time | Dependencies | Cost |
|---|---|---|---|
| Regex (local) | 8-15ms | 0 | $0 |
| spaCy NER (local) | 800-1200ms | 200MB+ | $0 |
| AWS Comprehend | 500-2000ms | API | ~$0.01/call |
| OpenAI GPT-4 | 2000-5000ms | API | ~$0.03/call |
For interactive processing, 10ms versus 2 seconds is the difference between "instant" and "noticeable."
Process 100 documents and you're looking at 1 second vs. 3 minutes.
Time compounds. Latency costs attention.
The Trust Model
import re # That's it. That's the dependency.
Regex lives in Python's standard library. No API keys. No internet connection. No model downloads. No data leaving your machine.
For a privacy tool, this matters doubly. Users handling sensitive documents may be uncomfortable sending content to external APIs. "Trust us" is easy to say. Local-only processing makes it unnecessary to say at all.
The best security promise is the one you don't have to make because the architecture makes it irrelevant.
When AI Actually Wins
Regex isn't always the answer. I'm not that dogmatic.
- ● Ambiguous names in context. "Apple sent the document" vs. "Apple is delicious." Regex can't distinguish companies from fruit. Context requires computation.
- ● Multiple languages. My patterns assume American formats. International phone numbers, non-Western names, varied date conventions—either many more patterns or smarter detection.
- ● High-stakes accuracy. If missing one SSN means regulatory violation, AI's marginal accuracy advantage may be worth the latency cost.
- ● Unusual document formats. Legal documents, medical records, specialized formats—patterns may not match.
For Anancy, targeting the 80% case with simple tools was the right call. Different product, different requirements, different answer.
The Takeaway
"Simplicity is the ultimate sophistication."
— Leonardo da Vinci
Regex isn't dead. It's just unfashionable.
For structured data matching, regex offers:
- ● Speed: Milliseconds, not seconds
- ● Simplicity: Standard library, zero dependencies
- ● Transparency: Patterns are readable and debuggable
- ● Privacy: Local processing, no external calls
AI-based detection is genuinely better for ambiguous cases. But most PII isn't ambiguous—it has structure we can exploit.
The question isn't "which is better?" It's "which is appropriate for this specific use case?"
Sometimes the boring solution is the right solution.
Sometimes the master's tools work just fine—because you're not trying to dismantle anything. You're just trying to protect some data.
Part 4 of "The Craft of Small Tools" series.
Building privacy tools that actually work?
Textstone Labs helps teams implement AI where it matters and keep it simple where it doesn't. We pick the right tool for the job—not the trendiest one.
Let's Talk →Want more Field Notes?
Practical lessons from the field, delivered to your inbox. No spam.
Textstone Labs — AI implementation for people who build things.