sanitext – Remove LLM-generated Text Fingerprints
The Hidden Fingerpint
How many differences can you spot between these two?
- I'm AI. (Normal text)
− І’m󠅘󠅟󠅜󠅑 ΑІ.󠅓󠅙󠅑󠅟 (AI-tainted text)
You probably noticed that the second quote looks a bit different? Actually, every single character in the second string is "tainted" in some way!
-
−
(U+2212) is a minus sign instead of a dash -
-
І
(U+406) is a cyrillic letter instead of the latin “I” -
’
(U+2019) is a right single quoation mark instead of a regular quote 1 -
m󠅘󠅟󠅜󠅑
is the letter ‘m’ followed by hidden variation selectors, which encode the message "hola" (more info here) -
-
Α
(U+391) is a greek letter -
І
(U+406) is cyrillic -
.󠅓󠅙󠅑󠅟
is the period ‘.’ with the encoded message "ciao" using variation selectors
You’d never notice these with the naked eye, but they could be used to detect AI-generated text.
Which raises the question: if you mostly need an LLM to generate ASCII (code, emails, ..), why allow unnecessary Unicode? And if you do need specific Unicode characters, why not just explicitly define which ones?
Will Paste Special (Cmd+Shift+V) Fix This? In some apps, "Paste Special" strips formatting and replaces fancy typography with plain ASCII. But it won’t always remove invisible characters or homoglyphs. For example, I tried Paste Special in Sublime Text, and nothing was removed—but the thin space (U+2009) was clearly visible.
Why I Built This
This isn’t a claim that major LLMs do all (or any) of these tricks. That said, I started working on this because I accidentally discovered an instance of text fingerprinting while debugging a byte-sensitive bug. That’s when I realized: it’s time to say goodbye to (at least these kinds of) fingerprints for good. 🙂
Even if the LLM itself doesn’t add these, a wrapper around it might. This tool is for anyone who wants clean, reliable text, free from invisible garbage hiding inside.
Zooming out: Watermarks & Fingerprints
AI-generated content, in general, can be marked, sometimes invisibly, to identify its origin. A few examples:
-
Latent Space Fingerprints - AI models leave behind subtle statistical patterns. Their text generation follows a probability distribution, creating a kind of stochastic signature (e.g., by occasionally using certain rare words).
-
Watermarking – Models can sneak in patterns in text, images, or audio. Think of it as an AI signature, hidden but detectable.
-
Metadata Encoding – Some AI-generated media comes with metadata baked in (e.g., image EXIF data).
-
Steganography – Messages can be embedded within the output, only detectable with the right key.
The Problem: LLM systems can Fingerprint their Outputs with Unicode Shenanigans
LLM systems can inject subtle fingerprints into their outputs using Unicode tricks. These can include homoglyphs (characters that look identical but aren’t), invisible characters, and other Unicode weirdness–all embedded in AI-generated text for later detection.
If AI-generated text is secretly marked, your emails, reports, or even tweets might be screaming "AI-generated" without you realizing it–though a filter might.
How do I remove this: sanitext
sanitext
is a simple command-line tool and Python library that:
✔ Detects suspicious Unicode characters (defaults to ASCII-only)
✔ Normalizes lookalike characters to ASCII equivalents
✔ Removes characters with no lookalikes
✔ Supports custom character allowlists (--allow-chars
, --allow-file
)
✔ Works with your clipboard-no manual pasting needed (unless you want to)
How It Works
sanitext
cleans up funky Unicode text using Unicode normalization (NFKC), homoglyph mapping, and a special cleanup process: if not a known homoglyph, decompose the Unicode character and check for ASCII characters in the decomposition, e.g., fi
(U+FB01) decomposes to f
+ i
.
It also flags suspicious characters so you can see exactly what’s causing trouble. You can also allow specific Unicode characters or single-codepoint emojis.
🚀 Install in Seconds
pip install sanitext
Now you can use it straight from the command line.
CLI usage example
# Process the clipboard content & copy back to clipboard
sanitext
# Detect characters but don't modify
sanitext --detect
# Process clipboard + show detected characters (most common command)
sanitext -v
# Process clipboard + show input, detected characters & output
sanitext -vv
# Process the provided string and print it
sanitext --string "Héllø, 𝒲𝑜𝓇𝓁𝒹!"
# Allow additional characters (for now, only single unicode code point characters)
sanitext --allow-chars "αøñç"
# Allow characters from a file
sanitext --allow-file allowed_chars.txt
# Allow single code point emoji
sanitext --allow-emoji
# Prompt user for handling disallowed characters
# y (Yes) -> keep it
# n (No) -> remove it
# r (Replace) -> provide a replacement character
sanitext --interactive
# Allow emojis
sanitext --allow-emoji
Python library usage example
from sanitext.text_sanitization import (
sanitize_text,
detect_suspicious_characters,
get_allowed_characters,
)
text = "“2×3 – 4 = 5”😎󠅒󠅟󠅣󠅣"
# Detect suspicious characters
suspicious_characters = detect_suspicious_characters(text)
print(f"Suspicious characters: {suspicious_characters}")
# [('“', 'LEFT DOUBLE QUOTATION MARK'), ('×', 'MULTIPLICATION SIGN'), ('–', 'EN DASH'), ('”', 'RIGHT DOUBLE QUOTATION MARK')]
# Sanitize text to all ASCII
sanitized_text = sanitize_text(text)
print(f"Sanitized text: {sanitized_text}") # "2x3 - 4 = 5"
# Allow the multiplication sign
allowed_characters = get_allowed_characters()
allowed_characters.add("×")
sanitized_text = sanitize_text(text, allowed_characters=allowed_characters)
print(f"Sanitized text: {sanitized_text}") # "2×3 - 4 = 5"
# Allow the emoji (but clean it from the encoded message "boss")
allowed_characters = get_allowed_characters(allow_emoji=True)
sanitized_text = sanitize_text(text, allowed_characters=allowed_characters)
print(f"Sanitized text: {sanitized_text}") # "2x3 - 4 = 5"😎
📂 Want to See the Code?
Check out the code here: https://github.com/panispani/sanitext.
Feel free to fork it, break it, improve it, or just use it to make your life easier.
-
The right single quotation mark
’
, while unusual on US keyboards, is automatically used to replace single quotes on iPhones if Smart Punctuation is enabled. However, I suspect that many of the long LinkedIn posts I notice this in weren’t written on an iPhone :) ↩
Enjoy Reading This Article?
Here are some more articles you might like to read next: