In the most recent NFL draft, Dallas Cowboys team owner Jerry Jones was very pleased that the team had gotten the picks they wanted. When asked by the press about a player, he proudly held up the team’s confidential draft priority board to show how clever they’d been. Staff were horrified and motioned for him to put the sheet down lest he reveal more of their strategies.
There’s a certain class of people – largely egotistical blowhards – who can’t resist showing people how smart they are. It’s not enough to quietly win the victory and reap the rewards. They want the win and the ego buff of dazzling everyone with their intelligence. “Not only did I defeat you, but here is how I did it. You tried A, but I knew you would, so I tried B, and then I also had C, D, and E prepared…”
This sort of crowing is nearly always bad strategy. It also indicates all sorts of psychological weaknesses in the braggart but the fundamental problem is that once you reveal your methods, they’re revealed. In the world of intelligence, coups and triumphs are never celebrated. Spy rings that obtain vital secrets don’t write tell-all books about how clever they are. If the CIA plants a bug in a foreign embassy, the President doesn’t nudge that country’s leader in the ribs at the next summit and say “you know, we snuck into your embassy, you wouldn’t believe how much we know!”
That is quite an interesting story. We sent what appeared to be identical emails to all, but each was actually coded with either one or two spaces between sentences, forming a binary signature that identified the leaker.
— Elon Musk (@elonmusk) October 9, 2022
Perturbing Text
There are plenty of boring, “uninteresting” methods to find who authored a document. For example, the BTK Killer was identified because he sent a floppy disk to the police taunting them, which contained a Word document with both his employer and first name in the metadata.
But as mentioned in the tweet above, Musk trumpets how they used an “interesting” technique to find a leaker. We’re just talking about the text itself, not its metadata, and we’re identifying who distributed a document, not just authored it.
I hate to break this news to Mr. Musk, but this technique he trumpets goes back well before he was born. Before computers. Before typewriters.
And the way he went about it here is not clever. In fact, there are much more clever methods he’s apparently ignorant about.
The idea so old that it’s hard to find what examples in history to search for. The basic idea is that you have text like this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
These two paragraphs are identical except that in the first one, there’s an extra space after the first sentence. No human – unless carefully looking for it – will see that when reading. Imagine this is, say, 20 paragraphs long. You want to send copies to a hundred people but you want to know if any of them share it. So you go through the text and perturb the text slightly. You move a period here and there. You “accidentally” change a comma to a semi-colon. There are many other ways, and you don’t need typesetting to accomplish this. You could do it with pen and paper.
Instead of 100 identical copies, you have 100 different copies that are hard to tell apart. Once you find a leaked copy in the wild, it’s easy to say “oh, it was Bill Gates who got the one that was missing a comma in the 14th sentence”.
Of course, now that Musk has shot his mouth off, every SpaceX employee will be know this technique is used. Sources and methods, Elon.
Musk is Not a Pro
There are ways to defeat this fingerprinting, which shows how amateurish Musk’s team is being here
The most obvious is to “select all” and then copy/paste into a program that will reformat, such as Microsoft Word. You make the spacing and punctuation uniform, fix the grammar and spelling errors, etc. Then when that “clean” copy is leaked, it can’t be fingerprinted. One can also use OCR.
Here, Musk apparently benefitted because the leaker was unsophisticated. If he or she had been, Musk would have been powerless.
He could have used better techniques. For example, you can use homoglyphs. Consider this example (from StackExchange):
cοnfidential confidential confᎥdential confiԁential confidentᎥal
Those look like five identical words, but they’re not. You can visually see (since I’m drawing it to your attention) that the ‘d’ in the fourth letter is different. It’s very unlikely your eye would see that if you’re just reading along. And even if it did, hasn’t everyone gotten the occasional PDF with a glitch in it? If your radar isn’t up, you’d easily, maybe even subconsciously, dismiss it.
But all of those “confidential” words are different, as this hex dump will show:
$ for word in cοnfidential confidential confᎥdential confiԁential confidentᎥal ; do echo -n $word | hexdump -C ; done 00000000 63 ce bf 6e 66 69 64 65 6e 74 69 61 6c |c..nfidential| 0000000d 00000000 63 6f 6e 66 69 64 65 6e 74 69 61 6c |confidential| 0000000c 00000000 63 6f 6e 66 e1 8e a5 64 65 6e 74 69 61 6c |conf...dential| 0000000e 00000000 63 6f 6e 66 69 d4 81 65 6e 74 69 61 6c |confi..ential| 0000000d 00000000 63 6f 6e 66 69 64 65 6e 74 e1 8e a5 61 6c |confident...al| 0000000e
There’s also the fonts themselves. You can tell Courier from Arial at sight, but what if the author uses a variety of very similar fonts, subtly changing them?
Is There a Foolproof Way to Fingerprint?
Yes…and no.
If you can change the content of the text, then fingerprinting is foolproof. For example, you can change some synonyms (e.g., in one document use “big” and another “large”), or you can modify insignificant data (for example, one reports cites a figure of “1.03948” and another “1.03949”). That would survive even if someone decided to retype your entire message.
Barring that, you are always relying on metadata such as fonts, presentation choices, etc. If the leaker can eliminate metadata, then they can escape fingerprinting. Then again, much of the value in leaking something is the metadata. If I have a document on White House letterhead with graphs and figures from the government, that is going to seem more authentic than a Word document where you have to take my word (no pun intended) that I saw and faithfully retyped something.
Keep in mind that if a leaker can get ahold of two copies of the document – say, his copy and a friend’s – then identifying if fingerprinting is being used is trivial. All you have to do is run a SHA checksum on both files and if they’re different, you will want to know why.
It’s an arms race, and Musk has committed two errors. First he revealed that he was using fingerprinting, then he revealed he was using a rather lame version of it.
Related Posts:
- One Week From Tomorrow…THE WORLD WILL LOSE THEIR MINDS!Lines Are Already Forming! - November 21, 2024
- Crunchbits Discontinuing Popular Annual Plans – The Community Mourns! - November 20, 2024
- RackNerd’s Black Friday 2024: Bigger, Better, and Now in Dublin! - November 19, 2024
You’re making an assumption: That Musk leaked the actual method, and wasn’t misdirecting.
Also, multiple methods are commonly employed. So while Musk may have revealed one method, the method they caught the leaker on, a Sith Master never reveals everything to his apprentice.