I am about to give you a nice bundle of loot or an excellent example of the Dunning-Kruger Effect.
Let’s start with the loot part. Back in 2006, Marcus Hutter created the Hutter Prize (Wikipedia, home page), which is compression competition. Participants seek to compress a 1GB file smaller and smaller, and each 1% gain in compression efficiency results in 1% of the prize money. The most recent winner took the file to 110,793,128 bytes, down from 112,578,322 bytes, and was paid 7,950€.
Naturally, this is not just gzip -9, or even recursive gzip. Hutter feels that this kind of compression research will make tangible contributions to AI research, as they claim “the better you can compress, the better you can predict”.
So what’s needed to win?
Minimal programming skills are required to write programs that compress files somewhat, but to win the contest, you first need to understand the most important existing data compression ideas, concepts, and basic algorithms, and then the state-of-the-art compressors which build on these (by refining, adapting, and combining them). Maybe you have a new great idea of your own, but by itself without combining it with existing ideas, you have no chance of winning.
Their FAQ has some excellent links.
The Dunning-Kruger Effect
The Dunning-Kruger effect “is a cognitive bias in which people with limited competence in a particular domain overestimate their abilities.”
In my case, I’m a reasonably good programmer who has read a lot of computer science over year. However, I do not have a degree in CS or any other scientific discipline, nor have I ever written any data compression code. So naturally, the moment I have an idea, I think it’s revoutionary.
This sort of D-K effect is very common in the world. A famous example is Stan Pons and Martin Fleischmann’s cold fusion saga. Both men were excellent, maybe even brilliant electrochemists, but they were completely out of their depth in nuclear fusion. Actual physicists spent very little time looking at their work because it was immediately obvious to them that they were wrong.
My Idea
In my case, I recognize I’m entering the D-K zone. Here’s how I might approach the Hutter prize.
Let me start by pointing out that the compressor can be purpose-built. It can work fantastically on the Wikipedia excerpt used in the contest and be poor on general text.
The ultimate solution would be to analyze the data and perfect calculate which patterns are repeated the most. So you’d go through and say “here’s a 200-character pattern that appears 20 times, and we’ll replace that with this symbol that doesn’t appear at all”. The replacement symbols can be 1-byte, 2-byte, etc. – whatever is not in the text now.
So start by finding the longest repeated substring, for which there are already existing algorithms. Copilot or ChatGPT can write code that will do this for you. Let’s say (I have no idea) that the longest repeated substring is 800 characters. So you build a database of 800-character repeated substrings, 799-character, etc. all the way down.
Next, you analyze the combinations to say which strings can I replace and not disturb other substrings. So that 800-character string that’s repeated 3 times ultimately only costs you 800 characters (the decompression dictionary entry) plus a one-character symbol. If it appears 3 times, that’s 2400 characters replaced with 803 characters. Then continue to work your down.
Of course, the tricky part is that a 50-character substring that’s repeated 50 times might be in a 500-character substring that’s repeated once, etc. so you’d need to be constantly recalculating your database after every replacement.
And I’m sure there are a lot more complications, but…hey, I’m admitting I’m a Dunning-Kruger candidate. You might be a legit computer scientist.
Related Posts:
- Perfectionists Welcome!The PQ Hosting Interview (with a Special Discount Code!) - January 21, 2025
- Dropbear in 2025: Still the LowEnd SSH Server of Choice? - January 20, 2025
- “OMG! I Never Knew That!”: The Simply Linux Tip That Has Got Me More Thanks Than Anything I’ve Ever Shared in 30+ Years - January 19, 2025
Leave a Reply