I Don't Have Time to Win the Hutter Prize, So Maybe You'd Like to Snag 500'000€ With My Idea

Dec 15, 2024 @ 1:56 pm

Data Compression I am about to give you a nice bundle of loot or an excellent example of the Dunning-Kruger Effect.

Let’s start with the loot part. Back in 2006, Marcus Hutter created the Hutter Prize (Wikipedia, home page), which is compression competition. Participants seek to compress a 1GB file smaller and smaller, and each 1% gain in compression efficiency results in 1% of the prize money. The most recent winner took the file to 110,793,128 bytes, down from 112,578,322 bytes, and was paid 7,950€.

Naturally, this is not just gzip -9, or even recursive gzip. Hutter feels that this kind of compression research will make tangible contributions to AI research, as they claim “the better you can compress, the better you can predict”.

So what’s needed to win?

Minimal programming skills are required to write programs that compress files somewhat, but to win the contest, you first need to understand the most important existing data compression ideas, concepts, and basic algorithms, and then the state-of-the-art compressors which build on these (by refining, adapting, and combining them). Maybe you have a new great idea of your own, but by itself without combining it with existing ideas, you have no chance of winning.

Their FAQ has some excellent links.

The Dunning-Kruger Effect

The Dunning-Kruger effect “is a cognitive bias in which people with limited competence in a particular domain overestimate their abilities.”

In my case, I’m a reasonably good programmer who has read a lot of computer science over year. However, I do not have a degree in CS or any other scientific discipline, nor have I ever written any data compression code. So naturally, the moment I have an idea, I think it’s revoutionary.

This sort of D-K effect is very common in the world. A famous example is Stan Pons and Martin Fleischmann’s cold fusion saga. Both men were excellent, maybe even brilliant electrochemists, but they were completely out of their depth in nuclear fusion. Actual physicists spent very little time looking at their work because it was immediately obvious to them that they were wrong.

My Idea

In my case, I recognize I’m entering the D-K zone. Here’s how I might approach the Hutter prize.

Let me start by pointing out that the compressor can be purpose-built. It can work fantastically on the Wikipedia excerpt used in the contest and be poor on general text.

The ultimate solution would be to analyze the data and perfect calculate which patterns are repeated the most. So you’d go through and say “here’s a 200-character pattern that appears 20 times, and we’ll replace that with this symbol that doesn’t appear at all”. The replacement symbols can be 1-byte, 2-byte, etc. – whatever is not in the text now.

So start by finding the longest repeated substring, for which there are already existing algorithms. Copilot or ChatGPT can write code that will do this for you. Let’s say (I have no idea) that the longest repeated substring is 800 characters. So you build a database of 800-character repeated substrings, 799-character, etc. all the way down.

Next, you analyze the combinations to say which strings can I replace and not disturb other substrings. So that 800-character string that’s repeated 3 times ultimately only costs you 800 characters (the decompression dictionary entry) plus a one-character symbol. If it appears 3 times, that’s 2400 characters replaced with 803 characters. Then continue to work your down.

Of course, the tricky part is that a 50-character substring that’s repeated 50 times might be in a 500-character substring that’s repeated once, etc. so you’d need to be constantly recalculating your database after every replacement.

And I’m sure there are a lot more complications, but…hey, I’m admitting I’m a Dunning-Kruger candidate. You might be a legit computer scientist.

The Man Who Was Paid €113,000 For His Code Which Compressed Entire Movies in 8KB of Disk...And Then ...

How Much Faster Is Making A Tar Archive Without Gzip?

raindog308

Raindog308 is a longtime LowEndTalk community administrator, technical writer, and self-described techno polymath. With deep roots in the *nix world, he has a passion for systems both modern and vintage, ranging from Unix, Perl, Python, and Golang to shell scripting and mainframe-era operating systems like MVS. He’s equally comfortable with relational database systems, having spent years working with Oracle, PostgreSQL, and MySQL.

As an avid user of LowEndBox providers, Raindog runs an empire of LEBs, from tiny boxes for VPNs, to mid-sized instances for application hosting, and heavyweight servers for data storage and complex databases. He brings both technical rigor and real-world experience to every piece he writes.

Beyond the command line, Raindog is a lover of German Shepherds, high-quality knives, target shooting, theology, tabletop RPGs, and hiking in deep, quiet forests.

His goal with every article is to help users, from beginners to seasoned sysadmins, get more value, performance, and enjoyment out of their infrastructure.

You can find him daily in the forums at LowEndTalk under the handle @raindog308.