Compressing the white whale: how to learn by re-inventing.

Short Form


"So this is where I made my biggest mistake, and I really still chuckle about it to this day. I notice that sometimes, when I'm trying to convert a word into a number, it's actually longer than the original word. So I get clever and just said 'hey if the word is shorter, just use the word.' Hey hey awesome, another small efficiency. What was wrong with that? Bingo, that number was already in use. So now Moby Dick compresses beautifully into an even smaller file, and even looks okay at first glance, until you try to read and see about half the sentences are now gibberish"


I’ll show you how I tried to design a more-efficient compression system for large English texts. In so doing I discovered how much you can learn by trying to solve a problem that other, smarter people have solved already and much more effectively than you ever will.

Starting with a twitter argument, I thought it was possible to find some efficiencies in text compression.

I built a little script that did okay, and within a few steps I was already smaller than straight .zip compression.

I hit some pitfalls, made some mistakes, and asked some silly questions about what constitutes ‘unchanged’ text (are capital letters really necessary?).

In the end of course there are way better compression schemes available with mathematical proofs of why they can’t be improved upon, but I still learned a ton, and it made me far less afraid to try building something from the ground up that others had done already.


compression, node, python, Moby Dick

Speaking experience

I've presented to large classrooms of over 65 students on technical and design topics. I've worked as a classroom teacher, and given technical trainings to internal teams.