Ben Collver
2025-01-23 15:16:38 UTC
How Unix Spell Ran in 64kB RAM
==============================
by Abhinav Upadhyay
How do you fit a 250kB dictionary in 64kB of RAM and still perform
fast lookups? For reference, even with modern compression techniques
like gzip -9, you can't compress this file below 85kB.
In the 1970s, Douglas McIlroy faced this exact challenge while
implementing the spell checker for Unix at AT&T. The constraints of
the PDP-11 computer meant the entire dictionary needed to fit in just
64kB of RAM. A seemingly impossible task.
Instead of relying on generic compression techniques, he took
advantage of the properties of the data and developed a compression
algorithm that came within 0.03 bits of the theoretical limit of
possible compression. To this day, it remains unbeaten.
The story of Unix spell is more than just historical curiosity. It's
a masterclass in engineering under constraints: how to analyze a
problem from first principles, leverage mathematical insights, and
design elegant solutions that work within strict resource limits.
If you're short on time, here's the key engineering story:
* The Unix spell started in the 1970s as an afternoon prototype by
Steve Johnson at AT&T, before Douglas McIlroy rewrote it to improve
its performance and accuracy.
* McIlroy's first innovation was a clever linguistics-based stemming
algorithm that reduced the dictionary to just 25,000 words while
improving accuracy.
* For fast lookups, he initially used a Bloom filter--perhaps one of
its first production uses. Interestingly, Dennis Ritchie provided
the implementation. They tuned it to have such a low false positive
rate that they could skip actual dictionary lookups.
* When the dictionary grew to 30,000 words, the Bloom filter approach
became impractical, leading to innovative hash compression
techniques.
* They computed that 27-bit hash codes would keep collision
probability acceptably low, but needed compression.
* McIlroy's solution was to store differences between sorted hash
codes, after discovering these differences followed a geometric
distribution.
* Using Golomb's code, a compression scheme designed for geometric
distributions, he achieved 13.60 bits per word--remarkably close to
the theoretical minimum of 13.57 bits.
* Finally, he partitioned the compressed data to speed up lookups,
trading a small memory increase (final size ~14 bits per word) for
significantly faster performance.
The rest of the article expands each of these points and gives a
detailed explanation with all the math and logic behind them.
From: <https://blog.codingconfessions.com/p/how-unix-spell-ran-in-64kb-ram>
==============================
by Abhinav Upadhyay
How do you fit a 250kB dictionary in 64kB of RAM and still perform
fast lookups? For reference, even with modern compression techniques
like gzip -9, you can't compress this file below 85kB.
In the 1970s, Douglas McIlroy faced this exact challenge while
implementing the spell checker for Unix at AT&T. The constraints of
the PDP-11 computer meant the entire dictionary needed to fit in just
64kB of RAM. A seemingly impossible task.
Instead of relying on generic compression techniques, he took
advantage of the properties of the data and developed a compression
algorithm that came within 0.03 bits of the theoretical limit of
possible compression. To this day, it remains unbeaten.
The story of Unix spell is more than just historical curiosity. It's
a masterclass in engineering under constraints: how to analyze a
problem from first principles, leverage mathematical insights, and
design elegant solutions that work within strict resource limits.
If you're short on time, here's the key engineering story:
* The Unix spell started in the 1970s as an afternoon prototype by
Steve Johnson at AT&T, before Douglas McIlroy rewrote it to improve
its performance and accuracy.
* McIlroy's first innovation was a clever linguistics-based stemming
algorithm that reduced the dictionary to just 25,000 words while
improving accuracy.
* For fast lookups, he initially used a Bloom filter--perhaps one of
its first production uses. Interestingly, Dennis Ritchie provided
the implementation. They tuned it to have such a low false positive
rate that they could skip actual dictionary lookups.
* When the dictionary grew to 30,000 words, the Bloom filter approach
became impractical, leading to innovative hash compression
techniques.
* They computed that 27-bit hash codes would keep collision
probability acceptably low, but needed compression.
* McIlroy's solution was to store differences between sorted hash
codes, after discovering these differences followed a geometric
distribution.
* Using Golomb's code, a compression scheme designed for geometric
distributions, he achieved 13.60 bits per word--remarkably close to
the theoretical minimum of 13.57 bits.
* Finally, he partitioned the compressed data to speed up lookups,
trading a small memory increase (final size ~14 bits per word) for
significantly faster performance.
The rest of the article expands each of these points and gives a
detailed explanation with all the math and logic behind them.
From: <https://blog.codingconfessions.com/p/how-unix-spell-ran-in-64kb-ram>