How Unix Spell Ran in 64kB RAM

Discussion:

(too old to reply)

Ben Collver

2025-01-23 15:16:38 UTC

How Unix Spell Ran in 64kB RAM
==============================

by Abhinav Upadhyay

How do you fit a 250kB dictionary in 64kB of RAM and still perform
fast lookups? For reference, even with modern compression techniques
like gzip -9, you can't compress this file below 85kB.

In the 1970s, Douglas McIlroy faced this exact challenge while
implementing the spell checker for Unix at AT&T. The constraints of
the PDP-11 computer meant the entire dictionary needed to fit in just
64kB of RAM. A seemingly impossible task.

Instead of relying on generic compression techniques, he took
advantage of the properties of the data and developed a compression
algorithm that came within 0.03 bits of the theoretical limit of
possible compression. To this day, it remains unbeaten.

The story of Unix spell is more than just historical curiosity. It's
a masterclass in engineering under constraints: how to analyze a
problem from first principles, leverage mathematical insights, and
design elegant solutions that work within strict resource limits.

If you're short on time, here's the key engineering story:

* The Unix spell started in the 1970s as an afternoon prototype by
Steve Johnson at AT&T, before Douglas McIlroy rewrote it to improve
its performance and accuracy.

* McIlroy's first innovation was a clever linguistics-based stemming
algorithm that reduced the dictionary to just 25,000 words while
improving accuracy.

* For fast lookups, he initially used a Bloom filter--perhaps one of
its first production uses. Interestingly, Dennis Ritchie provided
the implementation. They tuned it to have such a low false positive
rate that they could skip actual dictionary lookups.

* When the dictionary grew to 30,000 words, the Bloom filter approach
became impractical, leading to innovative hash compression
techniques.

* They computed that 27-bit hash codes would keep collision
probability acceptably low, but needed compression.

* McIlroy's solution was to store differences between sorted hash
codes, after discovering these differences followed a geometric
distribution.

* Using Golomb's code, a compression scheme designed for geometric
distributions, he achieved 13.60 bits per word--remarkably close to
the theoretical minimum of 13.57 bits.

* Finally, he partitioned the compressed data to speed up lookups,
trading a small memory increase (final size ~14 bits per word) for
significantly faster performance.

The rest of the article expands each of these points and gives a
detailed explanation with all the math and logic behind them.

From: <https://blog.codingconfessions.com/p/how-unix-spell-ran-in-64kb-ram>

Lawrence D'Oliveiro

2025-01-23 23:54:00 UTC

Permalink

Instead of relying on generic compression techniques, he took advantage
of the properties of the data and developed a compression algorithm that
came within 0.03 bits of the theoretical limit of possible compression.
To this day, it remains unbeaten.

All of which only worked for the English language (US).

What happened when they had to support other languages?

Eli the Bearded

2025-01-24 00:23:44 UTC

Permalink

Post by Lawrence D'Oliveiro

All of which only worked for the English language (US).
What happened when they had to support other languages?

Pretty sure the answer is: The program was completely replaced.
No one uses spell anymore, ispell or something else gets used.

When it was written, Unix' main use to the owning company was producing
phone books with the runoff tools. Those didn't need spell checking of
customer names or addresses, and so had very little text _to_ spell
check.

Just because the solution doesn't make sense to use now, doesn't mean it
wasn't clever then.

Elijah
------
probably still works better than chapgpt

Lawrence D'Oliveiro

2025-01-24 00:59:03 UTC

Permalink

Post by Eli the Bearded
When it was written, Unix' main use to the owning company was producing
phone books with the runoff tools. Those didn't need spell checking of
customer names or addresses, and so had very little text _to_ spell
check.

Actually, no. Remember the addition of automatic line numbers to troff was
of great interest to AT&T’s legal department, which spent a lot of time
creating things like patent applications.

Anton Shepelev

2025-01-24 10:16:24 UTC

Permalink

Actually, no. Remember the addition of automatic line
numbers to troff was of great interest to AT&TТs legal
department.

Sounds MSWord-ish to me. troff is the typesettng assem-
bler, where high-level functionality is implemented in
macros and macro packages. Line-numbering would seem to
be implementable as an output-line trap using a counter
in a numeric register... Perhaps, you were talking not
about troff, but about one of its macro packages?

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

Anton Shepelev

2025-01-24 10:18:10 UTC

Permalink

Post by Anton Shepelev

Actually, no. Remember the addition of automatic line
numbers to troff was of great interest to AT&TТs legal
department.

The message above was typeset in nroff from the following
sounce:

Lawrence D'Oliveiro:
.(Q
Actually, no. Remember the addition of automatic line numbers to troff was
of great interest to AT&TТs legal department.
.)Q
Sounds MSWord-ish to me.
troff is the typesettng assembler,
where high-level functionality
is implemented in macros and macro packages.
Line-numbering would seem to be implementable
as an output-line trap using
a counter in a numeric register...
Perhaps, you were talking not about troff,
but about one of its macro packages?

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

Lawrence D'Oliveiro

2025-01-25 00:46:43 UTC

Permalink

Post by Anton Shepelev

Post by Lawrence D'Oliveiro
Actually, no. Remember the addition of automatic line numbers to
troff was of great interest to AT&T’s legal department, which spent
a lot of time creating things like patent applications.

Sounds MSWord-ish to me.

There was no Microsoft at the time.

Post by Anton Shepelev
Line-numbering would seem to be implementable as an output-line trap
using a counter in a numeric register...

Testimony from those who were there
<https://www.gnu.org/software/groff/manual/groff.html.node/Background.html>:

When Unix was up and running on the PDP-11, Joe [Ossanna] got wind
of the legal department having installed a commercial word
processor. He went to pitch Unix as an alternative and clinched a
trial by promising to make roff able to number lines by tomorrow
in order to fulfill a patent-office requirement that the
commercial system did not support.

Anton Shepelev

2025-01-25 18:14:31 UTC

Permalink

Post by Lawrence D'Oliveiro

Post by Anton Shepelev
Line-numbering would seem to be implementable as an
output-line trap using a counter in a numeric
register...

Testimony from those who were there
When Unix was up and running on the PDP-11, Joe
[Ossanna] got wind of the legal department having
installed a commercial word processor. He went to
pitch Unix as an alternative and clinched a trial by
promising to make roff able to number lines by
tomorrow in order to fulfill a patent-office
requirement that the commercial system did not
support.

From aught I know of *roff, my opinion is that Osanna
implenented line-numbering in a macro package, as I said,
rather than in the core utility itself. And it is a great
compliment to *roff.

--
() ascii ribbon campaign -- against html e-mail
/\ www.asciiribbon.org -- against proprietary attachments

Salvador Mirzo

2025-01-25 15:26:49 UTC

Permalink

Post by Ben Collver
How Unix Spell Ran in 64kB RAM
==============================
by Abhinav Upadhyay

Ben Collver, I enjoy your postings a lot. Keep'em coming.

candycanearter07

2025-02-03 19:30:02 UTC

Permalink

Post by Salvador Mirzo

Post by Ben Collver
How Unix Spell Ran in 64kB RAM
==============================
by Abhinav Upadhyay

Ben Collver, I enjoy your postings a lot. Keep'em coming.

Agreed, learning about interesting computer tidbits is great.

--
user <candycane> is generated from /dev/urandom