From http://ask.metafilter.com/255675/Decoding-cancer-addled-ramblings:
In my grandmother's final days battling brain cancer, she became unable to speak and she filled dozens of index cards with random letters of the alphabet. I'm beginning to think that they are the first letters in the words of song lyrics, and would love to know what song this was. This is a crazy long shot, but I've seen Mefites pull off some pretty impressive code-breaking before!
This program guesses sentences from initial letters of each word using the unreasonable effectiveness of data. For example, if given the right seed texts, it can decode the input "OFWAIHHBTNTKCTWBDOEAIIIHFUTDODBAFUOT" into "our father who ascended into heaven hallowed be thy name thy kingdom come thy will be done on earth as it is in heaven for us this day our daily bread and forgive us of the"
Inspired by http://norvig.com/ngrams/ch14.pdf.
It's probably easiest to use virtualenv:
$ virtualenv env
Then install nltk and python-gflags:
$ env/bin/pip install nltk python-gflags
Start the program, feeding it some prayer texts:
$ env/bin/python decode.py apostles-creed.txt athanasian-creed.txt nicene-creed.txt order-of-morning.txt
Once it says "Enter initials:", type "OFWAIHHBTN" and press Enter. It will output something like this:
5.99453913576e-12 our father who ascended into heaven hallowed be thy name
This means that its best guess for "OFWAIHHBTN" is "our father who ascended into heaven hallowed be thy name", with a probability of some small number.
It is case-insenstive: OFWAIHHBTN is treated the same as "ofwaihhbtn".
You can use "$" to indicate the start of a sentence, for example "$OFWAIHHBTN".
The program makes its guesses based on text you feed it. I've
included 8 pieces of text, all in the corpora
subdirectory:
Filename | Description |
---|---|
apostles-creed.txt | The Apostles Creed |
athanasian-creed.txt | The Athanasian Creed |
bible-kjv.txt | The King James Bible |
hymnprayerbo00kunz_djvu.txt | Hymn and prayer book: for the use of such Lutheran churches as use the English language (1795) |
nicene-creed.txt | The Nicene Creed |
order-of-morning.txt | The Order of Morning Service |
prayerbookreligi00lasauoft_djvu.txt | Prayer-book for religious: a complete manual of prayers and devotions for the use of the members of all religious communities : a practical guide to the particular examen and to the methods of meditation (1914, c1904) |
tlh.txt | The Lutheran Hymnal |
To use corpora, supply them as arguments on the command line. For example:
$ env/bin/python decode.py bible-kjv.txt nicene-creed.txt
If you want to use other texts, put them in the corpora
subdirectory. Then you can specify their filenames on the command
line.
More and larger corpora slow the program down tremendously. For example, using just the King James Bible, trying to decode just three letters, like "ofw", can take 10-20 seconds. Trying to decode 4 or 5 or more can take minutes--or hours.
The program uses Viterbi decoding and assumes a "noisy
channel"--meaning that it assumes there's a chance the letters you
give it as input are wrong. By default it assumes there's a 0.1%
chance of an error. If you want to change that, use the
--error_prob
flag. For example, this tells it there's a 50% chance
of an error per letter:
$ env/bin/python decode.py --error_prob=0.5 bible-kjv.txt
Of course there's no reason this code is limited to interpreting religious codes. It is limited only by its corpora (and its bigram model, and its slowness, and...).