Preparing the Text

Attention conservation notice: Preparing a text for automated poetry detection is half spell-checking. That part may be kind of dull; I’m writing it down anyway because this blog is how I remember things now.

Cleanup

To do automated text processing, you need a text. Some people, like the indefatigable Sparrow, type the whole text by hand. I am more defatigable than in-, so I needed another way. The Internet Archive has a text version of The Lord of the Rings, left over from the days of freedom before the Enclosure of the Internet. I
suspect that the lawyers have let it survive because the quality of the scan is so poor. I started with that.

I did most of my work on the Unix command line. My constant companion is a program called “aspell“. Feed it a text file, and it returns a list of all the words it didn’t recognize. A scanner makes predictable errors, such as reading a “u” as “ii” or an “h” as “li”. Those were the most common misspellings. They are easy enough to encode in a stream-editor input file. So I cleaned up all of those that I could. Some were harder, like distinguishing between “ore” and “orc”. The computer will never get that straight, so I did those by hand. All ambiguous cases got changed to “orc”, and then I went to the Moria chapters and turned back the two or three actual mentions of “ore”. (I have no idea how hard this task would be with a book I didn’t know so well, but I may soon find out.)

[ETA: The Guardian has a hilarious article saying that I didn’t get the really fun scanner error.]

The next thing to do was restore the hyphenated words at the ends of lines. That’s easy to do with a little Perl script — check lines that end in hyphens; delete the hyphen and newline, and see if the combination passes spell-check.

Now comes the fun part. LotR is full of proper names, archaic words, British
spellings, and invented languages. The spell-checker will never get those
right. So I had to work my way through the aspell output, picking out by hand the
briticisms and other things that were correct but unknown to the software. This took up my spare time for about a week. The result is a long list of words that are
correctly spelled, despite what the spell-checker thinks. So the process was:

Run aspell on the text;
Use the comm command to compare the output to the list of correct words. Remove all the words in the output that are correct in this context;
Figure out what the corrections might be on the remainder. If they’re not wrong, add them to the “correct” list.
Repeat until the remainder is empty.

The list of correct words is an interesting document in its own right. You never know when that might come in handy, so I’ve published it on the Humanities Commons Core.

We now have a text of the book. Text is not all we need, though. We need sounds.

Phonetics

Next, we have to find the poetry. Alliteration is easy to identify, in theory. We’re going to look for words that begin with the same sound. Or, rather, words that have the same sound at the beginning of the accented syllable. This is done with a “pronouncing dictionary”. I used the free one from Carnegie-Mellon University. (These were invented so voice-mail systems could read you a printed text over the phone, but they can be used for Good, too.) For each word in the English language, it has a line with the spelling of the word its phonetic expansion:

VOLUNTARY  V AA1 L AH0 N T EH2 R IY0
ASSISTANT  AH0 S IH1 S T AH0 N T
POSTMEN	 P OW1 S T M EH N

The list of phonemes are on the welcome page of the dictionary, but they’re easy to figure out. Vowels have stresses; the numbers are the amount of stress each vowel gets. Zero means unstressed. One means stressed. Subsidiary stresses have higher numbers. Straight out of the box, the dictionary has over 130,000 words in it, but of course that’s not enough. I spent the next couple of weeks making dictionary entries for all the LotR-words I collected in the last step. The supplementary dictionary is also on line at the Humanities Commons. Surprisingly, the CMU dictionary contains all of the phonemes I needed except one. The supplement includes “KH” for words like “Erech” and “Grishnakh”.

This concludes the boring part. Next, I plunge into the quagmire of Anglo-Saxon scansion.

Idiosophy

A physicist loose among the liberal arts

Preparing the Text

Cleanup

Phonetics

Related

Leave a Reply Cancel reply

Idiosophy

A physicist loose among the liberal arts

Preparing the Text

Cleanup

Phonetics

Share this:

Related

Inflectional Survivor

Proudfeet

Leave a Reply Cancel reply