Abstract: This article discusses the use of computer software (from word processing to specialized applications) as an aid in analyzing literature by finding, counting, graphing, and analyzing texts available in electronic format.
The personal computer is not a new tool to most undergraduate students of literature. Most students probably use word processors, together with their built-in grammar and spelling checkers, to write and edit papers. And many campuses provide access to on-line catalogs for research. But the computer can do much more than that for the student: The ready availability of English and other literary texts in electronic format now makes possible the incorporation of computing tools to aid students in their reading and study of literature. Professors can now make available to students copies of novels, poems, and plays on disk for computer-assisted searching and analysis. Once the student is familiar with a text, the computer offers several features that can help the student to analyze, test theories, and make significant discoveries. Even if the only software available is a word processing program, the built-in capabilities provide a surprisingly powerful tool to aid the student in literary research. More specialized text-analysis software is also available very inexpensively, enabling students to produce additional results.
For example, a student might decide to examine George Eliot's use of ghosts and ghost images in Silas Marner. The book can be brought to the screen and the student can search for ghost. Nearly all find commands in modern word processing programs allow the user to specify whether or not to look for whole words only (thus finding only ghost, or not only ghost, but also ghostly, ghostlike, and so on). When each occurrence is found, the student can either read the contextual area or print the appropriate section or save it to a research file for later examination.
When searching for a theme, the student can develop a search list of related words that the author uses (or might use) when treating the theme and then search sequentially for the occurrences of each word. The location, context, frequency, and clustering of occurrence can all play a significant role in the working out of the theme. Equipped with this information, the student can construct an evidence-rich argument about a thematic feature of interest. It is important to note that a major benefit of computer-aided analysis of this sort is that the student often discovers a much fuller set of empirical data for supporting (or rebutting) interpretive claims than would be found by looking back through the work at notes and underlinings.
For initial assignments, the professor may wish to provide some guidance and ideas to the student. Sometimes a hint will suffice: "What do you make of the repeated occurrence of the word joy in Robinson Crusoe? And what about those words closely associated with it?" Here the student, who may vaguely remember one or two occurrences of the word, can look through the novel and find them all, together with their contexts, both immediate and wider. From this starting point, some good inductive thinking can proceed.
To give students several ideas about the kinds of things they might look for, as well as to supply them with topics should their own creativity fail, a longer assignment might be helpful. Here, for example, is a computer analysis homework assignment from a course in The Novel, where students were given electronic copies of The Red Badge of Courage:
Red Badge of Courage Computer Homework
Use your computer's search capability to locate something of thematic significance in Red Badge. If some image or word or phrase seems to you to have occurred several times in an interesting way, look for that and see what you find. If you are at a loss, here are some suggestions:
1. Search for a color and determine how it is used. For example, in what context does red seem to be most often used? What about yellow?
Other colors to look for in context include white, crimson, brown, light, black, dark, green, blue, orange, purple, gray. Do different colors signal different events or moods? Discuss.
2. Search for the words like and as to find similes. The book is filled with similes, so you will find many. Perhaps you could make a list to copy for the class. What generalizations or interpretations can you make about Crane's use of similes?
3. The army is described as a snake. Look for snake, serpent, serpentine, and analyze what you get.
4. Look for the use of religious words as metaphors. For example, prophet, mystic, devotee, cathedral, sacred. Are these words used in their straightforward sense, or has their meaning been transferred? Explain.
5. Look for a word that occurs repeatedly in significant passages. Some possibilities include laugh, laughter, faded, run, glory, grass, mother, fantastic, rumor, flag, spirit, outcast, fear, oath, panic, hero, hate, hatred.
6. Be creative and look for whole phrases of interest or other words you are curious about. Look for synonyms and antonyms of the term(s) you are interested in.
Plan to report to the class on the significance of what you find. Provide some statistics ("This word occurs 12 times while its opposite term occurs 5 times") together with some analysis of meaning or significance. Write up a one-page report to pass out to the class.It is interesting to note that for a computer analysis assignment like this, students will often search for words or phrases that did not necessarily appear to them to be significant when they initially read the book; instead the students become curious and conjectural when the ability to search becomes available. For the assignment above, one student looked for the role of women in the novel, and eventually reported on the relative occurrence of words referring to men and women, having discovered that Red Badge is an overwhelmingly male-focused novel. Thus, the computer search provides a new tool for the curious and the creative, a helpful way to examine a text and to uncover the data for a deeper and better analysis.
Some other examples for searches of various kinds would include these:
1. Search for a subset. Example: Study the epic similes in Milton's Paradise Lost. Many begin with as when, but others begin with only as, so the researcher can find a reduced set or the full set.
2. Search for an entire phrase. Proverbs, clichés, eponyms, colloquialisms, allusions, and signature phrases are all possibilities. Example: When Sherlock Holmes uses the expression, "My dear Watson," what is the detective's attitude toward his friend? Is it consistent (that is, always patronizing, always kindly, or what)?
3. Search for a punctuation mark. Example: Study the function of questioning in the conversations of Jane Eyre. Which characters ask the questions and which answer them? By searching for the question mark in the first four chapters of the novel, we can see that Jane is the object of questions 89 percent of the time; of all the questions asked, her questions about others amount to only one in ten. Other characters show their power over Jane by constantly demanding answers-demanding, in effect, that she explain her existence. She even questions herself a substantial amount. The number of questions each character asks could be graphed in a pie chart to create a visual result.
An effective way to count the number of occurrences of a feature is to use the word processor's find and replace mode. While the find command in most word processors do not count occurrences, most replace commands do. Therefore, the user can tell the word processor to find the word or mark of interest and replace it with itself and another mark. (See "Inserting Fireworks" under Specialized Searches, below.)
4. Distinguish between uppercase and lowercase. Most search functions offer the user the ability to choose whether or not to match case. Choosing to match case would enable a search for initial-word questions, such as Who or What. The case match will find Who but not who.
5. Search for an exact match. Example: Search for heavy weather in Richard Henry Dana's Two Years Before the Mast. Search set: storm, rain, gale, wind. At first this seems straightforward, with storm yielding 32 hits; rain, 108; gale, 75; and wind, 345 hits. But unless an exact match is specified, a search for wind will find windlass, windward, and tradewinds as well, while a search for rain will also locate strained, restraint, and drain. An exact match on rain now shows 62 hits, while wind shows 201.
Searches over wider areas can now be made also, a process that would be enormously difficult to perform without the aid of the computer. Here are some examples:
1. Search for Shakespeare's use of conceive through all the plays, and discuss his play upon literal and intellectual conception.
2. Search selected novels of Fielding, Defoe, and Swift for thief and thieves to see the varied focus the writers have toward those people. A fuller search set might include rob and steal as well.
3. Search all the Sherlock Holmes stories and novels to prove that the expression, "Elementary, my dear Watson," appears nowhere in Doyle, even though elementary and my dear Watson do occur several times.
4. Edgar Allan Poe's fondness for phantasm has been often remarked. Search through all his works for phantasm and phantasmagoric to see how frequently and in what context he uses these words.
For example, in his well-known article, "Fenimore Cooper's Literary Offences," Mark Twain alleges the following:
Another stage-property that he [Cooper] pulled out of his box pretty frequently was his broken twig. He prized his broken twig above all the rest of his effects, and worked it the hardest. It is a restful chapter in any book of his when somebody doesn't step on a dry twig and alarm all the reds and whites for two hundred yards around. Every time a Cooper person is in peril, and absolute silence is worth four dollars a minute, he is sure to step on a dry twig. (Twain 634)To see whether Twain was accurate or exaggerating, students can search for twig or even branch or stick in Deerslayer or Last of the Mohicans and examine both the number and context of the occurrences.
Once again, to help students get started, the instructor may wish to provide some theories for testing. Here is an example assignment for Gulliver's Travels:
Computer search for Gulliver's Travels
Directions: Test one of the following theories by developing a search list of appropriate words or phrases and then by using your search program to locate them. Write an analytic argument supporting or rebutting the theory with the evidence you have found.
1. Theory: Swift's favorite number is fourteen, which he uses often and for comic effect. Test this theory by finding the number of occurrences of fourteen, compared with other numbers, especially the other teen numbers: thirteen, fifteen, sixteen, etc. Also, attend to the context of the occurrence of fourteen and the others to see if fourteen is used in comic contexts more than the others.
2. Theory: Gulliver is so servile that he always calls his host in the land he enters, "my master." Find the occurrences of my master to see whether this theory has adequate supporting evidence. Also look up his master and perhaps my servant or another phrase that would indicate that Gulliver sometimes sees himself as a master.
3. Theory: Images of violence are more apparent than images of peace. Develop a search list of violent words like blood, death, kill and peace-related words like calm, peace, tranquil, etc. Search for and tabulate the occurrences of these words. Be sure to attend to the context of occurrence. How do you interpret the results?
4. Theory: Horses are viewed negatively in the first three books. (Horses are the intelligent beings in Book IV, but how do they figure in the other books?) Look for number of occurrences, context, attitude toward, and so forth.
5. Theory: Exaggeration is an integral part of Swift's satire, and Swift uses exaggeration through his word choice. Look up the number of occurrences of words that might be used for purposes of exaggeration, such as million, thousand (check the word before it to see if it's a number also, as in six thousand), huge, enormous, vast, large, extreme. Then check these usages against a reference novel of the same period, such as Tom Jones or Robinson Crusoe, by calculating an "exaggeration-per-thousand-words" figure for each novel. Interpret your findings.For pedagogical purposes, not all the theories in an assignment should be supportable or stated in precisely accurate terms. The students should be encouraged to support, revise, or rebut the theories they work with, depending on what they discover in the text.
Two or more themes can be measured and plotted simultaneously to reveal the relationship, if any, between their ebbing and flowing. Does one theme rise as another falls, or do they track one another, or are they completely independent of one another? The rising and falling of theme-related words could also be traced against the perceived rising and falling of the action of a work.
1. Inserting Sectional Markers. The beginning and ending of speeches, passages of description, crucial scenes, subplots, or other text portions can be marked by unusual characters (such as angle brackets: <<section 6>>) and then the words between the markers can be studied-counted, searched for thematic nuances, and so on. For example, the number of words spoken by each of several characters could be counted quickly (spelling checkers usually count words as part of their activity) and the results analyzed. Is one character dominating the conversation of the work? Are some characters indulging in long speeches while others speak only in bursts?
2. Inserting Fireworks. To enable the student to locate and compare the proximity of two or more words or phrases of interest, the latter can be marked prominently for subsequent location by glancing through the text. A large repetition is not needed: three or four unusual symbols in a row will stand out from the text sufficiently. For example, in the issue mentioned above about passages relating to heavy weather in Two Years Before the Mast, the search words could be marked for easy location near each other. Here is a sample paragraph thus marked:
Giving up all attempts to collect my things together, I lay down upon the sails, expecting every moment to hear the cry of "all hands ahoy," which the approaching ###storm### would soon make necessary. I shortly heard the $$$rain$$$-drops falling on deck, thick and fast, and the watch evidently had their hands full of work, for I could hear the loud and repeated orders of the mate, the trampling of feet, the creaking of blocks, and all the accompaniments of a coming ###storm###. In a few minutes the slide of the hatch was thrown back, which let down the noise and tumult of the deck still louder, the loud cry of "All hands, ahoy! tumble up here and take in sail," saluted our ears, and the hatch was quickly shut again. When I got upon deck, a new scene and a new experience was before me. The little brig was close hauled upon the @@@wind@@@, and lying over, as it then seemed to me, nearly upon her beam ends.3. Marking Individual Features. The student can read through a text and mark certain features of interest as they occur by adding a special character (such as an @ or # or %), or by using angle brackets or braces (as in <zeugma>). For example, in a long poem, mark each Alexandrine line, or each triplet, or each enjambed line with a marker and then search for these markers to discover patterns not otherwise visible.
Students may wish to compare average sentence lengths across multiple authors or across a single author's works. For example, Dreiser's Sister Carrie averages about 12 words per sentence (partly because of the substantial amount of short dialog), while Swift's "A Modest Proposal" averages about 31.
Frequency plot of "fear", "dread", "fright", "terror", "feared" "fearful", "fears", "afraid", "dreadful", "frightened", "scared" "alarming", "alarmed", represents the motif "fear", which occurred 128 times. Each row represents 500 lines.
Row Count
1 1 |*
2 4 |****
3 8 |********
4 5 |*****
5 5 |*****
6 1 |*
7 4 |****
8 1 |*
9 5 |*****
10 2 |**
11 6 |******
12 2 |**
13 1 |*
14 8 |********
15 4 |****
16 3 |***
17 2 |**
18 3 |***
19 10|**********
20 7 |*******
21 10|**********
22 0 |
23 10|**********
24 5 |*****
25 15|***************
26 6 |******
-----+----+----+----+----+
5
10 15 20
While some modules of MTAS have file size limitations (mentioned below) the Word Distribution Graph function had no trouble searching for references to Miss Westen and Sophia in the 340,000-word Tom Jones (a file size of just under two megabytes), so the program should handle most novels without a problem.
Another program, BE.EXE, calculates the percentage of words in Basic English, the 850-word vocabulary defined by C. K. Ogden. Readability Plus provides a display of "mortar and bricks," the relationship between the number of words in the text that match one of the most commonly used words (a list of 2450) and the number that do not match. Since most prose draws about 80 percent of its words from the most commonly used list, comparisons can be made to determine whether the text under study departs from this figure. Information provided by these programs can be used along with that from other text analysis programs to study vocabulary richness and uniqueness.
The program Micro-Eyeball, while designed for use on relatively small text samples, produces a large set of statistical descriptions useful for linguistic analysis (Ross 2). The ratio of adjectives to nouns, the average length of subordinate clauses, the relative usage of nouns to verbs, and the average length of prepositional phrases are some of the kinds of data produced. These calculations allow the student to compare various texts with each other, to discover significant differences between texts. Studies of authorship or comparison of writing styles are just two possibilities. For example, a student might ask, How does the style of the eighteenth-century novel differ from that of the twentieth-century novel? Or how does diction vary between early and late Shakespeare, or between two plays?
Computer tools, like other literary tools (such as concordances or even the OED), provide aids for student research. Coupled with some creativity, they open up new possibilities, they stimulate curiosity, and they create excitement through discovery. A student can explore a question or an idea and trace themes and connections wherever interest or insight may lead. The search or the graph or the printout is never the end result. Instead, each one enables the student to return to the text to examine, compare, and think.
Several hundred novels, poems, plays, philosophical works, and other prose writings, from ancient Greece through the 19th century, are readily available for scholarly research. Texts of 20th century works are more problematic because of copyright restrictions. Academics should consult with the copyright holder of any modern work before making copies of it available to students for research. For works in the public domain, the most efficient and least expensive way to obtain a large number of texts is with a CD ROM disc. The professor can choose one or more works from the disc and copy them onto floppy disks for distribution to students.
Several dozen novels, poems, and plays are available on the disc, Desktop Bookshop, from WeMake CD's, Indianapolis, Indiana. These files are in ASCII format and can be searched by virtually any word processor or search program on Macintoshes and IBM PC's.
Another source of ASCII-format texts is Project Gutenberg, whose 1991, 1992, 1993, and 1994 collections are available on a single CD-ROM from Walnut Creek CD-ROM, 1547 Palos Verdes Mall, #260, Walnut Creek, CA 94596. The publisher intends to update the disc every six months to add the new titles issued by the Project. As of this writing, the most recent disk is November, 1994. Files are also available directly from the Project over the Internet.
Andromeda Interactive (1050 Marina Village Pky. Ste. 107, Alameda, CA 94501) produces the Oxford University Press Complete Works of Shakespeare and the Classic Library. This latter title contains 60 novels, 40 plays, 600 poems, and 144 short stories, together with a graphical interface and search capability.
World Library (2809 Main St., Irvine, CA 92714) has several CD-ROM products available for Windows, in its Library of the Future series. While the files are not stored in ASCII format, searching software accompanies the disc and individual works can be saved to ASCII files on hard or floppy disk.
Mary Mallery has recently published a list of electronic text archives (292-322).
Finally, scanners with optical character recognition software are now affordable for many departments, so it is possible to scan one's own texts for research and teaching. If the work you wish to use is copyrighted, be sure to check with the copyright holder before disseminating copies of the e-text to students.
Sources of Specialized Software
MTAS and TACT are available from The Centre for Computing in the Humanities, Robarts Library, 14th Floor, 130 St. George Street, University of Toronto, Toronto, Ontario, Canada M5S 1A5.
Micro-Eyeball is available from Donald Ross, English Department, University of Minnesota, Minneapolis, MN 55455.
Readability Plus is available from Scandanavian PC Systems, Inc., P.O. Box 3156, Baton Rouge, LA 70821-3156.
Mallery, Mary. "Directory of Electronic Text Centers." Text Technology 4(1994): 292-322.
Ogden, C. K. The System of Basic English. New York: Harcourt, 1934.
Ross, Donald and David Hunter. "Micro-EYEBALL: An Interactive System for Producing Stylistic Descriptions and Comparisons." Computers and the Humanities 28(1994):1-11.
Smith, John B. "Computer Criticism." In Rosanne G. Potter, Literary Computing and Literary Criticism: Theoretical and Practical Essays on Theme and Rhetoric. Philadelphia: University of Pennsylvania Press, 1989. (13-44).
Twain, Mark. "Fenimore Cooper's Literary Offences." In Charles Neider, ed. The Complete Humorous Sketches and Tales of Mark Twain. New York: Doubleday, 1961. (631-642).