Blog: Flight of the Concordances: Resurrecting the Classical Concordance Online

Christopher Francese and bmulligan

April 22, 2018

“No competent scholar needs to be convinced of the utility of indices,” declared the University of Illinois “Czar of classics” and prodigious concordance producer William Abbot Oldfather in 1937, with only a little defensiveness. But then where have all the concordances gone today? Before the rise of a certain ubiquitous search engine, the humble index verborum (an alphabetical list of dictionary headwords used in a text, with a full list of citations for each instance) or concordance (same, but with a few words of context for each instance) were respected genres of scholarship. The work was tedious, though hardly easy or rote, given the many homographs and homonyms, especially in Latin. Concordances, dull though they may seem, helped classical scholars in studying the characteristic vocabulary of the authors. They allowed the finding of passages quickly. They helped translators and commentators by allowing access to a full list of instances of a particular lemma, something dictionaries did not provide. They revealed which words did not appear in an author. And, a key factor for many classical concordance makers, they could help in efforts to establish a more authoritative text. In 1980, Henri Quellet produced a comprehensive bibliography of them, which runs to 262 pages, and a list made at Dartmouth gives dozens more examples, up to the early 2000s.

Now the print concordance is well and truly defunct, digital road-kill. Bruce McMenomy recently prepared a full online concordance of Vergil’s Aeneid for his own use, based on the Latin Library text, with programming he himself describes as “trivial.” Thanks to the Packard Humanities Institute, a full concordance of 362 Latin authors is instantly available free. The TLG offers behind a paywall a similar, somewhat more sophisticated service for a vast corpus of ancient Greek. A delightful thing—but vaguely tragic when you consider the untold years of collective labor on the part of scholars (including many graduate students) that yielded scores of classical concordances and even, in some cases, several competing ones for the same author.

Existing online concordances like the PHI Latin Texts concordance and McMenomy’s Aeneid are actually lists of character strings, not, as with most of the older print concordances, lists of dictionary headwords—a crucial distinction. To take an English example, an un-lemmatized concordance of Shakespeare’s works jumbles into one entry on “light” no less than 270 quotations which variously refer to the noun, two adjectives, two verbs, and one adverb (Hart 1943, 28). In Latin latus could be the adjective “broad,” the participle “having been carried,” or the noun “flank,” depending on the lemma from which it derives. Ferias could mean “you might strike” (lemmatization: ferio), or “holidays” (lemmatization: feriae), depending on the context. If what you want is to understand an author’s vocabulary, an unlemmatized concordance will not do. Computers can lemmatize unambiguous words easily, but not the ambiguous ones, which still require human inspection. The percentage of ambiguous forms in Latin and Greek texts is hard to gauge precisely, but is probably on average around 30%.

A fully lemmatized text, on the other hand, with each form tied to a dictionary headword, has many potential uses. Scholars at Brown University developed the famous Brown Corpus of contemporary English, tagged for part of speech (POS), in the 1960s. This analyzed million word forms led ultimately to major advances in computational linguistics. The Belgian research group LASLA began in the early 1960s to hand lemmatize and POS tag Latin and Greek texts for its extensive database, and scholars have used it for stylistic analysis of classical authors. The very act of lemmatizing can be good pedagogy. Thanks to the Perseids project, many teams of teachers and students are now treebanking, an activity in which readers parse each word in a sentence and represent the syntactical relationships between the words graphically. This activity has now produced millions of words of crowdsourced lemmatized Latin and Greek. The Classical Language Toolkit has its own lemmatizer, actually a suite of tools developed with support from Google’s Summer of Code. Methods of lexicon-assisted machine lemmatization are an active subject of research in digital classics (vor der Brück et al. 2015; Kestemont and De Gussem, 2017).

But in some ways this digital work is reproducing work that had already been done in earlier generations by the concordance makers but published in a different format. Whereas the treebanker diagrams a sentence and displays it, the concordance maker dismembers the sentence and files the words in little boxes. What if the texts that were atomized in print concordances could reassembled, and the words put back together, each still tied to its correctly analyzed dictionary form?

Bret Mulligan and I are pleased to have been awarded small grant from the Society for Classical Studies (the Pedagogy Award) to take a first, experimental, step in this direction. The Index Apuleianus by Oldfather et. al. was published in 1934 by the American Philological Association. With the permission of the SCS as copyright holder, we will have the book professionally digitized and, using a program developed by Dickinson College computer scientist Michael Skalak, convert it into a fully lemmatized text. Further processing will tie the data to the set of Latin lemmas used in the Morpheus code, which is used by Perseus and other datasets of parsed Latin. It will then be married with lexicon via The Bridge, an application for the creation of correct vocabulary lists for Greek and Latin authors. This will allow students and teachers to create accurate running vocabulary lists for all works Apuleius, and increase the readability of his texts. This is already the case for the Aeneid, Caesar’s Gallic War, et al., thanks to LASLA’s lemmatized texts of those authors, combined with special lexica.

Practical questions arise: how exactly is concordance data best preserved and stored digitally? How should it be displayed to the user? How much information should be captured from the original print work? Should translations be added, as in Strong’s Exhaustive Concordance of the Bible? Some of the choices to be made in digitization are mentioned in the sample below.

A sample of Oldfather’s Index Apuleianus (1934) with questions about digitization.

The scholarly effort that went into the creation of the Index Apuleianus can be liberated for re-use in at least three ways:

Vocabulary lists for easier reading (as in Dickinson College Commentaries editions)
Data for word frequency and stylistic analyses
The development and better training of automatic parsers based on a larger corpus of hand parsed Latin

Indirectly and eventually, this project will, we hope, make machine parsers better so students of all Latin authors, not just Apuleius, can read them more easily. We plan to release the data on Github (i.e. to make it open access), under a creative commons share alike license, as:

a .txt file of the professionally digitized book
a lemmatized text of the works of Apuleius
customizable running lists with dictionary forms and generic definitions provided by The Bridge

Hippocrates would likely agree that life is short and resources are scarce, but that the art of the concordance may still have a long life. It remains to be seen if this genre of classical scholarship, once honored and now dead, is worthy of fuller digital resuscitation, and if so, how. Just as classicists in earlier eras came together to make concordances, digital humanists and classicists are now creating the digital resources that they and their students need. It would be a shame if the huge amount of work that went into print concordances could not be turned to some kind of productive use in the digital age.[1]

References:

Hart, Alfred. “Vocabularies of Shakespeare's Plays.” The Review of English Studies 19, no. 74 (1943): 128–40.

Kestemont, Mike and Jeroen De Gussem. “Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning.” Journal of Data Mining & Digital Humanities, August 6, 2017. https://jdmdh.episciences.org/3835

Oldfather, William A., H. V. Canter, Kenneth Morgan Abbott, and B. E. Perry. Index Apuleianus. Middleton: American Philological Association, 1934.

Oldfather, William A. “Suggestions for Guidance in the Preparation of a Critical Index Verborum for Latin and Greek Authors.” Transactions and Proceedings of the American Philological Association 68 (1937): 1–10. doi:10.2307/283249.

Quellet, Henri. Bibliographia indicum, lexicorum et concordantiarum auctorum Latinorum. Hildesheim and New York: G. Olms, 1980.

vor der Brück, Tim, Steffen Eger, and Alexander Mehler. “Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization models.” Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Beijing, China, July 30, 2015, pages 105–113. http://www.aclweb.org/anthology/W15-3716

[1] If you have ideas on this topic please do not hesitate to contact the authors of this post to discuss them.

(Header Image Caption: Cover of Franciszek Meninski's Complementum thesauri linguarum orientalium seu Onomasticum latino-turcico-arabico-persicum, simul idem index verborum lexici turcico-arabico-persici, quod latina, Germanica, aliarumque linguarum adjecta nomenclatione nuper in luce editum (1687). Image is in the Public Domain and available via Wikimedia).

Authors

Christopher Francese is Asbury J. Clarke Professor of Classical Studies at Dickinson College. He is the project director of Dickinson College Commentaries, and directs a series of professional development workshops for Latin teachers, the Dickinson Latin Workshops. He is author of Ancient Rome: An Anthology of Sources (Hackett, 2014), Ancient Rome in So Many Words (Hippocrene, 2007), and Parthenius of Nicaea and Roman Poetry (Peter Lang, 2001).

Bret Mulligan is an Associate Professor of Classics at Haverford College, and Project Director of The Bridge.

Christopher Francese and bmulligan

Authors

Categories