You are here

Review: Anderson on Winge, A Latin Macronizer

Albertus Magnus, De Bono. Folium 1r. Cologne, Library of the Dome, Codex 1024 (detail). From Wikimedia Commons. Public Domain.

Johan Winge’s macronizer tool is a very welcome and well-designed tool for automatic macronization of Latin texts. Appearing in the same year as Felipe Vogel’s macronizer, Winge’s macronizer tool has quickly become the best available, working equally well on multiple versions of Windows as well as MacOS X using Explorer, Firefox, Chrome, and Safari. The interface is clean and easy to use. Text can be pasted or typed in (20,000 characters max.) and return times seem very quick; the whole of Catullus 64 required less than 10 seconds. There are some options to the output, such as scanning a text as dactylic or elegiac meter (which increases accuracy of the macronization), and v/u and i/j conversion. There is also a nice function to automatically copy all macronized text. Unfortunately, one of the best characteristics of the output in the macronizer tool—having ambiguous forms highlighted in yellow and unknown forms highlighted in orange—is not always preserved in transfer, nor is the stichometry for poetry. I had the best results using Chrome, and the worst using Firefox, when importing into Word and TextEdit. Users will need to experiment, and may need to strip all formatting from a macronized text in order to remove the highlighting. This inconsistency may make it quite frustrating to work with the tool to develop a fully and correctly macronized text, as it seems ideal to be able to consistently copy the output with the highlights preserved – so that one could concentrate attention on problematic words – and then remove the highlights.

I tried the macronizer using the following short collection of words with ambiguous phonology and Catullus 1:

comedo amburo atramenta republica miseris bardo crastino furtivis undeviginti deabus lucubus

Cui dono lepidum novum libellum arida modo pumice expolitum? Corneli, tibi: namque tu solebas meas esse aliquid putare nugas. Iam tum, cum ausus es unus Italorum omne aevum tribus explicare cartis Doctis, Iuppiter, et laboriosis! Quare habe tibi quidquid hoc libelli, qualecumque, quod, o patrona virgo,plus uno maneat perenne saeclo!

I received the following output.

There are some quirks of the system that are a bit strange: republica is missing the macron on re; crastino is not noted as ambiguous although the length of the initial vowel is debated, and both Lewis & Short and the Oxford Latin Dictionary note it as short; lepidum might be ambiguous as adjective or noun but is not ambiguous in phonology; arida is noted as ambiguous probably because of the possible inflections of the final -a but the unambiguous stem is not macronized; tribus has a macron on the ultima where, as was arida, I expected it to be marked ambiguous and without any macrons. The system is trainable, however, and many of these quirks might be remedied by dedicated users and collaborators.

This tool is pretty clearly an enormous advance on other macronizers, but it has limitations. The key for any successful macronizer rests in the PoS (Part-of-Speech) tagging system used. A PoS tagger identifies the likely lemma(ta) for any given word, as well as its syntactic information. The accuracy of a PoS tagger is, in turn, dependent on the word treebank it uses. Two of the most commonly used and reliable treebanks for Latin are the Latin Dependency Treebank (LDT), developed under the Perseus Digital Library, and PROIEL, a dependency treebank for Indo-European languages based out of the University of Oslo. The LDT (v. 1.3) is based on about 100,000 words (selected from Augustus, Caesar, Cicero, Jerome, Ovid, Petronius, Phaedrus, Propertius, Sallust, Suetonius, Tacitus, and Vergil), while PROIEL is based on about 150,000 words (from Caesar, Cicero, and Jerome).

In his BA thesis Winge tested three different taggers, HunPos, RFTagger, and MATE. He had the most success with the RFTagger using the LDT, cross checked against PROIEL; this combination macronized texts with 95–98% accuracy. As far as I can tell, the public version of the macronizer does not cross-check against the PROIEL Treebank. It is also hard to say how well that percentage would hold up with atypical texts outside of the LDT and PROIEL treebanks. Using a treebank with a larger and more varied corpus could alleviate this issue. Two possibilities are the Index Thomisticus, which is based on upwards of 11 million words from the writings of Thomas Aquinas, and the treebank by Laboratoire d’Analyse Statistique des Langues Anciennes (LASLA) that contains over 1.5 million words from texts in the traditional canon of authors. For those interested in diving down this rabbit-hole, the Universal Dependencies project presents a great deal of background information on dependency trees in every language. Likewise, those interested in dependency treebanks for Latin and Greek should check out the Alpheios Project, an offshoot of the Perseus Project, which offers online tagging tools and training.

As with all such tools, the success and accuracy of a project depends as much (or more) on the level of engagement and continuing effort from the user community as on the original developer. Winge has offered us an exceptional bit of code and a very interesting application of digital infrastructure upon which he and we can build. Anyone developing text that should be macronized can check their efforts with the macronizer or use it first in order to get a good jump on some very painstaking work. It cannot, however, be used without careful editing and attention.

All source code may be found under a Free Software Foundation license on GitHub.

Metadata:

TITLE: The Latin Macronizer

DESCRIPTION: An automatic macronizer for classical Latin texts.

URL: http://stp.lingfil.uu.se/~winge/macronizer/index.py

NAME Winge, Johan       

PUBLISHER: [none]

PLACE: Uppsala, Sweden

COLLECTION TITLE: [none]

DATE CREATED: copyright 2015

DATE ACCESSED: August 2016

AVAILABILITY: Free

RIGHTS: copyright 2007 Free Software Foundation http://fsf.org

CLASSIFICATION: databases, digitization, language learning tools, language processing, Latin, linked open data.

(Header image: Detail from fol. 1r, De Bono, by Albertus Magnus. Codex 1024, Cologne, Library of the Dome. Image via Wikimedia Commons. Public Domain, {{PD-1996}}.)

Peter Anderson's picture

Peter Anderson is a Professor of Classics at Grand Valley State University in Michigan. After a BA and MA in Ottawa, Canada he moved to Cincinnati, where he completed a PhD in Greek and Latin Philology at the University of Cincinnati. His work focuses on language pedagogy and early Imperial Latin authors. Recent publications include translations of Seneca’s dialogues and consolations with Hackett Publishing. His current projects focus on Roman Stoicism and Latin translations of Marcus Aurelius. anderspe@gvsu.edu

Recent Posts

11/15/2018Matthew Loar
At last year’s SCS annual meeting in Boston, the Program Committee sponsored a panel called “Rhetoric: Then and...
11/09/2018
As one of the cornerstones upon which Classical scholarship has been built, much has already been said about...
11/01/2018Lisl Walsh
As Benjamin Isaac concisely stated in a 2016 piece in Eidolon,[i] the “pseudo-scientific roots” of American racism...
10/25/2018Willeon Slenders
Logeion allows searches of a series of Greek and Latin dictionaries and classical reference works. It was...
10/14/2018Charles Hedrick
EAGLE, the Electronic Archive of Greek and Latin Epigraphy, was conceived in 1997 by the Italian Epigrapher Silvio...
Subscribe to SCS Blog Feed

Share This Page

© 2018, Society for Classical Studies Privacy Policy