Digital Rescue: Transkribus as a tool saving Wüst’s Lexicon Aristophaneum (ca. 1910) from oblivion

Jeff Rusten and Ethan Della Rocca, Cornell University

Digital Rescue: Transkribus as a tool saving Wüst’s Lexicon Aristophaneum (ca. 1910) from oblivion.

The purpose of this paper is to demonstrate the utility of the Transkribus tool for creating machine learning models that can transcribe handwritten documents written in multiple languages, both ancient and modern, and in multiple scripts. Our recent success in digitizing Wüst’s Lexicon Aristophaneum (handwritten in German, Latin, and Ancient Greek) offers proof of the utility of this tool.

Aristophanes' language is unique in preserved classical Greek, combining high poetry and intellectual terminology with colloquialisms, terms from daily life, and obscenities, a lexical corpus already studied in antiquity. There are 17^th century word-lists and a good index, but there seemed to be no adequate lexicon until we discovered the microfiche images of Ernst Wüst's unpublished handwritten manuscript composed circa 1910 (Wüst, E., Lexicon Aristophaneum: ein handschriftliches Spezialwörterbuch zu den Komödien des Aristophanes, Berlin: K.G. Saur, 1984), never reviewed or even cited, difficult to access much less to read in the 1980s. With the help of local library equipment and staff, we created high-resolution images of the microfiches and began the process of digitizing the text.

A particular challenge for Americans when transcribing the images into a text format was the German Kurrent script. However, we discovered that the Transkribus project, existing in its current form only since 2019, is a superior tool for the transcription and optical character recognition (OCR) of documents, offering several advantages over Tesseract and other software. Transkribus:

-- Provides OCR/HTR tools for both handwritten and printed text;

-- Accurately uses OCR/HTR on documents written in multiple (ancient and modern, also RTL) languages and alphabets;

-- Allows users to train custom character recognition models for specific scripts and hands with neural networks;

-- Publishes past users’ models for new users to use and improve;

-- Is accessible to non-specialists;

-- Maintains two feature-rich collaborative platforms for transcribing the training pages.

Over the course of six weeks, drawing on 50 pages of training data (using a pre-existing German Kurrent transcription model, then hand-transcribed by a team including undergrads), we were able to create our own model specific to Wüst’s hand using Transkribus’ machine learning tools. This new model initially had a character error rate of 11.20%, which has now been reduced by 36% to 7.21%.

Our project will make available a detailed, comprehensive, and erudite lexicon that would otherwise be consigned to oblivion. Furthermore, online publication of this 1500-page text will allow us to include updates encompassing the past century of scholarly work, and link each citation in every entry with its Aristophanic context.

The presentation will display samples showing the Lexicon’s high quality and screenshots of the transcription process, and conclude with a list of future steps, such as: creating a TEI compliant XML version of the transcription; matching lemmas to LSJ; adding English translations, word-families, semantic categories and an updated bibliography; and linking the lexicon to our online lemmatized text of Aristophanes (from Perseus).

Jeff Rusten and Ethan Della Rocca, Cornell University

About this Abstract