Skip to main content

Jerome McGann (2014, 210) has referred to the Sachphilologie of 19th-century scholars like August

Boeckh as “object-oriented philology.” In the world of computer programming, though, “objectoriented”

has its own technical meaning, namely the practice of formalizing sequences of code in such

a way as to promote the separation of functionality within programs and ensure that code is modular

and reusable. In this paper, I discuss the Classical Language Toolkit, an open-source platform dedicated

to natural language processing support for historical languages, and the way in which recent

development strategies at the CLTK aim to reimagine comparative philology in terms of software

design and in particular how the project works to reconnect threads of philological research often now

separated at an institutional level by disciplinary boundaries (e.g. the study of classical Greek and Latin

philology in a Classics department but classical Chinese philology in Asian Studies.)

The CLTK began in practice as a series of measures designed to fill in resource gaps that commonly

used NLP platforms like the Natural Language Toolkit present for historical-language researchers. For

example, out-of-the-box solutions for extracting sentences from plaintext (i.e. sentence tokenization)

often rely on models tuned for modern languages, and English in particular, whereas the CLTK offers

its audience language-specific sentence tokenizers for Latin, Greek, and a growing number of other

historical languages. But as the platform has matured— and as it has become more complex, especially

because of the increasing number and growing diversity of the languages supported—it has become

inefficient to support individual and isolated development tracks for each language. In response, the

development team has been shifting toward an object-oriented approach to language. Abstract,

language-independent blocks of code—or, classes—are written at the module level, which can in turn

then be inherited by language-specific subclasses. For example, the Tokenize module contains the class

“SentenceTokenizer,” while the Latin submodule of Tokenize inherits as much code as is useful from

its parent class, supplementing it with Latin-specific customizations as necessary in a class called

“LatinSentenceTokenizer.”

For this paper, I look at two CLTK modules in particular: 1. Stop, for the creation and distribution of

stoplists; and 2. Tokenize, for splitting strings of text into sentences and words. Stop development

shows how defining a consistent workflow can be leveraged to create more uniform resources across

a large number of languages, while Tokenize development demonstrates the benefits of writing reusable

code and provides an example of how the “single responsibility principle,” (Martin 2003) a core idea

of object-oriented programming, can be adapted to philological customization by language.

Through the use of module-level classes and language-specific inheritance—that is, through an objectoriented

approach to philological work—the CLTK is laying the groundwork for building an integrated

platform for comparative philological work on historical languages. This is in keeping with related

trends in philological infrastructure. Federico Boschetti and Angelo Marco del Grosso (2014/2015,

3.3), for example, have shown how object-oriented design can be used in the collaborative context of

creating digital scholarly editions to “provide complex but recurrent functionality [and...] to

encapsulate functionality and data inside an efficient and flexible collection of classes.” For the CLTK,

these design choices not only reflect functional separation, but also linguistic separation, aspiring to a

new kind of “procedural originality” (Turner, 2014, 99) in the area of comparative philology. In

addition, these choices follow the guidance (Crane et al. 2009) that classicists “design their

cyberinfrastructure...to be as portable as possible across multiple languages.” In so doing, the CLTK

aims at building design patterns that reconnect it to and open up new directions in comparative

philological work.