Patrick Burns
Jerome McGann (2014, 210) has referred to the Sachphilologie of 19th-century scholars like August
Boeckh as “object-oriented philology.” In the world of computer programming, though, “objectoriented”
has its own technical meaning, namely the practice of formalizing sequences of code in such
a way as to promote the separation of functionality within programs and ensure that code is modular
and reusable. In this paper, I discuss the Classical Language Toolkit, an open-source platform dedicated
to natural language processing support for historical languages, and the way in which recent
development strategies at the CLTK aim to reimagine comparative philology in terms of software
design and in particular how the project works to reconnect threads of philological research often now
separated at an institutional level by disciplinary boundaries (e.g. the study of classical Greek and Latin
philology in a Classics department but classical Chinese philology in Asian Studies.)
The CLTK began in practice as a series of measures designed to fill in resource gaps that commonly
used NLP platforms like the Natural Language Toolkit present for historical-language researchers. For
example, out-of-the-box solutions for extracting sentences from plaintext (i.e. sentence tokenization)
often rely on models tuned for modern languages, and English in particular, whereas the CLTK offers
its audience language-specific sentence tokenizers for Latin, Greek, and a growing number of other
historical languages. But as the platform has matured— and as it has become more complex, especially
because of the increasing number and growing diversity of the languages supported—it has become
inefficient to support individual and isolated development tracks for each language. In response, the
development team has been shifting toward an object-oriented approach to language. Abstract,
language-independent blocks of code—or, classes—are written at the module level, which can in turn
then be inherited by language-specific subclasses. For example, the Tokenize module contains the class
“SentenceTokenizer,” while the Latin submodule of Tokenize inherits as much code as is useful from
its parent class, supplementing it with Latin-specific customizations as necessary in a class called
“LatinSentenceTokenizer.”
For this paper, I look at two CLTK modules in particular: 1. Stop, for the creation and distribution of
stoplists; and 2. Tokenize, for splitting strings of text into sentences and words. Stop development
shows how defining a consistent workflow can be leveraged to create more uniform resources across
a large number of languages, while Tokenize development demonstrates the benefits of writing reusable
code and provides an example of how the “single responsibility principle,” (Martin 2003) a core idea
of object-oriented programming, can be adapted to philological customization by language.
Through the use of module-level classes and language-specific inheritance—that is, through an objectoriented
approach to philological work—the CLTK is laying the groundwork for building an integrated
platform for comparative philological work on historical languages. This is in keeping with related
trends in philological infrastructure. Federico Boschetti and Angelo Marco del Grosso (2014/2015,
3.3), for example, have shown how object-oriented design can be used in the collaborative context of
creating digital scholarly editions to “provide complex but recurrent functionality [and...] to
encapsulate functionality and data inside an efficient and flexible collection of classes.” For the CLTK,
these design choices not only reflect functional separation, but also linguistic separation, aspiring to a
new kind of “procedural originality” (Turner, 2014, 99) in the area of comparative philology. In
addition, these choices follow the guidance (Crane et al. 2009) that classicists “design their
cyberinfrastructure...to be as portable as possible across multiple languages.” In so doing, the CLTK
aims at building design patterns that reconnect it to and open up new directions in comparative
philological work.