A Stylometric Analysis of Latin Literary Genre | Society for Classical Studies

Thomas J. Bolt, Pramit Chaudhuri, and Joseph Dexter

This paper introduces a quantitative method for analyzing genre in Latin literature. Using computational techniques drawn from machine learning, we show how traditional generic categories, such as epic or oratory, possess distinctive stylistic signatures reflected in grammatical and syntactic preferences. The paper describes the set of twenty-six stylometric features used in the study, which encompass pronouns, superlatives, and markers of subordination, among others. We report methods for 1) computing these features for the extant classical Latin literary corpus, 2) distinguishing genres based on the stylometric data, and 3) identifying the salient features most characteristic of each genre. The resulting profiles offer a multidimensional portrait of the stylistic tendencies typical of the major Latin genres (drama, elegy, and epic for verse, epistolography, historiography, oratory, philosophy, and technical treatise for prose).

Recent work in computational literary studies has shown that taking the coherence of genre as a working assumption can enable productive lines of interpretation and investigation of cross-temporal trends (Moretti 2013, Jockers 2013, Wilkens 2016, Underwood 2019). Yet while genre is a central concern of literary criticism, especially within Classics, there is little consensus regarding specific generic definitions, or even how much interpretative weight should be given to the concept due to the fluidity of generic boundaries (Harrison 2007 and Derrida 1980). To the extent that concepts of genre cohere, they do so through relationships of meter, diction, theme, and reception. Small-scale features that occur frequently provide potentially new evidence about generic style but are prohibitively difficult to count without computation. We incorporate stylometric features long considered important to Latin literary style, such as atque followed by a consonant (Adams 1972, Adams et al. 2005), as well as markers drawn from computational studies of English genres (Jockers 2013).

Our Latin corpus was originally digitized by the Perseus Project and comprises 206 works (Crane 1996). We calculate the frequency of each of our twenty-six features across the entire corpus, resulting in a high-dimensional stylometric profile. The feature calculation methods vary from exact (e.g., counts of prepositions) to partial or selective (e.g., superlative adjectives). In describing these methods, we clarify the importance of keying data quality to the type of research question: relatively coarse corpus-wide questions allow for variations in the quality of data that may not be acceptable for a fine-grained study of one text.

We then use supervised machine learning to predict genre based on style. In our experiment, an algorithm is given a subset of texts labeled by genre and learns to identify associations between our quantitative stylistic data and the genre labels. Afterwards, the algorithm predicts the genre of the remaining texts (i.e., those not used for training) based solely on the twenty-six stylometric features. With five-fold cross-validation, we achieve greater than 85% accuracy in genre identification across the whole corpus. By considering all features in combination, this approach enables construction of multifaceted stylistic profiles of the major Latin genres. To aid in the interpretation of this data, we use statistical feature ranking to identify the characteristics that best distinguish each genre from the rest of the corpus. To take one example, the top feature for historiography is a high frequency of conditional sentences, which speaks to the genre’s tolerance for hypotaxis, while the second feature (frequency of iste) perhaps highlights the importance of internal speeches.

The paper thus demonstrates computation’s power to extract new, interpretively useful data from an intensively studied corpus of Latin literature. Although classicists have long used computation for authorship attribution and linguistic analysis (Marriott 1979, Stover et al. 2015, McGillivray and Jenset 2018), large-scale literary studies of the sort undertaken for English and other modern languages have not been attempted for many premodern traditions. The methods introduced in this paper are extensible to Ancient Greek and other languages. Within the study of Latin literature, the research paves the way for additional fine-grained analyses of works or of trends over time.

Thomas J. Bolt, Pramit Chaudhuri, and Joseph Dexter

About this Abstract