Genre is a bit of a puzzle; it is generally agreed that certain authors and works can be grouped together, and the groupings of ancient texts into genres are not generally contentious, but a single, definitive definition of genre remains elusive. This paper presents a computational model of genre based on shared, co-occurring vocabulary as a complementary approach to genre. Using Greek historiography as an example, this paper argues that there is a strong, direct link between vocabulary and genre, and that a model based on vocabulary is useful for investigating issues of genre.
This paper comprises four parts. First, I review literary and audience-based models of genre, focusing on features of historiography. Most models of genre operate in a literary framework, recognizing a set of internal expectations and external rationalizations that shape and separate a genre. For example, Sancisi-Weerdenburg (1999) defines Greek historiography as separate from other historical approaches due to a focus on causality and a narrative structure. While the literary models typically operate on the scale of entire works, some recent scholars (e.g. Kraus 2013) discuss the genre/subgenres of historiography at the section level. This is a useful approach, but this approach to creating a model of genre does not scale efficiently.
Second, I discuss my development of a computational model of genre built to distinguish Greek historiography from other Greek prose genres. This model, because it is built on vocabulary, is able to operate at the section level -- independent of the larger-scale features and patterns which describe literary models of genre. This model produces c. 95% agreement with standard classification. In other words, this computational model can predict the genre of a single section of text (on average under 50 words) based on vocabulary usage with no knowledge of its larger literary context.
Third, the disagreements between the computational model and the literary models are examined in detail. I argue that some of the disagreements are due to authorial genre-bending. (For example, many sections of Thucydides that the computational model does not classify as historiography are speeches, e.g. 1.141.7 or 5.87.1.) In addition, different authors are shown to have different levels of agreement between the literary and computational models, despite the fact that these same authors were used to train the computational model.
Finally, I argue that this vocabulary-based, bottom-up method is highly complementary to traditional top-down definitions of genre. This computational model provides greater granularity in its classification, an opportunity for comparison between authors, and a measurement of the degree of a passage’s coherence with generic vocabulary usage. Literary models, in turn, provide the basis for this computational model; its predictions would not be possible without training on already labeled sections. In addition, literary models supply an interpretive framework for understanding the computational models; they offer explanations for differences in vocabulary usage. I examine vocabulary that is positively and negatively correlated with Greek historiography to show the utility of this complementary method in approaching the puzzle of genre.