Semantic Intertextual Search with Latin Word-Embedding Models

Joseph P. Dexter and Pramit Chaudhuri

This paper describes optimization of a computational method for representing semantic information in Latin texts and application of the method to identifying intertextual relationships of literary significance. The distributional hypothesis in linguistics holds that the meaning of a word can be inferred from the contexts in which it is used (Firth); the development of effective methods for computing distributional representations known as word embeddings has revolutionized natural language processing research over the past decade (Mikolov et al., Devlin et al.). We optimize a word embedding model for Latin and use that model to improve existing methods for intertextual search through incorporation of semantic matching.

The word2vec algorithm is among the most widely used approaches for computing word embeddings for English and other modern languages (Mikolov et al.). A key advantage of word2vec embeddings is that they encode semantic information and can be manipulated to deduce meaningful relationships, such as analogies. Although several word embedding models have been released for Latin, they were developed as part of large-scale or multi-language projects and not subject to further optimization (Bamman and Crane, Bojanowski et al., Devlin et al.). Accordingly, we find that these models struggle to identify Latin synonyms, which is a standard evaluation task (cf. Mikolov et al.). In light of these limitations, we train a new word2vec model on a large corpus of Latin literature and perform a number of language-specific optimizations, the most important of which is lemmatization of inflected forms prior to training. In total, these improvements increase the accuracy of synonym detection more than fivefold.

In recent years, digital methods have had a major impact on the study of Latin intertextuality (Forstall and Scheirer), and computational tools for intertextual search have been developed by the Tesserae Project, Musisque Deoque, and the Quantitative Criticism Lab (Coffee et al., Chaudhuri and Dexter). Existing search tools rely on exact or inexact lexical matching of related phrases and are not sensitive to similarities in meaning alone (cf. Forstall and Scheirer: 79-104). We devise an alternative method of semantic search that ranks intertexts according to the similarity of the embeddings of the constituent words.

To evaluate the effectiveness of our approach, we assemble a database of more than 1,200 intertextual parallels between Book 1 of Valerius Flaccus’ Argonautica and earlier Latin verse that were noted in the substantial commentaries of Kleywegt, Spaltenstein, and Zissos. We find that our method in combination with Tesserae and Fīlum performs better at identifying parallels in the dataset than any single tool, suggesting that integration of lexical and semantic information does improve intertext detection; combined methods recover more than 80% of known parallels with reasonable specificity. In addition, machine learning classification with embeddings identifies sections of concentrated intertextuality in Book 1, such as the divine council described at 1.503-73, which is modeled closely on the scene of Venus and Jupiter at Aeneid 1.223-96 (Zissos: 305). In total, our results highlight the promise of integrating powerful methods from natural language processing into research on Latin intertextuality.

Joseph P. Dexter and Pramit Chaudhuri

About this Abstract