Skip to main content

Many fragments of Greek historians and other prose writers are transmitted through text reuse by later sources.  Thus, for example, Athenaeus is the source of thousands of otherwise lost passages from hundreds of authors. Yet, in a casual reading, it is not possible to determine what constitutes directly quoted material, what is paraphrased, and what is significantly altered by the quoting writer. My ultimate aim is to use stylometric investigations and macro data analyses in order to distinguish text from cover text, by developing methods to generate and exploit corpora of syntactic data on ancient Greek.  Using syntactic annotation opens an approach that is robust against shared lexical vocabulary: the algorithm must discriminate among fragmentary treatments of different authors who write on the same topic and thus share vocabulary. I am using computer analysis of syntax, first, to distinguish patterns of usage by known authors such as Herodotus and Polybius. Once I can discriminate between known authors, I want to apply similar techniques to compare directly-transmitted texts to epitomes and abridgements attributed to the same author (such as Polybius or Diodorus), and ultimately I hope to differentiate between cover text and fragments of lost authors and to develop new evidence about the degree to which such fragments have been refashioned by the tradition that preserved them.

In this paper, I wish to discuss the first steps.  To begin, I have had to build a database of syntactically analyzed Greek prose.  The Prague tagset is an internationally recognized means of annotating a corpus of writings using morphological and syntactic labels.  This tagset uses dependency syntax rather than constituency syntax, both because dependency grammar is far simpler and because it easily allows for natural language processing in periodic sentence structure. I will discuss the process briefly and give an overview of the c. 150,000 tokens that I have personally treebanked to date (April 2015), using the new platform, Arethusa, and the Perseids/Perseus database of texts.

The second step is to extract syntactic information from this corpus in a usable form and identify computational techniques that will accurately distinguish between different known authors, thus establishing a proof of concept. One straightforward approach is to convert dependency relationships into “syntactic words” (sWords). To do this, one traces the dependency path from each leaf node back to the sentence root and records the dependency label for each edge. As an example, sentence 1 of Athenaeus Book 12:

Ἄνθρωπος εἶναί μοι Κυρηναῖος δοκεῖς, κατὰ τὸν Ἀλέξιδος Τυνδάρεων (II 384 K), ἑταῖρε Τιμόκρατες· (“According to Alexis’ Tyndareus, You seem to me a man of Cyrene, friend Timocrates.”)

Using XQuery transformation, the dependency tree of the sentence is translated (using the standard Prague tagset):

<sentence document_id="urn:cts:greekLit:tlg0008.tlg001.perseus-grc1">

<sword>PNOM-OBJ-PRED#</sword>

<sword>OBJ-PRED#</sword>

<sword>OBJ-PRED#</sword>

<sword>ATR-PNOM-OBJ-PRED#</sword>

<sword>PRED#</sword>

<sword>AuxX-AuxP-PRED#</sword>

<sword>AuxP-PRED#-</sword>

<sword>ATR-ADV-AuxP-PRED#</sword>

<sword>ATR-ADV-AuxP-PRED#</sword>

<sword>ADV-AuxP-PRED#</sword>

<sword>AuxX-AuxP-PRED#</sword>

<sword>ATR-ExD-PRED#</sword>

<sword>ExD-PRED#</sword>

<sword>AuxK#</sword>

</sentence>

(To explain: the first word, Ἄνθρωπος, is a predicate nominal (PNOM) of a word, εἶναί, which is itself the object (OBJ) of the sentence’s main verb, δοκεῖς (PRED). And so on for the dependency relationship of each word.)

A chief advantage of recasting dependencies as syntax words is that they are both human readable and suitable for immediate computational analysis: with trivial modifications such files can be put into standard text-processing software (various packages for R) to produce type-token ratios, word frequency histograms, and other output that provide detailed syntactic information about individual authors.

When I analyze the results using both supervised and unsupervised machine learning models of pattern recognition and prediction in the corpus of syntax words, I can show that we can not only distinguish computationally between different authors on this basis with great accuracy, but we can also determine the most characteristically favored and avoided syntactic constructions of each author.  I will finish by suggesting many and varied research opportunities that are presented as a result of this new historiographical approach to Greek texts.