Open Greek and Latin: corpora, editions, and libraries

Gregory Crane

The Open Greek and Latin (OGL) Project addresses the need to develop open textual corpora that provide increasingly comprehensive coverage for Greek and Latin and support new forms of born-digital annotation, new practices of reading, new audiences for Greek and Latin, and new avenues of research.

First, OGL at Leipzig, Tufts and Mount Allison, and the First Thousand Years of Greek Project, which Harvard has funded and in which the University of Virginia also participated, have produced open Greek and Latin corpora available under the Creative Commons licenses that are now standard in Digital Classics. As of February 2017, c. 30 million words of Ancient Greek and 37 million words of Classical Latin — almost two thirds of all Greek and Latin produced through c. 600 CE, with a number of core later works such as Scholia and the Suda — have been digitized as initial TEI XML and are being reviewed and formatted to conform to the Canonical Text Services (CTS) Protocol. OGL has also set out to align each initial TEI XML transcription for one edition with as many other digitized editions as possible, including both TEI XML and uncorrected OCR-generated text, allowing readers to assess the variation among the reconstructed texts of varying editions. At the same time, the Perseus Digital Library has added to its collection of modern language translation (more than 70 million words in English, German, French, and Italian), not only for human readers but also for automated alignment and new forms of semantic search (e.g., search in English and retrieve words in Greek and Latin).

Second, OGL works within a framework based upon mechanisms such as TEI XML and CTS that can also include manually curated diplomatic transcriptions of, and critically chosen machine-actionable annotations classifying and selecting readings from, particular manuscripts. Here we build specifically upon the work of projects such as the Homer Multitext Project, the Digital Corpus of Literary Papyri and the Digital Latin Library.

Third, OGL has also developed a new generation of born-digital annotations that support new forms of reading and of large scale analysis. These include the morpho-syntactic annotations published in the Ancient Greek and Latin Dependency Treebank, named entity annotations developed in conjunction with Pelagios Commons, annotations identifying the extent, source and nature of text reuse where one text quotes, paraphrases or cites another, and born-digital translations that are aligned at the word and phrase level.

Gregory Crane

About this Abstract