Corpora release 2.6

We are pleased to announce release 2.6 of our corpora! Some exciting new things:

  • Expanded Coptic Old Testament
  • More gold-standard treebanked texts
  • Updated files of Shenoute’s Abraham Our Father and Acephalous Work 22
  • New metadata fields to indicate whether documents have been machine annotated or if an editor has reviewed the machine annotations

Expanded Coptic Old Testament

Our Coptic Old Testament corpus is updated and expanded, with digital text from the our partners at the Digital Edition of the Coptic Old Testament project in Goettingen.  All annotations in this corpora are fully machine-processed (no human editing, because it’s BIG). You can read through all the text in two different visualizations online and search it in the ANNIS database:

  1. analytic: the normalized text segmented into words aligned with part of speech tags; each verse is aligned with Brenton’s English translation of the Septuagint
  2. chapter: the normalized text presented as chapters and verses; each word links to the online Coptic dictionary
  3. ANNIS search: full search of text, lemmas, parts of speech, syntactic annotations, etc. (see our ANNIS tips if you’re new to ANNIS)

Please keep in mind this corpus is fully machine-annotated, and we currently do not have the capacity to make manual changes to a corpus of this size.  If you notice systemic errors (the same thing tagged incorrectly often, for example) please let us know.  Otherwise, please be patient: as the tools improve, we will update this corpus.

We’ve also machine-aligned the text with Brenton’s English translation of the Septuagint. It’s possible there will be some misalignments.  Thanks for your understanding!

Treebanks

We’ve added more documents to our separate gold-standard treebank corpus.  (Want to learn more about treebanks?) In this corpus, the treebank/syntactic annotations have been manually corrected; the documents are part of the Universal Dependencies project for cross-language linguistics research.  New treebanked documents include selections from 1 Corinthians, the Gospel of Mark, Shenoute’s Abraham Our Father, Shenoute’s Acephalous Work 22, and the Martyrdom of Victor.  This means the self-standing treebank corpus is expanded, and any documents we’ve treebanked have updated word segmentation, part of speech tagging, etc., in their regular corpora.

Updated Shenoute Documents

Documents in the corpora for Shenoute’s Abraham Our Father and Acephalous Work 22 have several updates.

First, some documents are in our treebank corpus and are now significantly more accurate in terms of word segmentation, tagging, etc.

Second, we’ve added chapter and verse segmentation to these works.  Since there are no comprehensive print editions of these works with versification, we’ve applied our own chapter and verse numbers.  We recognize that versification is arbitrary, but nonetheless useful for citation.  For texts transcribed from manuscripts, chapter divisions typically occur when an ekthesis occurs in the manuscript. (Ekthesis describes a letter hanging in the margin.)  They do not necessarily occur with each ekthesis (if ekthesis is very frequent), but we try to make the divisions occur only with ekthesis.  Verses typically equal one sentence, sometimes more than one sentence per verse for very short sentences or more than one verse per sentence for very long Shenoutean sentences.

Third, we’ve added “Order” metadata to make it easier to read a work in order if it’s broken into multiple documents.  Check out Abraham Our Father, for example: the first document in the list is the beginning of the work.

Screen shot of list of documents in Abraham Our Father Corpus

Screen shot of list of documents in Abraham Our Father Corpus

When you’re reading through a document, click on “Next” to get the next document in reading order.  (If there are multiple manuscript witnesses to a work, we’ll send you to the next document in order with the fewest lacunae, or missing segments.)

Screen shot of beginning of Abraham Our Father

Screen shot of beginning of Abraham Our Father

Of course, you can always click on documents in any order you want to read however you like!

And everything is fully searchable across all documents in ANNIS.

New Metadata Fields Documenting Annotation

We sometimes get asked: which corpora do scholars annotate and which corpora are machine-annotated?  The answer is complicated — almost everything is machine annotated, with different levels of scholarly review.  So we’re adding three new metadata fields to help show users what kinds of annotation each document get:

  • Segmentation refers to word segmentation (or “tokenization”) performed by the tokenizer tool.
  • Tagging refers to part of speech, language of origin, and lemma tagging performed by our tagger
  • Parsing refers to dependency syntax annotations (which are part of our treebanking)

Each of these fields contains one of the following values:

  • automatic: fully machine annotated; no manual review or corrections to the tool output
  • checked: the tool has annotated the text, and a scholar has reviewed the annotations before publication
  • gold: the tools have been run and the annotations have received thorough review; this value usually applies only to documents that have been treebanked by a scholar (requiring rigorous review of word segmentation and tagging along the way)

For example, in the first image of document metadata visible in ANNIS, the document has automatic parsing; a scholar has checked the word segmentation and tagging.

 

Screenshot of document metadata showing checked word segmentation and tagging

Screenshot of document metadata showing checked word segmentation and tagging

In the next image of document metadata, a scholar has treebanked the text, making segmentation, tagging, and parsing all gold.

Screenshot of document metadata showing gold level annotations

Screenshot of document metadata showing gold level annotations

 

We are rolling out these annotations with each new corpus and newly edited corpus; not every corpus has them, yet — only the ones in this release.  Our New Testament and Old Testament corpora are machine-annotated (automatic) in all annotations.

 

We hope you enjoy!

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.