Summer 2020 Corpora Release 4.0.0

Place name index on data.copticscriptorium.org

It is our great pleasure to announce the latest release of data from Coptic Scriptorium, version 4.0.0. This release contains both new Coptic material and extensive additions to our suite of tools and annotations, focusing on the addition of support for entity annotation and named-entity linking across our new and old datasets. The new material, including more digitized data courtesy of the Marcion project and other scholars, includes:

With this new release, the semi-automatically annotated data (excluding automatically processed Bible materials) in the project covers close to 260,000 words of Sahidic Coptic annotated for entities, including 50,000 words of gold-standard treebanked data with manual syntactic analyses.

In addition to new texts, new tools and analyses have been added to the project:

  • Complete entity annotation, classifying all non-pronominal references to people, places and other entities into 10 entity categories
  • Entity linking:
    • Linking of all named entities which have corresponding Wikipedia articles to their respective Wikipedia entries, including geo-location information where available
    • A browseable index of people and places mentioned in the texts, also linked to Wikipedia and Google Maps and including both real and fictional entities
  • Search and visualization:
    • Search by entity type and named entity in ANNIS
    • New configurable analytic visualization which displays nested entity types, highlights named entities and links them to Wikipedia
  • Natural Language Processing
    • Automatic entity recognition is now available (work by Amir Zeldes, Lance Martin and Sichang Tu)
    • A new neural parser adapted for Coptic with higher accuracy syntactic analyses, which are deployed in ANNIS (work by Luke Gessler)
The new configurable Analytic Visualization with toggleable entity types and links

This release represents a tremendous amount of work over the past few months by the entire Coptic Scriptorium team. We would also like to thank individual contributors (which you can always find in the ‘annotation’ metadata for each document) and the Marcion and PAThs projects who shared their data with us, and the National Endowment for the Humanities for supporting us. We are continuing to work on more data, links to other resources and new kinds of annotations and tools. Please let us know if you have any feedback!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.