We’ve updated our tokenizer and part-of-speech tagger. The tokenizer breaks a Coptic bound group into its constituent words and/or morphemes. The tagger consists of a set of fine-grained and a set of course-grained Coptic models for use with the open source natural language processing program TreeTagger. They use a Sahidic Coptic lexicon based in part on data provided by Prof. Tito Orlandi and the Corpus dei Manoscritti Copti Letterari (CMCL). These tools are written in Perl, and they can be downloaded along with their documentation under the Tools section of the Coptic SCRIPTORIUM website at http://coptic.pacific.edu.
We also have provided a new visualization of our corpora, which we’re calling the “analytic” visualization. It’s currently available in html only for the letters of Besa, but you can access it in ANNIS for other corpora, and we will expand the html access to the rest of the texts in the future. The analytic visualization presents the normalized Coptic text aligned with part-of-speech tags and an English translation. This visualization is best viewed in the Safari and Chrome browsers. (Not Firefox.)