Updates to Automated Annotation Tools

We’ve updated our tokenizer (which breaks Coptic bound groups into their constituent morphemes) and our normalizer (which normalizes spelling and orthography to faclitate further automatic annotations).

Version 2.0.1 of the tokenizer includes more patterns to deal with a broader variety of bound groups.  It also includes a parameter (-l) to accommodate bound groups that are broken by line breaks, such as you might find in a transcription of a manuscript.  The tokenizer is now designed to annotate a bound group that runs across two lines as a bound group with tags and also adds tags for the line breaks.

Version 2.0 of the normalizer adds some vocabulary and also provides a parameter (-s) for normalizing the orthography particular to the Sahidica New Testament texts.

Check out the project at www.copticscriptorium.org, and fork us on Github.  Let us know what you think!

Leave a Reply

Your email address will not be published. Required fields are marked *