New release of Natural Language Processing Tools

Amir Zeldes and Luke Gessler  have spent much of the past summer improving Coptic Scriptorium’s Natural Language Processing tools, and are now happy to announce the release of Coptic-NLP V3.0.0. You can read more about what we’ve been doing and the impact on performance in our three part blog post (part 1, part 2, part 3). Some of the new improvements include:

  • A new 3 step normalization framework, which allows us to hypothetically normalize bound groups before deciding how to segment them, then normalize each segment again
  • A smart rebinding module which can handle deciding to merge split bound groups based on context (useful for processing messy texts with line-breaks mid word, or other segmentation anomalies)
  • A re-implemented segmentation algorithm which is especially better at handling ambiguous groups in context (e.g. “nau” in “peja|f na|u” vs. “nau ero|f”) and spelling variation
  • A brand new, more accurate part of speech tagger
  • Higher accuracy across tools thanks to hyperparameter optimization
  • More robust test suite to ensure new errors don’t creep in
  • Various data/lexicon/ruleset improvements and bugfixes

You can download the latest version of the tools here:

https://github.com/CopticScriptorium/coptic-nlp/

Or use our web interface, which has been updated with the latest version:

https://corpling.uis.georgetown.edu/coptic-nlp/

We appreciate your feedback and comments, and hope to release more data processed with these tools very soon!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.