Dealing with Heterogeneous Low Resource Data – Part I

Image from Budge’s (1914), Coptic Martyrdoms in the Dialect of Upper Egypt

Image from Budge’s (1914), Coptic Martyrdoms
in the Dialect of Upper Egypt
(scan made available by archive.org)

(This post is part of a series on our 2019 summer’s work improving processing for non-standardized Coptic resources)

A major challenge for Coptic Scriptorium as we expand to cover texts from other genres, with different authors, styles and transcription practices, is how to make everything uniform. For example, our previously released data has very specific transcription conventions with respect to what to spell together, based on Layton’s (2011:22-27) concept of bound groups, how to normalize spellings, what base forms to lemmatize words to, and how to segment and analyze groups of words internally.

An example of our standard is shown in below, with segments inside groups separated by ‘|’:

Coptic original:         ⲉⲃⲟⲗ ϩⲙ̅|ⲡ|ⲣⲟ    (Genesis 18:2)

Romanized:                 ebol hm|p|ro

Translation:                 out of the door

The words hm ‘in’, p ‘the’ and ro ‘door’ are spelled together, since they are phonologically bound: similarly to words spelled together in Arabic or Hebrew, the entire phrase carries one stress (on the word ‘door’) and no words may be inserted between them. Assimilation processes unique to the environment inside bound groups also occur, such as hm ‘in’, which is normally hn with an ‘n’, which becomes ‘m’ before the labial ‘p’, a process which does not occur across adjacent bound groups.

But many texts which we would like to make available online are transcribed using very different conventions, such as (2), from the Life of Cyrus, previously transcribed by the Marcion project following the convention of W. Budge’s (1914) edition:

 

Coptic original:    ⲁ    ⲡⲥ̅ⲏ̅ⲣ̅               ⲉⲓ  ⲉ ⲃⲟⲗ    ϩⲙ̅ ⲡⲣⲟ  (Life of Cyrus, BritMusOriental6783)

Romanized:           a     p|sēr              ei   e bol   hm p|ro

Gloss:                        did the|savior go to-out in the|door

Translation:          The savior went out of the door

 

Budge’s edition usually (but not always) spells prepositions apart, articles together and the word ebol in two parts, e + bol. These specific cases are not hard to list, but others are more difficult: the past auxiliary is just a, and is usually spelled together with the subject, here ‘savior’. However, ‘savior’ has been spelled as an abbreviation: sēr for sōtēr, making it harder to recognize that a is followed by a noun and is likely to be the past tense marker, and not all cases of a should be bound. This is further complicated by the fact that words in the edition also break across lines, meaning we sometimes need to decide whether to fuse parts of words that are arbitrarily broken across typesetting boundaries as well.

The amount of material available in varying standards is too large to manually normalize each instance to a single form, raising the question of how we can deal with these automatically. In the next posts we will look at how white space can be normalized using training data, rule based morphology and machine learning tools, and how we can recover standard spellings to ensure uniform searchability and online dictionary linking.

 

References

Layton, B. (2011). A Coptic Grammar. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.

Budge, E.A.W. (1914) Coptic Martyrdoms in the Dialect of Upper Egypt. London: Oxford University Press.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.