Make sure to read the previous blog post here for an introduction on wordnets!

The Unified Scottish Gaelic Wordnet

A Celtic language with roots in Middle or Common Irish (900-1200AD), Scottish Gaelic was once the language most spoken in Scotland but is considered today an endangered language, with less than 60,000 speakers. As aforementioned, online visibility is crucial for a minority language to stay alive and kicking in today’s highly digitised contemporary society. Without adequate digital corpora, it is extremely difficult to develop and administer datacentric language processing approaches such as computer-assisted translation.

In an effort to bridge this digital divide, the Unified Scottish Gaelic Wordnet or USGW, a lexico-semantic database for Scottish Gaelic, was created. USGW is free to download and accommodates over 10,000 words and 13,000 synsets. The developers behind USGW observed the needs of Gaelic as an endangered language and decided that the database should cover more common vocabulary than domain specific words. Another motivation was to create an online lexical resource for Gaelic that was computer-oriented as well as human-oriented, as any existing resources for Gaelic had only been made accessible to humans so far.

How to create a new wordnet

METHOD 1:

There are two primary methods in which you can create a new wordnet. The first method is to translate a specially selected assortment of synsets from an existing wordnet. As the initial wordnet, PWN is usually followed for this method. This method is very convenient because the new wordnet is automatically aligned with the other languages associated with this model and provides ‘free’ translations to them.

METHOD 2:

The second method, while a lot more manual work, is much more customisable to the needs of the language. This is extremely useful for languages whose semantic hierarchy may differ a lot from the original English of the PWN or for the special cases of minority languages whose needs may vary based on their circumstances. This approach works off existing language resources, such as a thesaurus or other wordbank, and translating it into the PWN model, while also developing a new synset hierarchy for that language. As one can imagine, this is a lot more time-consuming than the previous method and requires not only a lot of expertise but also first-hand familiarity with the language. A benefit of this approach is that it avoids one of the problems found with using the first method, which is that modelling a new wordnet off the PWN means that the language will be seen through an English prism. This second method avoids linguistic bias, which can be especially relevant to minority languages who want to stay away from the influence of any colonial language.

Developing the Unified Scottish Gaelic Wordnet

The developers of USGW mainly followed the first approach of translating into the PWN model, but questions of linguistic bias were tackled by handling the lexical gap of things in English that you can’t say in Gaelic (and vice versa).

There had already been numerous available online resources for Gaelic, but were mostly more accessible to humans rather than machine readable. Through the creation of USGW the developers sought to build a lexico-semantic resource that was serviceable both by computers and by humans. Another goal was for this new database to be sense-aligned with multiple existing wordnets. 

One of the existing online resources for Scottish Gaelic was the Extended Open Multilingual Wordnet, a project that produced wordnets for over 150 languages, including Gaelic. It worked automatically, obtaining data from the Gaelic Wiktionary site and integrating it into the PWN model. Since it was machine created, the quality wasn’t fantastic. It consisted of 4,674 words and 5,498 synsets, and had no Gaelic glosses. Despite that, it was a useful data source that was later utilised in the creation of USGW.

Developing the Unified Scottish Gaelic Wordnet – Methodology

Step 1: Selection of Relevant Synsets to Translate

In USGW primarily the common language was prioritised rather than domain specific terminology, but an effort was made to fill in much of the missing middle ground between the foundation word classes and domain related classes with the aim of expanding the Gaelic lexicon with neologisms. Nouns and verbs were especially favoured. 13 subgroups of noun synsets such as natural objects, body parts, feelings, events, food and locations were chosen, as well as a 14th group consisting of 1,100 commonly used verb synsets.

Step 2: Translation

The words in these groups were then translated following the instructions below:

  1. To omit synsets or halt translating a subgroup completely when he supposed the terminology a bit too technical and therefore of less interest to regular users;
  2. To recognise lexical gaps where no lexicalisation exists in Gaelic and in this case to either:
    1. Specify the group as a one with a lexical gap but to still contribute a Gaelic annotation for it
    2. Come up with a neologism, either by direct translation or derivations.

Step 3: Merging

For this step the two wordnets – the new one created for USGW and the existing machine generated one, were merged together. This was done within a large-scale multilingual lexico-semantic resource called the Universal Knowledge Core.

Step 4: Validation

Two methods of validation were conducted in this step. The first method was a machine-led cross-validation between the two wordnets. The second was an informal evaluation made by an external language expert who is a scholar and native speaker of Gaelic. From the validation process it was found that neologisms were both the most interesting and most disputable between the expert overseeing the evaluation and the translator involved in the first two steps. Contrasting opinions between the two indicate the lack of research done in the lexical aspect of Scottish Gaelic and could be indicative of how Gaelic is used in certain communities, where lexical gaps could be more developed in one than the other.

Results

USGW is deemed a great success in the space of technological advancements for Scottish Gaelic. Out of 1000 wordnets it ranks at 30th for the number of words it contains and 25th for the number of synsets. The mean level of polysemy in USGW is higher than PWN. This likely boils down to the fact that the subgroups included in USGW relate more to the common language which naturally tends to have more meanings, whereas PWN includes more domain-specific words.

References

Bella, Gábor, et al. “A Major Wordnet for a Minority Language: Scottish Gaelic.” Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), European Language Resources Association (ELRA), 2020, pp. 2812–18, iris.unitn.it/retrieve/e3835199-66d9-72ef-e053-3705fe0ad821/2020.lrec-1.342.pdf. Accessed 4 Jan. 2024.

Kornai, András. “Digital Language Death.” PLoS ONE, edited by Eduardo G. Altmann, vol. 8, no. 10, Oct. 2013, https://doi.org/10.1371/journal.pone.0077056.

Miller, George A., and Christiane Fellbaum. “WordNet Then and Now.” Language Resources and Evaluation, vol. 41, no. 2, 2007, pp. 209–14, www.jstor.org/stable/30200582?seq=1. Accessed 4 Jan. 2024.

Princeton University. “About WordNet.” Princeton.edu, 2010, wordnet.princeton.edu/.

Hi, I’m admin

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *