Essays

What digital technologies can do for the endangered language preservation effort – ABAIR.ie

The year is 2023, and the Digital Age is in full swing. With digital technology arguably being the main propeller of the last two waves of globalisation, it’s easy to say that we have reached a point where technology is indispensable in even the most trivial aspects of our daily lives. In this rapidly evolving society, everyday a new high-tech novelty is advertised, yet another mundane everyday chore is digitised for convenience or a previously thought impossible medical advancement is made possible. While it is irrefutable that this Third Industrial Revolution or the ‘Digital Revolution’ that we are living in has enhanced our quality of life in most aspects from medicine to education, like any wave of globalisation there are often many communities that are left behind.

To the average speaker of any majority language, language may not even be deemed relevant, let alone considered as one of the most fundamental elements when talking about technological development and innovation. But the fact is that minority languages and the communities native to them are often excluded from these conversations, and are left behind while the rest of the world is advancing. The digital environment in which we live in today, also described as a “digital timebomb”, imposes a threat to the already fragile 2,473 languages which are at risk of becoming extinct if said technologies are not harnessed and utilised properly.

As English establishes itself as the lingua franca of the internet and Anglocentric culture becomes even more globalised, the representation of smaller languages is more important than ever. In an effort to prevent endangered languages from being left behind and subsequently lost in this digital rush, it is crucial that we realise the opportunities that digital technologies can provide us with in the efforts to archive, preserve, sustain and revive vulnerable languages.

How can we use technology to aid us in the revitalisation of endangered languages?

Digital Language Technology Equality and Inclusion

In a world that is becoming digital at a rapid pace, it is important that endangered languages are not left behind alone in the physical world. Inaccessibility to digital technologies and the consequent digital literacy incompetence found in marginalised groups only means that they are held back from fully partaking in modern society. The research of mathematical linguist Kornai unveils that there is a “massive die-off” of minority and endangered languages attributable to digital inequality and that despite language conservation efforts, for over 95% of the world’s languages it is practically impossible to bridge the digital gap. According to Kornai’s theories surrounding the role of technology in the preservation of minority languages, a language’s survival relies on its “online visibility”. In which case, not only do online and digital technologies need to be made accessible to minority language communities, but they also need to reflect the needs of said languages and communities so that potential speakers can utilise them to the best of their requirements.

Preservation and Archival Efforts

Speech technologies mean that there are more opportunities than ever for minority language data collectors, archivists and field linguists. Speech recordings are considered to be the most important openings provided by digital technology when it comes to language preservation. Speech technology achievement is especially noteworthy in cases where the only remaining speakers of a critically endangered language community are documented using speech recordings. With the speech corpora built from those recordings come a multitude of analysis opportunities in language revitalization and preservation efforts. Thanks to today’s advanced speech technology, many different research initiatives are able to create data collections in a wide range of vulnerable languages and data.

One example is a research group based in India who collect speech data for endangered and under-resourced Indian languages, specifically from Tibeto-Burman and Indo-Aryan branches. Their motivation is based on the fact that only the major Indian languages such as Hindi and Tamil are supported by leading speech-based products such as Amazon Alexa and Microsoft, which leaves the speakers of smaller Indian languages underrepresented and hindered from making full use of digital services.

Case Study: The Irish Language Synthesiser – ABAIR.ie

Let’s look at the Irish language for example. Irish, or Gaeilge, is a Celtic language indigenous to the island of Ireland, and was the predominant and native tongue until the 19th century. Between efforts of linguistic and cultural genocide by the ruling British hand in Ireland to the death of a large percentage of native Irish speakers in the Great Famine of 1845-52, English became the dominant language in Ireland and unfortunately remains so to this day. Despite this, the language still prospers despite being in the minority and the initiative to restore the Irish language to its initial glory and numbers are thriving.

ABAIR.ie is a project of the Phonetics and Speech Laboratory in Trinity College, Ireland, who have been developing dialect-inclusive synthetic voices and other speech technologies for Irish for over a decade. The abair is the Irish verb for ‘to speak’. The extensive data collection ABAIR.ie have gathered over the years – from sourcing dialectal written corpora to collecting speech data from native speakers, provides endless opportunities for Irish language technology. The work of ABAIR.ie provides an invaluable foundation to multiple digital services for the Irish language as efforts are made to establish a strong digital presence for Irish that won’t render it secondary to any majority languages. For under-resourced languages like Irish, initiatives like ABAIR.ie are vital in bridging the digital divide between common and minority languages.

Let’s take a look at some of the various technologies developed by ABAIR.ie!

1. ABAIR – Speech Synthesis

ABAIR’s speech synthesis feature opens the Irish language up to the bottomless world of opportunities that is the technological sphere. First launched 11 years ago, this speech synthesis feature currently provides synthetic voices for 3 of Irish main dialects, with work still being done to provide for the subdialects. By typing into the textbox any sentence, inclusive of dialectal features, a synthetic voice will read it out in whichever dialect you please. Apart from dialect, the synthesiser also has many toggles to customise the audio output, including pitch, speed and the speaker’s gender. The option is also there to choose which synthesis system is used to create the audio for the audio may vary between them, with options including DNN (deep neural networks), HTS (HMM-based text-to-speech), NEMO (conversational AI toolkit) and PIPER. The audio can then be copied, exported or downloaded for personal usages.

ABAIR.ie’s speech synthesiser
The downloadable audio output

Speech synthesis is extremely crucial to the digitisation process of any language. Speech synthesis provides the grounds for a wide range of technologies such as TTS (text-to-speech) software to assist those with dyslexia or other learning differences, AAC (Augmented and Alternative Communication) systems, screenreaders for visual disabilities, voice assistant software such as Siri, language learning tools, and public usages such as train announcements. By being able to develop these sorts of technologies in minority languages it ensures that marginalised linguistic communities aren’t left behind as the world undergoes this wave of digitisation. Ireland’s linguistic landscape with all Gaeltacht regions being extremely isolated from each other means that the main dialects are very distinct from each other and from the standardised language. ABAIR.ie’s prioritisation of dialectal Irish over the standardised language that has no real native speakers is groundbreaking as it means that the needs of the native speaker is attended to, which makes sure that the technology is actally relevant to its target audience.

2. ÉIST – Speech Recognition

ÉIST, coming from the Irish éist (to listen), is another project of the ABAIR.ie initiative. ÉIST is an ASR (Automatic Speech Recognition) technology that transcribes spoken Irish into text. Just like ABAIR, ÉIST is inclusive of all major dialects of Irish, and is extremely effective and well developed.

3. Míle Glór (‘A Thousand Voices’)

The sentence prompt for users to record

Another scheme of ABAIR.ie’s is Míle Glór, which means ‘a thousand voices’ in Irish. Míle Glór is a project that invites Irish speakers to contribute their voices – with an initial goal of gathering 1000 voices, to ABAIR.ie’s corpus in a low-pressure environment. Míle Glór can be accessed through the ABAIR.ie website and done by the speaker themselves online. The speakers consent to their voices being used for data and then record themselves reading out short prompts provided by the site. The speaker can do as much or as little recording as they like and it is an effective, open and easy to navigate way for speakers of Irish passionate about the language to contribute to the initiative.

4. An Bat Mírialta

An Bat Mírialta (meaning ‘the irregular bot’) is a speech chat system built by ABAIR.ie developers to help learners of Irish practice irregular verbs. This educational but fun chat-based game is a way for learners to practice the conjugation of irregular verbs through natural-seeming conversations supported by the extensive database originally built for the synthesis and recognition systems mentioned above.

An Bat Mírialta‘s welcome page

When creating a profile for An Bat Mírialta, users are given the option to choose which dialect they already speak or are interested in learning. This aids the promotion and growth of native dialects rather than the standard Irish that is generally used in Irish education. The usage of the standard in the education system is deemed controversial by most scholars and speakers of the language because of how little it represents the Irish spoken by natives, and so ABAIR.ie’s determination to push the Gaeltacht dialects to the forefront in the technological world is applauded by the community.

Customisable lesson plans
Interactive chat bot style learning

Once logged in, the user can choose specifically which verbs they would like to practice, and in which tense and form they would like to practice them in. Once selected, the chat bot, Bat, loads and presents natural questions with blanks to fill in the missing verbs. The user’s learning statistics and previous lessons are logged on their profile and makes it easy to track their progress as they practice. As Irish is a language notorious for its complex irregular verbs that are daunting to fluent speakers let alone learners, this resource is integral to anyone looking to refine their grammar.

References

UNESCO. “Atlas of the World’s Languages in Danger.” Unesco.org, 2010, unesdoc.unesco.org/ark:/48223/pf0000187026. Accessed 3 Nov. 2023.

Bel, Bernard, and Médéric Gasquet-Cyrus. “7 Digital Curation and Event-driven Methods at the Service of Endangered Languages.” Endangered Languages and New Technologies. Edited by Mari C. Jones, Cambridge University Press, 2014, pp. 113–26, www-cambridge-org.ucc.idm.oclc.org/core/books/endangered-languages-and-new-technologies/digital-curation-andeventdriven-methods-at-the-service-of-endangered-languages/78CEE74BE67F9EA7EA86780A350AA17E#c04959-657

Carew, Margaret, et al. “Getting in Touch: Language and Digital Inclusion in Australian Indigenous Communities.” Language Documentation & Conservation, vol. 9, 2015, pp. 307–23, scholarspace.manoa.hawaii.edu/items/5a23dc82-b846-4188-a8fe-92754ac91c56. University of Hawaii Press.

Kornai, András. “Digital Language Death.” PLoS ONE, edited by Eduardo G. Altmann, vol. 8, no. 10, Oct. 2013, https://doi.org/10.1371/journal.pone.0077056.

Rialtas na hÉireann. Digital Plan for the Irish Language (Speech and Language Technologies 2023-2027). Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media, 2022.

Besacier, Laurent, et al. “Automatic Speech Recognition for Under-Resourced Languages: A Survey.” Speech Communication, vol. 56, Jan. 2014, pp. 85–100, https://doi.org/10.1016/j.specom.2013.07.008.

Kumar, Ritesh, et al. “Collecting Speech Data for Endangered and Under-Resourced Indian Languages.” 2nd Annual Meeting of the ELRA/ISCA SIG on Under-Resourced Languages (SIGUL 2023), 18 Aug. 2023, pp. 14–18, www.semanticscholar.org/paper/Collecting-Speech-Data-for-Endangered-and-Indian-Kumar-Takhellambam/083b0a200614ba8c1dc3488bf3e6132642ef2cf1, https://doi.org/10.21437/SIGUL.2023-4. Accessed 10 Jan. 2024.

Hi, I’m admin

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *