To train AI, algorithms learn from huge quantities of data. But, the internet is 52.1 per cent English. What happens to the languages and speakers left behind?
Natural language processing (NLP) is a form of applied machine learning. While NLP is booming, there are also concerns about bias in artificial intelligence.
NLP works by training computers on massive sets of data, which enables them to “understand” text and speech by finding patterns. This means that any model is only as good as the data that it is trained on, and models can only be trained on data that exists in large quantities. That privileges high-resource languages like English, contributing to the marginalization of already under-represented languages.
Some computational linguists at UBC are addressing this gap.
Dr. Ife Adebara, who recently completed her Ph.D in linguistics, focuses her research on Afro-centric NLP by developing tools to directly address NLP resource scarcity in African communities. She spoke about the importance of linguistic inclusivity in an increasingly globalized and digitized world.
She said that an estimated 2,000 languages are spoken in Africa, most of which are not supported by technologies.
“What this means is that more than one billion African people are excluded from global conversations ... That’s a huge proportion of the world’s population,” said Adebara.
She has worked with researchers to build a language generation model that supports 517 African languages and varieties.
Adebara loves discovering new linguistic features and seeing positive reactions from the communities that she works with.
“There’s a lot of bias where people feel like foreign languages are more prestigious than their [Indigenous African] language,” she said. “Now, seeing their language being used within the technology space, makes them feel excited and feel like, yeah, our language is actually developed. It’s not inferior to any other language. That is something I’m really excited about.”
Adebara hopes her work can broadly promote linguistic diversity and representation in A.I.
The rapid developments in NLP may benefit language preservation and learning, but there is also lots we don’t know about how computer models treat linguistic data.
Both Adebara and Dr. Garrett Nicolai, assistant professor in linguistics and program director of the Masters of Data Science in Computational Linguistics program, stressed the importance of using language technologies responsibly.
“Computational models are really, really good at picking up patterns. So if there is a signal in there biased in a certain direction, it will pick it up,” said Nicolai, “It’s not just a linguistic representation issue … they tend to produce negative [social] stereotypes.”
Nicolai said that means we should never trust AI blindly.
Nicolai works with NLP and Indigenous languages in Canada.
The challenge of working with under-represented languages, he says, is that there are “just no resources. You’re often going in blind when you’re trying to build a model.”
On the bright side, revitalization efforts for some Indigenous languages have led to linguists documenting them. Nicolai and colleagues have developed a computational system for Gitksan, a northern BC First Nations language, using data from previous efforts to document the language.
Nicolai also emphasized that computational tools for low-resource languages need to be developed in ways that meet the needs of that language community.
“We don’t want to just create tools and say ‘here are these tools that we developed and they’ll help you learn your language better,’” he said. “It’s very important that we work with the communities to find out what sorts of tools are actually useful to help encourage new speakers of the language but also … make it easier for them to learn with.”
This can be as simple as developing a dictionary for a language, something often taken for granted by high-resource languages.
Adebara’s African language models are publicly available “to ensure that all people can have access,” so they can be used in more applied ways.
“Working on Indigenous languages is so important. It’s really important for preserving culture … Each language encodes wisdom in it that oftentimes, you can’t translate into another language.” u
Share this article