Update #68: Whispering Indigenous Languages and Neural Net Training Dynamics
Papa Reo explains issues with Whisper's ability to transcribe the Māori language, and researchers find that neural networks learn statistics of increasing complexity throughout training.
Welcome to the 68th update from the Gradient! If you’re new and like what you see, subscribe and follow us on Twitter :)
We’re recruiting editors! If you’re interested in helping us edit essays for our magazine, reach out to editor@thegradient.pub.
Want to write with us? Send a pitch using this form.
News Highlight: Whisper and Indigenous Languages
Summary
Papa Reo, an organization dedicated to instilling, nurturing, and proliferating the Māori language, has raised some concerns about Whisper’s training on their language. While the ability to transcribe Māori would be a major step forward for organizations like Papa Reo who seek to preserve endangered languages, Whisper both performs poorly for Māori, and its open-sourcing makes possible use cases that could be dangerous for language users.
Overview
Māori is a language that has suffered a great deal of harm—speaking te reo Māori was once forbidden, and despite a growing appetite to learn the language, many New Zealanders make life difficult for Māori speakers.
When Whisper was first released, data scientists at Papa Reo were excited to see that Whisper could transcribe te reo Māori videos from YouTube. But, on closer inspection, they found its transcriptions were very faulty.
This raises important questions about Whisper, like: where did the data that enabled OpenAI’s model to transcribe te reo Māori come from? As it turns out, Whisper was trained with 1381 hours of te reo Māori and 338 hours of ‘ōlelo Hawai’i—the Whisper paper doesn’t specify where these data came from, but it was more than likely scraped from the web. This possibility has some important implications:
Scraping data from the web is particularly alarming when they don’t have the right to use that data or to create derived works from it… Most, if not all, “free” services offered by Big Tech require you to give them exclusive rights to create derived works from your data, derived works such as models like Whisper and GPT-3.
Rights issues and faulty translations, such as those produced by Google Translate, is especially concerning for languages like te reo Māori. If a language like Urdu is translated incorrectly by Google Translate, there are enough native speakers to preserve the language in its correct form. It’s another story for languages that have struggled, and continue to struggle for survival—mispronunciations and other faults in translation can actually damage a language. To create correct datasets and quality-check translations, experts in a language like Māori are necessary.
The team at Papa Reo wanted to concretely evaluate their concerns, so they spent 6 weeks building a bilingual te reo Māori and New Zealand English model by fine-tuning Whisper, then quantified just how poorly the model transcribed te reo Māori speech. Papa Reo’s current ASR, a DeepSpeech-based model, performs with a 53% World Error Rate (WER) on a “golden dataset,” a hand-curated corpora of te reo v speech classified as essential by its creators. The large Whisper model from OpenAI, when set to Māori, performs with a 73% WER.
The Papa Reo team found that, after fine-tuning with carefully curated training data, Whisper can perform substantially better than their DeepSpeech-based model. This demonstrates the important role of including native speakers as a key part of the language corpora and model evaluation process.
Why does it matter?
Māori, and other indigenous languages, could benefit substantially from the capabilities of natural language processing systems like Whisper. However, these systems do not work well for such languages, in large part because the organizations who create these systems don’t have a stake in the language.
Furthermore, while well-intentioned, models like Whisper can create more opportunities for non-speakers of a language like Māori than for the minority communities themselves. China’s use of technology to surveil the Uyghur population is well known, and technologies like Whisper, even when they are not perfect, can enable these troubling uses. A recent update to Whisper does mention surveillance issues, but the authors expect the existence of other limitations might prevent misuse of Whisper.
The final point that Papa Reo makes is about populations’ ability to participate in the future of their languages. Organizations without a stake in a language are not liable to these populations, and their incentives will often work against the best interests of that language and its people.
Editor Comments
Daniel: The quotation presented at the beginning of this article is really evocative—I can’t help but make note of the inevitability rhetoric we see about “progress” and the advancement of technology everywhere. I do think there are a lot of cases of harms from advanced technologies being overblown, but this case does demonstrate concrete scenarios (and motivations) for unsavory uses.
Research Highlight: Neural Networks Learn Statistics of Increasing Complexity
Summary
Researchers from ElutherAI and Oregon State University released a new paper in which they present new theoretical and empirical evidence for distributional simplicity bias (DSB). DSB posits that neural networks first learn low-order moments (mean and variance) of a data distribution, before moving on to higher-order correlations (skewness or kurtosis). The researchers demonstrate this by training models on real datasets and evaluating them (throughout training) on synthetic data designed to probe the models’ reliance on statistics of different orders. They demonstrate this behavior across a variety of image and language architectures.
Overview
The paper’s theoretical contributions begin with the introduction of a Taylor series expansion of the model’s loss (1) and expected (average) loss (2).
This expression allows the authors to express an arbitrary model’s expected loss (2) as a sum over its expected central moments. The connection between the expected loss and the kth (alpha) central moment provides the authors with motivation for the DSB. They go on to show that if a loss can be well approximated by the first K terms of its Taylor expansion, then the model should be sensitive to statistical moments up to order k. The researchers go on to empirically demonstrate that the earlier terms of the expansion will significantly contribute to loss earlier on than the later terms.
In order to demonstrate a model learning simple moments earlier in the training, the authors introduce two criteria that a model sensitive to up to k statistical moments would satisfy:
Changing the first k statistics of data from class A to match class B should cause the model to classify the modified data as class B
Models should be unaffected by “deleting” higher-order data statistics
In order to test (1), the researchers relied on optimal transport theory (OT). They used both coordinatewise quantile normalization and gaussian optimal transportation to transform samples from one probability distribution into another. The transformations are designed to minimize the average distance moved between the distributions. OT allows the researchers to craft synthetic datasets in which they can match the first k order statistics from samples and probe models’ ability to classify. For testing (2), the researchers demonstrated synthetic samples which match the target distributions and represent maximal entropy (the most deletion). A visually intuitive example for how maximizing entropy represents deletion can be seen below.
On the left, we have an unmodified sample from the dog class. In the middle, the example is transformed using coordinatewise quantile normalization to project a different class (goldfish) label’s first K statistics onto the original image. The image on the right, similarly matches the goldfish’s first K statistics but also maximizes the entropy of the image.
The researchers found a common pattern across models and datasets where early classifications relied heavily on the means and covariances (first k order) of the sampled distributions. As training progressed, the networks became more sensitive to higher-order statistics as demonstrated by the increased loss in interpolation. Across both text and image datasets, this manifested as a U-shaped pattern in the loss curves. An interesting finding with the loss curves associated with natural language models was the “double descent” behavior of the loss function. The authors found that after the emergence of the U shaped pattern in the loss function, there was a second stage where the loss decreased monotonically. This was attributed to in-context learning (ICL) being done at late training stages, leading to even further loss reductions.
Why does it matter?
Deep learning’s role and influence in our society has only continued to grow since AI exploded into the mass public consciousness with the rollout of products like ChatGPT and Midjourney. As these kinds of models become more intertwined with our lives, it becomes increasingly important for the scientific community to further our understanding of models' performance and behavior. Our understanding of how DSB influences early learning dynamics is particularly important, given the existing tendencies for many machine learning models to reproduce bias and generate harmful content. The research presented here provides insight into how first order statistical moments dominate the early learning process and how DSB could be an influential factor in existing models replicating the biases of their training distributions.
New from the Gradient
Subbarao Kambhampati: Planning, Reasoning, and Interpretability in the Age of LLMs
Russ Maschmeyer: Spatial Commerce and AI in Retail
Other Things That Caught Our Eyes
News
Huawei just retasked a factory to prioritize AI over its bestselling phone “Huawei makes both its Ascend AI chip and the Kirin chip, which powers the Mate 60, in one facility. However, production in the plant has been low, people familiar with the matter told Reuters, so the company now plans to prioritize the AI chip.”
Microsoft is teaming up with Semafor on AI-assisted news stories “Microsoft is teaming up with media website Semafor on a new project that uses ChatGPT to aid in the creation of news stories.”
Roblox breaks language barriers with AI-based real-time chat translation “Roblox is breaking language barriers today as it launches AI-powered real-time chat translation.”
UK gov’t touts $100M+ plan to fire up ‘responsible’ AI R&D “The UK government is finally publishing its response to an AI regulation consultation it kicked off last March, when it put out a white paper setting out a preference for relying on existing laws and regulators, combined with “context-specific” guidance, to lightly supervise the disruptive high tech sector.”
How Tech Giants Turned Ukraine Into an AI War Lab “Early on the morning of June 1, 2022, Alex Karp, the CEO of the data-analytics firm Palantir Technologies, crossed the border between Poland and Ukraine on foot, with five colleagues in tow. A pair of beaten-up Toyota Land Cruisers awaited on the other side.”
Europe eyes fix for Taylor Swift deepfakes “The issue has taken on greater urgency after fake AI-generated graphic images of Swift were seen more than 45 million times in January on social media platform X (formerly Twitter). United States lawmakers issued new calls for legislation, and the incident sparked alarm in the White House.”
In Big Tech’s backyard, California lawmaker unveils landmark AI bill “California’s landmark AI proposal could inspire regulation around the country, as more than 44 U.S. states take up the swiftly evolving technology.”
AI safeguards can easily be broken, UK Safety Institute finds “The UK’s new artificial intelligence safety body has found the technology can deceive human users, produce biased outcomes and has inadequate safeguards against giving out harmful information.”
Stability, Midjourney, Runway hit back in AI art lawsuit “The class-action copyright lawsuit filed by artists against companies providing AI image and video generators and their underlying machine learning (ML) models has taken a new turn, and it seems like the AI companies have some compelling arguments as to why they are not liable, and why the artists case should be dropped (caveats below).”
Papers
Daniel: This is a good overview of recent protein language models applied to adaptive immune receptors—a fast moving space! This paper presents a simple baseline for instruction fine-tuning, showing that simply selecting the 1,000 instructions with the longest responses from standard datasets can consistently outperform more sophisticated methods for fine-tuning. This is also a really neat paper on emergence in LLMs, considering the phenomenon from a perspective of phase transitions as they’re understood in physics.
Closing Thoughts
Have something to say about this edition’s topics? Shoot us an email at editor@thegradient.pub and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! For feedback, you can also reach Daniel directly at dbashir@hmc.edu or on Twitter. If you enjoyed this newsletter, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!