Gradient Update #29: No Language Left Behind and Democratized Transformer Training

In which we discuss Meta's new AI translation model and two papers on transformers--one on a larger Swin Transformer and the other on democratizing transformer training.

Jul 19, 2022

Welcome to the 29th update from the Gradient! If you were referred by a friend, subscribe and follow us on Twitter!

News Highlight: Meta open sources early-stage AI translation tool that works across 200 languages

Summary

Meta AI has trained and released a new AI model that is capable of translating between over 200 languages, including many low-resource languages that previous systems do not support. They also developed evaluation procedures, collected data for low-resource languages, and conducted interviews with speakers of those languages. The crucial parts of this project are open sourced, including the final trained model, code for processing and creating language datasets, and an evaluation dataset.

Background

Machine translation technologies allow humans to communicate, consume, and learn with languages that they do not speak well. However, most of the progress in machine translation is focused on high-resource languages like English, Russian, and French, which have both large amounts of available data and investment into their development. Past work by Facebook AI developed a model that could translate between 100 languages, but many low-resource languages are still not supported by that model.

In this recent work from Meta AI, termed No Language Left Behind, the 200 language barrier was broken by a new open-sourced model that is able to translate between any pair of over 200 languages. This model is a sparsely gated mixture of experts, meaning that it contains submodules that are selectively activated only for certain inputs. In particular, parts of the network are split into several independent feed-forward neural networks, of which only two are activated per token in an input, as in previous work. A gating network determines which feed-forward networks are used to process which tokens, which decreases computational cost of the model. The researchers have also released new data, including low-resource language data mined from the web and seed data translated by humans.

Why does it matter?

The internet holds copious information that is disproportionately tailored to high-resource languages; according to one source, about 64% of websites have content in English, whereas only about 26% of internet users speak English. Better machine translation that supports more languages could allow more and more people to interact with others and explore all that the internet has to offer.

Moreover, future technologies may be able to make even better use of machine translation. The No Language Left Behind project is part of a longer-term project by Meta to produce a “universal speech translator”, which would allow for new AR and VR experiences that are not limited by language barriers.

While machine translation has many benefits for society, development and use of these technologies has the potential for harm. Some of the low-resource-language speakers that were interviewed in this work note that the availability of high-resource language content may reduce incentives for creation of content in low-resource languages. Also, some of the interviewees note that machine translation would moreso benefit those with technological savvy and access to technology.

Editor Comments

Daniel: This is clearly pretty exciting–as already indicated, plenty of languages are underrepresented in translation datasets. The compelling potential for me here is a democratization of knowledge. People who only speak languages for which there are not as many translation resources may not be able to take advantage of much of the information on the internet like English speakers can, for instance. Models like NLLB may be a step towards changing that.

Paper Highlight: How to Train Your Transformer: Insights to democratize and scale LLM training

Image Source: Training Transformers Together

Summary

Large Language Models (LLMs) have recently driven huge successes in text-based image generation (such as DALL-E), natural language (such as GPT-3), and even robotics. However, such models have billions of parameters that require vast amounts of compute to train, making them accessible only to big corporations with financial means. Recent work from Hugging Face, HSE University, and the University of Washington demonstrates the use of collaborative training methodologies to effectively train an LLM, while another paper from Microsoft AI discusses techniques for scaling up vision transformers while using only a fraction of the compute power that would be required to train a billion parameter model generally. These research avenues not only show how transformers can be adapted to multiple tasks and beat current state-of-the-art models, but also provide resources for doing so in a decentralized, community-led way.

Overview

While the transformer architecture has been shown to perform well on various tasks in the computer vision and language modeling domains, many challenges stymie the democratization of training LLMs. An organization must not only provide enough compute power to train a billion parameter model, but also overcome challenges in training instability as the model is scaled up.

A recent demonstration from Borzunov et al builds an end-to-end framework for using decentralized compute to train a transformer with an architecture similar to DALL-E. The authors created a webpage where users can learn how to volunteer their compute (such as cloud-based GPUs) to an ongoing training run by running a jupyter notebook. While the authors use algorithms proposed by various researchers in recent work, this infrastructure itself requires a substantial engineering effort to reliably and safely train the model. The key takeaways from the author’s work are:

To overcome slow internet speeds and reduce communication costs, use large batch training, gradient compression, parameter sharing, and overlap computation with communication
To accommodate for slow devices that can stall a batch’s completion, allow the devices to process different numbers of samples in a batch while maintaining synchronous training.
To optimize for the available memory, use 8-bit optimizers, offload statistics to the CPU and enable gradient checkpointing or parameter sharing
Instead of storing the dataset on a device, use dataset streaming tools such as the ‘datasets’ library
To ensure a malicious user does not send incorrect tensors, authenticate all users, and/or use gradient aggregation techniques robust to outliers

Secondly, recent research from Microsoft AI lays the groundwork to create the first billion parameter vision transformer model that performs competitively on multiple vision benchmarks, building on the success of the Swin Transformer and scaling it up to roughly 3 billion parameters. The authors’ findings and solutions are as follows:

Training Instability: As the size of the transformer model was increased, it was increasingly prone to crashes. The authors traced this instability to a large feature variance discrepancy between different layers, which was in turn found to stem from non-normalized residual streams. A new ‘residual post-normalization’ method moves the normalization step to the end of each branch. This modification necessitates a change in the attention mechanism, for which the authors propose the ‘cosine attention mechanism’. These additions stabilized the training dynamics and improved accuracy.
Image resolution gaps between pre-training and inference: Since images used for pre-training are typically low resolution while those used in inference are high resolution, a gap exists that the original Swin Transformer architecture solves using a handcrafted bicubic interpolation method. In Swin v2, the authors instead train a small “meta-network” that learns to predict the same parameters that the handcrafted method interpolates. This algorithm, named “log-spaced continuous position-bias approach (Log-space CPB)”, allows smooth transitioning between different image sizes and even enables pretraining to be done at smaller resolutions with no accuracy loss and a 50% speedup, and speeds up training by 50%.
To boost model performance, the authors propose a log-spaced continuous position-bias approach (Log-space CPB). This method achieves smooth transferring between different image resolutions and allows for a smaller resolution (192x192) to be used instead of the standard 224x224 with no accuracy loss and a 50% speed up in training.
Lack of data for large models: Current large vision models require billions of images to train, which poses a significant hurdle in curating datasets for downstream tasks. To address this issue, the authors propose a new self-supervised pre-training approach called SimMIM (Simple Framework for Masked Image Modeling). SimMIM first trains the network to generate masked portions of the input image, thereby forcing it to better exploit the information contained in the image. Using this approach, the authors trained Swin v2 on 70 million images, a dataset roughly 40 times smaller than those used to train other large vision models.

As of November 2021, when the Swin v2 architecture was introduced, the model set new performance records on four vision benchmarks - ImageNetv2 Image Classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification.

Why does it matter?

Democratizing the training of large models and making compute resources accessible is an integral part of ensuring equal opportunities for research between the industry and academia. Both of these papers attempt to take a step in this direction by laying out a process to not only pool GPUs and collaboratively train a model, but also scale existing models while using only a fraction of the compute. As large neural networks – especially transformers – continue to drive fascinating results in AI, it is increasingly important to create opportunities and avenues for research that are accessible to most researchers. We hope that the open source algorithms discussed above emerge as the first steps in this process.

New from the Gradient

Sara Hooker: Cohere for AI, the Hardware Lottery, and DL Tradeoffs

Listen

Lukas Biewald: Crowdsourcing at CrowdFlower and ML Tooling at Weights & Biases

Listen

Other Things That Caught Our Eyes

News

Meta develops AI system for reviewing Wikipedia citations "Meta Platforms Inc. has developed an artificial intelligence system that can scan a Wikipedia article, analyze the sources cited by the article and identify if some of them may need to be changed. Meta detailed the AI system today."

Inside a radical new project to democratize AI "This is as close as you can get to a rock concert in AI research. Inside the supercomputing center of the French National Center for Scientific Research, on the outskirts of Paris, rows and rows of what look like black fridges hum at a deafening 100 decibels."

Amnon Shashua’s AI21 Labs raises $64 million for natural language processing platform "AI21 Labs, an Israeli startup aiming to change the way people read and write, announced on Tuesday the completion of its $64 million Series B funding round, bringing the company’s valuation to $664 million."

Papers

Language Models (Mostly) Know What They Know We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks… We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

Data Distributional Properties Drive Emergent In-Context Learning in Transformers Large transformer-based models are able to perform in-context few-shot learning, without being explicitly trained for it. This observation raises the question: what aspects of the training regime lead to this emergent behavior? Here, we show that this behavior is driven by the distributions of the training data itself… our findings indicate how the transformer architecture works together with particular properties of the training data to drive the intriguing emergent in-context learning behaviour of large language models, and how future work might encourage both in-context and in-weights learning in domains beyond language.

Closing Thoughts

Have something to say about this edition’s topics? Shoot us an email at gradientpub@gmail.com and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! If you enjoyed this piece, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!