Gradient Update #11: Deep Learning for Astronomy and The World's Largest Language Model

In which we discuss mining for strong gravitational lenses with self-supervised learning and Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Model

Welcome to the eleventh update from the Gradient! If you were referred by a friend, subscribe and follow us on Twitter!

Paper Highlight: Mining for strong gravitational lenses with self-supervised learning

Astronomers Using NASA's Hubble Discover Quasars Acting as Gravitational Lenses


Of the many cool conjectures stemming from Einstein’s theory of General Relatively, gravitational lenses, a phenomena where a dense object is so heavy and gravitationally dominant that light from other sources is bent around the denser object like a lens flair have been mostly illusive to identify despite their well studied theoretical origins. Recently, University of California Berkeley researchers presented the usage of “self-supervised representation learning to distill information from 76 million galaxy images from the Dark Energy Spectroscopic Instrument (DESI)” to identify over 1,000 new candidate images of strong gravitational lensing. Additionally, the researchers released an “interactive web-based similarity search tool” to extend their methodology onto other future datasets and help identify other low resource sources.


Recently, self supervised learning has shown great success in a variety of computer vision tasks, particularly excelling in situations where the volume of manually labeled data is incredibly small. In this instance of self supervised learning, the researchers used a series of random augmentations to create additional “views” of their data which was then fed into an encoder network to learn representations and optimize a contrastive loss function. This self supervised pre-training step was done using a subset of the total dataset and was followed by applying the encoder to predict embeddings for the entire dataset (74 million images). Finally, each embedded image was trained in a supervised manner with a linear classifier to predict whether or not an image represented lensing using a few thousand manually labeled lenses. 

Why it matters

In both the natural and our contrived world, some of the most daunting tasks for automation and classification are lacking high quality and abundant labels. Whether we are looking at compact x-ray objects or which shelf in the grocery store a consumer will reach for, data will always be a limiting factor in the success of automation or classification. Seeing such a great demonstration and application of self supervised learning in a low resource domain that is governed by some of the most extreme physical conditions in our universe fills us with optimism that other domains will continue to find great success in the application of these methods, especially those constrained by the quality and quantity of labeled data. 

Editor Comments

Justin : While It’s always great to see new state of the art (SOTA) results on common datasets in computer vision and other ML tasks, one of the true measures of our success as researchers is in the adoption and application of our methodologies across disciplines. I love seeing the unification of machine learning and physics and how its adoption is helping answer some previously hard to answer questions. 

Andrey: Agree with Justin, it’s been very cool to see AI being applied outside the traditional Computer Vision and NLP problems lately. DeepMind has been doing a lot of great work like this recently with AlphaFold and its more recent work on rain ‘nowcasting’. While I’m not very familiar with astronomy research, it sounds like sorting through dozens of millions of images to find specific phenomena is a perfect fit for AI. It’s exciting to see that deep learning techniques are mature and powerful enough that they can now be applied to such varied scientific disciplines. 

News Highlight: Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model


Just over a year after OpenAI unveiled the GPT-3 to the public, NVIDIA and Microsoft announced that their latest language model, Megatron-Turing NLG, has dethroned the GPT-3 as the world’s largest and most powerful generative language model. Compared to the GPT-3, which had 175 billion parameters, the Megatron-Turing NLG has 530 billion parameters. Naturally, it demonstrates state-of-the-art results and unmatched accuracy in a broad set of natural language tasks including sentence completion, reading comprehension, reasoning, natural language inference, and word sense disambiguation.


The recent trend in the field of natural language processing revolves around the fact that the bigger the model, the better it learns, and the better its output. However, bigger models need vast computational resources. This was a significant challenge that Microsoft and NVIDIA had to overcome. As such, researchers trained the Megatron-Turing NLG on the NVIDIA DGX SuperPOD-based Selene supercomputer that housed 560 DGX A100 servers. At the heart of it, their system used tensor-slicing from NVIDIA’s Megatron-LM and the pipeline parallelism from Microsoft’s DeepSpeed to scale the model across nodes. Taken together, the training dataset comprised 15 datasets consisting of a total of 339 billion tokens. 

Why does it matter?

It’s almost a certainty that the bigger the model, the better the results. This was the case here too. But notably, in a few-shot setting, the researchers found that the Megatron-Turing NLG performed substantially better in cases where the model was tasked to find relations between two sentences. This is important because previous models have struggled under such scenarios, the researchers noted. Similar improvements were noted in zero-shot and one-shot evaluations as well. It's worth noting that compared to its compatriots, the new language mode was trained on fewer tokens, further cementing the fact that larger models learn faster and encompass a broader spectrum of the training dataset. Impressively, the team behind the Megatron-Turing NLG also observed that the model can infer basic mathematical operations from context, even when the symbols involved in the expressions were obfuscated. While far from claiming numeracy, the model seems to go beyond only memorization for arithmetic.

Regardless of the advancements and improvements highlighted by NVIDIA and Microsoft’s latest model, researchers note that it still exhibits bias and toxicity. The model picks up stereotypes and biases from the data on which it is trained and reflects those in its outputs. These issues take root from the training data and were apparent in other language models, including GPT-3. Bias and discrimination in language models are contemporary issues of interest and researchers are only beginning to figure out ways to stem mitigate them within AI models.

Editor Comments

Andrey: It’s interesting to see the breakneck speed of the research on large language models in the recent year. GPT-3 now seems like a turning point after which the importance of such models has become clear, and now there are at least a dozen such models that have been developed by different companies and in various countries. It’s also interesting to see that this came at about the same time as Google’s Exploring the Limits of Large-Scale Model Pretraining, which did a large number of experiments to characterize how much further benefit we might expect from making these models larger and larger. While it’s exciting to see so much effort going towards these scaling efforts, I also hope more research comes out soon addressing some of the inherent limitations of these models such as them only handling text, not being able to process large inputs or have long term memory, and more.

New from the Gradient

Chelsea Finn on Meta Learning & Model Based Reinforcement Learning

Subscribe to The Gradient Podcast: Apple Podcasts | Spotify | Pocket Casts | RSS

Other Things That Caught Our Eyes


In Favor of More Science Communication by AI Researchers “There's too much misunderstanding, hype, and misinformation surrounding AI, and those developing it should do more to change that”

Duke computer scientist wins ‘Nobel Prize’ worth $1M for artificial intelligence work - " Whether preventing explosions on electrical grids, spotting patterns among past crimes, or optimizing resources in the care of critically ill patients, Duke University computer scientist Cynthia Rudin wants artificial intelligence (AI) to show its work."

New robots patrolling for 'anti-social behaviour' causing unease in Singapore streets “There are new sheriffs in town in Singapore, and they are unnerving many who live there."

How AI can fight human trafficking "Marinus implements its mission primarily through its set of AI-based tools called Traffic Jam, which has a goal “to find missing persons, stop human trafficking and fight organized crime.”

Alphabet’s DeepMind A.I. lab turns a profit for the first time ever "DeepMind, one of the world's premier artificial intelligence labs, has turned a profit for the first time ever, according to a filing with the U.K. company registry published Tuesday. The London-based research firm recorded a profit of £43.8 million."


Patches Are All You Need? “Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings .. we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.“

Truthful AI: Developing and governing AI that does not lie “While lying has traditionally been a human affair, AI systems that make sophisticated verbal statements are becoming increasingly prevalent … Establishing norms or laws of AI truthfulness will require significant work to: (1) identify clear truthfulness standards; (2) create institutions that can judge adherence to those standards; and (3) develop AI systems that are robustly truthful. Our initial proposals for these areas include: (1) a standard of avoiding "negligent falsehoods" (a generalisation of lies that is easier to assess); (2) institutions to evaluate AI systems before and after real-world deployment; and (3) explicitly training AI systems to be truthful via curated datasets and human interaction.”

Skillful precipitation nowcasting using deep generative models of radar “Precipitation nowcasting, the high-resolution forecasting of precipitation up to two hours ahead, supports the real-world socioeconomic needs of many sectors reliant on weather-dependent decision-making ... Here we present a deep generative model for the probabilistic nowcasting of precipitation from radar that addresses these challenges. Using statistical, economic and cognitive measures, we show that our method provides improved forecast quality, forecast consistency and forecast value. Our model produces realistic and spatiotemporally consistent predictions over regions up to 1,536 km × 1,280 km and with lead times from 5–90 min ahead. Using a systematic evaluation by more than 50 expert meteorologists, we show that our generative model ranked first for its accuracy and usefulness in 89% of cases against two competitive methods.”

Bottom-Up Skill Discovery from Unsegmented Demonstrations for Long-Horizon Robot Manipulation “We tackle real-world long-horizon robot manipulation tasks through skill discovery. We present a bottom-up approach to learning a library of reusable skills from unsegmented demonstrations and use these skills to synthesize prolonged robot behaviors … The entire model can be trained on a small set of human demonstrations collected within 30 minutes without further annotations, making it amendable to real-world deployment. We systematically evaluated our method in simulation environments and on a real robot. Our method has shown superior performance over state-of-the-art imitation learning methods in multi-stage manipulation tasks.“ 

Machine-learning-based evidence and attribution mapping of 100,000 climate impact studies “Increasing evidence suggests that climate change impacts are already observed around the world. Global environmental assessments face challenges to appraise the growing literature. Here we use the language model BERT to identify and classify studies on observed climate impacts, producing a comprehensive machine-learning-assisted evidence map. We estimate that 102,160 (64,958–164,274) publications document a broad range of observed impacts. By combining our spatially resolved database with grid-cell-level human-attributable changes in temperature and precipitation, we infer that attributable anthropogenic impacts may be occurring across 80% of the world’s land area, where 85% of the population reside. Our results reveal a substantial ‘attribution gap’ as robust levels of evidence for potentially attributable impacts are twice as prevalent in high-income than in low-income countries.”


Twitter avatar for @deviparikhDevi Parikh @deviparikh "This Climate Does Not Exist is an AI-driven experience based on empathy, allowing users to imagine the environmental impacts of the current climate crisis, one address at a time." #Thisclimatedoesnotexist

Closing Thoughts

Have something to say about this edition’s topics? Shoot us an email at and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! If you enjoyed this piece, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!