Update #34: Text-to-Video and ICLR Highlights

In which we discuss recent text-to-video models and selected submissions to ICLR 2023.

daniel bashir

Derek Lim

, and

Tanmay Agarwal

Oct 11, 2022

Welcome to the 34th update from the Gradient! If you were referred by a friend, subscribe and follow us on Twitter!

Want to write with us? Send a pitch using this form :)

This newsletter is a little long because of the ICLR highlights—you can expand the email or head to the Substack post to view in full!

News Highlight: Generating Videos from Text Prompts

Sources:

Summary

New models from Meta AI and Google Brain are capable of generating high resolution videos given input text prompts. Key technical components present in some of the architectures include diffusion models, Transformers, factorized convolutions, and super-resolution. These models build on the success of text-to-image generation models such as DALL-E 2, Imagen, and Stable Diffusion, and demonstrate another massive leap forward in the capabilities of generative models.

Background

This year, several text-to-image models took the AI community and general public by storm, including DALL-E 2, Imagen, and Stable Diffusion. Inspired by the success of these and other generative models, researchers have been pushing the limits of generating all sorts of outputs, such as video. Text-to-video models are more difficult to develop than text-to-image models, as paired text+video datasets contain far fewer examples than paired text+image datasets, and processing and generating videos is significantly more computationally demanding than dealing with images. Nevertheless, several months later, outstanding new text-to-video models have been developed.

Frames of generated videos from Make-A-Video

Meta AI’s Make-A-Video and Google Brain’s Imagen Video are new general purpose text-to-video generators, which take in a text input and output a video related to that input. They are capable of generating videos for a variety of prompts, ranging from “Incredibly detailed science fiction scene set on an alien planet, view of a marketplace. Pixel art.” for Imagen Video to “An emoji of a baby panda wearing a red hat, blue gloves, green shirt, and blue pants” for Make-A-Video. Both models are based on combining diffusion models, factorized convolutions, and super-resolution.

One continuous video generated with four different prompts by Phenaki

Besides text prompt to video generation, these models have other functionalities as well. For instance, Make-A-Video can take a still image, and then generate a video that continues the image. It can also generate variations of a video. Moreover, the new model Phenaki by Google Brain can take a series of prompts and generate a single continuous video that make up a story in which each prompt occurs sequentially. For instance, in the above example, Phenaki generates a video in which a teddy bear is first swimming in the ocean, then goes under water, continues swimming underwater, and then morphs into a panda bear underwater. On the demo website, there are longer videos spanning over 2 minutes that tell a continuous story generated by a longer sequence of prompts.

Why does it matter?

Text-to-video models are the latest creative AI models that are capable of generating massive amounts of media given human text input. They have the potential to further human creative production, and make videos that humans enjoy, learn from, or use for their work. However, there is also potential for several types of misuse, such as generating grotesque content, misinformation, and variations on copyrighted content. These models are trained on data that is known to be biased and contain violent / pornographic content, such as LAION-400M. According to the authors, this is part of the reason why none of these three next text-to-video models are released publicly.

Editor Comments

Daniel: I think the current take is “look at how mindblowingly fast AI is moving right now” (see this tweet from Kevin Roose and this reply from Jack Clark). I don’t know how useful it is to prognosticate and say “we’re now X years away from an AI-generated (short) film winning an Oscar.” But there is something truly astounding, beautiful, and disconcerting all at once in watching things progress the way they are. I suspect this sort of feeling isn’t unique to those of us living right now–technological progress has caused rapid changes in the world before, and history often repeats itself in many ways. But to narrow ourselves back to the topic at hand–I think some interesting trends to continue to watch here are (a) how “copyright” and “ownership” are thought about as AI systems develop even more capabilities and (b) how risks associated with these systems (pornographic content, misleading, etc.) are exploited and mitigated.

Research Highlight: ICLR Overview

Summary

The International Conference on Learning Representations (ICLR) is a leading annual conference in machine learning that publishes work on a variety of topics, ranging from climate and sustainability to representation learning and kernel learning. In this article, we present an overview of a curated set of submissions to ICLR including algorithms for generating 3D images and audio, a new optimizer for training large language models, self-programming AI, and advances in protein structure generation and Atari gameplay with Reinforcement Learning.

Overview

ICLR Submissions are out, so let’s get right to it!

Please note that all the papers discussed below have not been peer-reviewed yet.

Generative Modeling:

DreamFusion: This paper presents a novel ‘text-to-3D’ generative algorithm that can generate 3D objects while using a pre-trained ‘text-to-2D’ diffusion model. With this method, the model does not require a dataset of 3D objects, which might otherwise have been highly taxing to curate.
DreamFusion trains a NeRF for each text description from scratch, using it to predict the queried entity’s volumetric density and color. This information is used to construct surface normals and shading (from a randomly initialized light source) for the object’s entire surface. The NeRF-generated rendering is added to this shading, which is then diffused through a frozen Imagen model and reconstructed. This reconstruction is then used to backpropagate losses through the NeRF MLP and optimize its parameters. The DreamFusion model is used to generate a gallery of 3D models, videos, and meshes that you can view here.
Text-to-video: Multiple groups submitted “text-to-video” generation algorithms which are covered in our news section today!
Text-to-audio: This paper presents an auto-regressive audio generation model conditioned on text inputs. The task is challenging due to a dearth of high-quality audio data and associated text captions, as well as the difficulty of singling out sounds from one object when an audio recording may contain sounds from multiple sources. To address these issues, the authors create 10 datasets containing various audio recordings and their text captions. They also propose a method to mix audio from different sources before feeding that audio to the model, thereby forcing the model to learn to separate audio sources.
Their training methodology involves a GAN-like loss which includes adversarial and reconstruction losses obtained while using a transformer-based autoencoder to embed the input audio. The method outperforms the “DiffSound” baseline, achieving a score of 72% in the overall quality of the generated audio and 69% in text relevance (subjective metrics where users scored perceptual quality and audio/text alignment on a scale of 1 to 100).

Theory:

Amos: Amos is a stochastic gradient-based optimizer that uses model-specific information to enable faster convergence while training large language models. Specifically, for each variable in the model’s trainable parameters, the authors derive a new hyper-parameter that corresponds to the scale the variable is expected to converge to. This hyperparameter also influences the learning rate, by multiplying the global learning rate with this model-specific parameter that varies for each variable. The authors also derive the expected scale and hyperparameter values for Transformer based networks in the paper. By introducing such parameters to allow optimization to be highly model-specific, the authors found that when pre-training BERT and T5 language models, the Amos optimizer outperformed AdamW by achieving better validation loss in 70% fewer training steps and utilizing less than half of the memory.
Self-programming AI: In what the authors assert to be the first practical implementation of a self-programming system, a T5-based autoencoder is trained on code samples for training convolutional neural networks. A genetic algorithm then requests the model to generate N ‘refined’ code samples, and train them for one epoch, the losses from which are used to determine the best candidate for the next step of the algorithm. The model is mostly restricted to making hyperparameter changes - the number of layers, the number of neurons in a layer, and the number of attention heads. Note that this is different from simple hyperparameter tuning as a language model is rewriting code in this case instead of making numerical changes. Generated candidates with syntactic or compilation errors are discarded by the genetic algorithm. The authors find that the network’s accuracy on CIFAR 10 increases from 20% to 70% over four epochs, 92% to 98% within one epoch on MNIST, and from 70% to 87% over four epochs on EMNIST.

Large Language Models:

Human-Level Prompt Engineering: Large Language Models don’t always produce optimal results, but have to be prompted with carefully crafted queries to get them to respond correctly. While humans currently explore these prompts by trial and error, the authors explore training LLMs to generate meaningful prompts. In this paper, the authors train a secondary LLM on a set of input-output demonstrations for the target LLM. The secondary LLM then outputs a large number of candidate instructions, which are then scored and refined using a Monte Carlo Search method. In this method, the secondary LLM is asked to generate prompts that “have the same semantic meaning” as the candidates with the highest scores. The experiments reveal that the method achieves human-level performance on 19 out of 24 tasks.

Science:

Protein Structure Generation: Discovering new protein structures is an important problem that can lead to generating novel treatments for currently incurable diseases. This work targets the challenge of generating protein structures directly from neural networks by training a diffusion model to design protein backbone structures. Specifically, diffusion is performed on six angles that constitute the backbone (three torsion angles, and three bond angles). The process starts from an experimentally observed backbone and incrementally adds noise to it. The network is then trained to remove this noise and recover the protein structure. By sampling from the noise space and asking the model to remove the added noise, protein structures that were not present in the training dataset can be generated through this method. The authors find that the method is able to correctly estimate the marginal distributions for the torsion and bond angles, and generates protein structures that almost exactly match the test set.

Reinforcement Learning

SpeedyZero: Advances in Reinforcement Learning have enabled agents to achieve human-level performance on tasks such as Atari gameplay, but training such models can be very expensive in terms of computational resources, time, and data. In this paper, the authors develop distributed and parallel training systems for the EfficientZero model-based reinforcement learning method. The optimizations proposed by the authors include using PyTorch’s distributed data-parallel method in RL, data compression with the lz4 algorithm, replicating the replay buffer on each node in the distributed setting, and actively refreshing the priorities of all data points in the replay buffer. The results show that SpeedyZero can maintain a similar sampling rate as EfficientZero while achieving a 20x speedup in wall-clock time. Their agent achieves human-level performance on the Atari benchmark within 30 minutes of training on 300K samples.

New from the Gradient

Stuart Russell: The foundations of Artificial Intelligence

Listen

Varun Ganapathi: AKASA, AI and Healthcare

Listen

Other Things That Caught Our Eyes

News

US DoD Extends AI, Machine Learning Contract With Palantir Technologies “The US Army Research Laboratory has extended its contract with Palantir Technologies to continue advanced artificial intelligence (AI) and machine learning (ML) services for the US Department of Defense.”

White House unveils ‘AI bill of rights’ as ‘call to action’ to rein in tool “The non-binding guidelines call on companies to curb uses that lead to harm.”

Google’s new AI can hear a snippet of song—and then keep on playing “AudioLM, developed by Google researchers, generates audio that fits the style of the prompt, including complex sounds like piano music, or people speaking, in a way that is almost indistinguishable from the original recording.”

Exclusive: Boston Dynamics pledges not to weaponize its robots “The company joined other robotics firms in pledging to curtail projects that turn robots into war machines.”

Pizza robot truck startup Stellar snags $16.5M from Jay-Z “Stellar Pizza, a company that makes trucks equipped with robotic pizza makers, has raised $16.5 million in a funding round backed by Jay-Z’s Marcy Venture Partners, multiple outlets reported.”

Papers

VIMA: General Robot Manipulation with Multimodal Prompts This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We design a transformer-based generalist robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. To train and evaluate VIMA, we develop a new simulation benchmark with thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and four levels of evaluation protocol for systematic generalization. VIMA achieves strong scalability in both model capacity and data size. It outperforms prior SOTA methods in the hardest zero-shot generalization setting by up to 2.9x task success rate given the same training data. With 10x less training data, VIMA still performs 2.7x better than the top competing approach.

Downstream Datasets Make Surprisingly Good Pretraining Corpora For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around 10x-500x less data), outperforming the latter on 7 and 5 datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Our results suggest that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the incorporation of massive datasets. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data.

Closing Thoughts

Have something to say about this edition’s topics? Shoot us an email at editor@thegradient.pub and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! For feedback, you can also reach Daniel directly at dbashir@hmc.edu or on Twitter. If you enjoyed this piece, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!