Gradient Update #18: DeepMind's AlphaCode is an Average Programmer, Bellman Error is bad for RL
In which we cover DeepMind's new paper on AlphaCode, a paper showing the Bellman error is not a good surrogate for value error, and more!
News Highlight: Competitive programming with AlphaCode | DeepMind
Deepmind recently made headlines when it unveiled a new language model that achieved unprecedented success in competitive programming competitions. Dubbed AlphaCode, the transformer-based language model ranks within the top 54% of participants in programming contests on Codeforces. This marks the first time that an AI code generation system has reached a competitive level of performance in programming competitions.
Courtesy of the groundbreaking transformer architecture, the field of natural language processing and generation has turned a new leaf. In the past couple of years. We've seen massive language models like BERT, GPT-2/3, MT-NLG being used for tasks like creative writing, conversation bots, and song writing. The models have performed well in most settings and sometimes even produced deceptively realistic outputs in writing. But despite their numerous successes, these language models still underperform in specialized settings. Particularly, in writing code. For example, last year, researchers from UC Berkeley, UChicago, UIUC, and Cornell found out that understanding and solving coding problems is still a notoriously challenging task for even the best language models that we have today.
This is where AlphaCode steps in. In the preprint Competition-Level Code Generation with AlphaCode, researchers at Deepmind trained a language model that achieved the coding performance comparable to the ‘average’ contestant in a programming competition.
To achieve this unprecedented success, the researchers combined advances in large-scale transformer models with large-scale sampling and filtering. AlphaCode was first pre-trained on selected public GitHub code and then fine-tuned on a relatively small competitive programming dataset. At evaluation time, multitudinous C++ and Python programs for each problem were generated. These were subsequently filtered and clustered downstream. The resultant solutions were re-ranked to curate 10 candidate programs that were submitted for external assessment.
This pipeline, which mimics a typical competitor’s trial-and-error process of debugging, compiling, passing tests, and eventually submitting, was put through the works in 10 recent programming contests hosted on Codeforces. By solving various coding problems that involved a combination of critical thinking, logic, algorithms, coding, and natural language understanding, AlphaCode ranked amongst the top 54% of competitors.
Why Does it Matter?
AlphaCode’s impressive performance points towards two important ideas. Firstly, language models are becoming increasingly adept at understanding and responding to specific language problems. And second, tailored pre and postprocessing of the inputs and outputs, such as the training and inference pipeline discussed above, are a linchpin in improving the performance and specificity of large language models.
However, AlphaCode’s successes should be taken with a pinch of salt. Current figures don’t make a strong case to deploy it in commercial settings as a standalone, self-contained system capable of writing industry-level code. Perhaps a more fitting and direct use case would still be similar to GitHub Copilot, where the AI system will act as a pair programmer filling in the blanks and suggesting potential approaches. “You should think of it as something that could be an assistant to a programmer in the way that a calculator might once have helped an accountant. It’s not one-stop shopping that would replace an actual human programmer. We are decades away from that,” commented Gary Marcus, an AI professor at New York University.
In spite of this, the future ahead is certainly interesting. To help research in this area, Deepmind released its dataset of competitive programming problems and solutions on GitHub. “We hope this benchmark will lead to further innovations in problem solving and code generation,” the researchers wrote on the original blog post announcing the news.
Justin: I am really looking forward to the day where I can stop doing leetcode for job interviews and use an open sourced model. Keep up the good work everybody
Tanmay: Indubitably a huge milestone for language models. I am most surprised at the model’s ability to pull apart the setup from the prompt and develop a quantitative understanding of the problem.
Andrey: This generated a lot of excitement in the media, but to be honest I found it fairly underwhelming. The approach is not that interesting (mostly just fine-tuning a large language model on a dataset of competitive programming), and the task of solving programming problems is not all that useful. Still, having AI to help programmers is going to be a huge deal, so it’s nice to see more work towards that goal.
Daniel: I agree with Andrey that the approach is not that interesting, but I also think we should give credit where it’s due–getting this sort of system to work is really, really hard. I also think Scott Aaronson’s comparison of AlphaCode to a dog speaking mediocre English is a really great way to think of it. It’s probably easy for me to look at something like this and act unimpressed, but folks like Aaronson saw “AI” for the first time before I was alive. When you put things in a longer-term perspective, it’s pretty neat.
Paper Highlight: Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error
The authors “study the use of the Bellman equation as a surrogate objective for value prediction accuracy… [they] find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function… This means that the Bellman error can be minimized without improving the accuracy of the value function. [They] demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.”
Why does it matter?
In reinforcement learning (RL),” many modern RL algorithms rely on a value function in some capacity” such as the Bellman equation. The authors showed that “there exists infinitely many suboptimal value functions which satisfy the Bellman equation. This means the Bellman error is not a viable objective over incomplete datasets, such as the off-policy setting.” Since many studies use the Bellman error as a proxy for the value error, this has drastic implications for many domains which leverage RL algorithms on incomplete datasets (Autonomous Vehicles, Diagnostic Treatments, and Gaming immediately come to mind ). While some previous studies have also explored the fallbacks of using the Bellman error in policy evaluation, this work provides a thorough theoretical and experimental analysis of this issue. Their results not only support conclusions from previous work but also help explain the reasons behind the results of many past experiments in the domain of policy evaluation.
Tanmay: The paper does a great job at walking people through theoretical foundations of RL and provides a multi-faceted analysis of why a long used methodology is not optimal in this domain. Also appreciate the quirkiness in the title.
Andrey: This is a really cool paper, since the bellman error is widely used in RL optimization. I found the result highly interesting and think it may have major implications for how RL algorithms should be implemented, and am curious to see what impact these results will have.
New from the Gradient
Other Things That Caught Our Eyes
GPT-NeoX-20B: A Large Open-Source Language Model “GPT-NeoX-20B is a 20 billion parameter autoregressive language model whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights.“
Sony trains AI to leave world’s best Gran Turismo drivers in the dust “Researchers at Sony trained an AI called GT Sophy to play the PlayStation game Gran Turismo and found that it could outrace 95% of human players after two days and continued to shave tenths of a second off its lap times over the following week.”
The IRS Drops Facial Recognition Verification After Uproar “The Internal Revenue Service is dropping a controversial facial recognition system that requires people to upload video selfies when creating new IRS online accounts.”
Message Passing Neural PDE Solvers “The numerical solution of partial differential equations (PDEs) is difficult, having led to a century of research so far. Recently, there have been pushes to build neural--numerical hybrid solvers, which piggy-backs the modern trend towards fully end-to-end learned systems. In this work, we build a solver, satisfying these properties, where all the components are based on neural message passing, replacing all heuristically designed components in the computation graph with backprop-optimized neural function approximators… Our model outperforms state-of-the-art numerical solvers in the low resolution regime in terms of speed and accuracy.”
Block-NeRF: Scalable Large Scene Neural View Synthesis “We present Block-NeRF, a variant of Neural Radiance Fields that can represent large-scale environments. Specifically, we demonstrate that when scaling NeRF to render city-scale scenes spanning multiple blocks, it is vital to decompose the scene into individually trained NeRFs. This decomposition decouples rendering time from scene size, enables rendering to scale to arbitrarily large environments, and allows per-block updates of the environment.”
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation “Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks... In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. ”
CLIPasso: Semantically-Aware Object Sketching “Abstraction is at the heart of sketching due to the simple and minimal nature of line drawings. Abstraction entails identifying the essential visual properties of an object or scene, which requires semantic understanding and prior knowledge of high-level concepts. Abstract depictions are therefore challenging for artists, and even more so for machines. We present an object sketching method that can achieve different levels of abstraction, guided by geometric and semantic simplifications. … The generated sketches demonstrate multiple levels of abstraction while maintaining recognizability, underlying structure, and essential visual components of the subject drawn.”
Have something to say about this edition’s topics? Shoot us an email at email@example.com and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! If you enjoyed this piece, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!