Gradient Update #4: OpenAI's Copilot and DeepMind on Reward and AGI

as well as other news, papers, and tweets in the AI world!

Welcome to the fourth Update from the Gradient1! If you were referred by a friend, we’d love to see you subscribe and follow us on Twitter!

News Highlight: OpenAI and Github Copilot

This edition’s news story is GitHub and OpenAI launch a new AI tool that generates its own code.

Summary The AI and CS communities alike were taken aback by the announcement of the “GitHub CoPilot” at the end of June. CoPilot provides AI-powered code-completion, making it possible to have it take over to finish lines of code or even write entire functions given only the corresponding documentation comment. While code completion is a standard feature of programming tools, it is typically limited to filling in function or variable names, whereas CoPilot can complete many lines or whole functions based on what came before them. 

CoPilot is powered by an OpenAI GPT-3 model fine-tuned on billions of lines of public code from GitHub, which OpenAI named Codex. Codex is almost exactly the same as GPT-3, with the main distinction being that it was trained (either from scratch or by fine-tuning from a pre-trained model) on code, as later described by OpenAI in the paper Evaluating Large Language Models Trained on Code. As with GPT-3, a key aspect of their approach is to train the model with huge amounts of data:

“Our training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, contain-ing 179 GB of unique Python files under 1 MB. We filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters. After filtering, our final dataset totaled 159 GB.”

For their largest model (12B), Codex significantly outperforms GPT-3, an alternative large language model called GPT-J, and a commercial code completion offering. Qualitatively, this corresponds to far more intelligent code completion that has previously been demonstrated. While impressive, it also led to controversy regarding whether it was valid to train it on any publicly available code regardless of its license (given it may memorize and replicate portions of code in commercially licensed programs), and to controversy based on the possibility for GPT models to produce sexist or biased text.

Background CoPilot builds on OpenAI’s efforts to commercialize its GPT-3 model, which Microsoft (the owner of GitHub) licensed for exclusive use in 2020. When released, people noticed it was able to complete portions of code, since it was trained on huge amounts of text data from the internet which also included code. However, GPT-3’s code completions were not reliable and could often produce syntactically or logically incorrect code. Codex fixes that, primarily by being trained solely on auto-completing strings of code.

Why does it matter? CoPilot produces code completion good enough to be of huge benefit to programmers, and so it appears likely to alter their typical workflow in a significant way. Beyond the practical considerations, it garnered strong reactions from the CS community for psychological reasons as well. Like paintings generated by GANs, diagnosis made by CNNs trained on X-Rays, or the many things GPT-3 can write, CoPilot represents another area for which AI has now been shown to be approaching the skill level of humans, this time for the limited task of completing short sections of code based on nearby context. As with these prior examples, CoPilot has many limitations and is not nearly capable of doing the full job of a programmer, but it nevertheless peoples' reactions to it were strong due to it partially automating a complex skill set. Like professionals in many other fields, programmers may have to get comfortable using AI-powered tools to keep up with their co-workers. Also, less need to write boilerplate code!

What Do Our Editors Think?

Hugh: I was also quite excited to see the latest work in language modelling extend to programming languages as well as human ones! Nevertheless, I think it might be a little alarmist to claim that AI is “approaching the skill level of humans.” For example, for writing, it is unclear where scaling a large LM will produce something worthy of standing next to Godel, Escher, Bach or The Grapes of Wrath. And I consider myself quite excited about the potential of scaling! I just don’t see how training on terabytes and terabytes of Internet sludge will teach an AI to write more like Toni Morrison as opposed to a random Redditor.

In software engineering, the criticism of CoPilot taking over human jobs is perhaps even more acute. In the 1950s, programmers wrote code on punch cards. A decade later, the first versions of assembly language came out largely rendering the skillset of the punch-card programmers obsolete. By the 1980s, even assembly language was replaced in turn by higher-level languages like C. Though each of these changes caused specific skills to wax and wane in demand, the general class of programmers only increased in demand, in large part because a huge part of software engineering is figuring out how to automate what you did manually last week. I suspect that CoPilot and its descendants will have similar effects, at least for a while. It’s a tool that I believe will likely become essential to programmer productivity and will automate a lot of the boilerplate that no one wants to write (not even software engineers). I don’t think the general class of programmers is in any particular danger right now though.

Andrey: This result is not particularly surprising, given we've already seen that GPT-3 and GPT-J can do something similar. The paper itself is fairly underwhelming, with the main interesting aspect being how evaluation is done and the amount of data it is trained on; Codex itself is really just GPT-3 trained on this code data. The paper does not event attempt to evaluate prior approaches to code completion trained on the same data, which I found disappointing. What is more interesting is that this was done in collaboration with GitHub as an attempt to monetize GPT-3 beyond OpenAI's API for it. Based on qualitative examples, it's clear CoPilot can provide fairly intelligent code completion far beyond what has been shown in the past. So I am excited to try it out myself and for it to reduce the annoyance of writing boilerplate code.

Paper Highlight: Reward Is Enough

This edition’s paper is Reward is Enough.

Summary Researchers from Deepmind, including David Silver (one of the first authors of AlphaGo and its follow up papers: AlphaZero and MuZero) and Richard Sutton (coauthor of the classic textbook on reinforcement learning) posit that reinforcement learning is the best way forward for creating artificial general intelligence.

What are the claims? Unlike other approaches to building intelligence such as supervised learning (learning from labelled datasets) or unsupervised learning (learning to mimic or compactly represent unlabeled datasets), the authors of this paper believe that the best approach to creating general intelligence is to focus on reinforcement learning methods that learn via interaction with an environment to maximize a reward signal. The authors argue that the process of learning to maximize a reward signal is sufficient to induce learning language, social intelligence, or perception as a side quest to learn to maximize the overall rewardwithout specific optimization for these subgoals. This is not a view widely held by the AI community, but given Deepmind’s track record of success, it would be unwise to ignore their views. 

Why does it matter? There are roughly two lines of thought in the AI community about this issue. On one side are researchers who believe that scaling up existing methods (e.g. reinforcement learning) is sufficient to achieve general intelligence. Others are more skeptical and believe that we will need a breakthrough (or several) on par with the deep learning revolution before we see human-level intelligence or beyond. This paper suggests that Deepmind leans towards the camp of believing that reinforcement learning being a sufficient general paradigm for learning intelligence and hints at the research directions they are likely to pursue in the future. It also explains why Deepmind has produced far more papers on reinforcement learning than on say language, which is not even a listed research theme on its website.

What Do Our Editors Think?

Hugh: I’m largely bullish on Deepmind’s general hypothesis that learning to maximize reward is one of the most important, if not the most important, path forward for improving AI. Nevertheless, I think that the (current) paradigm of reinforcement learning will not get us there. Many reinforcement learning techniques take for granted several key assumptions (e.g. that the learning environment does not change over the course of learning) which are critical to their success but rarely true in the real world. In order to scale to more complex environments, we will need to rethink these tools from the ground up. Nevertheless, I still probably count myself on the optimistic side of this approach because even if I don’t believe that existing methods can be scaled up too much further exactly as they are right now, I am of the belief that they can (and soon will) be modified in ways that would allow them to be.

Andrey: Like many, I found this paper's hypothesis both obvious and meaningless. Assuming it's possible to have unlimited interaction with an environment and a sufficiently rich action/observation space and a sufficiently expressive reward function, it's unsurprising that reward would be enough to train intelligent agents. So the point being made is obvious. But it's also meaningless, because it's not at all clear how these sufficient conditions can be met, even if they can be met. Perhaps this paper is meant for a non AI audience, as I don't see how it contributed any useful ideas whatsoever to the AI research community.

Guest thoughts

Herbert Roitblat (author of Algorithms are Not Enough): Silver, Sutton, and their DeepMind colleagues claim that reinforcement learning is sufficient to 1) produce all known forms of (narrow) intelligence and 2) that it will be sufficient to eventually produce artificial general intelligence. They do not critically assess their second claim, but defend it by offering examples, where they presume that because the learning was successful, reinforcement learning must have been responsible for it. Their reasoning is circular. The reward defines success so it cannot explain success.  It notes that the agent achieved its goal but cannot explain how. Aside from the circularity, their approach relies on the idea that all forms of intelligence, including general intelligence, are the product of the same kind of learning that leads to the development of current forms of narrow intelligence. But the ability to solve an equation (narrow intelligence) is very different from the ability to create a novel equation to model a problem. Narrow forms of artificial intelligence are capable of solving only a limited range of problems.  Even within this range, humans still have to choose how to represent the problem, they have to specify what the goal is, and the relevant data.  AlphaZero, for example, can solve a narrow range of board games, but only when the problem space is consistent with its model. For example, the go-playing system can learn to play only those board games that can be solved by a model that takes the board position as input and which produces a vector of move probabilities. Board games that do not conform to this pattern (e.g., parchisi) cannot be learned by this system without modification. AlphaZero did not start from nothing (tabula rasa), as the DeepMind team suggests.  Its model was not learned through reward, but provided by the designers. All it had to learn was the move probabilities relative to board positions.  The number of alternative moves, the model, and many other aspects of the problem were given to it.  For an agent to be generally intelligent, it will have to define its own problem space, select its own goals, and find the means of navigating that space to achieve the goals without human assistance. The need for human intervention is another demonstration that reward is not enough.

Other News That Caught Our Eyes

This Agency Wants to Figure Out Exactly How Much You Trust AI - "An AI system used by doctors to diagnose disease should be more accurate than one recommending music. The National Institutes of Standards and Technology (NIST) is a federal agency best known for measuring things like time or the number of photons that pass through a chicken."

A New Approach To Mitigating AI’s Negative Impact - "Stanford launches an Ethics and Society Review Board that asks researchers to take an early look at the impact of their work."

How Twitter hired tech's biggest critics to build ethical AI - "Machine learning engineer Ari Font was worried about the future of Twitter's algorithms. It was mid-2020, and the leader of the team researching ethics and accountability for the company's ML had just left Twitter. For Font, the future of the ethics research was unclear."

Voice AI is scary good now. Video game actors hate it. - "A new 'Witcher 3' mod uses tech that's ethically questionable and what one actor calls 'utterly soulless.' But can anything be done about it?"

Amazon is using algorithms with little human intervention to fire Flex workers - "Locked gates, inclement weather, and bad selfies - all reasons drivers report that they were fired by the bots that apparently run human resources for Amazon's Flex delivery program."


The Values Encoded in Machine Learning Research Machine learning (ML) currently exerts an outsized influence on the world, increasingly affecting communities and institutional practices. It is therefore critical that we question vague conceptions of the field as value-neutral or universally beneficial, and investigate what specific values the field is advancing. In this paper, we present a rigorous examination of the values of the field by quantitatively and qualitatively analyzing 100 highly cited ML papers published at premier ML conferences, ICML and NeurIPS. … We find that societal needs are typically very loosely connected to the choice of project, if mentioned at all, and that consideration of negative consequences is extremely rare. … [W]e find that papers most frequently justify and assess themselves based on performance, generalization, efficiency, researcher understanding, novelty, and building on previous work. ... Finally, we find increasingly close ties between these highly cited papers and tech companies and elite universities.

Perceiver: General Perception with Iterative Attention Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. … In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. … We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio.

Using AntiPatterns to avoid MLOps Mistakes We describe lessons learned from developing and deploying machine learning models at scale across the enterprise in a range of financial analytics applications. These lessons are presented in the form of antipatterns ... Some antipatterns are due to technical errors, while others are due to not having sufficient knowledge of the surrounding context in which ML results are used ... In addition to cataloging antipatterns, we describe solutions, best practices, and future directions toward MLOps maturity.

Visual Conceptual Blending with Large-scale Language and Vision Models We ask the question: to what extent can recent large-scale language and image generation models blend visual concepts? Given an arbitrary object, we identify a relevant object and generate a single-sentence description of the blend of the two using a language model. We then generate a visual depiction of the blend using a text-based image generation model. Quantitative and qualitative evaluations demonstrate the superiority of language models over classical methods for conceptual blending, and of recent large-scale image generation models over prior models for the visual depiction.


Closing Thoughts
If you enjoyed this piece, give us a shoutout on Twitter. Have something to say about OpenAI’s Copilot or reinforcement learning for AGI? Shoot us an email at and we’ll select the most interesting thoughts from readers to share in the next newsletter! Finally, the Gradient is an entirely volunteer-run nonprofit, so we would appreciate any way you can support us!


For the next few months, the Update will be free for everyone. Afterwards, we will consider posting these updates as a perk for paid supporters of the Gradient! Of course, guest articles published on the Gradient will always be available for all readers, regardless of subscription status.