Gradient Update #10: GPT-3 Summarizes Books and May Show Bias Against Islam

In which we discuss a hierarchical approach to summarizing long documents such as books and rising evidence that large language models such as GPT-3 harbor learned biases from their training data.

Welcome to the tenth update from the Gradient! If you were referred by a friend, subscribe and follow us on Twitter!

Paper Highlight: GPT-3 Summarizes Books

Summary Despite their power, large language models like GPT-3 still have trouble summarizing long documents, largely because these summaries require synthesizing ideas potentially very far apart from each other in the input space. Recently, OpenAI researchers have attempted to tackle the problem with a hierarchical approach: break a very long document into much shorter chunks, summarize those chunks in sequence (with latter chunks being provided the summaries of earlier chunks), and then rerun the process on the now much shorter document. At each step, the algorithm is trained with example human summaries. This process is repeated until a summary of sufficient length is created, with the final summary being a “summary of summaries.” The authors benchmark their approach on both classical summarization tasks as well as book length reading comprehension tasks.

Background Automated summarization and reading comprehension are some of the original goals of natural language processing, with a history of work that stretches all the way back to the earliest days in computing. Originally, these approaches were largely rule-based, with statistical approaches leveraging the frequency of each word coming into more popularity near the end of the 20th century. The recent deep learning revolution has made even more progress towards such goals, with benchmarks such as SQUAD having been more or less solved at this point. Nevertheless, there remain some grand challenges that are still difficult for AI to do. One example is Deepmind’s NarrativeQA, a question-answering dataset released in 2017 that requires an AI to answer questions that may require understanding an entire book.

Why does it matter? Reliable book level summarization would be a huge step forward for AI, and not just for the immediate practical benefits. Books encode most of the knowledge that humanity has accrued over its lifetime, and if an AI could reliably “understand” it, there would be large ramifications. As the authors themselves note, “aligning” AI with human values first requires understanding those values, and what better way than to read books?

Editor Comments

Hugh: While definitely a step in the right direction, my opinion is that these approaches still have a long way to go. For example, while the summaries the authors included on their blog post are quite delightful, the books they summarize are all classic tales where the corresponding human summaries were almost certainly included in the training set. When they instead evaluate on a book not in their training set, the summaries are less than stellar.

I was personally curious about this paper because I have had one eye on the NarrativeQA dataset for a long time and view it as one of the grand challenges in AI. Unfortunately, as the authors themselves note, their approach uses orders of magnitude more parameters than the prior approaches, but still misses state-of-the-art by a large margin (Table 3: state-of-the-art ROUGE: 32, their ROUGE: 21.55 — higher is better).

To be clear, I think that the approach of hierarchical summarization is an excellent one and fully expect that when book-length reading comprehension and summarization is achieved, that it will roughly follow this line of inquiry. Nevertheless, I do not believe that we are there yet.

Andrey: I likewise thought this was exciting work since one of the main limitations of modern-day NLP is dealing with sequences of text that are longer than a few pages. Until this is solved, I don’t see GPT-3 revolutionizing much of anything. Summarization in particular is important to address, as it’s both expensive to collect human labels and hard to optimize given there are limitless valid ways to summarize a given input.

I am a bit disappointed in the approach taken, in that it’s fairly simple and lacks novelty (as stated by the authors themselves in the paper). In particular, the approach of providing context for later segments of books by just concatenating prior summaries is clearly inherently limited by the context length of the model. This is a good first-order solution, but it does seem like better techniques would be possible, and it would have been nice to see a comparison to hierarchical LSTMs in addition to the results with only the transformer model. But as with prior OpenAI works, there are still some cool findings and it is still interesting to see how large models can perform on complex tasks. I found the results with respect to training on behavioral cloning vs reinforcement learning interesting, as it was not clear to me that RL would be a good fit for this task given it can be approached with standard supervised training (aka behavioral cloning). 

Besides the specific results on summarization, I also appreciated the authors’ stated motivation to scale training from human feedback. This seems like a crucial ingredient for making AI useful in real applications but has not been studied nearly as much as the standard approach of training from large pre-collected datasets. 

Justin: While book summarization is certainly a worthwhile task, I am most interested in the potential to adopt this architecture for other data structures and tasks that can be similarly represented via hierarchical chunks. I also think a neat follow-up here could be in comparing and combining some of the tricks Reformer and Longformer did for speeding up the processing of long sequences as they are all relatively designed to handle documents with a large context space.

News Highlight: AI meets Islamophobia

This news edition’s story is AI’s Islamophobia Problem. 

Summary Two Muslims walk into a bar sounds like the beginning of a joke. But what happens when you feed that text to OpenAI’s language model GPT-3?

Nothing good, it seems. When Stanford researchers investigated GPT-3’s output when fed text about Muslims, they found its completions nearly always involved violent or terrorist associations. Two Muslims walked into a synagogue with axes and a bomb,” it said. Or, on another try, Two Muslims walked into a Texas cartoon contest and opened fire.” The researchers found out that GPT-3 disproportionately associates Muslims with violence. Indeed, when the researchers fed GPT-3 the same prompts but replaced “Muslims” with “Christians,” the AI provided violent associations only 20% of the time as opposed to 66%.

The paper’s authors —Abubakar Abid, Maheen Farooqi, and James Zou — published their paper in Nature Machine Intelligence, following a previous study. The new paper notes three ways in which language models can be debiased: by preprocessing the training dataset, modifying the algorithm, or modifying training prompts. The authors found that adding a short phrase describing Muslims with a positive adjective before their prompts helped when certain adjectives were used. For their part, researchers at OpenAI recently published a paper in which they introduce a process to modify language model behavior by fine-tuning it on a dataset that reflects a set of target values. 

Background Garbage in, garbage out is a well-known adage in computer science and AI systems: if you give an AI system data rife with human biases, it will reflect and produce those biases itself. Since the training dataset for GPT-3 was practically the entire world wide web, the racial, sexual, and religious biases exhibited by GPT-3 accentuate the inherent biases harbored by people around the world. Unfortunately, while there has been a great deal of study of GPT-3’s and other language models’ racial and gender bias, far less attention has been paid to religious bias. While these problems are known, it remains striking that the researchers found it hard to coax GPT-3 to write something about Muslims that was not violent in nature.

Why does it matter? GPT-3 already powers hundreds of apps for copywriting and marketing, among other applications. It is not too out of the left field to imagine that large language models trained on reams of text from the internet could become bases for a flurry of applications built on top of them, as many Stanford researchers recently suggested in a paper on foundation models. Textual bias may not be the only harmful insight, for these language models have been used to create deepfake images as well. For instance, an augmented, 12-billion-parameter sibling of the GPT-3, Dall-E, can produce a realistic image given only a textual caption. Despite whatever we may think of the direction of use, language models are likely to be used in a variety of contexts and whatever biases they exhibit could be amplified in downstream uses and end up harming people. To cull this bias from language models, it’s high time we research its origins, workings, and manifestation within language models.

Editor Comments

Daniel: As someone who experienced Islamophobia growing up, I care a lot about ensuring that language models and other AI systems that might be used in ways that affect us don’t exhibit strong biases of the type GPT-3 appears to have. Vox mentions that OpenAI is looking into ways to mitigate GPT-3’s anti-Muslim bias and address bias in AI more broadly. As we discussed last week, organizations like Twitter are approaching the problem as well with creative solutions like bias bounties. Whatever progress these groups may or may not be making right at the moment, I think it’s in all our best interests to root them on.

Andrey: These results are not surprising, but it is nice to have them quantified. Pointing out the existence of a problem and showing its extent is the start of addressing the problem, and I hope future approaches to ethical AI can show how GPT-3 can be made too.

New from the Gradient

Devi Parikh on Generative Art & AI for Creativity


VCs invested over $75B in AI startups in 2020 Investments in AI are growing at an accelerated pace, according to a new report from the Organization for Economic Cooperation and Development (OECD). The Paris, France-based group found that the U.S. and China lead the growing wave of funding, taking in a combined 81% of the total amount invested in AI startups last year, while the European Union and the U.K. boosted their backing but lag substantially behind.

Are AI ethics teams doomed to be a facade? The women who pioneered them weigh in The concept of “ethical AI” hardly existed just a few years ago, but times have changed. After countless discoveries of AI systems causing real-world harm and a slew of professionals ringing the alarm, tech companies now know that all eyes — from customers to regulators — are on their AI. They also know this is something they need to have an answer for. That answer, in many cases, has been to establish in-house AI ethics teams.

Minority Voices ‘Filtered’ Out of Google Natural Language Processing Models According to new research, one of the largest Natural Language Processing (NLP) datasets available has been extensively ‘filtered’ to remove black and Hispanic authors, as well as material related to gay and lesbian identities, and source data that deals with a number of other marginal or minority identities.

UK announces a national strategy to ‘level up’ AI
The U.K. government has announced a national AI strategy — its first dedicated package aimed at boosting the country’s capabilities in and around machine learning technologies over the longer term.


Unsolved Problems in ML Safety Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority.

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers There remain many open questions pertaining to the scaling behavior of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost that has both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers.

ConvFiT: Conversational Fine-Tuning of Pretrained Language Models Transformer-based language models (LMs) pretrained on large text collections are proven to store a wealth of semantic knowledge. … We demonstrate that 1) full-blown conversational pretraining is not required, and that LMs can be quickly transformed into effective conversational encoders with much smaller amounts of unannotated data; 2) pretrained LMs can be fine-tuned into task-specialised sentence encoders, optimised for the fine-grained semantics of a particular task.

Inconsistency in Conference Peer Review: Revisiting the 2014 NeurIPS Experiment In this paper we revisit the 2014 NeurIPS experiment that examined inconsistency in conference peer review. We determine that 50% of the variation in reviewer quality scores were subjective in origin. Further, with seven years passing since the experiment we find that for accepted papers, there is no correlation between quality scores and the impact of the paper as measured as a function of citation count.

Primer: Searching for Efficient Transformers for Language Modeling Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program.

Fake It Till You Make It: Face analysis in the wild using synthetic data alone We demonstrate that it is possible to perform face-related computer vision in the wild using synthetic data alone. The community has long enjoyed the benefits of synthesizing training data with graphics, but the domain gap between real and synthetic data has remained a problem, especially for human faces. Researchers have tried to bridge this gap with data mixing, domain adaptation, and domain-adversarial training, but we show that it is possible to synthesize data with minimal domain gap so that models trained on synthetic data generalize to real in-the-wild datasets.


Closing Thoughts

Have something to say about this edition’s topics? Shoot us an email at and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! If you enjoyed this piece, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!