Update #37: ICLR Reviews Get Spicy and How Much Attention Actually Attends
In which we discuss interesting developments in ICLR 2023 reviews and how much pretrained Transformers rely on input-dependent attention.
This Update is longer than usual, so you may need to open it on Substack or expand the email!
Welcome to the 37th update from the Gradient! If you were referred by a friend, subscribe and follow us on Twitter!
Want to write with us? Send a pitch using this form :)
News Highlight: ICLR Review Gets Spicy
Summary
Activity on OpenReview for the 2023 International Conference on Learning Representations (ICLR) is ongoing, and we’re seeing some interesting drama. In particular:
One commenter questions the originality of Git Re-Basin, one of the highest-rated papers for this year’s conference.
One author complains that their paper was given a low score due to a reviewer’s issue with their terminology.
A paper entitled “Quantum reinforcement learning” receives scores of 1 from every reviewer and is suspected of being generated by a language model.
Authors of two submissions (one, two) are not very happy with their reviewers.
Background
Since its inception in 2013, ICLR has employed an open peer review process to referee submissions. Being an important and incredibly influential conference, these reviews are bound to have an impact on the ML ecosystem. As described in the summary, there have been a number of interesting occurrences in the reviews this year.
First, a commenter named Sidak Pal Singh claimed that Git Re-Basin, one of this year’s highest-rated papers, made claims that were “exaggerated, invalid, or deceptive.” The paper itself introduces three algorithms to “merge” two neural networks models in weight space at no cost to the loss–this method would allow one to train models on two different datasets and combine them, achieving the same performance on a combined dataset. Singh claims that Git Re-Basin’s first method, matching activations, is provably identical to the activations-based alignment in another work named OTFusion. Singh also argues that the paper’s second and third methods, called “weight matching” and “straight-through estimator,” are also similar to existing methods. Singh says the authors do not compare to these baselines, cite some of the works as mere “lip service,” and present established results and observations as new.
In their response, the authors acknowledge the “not obvious” connection between their activation matching method and the work Singh cites and note that they have edited the paper to make this connection clear. They dispute the connections with their weight matching and estimator methods (as well as other claims on their results) and conclude by “gently call[ing] into question whether accusations of misinformation are conducive to an impartial, factual scientific discourse.”
Second, Peter Richtarik tweeted that a reviewer ignored the science of his paper to criticize the use of a potentially offensive piece of terminology. In his words:
Kosta Derpanis tweeted about a similar review, which criticizes the paper “Variance Reduction is an Antidote to Byzantines” for “Ambiguous and undefined terminology relying on previous works that use a lexicographically documented ethnoreligious prejudice as a technical term.” While the review notes a few other weaknesses in the paper, nearly the entire content of the review is focused on the use of the term “Byzantines.” The authors responded with four arguments along with a willingness to change the term–they replaced “Byzantines” with “Byz workers' ' and made similar substitutions elsewhere. After further exchange, the reviewer remained unsatisfied with the change and seemed to want the authors to go further:
I think that the paper should contain a proper definition of what the term in question is meant to capture, and raise the problem that the term is highly unsatisfactory for both ethical and scientific reasons. The very idea of using any ethnoreligious term to capture all the subtle subset attributes that the term is supposed to have is, in my view, a prime example of ethically questionable use of terminology.
The authors concluded their final comment, which made an attempt to understand the reviewer’s point but noted that the reviewer’s comments feel inappropriate for multiple reasons, with this:
In conclusion, we will not be further discussing the issue of the terminology "Byzantine robustness" in this forum. We are, however, happy to have a technical debate about our scientific contributions.
Eventually, the Ethics Chair and PC Committee themselves weighed in:
Third, the six-page (including references) paper “Quantum reinforcement learning” is… interesting. Beyond numerous grammatical and formatting errors, the “Author Contributions” and “Acknowledgements” sections were not even filled out. The authors ultimately withdrew the paper.
Finally, some fun!
After fairly weak reviews, the authors of “Token Turing Machines” decide to withdraw their their paper:
We are withdrawing this paper to give it an opportunity to be reviewed at a conference that better understands the paper (and Transformer models).
Meanwhile, the authors of “In-the-wild Pretrained Models Are Good Feature Extractors for Video Quality Assessment” are very unhappy with the quality of feedback from one of their reviewers:
Why does it matter?
We’d like to take the opportunity to ask you a few questions!
How do you feel about this activity in the review process for an academic conference?
What do you feel the merits and drawbacks of open review are?
Do you think that, on balance, the ICLR (open) review process is a good one?
What changes would you want to make to the review process or ML conferences more broadly?
Research Highlight: How Much Does Attention Actually Attend?
Summary
This work studies the extent to which pretrained Transformers rely on their input-dependent attention mechanisms. The authors propose a method to replace input-dependent attention matrices in Transformers with constant matrices. In several settings, the authors find little to no performance drop when replacing up to half of the attention mechanisms with constant matrices, and an average relative performance decrease of 8% when replacing all of the attention mechanisms.
Overview
Transformers — introduced in the seminal paper “Attention is All You Need” — have become the workhorse of deep learning in various tasks across many domains. As suggested by the title, the so-called attention mechanism is a fundamental part of the Transformer architecture, and is thought to be crucial to its success. The attention mechanism uses an attention matrix to “mix” the embeddings of each token in a sequence; this attention matrix is input-dependent, meaning it generally changes if the inputs change. If the attention is completely removed from the Transformer, then the architecture is a special case of a Multi-Layer Perceptron (MLP) acting independently on tokens and positional encodings.
Instead of removing the attention outright, one can try to replace the input-dependent attention matrices with input-independent attention matrices that are constant. Some works like FNet, MLP-Mixer, and gMLP do exactly this, and define an architecture that has input-independent token mixing matrices. Instead of defining and training new architectures from scratch, the new work by Hassid et al. takes pretrained Transformers and replaces the attention mechanisms with constant matrices.
To replace attention mechanisms in a pretrained Transformer, the authors take a training set of inputs, compute the attention matrix for each input, and then average these attention matrices (with a few modifications to ignore padding and handle special tokens properly) to get a constant, input-independent matrix. Also, to study replacing only a subset of attention heads, the authors define a measure of importance for each attention head, and then replace the attention heads that are least important by this metric.
This attention replacement strategy is tested with pretrained BERT, RoBERTa, and DeBERTa models across a variety of text classification and structured prediction tasks. Usually, half of the attention matrices can be replaced without harming performance — sometimes this even improves performance. When replacing all input-dependent attention matrices with constant matrices, there is an average relative performance drop of 8%, and the highest performance drop is no more than 20%. However, the authors find that better performing pretrained models suffer more performance decrease when replacing their attention mechanisms; this suggests that better models possibly make better use of the attention mechanism. Ablations show that the authors’ methods of choosing constant matrices and selecting which heads to replace generally outperform other baselines.
Why does it matter?
This work shows that pretrained Transformers are not strongly dependent on input-dependent attention, which has several implications for improving and/or better understanding Transformers. It may be possible that making better use of input-dependent attention could lead to better Transformers, as possibly suggested by the observation that better models have more performance degradation when replacing attention with constant matrices. Moreover, replacing attention matrices with input-independent constant matrices reduces the number of parameters and increases inference time of the model. If this can be done without performance degradation, as this work shows to be the case in various settings, then this could be used to make more efficient models. Further, this work focuses on pretrained Transformers, but further explorations on the role of attention during training of Transformers could likewise lead to better and more efficient training strategies.
Author Q&A:
Q: What led you to start working on these topics?
A: Roy (my supervisor) and I started out exploring some efficient alternatives for the attention mechanism. Most of them are focused on convolutions and faster computations, specifically we worked with input-independent attention variants.
After works like FNet and gMLP which presented relatively good performance for input-independent attention based models (compared to fully attentive models), we wondered about existing models (like BERT): maybe they don't use the input-dependent capability either, although being trained with that capability?
Q: Do your empirical findings have any implications on architectures that are trained with input-independent token mixing / attention matrices? Or do they suggest any different training schemes for Transformers?
A: This is a great question. I think our work can motivate more research for architectures based on input-independent attention mechanisms. For pretraining this is an ongoing line of work (like FNet, gMLP, and FLASH), but a more promising direction (in my opinion) is knowledge distillation. Our results show that replacing half of the attention heads with constant ones results in minimal or no performance drop, which should motivate us to distill these heads for more efficient computation without any loss in performance. Moreover, no model has shown a severe downgrade in performance replacing all the input-dependent attention matrices. Maybe distilling the whole Transformer model into a more efficient one can preserve the same results, while being much more efficient.
I should note also, that the PAPA method works in a probing setup, which means we don't let the model adapt to the new mechanism, probably finetuning it will let the model adjust to the architectural change and perform even better.
Q: What future directions do you find promising related to this line of work?
A: I think promising future directions are developing new architectures based on efficient input-independent attention mechanisms, both for pretraining and for knowledge distillation.
Another interesting finding is that performant models utilize their attention mechanism better. I think it will be interesting to find new training schemes that will try to leverage this finding and train models that better utilize the input-dependent capability.
Q: Any other interesting things about this work that you would like people to know?
A: It's more a fun fact than something interesting, but the title was suggested by my fiance :)
New from the Gradient
Yoshua Bengio: The Past, Present, and Future of Deep Learning
Kanjun Qiu and Josh Albrecht: Generally Intelligent
Other Things That Caught Our Eyes
Events
AI Helps Ukraine An event from Mila - Quebec AI that aims to raise funds to support Ukraine with medical and humanitarian aid. You can register and see a series of talks on AI For Good leading up to the main event (talks that have already happened are now available on YouTube, see their schedule).
News
Why Meta’s latest large language model survived only three days online “On November 15 Meta unveiled a new large language model called Galactica, designed to assist scientists. But instead of landing with the big bang Meta hoped for, Galactica has died with a whimper after three days of intense criticism.”
We’re getting a better idea of AI’s true carbon footprint “Large language models (LLMs) have a dirty secret: they require vast amounts of energy to train and run. What’s more, it’s still a bit of a mystery exactly how big these models’ carbon footprints really are.”
Ubisoft and Riot are going to use AI to stop you from being horrible online “Toxic gamers who enjoy harassing others via in-game comms are set for an equally rude awakening. As creators and publishers of some of the best co-op games out there, Ubisoft and Riot Games are no strangers to having to deal with toxic players.”
Tesla reports two new fatal crashes involving driver assistance systems “Tesla Inc (TSLA.O) told U.S. auto safety regulators it has reports of two new crash fatalities in Model 3 cars tied to advanced driver assistance systems in the month ending October 15, data released Tuesday by the government shows.”
Papers
Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding In this work, we present MinD-Vis: Sparse Masked Brain Modeling with Double-Conditioned Diffusion Model for Vision Decoding. Specifically, by boosting the information capacity of representations learned in a large-scale resting-state fMRI dataset, we show that our MinD-Vis framework reconstructed highly plausible images with semantically matching details from brain recordings with very few training pairs. We benchmarked our model and our method outperformed state-of-the-arts in both semantic mapping (100-way semantic classification) and generation quality (FID) by 66% and 41%, respectively. Exhaustive ablation studies are conducted to analyze our framework.
To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons… we conducted surveys of reviewers in two top-tier double-blind computer science conferences—ICML 2021 (5361 submissions and 4699 reviewers) and EC 2021 (498 submissions and 190 reviewers). Our two main findings are as follows. First, more than a third of the reviewers self-report searching online for a paper they are assigned to review. Second, outside the review process, we find that preprints from better-ranked affiliations see a weakly higher visibility, with a correlation of 0.06 in ICML and 0.05 in EC.
Harmonizing the object recognition strategies of deep neural networks with humans we explore if [the successes of deep neural networks] have also carried concomitant improvements in explaining the visual strategies humans rely on for object recognition. We do this by comparing two related but distinct properties of visual strategies in humans and DNNs: where they believe important visual features are in images and how they use those features to categorize objects… we find a systematic trade-off between DNN categorization accuracy and alignment with human visual strategies for object recognition. State-of-the-art DNNs are progressively becoming less aligned with humans as their accuracy improves. We rectify this growing issue with our neural harmonizer: a general-purpose training routine that both aligns DNN and human visual strategies and improves categorization accuracy. Our work represents the first demonstration that the scaling laws that are guiding the design of DNNs today have also produced worse models of human vision.
Closing Thoughts
Have something to say about this edition’s topics? Shoot us an email at editor@thegradient.pub and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! For feedback, you can also reach Daniel directly at dbashir@hmc.edu or on Twitter. If you enjoyed this newsletter, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!