In-Context Learning, In Context + Author Q&As
On the phenomenon of in-context learning in large language models and what researchers have learned about it so far. Plus, Q&As with Hattie Zhou and Sewon Min.
Much recent work on large language models (LLMs) has explored the phenomenon of in-context learning (ICL). In this paradigm, an LLM learns to solve a new task at inference time (without any change to its weights) by being fed a prompt with examples of that task. For example, a prompt might give an LLM examples of translations, word corrections, or arithmetic, then ask it to translate a new sentence, correct a new word, or solve a new arithmetic problem.
Note: I spoke to two authors of papers mentioned in this article about in-context learning and their perspectives. You can read my Q&As with Hattie Zhou and Sewon Min below!
Q&A with Hattie Zhou on “Teaching Algorithmic Reasoning via In-context Learning”
DB: What led you to start studying in-context learning?
HZ: I think that in-context learning is an exciting emergent capability because it seems close to the ideal way that we want to be interacting with these models: simply communicate to the model what you want and get the desired behaviour out of it. LLM is a general store of information, and in-context learning allows us to access certain parts within this store without having to specialize the entire model towards each task.
DB: You introduced this paper with a tweet echoing the (true) claim “LLM’s can’t even do addition” (until now). This seems really counter-intuitive, especially given papers like this that demonstrated transformers learning function classes that would be presumably more complex than addition (there is indeed a difference in data distribution and evaluation between your work and that particular paper). For someone seeing this claim and being thoroughly confused, why did it take so long for us to make this work? As a side note, do you have any broader thoughts on the differences between this and the function classes work?
HZ: In that work, the models are specifically trained to extrapolate a particular functional relationship from the in-context examples. In contrast, a pretrained LM does not have such strong bias that aligns with an individual task. This means that we require the model to have the machinery to interpret any abstract pattern and apply it flexibly to new inputs. We see a hint of this ability in our paper. Another difference is that instead of learning a function that maps from x to y, the addition task requires learning an algorithm whose complexity scales with the number of digits. This would be extremely hard for a single function to represent.
One reason it has taken a while might be that we needed a powerful enough model to observe this capability. Another reason might be that as humans, we are used to thinking that the individual steps of addition are straightforward because we know it so well. We miss the fact that LLMs do not have the same bias and prior knowledge as us, and they can have many different yet locally valid interpretations of the example observations. This leads us to underspecify the algorithm that we want the model to learn.
DB: You told me some time ago, presumably as you were working on this paper, that you felt like your job was as a prompt engineer. Tell me about some of the difficulties involved in coming up with the algorithmic prompting technique you introduce.
HZ: One good thing about algorithmic prompting is that it’s not so much about the exact phrasing that is used, but rather about observing the parts where the model is not interpreting the prompting information correctly and finding ways to disambiguate the information. So looking for misinterpretations gives a pretty clear signal for how to improve the prompts. This is difficult sometimes because the model can behave in unintuitive ways, which at times feel like they are an adversarial agent trying to poke holes in your explanation. Thus there is trial and error involved in aligning your explanation to the model’s interpretations.
DB: How non-ambiguous is a “non-ambiguous explanation”?
HZ: This part is really determined by the model. It’s about aligning the model’s behaviour with the one you want, and this work suggests that one way of doing that, at least in these algorithmic task settings, is by being more explicit about what you want. But how much detail is enough is up to the quirks of the particular model.
DB: There are areas taking a different tack at the intelligence problem–François Chollet seems to be a fan of program synthesis, which is really hard but seems to share the aspect of writing a program/algorithm. Did you see any relation to that work as you were working on this paper?
HZ In our paper, we achieve something that is much easier than program synthesis: given an algorithm, can the model learn to apply it to OOD situations? But the holy grail of this line of work would be something like algorithm discovery. For example, can the model figure out the best way to interpret what you want by itself without you having to specify everything? Can it generate a set of candidate *rules* that can explain the relationships within your in-context examples, and then evaluate which of those rules seem most likely? That would be very exciting.
DB: In our conversations you’ve seemed pretty open to a bold version of the scaling hypothesis. How has working on this paper impacted your views?
HZ: I think that LLMs can already output anything you want if you just condition it in the right way, and this conditioning will become easier and easier as we scale up these models. Working on this paper reinforces that view for me. But ultimately we would want the model to give us the right output without us knowing what it is. When the output possibilities are no longer sufficiently constrained by our existing knowledge, we will likely need the model to have its own self-critiquing process in order to reason properly.
DB: I found it interesting that subtraction has the worst performance (worse than multiplication!) with your algorithmic prompting. I would guess that your prompt for subtraction was basically analogous to that of addition–what do you think is going on here?
HZ: Multiplication is a harder problem and our evaluation dataset is more restrictive for that task, which probably explains the better performance. Compared to addition, subtraction performs 2x the number of steps with the added complexity of manipulating the sign of the digits. The subtle differences between the two different passes through the digits and the sign manipulation seem to be more error prone. I think it is possible to get subtraction to similar performance as addition with more prompt engineering, but we didn’t focus too much on that since it was already enough to validate our insights.
DB: Do you have any lingering questions about in-context learning you didn’t get to investigate in this work?
HZ: Loads! How do we discover algorithmic prompts more automatically? How do we teach the model to selectively attend to only the parts of the context that are relevant to the current reasoning process? We have provided an instance of what OOD generalization looks like in LLMs, can we now imitate this behavior with much fewer tokens? How can we teach skills to model in a modular way such that they can be readily combined with any other skill? If anyone has ideas and wants to chat, feel free to contact me!
Q&A with Sewon Min on “Rethinking the Role of Demonstrations: What makes In-context Learning Work?”
DB: What led you to start investigating in-context learning?
SM: I think the idea of performing a new task at test time with no gradient updates is really cool and interesting from a scientific point of view. It is also practically useful – removing the need for gradient updates makes applications of language models significantly easier, since one does not need to bother to run fine-tuning with careful choice of hyperparameters.
DB: I found the idea of in-context “learning” odd–yes, you have the phenomenon of something new happening because you give an LLM some context like (recently) examples of addition and it spits out correct answers with the right prompt, but “learning” still seems like a weird moniker for something happening at inference time. What are your intuitions about what’s going on in an LLM at different phases (pre-training time, testing time) beyond your investigations in this paper?
SM: I think “learning” is not a well-defined term in general. However, if we define “learning” as obtaining a new intrinsic ability it has not had previously, then I think learning has to be with gradient updates. I believe whatever that is happening at test time is a consequence of “learning” that happened at pre-training time. That is related to our claim in the paper that “in-context learning is mainly for better location (activation) of the intrinsic abilities LMs have learned during training”, which has been claimed by other papers as well (Xie et al. 2021, Reynolds & McDonell 2021).
I do want to note though that it doesn’t mean models cannot exhibit a new behavior at all at test time, but I believe “a new behavior” can mostly (if not always) be explained by a composition of abilities obtained during pre-training. For instance, suppose the model can assign “0” to a positive review and “1” to a negative review, even though it has never seen such text during pre-training. This can be seen as a composition of two abilities—“identifying sentiment of a review” and “mapping” (here, 0–positive, 1–negative), which I believe the model has seen plenty of time during pre-training. I do think this is a really interesting behavior! But it is still explained by a consequence of training.
DB: You co-wrote a nice blog post drawing on this work to observe “all the components of the prompt (input distribution, the output space and format) are providing ‘evidence’ to enable the model to better infer concepts that are learned during pretraining.” Do you think there are other components of a prompt that could help interrogate these concepts?
SM: I haven’t thought about other components, but I do think each of these components (input distribution, the output space and the format) can be nailed down in a more fine-grained manner. In fact, Madaan & Yazdanbakhsh (2022) did in-depth study for a chain-of-thought prompting, where they break down the output (“rationale” for the answer) into a more fine-grained set of components and investigate what matters and what does not. Recommend checking it out!
DB: We’ve already seen a case (Hattie Zhou’s recent paper) that introduces a new prompting technique that carefully examines whether a model is taking away what is intended from a prompt. What do you hope future “prompt engineers” and others using LLMs will take away from your work?
SM: I think the main points that I hope future scientists/engineers can take away are (1) the factuality of the prompt may not matter as much as you thought, (2) having in-distribution text (either the input or the output) is important, and (3) having the right format is almost always necessary. (Maybe this is repeating what we’ve written in the paper/blog post already 😅)
DB: Do you have any lingering questions about in-context learning you didn’t get to investigate in this work?
SM: Well, these are not specifically related to in-context learning, but I’ll take this opportunity to talk about two directions I found really interesting and are loosely related to our paper.
The first direction is to identify what training data attributes to the test-time behavior of the LM, which is related to my earlier point that any test-time behavior of the model can be explained by the model’s training data. There has been cool work on this topic such as Bohnet et al. (2022), Han & Tsvetkov (2022) and Akyürek et al. (2022), and I am excited to see more progress in this direction.
The second direction is if we can mimic language models’ behavior solely by retrieving a part of the training data, with no large language models. We actually have a recent paper that performs zero-shot inference solely by retrieving from a large-scale text corpus—here, we do not have a large language model; instead, we have a small encoder to retrieve a part of the data. It is still not at a stage that can mimic everything language models can do at this moment. However, in the future, with some multi-step retrieval for a composition of different parts of the data, I believe this may be able to explain a lot of what language models do.
very interesting read