Is Attention all you need? Mamba, a novel AI model based on State Space Models (SSMs), emerges as a formidable alternative to the widely used Transformer models, addressing their inefficiencies
Thanks for this post. I really like the physics-based approach to learning, and these state-space models remind me of control theory, dynamical systems, and other engineering and science-based methods.
When you said "models learn from their training data... This is a kind of lossy compression of the input data into the weights. We can think of the effect of pre-training data on the transformer as being like the effect of your ancestors' experiences on your genetics - you can't remember their experiences, you just have vague instincts about them.". Have people tried doing something like RLHF before or as pre-training to make the data more compositional or hierarchical, or to optimize the training data a priori?
A second, more general (and taxonomic) question. Do you like the term "mechanistic interpretability"? For me, "ML mechanisms" or "AI mechanisms" is a better and more general line of research. "Mechanistic interpretability" sounds a bit pedantic and seems to inadvertently limit its own scope due to semantics (in the sense of "we will study the interpretability of mechanisms" as opposed to "we will study mechanisms"). I come from chemistry, where "organic mechanisms" is a narrow approach that aims to understand mechanisms of organic chemistry reactions (and when we say "organic chemistry mechanistic interpretability" we are even narrower in scope), but if we are trying to understand things as a whole, "chemical mechanisms" or just "mechanisms" seems a more appropriate name. Couldn't the same be said for ML (or AI in general?)? Maybe I am the one being pedantic, and I apologize if that is the case.
An obvious question is how can we get the best of both SSM and transformers to get high effectiveness and high efficiency? Could you briefly explain if "Jamba" is a true hybrid that gets the best of both worlds and how it differs from traditional GPT and Mamba models or just some sort of interleaved system? (I think there is a slight difference between a true hybrid and an interleaving of two methods, if that makes sense; I guess the former implies a dual selectivity and self-attention type of approach, while the latter implies the concatenation of the two methods in series or sequence or something).
Hey Daniel! 👋 Thanks for reading and thanks for your questions! 😄
Yeah totally agree that there's a lot of research in control theory and dynamical systems which Mamba-style models take advantage of and can possibly lean further into 🧪
To your questions:
1. This is a great question!
* So right now, a lot of people are mixing in instruction tuning data (e.g. conversations and other more assistant-like data) into the pretraining mix. And lots of groups mix in example problems, programming puzzles and tool use in pretraining sometimes too. This is pretty similar to what you might hope to get from RLHF (if the instruction tuning data has been chosen and verified by humans for example).
* In terms of doing actual RLHF, it tends to not really work so well until you have a pretty solid model so people typically do RLHF at the end of training. It's unclear if that's just our current approach or whether there's a fundamental reason that earlier RLHF couldn't work (Yann LeCun's cake analogy gives a principled reason we might expect RL to work better at the end but it's more of an intuition than a proof).
* I'm not quite sure what you mean by "making the data hierarchical" but how you arrange and order data in training is a problem known as "Curriculum Learning" which there is some ML literature on. Curriculum Learning is starting to become a more important subfield (since data is often a good moat in foundation model training) 📚
2. I'm not as familiar with chemical mechanisms as you are so I can't comment as much there honestly.
* My understanding of the terminology is that the term was advocating for a new approach to Interpretability. A lot of prior interpretability methods (like saliency maps and other primitive techniques) weren't very principled or effective.
* Mechanistic Interpretability was coined by Chris Olah[1]. I think the name is fairly reasonable but likely not perfect - naming is pretty path-dependant and that's the name we have so I guess that's what we're going with!
* Regardless of the name, I'm a big fan of the field and I should hopefully be sharing some research in this area soon.
3. Yep, agree - how we can combine the two architectures is definitely a key question!
* I touch on that a little bit in the piece and I've been glad to see some papers in this direction recently. As you mention, Jamba from AI21 is a good contender here. The Jamba architecture diagram[2] shows what's going on within the Jamba block - as you say, they're interleaving Mamba blocks, MLPs, MoE layers and Attention layers.
* I'm not fully sure I understand the distinction you have in mind between hybrids and interleaving architectures but the authors refer to it as a "hybrid" and empirically the evals have been pretty impressive 🙌
* I'd love to see more research in this area focusing on using each of the architectures strengths (long-context and high effectiveness respectively) - I'm sure that's coming soon!
Thank you very much for your comprehensive response and for pointing me in the right direction and research topics! Would it be alright if I sent a short private message to introduce myself and give an overview of my transition, including my immediate goals as I venture into AI research (or other relevant roles within the AI industry, especially for someone with a PhD in a technical field who has foundational knowledge in CS and ML, but may not have extensive projects or practical experiences in these areas since my PhD was not specifically in Machine Learning or CS)? I have to admit that I feel a bit lost, and networking isn't exactly my strongest suit.
Yep it does, IG to really move forward we have to find terms with more expressivity. Like the case of Liquid Time Constant Networks. I really hope that someone write a blog post like this on LTCN, they are difficult to understand.
Thanks for this post. I really like the physics-based approach to learning, and these state-space models remind me of control theory, dynamical systems, and other engineering and science-based methods.
When you said "models learn from their training data... This is a kind of lossy compression of the input data into the weights. We can think of the effect of pre-training data on the transformer as being like the effect of your ancestors' experiences on your genetics - you can't remember their experiences, you just have vague instincts about them.". Have people tried doing something like RLHF before or as pre-training to make the data more compositional or hierarchical, or to optimize the training data a priori?
A second, more general (and taxonomic) question. Do you like the term "mechanistic interpretability"? For me, "ML mechanisms" or "AI mechanisms" is a better and more general line of research. "Mechanistic interpretability" sounds a bit pedantic and seems to inadvertently limit its own scope due to semantics (in the sense of "we will study the interpretability of mechanisms" as opposed to "we will study mechanisms"). I come from chemistry, where "organic mechanisms" is a narrow approach that aims to understand mechanisms of organic chemistry reactions (and when we say "organic chemistry mechanistic interpretability" we are even narrower in scope), but if we are trying to understand things as a whole, "chemical mechanisms" or just "mechanisms" seems a more appropriate name. Couldn't the same be said for ML (or AI in general?)? Maybe I am the one being pedantic, and I apologize if that is the case.
An obvious question is how can we get the best of both SSM and transformers to get high effectiveness and high efficiency? Could you briefly explain if "Jamba" is a true hybrid that gets the best of both worlds and how it differs from traditional GPT and Mamba models or just some sort of interleaved system? (I think there is a slight difference between a true hybrid and an interleaving of two methods, if that makes sense; I guess the former implies a dual selectivity and self-attention type of approach, while the latter implies the concatenation of the two methods in series or sequence or something).
Hey Daniel! 👋 Thanks for reading and thanks for your questions! 😄
Yeah totally agree that there's a lot of research in control theory and dynamical systems which Mamba-style models take advantage of and can possibly lean further into 🧪
To your questions:
1. This is a great question!
* So right now, a lot of people are mixing in instruction tuning data (e.g. conversations and other more assistant-like data) into the pretraining mix. And lots of groups mix in example problems, programming puzzles and tool use in pretraining sometimes too. This is pretty similar to what you might hope to get from RLHF (if the instruction tuning data has been chosen and verified by humans for example).
* In terms of doing actual RLHF, it tends to not really work so well until you have a pretty solid model so people typically do RLHF at the end of training. It's unclear if that's just our current approach or whether there's a fundamental reason that earlier RLHF couldn't work (Yann LeCun's cake analogy gives a principled reason we might expect RL to work better at the end but it's more of an intuition than a proof).
* I'm not quite sure what you mean by "making the data hierarchical" but how you arrange and order data in training is a problem known as "Curriculum Learning" which there is some ML literature on. Curriculum Learning is starting to become a more important subfield (since data is often a good moat in foundation model training) 📚
2. I'm not as familiar with chemical mechanisms as you are so I can't comment as much there honestly.
* My understanding of the terminology is that the term was advocating for a new approach to Interpretability. A lot of prior interpretability methods (like saliency maps and other primitive techniques) weren't very principled or effective.
* Mechanistic Interpretability was coined by Chris Olah[1]. I think the name is fairly reasonable but likely not perfect - naming is pretty path-dependant and that's the name we have so I guess that's what we're going with!
* Regardless of the name, I'm a big fan of the field and I should hopefully be sharing some research in this area soon.
3. Yep, agree - how we can combine the two architectures is definitely a key question!
* I touch on that a little bit in the piece and I've been glad to see some papers in this direction recently. As you mention, Jamba from AI21 is a good contender here. The Jamba architecture diagram[2] shows what's going on within the Jamba block - as you say, they're interleaving Mamba blocks, MLPs, MoE layers and Attention layers.
* I'm not fully sure I understand the distinction you have in mind between hybrids and interleaving architectures but the authors refer to it as a "hybrid" and empirically the evals have been pretty impressive 🙌
* I'd love to see more research in this area focusing on using each of the architectures strengths (long-context and high effectiveness respectively) - I'm sure that's coming soon!
Thanks again for your comment! 🙏
[1] - https://transformer-circuits.pub/2023/interpretability-dreams/index.html
[2] - https://twitter.com/swyx/status/1773500332628492375
Thank you very much for your comprehensive response and for pointing me in the right direction and research topics! Would it be alright if I sent a short private message to introduce myself and give an overview of my transition, including my immediate goals as I venture into AI research (or other relevant roles within the AI industry, especially for someone with a PhD in a technical field who has foundational knowledge in CS and ML, but may not have extensive projects or practical experiences in these areas since my PhD was not specifically in Machine Learning or CS)? I have to admit that I feel a bit lost, and networking isn't exactly my strongest suit.
Sure, drop me a dm
it still boils down to the lowly perceptron and its y = ax + b.
Yep it does, IG to really move forward we have to find terms with more expressivity. Like the case of Liquid Time Constant Networks. I really hope that someone write a blog post like this on LTCN, they are difficult to understand.