Is Attention all you need? Mamba, a novel AI model based on State Space Models (SSMs), emerges as a formidable alternative to the widely used Transformer models, addressing their inefficiencies
Yep it does, IG to really move forward we have to find terms with more expressivity. Like the case of Liquid Time Constant Networks. I really hope that someone write a blog post like this on LTCN, they are difficult to understand.
Hey Daniel! 👋 Thanks for reading and thanks for your questions! 😄
Yeah totally agree that there's a lot of research in control theory and dynamical systems which Mamba-style models take advantage of and can possibly lean further into 🧪
To your questions:
1. This is a great question!
* So right now, a lot of people are mixing in instruction tuning data (e.g. conversations and other more assistant-like data) into the pretraining mix. And lots of groups mix in example problems, programming puzzles and tool use in pretraining sometimes too. This is pretty similar to what you might hope to get from RLHF (if the instruction tuning data has been chosen and verified by humans for example).
* In terms of doing actual RLHF, it tends to not really work so well until you have a pretty solid model so people typically do RLHF at the end of training. It's unclear if that's just our current approach or whether there's a fundamental reason that earlier RLHF couldn't work (Yann LeCun's cake analogy gives a principled reason we might expect RL to work better at the end but it's more of an intuition than a proof).
* I'm not quite sure what you mean by "making the data hierarchical" but how you arrange and order data in training is a problem known as "Curriculum Learning" which there is some ML literature on. Curriculum Learning is starting to become a more important subfield (since data is often a good moat in foundation model training) 📚
2. I'm not as familiar with chemical mechanisms as you are so I can't comment as much there honestly.
* My understanding of the terminology is that the term was advocating for a new approach to Interpretability. A lot of prior interpretability methods (like saliency maps and other primitive techniques) weren't very principled or effective.
* Mechanistic Interpretability was coined by Chris Olah[1]. I think the name is fairly reasonable but likely not perfect - naming is pretty path-dependant and that's the name we have so I guess that's what we're going with!
* Regardless of the name, I'm a big fan of the field and I should hopefully be sharing some research in this area soon.
3. Yep, agree - how we can combine the two architectures is definitely a key question!
* I touch on that a little bit in the piece and I've been glad to see some papers in this direction recently. As you mention, Jamba from AI21 is a good contender here. The Jamba architecture diagram[2] shows what's going on within the Jamba block - as you say, they're interleaving Mamba blocks, MLPs, MoE layers and Attention layers.
* I'm not fully sure I understand the distinction you have in mind between hybrids and interleaving architectures but the authors refer to it as a "hybrid" and empirically the evals have been pretty impressive 🙌
* I'd love to see more research in this area focusing on using each of the architectures strengths (long-context and high effectiveness respectively) - I'm sure that's coming soon!
it still boils down to the lowly perceptron and its y = ax + b.
Yep it does, IG to really move forward we have to find terms with more expressivity. Like the case of Liquid Time Constant Networks. I really hope that someone write a blog post like this on LTCN, they are difficult to understand.
Hey Daniel! 👋 Thanks for reading and thanks for your questions! 😄
Yeah totally agree that there's a lot of research in control theory and dynamical systems which Mamba-style models take advantage of and can possibly lean further into 🧪
To your questions:
1. This is a great question!
* So right now, a lot of people are mixing in instruction tuning data (e.g. conversations and other more assistant-like data) into the pretraining mix. And lots of groups mix in example problems, programming puzzles and tool use in pretraining sometimes too. This is pretty similar to what you might hope to get from RLHF (if the instruction tuning data has been chosen and verified by humans for example).
* In terms of doing actual RLHF, it tends to not really work so well until you have a pretty solid model so people typically do RLHF at the end of training. It's unclear if that's just our current approach or whether there's a fundamental reason that earlier RLHF couldn't work (Yann LeCun's cake analogy gives a principled reason we might expect RL to work better at the end but it's more of an intuition than a proof).
* I'm not quite sure what you mean by "making the data hierarchical" but how you arrange and order data in training is a problem known as "Curriculum Learning" which there is some ML literature on. Curriculum Learning is starting to become a more important subfield (since data is often a good moat in foundation model training) 📚
2. I'm not as familiar with chemical mechanisms as you are so I can't comment as much there honestly.
* My understanding of the terminology is that the term was advocating for a new approach to Interpretability. A lot of prior interpretability methods (like saliency maps and other primitive techniques) weren't very principled or effective.
* Mechanistic Interpretability was coined by Chris Olah[1]. I think the name is fairly reasonable but likely not perfect - naming is pretty path-dependant and that's the name we have so I guess that's what we're going with!
* Regardless of the name, I'm a big fan of the field and I should hopefully be sharing some research in this area soon.
3. Yep, agree - how we can combine the two architectures is definitely a key question!
* I touch on that a little bit in the piece and I've been glad to see some papers in this direction recently. As you mention, Jamba from AI21 is a good contender here. The Jamba architecture diagram[2] shows what's going on within the Jamba block - as you say, they're interleaving Mamba blocks, MLPs, MoE layers and Attention layers.
* I'm not fully sure I understand the distinction you have in mind between hybrids and interleaving architectures but the authors refer to it as a "hybrid" and empirically the evals have been pretty impressive 🙌
* I'd love to see more research in this area focusing on using each of the architectures strengths (long-context and high effectiveness respectively) - I'm sure that's coming soon!
Thanks again for your comment! 🙏
[1] - https://transformer-circuits.pub/2023/interpretability-dreams/index.html
[2] - https://twitter.com/swyx/status/1773500332628492375
Sure, drop me a dm