Mamba Explained
Is Attention all you need? Mamba, a novel AI model based on State Space Models (SSMs), emerges as a formidable alternative to the widely used Transformer models, addressing their inefficiencies
A guest post by the creator of this substack!
Article preview:
Right now, AI is eating the world.
And by AI, I mean Transformers. Practically all the big breakthroughs in AI over the last few years are due to Transformers.
Mamba, however, is one of an alternative class of models called State Space Models (SSMs). Importantly, for the first time, Mamba promises similar performance (and crucially similar scaling laws) as the Transformer whilst being feasible at long sequence lengths (say 1 million tokens). To achieve this long context, the Mamba authors remove the “quadratic bottleneck” in the Attention Mechanism. Mamba also runs fast - like “up to 5x faster than Transformer fast”.
Here we’ll discuss:
The advantages (and disadvantages) of Mamba (🐍) vs Transformers (🤖),
Analogies and intuitions for thinking about Mamba, and
What Mamba means for Interpretability, AI Safety and Applications
Thanks for this post. I really like the physics-based approach to learning, and these state-space models remind me of control theory, dynamical systems, and other engineering and science-based methods.
When you said "models learn from their training data... This is a kind of lossy compression of the input data into the weights. We can think of the effect of pre-training data on the transformer as being like the effect of your ancestors' experiences on your genetics - you can't remember their experiences, you just have vague instincts about them.". Have people tried doing something like RLHF before or as pre-training to make the data more compositional or hierarchical, or to optimize the training data a priori?
A second, more general (and taxonomic) question. Do you like the term "mechanistic interpretability"? For me, "ML mechanisms" or "AI mechanisms" is a better and more general line of research. "Mechanistic interpretability" sounds a bit pedantic and seems to inadvertently limit its own scope due to semantics (in the sense of "we will study the interpretability of mechanisms" as opposed to "we will study mechanisms"). I come from chemistry, where "organic mechanisms" is a narrow approach that aims to understand mechanisms of organic chemistry reactions (and when we say "organic chemistry mechanistic interpretability" we are even narrower in scope), but if we are trying to understand things as a whole, "chemical mechanisms" or just "mechanisms" seems a more appropriate name. Couldn't the same be said for ML (or AI in general?)? Maybe I am the one being pedantic, and I apologize if that is the case.
An obvious question is how can we get the best of both SSM and transformers to get high effectiveness and high efficiency? Could you briefly explain if "Jamba" is a true hybrid that gets the best of both worlds and how it differs from traditional GPT and Mamba models or just some sort of interleaved system? (I think there is a slight difference between a true hybrid and an interleaving of two methods, if that makes sense; I guess the former implies a dual selectivity and self-attention type of approach, while the latter implies the concatenation of the two methods in series or sequence or something).
it still boils down to the lowly perceptron and its y = ax + b.