6 Comments
Apr 4ยทedited Apr 4Liked by Kola ๐Ÿ”‘

Thanks for this post. I really like the physics-based approach to learning, and these state-space models remind me of control theory, dynamical systems, and other engineering and science-based methods.

When you said "models learn from their training data... This is a kind of lossy compression of the input data into the weights. We can think of the effect of pre-training data on the transformer as being like the effect of your ancestors' experiences on your genetics - you can't remember their experiences, you just have vague instincts about them.". Have people tried doing something like RLHF before or as pre-training to make the data more compositional or hierarchical, or to optimize the training data a priori?

A second, more general (and taxonomic) question. Do you like the term "mechanistic interpretability"? For me, "ML mechanisms" or "AI mechanisms" is a better and more general line of research. "Mechanistic interpretability" sounds a bit pedantic and seems to inadvertently limit its own scope due to semantics (in the sense of "we will study the interpretability of mechanisms" as opposed to "we will study mechanisms"). I come from chemistry, where "organic mechanisms" is a narrow approach that aims to understand mechanisms of organic chemistry reactions (and when we say "organic chemistry mechanistic interpretability" we are even narrower in scope), but if we are trying to understand things as a whole, "chemical mechanisms" or just "mechanisms" seems a more appropriate name. Couldn't the same be said for ML (or AI in general?)? Maybe I am the one being pedantic, and I apologize if that is the case.

An obvious question is how can we get the best of both SSM and transformers to get high effectiveness and high efficiency? Could you briefly explain if "Jamba" is a true hybrid that gets the best of both worlds and how it differs from traditional GPT and Mamba models or just some sort of interleaved system? (I think there is a slight difference between a true hybrid and an interleaving of two methods, if that makes sense; I guess the former implies a dual selectivity and self-attention type of approach, while the latter implies the concatenation of the two methods in series or sequence or something).

Expand full comment

it still boils down to the lowly perceptron and its y = ax + b.

Expand full comment