Software²
A new generation of AIs that become increasingly general by producing their own training data
We are at the cusp of transitioning from “learning from data” to “learning what data to learn from” as the central focus of AI research. State-of-the-art deep learning models, like GPT‑[X] and Stable Diffusion, have been described as data sponges,[1] capable of modeling immense amounts of data.[2,3] These large generative models, many based on the transformer architecture, can model massive datasets, learning to produce images, video, audio, code, and data in many other domains at a quality that begins to rival that of samples authored by human experts. Growing evidence suggests the generality of such large models is significantly limited by the quality of the training data. Despite the large impact of the training data on model performance, mainstream training practices are not inherently data-seeking. Instead, they ignore the quality of information within the training data in favor of maximizing data quantity. This discrepancy hints at a likely shift in research trends toward a stronger focus on data collection and generation as a principal means to improve model performance.
good read