Update #70: Apple Shutters Autonomous EV Project and Griffin + Hawk Compete with Transformers
Apple terminates Project Titan and redirects resources to generative AI, and researchers propose two RNN-based architectures in the SSM line of work, which perform competitively with Transformers.
Welcome to the 70th update from the Gradient! If you’re new and like what you see, subscribe and follow us on Twitter :) You’ll need to view this post on Substack to see the full newsletter!
We’re recruiting editors! If you’re interested in helping us edit essays for our magazine, reach out to editor@thegradient.pub.
Want to write with us? Send a pitch using this form.
Editor note: Hey—there’s a lot of news this round (as there is every week) and a pretty long research summary. Our news highlight is really interesting, but I also want to direct you to Karen Hao’s Atlantic piece on the water costs of AI (also later in the newsletter). You should read it. —Daniel
News Highlight: The End of Project Titan: Apple's Pivot from Autonomous Dreams to Generative AI Realities
Summary
Apple Inc. has decided to discontinue its decade-long electric car project, known as Project Titan, and redirect the team's efforts towards generative AI. The project’s termination will result in the dismissal of hundreds of employees, with the remaining 1,400 members being redirected to other roles within the company, particularly in generative AI projects, or facing layoffs if they cannot find suitable reassignments within 90 days.
Overview
Apple has decided to terminate its long-standing Project Titan, which aimed to develop an autonomous electric car. Project Titan, initiated around 2014, aimed to produce a fully autonomous electric vehicle with innovative features like a limousine-like interior and voice-guided navigation. At one point around 2018, it had around 5,000 workers dedicated to the effort (source: Techcrunch).
However, the project faced numerous challenges, including changes in leadership and strategy, and difficulties in developing self-driving technology since its inception. Leadership and strategic direction underwent several changes, with key figures leaving the company. Despite these efforts, the project struggled to maintain a consistent vision, from focusing on creating an all-electric vehicle to compete with Tesla to exploring fully autonomous vehicles similar to Waymo's.
One of the major technical hurdles for Project Titan was developing reliable self-driving technology. Apple has conducted road tests using modified Lexus SUVs since 2017 and even tested various components on a large track in Phoenix, previously owned by Chrysler (source: Bloomberg). However, the complexity of creating a fully autonomous system proved to be a significant challenge. Project Titan was not only ambitious in its technological goals, but also notable for the caliber of talent it attracted. The project saw a revolving door of high-profile automotive executives, including Doug Field, a former Tesla executive who later joined Ford and recruitment of industry veterans from prestigious automotive companies like Lamborghini and Ford. While Apple was tight-lipped about its electric self-driving car plans, later in 2021, the company hired Ulrich Kranz, a former BMW executive instrumental in the development of the i3 program, showing aggressive efforts to further their project (source: Bloomberg, Techcrunch).
However, the project's discontinuation was not solely due to internal challenges. After a period of rapid growth, the EV market started to cool down due to high prices and inadequate charging infrastructure, which discouraged many potential buyers. After more than doubling in 2021 and growing 62% in 2022, the growth rate dropped to 31% in 2023, with plug-ins making up 15% of all vehicle sales (source: Bloomberg report on EVs). The slowdown is expected to continue, with a forecasted growth rate of 21% for this year. Major automakers like General Motors and Ford shifted their focus towards hybrid vehicles in response to the softening demand for all-electric cars. Even Tesla, a leading force in the EV industry, indicated a slowdown in its expansion rate (source: Bloomberg).
In recent weeks, Apple's senior executives decided to wind down the project, shifting the focus to AI, an area of increasing importance for the company. This strategic move aligns with the broader industry trend of investing in AI technology, which promises long-term profitability potential. The car team's employees, including hardware engineers and vehicle designers, will be reassigned to other projects within Apple, with some facing layoffs.
The cancellation of Project Titan has prompted a range of responses from industry leaders and competitors. Tesla CEO Elon Musk, whose company has been a pioneer in the electric vehicle market, reacted to the news with celebration, possibly viewing Apple's exit from the EV space as a reduction in potential competition (source: Bloomberg). On the other hand, Chinese electric vehicle manufacturers like Xiaomi and Li Auto expressed surprise at Apple's decision. Despite the setback, these companies reaffirmed their commitment to the EV sector (source: Yahoo Finance).
Our Take
Apple's decision to wind down its car project seems to be part of a broader strategic shift. While the company is stepping back from the automotive industry, it continues to invest heavily in other innovative areas. One notable example is the launch of the Vision Pro headset, Apple's foray into the mixed reality space, which represents a new product category for the company. Additionally, Apple is focusing on enhancing its existing technologies, such as the CarPlay software, with the redesign aimed to integrate more deeply with vehicle controls and entertainment systems, offering a more seamless and user-friendly experience for drivers and passengers. At WWDC 2022, the company also announced several partners, including Ford, Audi, Jaguar-Land Rover, and Nissan.
Regarding the broader EV and autonomous vehicle market, the slowing growth in EV sales and the challenges faced by autonomous vehicle projects highlight the complexities of these industries. While the potential for electric and autonomous vehicles to transform transportation remains significant, the path to widespread adoption is fraught with technological, regulatory, and market hurdles.
The decision to shift focus to generative AI reflects Apple's strategic realignment towards areas with clearer paths to commercialization and integration with its existing product ecosystem. Generative AI offers potential for immediate impact in areas such as content creation, user interaction, and enhancing existing products like Siri and Apple's software suite. This move allows Apple to leverage its strengths in software and hardware integration, while avoiding the uncertainties and long development timelines associated with automotive projects.
-Sharut
To add to some of Sharut’s comment from a slightly different perspective: it’s interesting to me that, for all the challenges facing AV companies, even a company like Apple—which is uniquely positioned to be able to sink money and resources into such a long-term project—is shuttering its efforts in the space. That’s not a major surprise, given the departure from what I think Apple’s core focus and competencies are. But, it’s hard not to look at this in the light of other AV projects/companies shuttering or being sold. I think there’s another lens to look at this shift of resources as a microcosm of what has historically happened to blue-sky research labs in industry (certainly not a perfect analogy).
—Daniel
Research Highlight: Gated Linear Recurrences + Local Attention = Efficient Language Models
Summary
The authors propose Hawk, a Recurrent Neural Network (RNN) with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds Mamba’s reported performance on downstream tasks, while Griffin matches Llama-2’s performance with a 6x smaller training set. Hawk and Griffin are claimed to be more efficient than Transformers for inference.
Overview (Theory)
RNNs are no longer a standard method in NLP, but some recent works have turned their attention back to the once-central class of models. RNNs scale more efficiently to long sequences than transformers do, but have not demonstrated comparable performance to transformers at scale.
The key component of Hawk’s and Griffin’s architecture is the temporal mixing block. The temporal mixing block can be implemented as a global or local Multi-Query Attention (MQA), or as the proposed recurrent block (section c of the architecture diagram).
The chief architectural innovation in the recurrent block is the Real-Gated Linear Recurrent Unit (RG-LRU): RG-LRU is inspired by the Linear Recurrent Unit (LRU), but incorporates gating mechanisms motivated by LSTMs and GRUs.
A few notes about the equations above:
The output of an RG-LRU layer is y_t = h_t, or equation (4).
Equations (1) and (2) are standard gates, similar to the input, output, and forget gates seen in an LSTM—note the absence of a hidden state in their definitions, which ensures these computations can be executed efficiently on device.
Note the typo in equation (1) from the paper, where x_t should be a_t.
Equation (3) is worth focusing on, since it’s an exponential that determines the gating in equation (4): a in equation (3) is parameterized as sigmoid(Λ), where Λ is a learnable parameter, ensuring that a is in the range [0,1] to facilitate a stable recurrence. C is a constant set to 8. Finally, the exponential that defines a_t is calculated in log-space for numerical stability.
Λ is initialized such that a^c is uniformly distributed between 0.9 and 0.999 at the start of training.
The numerically stable implementation of equation (3) can be found in the appendix: the authors compute the logarithm of a_t, then exponentiate it: cr_t becomes a multiplicative factor for the logarithm before re-exponentiating.
The authors also leave a few notes about the behavior of the “recurrence gate” (in this case, they mean equation 4 and not 1): this gate can “approximately interpolate” between the standard LRU update and the previous hidden state, allowing it to discard its input and preserve all information from previous history. Equation (4), standard GRU gating, and Mamba’s gate all derive from a single functional form for h_t.
They believe this gate enables the model to achieve “super-exponential memory” by reducing the influence of uninformative inputs: “super-exponential” isn’t defined, but I presume they’re speaking to the long context problem. Indeed, if this equation allows a model to recall important information across an input context and this method robustly scales, then this would be an important contribution to long context modeling.
Hawk is a pure RNN model, or a sequence of residual blocks defined as in the first figure.
Griffin mixes these recurrent blocks with local (MQA) attention. Importantly, both the recurrent blocks and local attention maintain a benefit of using a fixed state size to summarize a sequence: recall that local attention uses a sliding window and allows each position to attend only to a fixed number of tokens in the past (as opposed to global attention, where a token can attend to all tokens in the past). The authors offer the intuition for this combination’s effectiveness: that local attention can accurately model the recent past, while recurrent layers can transmit information across long sequences.
Overview (Results)
The authors perform scaling studies for three model families: (1) a MQA-Transformer baseline; (2) Hawk, their pure RNN model; and (3) Griffin, their hybrid model. They train these families at a range of model scales, with an addition 14B Griffin model, and increase the number of training tokens in proportion to the model size as prescribed by Chinchilla scaling laws. They train their models on the MassiveText dataset used for Gopher and Chinchilla, with a different data subset distribution.
The three model families are assessed against Mamba and Llama 2. While the MQA Transformer, Hawk, and Griffin are all trained on 300 billion tokens, Mamba is trained on 600 billion tokens and Llama-2 is trained on 2 trillion tokens. Furthermore, Mamba and Llama-2 were trained on different dataset and with different hyper-parameter tuning strategies.
Overview (Efficiency)
The authors describe their approach to training recurrent models efficiently on device—the two main engineering challenges involved sharding models across multiple devices and efficiently implementing linear recurrences to maximize training efficiency on TPUs. Three things you need to know:
To apply model parallelism, the authors use Megatron-style sharding: by splitting weight matrices (the standard weight matrix in an MLP, and the vector, query, and key matrices in attention) but not the input matrix, matrix operations can be parallelized while only requiring a single synchronization point (all-reduce) in each of the forward and backward passes. They also shard the attention mechanism over its heads, and employ ZeRO parallelism to distribute optimizer states and model parameters across batch shards.
Equation (4) for RG-LRU, which updates a model’s hidden state, executes only a few arithmetic operations—execution time is dominated by memory transfers between HBM (high-bandwidth memory) and VMEM (virtual memory), making the computation memory bound. The authors write a custom Pallas kernel for equation (4) using a linear scan, keeping the hidden state in VMEM at all times and performing memory transfers in larger chunks.
The authors observe that Griffin trains more slowly than their MQA baseline at 7B parameters for short sequences, but the Transformer becomes slower for longer sequence lengths while Griffin’s runtime does not—the authors attribute this to RG-LRU’s computation’s scaling of O(TD), as opposed to the O(T^2 D) scaling of global attention (where D is model width and T is sequence length).
The authors also observe that Hawk and Griffin achieve faster sampling latency than MQA transformers for long sequences, and significantly higher throughput than the MQA transformer baseline.
Overview (Long Context Modeling)
As Griffin and Hawk are intended to compete with transformers, the authors consider Hawk’s and Griffin’s ability to use longer contexts for improving their next token prediction, and investigate extrapolation capabilities during inference.
In summary, the authors find that Hawk and Griffin perform better at next-token prediction on a held-out books dataset than transformer baselines, and demonstrate further improvements when trained with sequence length 8192 vs 2048. The authors note that Hawk-2k and Griffin-2k perform slightly better than their 8k variants for short sequence lengths, and that this suggests training sequence length should be chosen carefully according to intended downstream model use. I agree the evidence here points in that direction, but it is pretty limited evidence.
Finally, the authors look at Hawk’s and Griffin’s copy and retrieval capabilities. They employ a selective copying task, where the model needs to copy data tokens from a sequence while ignoring noise tokens from the context; and an induction heads task, where the model needs to recall the token immediately following a special token. To assess the emergence of copying and retrieval capabilities in pre-trained models, the authors use a phonebook lookup task where a model is provided with a synthetic notebook containing names and numbers, and the model is asked to retrieve the correct phone number given a name.
All 3 models perform perfectly on selective copying—Hawk learns more slowly than the transformer baseline, while Griffin does not, despite using only a single local attention layer. All 3 also perform perfectly on the induction heads task up to the training sequence length, while Hawk and Griffin demonstrate an ability to extrapolate beyond that length which the transformer baseline does not. Finally, on the phonebook task, Griffin and Hawk do not demonstrate a significant performance improvement vs the Transformer baseline—Hawk’s small fixed-size state prevents it from extrapolating to long phonebook lengths, while Griffin’s performance degrades when the phonebook length exceeds the size of its local attention window.
Our Take
I want to note a couple of important points about the results. First, the authors train their models on a number of tokens prescribed by Chinchilla scaling laws—this is a good baseline, but the Llama models showed us that going far beyond those numbers yields continual improvements in model performance. I’d be interested to see how further increasing the number of training tokens impacts performance for Hawk and Griffin. Another important detail is the difference in training datasets and hyper-parameter tuning strategies. The authors use their MQA Transformer baseline, trained with the same dataset and hyper-parameter tuning as Hawk and Griffin, to provide a fairer standard for comparison. Regardless of dataset, Griffin’s beating Llama-2 on all benchmarks besides MMLU (where it loses to Llama-2 by a decent margin) while being trained on 7x fewer tokens is quite impressive. Once again, it would be nice to see more careful ablations, but these can be difficult. I’d also be curious to see more information on the held-out books dataset—I’ve seen at least one essay considering the inherent “predictability” of fiction. Mamba is an interesting transformer competitor, but it looks like a lot of evaluation and capability work remains to be done.
—Daniel
New from the Gradient
Car-GPT: Could LLMs finally make self-driving happen?
Venkatesh Rao: Protocols, Intelligence, and Scaling
Do text embeddings perfectly encode text?
Sasha Rush: Building Better NLP Systems
Other Things That Caught Our Eyes
News
AI Is Taking Water From the Desert
This article explores the collision between the explosive growth of generative AI and the changing climate in the American Southwest. Microsoft, which operates over 300 data centers worldwide, has made ambitious plans to tackle climate change but has also made a commitment to OpenAI, the maker of large-scale AI models. The demand for AI is resource-intensive, with data centers consuming large amounts of water and electricity. Researchers estimate that global AI demand could cause data centers to consume 1.1 trillion to 1.7 trillion gallons of fresh water by 2027. Microsoft's own environmental reports show a significant increase in resource consumption due to the growth of its AI platform. While Microsoft is working to make its data centers more sustainable, there are limitations to its efforts, and the company has been reluctant to provide customers with specific details on the environmental impacts of their cloud-service needs.
Tumblr and WordPress to Sell Users’ Data to Train AI Tools
According to internal documentation and communications reviewed by 404 Media, Tumblr and WordPress.com are preparing to sell user data to Midjourney and OpenAI. The exact types of data and the details of the deals are not specified in the documentation. However, there is mention of a messy process within Tumblr, where a query to prepare data for OpenAI and Midjourney resulted in a compilation of a large number of user posts that it wasn't supposed to include. It is unclear whether this data has already been sent or if there are plans to scrub the data before sending it.
Adobe reveals a GenAI tool for music
Adobe has unveiled Project Music GenAI Control, a platform that uses AI to generate audio from text descriptions or a reference melody. Users can customize the generated music by adjusting tempo, intensity, repeating patterns, and structure. The tool also allows users to extend tracks to an arbitrary length, remix music, or create endless loops. Developed in collaboration with researchers at the University of California and Carnegie Mellon, Project Music GenAI Control is currently in the research stage and does not have a user interface yet. The tool aims to give users control over AI-generated music and allow them to explore their musical ideas. However, the rise of AI-created music raises ethical and legal concerns, particularly regarding copyright infringement.
An AI license plate surveillance startup installed hundreds of cameras without permission
Flock, an AI license plate surveillance startup, has installed car-tracking cameras in 4,000 cities across 42 states without obtaining the correct permits. The company provides AI-based tracking hardware and software to local police departments, who pay an annual fee of $3,000. Flock's cameras use AI software to match a car's make, model, and appearance to a license plate number in the DOT database, providing accurate tracking of potential suspects. However, the company's actions have raised concerns about privacy and personal freedom, as it is unclear what Flock is doing with the tracking data. Flock CEO Garrett Langley claims that the cameras solve about 2,200 crimes a day and cover almost 70% of the population.
Nvidia has updated its licensing terms to explicitly ban the use of translation layers for running CUDA-based software on non-Nvidia hardware platforms. This restriction, which was previously only listed in the online End User License Agreement (EULA), is now included in the installed files of CUDA 11.6 and newer versions. The move is seen as an attempt to prevent initiatives like ZLUDA and Chinese GPU makers from utilizing CUDA code with translation layers. While recompiling CUDA programs for different hardware platforms remains legal, the rise of competitive hardware from companies like AMD and Intel could challenge Nvidia's dominance in the accelerated computing space.
India reverses AI stance, requires government approval for model launches
India's Ministry of Electronics and IT has issued an advisory requiring “significant” tech firms to obtain government approval before launching new AI models. The advisory also asks tech firms to ensure that their products or services do not exhibit bias or discrimination or threaten the integrity of the electoral process. While the advisory is not legally binding, it signals a shift in India's approach to AI regulation. The ministry cites its power under the IT Act, 2000 and IT Rules, 2021. Tech firms are asked to comply with the advisory immediately and submit an "Action Taken-cum-Status Report" within 15 days. The advisory also asks tech firms to appropriately label the fallibility or unreliability of their AI models' output.
Inside the World of AI TikTok Spammers
The article explores the world of AI TikTok spammers, who use stolen content to create low-quality videos. Influencers claim to make five figures a month by flooding social media platforms with these videos, using AI tools to combine stolen celebrity clips with unrelated footage. The author discovers a complex ecosystem of content parasitism, with thousands of people using AI tools to create spammy videos that recycle various types of content.
Cloudflare announces Firewall for AI
Cloudflare has announced the development of Firewall for AI, a protection layer called a Web Application Firewall (WAF), that can be deployed in front of LLMs—which introduce new vulnerabilities—to identify abuses before they reach the models. It will include tools such as Rate Limiting and Sensitive Data Detection, as well as a new validation that analyzes the prompt submitted by the end user to identify attempts to exploit the model. Firewall for AI runs close to the user, allowing for early identification of attacks and protection for both end users and models.
I used generative AI to turn my story into a comic—and you can too
Lore Machine, a generative AI tool, is now available to the public. For $10 a month, users can upload up to 100,000 words of text and generate 80 images for various purposes such as short stories, scripts, and podcast transcripts. The tool offers a range of preset styles for illustrations, making it easy to create visual content. Lore Machine uses an LLM to scan the text and identify descriptions and sentiment, while a version of Stable Diffusion generates the images. The tool is user-friendly and provides a one-click web interface.
Waymo launches driverless rides for employees in Austin
Waymo is launching driverless rides for its employees in Austin, Texas. The vehicles will operate without a safety operator behind the wheel, marking an important step before Waymo opens the program to the public. The service will cover 43 square miles of Austin, including various neighborhoods and the downtown. Waymo recently gained permission to charge for rides in expanded areas of Los Angeles and the San Francisco Bay Area. This development in Austin makes it the fourth city where Waymo's autonomous vehicles are officially in operation. Despite some setbacks and challenges faced by other companies in the autonomous vehicle space, Waymo continues to expand its ride-hailing program.
Competition in AI video generation heats up as DeepMind alums unveil Haiper
AI video generation is becoming a competitive market, with Haiper, a video-generation tool developed by DeepMind alums Yishu Miao and Ziyu Wang, entering the scene. Haiper has raised $13.8 million in a seed round. Users can generate videos for free on Haiper's website by typing in text prompts, with limitations on video length and quality. The company aims to keep these features free to build a community but is exploring commercial use cases. Haiper is also working on building a core video-generation model that could be offered to others. The company is actively hiring and faces competition from OpenAI's Sora, Google, Runway, and others. The key challenge for Haiper and the industry as a whole is to overcome the “uncanny valley” problem and create AI-generated humans that look realistic.
OpenAI Fires Back at Musk Allegations With Trove of Emails
OpenAI has responded to a lawsuit filed by Elon Musk by publishing a blog post that includes emails from Musk himself. The emails show that Musk supported OpenAI's plans to become a for-profit business and urged the company to raise significant funding to compete with Google. OpenAI claims that Musk's lawsuit is a result of his failed attempt to make the company part of Tesla.
We Hacked Google A.I. for $50,000
The article recounts the experience of a group of hackers who participated in Google's latest Bug Bounty event, LLM bugSWAT. The hackers discovered vulnerabilities in Google's generative AI and LLM systems. One of the vulnerabilities involved an Insecure Direct Object Reference (IDOR) in Bard (now Gemini), which allowed unauthorized access to other users' images. Another vulnerability was found in Google Cloud Console's use of GraphQL directives, which could be exploited for a Denial of Service (DoS) attack. The hackers won $1,000 for discovering these vulnerabilities and received an additional $5,000 as a "Coolest Bug of the Event" bonus.
A new research study by the Center for Countering Digital Hate (CCDH) has found that popular AI image generators are creating election disinformation in 41% of cases. The study tested four AI image generators—Midjourney, ChatGPT Plus, DreamStudio, and Microsoft's Image Creator—using 40 text prompts related to the 2024 United States presidential election. The tools generated convincing images that supported false claims about candidates or election fraud; failures here mean that the tools failed to sufficiently enforce existing policies against creating misleading content. Midjourney performed the worst, failing in 65% of test runs. The study highlights the need for AI platforms to enforce policies against creating misleading content and for social media platforms to invest in trust and safety staff to prevent the use of generative AI for disinformation.
Google engineer indicted over allegedly stealing AI trade secrets for China
A Google engineer, Linwei Ding, has been indicted for allegedly stealing trade secrets related to Google's AI chip software and hardware. The stolen data includes software designs for Google's TPU chips, hardware and software specifications for GPUs, and designs for machine learning workloads in data centers. Ding is accused of transferring the files to his personal Google Cloud account using Apple Notes and converting them to PDFs to avoid detection. He allegedly went on to work for a Chinese machine learning company and founded his own startup while still employed at Google. If convicted, Ding could face up to ten years in prison and a $250,000 fine for each count of trade secret theft.
Microsoft engineer warns company's AI tool creates violent, sexual images, ignores copyrights
Microsoft's AI image generator, Copilot Designer, has come under scrutiny for creating violent and sexual images that violate Microsoft's responsible AI principles. Shane Jones, an AI engineer at Microsoft, discovered these disturbing images while testing the product for vulnerabilities. Despite reporting his findings to Microsoft, the company has refused to take the product off the market. Jones has escalated his concerns by sending letters to the Federal Trade Commission and Microsoft's board of directors, urging them to address the issue. The AI tool has also been criticized for generating images that infringe on copyrights, including Disney characters. The lack of guardrails and oversight in generative AI technology is a growing concern, especially in the context of election-related misinformation online. The key takeaway from this article is the need for better safeguards and responsible AI incident reporting processes in AI development.
Top AI researchers say OpenAI, Meta and more hinder independent evaluations
Over 100 top AI researchers have signed an open letter urging generative AI companies to allow independent investigators access to their systems. The researchers argue that strict protocols designed to prevent misuse of AI systems are hindering independent research and safety testing. The letter calls on companies like OpenAI, Meta, Anthropic, Google, and Midjourney to provide a legal and technical safe harbor for researchers to examine their products. The researchers highlight the need for companies to avoid repeating the mistakes of social media platforms that have banned research aimed at holding them accountable. The letter comes as AI companies are becoming more aggressive in shutting out outside auditors from their systems.
Baidu Launches China's First 24/7 Robotaxi Service
Baidu has announced the launch of China's first 24/7 robotaxi service through its autonomous ride-hailing platform, Apollo Go. This expansion allows for non-stop autonomous driving services in selected areas of Wuhan, catering to nighttime travel needs and providing a safer and more convenient service. Baidu has achieved several operational milestones, including fully driverless rides across the Yangtze River and driverless airport transportation services in Wuhan and Beijing. With over 5 million cumulative rides provided, Baidu and Apollo Go are actively working to bring fully autonomous ride-hailing services to more locations and users.
Midjourney Accuses Stability AI of Image Theft, Bans Its Employees
Midjourney has accused Stability AI of stealing their images. The CEOs of both companies, David Holz and Emad Mostaque, have made statements regarding the issue. Holz confirmed the theft and mentioned that Midjourney has obtained some information on the issue, while Mostaque denied instructing his employees to steal and promised to assist with the investigation. The relationship between the two CEOs suggests that their statements are genuine.
Papers
Daniel: I have two theory papers to recommend—this paper provides the first non-vacuous generalization bounds for pretrained LLMs (we / I love generalization bounds, especially non-vacuous ones). Their SubLoRA parameterization provides bounds for models up to 849M parameters, and it looks like the tightest empirically achievable bounds are much stronger for large models (besides their 124M sweep) than what the authors report. A key takeaway from this paper is that it provides force for the empirical finding that larger LLMs generalize better: the authors’ generalization bounds improve with model size (even with a fixed training dataset), and they find that, under their SubLoRA compression scheme, larger models are more compressible given a fixed training error.
Another neat paper uses Rough Path Theory to provide theoretical grounding for the finding that State-Space Models can surpass attention-powered foundation models in accuracy and efficiency. Again, why care about all this theory? It (a) motivates the success of SSMs and (b) gives us some specific guidance about how to structure future SSM architectures to achieve full expressivity (e.g. approximating any continuous function). The authors show that SSMs can be defined as Linear Controlled Differential Equations (Linear CDEs), and therefore study these models through that framework. In particular, the authors show that Mamba, a “diagonal selective SSM” (one of the matrices in the Linear CDE is restricted to be a diagonal matrix), is weaker in expressivity than its non-diagonal counterparts, but can regain expressivity without sacrificing efficiency through a method called chaining (repeating the CDE computation multiple times).
Finally, a few other papers I think you should read, that I won’t provide so many notes for! This paper by Matthew Goldrick considers the consequences of an overly narrow epistemology for cognitive science. This paper offers comments about a recent structural turn—where structural tools such as mathematical spaces are used to characterize conscious experiences—in consciousness science. Also see this work on copyright and generative AI.
These and plenty of other papers deserve full writeups, but this will have to do for now. Let me know if you want to see more depth on these!
Closing Thoughts
Have something to say about this edition’s topics? Shoot us an email at editor@thegradient.pub and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! For feedback, you can also reach Daniel directly at dbashir@hmc.edu or on Twitter. If you enjoyed this newsletter, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!