Update #65: Foundation Models in Robotics and AI in the Media
Researchers consider future challenges for robotics with foundation models, and traditional media outlets take steps forward in adopting AI.
Welcome to the 65th update from the Gradient! If you’re new and like what you see, subscribe and follow us on Twitter :)
We’re recruiting editors! If you’re interested in helping us edit essays for our magazine, reach out to editor@thegradient.pub.
Want to write with us? Send a pitch using this form.
News Highlight: The role of AI in traditional media expands
Summary
On Tuesday, December 12, 2023 the New York Times announced the creation of a new newsroom position: editorial director of artificial intelligence initiatives. In a staff memo, the Times said they will “develop a plan and determine the ways [they] can draw upon the powers of AI to improve [their] product, while also not denigrating the quality of trusted journalism”. Of the many things they want to accomplish, the Times will begin by establishing principles on how they will and won’t use generative AI. In addition to the New York Times expanding their AI footprint, a new deal between Axel Springer brands (the publisher of Politico, Business Insider, and numerous other large publications) and OpenAI was announced to advance the training of OpenAI’s large language models and enable ChatGPT to pull content from Axel Springer’s publications to answer questions. The announcements from both the New York Times and Axel Springer caps off an explosive week of AI growth at traditional media publishers.
Overview
In a recent New York Times staff memo the publisher announced that Zach Seward, a founding editor of Quartz (qz.com) was joining the Times as the editorial director of artificial intelligence initiatives. In his role he will help the their organization prototype generative AI tools and partner with journalists to help them incorporate AI into the Times’ publishing tools and other digital products. Within a day of the announcement from the New York Times, we saw a similar announcement from Axel Springer where they tout a global partnership with OpenAI “to strengthen independent journalism in the age of artificial intelligence”. Some highlights from their announcements include but are not limited to:
Using Axel Springer Media Brands content to help train OpenAI’s language models
ChatGPT responses can include attribution links to full articles for further information on topics
Creation of new (and non-specified) financial opportunities for Axel Springer intended to holistically support their journalistic enterprises
Why does it matter?
Since the onset of ChatGPT and other successful large language models, many media publishers have wrestled with the role of AI in their newsrooms. One of the most noteworthy lines in the announcement that the Times highlighted is that Zach Seward has a firm belief that journalism at the Times will always be reported, written, and edited by expert (human) journalists. This belief is particularly refreshing, given how other publications like Sports Illustrated, Men's Health, CNET, Bankrate, Gizomodo, the A.V. Club, Buzzfeed, and USA Today have engaged in mass churning of AI-generated content that has been attributed to “reporters” who do not exist, as we have highlighted in our recent newsletter. An unfortunately common side effect from using AI to generate content is the phenomenon of hallucinations. One kernel of optimism from both the Axel Springer and New York Times announcements is that they seem to be taking steps to mitigate these hallucinations. It seems that the New York Times will focus on using AI to empower their human journalists, while OpenAI will look to bring in third-party citations to act as an authoritative stamp on generated content. While it remains to be seen how effective these measures will be in practice, both seem to be good first steps towards reducing hallucination frequency and improving the quality of both human and AI-generated media and content.
Editor Comments
Justin: Axel Springer, particularly in Germany, has a decades-long history of scandalous journalistic practices which continuously mixes right wing “activism” and journalism. Using their publications in the training of OpenAI’s language models as well as in attributing ChatGPTs responses to Axel Springer’s “authoritative sources” risk the models further internalizing right wing propaganda as well as biasing the outputs. I really hope this is not the beginning of a trend where ChatGPT further incorporates other right wing click bait into the fold and we see websites like Infowars or Britebart being cited by ChatGPT as “authoritative” news sources.
Research Highlight: Foundation Models in Robotics: Applications, Challenges, and the Future
Summary
Recently, we’ve seen a number of works apply LLMs to robotics and real-world manipulation, such as in SayCan and the Robotics Transformer. In this paper, researchers with eight affiliations survey applications for pretrained foundation models in robotics. Looking at recent papers in this domain, they consider how foundation models contribute to improving robot capabilities in perception, decision-making, and control. They discuss challenges hindering foundation model adoption in robotics and potential pathways for future work.
Overview
Foundation models offered the opportunity for researchers to utilize general models that did well on a variety of tasks, as opposed to using architectures that had been trained for more specific capabilities. Traditional deep learning models for robotics were also typically trained for particular tasks, but foundation models’ learned representations hold the potential to be used in any part of the robot autonomy stack, including perception, decision-making, and control.
The authors cover many different cases for foundation models in robotics, and we will note a few examples here:
SayCan uses an LLM for high-level task planning in language, with a learned value function to ground instructions in its environment. The language model provides task-grounding, determining useful sub-goals based on high-level instructions. The system learns an affordance function to achieve world-grounding, which enables the identification of feasible actions to execute the plan.
The Robotics Transformer (RT-1) is trained on a dataset of over 130k real-world robotic experiences—it receives images and natural language instructions as inputs, and outputs base and arm actions. It shows promising scalability properties in that it generalizes to new tasks, is robust in challenging environments, and can execute long-horizon instructions. RT-2 takes this a step further, using a vision-language action model which utilizes both web and robotics data to generate generalized actions for robotic control.
Finally, the authors consider challenges and future directions:
Data scarcity for training robotics foundation models: While text and image data are abundant, robotics-specific data is scarce. A few solutions are noted, such as using inpainting and VLMs for data augmentation, or performing high-fidelity simulation via gaming engines to collect data.
Inference time: The inference time for many foundation models does not meet the requirements for reliable real-time deployment of robotic systems.
Limitations in multimodal representation: In current multimodal representation learning, a simple embedding is assumed to be sufficient to identify a modality—it is an open challenge whether a single multimodal model can accommodate all modalities. Robotics applications involve some modalities where sufficient data is not available and to align with other modalities, they need to first be converted to other modalities and then used.
Uncertainty quantification: Hallucinations are a well-known issue for current foundation models, but reliability assurances are important for deploying models for safety-critical robotics applications. The authors look at instance-level and distribution-level uncertainty quantification, and observe that estimates of uncertainty need to be calibrated.
Safety evaluation: How can we rigorously test for the safety of a foundation model-based robotic system? This testing needs to occur prior to deployment, during the model’s runtime, and as a robot operates in its target environments.
Using existing foundation models as plug-and-play or building new foundation models for robotics: Foundation models could be integrated into various robotics applications without customization—the plug-and-play approach simplifies the integration of recent AI advances into robotics. However, specific domain expertise may be needed in particular applications, which would require building a foundation model from scratch or fine-tuning existing models.
High variability in robotic settings: Robot platforms are diverse, with different physical characteristics, configurations, and capabilities. Real-world environments are also diverse. Thus, a key requirement for general-purpose robotic foundation models is that they are task-agnostic, cross-embodiment, open-ended, and capture diverse robotic data.
Benchmarking and reproducibility in robotics settings: Finally, robotics research requires real-world hardware experiments—this creates issues for reproducibility, because replicating results from hardware experiments might require access to the exact equipment used in those experiments. Recent works have focused on simulators, but this leads to a large sim-to-real gap: real-world performance might be much more variable than performance in simulation, as low-level planning and control modules have to handle real-world physics.
Why does it matter?
Robotics continues to present AI with some of its hardest challenges. Foundation models have allowed for exciting strides in recent work, but they will not solve everything. This paper identifies a number of important directions for future research that roboticists and others will want to pay attention to.
New from the Gradient
Peter Tse: The Neuroscience of Consciousness and Free Will
Vera Liao: AI Explainability and Transparency
Other Things That Caught Our Eyes
News
Meta unveils Audiobox, an AI that clones voices and generates ambient sounds “While startups including ElevenLabs have received tens of millions in funding for dedicating themselves to this pursuit, Meta Platforms, the parent company of Facebook, Instagram, WhatsApp and Oculus VR has released its own free voice cloning program, Audiobox — with a catch.”
WALT is a new AI video tool that creates photorealistic clips from a single image — you have to see it to believe it “A new artificial intelligence model called WALT can take a simple image or text input and convert it into a photorealistic video. Preview clips include dragons breathing fire, asteroids hitting the Earth and horses walking on a beach.”
SEC Probes Investment Advisers’ Use of AI “The Securities and Exchange Commission is asking investment advisers how they use and oversee artificial intelligence, as agency head Gary Gensler continues to express skepticism about the technology.”
Google’s most capable AI, Gemini, is now available for enterprise development “Google today announced that its most powerful and capable generative AI model, Gemini, is now available to enterprises for their app development needs.”
Biden administration holds first White House AI council meeting “Members of the Biden administration met to discuss how to implement President Biden’s artificial intelligence (AI) executive order Tuesday for the inaugural meeting of the White House AI Council, according to an official.”
Microsoft Targets Nuclear to Power AI Operations “Microsoft is betting nuclear power can help sate its massive electricity needs as it ventures further into artificial intelligence and supercomputing. The technology industry’s thirst for power is enormous. A single new data center can use as much electricity as hundreds of thousands of homes.”
Big Tech's LLM evals are just marketing “Just last week we had Google announce the Gemini model suite (with the Pro version actually being available via API as of today), so this week we got to see the response of the Microsoft-OpenAI coalition.”
Meta used copyrighted books for AI training despite its own lawyers' warnings, authors allege “Meta Platforms' lawyers had warned it about the legal perils of using thousands of pirated books to train its AI models, but the company did it anyway, according to a new filing in a copyright infringement lawsuit initially brought this summer.”
Federal watchdog finds more than 1,000 ways government could use AI “A Government Accountability Office (GAO) report released Tuesday found federal agencies have more than 1,200 potential uses for artificial intelligence (AI), with more than 200 already being employed.”
Cruise slashes 24% of self-driving car workforce in sweeping layoffs “Cruise, the embattled GM self-driving car subsidiary, is laying off 900 employees, or about 24% of its workforce, as part of a plan to slash costs and attempt to revamp the company following an October 2 incident that left a pedestrian stuck under and then dragged by one of its robotaxis.”
Cheating Fears Over Chatbots Were Overblown, New Research Suggests “Last December, as high school and college students began trying out a new A.I. chatbot called ChatGPT to manufacture writing assignments, fears of mass cheating spread across the United States.”
Deepfakes for $24 a month: how AI is disrupting Bangladesh’s election “Pro-government news outlets and influencers in Bangladesh have in recent months promoted AI-generated disinformation created with cheap tools offered by artificial intelligence start-ups.”
ByteDance is secretly using OpenAI’s tech to build a competitor “TikTok’s entrancing ‘For You’ feed made its parent company, ByteDance, an AI leader on the world stage. But that same company is now so behind in the generative AI race that it has been secretly using OpenAI’s technology to develop its own competing large language model, or LLM.”
Pro-China YouTube Network Used A.I. to Malign U.S., Report Finds “The 10-minute post was one of more than 4,500 videos in an unusually large network of YouTube channels spreading pro-China and anti-U.S. narratives, according to a report this week from the Australian Strategic Policy Institute, a security-focused think tank.”
Papers
Daniel: OpenAI has a new paper and blog post presenting a new research direction for “superalignment”—their study finds that when they naively finetune strong pretrained models on labels generated by a weak model, those strong models consistently outperform their weak supervisors, a phenomenon the researchers term weak-to-strong generalization. Another paper, published on ChatGPT’s one-year anniversary, surveys tasks where open-source LLMs are claimed to be competitive with ChatGPT. This interesting paper from Meta introduces “System 2 Attention” (S2A), which leverages LLMs’ ability to reason in natural language (I’d take issue with the choice of the word “reason,” but ok) and follow instructions to decide what to attend to.
Closing Thoughts
Have something to say about this edition’s topics? Shoot us an email at editor@thegradient.pub and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! For feedback, you can also reach Daniel directly at dbashir@hmc.edu or on Twitter. If you enjoyed this newsletter, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!