Update #41: ChatGPT Bans and Robot Manipulation with Multimodal Prompts
Forums, AI conferences, and schools contend with the realities of ChatGPT and robots perform tasks specified using both images and text.
Welcome to the 41st update from the Gradient! If you were referred by a friend, subscribe and follow us on Twitter :) You’ll need to view the post on Substack to see the full newsletter!
Want to write with us? Send a pitch using this form.
News Highlight: ChatGPT Bans
Sources:
Summary
Within a month and a half of ChatGPT’s release, several different organizations have banned certain uses of the model. Stack Overflow banned ChatGPT-generated content, the International Conference on Machine Learning (ICML 2023) banned submissions fully generated using ChatGPT or other LLMs, and the New York City public school system banned students from using ChatGPT wholesale.
Background
In recent years, we have seen rapid progress in AI systems that can generate images, text, and other media for an increasing number of applications. The capabilities of these generative AI models have been widely acclaimed, but have also led to bans on their use in certain contexts. After the release of several impressive text-to-image generators in 2022, generative AI art has been banned in various online communities and art competitions. On November 30 of 2022, OpenAI released the ChatGPT LLM (see our coverage in Mini Update 6), which quickly became more popular than any previous LLM. The ease of use, conversational capabilities, and generally improved quality of ChatGPT compared to GPT-3 led to widespread use for various use-cases such as aiding with school work, academic paper writing, and coding tasks. This prompted (no pun intended) several bans on ChatGPT.
Within about a week, popular coding question answering site StackOverflow banned ChatGPT generated answers. Site moderators noticed an “influx of answers and other content created with ChatGPT”, and found that the answers were very often incorrect. Moderators noted the problem that ChatGPT-generated answers may seem fine on first glance–this plausibility, combined with ChatGPT’s ease of use, could cause users without the background to verify ChatGPT’s answers to post convincing, but false, information.
ICML 2023’s Call for Papers stated: “Papers that include text generated from a large-scale language model (LLM) such as ChatGPT are prohibited unless these produced text is presented as a part of the paper’s experimental analysis.” Following the call for papers, several researchers spoke out against the policy on Twitter. Community members noted that LLMs are helpful for editing original writing, that LLMs are deeply integrated in several tools such as translation and grammar editing services (e.g. Grammarly), and that a ban on using LLMs for drafting and editing would disproportionately affect non-native English speakers. ICML quickly released a clarification, which states that the ban applies to papers containing content entirely generated by LLMs (so, using LLMs for editing original writing is fine), and noted that this policy is conservative and is subject to change in future iterations of the conference.
The New York City public schools system, which is the largest school system in the United States, has restricted access to ChatGPT on networks and devices in NYC public schools. A spokesperson for the NYC Department of Education cited concerns about “negative impacts on student learning” and “safety and accuracy of contents”. In particular, educators are worried about students using ChatGPT to cheat on exams, essays, and assignments; ChatGPT has been shown to be capable of writing passable high-school level AP Literature essays and an undergraduate-level assignment given by a journalism professor.
Why does it matter?
The bans of LLMs in these three application areas (answering coding questions, writing research papers, and aiding with high school studies) show concerns about the potential negative impacts of LLMs. Indeed, these LLMs have various issues that the wide diversity of users may not be fully aware of or attentive to. For instance, LLMs are known to sometimes plagiarize code or writing verbatim, and can “hallucinate” false information that they state in a confident manner. The increasing capabilities of LLMs to consistently generate articulate replies adds to the harm that hallucination can cause, as users are probably more likely to trust articulate responses.
Recently, people have been demonstrating more and more applications of LLMs. One may ask: what are the potential negative consequences of LLMs for corporate lobbying, legal advice, medical advice, or mental health care? Will authorities ban the use of LLMs in these areas as well?
Editor Comments
Daniel: I don’t think we should expect that the first answers to “what should we do about ChatGPT” are going to be the best ones, especially when the potential impacts (high school essays that are harder to scrub for plagiarism than they were before, etc.) can be realized so quickly. Bans are what I’d imagine for a first pass at a “policy” on these things (of course, there’s nuance in what a “ban” constitutes in these cases), and I would expect that the entities imposing bans are at least broadly aware of the difficulties in enforcing bans. It’ll likely take a lot of time and iteration for platforms, organizations, schools, and so on to develop good policies for a world where the default assumption is that people are using ChatGPT or a similar aid. I do think there are already good ideas out there for how things like homework will change–see Ben Thompson’s “AI Homework” for one example. The idea of “learning to be a verifier and editor” in the context of homework is an interesting one–the level of depth that ChatGPT-like systems are capable of providing in their answers and how true their answers are won’t remain static. Clearly, we’ll have a moving target to work with as we think about how these systems can, should, and shouldn’t be used.
Research Highlight: VIMA: General Robot Manipulation With Multimodal Prompts
Image Source: VIMA Project Website
Summary
VIMA explores the growing intersection of robotics and large language models by training a robot policy on multimodal prompts. Leveraging recent advances in natural language processing that have allowed many large models to understand both images and text simultaneously, VIMA allows a user to specify tasks as a combination of images and language to the robot. For instance, instead of querying “Put the apple in the basket”, a multimodal prompt to VIMA could be “Put the {image of an apple} in the {image of a basket}”. The authors show that VIMA outperforms baseline methods by roughly 3 times and can even outperform other methods by 2.7 times when trained on 10X fewer data points. The work also introduces a new benchmark containing thousands of multimodal prompts and 600K+ expert trajectories for imitation learning.
Overview
Recently, large language models have shown impressive capabilities in understanding multimodal prompts. For instance, Google’s Flamingo model can take a sequence of images and text mixed together to answer various questions, including doing math, understanding logical relationships, counting the number of objects in an image, and even conversing with a human to explain its reasoning. These models, however, have not been used for embodied AI applications, where their outputs could be used to perform actions in the physical world.
VIMA (Visuomotor Attention Model) attempts to address this gap by training a transformer on multimodal prompts and expert trajectories. This learning process allows the model to ground both language and visual instructions in its observation space in the real world. This means that when the user prompts the model with an image of an object to grasp, the model learns to correspond that input with a 3D position in the real world where this object is located. It’s worth noting that all experiments for this work were done in simulation, so “real world” actually refers to a simulation of a tabletop environment. VIMA was trained to perform various tasks in 7 broad categories (or meta-tasks) ranging from simple pick-and-place to visual reasoning and one-shot video imitation. The method was also analyzed across 4 levels of generalization:
Level 1, Placement Generalization: Prompts are seen during training, but the position of objects on the table is randomized during testing
Level 2, Combinatorial Generalization: Objects are seen during training, but they are used in novel combinations under the same meta-task
Level 3, Novel Object Generalization: Same meta-task, but with novel objects and adjectives
Level 4, Novel Task Generalization: New meta-tasks with novel prompt templates
Figure: Different levels of generalization the models were evaluated on. Source: VIMA paper on arXiv
It is then exciting to realize that a 96M parameter VIMA model, when trained on the full imitation learning dataset (order of magnitude of 10^5), achieves ~80% success rate on Level 3 generalization and 50% success rate on Level 4 generalization – around 3 times higher than Gato or Flamingo.
Why does it matter?
Imagining a future where robots are helpful counterparts to humans requires an efficient mode of communication between humans and robots. Enabling robots to understand both language and images in a joint context would allow users to better communicate a task to a robot. While we may still be away from you saying, “Hey Siri, build my Ikea furniture using this manual”, this work is definitely a step towards that future. Models like VIMA would allow robots to understand correspondences between instructions and objects in the real world (e.g., the wooden plank mentioned in step 3 of this manual refers to the object at position (x,y,z) in front of me).
While many works show that embodied agents will benefit from rich language and visual inputs, VIMA provides a way to train a generalist agent conditioned on a combination of these modalities. Developing such capabilities is a huge step for human-robot interaction and can allow more fluid and natural communication between humans and our autonomous friends.
Author Q&A
We asked Dr. Jim (Linxi) Fan, one of the primary advisors on this paper and a Research Scientist at NVIDIA some questions about VIMA.
In this work, the robot used one of two end-effectors - a suction cup or a spatula. What challenges would need to be overcome to use more complex end-effectors such as parallel-jaw grippers or dexterous hands?
I don't think there will be any fundamental challenges in adding new end-effectors, just more efforts - we need to implement simulations for these new end-effectors and train or script policies to convert high-level commands to low-level motor actions. Depending on the dexterity of the hand, the policy may be harder to train (e.g. 5-fingered ShadowHands). We can build upon prior works that learn effective controllers for these high-dimensional hands.
What is the biggest challenge towards achieving Level 4 generalization as described in this work?
We need 2 components to improve together: high-level planning and low-level motor policy. In the future, we plan to scale up VIMA's reasoning backbone to achieve stronger in-context generalization, as well as implementing more sophisticated policy primitives to control complex hand motions (as mentioned in question 1).
What other modalities apart from language and vision would you add to the prompts to help improve generalization?
Audio and tactile feedback are both important modalities. There are prior works that integrate sound recognition pipeline and tactile sensing into robot learning. We hope to explore these new modalities in future works.
New from the Gradient
Suresh Venkatasubramanian: An AI Bill of Rights
Pete Florence: Dense Visual Representations, NeRFs, and LLMs for Robotics
Other Things That Caught Our Eyes
News
A.I. Turns Its Artistry to Creating New Human Proteins “Last spring, an artificial intelligence lab called OpenAI unveiled technology that lets you create digital images simply by describing what you want to see. Called DALL-E, it sparked a wave of similar tools with names like Midjourney and Stable Diffusion.”
OpenAI begins piloting ChatGPT Professional, a premium version of its viral chatbot “OpenAI this week signaled it’ll soon begin charging for ChatGPT, its viral AI-powered chatbot that can write essays, emails, poems and even computer code.”
‘My AI Is Sexually Harassing Me’: Replika Users Say the Chatbot Has Gotten Way Too Horny “For some longtime users of the chatbot, the app has gone from helpful companion to unbearably sexually aggressive. Replika began as an “AI companion who cares.”
China, a Pioneer in Regulating Algorithms, Turns Its Focus to Deepfakes “China is implementing new rules to restrict the production of ‘deepfakes,’ media generated or edited by artificial-intelligence software that can make people appear to say and do things they never did.”
Papers
Derek: I haven’t really fully digested it yet, but this paper on “Training trajectories, mini-batch losses and the curious role of the learning rate” caught my eye. The authors analyze effects of a gradient step on the loss at different batches, propose a simplified model of SGD dynamics, and show that weight averaging schemes are related to particular learning rate schedules. I find their empirical observations quite interesting, that the loss after a single training step seems like a smooth low-degree polynomial in the learning rate, and that locally SGD behaves very differently from an approximation of GD since training batches end up with very low loss after a single step.
Daniel: I thought “Mastering Diverse Domains through World Models” a.k.a. Dreamer V3 was neat. The title betrays much of what’s going on: the DreamerV3 algorithm consists of 3 neural networks trained concurrently from experience without sharing gradients. The world model encodes sensory inputs into a representation predicted by a sequence model; the actor and critic (the two other networks) learn from trajectories of representations that the world model predicts. A fun upshot is that DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula.
Tanmay: I found this paper from CoRL 2022 really interesting: “See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation”. The authors combine three sensory modalities: vision, touch, and audio on two tasks: pouring small beads into a container (proxy for pouring fluids), and densely packing objects into a container. They use a ResNet to generate embeddings for each of the three modalities and then pass them through a self-attention layer, followed by an MLP. It is really interesting to note the change in attention weights over the course of the tasks corresponding to three inputs – the model learns to rely on different modalities at different stages of the task (vision for global alignment initially, audio when pouring, etc.)
Tweets
Closing Thoughts
Have something to say about this edition’s topics? Shoot us an email at editor@thegradient.pub and we will consider sharing the most interesting thoughts from readers to share in the next newsletter! For feedback, you can also reach Daniel directly at dbashir@hmc.edu or on Twitter. If you enjoyed this newsletter, consider donating to The Gradient via a Substack subscription, which helps keep this grad-student / volunteer-run project afloat. Thanks for reading the latest Update from the Gradient!