What's Missing From LLM Chatbots: A Sense of Purpose
Kenneth Li argues that LLM chatbots are missing a "goal" for each conversation and proposes fixing this with human-in-the-loop model evaluations.
I’m super excited to see this piece come out by Kenneth Li, one of my good friends. Kenneth is doing his PhD in machine learning at Harvard and has been the source of many wonderful discussions about LLMs and the future of AI.
Article Preview:
LLM-based chatbots’ capabilities have been advancing every month. These improvements are mostly measured by benchmarks like MMLU, HumanEval, and MATH (e.g. sonnet 3.5, gpt-4o). However, as these measures get more and more saturated, is user experience increasing in proportion to these scores? If we envision a future of human-AI collaboration rather than AI replacing humans, the current ways of measuring dialogue systems may be insufficient because they measure in a non-interactive fashion.