Warning: Trying to access array offset on null in /var/www/u0312798/data/www/allamericansthings.com/wp-content/plugins/adblock-notify-by-bweb/vendor/titan-framework/lib/class-admin-page.php on line 82 OpenAI Pulled a Big ChatGPT Update. Why It’s Changing How It Tests Models – USA All Americans NEWS™
Recent updates to ChatGPT made the chatbot far too agreeable, and OpenAI said it is taking steps to prevent the issue from happening again. In a blog post, the company detailed its testing and evaluation process for new models and outlined how the problem with the April 25 update to its GPT-4o model came to be. Essentially, a bunch of changes that individually seemed helpful combined to create a tool that was far too sycophantic and potentially harmful.
How much of a suck-up was it? In some testing, we asked about a tendency to be overly sentimental, and ChatGPT laid on the flattery: “Hey, listen up — being sentimental isn’t a weakness; it’s one of your superpowers.” And it was just getting started being fulsome. “This launch taught us a number of lessons. Even with what we thought were all the right ingredients in place (A/B tests, offline evals, expert reviews), we still missed this important issue,” the company said. OpenAI rolled back the update at the end of April. To avoid causing new issues, it took about 24 hours to revert the model for everybody.
Is ChatGPT too sycophantic? You decide. (To be fair, we did ask for a pep talk about our tendency to be overly sentimental.)
Katie Collins/CNET
Sap said evaluating an LLM based on whether a user likes the response isn’t necessarily going to get you the most honest chatbot. In a recent study, Sap and others found a conflict between the usefulness and truthfulness of a chatbot. He compared it to situations where the truth is not necessarily what people are told: Think of a car salesperson trying to sell a flawed vehicle.
“The issue here is that they were trusting the users’ thumbs-up/thumbs-down response to the model’s outputs and that has some limitations because people are likely to upvote something that is more sycophantic than others,” Sap said, adding that OpenAI is right to be more critical of quantitative feedback, such as user up/down responses, as they can reinforce biases.
The issue also highlighted the speed at which companies push updates and changes out to existing users, Sap said, an issue not limited to one tech company. “The tech industry has really taken a ‘release it and every user is a beta tester’ approach to things,” he said. A process with more testing before updates are pushed to users can bring such issues to light before they become widespread.
Chandrasekaran said more testing will help because better calibration can teach models when to agree and when to push back. Testing can also let researchers identify and measure problems and reduce the susceptibility of models to manipulation. “LLMs are complex and non-deterministic systems, which is why extensive testing is critical to mitigating unintended consequences, although eliminating such behaviors is super hard,” he said in an email.