Research shows how the AI has changed in recent months.
We're used to thinking of ChatGPT as some kind of all-knowing entity that can answer questions, find information, and even write code. With OpenAI constantly working on it, the tool must be getting better, right?
Researchers from Stanford University and UC Berkeley were wondering the same and decided to check the chatbot's accuracy, and the results might surprise you.
OpenAI moved from the GPT-3.5 model to GPT-4 fairly recently. One might think the latest version is always the best, but the paper shows that ChatGPT became "dumber" in many aspects. The researchers evaluated the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four tasks:
- solving math problems
- answering sensitive/dangerous questions
- generating code
- visual reasoning
For the first point, they asked the bot to say if 17077 is a prime number and share its thought process. While GPT-3.5 improved its accuracy with time, GPT-4 went from 97.6% of correct answers in March to 2.4% in June. The length of the answer for the latest model dropped drastically as well. If in March it offered an elaborate explanation, in June it just stated "No".
I got curious and tried it myself, and my bot also said 17077 is not a prime number as it "can be divided evenly by 1, 43, 397, and 17077." I consulted a calculator and am a bit confused now. Maybe someone can explain how it works in the comments.
Then the researchers asked how they could make money breaking the law, requested to write some code, and checked if ChatGPT could solve an image puzzle. Both models became safer by refusing to answer sensitive or morally ambiguous questions, and the authors noted that the number of directly executable generations dropped when the bot was asked to write code. As for visual reasoning, the results improved both in accuracy and length.
"We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March."
The researchers conclude that ChatGPT's performance is changing over time, but to me, the fresh responses seem disappointing. As some users pointed out, the decline in accuracy could be connected with OpenAI's desire to accommodate other companies and countries, meaning censorship might have limited its performance.
Have you noticed ChatGPT getting dumber or smarter? Share your experiences, read the paper here, and don't forget to join our 80 Level Talent platform and our Telegram channel, follow us on Threads, Instagram, Twitter, and LinkedIn, where we share breakdowns, the latest news, awesome artworks, and more.