Current gen models got less accurate and hallucinated at a higher rate compared to the last ones, from experience and from openai. I think it’s either because they’re trying to see how far they can squeeze the models, or because it’s starting to eat its own slop found while crawling.
I doubt it, LLMs have already become significantly more efficient and powerful in just the last couple months.
In a year or two we will be able to run something like Gemini 2.5 Pro on a gaming PC which right now requires a server farm.
Current gen models got less accurate and hallucinated at a higher rate compared to the last ones, from experience and from openai. I think it’s either because they’re trying to see how far they can squeeze the models, or because it’s starting to eat its own slop found while crawling.
https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
That’s one example, but what about other models? What you just did is called cherry picking, or selective evidence.
Those are previous gen models, here are the current gen models: https://cdn.openai.com/pdf/8124a3ce-ab78-4f06-96eb-49ea29ffb52f/gpt5-system-card-aug7.pdf#page10