Boffins at Stanford University and the University of California, Berkeley said the deterioration is an example of a phenomenon known to AI developers as drift, where attempts to improve one part of the enormously complex AI models make other parts perform worse.
They have tested two versions of ChatGPT: version 3.5 and version 4.0, available and the results are grim.
The boffins gave the chatbot a basic task: identify whether a particular number is prime. This is the sort of math problem that is complicated for people but simple for computers.
If a number is a prime should be easy for computers to evaluate by dividing by two, three, five, etc., and see if anything works.
To track performance, the researchers fed ChatGPT 1,000 different numbers. In March, the premium GPT-4, correctly identified whether 84 per cent of the numbers were prime. By June, its success rate had dropped to 51 per cent. Across eight different tasks, GPT-4 became worse at six of them. GPT-3.5 improved on six measures but remained worse than its advanced sibling at most tasks.