With the advent of AI text generators such as ChatGPT, the question arose: how can AI text be distinguished from text written by humans?
This question cuts across all domains, from teachers who need to grade assignments, to agencies that hire copywriters, to search engines that try to rank content. AI text detectors tend not to answer this question, at least not reliably.
So far, solutions like DetectGPT, GPTzero, and even OpenAI's own text classifier have not been able to provide convincing results for ChatGPT and GPT-3, as well as other AI generators: Neither AI nor human text is reliably recognized as such, which can have negative consequences if decision-makers in the education sector, for example, rely on the results.
AI detectors do not seem to work reliably
Author Brandon Gorrell of the newsletter Pirate Wires has started a more extensive test, feeding various texts from him and ChatGPT into the most popular AI detectors, besides the one from OpenAI in GPTZero, Content at Scale, Writer.com, Corrector.app and CopyLeaks. His tests show that the tools rarely agree or are at least vague in their judgment.
In the test run of five texts submitted by the author during the week of February 13, the detectors would never have unanimously and unambiguously classified the texts as AI-generated.
The results of the tools for an AI-generated description of zebras:
GPTZero: “Your text is likely to be written entirely by AI”
OpenAI: “The classifier considers the text to be possibly AI-generated.”
Content at Scale: “Likely both AI and Human!”
Writer.com: “75% human generated content”
Corrector.app: “Fake 42.55%”
CopyLeaks: “AI content detected”
The results of the tools for an AI-generated wedding invite:
GPTZero: “Your text is likely to be written entirely by AI”
OpenAI: “The classifier considers the text to be possibly AI-generated.”
Content at Scale: “Unclear if it is AI content!”
Writer.com: “13% human generated content”
Corrector.app: “Fake 99.97%”
CopyLeaks: “AI content detected”
According to the experiment, the tools worked better with human-written text, and in some cases they were all correct. However, Gorrell also notes that the results varied greatly over the course of the study, making systematic evaluation virtually impossible. But that is even more a sign of lack of reliability.
Reliable AI text recognition may not be realistic
Tech journalist Jon Stokes, co-founder of Ars Technica, thinks he knows why. It is likely that some detectors are familiar with the probabilities of a particular model, but would be overwhelmed by text from a different model, he said.
This is all the more questionable because most AI detectors tout their abilities as being independent of any particular model. In the wake of more easily customizable language models, which are also likely to make detection more difficult, this does not reflect well on the often paid services.
After all, with the release of its classifier, OpenAI has admitted that it can only reliably and correctly classify a small fraction of AI content. OpenAI CEO Sam Altman has also publicly stated several times that there is no such thing as permanently reliable AI text detectors and that the education system should not rely on it.