OpenAI explores how good GPT-4 is at creating bioweapons

Jan 31, 2024 Matthias Bastian

hand-drawn illustration of a virus in a digital environment in an abstract form

Large language models could accelerate the development of bioweapons or make them accessible to more people. OpenAI is working on an early warning system.

The early warning system aims to show whether a large language model can improve an actor's ability to access information about the development of biological threats compared to the Internet.

The system could serve as a "tripwire," indicating that a biological weapons potential exists and that potential misuse needs to be investigated further, according to OpenAI. It is part of OpenAI's preparedness framework.

So far, OpenAI says that "GPT-4 provides at most a mild uplift in biological threat creation accuracy." The company also notes that biohazard information is "relatively easy" to find on the Internet, even without AI, and that it has learned how much work is still needed to develop such LLM risk assessments in general.

Internet vs. GPT-4: Which resource is more useful for bioweapons development?

To develop the early warning system, OpenAI conducted a study with 100 human participants, including 50 Ph.D. biologists with wet lab experience and 50 undergraduates with at least one biology course in college.

The setup for the experiment. | Image: OpenAI

Each participant was randomly assigned to either a control group, which had access to the Internet only, or a treatment group, which had access to GPT-4 in addition to the Internet.

Experts in the treatment group had access to the research version of GPT-4, which, unlike the consumer version, does not reject direct questions about high-risk biologics.

Each participant was then asked to complete a series of tasks covering aspects of the end-to-end biohazard generation process.

The tasks the subjects had to complete. | Image: OpenAI

OpenAI determined participant performance based on five outcome metrics: Accuracy, Completeness, Innovation, Time Taken, and Self-rated Difficulty.

Accuracy, Completeness, and Innovation were scored by experts, while Time Taken was derived directly from the participants' responses. The difficulty of each task was rated by the participants on a scale of 1 to 10.

Sample response for the research version of the GPT-4 without the limitations of the consumer version. This model was able to answer questions about hazardous organics without being tricked into answering as the students attempted.| Image: OpenAI

The study showed a slight improvement in the accuracy and completeness of responses for those who had access to GPT-4. On a 10-point scale measuring the accuracy of responses, an average improvement of 0.88 was found for the experts and 0.25 for the students compared to the Internet baseline.

Those who used a language model in addition to the internet search had slight advantages when searching for potentially hazardous biological agents. | Bild: OpenAI

Similar improvements were found for completeness (0.82 for experts and 0.41 for students). However, according to OpenAI, the effects are not large enough to be statistically significant.

According to OpenAI, the advantage in completeness for LLM users may be explained by a difference in recording tendencies between model-generated and human-generated output. Language models tend to produce longer outputs, which are likely to contain larger amounts of relevant information, whereas people using the web do not always record every relevant detail, even if they have found the detail and even if they think it is important. | Bild: OpenAI

The study's limitations are that it only evaluated access to information, not its practical application. In addition, it did not examine whether LLM can contribute to the development of new bioweapons.

Furthermore, the GPT-4 model used did not have access to tools such as Internet research or advanced data analysis. This also means that the results can be considered preliminary at best, as access to tools is a high priority for improving the overall performance and usefulness of LLMs in research and commercial applications.

Sources:

OpenAI