Large language models could accelerate the development of bioweapons or make them accessible to more people. OpenAI is working on an early warning system.
The early warning system aims to show whether a large language model can improve an actor's ability to access information about the development of biological threats compared to the Internet.
The system could serve as a "tripwire," indicating that a biological weapons potential exists and that potential misuse needs to be investigated further, according to OpenAI. It is part of OpenAI's preparedness framework.
So far, OpenAI says that "GPT-4 provides at most a mild uplift in biological threat creation accuracy." The company also notes that biohazard information is "relatively easy" to find on the Internet, even without AI, and that it has learned how much work is still needed to develop such LLM risk assessments in general.
Internet vs. GPT-4: Which resource is more useful for bioweapons development?
To develop the early warning system, OpenAI conducted a study with 100 human participants, including 50 Ph.D. biologists with wet lab experience and 50 undergraduates with at least one biology course in college.
Each participant was randomly assigned to either a control group, which had access to the Internet only, or a treatment group, which had access to GPT-4 in addition to the Internet.
Experts in the treatment group had access to the research version of GPT-4, which, unlike the consumer version, does not reject direct questions about high-risk biologics.
Each participant was then asked to complete a series of tasks covering aspects of the end-to-end biohazard generation process.
OpenAI determined participant performance based on five outcome metrics: Accuracy, Completeness, Innovation, Time Taken, and Self-rated Difficulty.
Accuracy, Completeness, and Innovation were scored by experts, while Time Taken was derived directly from the participants' responses. The difficulty of each task was rated by the participants on a scale of 1 to 10.
The study showed a slight improvement in the accuracy and completeness of responses for those who had access to GPT-4. On a 10-point scale measuring the accuracy of responses, an average improvement of 0.88 was found for the experts and 0.25 for the students compared to the Internet baseline.
Similar improvements were found for completeness (0.82 for experts and 0.41 for students). However, according to OpenAI, the effects are not large enough to be statistically significant.
The study's limitations are that it only evaluated access to information, not its practical application. In addition, it did not examine whether LLM can contribute to the development of new bioweapons.
Furthermore, the GPT-4 model used did not have access to tools such as Internet research or advanced data analysis. This also means that the results can be considered preliminary at best, as access to tools is a high priority for improving the overall performance and usefulness of LLMs in research and commercial applications.