The alignment problem or how AI could become human-friendly

Particularly powerful artificial intelligence, which fulfills its goals but takes dubious paths to do so, could become dangerous to humanity. OpenAI aims to prevent this dystopian scenario with three approaches.

The most vivid example of failed artificial intelligence alignment comes from philosopher Nick Bostrom: in 2003, he published the text "Ethical Issues in Advanced Artificial Intelligence", which describes an AI optimized to produce paper clips. It finds that human bodies are a valuable resource for paperclip production. The result of this story is obvious.

Bostrom's current approach: people should trust the intellectual capacity of a powerful AI, teach it what people want, and make sure it has a stake in seeing that human needs are met. AI requires common sense, Bostrom says.

OpenAI focuses on the interaction between humans and machines

U.S.-based AI company OpenAI is now offering a glimpse into its alignment research, which addresses Bostrom's demands, at least to some extent. OpenAI's principle is that AI systems should learn from human feedback and at the same time support humans in evaluating AI.

The company wants to maximize current alignment approaches to explore and document their possibilities and limitations. In research, OpenAI relies on three pillars for this.

Training of AI systems with human feedback
Training AI systems to assist human evaluation
Training AI systems to do alignment research

OpenAI acknowledges that these three areas alone are too superficial, as people make different judgments depending on their cultural background, for example. Alignment is therefore not only a technical problem, but also a social and political challenge. A central question is to whom AI systems should be aligned.

Human feedback helps with AI alignment

A central building block of OpenAI's current alignment research are the so-called InstructGPT models, derivatives from large AI models such as GPT-3. They are optimized to better follow explicit and implicit human needs and to consider values such as truthfulness, fairness, and safety. For InstructGPT models, OpenAI integrates human evaluations into the dataset for AI training.

According to OpenAI, InstructGPT models are already better rated and preferred by humans on average than native GPT-3 models. This is evident from its programming interface for natural language, OpenAI says, and it also showed up in initial studies.

But even this class of models still produces incorrect information or toxic results. Moreover, training with human feedback is unlikely to be sufficient to match AGI's potentially superior machine intelligence. Still, human feedback is a cornerstone of a potential adaptation solution, OpenAI says.

Recommendation

AI research

DOOM on the toaster was fun, on AI it's groundbreaking

AI should make evaluations easier for humans

One weakness in human evaluations is the human itself: The more complex the tasks an AI solves become, the more difficult it becomes for humans to provide useful feedback to the AI system.

In that case, AI might only ask humans for help with questions that humans can solve, in the worst case, it might only tell humans what they want to hear instead of the truth.

AI systems should therefore break down complex processes into individual steps that can be evaluated by humans. One example is an AI system from OpenAI that supports the evaluation of an AI-generated book summary with summaries of individual chapters.

OpenAI sees the development of AI assistance as the best way to conduct further alignment research: "Our alignment techniques need to work even if our AI systems are proposing very creative solutions (like AlphaGo’s move 37), thus we are especially interested in training models to assist humans to distinguish correct from misleading or deceptive solutions."

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

AI aligns itself

The third building block in OpenAI's alignment strategy: artificial intelligence researching alignment on its own.

There is a compelling reason for this strategy: as AI advances, OpenAI says, we can expect some new alignment problems that no one has yet foreseen. A scalable, fundamental solution to the alignment problem is therefore likely to be elusive.

It would be more pragmatic to develop AI systems that can do alignment research faster and better than humans. For humans, it would probably be much easier to evaluate AI research results than to produce research results themselves, OpenAI says.

"Therefore human researchers will focus more and more of their effort on reviewing alignment research done by AI systems instead of generating this research by themselves. Our goal is to train models to be so aligned that we can off-load almost all of the cognitive labor required for alignment research," OpenAI writes.

Narrower AI systems that outperform humans in their respective domains would already suffice for AI-based alignment research. These specialized AIs are likely to be easier to align than a potential AGI, the company adds.

Large language models contain "a lot of knowledge and information about human values from reading the internet" and are therefore potentially well suited to automating alignment research, OpenAI suggests. Moreover, they do not pursue their own goals.

Future versions of AI systems such as WebGPT, InstructGPT, and Codex could become the basis for this automated alignment research. It is not yet clear when these systems will be powerful enough.

"Once we train a model that could be useful, we plan to make it accessible to the external alignment research community," OpenAI writes.

Risks of alignment research

OpenAI also describes the risks of its alignment approaches. For example, using AI support for human evaluations could lead to biases being reinforced. Moreover, adapting general AI is likely to be much more challenging than adapting today's AI systems, even though OpenAI expects a smooth transition to increasingly powerful systems.

"[...] but if there are major discontinuities or paradigm shifts, then most lessons learned from aligning models like InstructGPT might not be directly useful," OpenAI writes.

AI research on alignment could also prove to be a risk if the AI systems needed for it are already too dangerous without alignment. In this case, the direct alignment of AGI would come back into focus.

The alignment problem or how AI could become human-friendly

OpenAI focuses on the interaction between humans and machines

Human feedback helps with AI alignment

DOOM on the toaster was fun, on AI it's groundbreaking

AI should make evaluations easier for humans

AI aligns itself

Risks of alignment research

Most AI models can fake alignment, but safety training suppresses the behavior, study finds

Anthropic researchers teach language models to fine-tune themselves

100 experts call for more research into the control of AI systems

AI coding can make developers slower even if they feel faster

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

"Cat attack" on reasoning model shows how important context engineering is

The alignment problem or how AI could become human-friendly

OpenAI focuses on the interaction between humans and machines

Human feedback helps with AI alignment

AI should make evaluations easier for humans

AI aligns itself

Risks of alignment research

Share

Bank details