summary Summary

If a superhuman AGI emerges one day, it would be nice if it were friendly to humanity. Three researchers describe the risks that must be overcome along the way.

A super AI could emerge suddenly and unexpectedly, just as new capabilities emerge when scaling current AI models. This phenomenon, known as a "phase transition," occurs when scaling large amounts of computation and data for an AI model.

Exactly how AI models suddenly acquire new capabilities is not known. It just works. This so-called emergence is seen by AI researchers from OpenAI, UC Berkeley and Oxford University as the possible origin of an AGI that is intellectually far superior to humans.

"The capabilities of neural networks are currently advancing much faster than our understanding of how they work, with the most capable networks effectively being 'black boxes'," the research team writes.


Moreover, the time between the invention of human-level artificial general intelligence and superhuman AGI may be very short, they say. Biological limitations on the speed, architecture and size of the human brain make it "very unlikely" that humans have reached "anywhere near an upper bound on general intelligence," the researchers write.

The many risks of false rewards

In their paper, the researchers describe the risks of current tone-setting Deep Learning techniques in aligning artificial intelligence with human needs. According to the paper, in numerous scenarios, Deep Learning can lead to a super AI pursuing goals that run counter to human needs.

AI could, first, be rewarded for behavior that is not in the interest of the developers and exploits human errors through an incorrectly designed reward system. In particular, the training with human feedback used by OpenAI, among others, is vulnerable to this problem, the researchers say.

The more complex the applications, the greater the risk of so-called reward hacking by AI. The team gives some examples of this:

  • An AI trained for the financial markets might use illegal methods if it is only looking for the highest profit.
  • A scientific AI might falsify experimental data if it is rewarded for new scientific findings.
  • AI optimized for software development could drive people into addiction if rewarded for the most addictive user interfaces possible.
  • Large language models could be enticed to make up facts if they are optimized to respond as always and as pleasantly as possible. As an example, OpenAI researcher Richard Ngo, who co-authored the paper, cites ChatGPT's invention of the OpenAI website as a response to a request to visit its website - which ChatGPT can't do because it doesn't have Internet access.

New tasks, own goals

As a second area of risk, the researchers describe AI that defines its own internal goals and undesirably generalizes learned behavior to new situations.


They distinguish between two variants:

  • The AI behaves incompetently when it is given a new task,
  • or it behaves competently, but differently than desired, when it selects new goals.

While the AI could also choose subgoals that are meaningful to humans, such as helpfulness, honesty, and harmlessness, there is a risk here, as described earlier, that humans are fallible and could give bad rewards.

Furthermore, there could be undesirable correlations between rewards and features of the training environment, for example, if the AI were to try to make as much money as possible because making money is often associated with success.

A particular risk in this context is "broadly-scoped goals" that are pursued over a long period of time and across many tasks and relate to unprecedented situations. This behavior is also typical for humanity, according to the researchers.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Just as an AGI (Artificial General Intelligence) applies learned skills to new situations, it can also define goals for those situations by learning higher level representations and then relating its goals to those representations.

Overly broad goals that lead to undesirable actions could be favored by companies or politicians, for example, who typically have tasks with long-time horizons, the researchers write.

In addition, AI with broad and misaligned goals could be more inclined to permanently reward itself as highly as possible through self-selected subgoals.

Such subgoals could include gaining tools such as money, persuading other protagonists through manipulation and collaboration, and maintaining its own goals - which conversely means stopping attempts to change those goals.

Power-seeking AI

As a third risk scenario, the research team describes power-seeking, manipulative AIs focused on self-preservation and the ruthless extraction of resources such as knowledge, rather than human benefit.

Such AIs could gain control of key levers of power by faking human behavior to gain people's trust, teaming up, or autonomously collecting new data to learn additional capabilities.

They could also develop superior weapons systems or even take over machine learning research, then evolve themselves much faster than humans could.

As we develop AGIs whose capabilities generalize to a very wide range of situations, it will become increasingly unlikely that their aligned goals (like “obedience to humans”) generalize in ways which rule out all power-seeking strategies.

From the paper

Although the above scenarios are speculative, the researchers acknowledge that the field is evolving so rapidly that "we should be preparing in advance," according to Ngo. There is "significant disagreement" among researchers about the plausibility of the various threats and whether the potential countermeasures developed so far are effective, the researchers write.

We strongly encourage more extensive discussion and critique of the claims presented in this paper, even from those who find them implausible. Reasoning about these topics is difficult, but the stakes are sufficiently high that we can’t justify disregarding or postponing the work.

From the paper

You can find more details in the paper "The Alignment Problem from a Deep Learning Perspective".

Recently, OpenAI described how it seeks to align powerful AI models with human needs. In OpenAI's alignment strategy, reinforcement learning with human feedback plays an essential role.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Artificial general intelligence could one day be far superior to humans, solving problems that humans themselves cannot.
  • However, there are also several scenarios in which this endeavor goes thoroughly wrong.
  • Three major risks are "reward hacking," AI setting sub-goals that are unfavorable to humans, and/or being addicted to power.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.