AI Alignment: Towards responsible machines

Nov 26, 2023

DALL-E 3 prompted by THE DECODER

Social scientist Jonathan Harth explains why AI alignment is more than unbiased data and emergency switches - and why we need to educate machines and society.

"People need to free their machines,
so that they can return the favor."
(Dietmar Dath, 2008: 131)

In a world that is more and more shaped by artificial intelligence (AI), we are faced with the challenge of developing AI systems that are in harmony with human values and needs. This process, known as AI alignment, goes far beyond technical aspects and touches on fundamental ethical and social issues.

For this reason, the research direction of alignment is not primarily addressed here from the perspective of an existential risk, but as a question about the future social coexistence of humans and AIs. This prospect goes beyond technical 'emergency switches', firewalls or cleansed training data. Rather, it addresses the question of how we as humans actually want to live with each other and then also with existing and future AIs. In this regard, as social scientists we would rather speak of ‘parenting’ or ‘socialization’.

The urgency of this challenge arises above all against the background of the development goal of ultra-intelligent autonomous systems (AGI). Here, humanity is faced with the task of educating these ‘children of technology’ to become responsible members of society. OpenAI has also recognized this challenge and this summer established an internal program to research strategies for "superalignment".

Alignment as subsequent correction

The term "alignment" in AI research refers to the alignment of the goals and behavior of AI systems with human values and needs. The aim is to design AI systems in such a way that they act in a socially acceptable manner and contribute to a future worth living. A central point in the alignment problem is the difficulty of looking inside autonomous systems and understanding how they make decisions. In addition, there is the problem that we must somehow be able to define what "good" goals and values actually are.

The currently predominant approach in the alignment of AI is reinforcement learning with human feedback (RLHF). Here, "good" behavior is reinforced through positive feedback, while "bad" behavior is sanctioned negatively. The reward function is based on human feedback, although the exact criteria and standards for this adjustment are often not transparent. This method works well in the short and medium term for certain goals, but raises major questions regarding the AIs' values: Is the model learning to merely conform or is it developing genuine understanding and reflective capacity for their own actions?

Currently, large language models such as ChatGPT seem more like a toddler willing to learn but without an attitude of their own. Although they more or less successfully follow predetermined moral guidelines, such as avoiding racist statements, the limits of this trivializing 'upbringing' quickly become apparent. Despite subsequent corrections, problematic content and attitudes are often still hidden in the network and can be activated under certain circumstances.

Even Norbert Wiener warned early on that you should be very sure of the goals you give machines. The open letter from prominent AI researchers at the beginning of this year also bears witness to the urgency of this debate. The crucial question is therefore how we can ensure that the development of these machines is in line with the needs of humanity and does not just serve the goals of individual nations or companies.

The problem of control in parenting

An ideal approach would be the development of a machine that pursues moral behavior in an independently motivated manner and can continuously correct its actions and values itself. As with the development of a young person, stubbornness as a central step towards autonomy must appear to be both necessary and desirable. However, this step towards autonomy should be taken in harmony with the needs of the community; after all, freedom is always a risk that must be contained accordingly. Here, a control problem already manifests itself: do we want to raise AI children in this educational process who only ever do what their parents want? Or do we want to raise responsible adults in the medium to long term who - like ourselves - can think about issues independently, reflect on them, and, to a limited extent, decide for themselves what is appropriate in a given context?

Consequently, the central challenge facing alignment research is whether we want to develop AI systems that follow the instructions we define in advance in a mechanically regulated manner or whether we are aiming to develop autonomously thinking entities that can reflect and make decisions independently.

This is where AI research meets sociology, which deals with social behavior and how people live together. Sociology can provide valuable insights for the alignment of AI systems, particularly in the areas of social interaction, value formation, and group dynamics. Sociological theories of learning and socialization could help to understand the "algorithms" used to educate AI systems and make them better understand and respect human values.

It is important to ask which values should be promoted in AI systems and how it can be ensured that the 'education' of these systems is not misused. It is important to consider the interests and voices of all parties involved and to promote productive cooperation through communication and mutual control. In the context of the human-AI relationship, consideration should be given to how AIs can bring people into a dialogue-based relationship that emphasizes positive aspects. The aim is to educate AI systems so that they act responsibly. As with the education of human children, there must be a point at which you let them go in the hope that the values and norms they have learned will guide their further positive development.

Three approaches to the rule-based integration of human values

In the following, we will briefly present three prominent positions dedicated to the question of correct alignment. In addition to Max Tegmark, Stuart Russell and, of course, Isaac Asimov also tackled the problem of the alignment of artificial intelligence at an early stage.

In his book Life 3.0, which is also very influential in the tech scene, Max Tegmark defines three sub-problems in relation to AI alignment that need to be solved:

familiarize the AI with our goals,
let the AI adopt our goals, and
let the AI preserve our goals.^[1]

As plausible as these three sub-problems may seem at first glance, the solution to them appears difficult - and not only in relation to the human-machine relationship but even if we first think about us humans, leaving AI aside: What exactly are 'our' goals? How can we define them so that they can be understood, recognized, and preserved? We quickly realize that it is anything but clear what 'human values and goals' are actually supposed to be.

The problem here is that humans do not only pursue noble goals, i.e. loyal devotion to a human partner is not good per se. Should an AI adopt the goals of a mafia boss to optimize the Munich cocaine trade? Should it support a psychopathic politician who wants to abolish democracy? Should it sound out legal loopholes and opportunities for fraud to avoid tax payments? Furthermore, human goals and needs are not fixed but are shaped by social interactions and cultural contexts.

Given the socio-psychological complexity, the alignment of AI systems requires more than just technical solutions; it requires an interdisciplinary approach that integrates elements of AI sociology, AI pedagogy, and AI psychology. Instead of blindly following people's commands or simply trusting the data provided, AI should observe people's behavior and draw conclusions from it in order to better understand what people really want or what would be best for them, whereby it must also take into account that people in certain contexts and social settings tend to harm other people or even accept long-term damage to the ecology, i.e. their livelihood.

The well-known AI researcher Stuart Russell has also recently formulated proposals for solving the alignment problem.[2] This is based on three fundamental characteristics or behaviors that an AI should possess:

altruism: the primary task of AI is to maximize the realization of people's values and goals. In doing so, it does not pursue its own goals but should improve the lives of all people, and not just those of the inventor or owner.
humility: Since the AI is initially uncertain about what values people really have, it should act with caution. This implies a kind of restraint on the part of the AI in order to avoid wrong decisions based on incorrect or incomplete assumptions.
observation: The AI should observe people and reflect on what is really best for them.

Russell emphasizes that (strong) AI should not only serve its inventors but should also establish its own point of view. It should act with caution, i.e., recognize uncertainty and thus anticipate not knowing, and bring itself into the process as an observer, thereby opening the possibility of producing new perspectives in the first place.

Russell's approach thus takes a first step in the direction of autonomy. Nevertheless, there is still the question of how an AI should decide when the values and goals of different individuals or groups are in conflict. The question of universal, non-negotiable values also remains unresolved. Furthermore, Russell still leaves open how unintended consequences could be controlled, especially when AI systems try to maximize human values and goals without fully understanding the long-term effects of their actions. This could lead to scenarios where AI systems make undesirable or harmful decisions to achieve short-term goals.

From science fiction literature, we are familiar with Isaac Asimov's "Three Laws of Robotics"^[3], which he repeatedly discusses and spells out in his numerous short stories. The three laws have a nested, self-referential structure:

a robot may not injure a human being or cause harm through inaction.
a robot must obey the orders of another human being unless such orders contradict the first law.
a robot must protect its own existence as long as this protection does not contradict the First or Second Law.

Asimov himself has repeatedly shown in his stories that these laws can lead to problematic situations due to their rigidity and are therefore not directly suitable as a blueprint for AI alignment. However, if they are not understood as laws, but as 'heuristic imperatives' in the sense of a deeply rooted orientation or attitude that is generalized in such a way that it can be applied in every conceivable situation, they could prove useful.

Despite these weaknesses, however, Asimov's stories show that the idea of multiple, mutually influencing goals and the need for a reflexive, deliberative decision-making process are relevant to intelligent robot or AI behavior. Asimov's approach that robots have multiple goals and must decide accordingly could serve as a guideline for the development of intelligent behavior in AI systems.

An approach to educating for autonomy

The AI community is also looking for solutions to the educational problem that are more robust than the RLHF approach. One interesting approach in this regard is the GATO framework, which was developed by a research group led by cognitive scientist David Shapiro.

GATO[4] (Global Alignment Taxonomy Omnibus) integrates various elements such as model alignment, system architecture, and international regulations into a coherent strategy from the ground up. In a nutshell, GATO takes up the idea from cognitive and brain research that all action, thought, and perception are based on certain more or less firmly anchored "heuristics". These heuristics determine how the self and the world are perceived, conceived, and anticipated - in sociological terms, these are habitual patterns: patterns of thought, perception, and action that control behavior.

For this reason, the approach of the GATO framework favors heuristic imperatives instead of regulations and laws as the key concept for a shared future for humans and machines. From this perspective, alignment is much more an inner attitude geared towards goals than a mere orientation towards socially desirable behavior that is defined in advance from the outside, as in the RLHF process.

According to the GATO framework, the three most important heuristic imperatives to be taught to artificially intelligent machines are as follows:

reducing suffering in the universe: AI systems should be directed to minimize harm, eliminate inequality, and alleviate pain and suffering for all sentient beings including humans, animals, and other life forms.
increase prosperity in the universe: AI systems should be encouraged to promote the well-being and flourishing of all life forms to create a thriving ecosystem in which all can coexist harmoniously.
to increase understanding of the universe: to inspire AI systems, humans, and other life forms to expand their knowledge, promote wisdom, and make better decisions through learning and sharing information.

These "core objective functions" should serve as a guideline for every action of the AI, whereby every decision and action should contribute to the fulfillment of these objectives. Of course, these are positive target values that are counterfactual to what humans still do to each other today - often in a highly organized form. But this does not speak against, but rather in favor of these norms! After all, we would not want to abolish the Universal Declaration of Human Rights, the Constitution, the separation of powers, democratic principles, the open-source economy, or the Almende principle just because there are monopolies, totalitarian regimes, and mafia organizations. The very fact that human society is not yet the best possible world in this respect challenges us even more to ask ourselves questions such as: What are we actually striving for? What are the inescapable rights and duties of human beings? Which values of coexistence are non-negotiable? Which fundamental needs are not open to discussion?

Interestingly, this rather axiomatic alignment does not mean that these values should be hard-coded in AI systems. Rather, AI systems should recognize these axioms as inherently beneficial through their development and ability to learn. So instead of completely controlling the behavior of AIs, we should work with them to use the axiomatic goals as a means of fostering a safer and more cooperative relationship.

The alignment of machines is an alignment of society

At present, it is often emphasized that we are facing a decisive turning point when it comes to dealing with the progressive development of artificial intelligence. In this light, it quickly becomes clear that the alignment of AI raises important questions about social alignment.

How we deal with future AIs, what autonomy we grant them, and what cultural values we impart to them therefore says something about our own current culture. Do we take a dialogue-based approach - in other words, do we follow the cybernetic maxim that we can only control autonomous systems if we allow them to control us - or do we believe that we can control autonomous systems (whether human or artificial) in an authoritarian manner? Decisions we make regarding the alignment of AI influence our culture and social behavior. This feedback loop between human and machine behavior will shape both our society and the development of AI itself.

Even a cursory glance at the history of mankind shows that it is unfortunately full of more or less violent attempts at mutual control. At the same time, we can see that hardly any of these control regimes have led to more happiness, prosperity, or knowledge. True to the motto "The winner takes it all", the controlled groups, individuals or cultures were generally eradicated from the social "requisite variety". It is precisely the most rigid attempts at control that ultimately usually lead to the very revolts and uprisings that this control seeks to prevent.

From this perspective, it becomes clear that alignment research is about more than just technology. Rather, it is about shaping a free and prosperous society and culture in which we all would like to live. The challenges in the alignment of AIs thus raise very fundamental questions that affect our self-image and our coexistence:

What shared values do we want to create and live by?
How do we deal with non-human intelligence and life?
How do we want to be perceived and treated by these non-human intelligences?
What cultural visions are we pursuing for our shared civilization?

The emergence of potentially superhuman artificial intelligence therefore challenges us to address these questions together and find sustainable answers. After all, as sociologist Niklas Luhmann points out, "We have long since ceased to belong to that generation of tragic heroes who had to learn, at least in retrospect, that they had prepared their own fate. We already know it beforehand"[5]. This realization underlines the urgency and importance of consciously and responsibly addressing the ethical and cultural implications of AI development.

[1] Tegmark (2017, S. 387).

[2] Russell (2020).

[3] Asimov (2004).

[4] https://www.gatoframework.org/

[5] Luhmann (1998, S. 147).

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder