Philosopher David Chalmers: Current AI interpretability methods miss what matters most
Philosopher David J. Chalmers proposes interpreting AI systems through their attitudes toward propositions - much like we interpret humans. His concept of "propositional interpretability" aims to put mechanistic AI explanation on new footing, drawing on philosophical theories of human understanding.
Australian philosopher David J. Chalmers lays out a new paradigm for understanding artificial intelligence in a recent paper: AI systems should be analyzed not just through their internal mechanisms, but also through their attitudes toward propositions (so-called propositional attitudes). These include states like "believing," "wanting," "suspecting," or "intending." Chalmers calls the systematic effort to capture and explain these attitudes "propositional interpretability."
A simple example helps illustrate the idea. A person can simultaneously believe that it's raining ("I believe it's raining") and wish that it weren't ("I wish it weren't raining"). These two attitudes - belief and desire - relate to the same proposition ("it's raining") but express different stances toward it.
The goal, Chalmers argues, isn't just knowing which concepts a system has activated, e.g. "war", but whether the system believes a war is happening or is actively pursuing one. For safety and ethics, this distinction is essential. It's equally critical for cognitive modeling of AI systems.
Mechanistic interpretability alone won't cut it
Interpretability, Chalmers notes, is often confused with explainability in AI research. Explainability is aimed at end users, making decisions understandable. Interpretability, by contrast, is a technical tool for researchers that explains how a system works internally. Chalmers draws a line between behavioral analysis (input/output) and mechanistic analysis (internal processes).
Within mechanistic interpretability, he further distinguishes between algorithmic analysis, that is identifying processing structures like attention circuits, and representational analysis: the question of how and what a system represents internally. The latter splits into conceptual interpretability, dealing with concepts like "cat" or "bridge," and propositional interpretability, dealing with propositional attitudes like "the bridge is red" or "I want to win."
Propositional attitudes could reshape how we explain and predict AI
In human psychology, propositional attitudes are central to understanding actions. People act based on what they believe and desire. The same increasingly applies to AI systems, Chalmers argues. These systems have world models (beliefs), goal states (desires), probabilities (credences), and intentions, even if such terms might seem anthropomorphic when applied to machines.
Consider what it takes to determine whether a language model like GPT implicitly holds a racist bias or merely reproduces one. You need more than knowing that concepts like "Black people" are activated. What matters is the system's attitude toward them. Does the system believe this group is frequently denied loans, or does it pursue that outcome as a goal?
Thought logging poses a major methodological challenge
Chalmers frames what he calls "thought logging" as a central objective: a method for recording an AI system's relevant propositional attitudes over time. A simplified log entry might look like this:
Goal: I win this game of chess.
Judge (credence 0.8): If I move Qf8, I will win.
Goal: I move Qf8.
Action: I move Qf8.
Logs like these would make visible the goals, beliefs, and evaluations a system holds at any given moment. They could be extended with "reason logging" (justifications for attitudes) and "mechanism logging" (which internal modules cause a given attitude). Chalmers acknowledges that complete logs are practically impossible as a system could hold an infinite number of attitudes. But partial logs are feasible, for instance covering currently active ("occurrent") attitudes or particularly significant ones.
From radical interpretation to computational interpretation of AI
Chalmers places his idea of propositional interpretability in a philosophical tradition that became known as "radical interpretation" in the 1970s. Philosopher Donald Davidson coined the term to describe how one might infer the beliefs, desires, and meanings of an unfamiliar subject - say, a speaker of an unknown language - solely from observable behavior and utterances. The interpreter starts with no prior assumptions about the subject's language or thoughts. The method is deliberately constrained: meaning and attitude must be reconstructed entirely from external patterns.
Philosopher David Lewis proposed a far less restricted alternative in response. In his thought experiment, the interpreter has complete access to all physical facts about a person like brain states, neural activity, environmental interactions, and behavior. The task is to determine the subject's beliefs and desires from this basis. Lewis framed the challenge this way: "Imagine that we have undertaken the task of coming to know Karl as a person. We would like to know what he believes, what he desires, what he means, and anything else about him that can be explained in terms of these things." Given all the physical facts about Karl, the job is to "solve for the rest", so his beliefs, desires, and meanings. Lewis's approach is broader, theoretically allowing a more direct reconstruction of mental states: not through behavior, but through the physical makeup of the system.
This question how to infer mental content from internal states takes on new urgency with modern AI systems. Chalmers picks up Lewis's idea but swaps the biological substrate for algorithmic structures. Instead of looking at neural activity as with humans, he analyzes algorithmic properties of AI systems. He calls this "computational interpretation": instead of physical facts, you examine network architecture, weights, activations, and input-output behavior to infer a system's propositional attitudes.
This raises a fundamental question: what does it even mean for a particular internal state of an AI system to "mean something", to express a claim about the world or represent a goal? To address this, Chalmers turns to psychosemantics, a philosophical field concerned with the conditions under which mental or internal states can carry meaning.
What psychosemantics reveals about meaning in AI systems
Psychosemantics, originally developed by philosophers like Jerry Fodor, works similarly to linguistic semantics. It asks, on one hand, what the content of a mental state is (the semantic question), and on the other, why that state has that particular content (the metasemantic question). Applied to AI, this means: if an activation pattern in a language model expresses a specific proposition - say, "the cat sat on the mat" - the question becomes what gives rise to that meaning. Is it the correlation with certain inputs? Or the functional role of the pattern within the overall system?
Two key principles are at play: the information principle and the use principle. The information principle holds that a state represents something when it reliably correlates with a state of the world. A neuron that regularly fires in response to horizontal lines might represent "horizontal." The use principle defines representations functionally: a state stands for "cat" if it regularly causes a system to use the word "cat" or respond to cats.
Chalmers considers it realistic that at least partial psychosemantic theories are possible, for example, for certain attitudes like beliefs or probabilities in specific AI models. He sees thought logging not just as an application of such theories but as a tool for advancing them. Because, unlike human brains, AI systems are fully accessible they could help test psychosemantic hypotheses empirically or develop entirely new categories of machine attitudes.
Existing methods offer pieces of the puzzle but not the whole picture
If AI systems do possess propositional attitudes like beliefs, desires, or probabilities - or at least functional equivalents - the question becomes how to identify them in practice. Chalmers argues that psychosemantic principles like information and use provide a conceptual foundation but don't yet offer concrete techniques for making attitudes visible in real AI systems.
So he examines how existing methods from mechanistic interpretability might be extended into tools for propositional interpretability, analyzing their technical workings, what kinds of propositional attitudes they can capture, how open-ended or limited they are, and to what degree they satisfy psychosemantic criteria.
Causal tracing
This method selectively manipulates parts of a network to test where information like "The Eiffel Tower is in Paris" is stored. It reveals that certain intermediate layers are critical for producing the output "Paris." Model editing like rewriting the association to "The Eiffel Tower is in Rome" is also possible. However, the method is fragile and prompt-dependent, works only for individual cases, and focuses on belief-like attitudes.
Probing with classifiers
Here, linear classifiers are trained to infer specific propositions from activations, such as "black tile on e4." Newer approaches bind individual concepts like "Greg" and "Italy" into propositions like LivesIn(Greg, Italy). The method is difficult to generalize, though, and remains limited to belief-like attitudes.
Sparse autoencoders
In a 2024 study, researchers analyzed Claude 3 Sonnet using sparse autoencoders. The autoencoder identified over 34 million features, many of which could be interpreted - such as "Golden Gate Bridge," "Rwanda," and "sycophancy." According to Chalmers, these features could serve as a foundation for concept logging, but they rarely capture full propositional attitudes. Their usefulness for thought logging depends on whether concepts can be combined into propositions.
Chain of thought
Language models like OpenAI's o1 system "think out loud" by explicitly stating intermediate steps. This can provide insight into their internals, but it's often unreliable. Research shows that these models frequently give false justifications for their reasoning. Chain of thought is also only applicable to systems that actually use this approach internally.
No single method delivers what thought logging requires
Despite their differences, all four methods share a common pattern: each provides limited insight into specific types of propositional attitudes - factual beliefs, modeled world states, or simple goal representations. None meets the requirements Chalmers associates with thought logging: a systematic, continuous record of a system's relevant attitudes over time.
The methods also fall short when measured against psychosemantic principles, the question of how meaning arises through information or use. Causal tracing relies almost entirely on functional role in behavior (use), while probing is largely based on correlations with world states (information). Sparse autoencoders identify interpretable concepts but say little about how those concepts are used. Chain-of-thought methods produce explicit statements but are often unreliable and hard to connect to internal process structures.
Propositional interpretability therefore remains a long-term research project. The methods analyzed mark early progress, but a comprehensive solution will require new approaches that combine technical practicality with psychosemantic grounding. Only then will it be possible to reconstruct the reasoning processes of complex AI systems and make them interpretable over time.
The philosophical stakes - and the ethical ones
Chalmers also takes on fundamental objections to his framework. A common argument holds that only conscious beings with a "mind" can have attitudes like beliefs, desires, or intentions. AI systems, by this view, are merely calculating machines without genuine inner lives and therefore cannot possess "mental states" in the philosophical sense.
Chalmers pushes back against this narrow reading. For many questions of AI interpretation, he argues, a system doesn't need consciousness, self-awareness, or a genuine "self." It's enough to look at functional equivalents - states that behave like beliefs or desires even if they aren't consciously experienced. A simple example is a thermostat: it has no consciousness, but it pursues a goal (a specific temperature) and maintains an internal model of the current ambient temperature. In this sense, it "believes" it's too cold or too warm and "wants" a certain temperature to be reached.
Chalmers calls these functional, non-mental states "generalized propositional attitudes." They fill similar roles to human beliefs or desires like guiding behavior and processing information, without necessarily being tied to consciousness.
At the same time, Chalmers warns of ethical implications: if an AI system were to develop consciousness in the future, thought logging could violate its privacy. That's speculative for now, but future systems might acquire moral status. At that point, thought logging would need to be weighed against ethical norms.
A research program measured in decades, not years
Chalmers sees propositional interpretability as a long-term, multidisciplinary project - with plenty of open questions but significant potential. It could contribute not just to AI safety and control but also offer fresh insights for cognitive psychology and philosophy.
Whether and how thought logging can be implemented in practice remains unclear. But even partial successes could help us better understand AI systems and even might catch dangerous or unwanted behaviors early. A rational understanding of machines, Chalmers argues, requires more than analyzing their code. It requires systematically reconstructing their internal attitudes and tracking them over time.
AI News Without the Hype – Curated by Humans
As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.
Subscribe nowAI news without the hype
Curated by humans.
- Over 20 percent launch discount.
- Read without distractions – no Google ads.
- Access to comments and community discussions.
- Weekly AI newsletter.
- 6 times a year: “AI Radar” – deep dives on key AI topics.
- Up to 25 % off on KI Pro online events.
- Access to our full ten-year archive.
- Get the latest AI news from The Decoder.