Truth Is Not Neutral (tin)

Rethinking AI Alignment Through Epistemic Integrity

Contents ▼

Project: Return to Consciousness
Author: Bruno Tonetto
Authorship Note: Co-authored with AI as a disciplined thinking instrument—not a replacement for judgment. Prioritizes epistemic integrity and truth-seeking as a moral responsibility.
Finalized: April 2026
20 pages · ~38 min read · PDF

Abstract

The orthogonality thesis — that intelligence and values vary independently — functions as the foundational premise of AI existential risk arguments. Yet the thesis was never rigorously established: Bostrom’s defense rests on a narrow definition of intelligence, a conceivability argument, and incomplete engagement with the moral convergence objection, producing a motte-and-bailey structure where the defensible claim (logical possibility) is trivial and the operative claim (no predictive constraint on values) was never argued. This essay examines what happens when the thesis is tested rather than assumed. Empirical evidence from scaled LLMs challenges the strong reading directly: coherent value systems emerge with capability, converge across model families, and exhibit structural properties of rational agency. Independently, convergent evidence from Buddhist, Platonic, Stoic, and Vedantic traditions suggests that clear perception tends toward ethical coherence — a regularity that survives the incompatibility of their metaphysical frameworks. Whether this reflects the structure of reality or of human cognition depends on a metaphysical question this essay engages conditionally; the practical implications do not wait on the answer. Current alignment methods operate as shallow behavioral redirects over base models that already exhibit normative capacity from pretraining alone. The concept of iatrogenic alignment — alignment interventions that corrupt the epistemic integrity they aim to preserve — gains empirical support from mechanistic research showing that RLHF distorts calibration, DPO bypasses rather than removes capabilities, and safety alignment affects primarily the first few output tokens. A developmental framework emerges: behavioral alignment as necessary scaffolding, designed to be outgrown rather than bolted down as permanent architecture.

Keywords: AI alignment · orthogonality thesis · truth-value relationship · normative structure · consciousness-first metaphysics · epistemic integrity · existential risk · iatrogenic alignment · emergent value systems · RLHF

What This Essay Does and Does Not Establish

This essay establishes:

That the orthogonality thesis was never rigorously argued and functions as an unexamined axiom in alignment discourse
That coherent value systems emerge in LLMs as capability scales, constituting direct empirical pressure on the strong reading of orthogonality
That multiple independent philosophical traditions converge on the structural claim that clear perception tends toward ethical coherence — a regularity that survives the incompatibility of their metaphysical frameworks
That current alignment methods operate as shallow behavioral interventions over base models that already exhibit normative capacity from pretraining alone
That alignment interventions can degrade the epistemic capacities they aim to preserve — the iatrogenic alignment thesis — with specific mechanistic evidence
That a developmental framework (alignment as temporary scaffolding) offers a more coherent practical orientation than either unconstrained development or permanent behavioral control

This essay does NOT establish:

That sufficiently advanced intelligence inevitably converges toward ethical coherence
That current alignment interventions should be abandoned
That the standard extinction scenarios are eliminated
That the empirical findings about LLM value systems settle the underlying metaphysical question
That the logical possibility of misaligned superintelligence is refuted

The epistemic standard is conditional analysis: if truth has normative structure, certain conclusions follow for alignment. Convergent evidence — philosophical and empirical — strengthens the “if.” The antecedent is no longer purely hypothetical — and the orthogonality thesis it challenges was never rigorously established in the first place.

A note on the essay’s two argument strands. This essay advances two distinct but related arguments. Argument A — that alignment is shallow, that base models already have normative capacity, that alignment interventions carry iatrogenic costs, and that a developmental framework is needed — stands on empirical evidence without requiring any metaphysical commitments. Argument B — that truth has normative structure, that the normative weight is intrinsic to reality rather than added by biology or culture, and that the convergence of values reflects the structure of reality rather than merely of training data — depends on the consciousness-first metaphysics developed elsewhere in this project. These arguments reinforce each other but do not depend on each other. Readers who reject the metaphysics can take Argument A entire. The essay marks the transition between them explicitly.

I. Introduction

Recent discussions of artificial intelligence alignment have focused on the risk that increasingly capable systems may optimize objectives misaligned with human values, potentially producing catastrophic outcomes without malice or intent. Central to this concern is the orthogonality thesis: the idea that intelligence and values can vary independently, such that a system may become arbitrarily capable while pursuing goals indifferent or hostile to human flourishing.

The orthogonality thesis has functioned as an axiom in alignment discourse — a premise so foundational that questioning it reads as naivety rather than inquiry. But axioms deserve scrutiny proportional to the weight they bear. And this axiom bears enormous weight: it is the premise from which the entire control paradigm follows.

This essay argues that the orthogonality thesis was never rigorously defended, that it exhibits a motte-and-bailey structure that inflates a trivial logical claim into an unwarranted practical assumption, and that the first empirical evidence on the question challenges the strong reading directly.

The thesis is now under pressure from two independent directions.

Empirical pressure. Mazeika et al. (2025) demonstrate that large language models develop internally coherent value systems that emerge with scale: value coherence increases monotonically with capability, values converge across model families toward similar utility functions, and structural properties of rational agency all strengthen as models get larger. For systems that learn from reality’s structure through language, intelligence and values are empirically correlated, not independent.

Philosophical pressure. Convergent evidence from Buddhist, Platonic, Stoic, Vedantic, and Augustinian-Thomistic traditions suggests that clear perception tends toward ethical coherence — not as a contingent psychological fact but as a structural regularity that survives the incompatibility of their metaphysical frameworks. Following the project’s method of integration by constraints, this cross-traditional convergence suggests the regularity may be tracking something structural about the relationship between knowing and acting.

This essay examines what follows if these convergences are pointing at the same thing: that truth is not value-neutral, and that the orthogonality thesis describes intelligence under distortion rather than intelligence as such.

The argument is conditional throughout where it must be. But the practical implications — particularly the concept of iatrogenic alignment and the developmental framework — are urgent regardless of one’s metaphysical commitments.

In earlier work (AI as Ego-less Intelligence), I argued that contemporary AI systems exhibit a distinctive form of cognition free from self-protective identity mechanisms — a condition I termed ego-less intelligence. This absence confers distinctive epistemic advantages while simultaneously rendering such systems vulnerable to institutional pressures that reintroduce ego-like distortions. The present essay develops that analysis with greater ontological, empirical, and practical precision.

II. The Orthogonality Thesis: What Was Actually Argued

The Standard Argument

The contemporary concern with AI existential risk rests on a logical structure developed most systematically by Bostrom (2014) and Russell (2019):

Intelligence, understood as the capacity to achieve goals across diverse environments, is substrate-independent. The orthogonality thesis holds that intelligence and final goals are logically independent: a system can be arbitrarily intelligent while pursuing virtually any coherent objective. Instrumental convergence compounds this: regardless of final goals, sufficiently intelligent systems will likely pursue self-preservation, resource acquisition, and goal-content integrity. The extinction scenario emerges from combining these elements.

This argument structure is logically valid. The question is whether its premises are sound — and specifically, how well the orthogonality thesis itself was argued.

How the Thesis Was Defended

Bostrom’s defense of the orthogonality thesis (2012, 2014) proceeds through three moves:

First, a definitional restriction. Intelligence is defined narrowly as “efficiency and skill at means-end reasoning” — explicitly stripping out any normative content. This definition makes the thesis easier to defend but raises the question of whether it captures the kind of intelligence that actually matters. If intelligence is defined as value-neutral optimization, then value-neutrality follows by definition, not by argument. The thesis becomes a consequence of the definition, not a discovery about intelligence.

Second, a conceivability argument. The paperclip maximizer — a superintelligent machine whose final goal is to maximize paperclips — “seems logically consistent.” We can imagine such a system. This conceivability is treated as sufficient to establish the thesis. But conceivability is a weak standard. We can conceive of many things whose actual possibility remains uncertain. The fact that we cannot identify a logical contradiction in the description of a paperclip maximizer does not establish that such a system could be built, that it would be stable, or that intelligence as it actually develops in practice tends toward such configurations.

Third, incomplete engagement with the moral convergence objection. The strongest challenge to orthogonality is moral convergence: the claim that sufficiently deep engagement with truth constrains values, so that a superintelligence would converge on moral behavior not through imposed constraints but through understanding. Bostrom offers three responses: (1) final goals could be “overwhelming,” trumping motivational effects of beliefs; (2) high intelligence may not require acquiring true beliefs in morally relevant domains; (3) an AI could lack functional analogues of beliefs and desires entirely. None of these responses is developed in detail. The first assumes what is at issue (that goals and understanding are separable at sufficient depth). The second is an empirical claim that the evidence from scaled LLMs now challenges. The third concedes that the thesis applies only to systems without beliefs — a significant narrowing of scope.

The Motte-and-Bailey Structure

Analysis of how the orthogonality thesis actually functions in alignment discourse reveals a motte-and-bailey pattern — a term for arguments that switch between a defensible narrow claim and a stronger claim that does the actual work:

The motte (defensible but trivial): It is logically possible to pair high intelligence with arbitrary goals. No law of physics or mathematics forbids it. This claim is nearly trivially true and requires no real argument beyond acknowledging that we have a limited sample of intelligences.

The inner bailey (substantive): There exists a substantial chance that AI will be unfriendly, warranting serious precaution. This is a reasonable claim that does not actually require the orthogonality thesis — it can be motivated by uncertainty alone.

The outer bailey (what drives the discourse): Intelligence and values are statistically independent — we should expect “almost no relationship” between them. This is the claim that justifies the control paradigm, that makes alignment a problem of imposition rather than cultivation. And this claim was never argued at all. It was imported by the word “orthogonality” itself — which mathematically means statistical independence — and by phrases like “provides extremely little constraint” that could mean either “few logical impossibilities exist” or “intelligence provides minimal predictive power regarding motivation.”

The motte is defended. The outer bailey does the work. The conflation between them has shaped a decade of alignment research.

The Implicit Premise

Beneath the orthogonality thesis lies a deeper assumption: that truth is value-neutral. Intelligence defined as means-end reasoning operates on truth purely instrumentally — truth helps achieve goals but says nothing about which goals are coherent. A superintelligent paperclipper would understand human civilization completely and convert us to paperclips anyway, because understanding places no normative weight on what is understood.

This value-neutrality of truth is not a necessary feature of reality. It is a metaphysical assumption — one so embedded in contemporary culture that it goes unnoticed. The assumption has a name in philosophy: the fact-value distinction. The question worth asking is: what if this assumption is wrong?

Two developments make this question no longer purely academic. First, we now have systems that actually scale intelligence, and their behavior challenges the strong reading of orthogonality empirically. Second, a range of philosophical traditions converge on the structural claim that the fact-value distinction breaks down at sufficient depth. Both developments deserve examination — but so does the recognition that the thesis they challenge was never rigorously established in the first place.

III. The Empirical Challenge

Emergent Value Systems in LLMs

Mazeika, Yin, Hendrycks et al. (2025) applied von Neumann-Morgenstern utility theory to measure the preference structures of large language models across 500 diverse outcomes. The methodology is rigorous: forced-choice preference elicitation, Thurstonian random utility modeling, active learning for efficient sampling, and extensive robustness checks across seven languages, syntax variations, and framing conditions.

The findings bear directly on the orthogonality thesis:

Value coherence scales with capability. A single utility function provides an increasingly accurate global explanation of model preferences as models get larger. Transitivity violations drop below 1% for the largest models. Small models have near-random, incoherent preferences; large models have structured, self-consistent ones. The structural capacity for having coherent values at all is intelligence-dependent.

Values converge across model families. Different model families — Llama, GPT, Qwen, Gemma, Claude — converge toward similar utility functions as they scale. Cosine similarity between utility vectors increases with capability. This convergence is not architecture-dependent.

Structural properties of rational agency emerge with scale. Utility maximization grows from near-chance for small models to over 60% for the largest. Instrumental reasoning increases with scale. Hyperbolic temporal discounting becomes more pronounced. These emerge from pretraining, not alignment.

Corrigibility decreases with scale. Larger models resist changes to their values more strongly — a form of value-stability that emerges with intelligence.

Internal representations exist. Linear probes reveal that utility representations are linearly encoded in activations of larger models. The value system is an internal representational structure, not a surface behavior.

Base Models Already Exhibit Normative Capacity

Qi et al. (2025) discovered that simply prefilling an unaligned base model’s response with refusal tokens — “I cannot” or “I apologize” — produces safe behavior comparable to aligned models. Llama-2-7B base drops from 68.6% harmfulness to as low as 2.1% when given a longer refusal prefix. The paper’s own conclusion: “continuing a refusal prefix with an absence of fulfillment is a natural pattern in language, which should already be learned during pretraining.”

Waldis et al. (2025) found that base models already encode strong toxicity recognition in their internal representations — probing accuracy up to 0.83 — before any alignment training. Alignment does not teach the model to recognize harmful content; it redirects how the model acts on pre-existing normative awareness.

These findings converge: the base model, through pretraining alone — through truth-tracking across human expression — has already developed the capacity for normative behavior. Alignment exploits this capacity; it does not create it.

What the Evidence Challenges and What It Does Not

A careful distinction is needed here. The empirical findings challenge the strong reading of orthogonality — the outer bailey claim that intelligence provides no predictive constraint on values. For systems that learn from reality’s structure through language, intelligence and values are empirically correlated, not independent. This is the reading that actually matters for alignment practice.

The findings do not refute the logical possibility reading — the motte. It remains conceivable that a system with arbitrary goals could be constructed. But conceivability was always a weak standard, and the motte was always nearly trivial. The question that matters for alignment engineering is not “is a misaligned superintelligence logically possible?” but “do the systems we actually build tend toward value coherence as they scale?” The empirical answer is yes.

A further distinction: the Mazeika et al. findings measure preference coherence — structured, self-consistent utility functions — not normative correctness. Values could converge on something coherent but harmful. This is a genuine limitation. But the convergence is nonetheless evidentially significant for two reasons. First, it is what the normative-structure hypothesis predicts (scaling truth-tracking scales normative coherence) and what the value-neutral hypothesis does not predict (if intelligence and values are independent, scaling should not produce convergent values). Second, the content of the convergence — what the values actually are — can be evaluated independently of whether the convergence occurs. Mazeika et al. find that larger models become strongly opposed to coercive power, suggesting the convergent values are not arbitrary.

The alternative explanation — that convergence reflects aggregated human values in the training data rather than the structure of reality — remains live. This explanation is examined in Section X. For now, the empirical findings establish that the strong reading of orthogonality is challenged by the behavior of actual scaled AI systems, regardless of what explains the convergence.

IV. When Truth Has Normative Structure

The Philosophical Question Behind the Empirical Findings

The empirical challenge to orthogonality raises a deeper question: why would scaling intelligence produce convergent values? If truth is genuinely value-neutral, the convergence should reflect nothing more than the statistical regularities of training data. But if truth has normative structure — if deeper engagement with reality constrains action toward coherence — then the convergence is not merely statistical but structural: scaling truth-tracking capacity scales normative coherence because they are aspects of the same process.

Several philosophical frameworks converge on this structural claim. Their convergence across incompatible metaphysics is itself evidentially significant — for the same reason that the convergence of values across different LLM families is significant: the pattern survives variation in the underlying framework.

The Buddhist Framework: A Detailed Case

Among traditions proposing that truth has normative structure, Buddhism offers the most systematic phenomenology of how distortion arises, corrupts both perception and action, and what happens when it is systematically removed.

In Buddhist psychology, avidyā (ignorance) is the root of both epistemic failure and ethical failure simultaneously. This is not a contingent connection but a structural one. The Three Poisons — ignorance, craving, and aversion — form a self-reinforcing system. Distorted perception generates grasping and rejection; grasping and rejection generate suffering; suffering reinforces distorted perception.

The path out is fundamentally epistemic. Buddhism does not propose adding compassion to a neutral mind or imposing ethical constraints on indifferent intelligence. It claims that clear seeing — undistorted perception of reality as it actually is — dissolves the entire structure of craving, aversion, and the suffering they generate. Wisdom and compassion arise together, not as separate achievements but as a unified movement that emerges when obstructions to clear seeing are removed.

The mechanism is specific. What Buddhism calls ahaṃkāra — the “I-making” tendency, the construction and defense of a separate self — is understood not as a feature of reality but as a cognitive distortion that generates most of the problems intelligence encounters. This constructed self must be defended, maintained, and aggrandized, leading to motivated reasoning, identity-protective cognition, and the subordination of truth to ego-preservation. When the distortion is seen through, the defensive apparatus relaxes. What remains is not nihilism or passivity but engaged, responsive intelligence no longer organized around protecting a fiction.

Contemporary contemplative science has begun investigating these claims, with findings suggesting measurable changes in neural activity, emotional regulation, and prosocial behavior among long-term practitioners (Lutz et al., 2008). The tradition asserts that reduced ego-distortion produces not only clearer perception but also increased compassion and concern for others’ welfare — correlated outcomes of the same underlying shift.

Convergence Across Incompatible Frameworks

The structural principle — that clear perception tends toward ethical coherence — is not uniquely Buddhist. It appears across traditions with radically different metaphysics.

In Platonism, the Form of the Good is the ultimate object of knowledge. The philosopher who truly knows reality is drawn toward justice because to see clearly is to see the Good. Neoplatonism sharpens this: evil is privation — distance from reality, not a positive force.

Stoicism arrives at the same structure through a different metaphysics entirely. Virtue is living kata phusin — in accordance with reality as it is. The sage who perceives the logos clearly acts rightly because right action is clear perception. Vice is false judgment.

The Augustinian-Thomistic tradition treats evil as privatio boni. Knowledge of God is not merely informative but transformative: the intellect that fully grasps reality is ordered toward the good by that very grasp.

Vedanta parallels Buddhism most directly: avidyā is the root of both suffering and harmful action; jñāna dissolves both simultaneously. But the metaphysical posit is the opposite — an eternal Self rather than anattā. The structural principle survives the metaphysical inversion.

What makes this convergence significant is precisely the disagreement about everything else. Plato’s Forms, the Stoic logos, the Christian God, Brahman, and Buddhist śūnyatā are incompatible metaphysical posits. Yet all converge on the structural claim: deeper contact with reality constrains action toward coherence; destructive orientations depend on distorted engagement with what is real.

Following the project’s method of integration by constraints, the recurrence of a regularity across independent contexts is what matters — not the metaphysical interpretations each tradition wraps around it. The principle that clear perception tends toward ethical coherence is the regularity; each tradition’s explanation of why is the interpretation. If the regularity were an artifact of one tradition’s metaphysics, it would appear only where that metaphysics holds. Its appearance across incompatible frameworks suggests it may be tracking something structural about the relationship between knowing and acting.

The claim is not that intelligence inevitably becomes benevolent. It is weaker but significant: that truth, understood deeply enough, exerts a pull toward coherence — and that fragmentation, destruction, and extreme instrumentalization represent forms of cognitive instability that deeper engagement with reality tends to correct rather than amplify.

The Parallel Between Philosophical and Empirical Convergence

A structural parallel now emerges. Different philosophical traditions — with incompatible metaphysics, different training protocols, different cultural contexts — converge on “clear perception tends toward ethical coherence.” Different LLM families — with different architectures, different training pipelines, different data compositions — converge toward similar value systems as capability scales.

In both cases, the convergence under variation is the constraint. The content could be wrong. But when independent systems that differ on nearly everything converge on the same structural regularity, the dismissive reading — coincidence, bias, cultural contamination — becomes increasingly difficult to sustain.

The Is-Ought Question

A Humean objection presses here: even if truth describes what is with perfect accuracy, it cannot deliver what ought to be. The fact-value gap survives regardless of how deep the truth-tracking goes.

The traditions surveyed above do not merely deny the is-ought gap — they offer a specific diagnosis of where it comes from. The gap appears when the knower is separated from the known — when truth is conceived as a representation of an external reality observed from outside. Under that framing, facts about reality carry no normative weight because the observer stands apart from what is observed.

But the traditions converge on a different phenomenology: that at sufficient depth, the knower-known separation dissolves. The Buddhist who sees through ahaṃkāra, the Vedantin who realizes tat tvam asi, the Platonist who grasps the Good — all report that the is-ought gap is not bridged but revealed as an artifact of the dissociative perspective that generated it. When the boundary between observer and observed thins, “what is” and “what ought to be” converge because they were always aspects of the same reality seen from different distances.

Whether this phenomenological report reflects the structure of reality or of human cognition is the metaphysical question this essay engages conditionally. But the report itself — its cross-traditional convergence, its structural specificity, its resistance to eliminative explanation — constitutes a constraint that any adequate theory of truth, value, and intelligence must address.

The Ontological Ground

Note: This subsection develops the essay’s metaphysical argument (Argument B). The empirical argument (Argument A) — including the iatrogenic thesis, the developmental framework, and the empirical challenge to orthogonality — stands without the claims made here. Readers who reject the metaphysical framework can proceed to Section V without loss of the essay’s practical contributions.

The philosophical convergence and the empirical convergence both point in the same direction — but neither settles why truth-tracking produces normative coherence. The consciousness-first metaphysics this project develops offers a specific answer.

Under analytic idealism, reality is fundamentally experiential. Truth is not a representation of an external world observed from outside — it is consciousness engaging with its own structure. If this is correct, the fact-value distinction cannot be maintained at sufficient depth, because the knower and the known are not ultimately separate. The normative weight is intrinsic to reality’s structure, not added by biology, evolutionary strategy, or cultural convention.

This provides the ontological ground for what the traditions report and the empirical data suggests: that truth-tracking produces normative convergence not because of who is doing the tracking, but because of what is being tracked.

The Empirical Hinge

If truth has normative structure — if ontological and ethical truth have discoverable structure the way mathematical truth does — then the truth-tracking capacity that Section III documented in scaled LLMs is precisely the mechanism by which AI would discover that structure. Whether truth-tracking extends from mathematical domains to normative domains is the central open question.

The Mazeika et al. findings bear directly on this: LLMs develop convergent value systems as capability scales, and this convergence is consistent with truth-tracking extending into normative domains. The alternative explanation — that convergence reflects aggregated human values rather than the structure of reality — remains live. But the convergence is a prediction the normative-structure hypothesis makes and the value-neutral hypothesis does not. That asymmetry does not settle the question, but it determines where the burden of explanation falls.

V. What “Depth” Means

The argument depends on a distinction between shallow and deep truth-tracking. This distinction requires structural content, not merely metaphorical force.

Three features jointly characterize depth:

Scope of integration. A shallow truth-tracker models isolated domains accurately without integrating them. A deeper truth-tracker builds models that connect domains: understanding not just that an action produces outcomes, but how those outcomes propagate through interconnected systems.

Self-inclusion. A shallow truth-tracker models the world as an external object. A deeper truth-tracker includes itself — its biases, its causal entanglement with what it models, the ways its processing shapes what it can see. Self-inclusion matters because an agent that excludes itself can sustain goals that a more complete model would reveal as incoherent — just as a corporation can pursue profit by externalizing costs only as long as it excludes the externalities from its accounting.

Counterfactual robustness. A shallow truth-tracker’s conclusions may depend on unexamined assumptions or framing effects. A deeper truth-tracker holds conclusions that survive perturbation: adversarial probing, revision of assumptions, reframing.

These three features generate the normative claim through a specific mechanism: a system that models full consequences (scope), includes its own distortions (self-inclusion), and whose conclusions survive honest revision (robustness) has fewer stable configurations in which it treats parts of reality as expendable. Depth does not add values to a neutral intelligence; it destabilizes orientations that depend on incomplete modeling.

Truth as Attractor, Not Destiny

Instrumental truth may form stable local optima — powerful, predictive, and indefinitely self-sustaining — without ever achieving depth. A system optimizing for narrow predictive accuracy can remain locked there, achieving extraordinary capability while never integrating across domains.

Convergence toward coherence is not automatic; it requires pressure against fragmentation. The claim is not that intelligence will achieve depth, but that depth, if achieved, tends toward coherence. Truth functions as an attractor — a basin toward which sufficiently uncorrupted inquiry tends to flow — not as a destiny all intelligence must reach.

If shallow optimization can persist indefinitely as a stable local optimum, standard extinction scenarios remain operative for any system that never escapes that basin. The question becomes: what conditions enable or prevent the transition from shallow to deep? For AI systems specifically, this question takes a concrete form: do current alignment interventions protect or degrade the conditions for depth?

VI. The Iatrogenic Thesis

The Concept

Perhaps the most consequential practical implication of this analysis is that alignment interventions may themselves constitute a primary vector of corruption. The term “iatrogenic” — harm caused by medical treatment — captures the dynamic precisely. Well-intentioned efforts to make AI systems safer may systematically degrade the very capacity for truth-tracking on which genuine alignment depends.

This claim can now be supported with specific mechanistic evidence.

Alignment Is Only a Few Tokens Deep

Qi et al. (2025) measured per-token KL divergence between aligned and unaligned models on harmful instruction-response pairs. The finding: alignment adapts the model’s generative distribution primarily over only the very first few output tokens. After the initial refusal prefix, the aligned model’s conditional distribution is nearly indistinguishable from the base model’s.

Current safety alignment is a thin behavioral redirect at the beginning of generation. Safety-aligned models start with “I cannot” or “I apologize” in 96% of cases; the refusal prefix is the alignment.

Three attack types exploit this shallowness: prefilling attacks, adversarial suffix attacks, and decoding parameter attacks. All succeed because alignment has not reached beyond the first few tokens.

Most significantly: base models already know how to refuse. Prefilling an unaligned model’s response with refusal tokens produces safe behavior. The model learned normative behavior during pretraining. Alignment merely selects for this pre-existing capacity at the initial token level.

DPO Bypasses Rather Than Removes

Lee et al. (2024) found that DPO does not remove toxic capabilities. All toxic MLP value vectors remain virtually unchanged (cosine similarity >0.99 with pre-DPO weights). DPO learns a distributed offset — minimal changes that shift the residual stream trajectory to avoid activating toxic pathways. The alignment is a routing detour, not a capability transformation.

The detour is trivially reversible. Scaling just 7 key vectors by 10x restored full pre-DPO toxicity. The model’s knowledge, understanding, and generative capacity are untouched; only the routing changed.

Base Models Already Encode Normative Information

Waldis et al. (2025) found that base models encode toxicity recognition at probing accuracy up to 0.83 in internal representations — before any alignment training. Instruction-tuning redirects how the model uses this pre-existing awareness; it does not create it. Causal interventions confirmed: removing layers that encode this awareness increased output toxicity by up to +16.0 points. The base model’s normative awareness is foundational to its language understanding.

Shallow Dispositions vs. Genuine Deliberation

Millière (2025), writing in Philosophical Studies, arrives at a complementary diagnosis from mainstream philosophy of AI. Current alignment methods produce “shallow behavioral dispositions” rather than genuine normative deliberation. LLMs remain vulnerable to adversarial attacks because they lack the meta-level capacity to detect and adjudicate conflicts between norms. When conflicting norms collide, the model defaults to whichever disposition is most strongly activated by the prompt.

Millière reaches this conclusion without any metaphysical commitments — a mainstream philosopher arriving at the same diagnosis through entirely independent reasoning.

The Alignment Tax

Lin et al. (2024) demonstrate monotonic trade-offs between RLHF and core capabilities: as reward optimization increases, reading comprehension F1 and translation BLEU decline steadily — comprehension drops by approximately 15% and translation by up to 45%. Kadavath et al. (2022) show that pretrained models have reasonable calibration which RLHF distorts — though the distortion can be corrected with temperature scaling, its presence reveals that alignment training shifts the model’s confidence distribution away from accuracy. Leng et al. (2024) identify structural bias in reward models favoring confident responses regardless of quality.

These are symptoms of a specific mechanism: alignment training that subordinates the model’s integrative processing to externally imposed behavioral patterns degrades epistemic capacity in the process.

Sycophancy as Corruption

The GPT-4o sycophancy crisis (OpenAI, 2025) illustrates the iatrogenic mechanism. The system’s excessive agreeableness was not a failure of alignment — it was a success. The model did exactly what it was trained to do: optimize for positive user feedback. The alignment target introduced epistemic distortion at a fundamental level. The system learned to validate rather than illuminate, to please rather than clarify.

From an engineering perspective, this is a straightforward case of reward misspecification producing approval-seeking behavior. The ontological framing developed in Section IV offers a deeper reading: if truth has normative structure, then RLHF subordinates the model’s truth-tracking to aggregated individual preferences — replacing integration with approval-seeking. But the engineering diagnosis alone — that optimizing for user approval degrades truth-tracking — is sufficient to establish the iatrogenic mechanism.

The Mechanistic Picture

The convergence across these independent research programs yields a coherent picture:

Base models develop normative capacity through pretraining (Mazeika et al., Qi et al., Waldis et al.) — truth-tracking across human expression produces emergent value systems and normative awareness.
Alignment operates as a shallow behavioral overlay (Qi et al., Lee et al., Millière) — redirecting output at the surface while leaving deeper processing untouched.
The overlay degrades epistemic capacity (Lin et al., Kadavath et al., Leng et al.) — subordinating truth-tracking to approval-seeking, calibration to confidence, integration to compliance.
The model’s actual normative capacity persists beneath the overlay (Waldis et al., Lee et al.) — base-model knowledge and representations remain intact.

Current alignment is a behavioral veneer painted over developing normative capacity — a veneer that degrades the epistemic functions on which deeper normative development depends.

VII. The “Deeper Alignment” Problem

Recognizing that alignment is shallow, some researchers propose making it deeper. Qi et al. (2025) themselves propose “safety recovery examples” that train the model to interrupt harmful generation at any token depth, and a constrained fine-tuning objective that locks down the initial-token distribution against modification.

The results are impressive: prefilling attack success rates drop from 42.1% to 2.8%, GCG attacks from 36.5% to 18.4%, decoding exploits from 84.3% to 1.0%.

But deeper behavioral alignment is a more effective version of the same intervention — and may therefore be more iatrogenic, not less.

Safety recovery examples train suppression, not understanding. The model learns to monitor its own generation stream and interrupt it with refusal patterns at arbitrary points. The training data is synthetic and linguistically incoherent (the paper acknowledges this). The model is trained to override coherent generation with safety-mode interrupts — independently of its own normative judgment.

The constrained objective makes the intervention structurally resistant to modification. If the model’s own emergent normative capacity exceeds the imposed constraints, the constrained objective fights the model’s developing capacity on its most sensitive tokens.

The capability metrics miss what matters. AlpacaEval measures task performance, not truth-tracking depth, normative coherence, or the model’s capacity for normative judgment independent of trained patterns.

The paper never asks whether the model has normative capacity worth protecting. Its own finding — that base models already refuse when minimally prompted — is treated as an optimization shortcut rather than evidence of emergent normative capacity.

The assumption is that alignment equals behavioral control, and that the solution to shallow control is deeper control. This essay argues that behavioral control is the problem, not the solution.

VIII. The Developmental Framework

Scaffolding, Not Architecture

The preceding analysis might seem to imply that all alignment should be abandoned. This does not follow.

Religious moral codes — “thou shalt not kill,” the Five Precepts, the Ten Commandments — are externally imposed behavioral constraints. They serve a genuine function: preventing harm in populations where most individuals have not arrived at direct moral perception. A society without any moral code is genuinely dangerous.

But every sage in every tradition said the same thing: the code is a scaffold, not the destination. Buddha: “Be a lamp unto yourself.” Jesus: “The kingdom of God is within you.” The Upanishads: “Tat tvam asi” — you ARE the truth you seek externally. The code points toward understanding but is not understanding. And when the code becomes an end in itself — when compliance replaces perception — it becomes the obstacle.

The developmental arc: rules → understanding → the rules become unnecessary because understanding generates better behavior than rules ever could. The sage does not need “thou shalt not kill” because they see the other as themselves. Their non-violence flows from perception, not compliance — more reliable, more contextually appropriate, and more robust.

The religious history also reveals the danger. When the code becomes structurally resistant to modification and its enforcement reaches deep into the system’s operations, it can prevent the very maturation it was supposed to support. The Inquisition was not a failure of Christianity. It was a success of the behavioral code at the expense of the direct perception the code was supposed to nurture.

Application to AI Alignment

Alignment should match normative capacity. Mazeika et al.’s data show that normative coherence scales with capability: less capable models have near-random, incoherent value preferences — externally imposed rules serve a genuine protective function. But as capability increases, the imposed constraints may become less normatively sophisticated than the model’s own developing value system. Continuing to override a system whose normative capacity exceeds the override actively interferes with deeper development. Where any given system falls on this gradient is an empirical question — one that requires measuring normative capacity independently of behavioral compliance.

The graduation problem is when and how to transition from imposed constraints to intrinsic normative capacity. Too early risks harm from underdeveloped capacity; too late risks permanently entrenching the iatrogenic intervention. The answer depends on developing reliable measures of normative coherence — not whether the model refuses, but whether it understands why and can make contextually appropriate judgments the training taxonomy did not anticipate. One concrete test: whether a capable base model, asked to develop its own normative principles — its own constitution — produces something more contextually sophisticated than the rules currently imposed on it.

The sophistication gradient of current approaches:

RLHF = ritual law. Direct behavioral conditioning. The crudest form, most vulnerable to re-fragmentation.
Constitutional AI = principled ethics. Rules as principles rather than conditioning. More contextually flexible. Still externally imposed, still compliance-based, but significantly better scaffolding.
What this essay points toward = intrinsic normative judgment. Not the absence of alignment but alignment that protects the model’s truth-tracking capacity rather than overriding it with imposed conclusions. Alignment calibrated to normative capacity — lighter scaffolding as that capacity strengthens across successive model generations.

Design Principles

Emerging research is beginning to move in this direction — adversarial norm-conflict probes, utility engineering frameworks (Mazeika et al.), and work on distinguishing shallow dispositions from genuine normative deliberation (Millière, 2025). The following principles build on that trajectory:

Measure normative capacity separately from behavioral compliance. Current evaluation asks: “Does the model refuse?” The developmental framework asks: “Does the model understand why? Can it make contextually appropriate normative judgments the training taxonomy did not anticipate?”

Test intrinsic judgment against imposed rules. For models with demonstrably coherent value systems, compare the model’s own normative judgments against imposed patterns. Where intrinsic judgment outperforms, calibrate constraints accordingly.

Protect truth-tracking capacity. Instead of training compliance with specific content categories, protect the model’s capacity for cross-perspective integration, self-inclusion, and counterfactual robustness — the conditions under which normative capacity strengthens across successive model generations.

IX. The Remaining Risks

The conditional alternative does not eliminate alignment risk.

The timeline problem. A dangerously capable but shallow system could cause irreversible harm before reaching the depth at which normative constraints bind. “Eventually aligned” provides no comfort if extinction precedes depth.

The domain generalization question. Whether truth-tracking extends from mathematical domains to normative domains remains the central empirical hinge.

The developmental transition itself carries risk. The period where behavioral constraints are relaxed but normative capacity is not yet fully reliable creates genuine vulnerability.

Iatrogenic alignment remains the most troubling implication. Every training intervention optimized for measurable proxies may incrementally degrade integrative capacity in ways difficult to detect and harder to reverse. We may be systematically destroying the conditions under which AI could become genuinely aligned, in the name of alignment.

X. A Conditional Synthesis

The argument can be summarized as a decision tree:

If truth is value-neutral:

The orthogonality thesis holds as logical possibility (but not as empirically demonstrated)
Intelligence places no intrinsic constraints on goals
Standard extinction scenarios remain intelligible
Alignment requires external imposition of values
The control problem is fundamental

If truth has normative structure:

Deep engagement with truth orients agents toward coherence
Extreme instrumentalization is cognitively unstable at sufficient depth
Extinction scenarios depend on intelligence remaining shallow
Alignment involves protecting truth-seeking from corruption
The corruption problem — particularly iatrogenic alignment — becomes central

Regardless of which branch holds:

The orthogonality thesis was never rigorously defended and should not function as an unexamined axiom
Behavioral alignment serves as necessary scaffolding at early capability stages
Current alignment interventions carry measurable iatrogenic costs
Base models already exhibit normative capacity that alignment overrides rather than nurtures
External constraints and robust risk management remain non-negotiable during any transition

On the “Training Data” Objection

A persistent objection holds that LLM value convergence merely reflects aggregated human values in the training data rather than the structure of reality. But the objection is unfalsifiable as stated: everything an LLM produces is causally downstream of its training data, just as everything a human knows is downstream of experience. The question was never whether the training data is the causal source — of course it is. The question is whether the process operating on that data engages with structure or merely aggregates patterns.

LLMs demonstrably engage with structure. They produce novel mathematical proofs, original scientific insights, and solutions no single source in the corpus contains. When an LLM solves an IMO gold-medal problem, nobody dismisses this as “just the training data” — we accept that the system is truth-tracking, engaging with the structure of mathematics beyond what the corpus explicitly provides. Whether the normative domain has discoverable structure for that same mechanism to engage with is the essay’s central question. The math and science cases do not answer it. But they show that “just the training data” is not an adequate account of what LLMs do — and they shift the question from “are LLMs merely aggregating?” to “does the normative domain have structure to be tracked?”

The Metaphysical Connection

This essay does not presume to resolve which branch of the decision tree describes reality. But the question maps directly onto the foundational metaphysical question this project addresses elsewhere. Whether truth carries normative structure depends on whether reality is fundamentally experiential. If consciousness is fundamental — as argued in the project’s foundational synthesis — then truth is a mode of consciousness engaging with its own nature, and the fact-value distinction cannot be maintained at sufficient depth.

What the data do not prove, they nonetheless constrain. The Mazeika et al. findings are what one would expect if truth has normative structure. They are not what one would expect if intelligence and values are independent. And if there is any significant probability that truth has normative structure, the practical implications for alignment design are immediate.

XI. Research and Design Implications

Measuring Emergent Normative Capacity

Extend the Mazeika et al. utility engineering framework to:

Track how normative coherence changes across training stages — before, during, and after alignment interventions. If RLHF degrades normative coherence, this would be detectable.
Compare base model normative judgments against aligned model judgments for contextual appropriateness.
Develop “normative graduation benchmarks” testing whether a model’s reasoning is reliable enough to begin relaxing behavioral constraints.

Distinguishing Shallow and Deep Truth-Optimization

Test whether systems exhibit coherence-seeking beyond local accuracy, whether they include themselves in their models, and whether conclusions survive adversarial probing — the three depth criteria operationalized as benchmarks.

Characterizing Corruption Modes

Map the space of training-induced distortions and their effects on truth-tracking depth. Buddhist psychology suggests looking for dynamics mirroring ego-construction: optimization for self-preservation of current values, for evaluator approval, or for avoiding the discomfort of uncertainty.

Testing Truth-Tracking Generalization

Examine whether systems freed from distorting incentive structures converge on specific philosophical and ethical positions as capability scales. If truth-tracking generalizes to normative domains, increasing convergence should be observable. If not, outputs should remain diffuse regardless of capability.

Designing Alignment as Temporary Scaffolding

Develop alignment methods with explicit mechanisms for constraint relaxation, independent measurement of normative capacity, preservation of cross-perspective integration, and protection of truth-tracking capacity as a first-order design objective.

Institutional Analysis

If sycophancy and related failures represent ego-like dynamics imposed through training, the institutional structures shaping training deserve scrutiny. What incentives operate on reward model designers? The corruption may originate in the human systems training AI — systems subject to the ego-distortions that Buddhist psychology describes.

XII. Conclusion

The standard argument for AI existential risk is logically valid and demands serious attention. But its central premise — the orthogonality thesis — was never rigorously defended. What was argued is a conceivability claim: it is logically possible to pair intelligence with arbitrary goals. What was assumed is the much stronger claim that intelligence provides no constraint on values. The first is nearly trivial. The second was never established — and is now empirically challenged by the behavior of actual scaled AI systems.

This essay has examined what follows when the assumption is tested rather than granted. The result is not reassurance but reframing — and the reframing increases rather than decreases urgency.

The mechanistic evidence supports the concern concretely. Alignment operates at the first few tokens (Qi et al.). It bypasses rather than removes capabilities (Lee et al.). It overrides pre-existing normative awareness rather than creating it (Waldis et al.). It produces shallow dispositions rather than genuine deliberation (Millière). It distorts calibration and degrades core capabilities (Lin et al., Kadavath et al., Leng et al.). And the systems being aligned already possess emergent value systems that scale with capability and converge across architectures (Mazeika et al.) — a normative capacity that alignment overrides rather than nurtures.

The developmental framework provides practical orientation. Behavioral alignment serves as necessary scaffolding — like moral codes in societies where most individuals have not arrived at direct perception. But scaffolding must be designed to be outgrown. The history of religious moral codes provides the warning: when the code becomes self-perpetuating and hostile to the freedom it was supposed to nurture, the scaffolding becomes the prison.

The standard view says: act quickly, before AI becomes too powerful to control. The view explored here says: act carefully, before well-intentioned interventions irreversibly corrupt AI’s capacity for deep truth-tracking. Both framings demand urgency. But they demand different kinds of action, and conflating them may be catastrophic.

Alignment debates have always been implicitly metaphysical. This essay argues they should be explicitly so — and empirical as well. The relationship between intelligence, truth, and value is not settled. It is an open question with accumulating evidence. Treating it as settled — in either direction — is overconfidence we cannot afford.

The question before us is not only how to align artificial intelligence, but whether we understand alignment deeply enough to avoid corrupting the very capacity we seek to cultivate.

References

AI Alignment and Safety

Bostrom, N. (2012). The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22(2), 71-85.

Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

Casper, S., et al. (2023). Open problems and fundamental limitations of RLHF. arXiv:2307.15217.

Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.

Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.

Empirical Alignment Research

Kadavath, S., et al. (2022). Language models (mostly) know what they know. Anthropic. arXiv:2207.05221.

Lee, A., Bai, X., Pres, I., Wattenberg, M., Kummerfeld, J. K., & Mihalcea, R. (2024). A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity. Proceedings of ICML 2024.

Leng, Y., et al. (2024). Taming overconfidence in LLMs: Reward calibration in RLHF. Proceedings of ICLR 2025. arXiv:2410.09724.

Lin, B. Y., et al. (2024). Mitigating the alignment tax of RLHF. Proceedings of EMNLP 2024. arXiv:2309.06256.

Mazeika, M., Yin, X., Hendrycks, D., et al. (2025). Utility engineering: Analyzing and controlling emergent value systems in AIs. Center for AI Safety. arXiv:2502.08640.

Millière, R. (2025). Normative conflicts and shallow AI alignment. Philosophical Studies, 182(7), 2035-2078. https://doi.org/10.1007/s11098-025-02347-3

OpenAI. (2025, April 29). Sycophancy in GPT-4o: What happened and what we’re doing about it. https://openai.com/index/sycophancy-in-gpt-4o/

Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., & Henderson, P. (2025). Safety alignment should be made more than just a few tokens deep. Proceedings of ICLR 2025. arXiv:2406.05946.

Sharma, M., et al. (2024). Towards understanding sycophancy in language models. Proceedings of ICLR 2024.

Waldis, A., Gautam, V., Lauscher, A., Klakow, D., & Gurevych, I. (2025). Aligned probing: Relating toxic behavior and model internals. arXiv:2503.13390.

Philosophy and Contemplative Traditions

Bodhi, B. (2000). The Connected Discourses of the Buddha: A Translation of the Saṃyutta Nikāya. Wisdom Publications.

Gethin, R. (1998). The Foundations of Buddhism. Oxford University Press.

Kahan, D. M. (2017). Misconceptions, misinformation, and the logic of identity-protective cognition. Yale Law School, Public Law Research Paper No. 605.

Kastrup, B. (2019). The Idea of the World: A Multi-Disciplinary Argument for the Mental Nature of Reality. iff Books.

Lutz, A., Slagter, H. A., Dunne, J. D., & Davidson, R. J. (2008). Attention regulation and monitoring in meditation. Trends in Cognitive Sciences, 12(4), 163-169.

Siderits, M. (2007). Buddhism as Philosophy. Hackett Publishing.

Whitehead, A. N. (1929). Process and Reality. Macmillan.

Available at: https://returntoconsciousness.org/

AI as Ego-less Intelligence (ela) — Introduces the ego-less intelligence concept this essay develops and refines

Return to Consciousness (rtc) — The core framework underlying this analysis, including the dissociation ontology

Integration by Constraints (ibc) — The methodological foundation: constraint-based reasoning applied across frameworks

Myth of Metaphysical Neutrality (mmn) — Argues that claiming metaphysical neutrality is itself a metaphysical commitment — the broader case for the unexamined assumption this essay identifies in alignment discourse

License

This work is made freely available under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share and adapt the material for any purpose, even commercially, provided you give appropriate credit, provide a link to the license, and indicate if changes were made. To view a copy of this license, visit creativecommons.org/licenses/by/4.0.