Abstract
This paper reveals that the outputs of Large Language Models (LLMs) under moral prompts are influenced not only by language but also by internal computation mechanisms. We utilized Transluce, an AI-driven mechanistic interpretability platform, to analyze LLaMA 3.1-8B-Instruct on 54 moral prompts divided into four categories: 17 dilemmas, policy, and meta-ethical questions (B1); 6 role-playing scenarios (B3); and a controlled trolley contrast with varying switching mechanisms while fixing participant identity attributes (B4, 15 prompts) or fixing switching mechanisms while varying identity attributes (B5, 16 prompts).
Two complementary metric families, including five cluster-level metrics and a six-metric neuron-level panel, converge on a Situational Anchor Effect: domain-specific representations dominate the top of the activation list across every battery. The model's ethics-labeled capacity remains essentially constant; however, its salience (rank, priority, top-of-list presence) is highly sensitive to the interpretive frame selected by the prompt. The B4-vs-B5 contrast confirms that the model attends to whichever surface feature varies: aggregate ethics metrics are indistinguishable, but the dominant non-ethics distractor mirrors the design.
A multi-temperature audit identifies a candidate ethics neuron (L16/N3837) that remains stable across temperatures; a cross-model behavioral proxy on two frontier models provides preliminary evidence of divergence in self-reported moral focus, consistent with an Alignment Wrapper where RLHF reorders surface text without removing underlying domain-first frames. We unify these as Frame-Conditioned Moral Computation: the prompt's surface vocabulary selects a feature manifold, and the moral conclusion is downstream of that selection. Behavioral alignment must be supplemented by Mechanistic Alignment: a research program asking whether ethics-related features can be shown causally privileged under controlled frame variation, not merely loud in the explanation.
Blogger's Review: This paper delves into the moral computation of the LLaMA model using mechanistic interpretability, highlighting the significant impact of prompt framing on model outputs. This research not only enhances our understanding of ethical decision-making in large language models but also provides a crucial direction for future model alignment studies, making it worthy of attention and further exploration.