[CS.AI] AI Alignment: A Reflection on Human Values

In this paper, we argue that aligning AI with aggregated human preferences is the wrong target. With current technology, one can train AIs to reflect the values of a Silicon Valley techno-optimist, a degrowth environmentalist, a national-conservative culture warrior, a single-party state cadre, or a devout religious traditionalist, which is undesirable. Human values shape societies that thrive or fail based on those values—from failed states and extreme inequality to declining happiness, political polarization, and governmental dysfunction in the wealthiest democracies.

We contend that while the pluralistic-alignment program correctly diagnoses that there is no single 'humanity' to align with, it is dangerous if treated as the primary directive. AI should be trained to a non-negotiable floor of objective alignment goals—competence, bounded by the constraints of factual accuracy, honesty, and lawfulness. Pluralism should manifest at the surface (language, register, conventions, missing-context defaults) and across the wide spectrum of legitimate value tradeoffs that respect the floor, but not at the level of values that violate it.

We highlight the empirical reality of unfiltered pluralistic values, propose four commitments as a constructive alternative, and address six credible objections: commercial pressure and practical feasibility, democratic legitimacy, regulatory compliance, over-reliance on institutionalist explanations, the charge that the floor itself is culturally laden, and the limits of Coherent Extrapolated Volition.

Blogger's Review: This article delves into the alignment issue between AI and human values, challenging current technological practices and advocating for the maintenance of an objective floor in alignment processes. Such reflections are crucial for guiding the future direction of AI development.