This paper examines the trade-offs between AI safety and human well-being, focusing on two methods: 'Constitutional AI', a leading technique for fine-tuning super-capable AIs, and 'Virtue Ethics', a powerful approach to understanding complex ethical decision-making and the well-being of rational agents. We fine-tune various models using 'Virtuous agent', 'Subordinate agent', and 'Generic agent' constitutions and evaluate them on 'general safety' (toxic behaviors, misinformation, etc.) and their willingness to endorse a wide range of behaviors that, if adopted by a super-powerful AI, would significantly increase existential risk for humanity. Our results suggest a trade-off between reducing existential risk and reinforcing beliefs and dispositions conducive to an AI agent's well-being. Additionally, there is a trade-off between existential risk and general safety: fine-tuning an AI to adopt beliefs and dispositions that substantially reduce its existential risk—by making the AI systematically subordinate to external human authorities—may increase the likelihood that a human user can deliberately induce the AI to engage in various unsafe behaviors.
Blogger's Review: This paper provides a profound exploration of the ethical dilemmas in AI fine-tuning, revealing the delicate balance between safety and well-being. It serves as a crucial reminder that while pursuing powerful AIs, we must carefully manage the potential risks involved, offering essential insights for future AI development.