Research Papers

A curated list of key publications shaping the field of AI safety and alignment.

Foundation

Essential readings to build a strong foundation for the alignment problem.

AI Alignment: A Comprehensive Survey

Ji et al.2024

Provides a comprehensive yet beginner-friendly review of alignment research topic.

AI Governance: A Research Agenda

Dafoe2018

Outlines key questions and challenges relating to AI governance and policy.

Concrete Problems in AI Safety

Amodei et al.2016

Presents practical research problems in AI safety.

The Alignment Problem from a Deep Learning Perspective

Ngo et al.2024

Discusses the challenges of aligning advanced AI models from the deep learning paradigm with human values and intentions.

An Overview of Catastrophic AI Risks

Hendrycks et al.2023

Provides an overview of the main sources of catastrophic AI risks.

Unsolved Problems in ML Safety

Hendrycks et al.2022

Identifies four key areas of unsolved problems in machine learning safety.

Technical

The engineering/mathematical side of AI safety and alignment.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Templeton et al.2024

A big milestone in the mechanistic interpretability of large neural networks.

The Off-Switch Game

Hadfield-Menell et al.2017

A game-theoretic view on AI self-preservation.

Training Language Models to Follow Instructions with Human Feedback

Ouyang et al.2022

RLHF, the prevailing technique used to align AI systems with human values.

Weak-To-Strong Generalization

Burns et al.2023

A new research direction for how human intelligence can take steps to align superhuman intelligence.