Research Papers
A curated list of key publications shaping the field of AI safety and alignment.
Foundation
Essential readings to build a strong foundation for the alignment problem.
AI Alignment: A Comprehensive Survey
Ji et al. • 2024
Provides a comprehensive yet beginner-friendly review of alignment research topic.
AI Governance: A Research Agenda
Dafoe • 2018
Outlines key questions and challenges relating to AI governance and policy.
Concrete Problems in AI Safety
Amodei et al. • 2016
Presents practical research problems in AI safety.
The Alignment Problem from a Deep Learning Perspective
Ngo et al. • 2024
Discusses the challenges of aligning advanced AI models from the deep learning paradigm with human values and intentions.
An Overview of Catastrophic AI Risks
Hendrycks et al. • 2023
Provides an overview of the main sources of catastrophic AI risks.
Unsolved Problems in ML Safety
Hendrycks et al. • 2022
Identifies four key areas of unsolved problems in machine learning safety.
Technical
The engineering/mathematical side of AI safety and alignment.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton et al. • 2024
A big milestone in the mechanistic interpretability of large neural networks.
Training Language Models to Follow Instructions with Human Feedback
Ouyang et al. • 2022
RLHF, the prevailing technique used to align AI systems with human values.
Weak-To-Strong Generalization
Burns et al. • 2023
A new research direction for how human intelligence can take steps to align superhuman intelligence.