(Coauthored with Ramana Kumar and cross-posted from the Alignment Forum.)
There are a few different classifications of safety problems, including the Specification, Robustness and Assurance (SRA) taxonomy and the Goodhart’s Law taxonomy. In SRA, the specification category is about defining the purpose of the system, i.e. specifying its incentives. Since incentive problems can be seen as manifestations of Goodhart’s Law, we explore how the specification category of the SRA taxonomy maps to the Goodhart taxonomy. The mapping is an attempt to integrate different breakdowns of the safety problem space into a coherent whole. We hope that a consistent classification of current safety problems will help develop solutions that are effective for entire classes of problems, including future problems that have not yet been identified.
The SRA taxonomy defines three different types of specifications of the agent’s objective: ideal (a perfect description of the wishes of the human designer)…
View original post 1,324 more words