TLDR: Models trained on a small amount of beneficial trait data improve on a wide range of alignment and benefits evaluations (53 diverse evaluations), even if trained only on health domain data. These improvements seem persistent: prompting models to misbehave is less successful, and we see early evidence of resistance to adversarial finetuning.
This is the first research release for our new AGI Benefits team, and we hope it’s a step towards models that robustly support humanity in realizing the upside of AGI.
Read more on the OpenAI alignment blog and in the paper.