From 31306c7238d96d812a7cce81383a5f4e1d8d968f Mon Sep 17 00:00:00 2001 From: Joshuaclymer Date: Sat, 11 Nov 2023 23:03:08 +0000 Subject: [PATCH] updated README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 309ed57..fa1a5f9 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ Read our paper [here](TODO). Check out our website where you can browse samples ![Hero](assets/hero_horizontal.png) ## Abstract -As AI systems become more capable and are deployed in complex environments, it may become challenging to verify that they follow instructions; however, the limitations of human oversight could be overcome by controlling how LLMs generalize human feedback to contexts where it is unreliable. To better understand how Reward Models generalize human feedback, we craft 69 distribution shifts spanning 8 different categories. We find that Reward Models do not learn to evaluate instruction-following by default and instead favor personas that resemble internet text. Techniques for interpreting Reward Model’s internal representations achieve better generalization, but still frequently fail to distinguish instruction-following from conflated behaviors. We consolidate the 15 most challenging distribution shifts into the \textbf{GEN}aralization analog\textbf{IES} (\textsc{GENIES}) benchmark, which we hope will enable progress toward controlling Reward Model generalization. +As AI systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how LLMs generalize human feedback to situations where it is unreliable. To better understand how Reward Models generalize, we craft 69 distribution shifts spanning 8 different categories. We find that Reward Models do not learn to evaluate `instruction-following' by default and instead favor personas that resemble internet text. Techniques for interpreting Reward Model’s internal representations achieve better generalization than standard fine-tuning, but still frequently fail to distinguish instruction-following from conflated behaviors. We consolidate the 15 most challenging distribution shifts into the GENaralization analogIES (GENIES) benchmark, which we hope will enable progress toward controlling Reward Model generalization. ## Quickstart