Update research papers docpage. (#782)

* Update research papers docpage.
2026-06-29 16:30:24 +08:00 · 2023-01-17 02:29:41 +03:00
parent 76ffb2a67a
commit ba68f721aa
1 changed files with 124 additions and 8 deletions
@@ -8,15 +8,29 @@ This page lists research papers that are relevant to the project.
 - Generating Text From Language Models
 - Automatically Generating Instruction Data for Training
 - Uncertainty Estimation of Language Model Outputs
+- Evidence-Guided Text Generation
+- Reward Model Optimization
+- Dialogue-Oriented RLHF
+- Reduce Harms in Language Models

-## Reinforcement Learning from Human Feedback <a name="reinforcement-learning-from-human-feedback"></a>
+## Reinforcement Learning from Human Feedback

 Reinforcement Learning from Human Feedback (RLHF) is a method for fine-tuning a
 generative language models based on a reward model that is learned from human
 preference data. This method facilitates the learning of instruction-tuned
 models, among other things.

-### Learning to summarize from human feedback [[ArXiv](https://arxiv.org/pdf/2009.01325.pdf)], [[Github](https://github.com/openai/summarize-from-feedback)]
+### Fine-Tuning Language Models from Human Preferences [[ArXiv](https://arxiv.org/abs/1909.08593)], [[GitHub](https://github.com/openai/lm-human-preferences)]
+
+> In this paper, we build on advances in generative pretraining of language
+> models to apply reward learning to four natural language tasks: continuing
+> text with positive sentiment or physically descriptive language, and
+> summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic
+> continuation we achieve good results with only 5,000 comparisons evaluated by
+> humans. For summarization, models trained with 60,000 comparisons copy whole
+> sentences from the input but skip irrelevant preamble.
+
+### Learning to summarize from human feedback [[ArXiv](https://arxiv.org/abs/2009.01325)], [[GitHub](https://github.com/openai/summarize-from-feedback)]

 > In this work, we show that it is possible to significantly improve summary
 > quality by training a model to optimize for human preferences. We collect a
@@ -24,7 +38,18 @@ models, among other things.
 > model to predict the human-preferred summary, and use that model as a reward
 > function to fine-tune a summarization policy using reinforcement learning.

-### Training language models to follow instructions with human feedback [[ArXiv](https://arxiv.org/pdf/2203.02155.pdf)]
+### Recursively Summarizing Books with Human Feedback [[ArXiv](https://arxiv.org/abs/2109.10862)]
+
+> Our method combines learning from human feedback with recursive task
+> decomposition: we use models trained on smaller parts of the task to assist
+> humans in giving feedback on the broader task. We collect a large volume of
+> demonstrations and comparisons from human labelers. Our resulting model
+> generates sensible summaries of entire books, even matching the quality of
+> human-written summaries in a few cases (∼5% of books). We achieve
+> state-of-the-art results on the recent BookSum dataset for book-length
+> summarization. We release datasets of samples from our model.
+
+### Training language models to follow instructions with human feedback [[ArXiv](https://arxiv.org/abs/2203.02155)]

 > Starting with a set of labeler-written prompts and prompts submitted through
 > the OpenAI API, we collect a dataset of labeler demonstrations of the desired
@@ -33,7 +58,7 @@ models, among other things.
 > fine-tune this supervised model using reinforcement learning from human
 > feedback.

-### Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [[ArXiv](https://arxiv.org/pdf/2204.05862.pdf)]
+### Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [[ArXiv](https://arxiv.org/abs/2204.05862)]

 > We apply preference modeling and reinforcement learning from human feedback
 > (RLHF) to finetune language models to act as helpful and harmless assistants.
@@ -41,6 +66,31 @@ models, among other things.
 > evaluations, and is fully compatible with training for specialized skills such
 > as python coding and summarization.

+### Self-critiquing models for assisting human evaluators [[ArXiv](https://arxiv.org/abs/2206.05802)]
+
+> We fine-tune large language models to write natural language critiques
+> (natural language critical comments) using behavioral cloning. On a
+> topic-based summarization task, critiques written by our models help humans
+> find flaws in summaries that they would have otherwise missed. We study
+> scaling properties of critiquing with both topic-based summarization and
+> synthetic tasks. Finally, we motivate and introduce a framework for comparing
+> critiquing ability to generation and discrimination ability. These results are
+> a proof of concept for using AI-assisted human feedback to scale the
+> supervision of machine learning systems to tasks that are difficult for humans
+> to evaluate directly. We release our training datasets.
+
+### Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization [[ArXiv](https://arxiv.org/abs/2210.01241)]
+
+> We tackle the problem of aligning pre-trained large language models (LMs) with
+> human preferences. We present the GRUE (General Reinforced-language
+> Understanding Evaluation) benchmark, a set of 6 language generation tasks
+> which are supervised by reward functions which capture automated measures of
+> human preference. Finally, we introduce an easy-to-use, performant RL
+> algorithm, NLPO (Natural Language Policy Optimization) that learns to
+> effectively reduce the combinatorial action space in language generation. We
+> show that RL techniques are generally better than supervised methods at
+> aligning LMs to human preferences.
+
 ## Generating Text From Language Models

 A language model generates output text token by token, autoregressively. The
@@ -48,7 +98,7 @@ large search space of this task requires some method of narrowing down the set
 of tokens to be considered in each step. This method, in turn, has a big impact
 on the quality of the resulting text.

-### RANKGEN: Improving Text Generation with Large Ranking Models [[ArXiv](https://arxiv.org/pdf/2205.09726.pdf)], [[Github](https://github.com/martiansideofthemoon/rankgen)]
+### RANKGEN: Improving Text Generation with Large Ranking Models [[ArXiv](https://arxiv.org/abs/2205.09726)], [[GitHub](https://github.com/martiansideofthemoon/rankgen)]

 > Given an input sequence (or prefix), modern language models often assign high
 > probabilities to output sequences that are repetitive, incoherent, or
@@ -65,7 +115,7 @@ annotated data for the purpose of training
 [instruction-aligned](https://openai.com/blog/instruction-following/) language
 models.

-### SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions [[ArXiv](https://arxiv.org/pdf/2212.10560.pdf)], [[Github](https://github.com/yizhongw/self-instruct)].
+### SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions [[ArXiv](https://arxiv.org/abs/2212.10560)], [[GitHub](https://github.com/yizhongw/self-instruct)].

 > We introduce SELF-INSTRUCT, a framework for improving the
 > instruction-following capabilities of pretrained language models by
@@ -76,7 +126,7 @@ models.
 > SuperNaturalInstructions, on par with the performance of InstructGPT-0011,
 > which is trained with private user data and human annotations.

-### Tuning Language Models with (Almost) No Human Labor. [[ArXiv](https://arxiv.org/pdf/2212.09689.pdf)], [[Github](https://github.com/orhonovich/unnatural-instructions)].
+### Tuning Language Models with (Almost) No Human Labor. [[ArXiv](https://arxiv.org/abs/2212.09689)], [[GitHub](https://github.com/orhonovich/unnatural-instructions)].

 > In this work, we introduce Unnatural Instructions: a large dataset of creative
 > and diverse instructions, collected with virtually no human labor. We collect
@@ -91,7 +141,7 @@ models.

 ## Uncertainty Estimation of Language Model Outputs

-### Teaching models to express their uncertainty in words [[Arxiv](https://arxiv.org/pdf/2205.14334.pdf)]
+### Teaching models to express their uncertainty in words [[ArXiv](https://arxiv.org/abs/2205.14334)]

 > We show that a GPT-3 model can learn to express uncertainty about its own
 > answers in natural language -- without use of model logits. When given a
@@ -100,3 +150,69 @@ models.
 > are well calibrated. The model also remains moderately calibrated under
 > distribution shift, and is sensitive to uncertainty in its own answers, rather
 > than imitating human examples.
+
+## Evidence-Guided Text Generation
+
+### WebGPT: Browser-assisted question-answering with human feedback [[ArXiv](https://arxiv.org/abs/2112.09332)]
+
+> We fine-tune GPT-3 to answer long-form questions using a text-based
+> web-browsing environment, which allows the model to search and navigate the
+> web. We are able to train models on the task using imitation learning, and
+> then optimize answer quality with human feedback. Models must collect
+> references while browsing in support of their answers. Our best model is
+> obtained by fine-tuning GPT-3 using behavior cloning, and then performing
+> rejection sampling against a reward model.
+
+### Teaching language models to support answers with verified quotes [[ArXiv](https://arxiv.org/abs/2203.11147)]
+
+> In this work we use RLHF to train "open-book" QA models that generate answers
+> whilst also citing specific evidence for their claims, which aids in the
+> appraisal of correctness. Supporting evidence is drawn from multiple documents
+> found via a search engine, or from a single user-provided document. However,
+> analysis on the adversarial TruthfulQA dataset shows why citation is only one
+> part of an overall strategy for safety and trustworthiness: not all claims
+> supported by evidence are true.
+
+## Reward Model Optimization
+
+### Scaling Laws for Reward Model Overoptimization [[ArXiv](https://arxiv.org/abs/2210.10760)], [[Preceding Blogpost](https://openai.com/blog/measuring-goodharts-law/)]
+
+> In this work, we use a synthetic setup in which a fixed "gold-standard" reward
+> model plays the role of humans, providing labels used to train a proxy reward
+> model. We study how the gold reward model score changes as we optimize against
+> the proxy reward model using either reinforcement learning or best-of-n
+> sampling. We study the effect on this relationship of the size of the reward
+> model dataset. We explore the implications of these empirical results for
+> theoretical considerations in AI alignment.
+
+## Dialogue-Oriented RLHF
+
+### Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning [[ArXiv](https://arxiv.org/abs/2208.02294)]
+
+> Building automated agents that can carry on rich open-ended conversations with
+> humans "in the wild" remains a formidable challenge. In this work we develop a
+> real-time, open-ended dialogue system that uses reinforcement learning (RL) to
+> power a bot's conversational skill at scale. Trained using crowd-sourced data,
+> our novel system is able to substantially exceeds several metrics of interest
+> in a live experiment with real users of the Google Assistant.
+
+### Improving alignment of dialogue agents via targeted human judgements [[ArXiv](https://arxiv.org/abs/2209.14375)]
+
+> We present Sparrow, an information-seeking dialogue agent trained to be more
+> helpful, correct, and harmless compared to prompted language model baselines
+> First, to make our agent more helpful and harmless, we break down the
+> requirements for good dialogue into natural language rules the agent should
+> followy. Second, our agent provides evidence from sources supporting factual
+> claims when collecting preference judgements over model statements.Finally, we
+> conduct extensive analyses showing that though our model learns to follow our
+> rules it can exhibit distributional biases.
+
+## Reduce Harms in Language Models
+
+### Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned [[ArXiv](https://arxiv.org/abs/2209.07858)]
+
+> We investigate scaling behaviors for red teaming. We find that the RLHF models
+> are increasingly difficult to red team as they scale, and we find a flat trend
+> with scale for the other model types. We exhaustively describe our
+> instructions, processes, statistical methodologies, and uncertainty about red
+> teaming.