Update research papers docpage. (#782)

* Update research papers docpage.
This commit is contained in:
stalkermustang
2023-01-17 02:29:41 +03:00
committed by GitHub
parent 76ffb2a67a
commit ba68f721aa
+124 -8
View File
@@ -8,15 +8,29 @@ This page lists research papers that are relevant to the project.
- Generating Text From Language Models
- Automatically Generating Instruction Data for Training
- Uncertainty Estimation of Language Model Outputs
- Evidence-Guided Text Generation
- Reward Model Optimization
- Dialogue-Oriented RLHF
- Reduce Harms in Language Models
## Reinforcement Learning from Human Feedback <a name="reinforcement-learning-from-human-feedback"></a>
## Reinforcement Learning from Human Feedback
Reinforcement Learning from Human Feedback (RLHF) is a method for fine-tuning a
generative language models based on a reward model that is learned from human
preference data. This method facilitates the learning of instruction-tuned
models, among other things.
### Learning to summarize from human feedback [[ArXiv](https://arxiv.org/pdf/2009.01325.pdf)], [[Github](https://github.com/openai/summarize-from-feedback)]
### Fine-Tuning Language Models from Human Preferences [[ArXiv](https://arxiv.org/abs/1909.08593)], [[GitHub](https://github.com/openai/lm-human-preferences)]
> In this paper, we build on advances in generative pretraining of language
> models to apply reward learning to four natural language tasks: continuing
> text with positive sentiment or physically descriptive language, and
> summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic
> continuation we achieve good results with only 5,000 comparisons evaluated by
> humans. For summarization, models trained with 60,000 comparisons copy whole
> sentences from the input but skip irrelevant preamble.
### Learning to summarize from human feedback [[ArXiv](https://arxiv.org/abs/2009.01325)], [[GitHub](https://github.com/openai/summarize-from-feedback)]
> In this work, we show that it is possible to significantly improve summary
> quality by training a model to optimize for human preferences. We collect a
@@ -24,7 +38,18 @@ models, among other things.
> model to predict the human-preferred summary, and use that model as a reward
> function to fine-tune a summarization policy using reinforcement learning.
### Training language models to follow instructions with human feedback [[ArXiv](https://arxiv.org/pdf/2203.02155.pdf)]
### Recursively Summarizing Books with Human Feedback [[ArXiv](https://arxiv.org/abs/2109.10862)]
> Our method combines learning from human feedback with recursive task
> decomposition: we use models trained on smaller parts of the task to assist
> humans in giving feedback on the broader task. We collect a large volume of
> demonstrations and comparisons from human labelers. Our resulting model
> generates sensible summaries of entire books, even matching the quality of
> human-written summaries in a few cases (5% of books). We achieve
> state-of-the-art results on the recent BookSum dataset for book-length
> summarization. We release datasets of samples from our model.
### Training language models to follow instructions with human feedback [[ArXiv](https://arxiv.org/abs/2203.02155)]
> Starting with a set of labeler-written prompts and prompts submitted through
> the OpenAI API, we collect a dataset of labeler demonstrations of the desired
@@ -33,7 +58,7 @@ models, among other things.
> fine-tune this supervised model using reinforcement learning from human
> feedback.
### Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [[ArXiv](https://arxiv.org/pdf/2204.05862.pdf)]
### Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [[ArXiv](https://arxiv.org/abs/2204.05862)]
> We apply preference modeling and reinforcement learning from human feedback
> (RLHF) to finetune language models to act as helpful and harmless assistants.
@@ -41,6 +66,31 @@ models, among other things.
> evaluations, and is fully compatible with training for specialized skills such
> as python coding and summarization.
### Self-critiquing models for assisting human evaluators [[ArXiv](https://arxiv.org/abs/2206.05802)]
> We fine-tune large language models to write natural language critiques
> (natural language critical comments) using behavioral cloning. On a
> topic-based summarization task, critiques written by our models help humans
> find flaws in summaries that they would have otherwise missed. We study
> scaling properties of critiquing with both topic-based summarization and
> synthetic tasks. Finally, we motivate and introduce a framework for comparing
> critiquing ability to generation and discrimination ability. These results are
> a proof of concept for using AI-assisted human feedback to scale the
> supervision of machine learning systems to tasks that are difficult for humans
> to evaluate directly. We release our training datasets.
### Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization [[ArXiv](https://arxiv.org/abs/2210.01241)]
> We tackle the problem of aligning pre-trained large language models (LMs) with
> human preferences. We present the GRUE (General Reinforced-language
> Understanding Evaluation) benchmark, a set of 6 language generation tasks
> which are supervised by reward functions which capture automated measures of
> human preference. Finally, we introduce an easy-to-use, performant RL
> algorithm, NLPO (Natural Language Policy Optimization) that learns to
> effectively reduce the combinatorial action space in language generation. We
> show that RL techniques are generally better than supervised methods at
> aligning LMs to human preferences.
## Generating Text From Language Models
A language model generates output text token by token, autoregressively. The
@@ -48,7 +98,7 @@ large search space of this task requires some method of narrowing down the set
of tokens to be considered in each step. This method, in turn, has a big impact
on the quality of the resulting text.
### RANKGEN: Improving Text Generation with Large Ranking Models [[ArXiv](https://arxiv.org/pdf/2205.09726.pdf)], [[Github](https://github.com/martiansideofthemoon/rankgen)]
### RANKGEN: Improving Text Generation with Large Ranking Models [[ArXiv](https://arxiv.org/abs/2205.09726)], [[GitHub](https://github.com/martiansideofthemoon/rankgen)]
> Given an input sequence (or prefix), modern language models often assign high
> probabilities to output sequences that are repetitive, incoherent, or
@@ -65,7 +115,7 @@ annotated data for the purpose of training
[instruction-aligned](https://openai.com/blog/instruction-following/) language
models.
### SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions [[ArXiv](https://arxiv.org/pdf/2212.10560.pdf)], [[Github](https://github.com/yizhongw/self-instruct)].
### SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions [[ArXiv](https://arxiv.org/abs/2212.10560)], [[GitHub](https://github.com/yizhongw/self-instruct)].
> We introduce SELF-INSTRUCT, a framework for improving the
> instruction-following capabilities of pretrained language models by
@@ -76,7 +126,7 @@ models.
> SuperNaturalInstructions, on par with the performance of InstructGPT-0011,
> which is trained with private user data and human annotations.
### Tuning Language Models with (Almost) No Human Labor. [[ArXiv](https://arxiv.org/pdf/2212.09689.pdf)], [[Github](https://github.com/orhonovich/unnatural-instructions)].
### Tuning Language Models with (Almost) No Human Labor. [[ArXiv](https://arxiv.org/abs/2212.09689)], [[GitHub](https://github.com/orhonovich/unnatural-instructions)].
> In this work, we introduce Unnatural Instructions: a large dataset of creative
> and diverse instructions, collected with virtually no human labor. We collect
@@ -91,7 +141,7 @@ models.
## Uncertainty Estimation of Language Model Outputs
### Teaching models to express their uncertainty in words [[Arxiv](https://arxiv.org/pdf/2205.14334.pdf)]
### Teaching models to express their uncertainty in words [[ArXiv](https://arxiv.org/abs/2205.14334)]
> We show that a GPT-3 model can learn to express uncertainty about its own
> answers in natural language -- without use of model logits. When given a
@@ -100,3 +150,69 @@ models.
> are well calibrated. The model also remains moderately calibrated under
> distribution shift, and is sensitive to uncertainty in its own answers, rather
> than imitating human examples.
## Evidence-Guided Text Generation
### WebGPT: Browser-assisted question-answering with human feedback [[ArXiv](https://arxiv.org/abs/2112.09332)]
> We fine-tune GPT-3 to answer long-form questions using a text-based
> web-browsing environment, which allows the model to search and navigate the
> web. We are able to train models on the task using imitation learning, and
> then optimize answer quality with human feedback. Models must collect
> references while browsing in support of their answers. Our best model is
> obtained by fine-tuning GPT-3 using behavior cloning, and then performing
> rejection sampling against a reward model.
### Teaching language models to support answers with verified quotes [[ArXiv](https://arxiv.org/abs/2203.11147)]
> In this work we use RLHF to train "open-book" QA models that generate answers
> whilst also citing specific evidence for their claims, which aids in the
> appraisal of correctness. Supporting evidence is drawn from multiple documents
> found via a search engine, or from a single user-provided document. However,
> analysis on the adversarial TruthfulQA dataset shows why citation is only one
> part of an overall strategy for safety and trustworthiness: not all claims
> supported by evidence are true.
## Reward Model Optimization
### Scaling Laws for Reward Model Overoptimization [[ArXiv](https://arxiv.org/abs/2210.10760)], [[Preceding Blogpost](https://openai.com/blog/measuring-goodharts-law/)]
> In this work, we use a synthetic setup in which a fixed "gold-standard" reward
> model plays the role of humans, providing labels used to train a proxy reward
> model. We study how the gold reward model score changes as we optimize against
> the proxy reward model using either reinforcement learning or best-of-n
> sampling. We study the effect on this relationship of the size of the reward
> model dataset. We explore the implications of these empirical results for
> theoretical considerations in AI alignment.
## Dialogue-Oriented RLHF
### Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning [[ArXiv](https://arxiv.org/abs/2208.02294)]
> Building automated agents that can carry on rich open-ended conversations with
> humans "in the wild" remains a formidable challenge. In this work we develop a
> real-time, open-ended dialogue system that uses reinforcement learning (RL) to
> power a bot's conversational skill at scale. Trained using crowd-sourced data,
> our novel system is able to substantially exceeds several metrics of interest
> in a live experiment with real users of the Google Assistant.
### Improving alignment of dialogue agents via targeted human judgements [[ArXiv](https://arxiv.org/abs/2209.14375)]
> We present Sparrow, an information-seeking dialogue agent trained to be more
> helpful, correct, and harmless compared to prompted language model baselines
> First, to make our agent more helpful and harmless, we break down the
> requirements for good dialogue into natural language rules the agent should
> followy. Second, our agent provides evidence from sources supporting factual
> claims when collecting preference judgements over model statements.Finally, we
> conduct extensive analyses showing that though our model learns to follow our
> rules it can exhibit distributional biases.
## Reduce Harms in Language Models
### Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned [[ArXiv](https://arxiv.org/abs/2209.07858)]
> We investigate scaling behaviors for red teaming. We find that the RLHF models
> are increasingly difficult to red team as they scale, and we find a flat trend
> with scale for the other model types. We exhaustively describe our
> instructions, processes, statistical methodologies, and uncertainty about red
> teaming.