Expanding survey of relevant research (#557)

Expanding survey of relevant research
This commit is contained in:
mrcabbage972
2023-01-09 05:45:54 -05:00
committed by GitHub
parent 225aa58f89
commit c22b26d21a
+69 -1
View File
@@ -1,7 +1,63 @@
# General
# Research
This page lists research papers that are relevant to the project.
## Table of Contents
- Reinforcement Learning from Human Feedback
- Generating Text From Language Models
- Automatically Generating Instruction Data for Training
- Uncertainty Estimation of Language Model Outputs
## Reinforcement Learning from Human Feedback <a name="reinforcement-learning-from-human-feedback"></a>
Reinforcement Learning from Human Feedback (RLHF) is a method for fine-tuning a
generative language models based on a reward model that is learned from human
preference data. This method facilitates the learning of instruction-tuned
models, among other things.
### Learning to summarize from human feedback [[ArXiv](https://arxiv.org/pdf/2009.01325.pdf)], [[Github](https://github.com/openai/summarize-from-feedback)]
> In this work, we show that it is possible to significantly improve summary
> quality by training a model to optimize for human preferences. We collect a
> large, high-quality dataset of human comparisons between summaries, train a
> model to predict the human-preferred summary, and use that model as a reward
> function to fine-tune a summarization policy using reinforcement learning.
### Training language models to follow instructions with human feedback [[ArXiv](https://arxiv.org/pdf/2203.02155.pdf)]
> Starting with a set of labeler-written prompts and prompts submitted through
> the OpenAI API, we collect a dataset of labeler demonstrations of the desired
> model behavior, which we use to fine-tune GPT-3 using supervised learning. We
> then collect a dataset of rankings of model outputs, which we use to further
> fine-tune this supervised model using reinforcement learning from human
> feedback.
### Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [[ArXiv](https://arxiv.org/pdf/2204.05862.pdf)]
> We apply preference modeling and reinforcement learning from human feedback
> (RLHF) to finetune language models to act as helpful and harmless assistants.
> We find this alignment training improves performance on almost all NLP
> evaluations, and is fully compatible with training for specialized skills such
> as python coding and summarization.
## Generating Text From Language Models
A language model generates output text token by token, autoregressively. The
large search space of this task requires some method of narrowing down the set
of tokens to be considered in each step. This method, in turn, has a big impact
on the quality of the resulting text.
### RANKGEN: Improving Text Generation with Large Ranking Models [[ArXiv](https://arxiv.org/pdf/2205.09726.pdf)], [[Github](https://github.com/martiansideofthemoon/rankgen)]
> Given an input sequence (or prefix), modern language models often assign high
> probabilities to output sequences that are repetitive, incoherent, or
> irrelevant to the prefix; as such, model-generated text also contains such
> artifacts. To address these issues we present RankGen, a 1.2B parameter
> encoder model for English that scores model generations given a prefix.
> RankGen can be flexibly incorporated as a scoring function in beam search and
> used to decode from any pretrained language model.
## Automatically Generating Instruction Data for Training
This line of work is about significantly reducing the need for manually
@@ -32,3 +88,15 @@ models.
> rivals the effectiveness of training on open-source manually-curated datasets,
> surpassing the performance of models such as T0++ and Tk-Instruct across
> various benchmarks.
## Uncertainty Estimation of Language Model Outputs
### Teaching models to express their uncertainty in words [[Arxiv](https://arxiv.org/pdf/2205.14334.pdf)]
> We show that a GPT-3 model can learn to express uncertainty about its own
> answers in natural language -- without use of model logits. When given a
> question, the model generates both an answer and a level of confidence (e.g.
> "90% confidence" or "high confidence"). These levels map to probabilities that
> are well calibrated. The model also remains moderately calibrated under
> distribution shift, and is sensitive to uncertainty in its own answers, rather
> than imitating human examples.