mirror of
https://github.com/wassname/Open-Assistant.git
synced 2026-06-27 16:10:30 +08:00
Expanding survey of relevant research (#557)
Expanding survey of relevant research
This commit is contained in:
@@ -1,7 +1,63 @@
|
||||
# General
|
||||
# Research
|
||||
|
||||
This page lists research papers that are relevant to the project.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- Reinforcement Learning from Human Feedback
|
||||
- Generating Text From Language Models
|
||||
- Automatically Generating Instruction Data for Training
|
||||
- Uncertainty Estimation of Language Model Outputs
|
||||
|
||||
## Reinforcement Learning from Human Feedback <a name="reinforcement-learning-from-human-feedback"></a>
|
||||
|
||||
Reinforcement Learning from Human Feedback (RLHF) is a method for fine-tuning a
|
||||
generative language models based on a reward model that is learned from human
|
||||
preference data. This method facilitates the learning of instruction-tuned
|
||||
models, among other things.
|
||||
|
||||
### Learning to summarize from human feedback [[ArXiv](https://arxiv.org/pdf/2009.01325.pdf)], [[Github](https://github.com/openai/summarize-from-feedback)]
|
||||
|
||||
> In this work, we show that it is possible to significantly improve summary
|
||||
> quality by training a model to optimize for human preferences. We collect a
|
||||
> large, high-quality dataset of human comparisons between summaries, train a
|
||||
> model to predict the human-preferred summary, and use that model as a reward
|
||||
> function to fine-tune a summarization policy using reinforcement learning.
|
||||
|
||||
### Training language models to follow instructions with human feedback [[ArXiv](https://arxiv.org/pdf/2203.02155.pdf)]
|
||||
|
||||
> Starting with a set of labeler-written prompts and prompts submitted through
|
||||
> the OpenAI API, we collect a dataset of labeler demonstrations of the desired
|
||||
> model behavior, which we use to fine-tune GPT-3 using supervised learning. We
|
||||
> then collect a dataset of rankings of model outputs, which we use to further
|
||||
> fine-tune this supervised model using reinforcement learning from human
|
||||
> feedback.
|
||||
|
||||
### Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [[ArXiv](https://arxiv.org/pdf/2204.05862.pdf)]
|
||||
|
||||
> We apply preference modeling and reinforcement learning from human feedback
|
||||
> (RLHF) to finetune language models to act as helpful and harmless assistants.
|
||||
> We find this alignment training improves performance on almost all NLP
|
||||
> evaluations, and is fully compatible with training for specialized skills such
|
||||
> as python coding and summarization.
|
||||
|
||||
## Generating Text From Language Models
|
||||
|
||||
A language model generates output text token by token, autoregressively. The
|
||||
large search space of this task requires some method of narrowing down the set
|
||||
of tokens to be considered in each step. This method, in turn, has a big impact
|
||||
on the quality of the resulting text.
|
||||
|
||||
### RANKGEN: Improving Text Generation with Large Ranking Models [[ArXiv](https://arxiv.org/pdf/2205.09726.pdf)], [[Github](https://github.com/martiansideofthemoon/rankgen)]
|
||||
|
||||
> Given an input sequence (or prefix), modern language models often assign high
|
||||
> probabilities to output sequences that are repetitive, incoherent, or
|
||||
> irrelevant to the prefix; as such, model-generated text also contains such
|
||||
> artifacts. To address these issues we present RankGen, a 1.2B parameter
|
||||
> encoder model for English that scores model generations given a prefix.
|
||||
> RankGen can be flexibly incorporated as a scoring function in beam search and
|
||||
> used to decode from any pretrained language model.
|
||||
|
||||
## Automatically Generating Instruction Data for Training
|
||||
|
||||
This line of work is about significantly reducing the need for manually
|
||||
@@ -32,3 +88,15 @@ models.
|
||||
> rivals the effectiveness of training on open-source manually-curated datasets,
|
||||
> surpassing the performance of models such as T0++ and Tk-Instruct across
|
||||
> various benchmarks.
|
||||
|
||||
## Uncertainty Estimation of Language Model Outputs
|
||||
|
||||
### Teaching models to express their uncertainty in words [[Arxiv](https://arxiv.org/pdf/2205.14334.pdf)]
|
||||
|
||||
> We show that a GPT-3 model can learn to express uncertainty about its own
|
||||
> answers in natural language -- without use of model logits. When given a
|
||||
> question, the model generates both an answer and a level of confidence (e.g.
|
||||
> "90% confidence" or "high confidence"). These levels map to probabilities that
|
||||
> are well calibrated. The model also remains moderately calibrated under
|
||||
> distribution shift, and is sensitive to uncertainty in its own answers, rather
|
||||
> than imitating human examples.
|
||||
|
||||
Reference in New Issue
Block a user