Expanding survey of relevant research (#557)

Expanding survey of relevant research
2026-06-27 16:10:30 +08:00 · 2023-01-09 05:45:54 -05:00
parent 225aa58f89
commit c22b26d21a
1 changed files with 69 additions and 1 deletions
@@ -1,7 +1,63 @@
-# General
+# Research

 This page lists research papers that are relevant to the project.

+## Table of Contents
+
+- Reinforcement Learning from Human Feedback
+- Generating Text From Language Models
+- Automatically Generating Instruction Data for Training
+- Uncertainty Estimation of Language Model Outputs
+
+## Reinforcement Learning from Human Feedback <a name="reinforcement-learning-from-human-feedback"></a>
+
+Reinforcement Learning from Human Feedback (RLHF) is a method for fine-tuning a
+generative language models based on a reward model that is learned from human
+preference data. This method facilitates the learning of instruction-tuned
+models, among other things.
+
+### Learning to summarize from human feedback [[ArXiv](https://arxiv.org/pdf/2009.01325.pdf)], [[Github](https://github.com/openai/summarize-from-feedback)]
+
+> In this work, we show that it is possible to significantly improve summary
+> quality by training a model to optimize for human preferences. We collect a
+> large, high-quality dataset of human comparisons between summaries, train a
+> model to predict the human-preferred summary, and use that model as a reward
+> function to fine-tune a summarization policy using reinforcement learning.
+
+### Training language models to follow instructions with human feedback [[ArXiv](https://arxiv.org/pdf/2203.02155.pdf)]
+
+> Starting with a set of labeler-written prompts and prompts submitted through
+> the OpenAI API, we collect a dataset of labeler demonstrations of the desired
+> model behavior, which we use to fine-tune GPT-3 using supervised learning. We
+> then collect a dataset of rankings of model outputs, which we use to further
+> fine-tune this supervised model using reinforcement learning from human
+> feedback.
+
+### Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback [[ArXiv](https://arxiv.org/pdf/2204.05862.pdf)]
+
+> We apply preference modeling and reinforcement learning from human feedback
+> (RLHF) to finetune language models to act as helpful and harmless assistants.
+> We find this alignment training improves performance on almost all NLP
+> evaluations, and is fully compatible with training for specialized skills such
+> as python coding and summarization.
+
+## Generating Text From Language Models
+
+A language model generates output text token by token, autoregressively. The
+large search space of this task requires some method of narrowing down the set
+of tokens to be considered in each step. This method, in turn, has a big impact
+on the quality of the resulting text.
+
+### RANKGEN: Improving Text Generation with Large Ranking Models [[ArXiv](https://arxiv.org/pdf/2205.09726.pdf)], [[Github](https://github.com/martiansideofthemoon/rankgen)]
+
+> Given an input sequence (or prefix), modern language models often assign high
+> probabilities to output sequences that are repetitive, incoherent, or
+> irrelevant to the prefix; as such, model-generated text also contains such
+> artifacts. To address these issues we present RankGen, a 1.2B parameter
+> encoder model for English that scores model generations given a prefix.
+> RankGen can be flexibly incorporated as a scoring function in beam search and
+> used to decode from any pretrained language model.
+
 ## Automatically Generating Instruction Data for Training

 This line of work is about significantly reducing the need for manually
@@ -32,3 +88,15 @@ models.
 > rivals the effectiveness of training on open-source manually-curated datasets,
 > surpassing the performance of models such as T0++ and Tk-Instruct across
 > various benchmarks.
+
+## Uncertainty Estimation of Language Model Outputs
+
+### Teaching models to express their uncertainty in words [[Arxiv](https://arxiv.org/pdf/2205.14334.pdf)]
+
+> We show that a GPT-3 model can learn to express uncertainty about its own
+> answers in natural language -- without use of model logits. When given a
+> question, the model generates both an answer and a level of confidence (e.g.
+> "90% confidence" or "high confidence"). These levels map to probabilities that
+> are well calibrated. The model also remains moderately calibrated under
+> distribution shift, and is sensitive to uncertainty in its own answers, rather
+> than imitating human examples.