2023-04-29 18:46:58 +08:00
2023-04-29 18:46:58 +08:00

This is a list of resources for reinforcement learning from human feedback (RLHF) and other methods to instruct large language models.

Data

Data can generally be divided along two axis:

  • high quality 🗹 or Lower quality ☐
  • natural 🧑 or unnatural 🤖

Depending on your training objectives you will want lots of low quality instruction data, or a small amount of high quality data. Which should you use? Lets see what Anthropic have to say in Askell et al:

How can we improve the sample efficiency of preference modeling? We find that we can significantly improve sample efficiency using a preference model pre-training (PMP) stage of training, where we first pre-train on large public datasets that encode human preference information, such as Stack Exchange, Reddit, and Wikipedia edits, before finetuning on smaller datasets encoding more specific human preferences.

Natural 🧑 & High quality 🗹

Natural 🧑 & Lower quality ☐

Unnatural 🤖 & High quality 🗹

Unnatural 🤖 & Lower quality ☐

  • unnatural-instructions used above and GPT3 to make 256k examples
  • OIG - Open Instruction Generalist Dataset a compilation of ~43M instructions. "The OIG dataset is almost purely a synthetic data set created using data augmentation.""
    • note there is a higher quality subset OIG-small-chip2

Uncategorized

Finding more data

A great way to find new instruction datasets is to

Training

Libraries

tutorials

Papers/Methods

  • RLHF
  • Chain of Hindsight https://arxiv.org/abs/2302.02676 the model it trained to rank it's own output, so it's kind of like diffusion, letting the model operate iterativly.
  • SFT - Supervised Fine Tuning this is normal fine tuning
  • Pretraining Language Models with Human Preferences tweet You can (and should) do RL from human feedback during pretraining itself! In our new paper, we show how training w/ human preferences early on greatly reduces undesirable LM behaviors
  • HIR: Hindsight Instruction Relabeling 💩 offline RL reinvented with extra steps FARL: SL Algorithm Distillation: classical control problems. Offline RL
  • Hindsight Instruction Relabeling (HIR), https://arxiv.org/abs/2302.05206 " outperforms the baseline algorithms and is comparable to or even surpasses supervised finetuning. "

Evaluation

There are multiple ways to formally evaluate LLM capabilities. Right now project generally use one of these 3 libraries. Personally I prefer Eleuther's work, but opinions and github stars are divided.

Similar lists

S
Description
Lists of datasets, training, and evals for RLHF and similar
Readme 182 KiB