Files
awesome-rlhf/README.md
T
2023-04-22 11:29:29 +08:00

5.1 KiB
Raw Blame History

This is a list of resources for reinforcement learning from human feedback (RLHF) and other methods to instruct large language models.

Data

Data can generally be divided along two axis:

  • high quality 🗹 or Lower quality ☐
  • natural 🧑 or unnatural 🤖

Depending on your training objectives you will want lots of low quality instruction data, or a small amount of high quality data. Which should you use? Lets see what Anthropic have to say in Askell et al:

How can we improve the sample efficiency of preference modeling? We find that we can significantly improve sample efficiency using a preference model pre-training (PMP) stage of training, where we first pre-train on large public datasets that encode human preference information, such as Stack Exchange, Reddit, and Wikipedia edits, before finetuning on smaller datasets encoding more specific human preferences.

Natural 🧑 & High quality 🗹

Natural 🧑 & Lower quality ☐

Unnatural 🤖 & High quality 🗹

Unnatural 🤖 & Lower quality ☐

  • unnatural-instructions used above and GPT3 to make 256k examples
  • OIG - Open Instruction Generalist Dataset a compilation of ~43M instructions. "The OIG dataset is almost purely a synthetic data set created using data augmentation.""
    • note there is a higher quality subset OIG-small-chip2

Uncategorized

Finding more data

A great way to find new instruction datasets is to

Training

Libraries

Papers/Methods

  • RLHF
  • Chain of Hindsight https://arxiv.org/abs/2302.02676 the model it trained to rank it's own output, so it's kind of like diffusion, letting the model operate iterativly.
  • SFT - Supervised Fine Tuning this is normal fine tuning
  • HIR: Hindsight Instruction Relabeling 💩 offline RL reinvented with extra steps FARL: SL Algorithm Distillation: classical control problems. Offline RL

Evaluation

There are multiple ways to formally evaluate LLM capabilities. Right now project generally use one of these 3 libraries. Personally I prefer Eleuther's work, but opinions and github stars are divided.

Similar lists