awesome-rlhf/README.md at 24b79e585a4e40984a7a6fcbe9f9f37e88bc8d2e

wassname/awesome-rlhf

Fork 0

mirror of https://github.com/wassname/awesome-rlhf.git synced 2026-06-27 18:03:47 +08:00

Files

T

wassname 24b79e585a Create README.md

2023-04-22 11:17:56 +08:00

4.0 KiB

Raw Blame History

This is a list of resources for reinforcement learning from human feedback and other methods to instruct large language models.

Evaluation

There are multiple ways to formally evaluate LLM capabilities. Right now project generally use one of these 3 libraries. Personally I prefer Eleuther's work, but opinions and github stars are divided.

Training

Data

Data can generally be divided along two axis:

high quality 🗹 or Lower quality ☐
natural 🧑 or unnatural 🤖

Depending on your training objectives you will want lots of low quality instruction data, or a small amount of high quality data. Which should you use? Lets see what Anthropic have to say Askell et all Antrhopic]:

How can we improve the sample efficiency of preference modeling? We find that we can significantly improve sample efficiency using a ‘preference model pre-training’ (PMP) stage of training, where we first pre-train on large public datasets that encode human preference information, such as Stack Exchange, Reddit, and Wikipedia edits, before finetuning on smaller datasets encoding more specific human preferences.

Natural 🧑 & High quality 🗹

oasst- from open assistant 22k rows, 2023-04-12
SHP - Stanford human preferences - a dataset of instructions inferred from high quality sbureddits. 300k rows. 2023-02-23 tweet
HH-RLHF - Antropic RLHF 91k rows
https://github.com/allenai/natural-instructions 64k rows
https://github.com/hendrycks/ethics 130k rows

Natural 🧑 & Lower quality ☐

ELI5: a reddit based dataset of questions and answers. The SHP dataset improved on it's processing by comparing score and time
https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences 10M instruction stack exchange, was used in anthropic paper paper]:

Unnatural 🤖 & High quality 🗹

alpaca_data_cleaned.json including removing as a large language model https://github.com/gururise/AlpacaDataCleaned
https://github.com/teknium1/GPTeacher
https://github.com/databrickslabs/dolly
OIG-small-chip2 a subset of the OIG dataset

Unnatural 🤖 & Lower quality ☐

unnatural-instructions used above and GPT3 to make 256k examples
OIG - Open Instruction Generalist Dataset a compilation of ~43M instructions. "The OIG dataset is almost purely a synthetic data set created using data augmentation.""
- note there is a higher quality subset OIG-small-chip2

Uncategorized

b-mc2 (https://huggingface.co/datasets/b-mc2/wikihow_lists).

Finding more data

A great way to find new instruction datasets is to

[search huggingface's datasets](all hf data 1)
Look at compilations like - OIG
github instruction-turning tag

Similar lists

4.0 KiB Raw Blame History Unescape Escape