4.0 KiB
This is a list of resources for reinforcement learning from human feedback and other methods to instruct large language models.
Evaluation
There are multiple ways to formally evaluate LLM capabilities. Right now project generally use one of these 3 libraries. Personally I prefer Eleuther's work, but opinions and github stars are divided.
- EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of autoregressive language models.
- openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
- stanford-crfm/helm: Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110).
Training
Data
Data can generally be divided along two axis:
- high quality 🗹 or Lower quality ☐
- natural 🧑 or unnatural 🤖
Depending on your training objectives you will want lots of low quality instruction data, or a small amount of high quality data. Which should you use? Lets see what Anthropic have to say Askell et all Antrhopic]:
How can we improve the sample efficiency of preference modeling? We find that we can significantly improve sample efficiency using a ‘preference model pre-training’ (PMP) stage of training, where we first pre-train on large public datasets that encode human preference information, such as Stack Exchange, Reddit, and Wikipedia edits, before finetuning on smaller datasets encoding more specific human preferences.
Natural 🧑 & High quality 🗹
- oasst- from open assistant 22k rows, 2023-04-12
- SHP - Stanford human preferences - a dataset of instructions inferred from high quality sbureddits. 300k rows. 2023-02-23 tweet
- HH-RLHF - Antropic RLHF 91k rows
- https://github.com/allenai/natural-instructions 64k rows
- https://github.com/hendrycks/ethics 130k rows
Natural 🧑 & Lower quality ☐
- ELI5: a reddit based dataset of questions and answers. The SHP dataset improved on it's processing by comparing score and time
- https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences 10M instruction stack exchange, was used in anthropic paper paper]:
Unnatural 🤖 & High quality 🗹
alpaca_data_cleaned.jsonincluding removing as a large language model https://github.com/gururise/AlpacaDataCleaned- https://github.com/teknium1/GPTeacher
- https://github.com/databrickslabs/dolly
- OIG-small-chip2 a subset of the OIG dataset
Unnatural 🤖 & Lower quality ☐
- unnatural-instructions used above and GPT3 to make 256k examples
- OIG - Open Instruction Generalist Dataset a compilation of ~43M instructions. "The OIG dataset is almost purely a synthetic data set created using data augmentation.""
- note there is a higher quality subset OIG-small-chip2
Uncategorized
Finding more data
A great way to find new instruction datasets is to
- [search huggingface's datasets](all hf data 1)
- Look at compilations like - OIG
- github instruction-turning tag
Similar lists