diff --git a/docs/supervised_datasets.md b/docs/supervised_datasets.md new file mode 100644 index 00000000..0f8c986d --- /dev/null +++ b/docs/supervised_datasets.md @@ -0,0 +1,79 @@ +# Supervised datasets + +For discussion about usage of supervised data see issue +. + +## Motivation + +An important part of making the assistant useful is to teach it to understand +and follow instructions, and to perform large set of tasks well. + +While RLHF seems like the main ingredient, using existing supervised data might +help. + +There are two large-scale projects in the area of instruction-following / +multitask learning: Promptsource and Natural Instructions - these projects +crowdsourced templates and turned existing NLP datasets into +instruction-following seq2seq form in natural langauge. They include both +long-output training examples like generating a sentence that is a likely +consequence of sentence in the prompt, and short-output, like rating prediction +from review. (Pre-)training on such datasets should help model understand and +follow instructions and teach it many abilities neccessary to perform a large +set of tasks correctly. However, these data are not dialog-like - they do not +look like a normal conversation. + +There are also supervised dialog datasets such as Blended Skill Talk or SODA. In +constrast to instruction-following datasets, dialog data is not as focused on +"academic tasks" or correctness, but encourage the model to respond naturally +like a person would. + +### Promptsource + +- GitHub: +- paper: + [Multitask Prompted Training Enables Zero-Shot Task Generalization](https://arxiv.org/abs/2110.08207) +- project for preparing templates and working with them +- they generated a dataset using the templates: + - + - (with multilingual data but + English prompt) + - (with multilingual data + and machine-translated prompt) +- they trained zero-shot models (= models for following instructions in the + input) + - based on T5 architecture (encoder-decoder) called T0 family (and MT0 for + multilingual) + - and based on GPT architecture (decoder-only) called BloomZ family + - Huggingface demo: [T0](https://huggingface.co/bigscience/T0pp), + [MT0](https://huggingface.co/bigscience/mt0-large), + [BloomZ](https://huggingface.co/bigscience/bloomz), + - GitHub repo for T0: + - GitHub repo for BloomZ and MT0: + + +### Natural instructions + +- GitHub: +- paper: + [Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks](https://arxiv.org/abs/2204.07705) +- they crowdsource directly the data prepared for instruction following (and + learning from a few examples) +- the GitHub repo = the dataset. It contains jsons +- they trained zero-shot and in-context few-shot models (in multiple sizes): + - mT5 architecture (encoder-decoder, multilingual pretraining) + - Huggingface demo few-shot: + + - Huggingface demo zero-shot: + + +### Blended Skill Talk + +- used by Facebook in Blenderbot project +- HuggingFace dataset: +- example model trained on it: + + +### SODA + +- GitHub: +- paper: