Merge pull request #239 from markcheeky/patch-1

Create docs/supervised_datasets.md, suggested by yk in issue 186
2026-06-28 16:20:34 +08:00 · 2023-01-03 21:35:05 +01:00
parent 1970b43a57 c7ff4c6a97
commit e14c670cb0
1 changed files with 79 additions and 0 deletions
@@ -0,0 +1,79 @@
+# Supervised datasets
+
+For discussion about usage of supervised data see issue
+<https://github.com/LAION-AI/Open-Assistant/issues/186>.
+
+## Motivation
+
+An important part of making the assistant useful is to teach it to understand
+and follow instructions, and to perform large set of tasks well.
+
+While RLHF seems like the main ingredient, using existing supervised data might
+help.
+
+There are two large-scale projects in the area of instruction-following /
+multitask learning: Promptsource and Natural Instructions - these projects
+crowdsourced templates and turned existing NLP datasets into
+instruction-following seq2seq form in natural langauge. They include both
+long-output training examples like generating a sentence that is a likely
+consequence of sentence in the prompt, and short-output, like rating prediction
+from review. (Pre-)training on such datasets should help model understand and
+follow instructions and teach it many abilities neccessary to perform a large
+set of tasks correctly. However, these data are not dialog-like - they do not
+look like a normal conversation.
+
+There are also supervised dialog datasets such as Blended Skill Talk or SODA. In
+constrast to instruction-following datasets, dialog data is not as focused on
+"academic tasks" or correctness, but encourage the model to respond naturally
+like a person would.
+
+### Promptsource
+
+- GitHub: <https://github.com/bigscience-workshop/promptsource>
+- paper:
+  [Multitask Prompted Training Enables Zero-Shot Task Generalization](https://arxiv.org/abs/2110.08207)
+- project for preparing templates and working with them
+- they generated a dataset using the templates:
+  - <https://huggingface.co/datasets/bigscience/P3>
+  - <https://huggingface.co/datasets/bigscience/xP3> (with multilingual data but
+    English prompt)
+  - <https://huggingface.co/datasets/bigscience/xP3mt> (with multilingual data
+    and machine-translated prompt)
+- they trained zero-shot models (= models for following instructions in the
+  input)
+  - based on T5 architecture (encoder-decoder) called T0 family (and MT0 for
+    multilingual)
+  - and based on GPT architecture (decoder-only) called BloomZ family
+  - Huggingface demo: [T0](https://huggingface.co/bigscience/T0pp),
+    [MT0](https://huggingface.co/bigscience/mt0-large),
+    [BloomZ](https://huggingface.co/bigscience/bloomz),
+  - GitHub repo for T0: <https://github.com/bigscience-workshop/t-zero>
+  - GitHub repo for BloomZ and MT0:
+    <https://github.com/bigscience-workshop/xmtf>
+
+### Natural instructions
+
+- GitHub: <https://github.com/allenai/natural-instructions>
+- paper:
+  [Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks](https://arxiv.org/abs/2204.07705)
+- they crowdsource directly the data prepared for instruction following (and
+  learning from a few examples)
+- the GitHub repo = the dataset. It contains jsons
+- they trained zero-shot and in-context few-shot models (in multiple sizes):
+  - mT5 architecture (encoder-decoder, multilingual pretraining)
+  - Huggingface demo few-shot:
+    <https://huggingface.co/allenai/tk-instruct-3b-def-pos>
+  - Huggingface demo zero-shot:
+    <https://huggingface.co/allenai/tk-instruct-3b-def>
+
+### Blended Skill Talk
+
+- used by Facebook in Blenderbot project
+- HuggingFace dataset: <https://huggingface.co/datasets/blended_skill_talk>
+- example model trained on it:
+  <https://huggingface.co/facebook/blenderbot_small-90M>
+
+### SODA
+
+- GitHub: <https://github.com/skywalker023/sodaverse>
+- paper: <https://arxiv.org/abs/2212.10465>