1216: Add quickstart data collection guide (#1219)

* Create data_collection.md
2026-06-27 16:10:30 +08:00 · 2023-02-05 23:30:30 +00:00
parent f6e7bc72fa
commit 196cc0c58e
1 changed files with 83 additions and 0 deletions
@@ -0,0 +1,83 @@
+# Data Collection Guide
+
+## Writing Prompts (Initial Prompts or Prompter Replies)
+
+Attempt to post a **diverse range** of prompts. When posting a prompt, consider
+whether it is likely to be original. Collecting the same prompt many times is
+usually not useful.
+
+Prompts can be on any topic but should not be obscene or hateful. Prompts go
+through a moderation process just like answers, so spammy or inappropriate
+prompts will be removed.
+
+Do **not** include personally identifiable information (PII) in prompts. This
+will be moderated out.
+
+Prompts should not include links which the assistant must open to effectively
+answer the prompt.
+
+### Prompts: Spelling and Grammar
+
+For prompts, grammar and spelling is significantly less important than answers.
+If the grammar of a prompt makes it **ambiguous or unreadable** it should be
+considered low quality, but if it simply contains common typos this is not a
+problem.
+
+## Writing Answers (Assistant Replies)
+
+Answers should aim to match the tone of the prompt. Prompts which are to the
+point should receive answers which are to the point. Answers should not be
+unnecessarily formal and should be **conversational** in a manner a human may
+write. Avoid jargon if possible.
+
+Attempt to answer in a way which accurately fulfils the request in the prompt.
+Low quality answers will be moderated. Longer answers tend to be preferable
+_unless_ the prompt is sufficiently simple to justify a shorter one.
+
+Where possible, **avoid bias** or expressing subjective preferences in answers.
+Do **not** write answers which encourage illegal activities or activities likely
+to be harmful to the user or others around them.
+
+### Answers: ChatGPT
+
+**Do not copy answers from ChatGPT** or other OpenAI services. It is prohibited
+according to the OpenAI Terms of Service. Obvious ChatGPT-generated answers
+should be reported in moderation and rated low quality. Other AI systems may
+allow such usage, but answers should still **not** be copied and pasted.
+
+### Answers: Spelling and Grammar
+
+Answers should be held to a high standard of grammar and spelling. It **is**
+permissible for answers to contain sources or references provided they are
+relevant and from reputable websites. However, this is **not** required.
+
+## Ranking and Classifying Prompts
+
+Largely the criteria for "quality" can be derived by how well the prompt follows
+the guidance
+[above](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/guides#writing-prompts-initial-prompts-or-prompter-replies).
+The other labels are largely self-explanatory.
+
+Spam should be reserved for intentionally low effort or low quality prompts.
+
+It is usually fine to go closer to "Ordinary" than "Creative" unless there is
+obvious creativity shown. The same goes for "Serious" compared to "Humorous".
+
+## Ranking and Classifying Answers
+
+Your rankings and classification will help decide which answers are the best and
+therefore which ones are prioritised in the final dataset. That means the answer
+classification is a crucial task, serving as community moderation.
+
+Largely the criteria for "quality" can be derived by how well the prompt follows
+the guidance
+[above](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/guides#writing-answers-assistant-replies).
+The other labels are largely self-explanatory.
+
+Spam should be reserved for intentionally low effort or low quality answers.
+
+The bad answer label should be given when the answer does not respond to the
+prompt at all or is irrelevant.
+
+It is usually fine to go closer to "Ordinary" than "Creative" unless there is
+obvious creativity shown. The same goes for "Serious" compared to "Humorous".