diff --git a/docs/docs/guides/data_collection.md b/docs/docs/guides/data_collection.md new file mode 100644 index 00000000..4dce205c --- /dev/null +++ b/docs/docs/guides/data_collection.md @@ -0,0 +1,83 @@ +# Data Collection Guide + +## Writing Prompts (Initial Prompts or Prompter Replies) + +Attempt to post a **diverse range** of prompts. When posting a prompt, consider +whether it is likely to be original. Collecting the same prompt many times is +usually not useful. + +Prompts can be on any topic but should not be obscene or hateful. Prompts go +through a moderation process just like answers, so spammy or inappropriate +prompts will be removed. + +Do **not** include personally identifiable information (PII) in prompts. This +will be moderated out. + +Prompts should not include links which the assistant must open to effectively +answer the prompt. + +### Prompts: Spelling and Grammar + +For prompts, grammar and spelling is significantly less important than answers. +If the grammar of a prompt makes it **ambiguous or unreadable** it should be +considered low quality, but if it simply contains common typos this is not a +problem. + +## Writing Answers (Assistant Replies) + +Answers should aim to match the tone of the prompt. Prompts which are to the +point should receive answers which are to the point. Answers should not be +unnecessarily formal and should be **conversational** in a manner a human may +write. Avoid jargon if possible. + +Attempt to answer in a way which accurately fulfils the request in the prompt. +Low quality answers will be moderated. Longer answers tend to be preferable +_unless_ the prompt is sufficiently simple to justify a shorter one. + +Where possible, **avoid bias** or expressing subjective preferences in answers. +Do **not** write answers which encourage illegal activities or activities likely +to be harmful to the user or others around them. + +### Answers: ChatGPT + +**Do not copy answers from ChatGPT** or other OpenAI services. It is prohibited +according to the OpenAI Terms of Service. Obvious ChatGPT-generated answers +should be reported in moderation and rated low quality. Other AI systems may +allow such usage, but answers should still **not** be copied and pasted. + +### Answers: Spelling and Grammar + +Answers should be held to a high standard of grammar and spelling. It **is** +permissible for answers to contain sources or references provided they are +relevant and from reputable websites. However, this is **not** required. + +## Ranking and Classifying Prompts + +Largely the criteria for "quality" can be derived by how well the prompt follows +the guidance +[above](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/guides#writing-prompts-initial-prompts-or-prompter-replies). +The other labels are largely self-explanatory. + +Spam should be reserved for intentionally low effort or low quality prompts. + +It is usually fine to go closer to "Ordinary" than "Creative" unless there is +obvious creativity shown. The same goes for "Serious" compared to "Humorous". + +## Ranking and Classifying Answers + +Your rankings and classification will help decide which answers are the best and +therefore which ones are prioritised in the final dataset. That means the answer +classification is a crucial task, serving as community moderation. + +Largely the criteria for "quality" can be derived by how well the prompt follows +the guidance +[above](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/guides#writing-answers-assistant-replies). +The other labels are largely self-explanatory. + +Spam should be reserved for intentionally low effort or low quality answers. + +The bad answer label should be given when the answer does not respond to the +prompt at all or is irrelevant. + +It is usually fine to go closer to "Ordinary" than "Creative" unless there is +obvious creativity shown. The same goes for "Serious" compared to "Humorous".