1216: Add quickstart data collection guide (#1219)

* Create data_collection.md
This commit is contained in:
Oliver Stanley
2023-02-05 23:30:30 +00:00
committed by GitHub
parent f6e7bc72fa
commit 196cc0c58e
+83
View File
@@ -0,0 +1,83 @@
# Data Collection Guide
## Writing Prompts (Initial Prompts or Prompter Replies)
Attempt to post a **diverse range** of prompts. When posting a prompt, consider
whether it is likely to be original. Collecting the same prompt many times is
usually not useful.
Prompts can be on any topic but should not be obscene or hateful. Prompts go
through a moderation process just like answers, so spammy or inappropriate
prompts will be removed.
Do **not** include personally identifiable information (PII) in prompts. This
will be moderated out.
Prompts should not include links which the assistant must open to effectively
answer the prompt.
### Prompts: Spelling and Grammar
For prompts, grammar and spelling is significantly less important than answers.
If the grammar of a prompt makes it **ambiguous or unreadable** it should be
considered low quality, but if it simply contains common typos this is not a
problem.
## Writing Answers (Assistant Replies)
Answers should aim to match the tone of the prompt. Prompts which are to the
point should receive answers which are to the point. Answers should not be
unnecessarily formal and should be **conversational** in a manner a human may
write. Avoid jargon if possible.
Attempt to answer in a way which accurately fulfils the request in the prompt.
Low quality answers will be moderated. Longer answers tend to be preferable
_unless_ the prompt is sufficiently simple to justify a shorter one.
Where possible, **avoid bias** or expressing subjective preferences in answers.
Do **not** write answers which encourage illegal activities or activities likely
to be harmful to the user or others around them.
### Answers: ChatGPT
**Do not copy answers from ChatGPT** or other OpenAI services. It is prohibited
according to the OpenAI Terms of Service. Obvious ChatGPT-generated answers
should be reported in moderation and rated low quality. Other AI systems may
allow such usage, but answers should still **not** be copied and pasted.
### Answers: Spelling and Grammar
Answers should be held to a high standard of grammar and spelling. It **is**
permissible for answers to contain sources or references provided they are
relevant and from reputable websites. However, this is **not** required.
## Ranking and Classifying Prompts
Largely the criteria for "quality" can be derived by how well the prompt follows
the guidance
[above](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/guides#writing-prompts-initial-prompts-or-prompter-replies).
The other labels are largely self-explanatory.
Spam should be reserved for intentionally low effort or low quality prompts.
It is usually fine to go closer to "Ordinary" than "Creative" unless there is
obvious creativity shown. The same goes for "Serious" compared to "Humorous".
## Ranking and Classifying Answers
Your rankings and classification will help decide which answers are the best and
therefore which ones are prioritised in the final dataset. That means the answer
classification is a crucial task, serving as community moderation.
Largely the criteria for "quality" can be derived by how well the prompt follows
the guidance
[above](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/guides#writing-answers-assistant-replies).
The other labels are largely self-explanatory.
Spam should be reserved for intentionally low effort or low quality answers.
The bad answer label should be given when the answer does not respond to the
prompt at all or is irrelevant.
It is usually fine to go closer to "Ordinary" than "Creative" unless there is
obvious creativity shown. The same goes for "Serious" compared to "Humorous".