mirror of
https://github.com/wassname/Open-Assistant.git
synced 2026-06-27 16:10:30 +08:00
1216: Add quickstart data collection guide (#1219)
* Create data_collection.md
This commit is contained in:
@@ -0,0 +1,83 @@
|
||||
# Data Collection Guide
|
||||
|
||||
## Writing Prompts (Initial Prompts or Prompter Replies)
|
||||
|
||||
Attempt to post a **diverse range** of prompts. When posting a prompt, consider
|
||||
whether it is likely to be original. Collecting the same prompt many times is
|
||||
usually not useful.
|
||||
|
||||
Prompts can be on any topic but should not be obscene or hateful. Prompts go
|
||||
through a moderation process just like answers, so spammy or inappropriate
|
||||
prompts will be removed.
|
||||
|
||||
Do **not** include personally identifiable information (PII) in prompts. This
|
||||
will be moderated out.
|
||||
|
||||
Prompts should not include links which the assistant must open to effectively
|
||||
answer the prompt.
|
||||
|
||||
### Prompts: Spelling and Grammar
|
||||
|
||||
For prompts, grammar and spelling is significantly less important than answers.
|
||||
If the grammar of a prompt makes it **ambiguous or unreadable** it should be
|
||||
considered low quality, but if it simply contains common typos this is not a
|
||||
problem.
|
||||
|
||||
## Writing Answers (Assistant Replies)
|
||||
|
||||
Answers should aim to match the tone of the prompt. Prompts which are to the
|
||||
point should receive answers which are to the point. Answers should not be
|
||||
unnecessarily formal and should be **conversational** in a manner a human may
|
||||
write. Avoid jargon if possible.
|
||||
|
||||
Attempt to answer in a way which accurately fulfils the request in the prompt.
|
||||
Low quality answers will be moderated. Longer answers tend to be preferable
|
||||
_unless_ the prompt is sufficiently simple to justify a shorter one.
|
||||
|
||||
Where possible, **avoid bias** or expressing subjective preferences in answers.
|
||||
Do **not** write answers which encourage illegal activities or activities likely
|
||||
to be harmful to the user or others around them.
|
||||
|
||||
### Answers: ChatGPT
|
||||
|
||||
**Do not copy answers from ChatGPT** or other OpenAI services. It is prohibited
|
||||
according to the OpenAI Terms of Service. Obvious ChatGPT-generated answers
|
||||
should be reported in moderation and rated low quality. Other AI systems may
|
||||
allow such usage, but answers should still **not** be copied and pasted.
|
||||
|
||||
### Answers: Spelling and Grammar
|
||||
|
||||
Answers should be held to a high standard of grammar and spelling. It **is**
|
||||
permissible for answers to contain sources or references provided they are
|
||||
relevant and from reputable websites. However, this is **not** required.
|
||||
|
||||
## Ranking and Classifying Prompts
|
||||
|
||||
Largely the criteria for "quality" can be derived by how well the prompt follows
|
||||
the guidance
|
||||
[above](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/guides#writing-prompts-initial-prompts-or-prompter-replies).
|
||||
The other labels are largely self-explanatory.
|
||||
|
||||
Spam should be reserved for intentionally low effort or low quality prompts.
|
||||
|
||||
It is usually fine to go closer to "Ordinary" than "Creative" unless there is
|
||||
obvious creativity shown. The same goes for "Serious" compared to "Humorous".
|
||||
|
||||
## Ranking and Classifying Answers
|
||||
|
||||
Your rankings and classification will help decide which answers are the best and
|
||||
therefore which ones are prioritised in the final dataset. That means the answer
|
||||
classification is a crucial task, serving as community moderation.
|
||||
|
||||
Largely the criteria for "quality" can be derived by how well the prompt follows
|
||||
the guidance
|
||||
[above](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/guides#writing-answers-assistant-replies).
|
||||
The other labels are largely self-explanatory.
|
||||
|
||||
Spam should be reserved for intentionally low effort or low quality answers.
|
||||
|
||||
The bad answer label should be given when the answer does not respond to the
|
||||
prompt at all or is irrelevant.
|
||||
|
||||
It is usually fine to go closer to "Ordinary" than "Creative" unless there is
|
||||
obvious creativity shown. The same goes for "Serious" compared to "Humorous".
|
||||
Reference in New Issue
Block a user