Open-Assistant/model/supervised_finetuning/custom_datasets/README.md

# Dataset collections overview:

currently dataset can be divided into 3 classes

- language knowledge

  - summarization

  - translation

- dialogue : don't let user know you are a robot

- STEM : knowledge about the world

  - coding

  - world knowledge <= ideally we want to handle this via prefix context

Issues and TODO:

- as dataset are growing, how can we update this section less

- ideally we can update the config yaml and new dataset will be download from
  hub

  - one possible idea is we upload the trasform format of these dataset to the
    OA hub