Files
Open-Assistant/model/supervised_finetuning/custom_datasets

Dataset collections overview:

currently dataset can be divided into 3 classes

  • language knowledge

    • summarization

    • translation

  • dialogue : don't let user know you are a robot

  • STEM : knowledge about the world

    • coding

    • world knowledge <= ideally we want to handle this via prefix context

Issues and TODO:

  • as dataset are growing, how can we update this section less

  • ideally we can update the config yaml and new dataset will be download from hub

    • one possible idea is we upload the trasform format of these dataset to the OA hub