Files
Open-Assistant/openassistant/datasets
Vechtomov cbf3c22560 Add new datasets contribution guide (#1061)
* Add readme

* change link in CONTRIBUTING
2023-02-02 07:52:39 +01:00
..
2023-01-16 22:57:47 -05:00

Datasets

This folder contains datasets loading scripts that are used to train OpenAssistant. The current list of datasets can be found here.

Adding a New Dataset

To add a new dataset to OpenAssistant, follow these steps:

  1. Create an issue: Create a new issue and describe your proposal for the new dataset.

  2. Create a dataset on HuggingFace: Create a dataset on HuggingFace. See below for more details.

  3. Make a pull request: Add a new dataset loading script to this folder and link the issue in the pull request description. For more information, see below.

Creating a Dataset on HuggingFace

To create a new dataset on HuggingFace, follow these steps:

1. Convert your dataset file(s) to the Parquet format using the pandas library:

import pandas as pd

# Create a pandas dataframe from your dataset file(s)
df = pd.read_json(...) # or any other way

# Save the file in the Parquet format
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow")

2. Install HuggingFace CLI

pip install huggingface-cli

3. Log in to HuggingFace

Use your access token to login:

  • Via terminal
huggingface-cli login
  • in Jupyter notebook
from huggingface_hub import notebook_login
notebook_login()

4. Push the Parquet file to HuggingFace using the following code:

from datasets import Dataset
ds = Dataset.from_parquet("dataset.parquet")
ds.push_to_hub("your_huggingface_name/dataset_name")

5. Update the README.md file

Update the README.md file of your dataset by visiting this link: https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md (paste your HuggingFace name and dataset)

Making a Pull Request

1. Fork this repository

2. Create a new branch in your fork

3. Add your dataset to the repository

  • Create a folder with the name of your dataset.

  • Add a loading script that loads your dataset from HuggingFace, for example:

    from datasets import load_dataset
    
    if __name__ == "__main__":
        ds = load_dataset("your_huggingface_name/dataset_name")
        print(ds["train"])
    
  • Optionally, add any other files that describe your dataset and its creation, such as a README, notebooks, scrapers, etc.

4. Stage your changes and run the pre-commit hook

pre-commit run

5. Submit a pull request

  • Submit a pull request and include a link to the issue it resolves in the description, for example: Resolves #123