mirror of https://github.com/wassname/Open-Assistant.git synced 2026-06-27 16:10:30 +08:00

Files

T

Vechtomov cbf3c22560 Add new datasets contribution guide (#1061 )

* Add readme

* change link in CONTRIBUTING

2023-02-02 07:52:39 +01:00

mt_note_generation

Adding MT Sample clinical note dataset (#804 )

2023-01-23 14:24:43 +01:00

soda_synthetic_dialogue

lint fix

2023-01-16 22:57:47 -05:00

__init__.py

add soda_synethetic_dialogue dataset

2023-01-09 16:03:56 +00:00

README.md

Add new datasets contribution guide (#1061 )

2023-02-02 07:52:39 +01:00

README.md

Datasets

This folder contains datasets loading scripts that are used to train OpenAssistant. The current list of datasets can be found here.

Adding a New Dataset

To add a new dataset to OpenAssistant, follow these steps:

Create an issue: Create a new issue and describe your proposal for the new dataset.
Create a dataset on HuggingFace: Create a dataset on HuggingFace. See below for more details.
Make a pull request: Add a new dataset loading script to this folder and link the issue in the pull request description. For more information, see below.

Creating a Dataset on HuggingFace

To create a new dataset on HuggingFace, follow these steps:

1. Convert your dataset file(s) to the Parquet format using the pandas library:

import pandas as pd

# Create a pandas dataframe from your dataset file(s)
df = pd.read_json(...) # or any other way

# Save the file in the Parquet format
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow")

2. Install HuggingFace CLI

pip install huggingface-cli

3. Log in to HuggingFace

Use your access token to login:

Via terminal

huggingface-cli login

in Jupyter notebook

from huggingface_hub import notebook_login
notebook_login()

4. Push the Parquet file to HuggingFace using the following code:

from datasets import Dataset
ds = Dataset.from_parquet("dataset.parquet")
ds.push_to_hub("your_huggingface_name/dataset_name")

5. Update the `README.md` file

Update the README.md file of your dataset by visiting this link: https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md (paste your HuggingFace name and dataset)

Making a Pull Request

1. Fork this repository

2. Create a new branch in your fork

3. Add your dataset to the repository

Create a folder with the name of your dataset.

Add a loading script that loads your dataset from HuggingFace, for example:

from datasets import load_dataset

if __name__ == "__main__":
    ds = load_dataset("your_huggingface_name/dataset_name")
    print(ds["train"])

Optionally, add any other files that describe your dataset and its creation, such as a README, notebooks, scrapers, etc.

4. Stage your changes and run the pre-commit hook

pre-commit run

5. Submit a pull request

Submit a pull request and include a link to the issue it resolves in the description, for example: Resolves #123

README.md

Datasets

Adding a New Dataset

Creating a Dataset on HuggingFace

1. Convert your dataset file(s) to the Parquet format using the pandas library:

2. Install HuggingFace CLI

3. Log in to HuggingFace

4. Push the Parquet file to HuggingFace using the following code:

5. Update the README.md file

Making a Pull Request

1. Fork this repository

2. Create a new branch in your fork

3. Add your dataset to the repository

4. Stage your changes and run the pre-commit hook

5. Submit a pull request

5. Update the `README.md` file