From cbf3c22560bbc5ecdbfa4f2e63c6498604c4e4b8 Mon Sep 17 00:00:00 2001 From: Vechtomov Date: Thu, 2 Feb 2023 09:52:39 +0300 Subject: [PATCH] Add new datasets contribution guide (#1061) * Add readme * change link in CONTRIBUTING --- CONTRIBUTING.md | 2 +- openassistant/datasets/README.md | 108 +++++++++++++++++++++++++++++++ 2 files changed, 109 insertions(+), 1 deletion(-) create mode 100644 openassistant/datasets/README.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 39773b41..c0d77f69 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -140,4 +140,4 @@ automatically deploy the built release to the dev machine. ### Contribute a Dataset See -[here](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/data/datasets.md) +[here](https://github.com/LAION-AI/Open-Assistant/blob/main/openassistant/datasets/README.md) diff --git a/openassistant/datasets/README.md b/openassistant/datasets/README.md new file mode 100644 index 00000000..4d523f1f --- /dev/null +++ b/openassistant/datasets/README.md @@ -0,0 +1,108 @@ +# **Datasets** + +This folder contains datasets loading scripts that are used to train +OpenAssistant. The current list of datasets can be found +[here](https://docs.google.com/spreadsheets/d/1NYYa6vHiRnk5kwnyYaCT0cBO62--Tm3w4ihdBtp4ISk). + +## **Adding a New Dataset** + +To add a new dataset to OpenAssistant, follow these steps: + +1. **Create an issue**: Create a new + [issue](https://github.com/LAION-AI/Open-Assistant/issues/new) and describe + your proposal for the new dataset. + +2. **Create a dataset on HuggingFace**: Create a dataset on + [HuggingFace](https://huggingface.co). See + [below](#creating-a-dataset-on-huggingface) for more details. + +3. **Make a pull request**: Add a new dataset loading script to this folder and + link the issue in the pull request description. For more information, see + [below](#making-a-pull-request). + +## **Creating a Dataset on HuggingFace** + +To create a new dataset on HuggingFace, follow these steps: + +#### 1. Convert your dataset file(s) to the Parquet format using the [pandas](https://pandas.pydata.org/) library: + +```python +import pandas as pd + +# Create a pandas dataframe from your dataset file(s) +df = pd.read_json(...) # or any other way + +# Save the file in the Parquet format +df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow") +``` + +#### 2. Install HuggingFace CLI + +```bash +pip install huggingface-cli +``` + +#### 3. Log in to HuggingFace + +Use your [access token](https://huggingface.co/docs/hub/security-tokens) to +login: + +- Via terminal + +```bash +huggingface-cli login +``` + +- in Jupyter notebook + +```python +from huggingface_hub import notebook_login +notebook_login() +``` + +#### 4. Push the Parquet file to HuggingFace using the following code: + +```python +from datasets import Dataset +ds = Dataset.from_parquet("dataset.parquet") +ds.push_to_hub("your_huggingface_name/dataset_name") +``` + +#### 5. Update the `README.md` file + +Update the `README.md` file of your dataset by visiting this link: +https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md +(paste your HuggingFace name and dataset) + +## **Making a Pull Request** + +#### 1. Fork this repository + +#### 2. Create a new branch in your fork + +#### 3. Add your dataset to the repository + +- Create a folder with the name of your dataset. +- Add a loading script that loads your dataset from HuggingFace, for example: + + ```python + from datasets import load_dataset + + if __name__ == "__main__": + ds = load_dataset("your_huggingface_name/dataset_name") + print(ds["train"]) + ``` + +- Optionally, add any other files that describe your dataset and its creation, + such as a README, notebooks, scrapers, etc. + +#### 4. Stage your changes and run the pre-commit hook + +```bash +pre-commit run +``` + +#### 5. Submit a pull request + +- Submit a pull request and include a link to the issue it resolves in the + description, for example: `Resolves #123`