Add new datasets contribution guide (#1061)

* Add readme * change link in CONTRIBUTING
2026-06-27 16:10:30 +08:00 · 2023-02-02 09:52:39 +03:00
parent 5f7013fb81
commit cbf3c22560
2 changed files with 109 additions and 1 deletions
@@ -140,4 +140,4 @@ automatically deploy the built release to the dev machine.
 ### Contribute a Dataset

 See
-[here](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/data/datasets.md)
+[here](https://github.com/LAION-AI/Open-Assistant/blob/main/openassistant/datasets/README.md)
@@ -0,0 +1,108 @@
+# **Datasets**
+
+This folder contains datasets loading scripts that are used to train
+OpenAssistant. The current list of datasets can be found
+[here](https://docs.google.com/spreadsheets/d/1NYYa6vHiRnk5kwnyYaCT0cBO62--Tm3w4ihdBtp4ISk).
+
+## **Adding a New Dataset**
+
+To add a new dataset to OpenAssistant, follow these steps:
+
+1. **Create an issue**: Create a new
+   [issue](https://github.com/LAION-AI/Open-Assistant/issues/new) and describe
+   your proposal for the new dataset.
+
+2. **Create a dataset on HuggingFace**: Create a dataset on
+   [HuggingFace](https://huggingface.co). See
+   [below](#creating-a-dataset-on-huggingface) for more details.
+
+3. **Make a pull request**: Add a new dataset loading script to this folder and
+   link the issue in the pull request description. For more information, see
+   [below](#making-a-pull-request).
+
+## **Creating a Dataset on HuggingFace**
+
+To create a new dataset on HuggingFace, follow these steps:
+
+#### 1. Convert your dataset file(s) to the Parquet format using the [pandas](https://pandas.pydata.org/) library:
+
+```python
+import pandas as pd
+
+# Create a pandas dataframe from your dataset file(s)
+df = pd.read_json(...) # or any other way
+
+# Save the file in the Parquet format
+df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow")
+```
+
+#### 2. Install HuggingFace CLI
+
+```bash
+pip install huggingface-cli
+```
+
+#### 3. Log in to HuggingFace
+
+Use your [access token](https://huggingface.co/docs/hub/security-tokens) to
+login:
+
+- Via terminal
+
+```bash
+huggingface-cli login
+```
+
+- in Jupyter notebook
+
+```python
+from huggingface_hub import notebook_login
+notebook_login()
+```
+
+#### 4. Push the Parquet file to HuggingFace using the following code:
+
+```python
+from datasets import Dataset
+ds = Dataset.from_parquet("dataset.parquet")
+ds.push_to_hub("your_huggingface_name/dataset_name")
+```
+
+#### 5. Update the `README.md` file
+
+Update the `README.md` file of your dataset by visiting this link:
+https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md
+(paste your HuggingFace name and dataset)
+
+## **Making a Pull Request**
+
+#### 1. Fork this repository
+
+#### 2. Create a new branch in your fork
+
+#### 3. Add your dataset to the repository
+
+- Create a folder with the name of your dataset.
+- Add a loading script that loads your dataset from HuggingFace, for example:
+
+  ```python
+  from datasets import load_dataset
+
+  if __name__ == "__main__":
+      ds = load_dataset("your_huggingface_name/dataset_name")
+      print(ds["train"])
+  ```
+
+- Optionally, add any other files that describe your dataset and its creation,
+  such as a README, notebooks, scrapers, etc.
+
+#### 4. Stage your changes and run the pre-commit hook
+
+```bash
+pre-commit run
+```
+
+#### 5. Submit a pull request
+
+- Submit a pull request and include a link to the issue it resolves in the
+  description, for example: `Resolves #123`