mirror of
https://github.com/wassname/Open-Assistant.git
synced 2026-06-27 16:10:30 +08:00
Add new datasets contribution guide (#1061)
* Add readme * change link in CONTRIBUTING
This commit is contained in:
+1
-1
@@ -140,4 +140,4 @@ automatically deploy the built release to the dev machine.
|
|||||||
### Contribute a Dataset
|
### Contribute a Dataset
|
||||||
|
|
||||||
See
|
See
|
||||||
[here](https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/data/datasets.md)
|
[here](https://github.com/LAION-AI/Open-Assistant/blob/main/openassistant/datasets/README.md)
|
||||||
|
|||||||
@@ -0,0 +1,108 @@
|
|||||||
|
# **Datasets**
|
||||||
|
|
||||||
|
This folder contains datasets loading scripts that are used to train
|
||||||
|
OpenAssistant. The current list of datasets can be found
|
||||||
|
[here](https://docs.google.com/spreadsheets/d/1NYYa6vHiRnk5kwnyYaCT0cBO62--Tm3w4ihdBtp4ISk).
|
||||||
|
|
||||||
|
## **Adding a New Dataset**
|
||||||
|
|
||||||
|
To add a new dataset to OpenAssistant, follow these steps:
|
||||||
|
|
||||||
|
1. **Create an issue**: Create a new
|
||||||
|
[issue](https://github.com/LAION-AI/Open-Assistant/issues/new) and describe
|
||||||
|
your proposal for the new dataset.
|
||||||
|
|
||||||
|
2. **Create a dataset on HuggingFace**: Create a dataset on
|
||||||
|
[HuggingFace](https://huggingface.co). See
|
||||||
|
[below](#creating-a-dataset-on-huggingface) for more details.
|
||||||
|
|
||||||
|
3. **Make a pull request**: Add a new dataset loading script to this folder and
|
||||||
|
link the issue in the pull request description. For more information, see
|
||||||
|
[below](#making-a-pull-request).
|
||||||
|
|
||||||
|
## **Creating a Dataset on HuggingFace**
|
||||||
|
|
||||||
|
To create a new dataset on HuggingFace, follow these steps:
|
||||||
|
|
||||||
|
#### 1. Convert your dataset file(s) to the Parquet format using the [pandas](https://pandas.pydata.org/) library:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
# Create a pandas dataframe from your dataset file(s)
|
||||||
|
df = pd.read_json(...) # or any other way
|
||||||
|
|
||||||
|
# Save the file in the Parquet format
|
||||||
|
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow")
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. Install HuggingFace CLI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install huggingface-cli
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 3. Log in to HuggingFace
|
||||||
|
|
||||||
|
Use your [access token](https://huggingface.co/docs/hub/security-tokens) to
|
||||||
|
login:
|
||||||
|
|
||||||
|
- Via terminal
|
||||||
|
|
||||||
|
```bash
|
||||||
|
huggingface-cli login
|
||||||
|
```
|
||||||
|
|
||||||
|
- in Jupyter notebook
|
||||||
|
|
||||||
|
```python
|
||||||
|
from huggingface_hub import notebook_login
|
||||||
|
notebook_login()
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 4. Push the Parquet file to HuggingFace using the following code:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from datasets import Dataset
|
||||||
|
ds = Dataset.from_parquet("dataset.parquet")
|
||||||
|
ds.push_to_hub("your_huggingface_name/dataset_name")
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 5. Update the `README.md` file
|
||||||
|
|
||||||
|
Update the `README.md` file of your dataset by visiting this link:
|
||||||
|
https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md
|
||||||
|
(paste your HuggingFace name and dataset)
|
||||||
|
|
||||||
|
## **Making a Pull Request**
|
||||||
|
|
||||||
|
#### 1. Fork this repository
|
||||||
|
|
||||||
|
#### 2. Create a new branch in your fork
|
||||||
|
|
||||||
|
#### 3. Add your dataset to the repository
|
||||||
|
|
||||||
|
- Create a folder with the name of your dataset.
|
||||||
|
- Add a loading script that loads your dataset from HuggingFace, for example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from datasets import load_dataset
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
ds = load_dataset("your_huggingface_name/dataset_name")
|
||||||
|
print(ds["train"])
|
||||||
|
```
|
||||||
|
|
||||||
|
- Optionally, add any other files that describe your dataset and its creation,
|
||||||
|
such as a README, notebooks, scrapers, etc.
|
||||||
|
|
||||||
|
#### 4. Stage your changes and run the pre-commit hook
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pre-commit run
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 5. Submit a pull request
|
||||||
|
|
||||||
|
- Submit a pull request and include a link to the issue it resolves in the
|
||||||
|
description, for example: `Resolves #123`
|
||||||
Reference in New Issue
Block a user