Files
Open-Assistant/docs/datasets/README.md
T
Lewis Tunstall ca8d3c8f8d Fix style
2023-01-04 13:45:05 +11:00

3.8 KiB

Datasets

The datasets for this project are currently hosted under the Open-Assistant organization the Hugging Face Hub. Each of them can be loaded by first installing the 🤗 Datasets library:

python -m pip install datasets

and then running:

from datasets import load_dataset

dataset = load_dataset("OpenAssistant/{dataset-name}")

See the instructions below if you'd like to contribute a new dataset to the project.

Adding a new dataset

Uploading a dataset to the Hugging Face Hub

Adding a new dataset for the OpenAssistant project typically involves the following steps:

  1. Setup
  2. Create a new dataset repository
  3. Create a dataset loading script and dataset card
  4. Upload to the Hub

1. Setup

To upload a dataset to the OpenAssistant organization, you first need to:

By default, your role in the organization is contributor, which gives you write access to any datasets that you create (and only those). If you'd like to make changes to other datasets, open a discussion or Hub pull request.

Next, check that you're correctly logged in and that git-lfs is installed so that the dataset can be uploaded. To log in, create a write access token that can be found under your Hugging Face profile (icon in the top right corner on hf.co, then Settings -> Access Tokens -> User Access Tokens -> New Token. Alternatively, you can go to your token settings directly.

Once you've created a token, run:

huggingface-cli login

in a terminal, or case you're working in a notebook

from huggingface_hub import notebook_login

notebook_login()

You can then copy-paste your token to log in locally.

Next, let's make sure that git-lfs is correctly installed. To do so, simply run:

git-lfs -v

The output should show something like git-lfs/2.13.2 (GitHub; linux amd64; go 1.15.4). If your console states that the git-lfs command was not found, please make sure to install it here or simply via:

sudo apt-get install git-lfs
git config --global user.email "you@example.com"
git config --global user.name "Your Name"

The final step of the setup is to install the 🤗 Datasets library by running:

python -m pip install datasets

2. Create a new dataset repository

We've created a Gradio application on Hugging Face Spaces that will create a new dataset repository for you with the following template files:

  • A dataset loading script
  • A dataset card

Simply provide the name of the new dataset and your access token from Step 1, and you're good to go!

3. Create a dataset loading script

If you've followed Step 2, a template dataset loading script will have been created in your new dataset repository. Edit the script according to this guide. Then fill out the missing details in the dataset card (the README.md file)

4. Upload to the Hub

Once the dataset script and card are ready, use Git to push them to the Hub (along with any data files you may need).

At this point, you can load the dataset by running:

from datasets import load_dataset

load_dataset("OpenAssistant/my_dataset")

Congratulations - you've now added a dataset to the OpenAssistant org!