13 KiB
OpenAssistant Datasets
The datasets for this project are currently hosted as loading scripts on the Open-Assistant organization the Hugging Face Hub. Each of them can be loaded by first installing the 🤗 Datasets library:
python -m pip install datasets
and then running:
from datasets import load_dataset
dataset = load_dataset("OpenAssistant/{dataset-name}")
We use this GitHub repository to accept new submissions and standardize quality control. See the instructions below if you'd like to contribute a new dataset to the project.
Adding a new dataset
0. Pre-Requisites
Install Git and create a GitHub account prior to implementing a dataset; you can follow instructions to install Git here.
You will also need at least Python 3.8+. If you are installing Python, we recommend downloading Anaconda to curate a python environment with necessary packages. We strongly recommend Python 3.8+ for stability.
1. Fork the OpenAssistant repository
Fork the
OpenAssistantrepository. To do
this, click the link to the repository and click "Fork" in the upper-right
corner. You should get an option to fork to your account, provided you are
signed into Github.
After you fork, clone the repository locally. You can do so as follows:
git clone git@github.com:<your_github_username>/OpenAssistant.git
cd OpenAssistant # enter the directory
Next, you want to set your upstream location to enable you to push/pull (add
or receive updates). You can do so as follows:
git remote add upstream git@github.com:LAION-AI/Open-Assistant.git
You can optionally check that this was set properly by running the following command:
git remote -v
The output of this command should look as follows:
origin git@github.com:<your_github_username>/Open-Assistant.git (fetch)
origin git@github.com:<your_github_username>/Open-Assistant.git (push)
upstream git@github.com:LAION-AI/Open-Assistant.git (fetch)
upstream git@github.com:LAION-AI/Open-Assistant.git (push)
If you do NOT have an origin for whatever reason, then run:
git remote add origin git@github.com:<your_github_username>/OpenAssistant.git
The goal of upstream is to keep your repository up-to-date to any changes that
are made officially to the OpenAssistant repo. You can do this as follows by
running the following commands:
git fetch upstream
git pull
Provided you have no merge conflicts, this will ensure the repo stays up-to-date as you make changes. However, before you make changes, you should make a custom branch to implement your changes.
You can make a new branch as such:
git checkout -b <dataset_name>
Please do not make changes on the master branch!
Always make sure you're on the right branch with the following command:
git branch
The correct branch will have a asterisk * in front of it.
2. Create a development environment
You can make an environment in any way you choose to. We highlight two possible options:
2a) Create a conda environment
The following instructions will create an Anaconda openassistant environment.
- Install anaconda for your appropriate operating system.
- Run the following command while in the
biomedicalfolder (you can pick your python version):
conda create -n openassistant python=3.8 # Creates a conda env
conda activate openassistant # Activate your conda environment
cd openassistant
pip install -r dev-requirements.txt # Install this while in the openassistant folder
You can deactivate your environment at any time by either exiting your terminal
or using conda deactivate.
2b) Create a venv environment
Python 3.3+ has venv automatically installed; official information is found here.
python3 -m venv <your_env_name_here>
source <your_env_name_here>/bin/activate # activate environment
cd openassistant
pip install -r dev-requirements.txt # Install this while in the openassistant folder
Make sure your pip package points to your environment's source.
3. Prepare a folder in datasets for your dataloader
Make a new directory within the openassistant/datasets directory:
mkdir openassistant/datasets/<dataset_name>
NOTE: Please use snake_case, i.e. lowercase letters and underscores when
choosing a <dataset_name>.
Add an __init__.py file to this directory:
touch openassistant/datasets/<dataset_name>/__init__.py
Next, copy the template.py script and hub.py module of templates into your
dataset folder. The template.py script has "TODOs" to fill in for your
dataloading script:
cp templates/hub.py openassistant/datasets/<dataset_name>/
cp templates/template.py openassistant/datasets/<dataset_name>/<dataset_name>.py
(Optional) Prepare local dataset files
If your dataset files aren't publicly available via URLs (e.g. because you implemented a web scraper), you'll need to implement some extra logic to store and prepare the data locally prior to implementing a loading script in 🤗 Datasets.
To do so, first copy the template script for dataset creation:
cp templates/prepare.py openassistant/datasets/<dataset_name>/
Next, implement any logic that is needed to prepare a local version of the
dataset files (by convention we store them in datasets/<dataset_name>/data/).
Add any extra dependencies to a requirements.txt file and provide instructions
on how to prepare the dataset files in a README:
touch openassistant/datasets/<dataset_name>/requirements.txt
cp templates/README.py openassistant/datasets/<dataset_name>/
Note: Do not commit any dataset files to the OpenAssistant repo - all data will be hosted on the Hugging Face Hub. This step is needed for the project's data admins to be able to replicate the dataset creation process before pushing to the Hub.
4. Implement your dataset
To implement your dataloader, you will need to follow template.py and fill in
all necessary TODOs. There are three key methods that are important:
_info: Specifies the schema of the expected dataloader_split_generators: Downloads and extracts data for each split (e.g. train/val/test) or associate local data with each split._generate_examples: Create examples from data that conform to each schema defined in_info.
For the _info_ function, you will need to define features for your
DatasetInfo object. For each dataset config, choose the right schema from our
list of examples. You can find the schemas in the
schemas directory.
You will use this schema in the _generate_examples return value.
Populate the information in the dataset according to this schema; some fields may be empty.
Example scripts
TODO
Running & debugging
You can run your data loader script during development by appending the following statement to your code (templates/template.py already includes this):
if __name__ == "__main__":
datasets.load_dataset(__file__)
If you want to use an interactive debugger during development, you will have to
use breakpoint() instead of setting breakpoints directly in your IDE. Most
IDEs will recognize the breakpoint() statement and pause there during
debugging. If your preferred IDE doesn't support this, you can always run the
script in your terminal and debug with pdb.
5. Check if your dataloader works
Make sure your dataset is implemented correctly by checking in python the following commands:
from datasets import load_dataset
data = load_dataset("openassistant/datasets/<dataset_name>/<dataset_name>.py", name="<dataset_name>_<schema>")
Run these commands from the top level of the OpenAssistant repo.
6. Create a dataset card
Copy and fill out the template dataset card:
cp templates/dataset_card.md openassistant/datasets/<dataset_name>/README.md
7. Format your code
From the main directory, run the code quality checks via the following command:
pre-commit run --all-files
This runs the black formatter, isort, and lints to ensure that the code is readable and looks nice. Flake8 linting errors may require manual changes.
8. Commit your changes
First, commit your changes to the branch to "add" the work:
git add openassistant/datasets/<dataset_name>/*.py
git commit -m "A message describing your commits"
Then, run the following commands to incorporate any new changes in the master branch of datasets as follows:
git fetch upstream
git rebase upstream/main
Run these commands in your custom branch.
Push these changes to your fork with the following command:
git push -u origin <dataset_name>
9. Make a pull request
Make a Pull Request to implement your changes on the main repository here. To do so, click "New Pull Request". Then, choose your branch from your fork to push into "base:main".
When opening a PR, please link the
issue corresponding to your
dataset using
closing keywords
in the PR's description, e.g. resolves #17.
[Admins] Uploading a dataset to the Hugging Face Hub
Uploading a new dataset from openassistant/datasets/<dataset_name> to the
Hugging Face Hub typically involves the following steps:
- Setup
- Create a new dataset repository
- Copy a dataset loading script and dataset card
- Upload to the Hub
1. Setup
To upload a dataset to the OpenAssistant organization, you first need to:
- Create a Hugging Face account (it's free)
- Join the OpenAssistant organization by clicking on the Request to join this org button on the top right-hand side
Next, check that you're correctly logged in and that git-lfs is installed so
that the dataset can be uploaded. To log in, create a write access token
that can be found under your Hugging Face profile (icon in the top right corner
on hf.co, then Settings -> Access Tokens -> User Access Tokens
-> New Token. Alternatively, you can go to
your token settings directly.
Once you've created a token, run:
huggingface-cli login
in a terminal, or case you're working in a notebook
from huggingface_hub import notebook_login
notebook_login()
You can then copy-paste your token to log in locally.
Next, let's make sure that git-lfs is correctly installed. To do so, simply
run:
git-lfs -v
The output should show something like
git-lfs/2.13.2 (GitHub; linux amd64; go 1.15.4). If your console states that
the git-lfs command was not found, please make sure to install it
here or simply via:
sudo apt-get install git-lfs
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
The final step of the setup is to install the 🤗 Datasets library by running:
python -m pip install datasets
2. Create a new dataset repository
Follow this guide for
instructions on creating a new dataset repo on the Hub. Use the same snake_case
name as the dataset in openassistant/datasets/<dataset_name>.
Once you've created the dataset repo, clone it by running:
git clone https://huggingface.co/datasets/OpenAssistant/<dataset_name>
cd <dataset_name>
3. Copy a dataset loading script and dataset card
Next, copy the loading script and dataset card to your repo:
cp openassistant/datasets/<dataset_name>/<dataset_name>.py .
cp openassistant/datasets/<dataset_name>/README.md .
(Optional) Prepare local dataset files
If the dataset files of openassistant/datasets/<dataset_name> aren't public,
you'll need to run the openassistant/datasets/<dataset_name>/prepare.py script
to create them. Store them in the same directory that is specified by the
loading script (data by default).
4. Upload to the Hub
Once the dataset script and card are ready, use Git to push them to the Hub (along with any data files you may need).
At this point, you can load the dataset by running:
from datasets import load_dataset
load_dataset("OpenAssistant/{dataset_name}")
Congratulations - you've now added a dataset to the OpenAssistant org!