Merge pull request #324 from lewtun/add-datasets-docs

Add dataset creation guide
2026-06-28 16:20:34 +08:00 · 2023-01-04 14:47:56 +01:00
parent d83e75dbe4 6950fff781
commit 099d035d98
9 changed files with 696 additions and 1 deletions
@@ -26,7 +26,7 @@
 #
 # /WARNING!

-exclude: build|stubs|^bot/templates/$
+exclude: build|stubs|^bot/templates/$|openassistant/templates

 repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
@@ -0,0 +1,421 @@
+# OpenAssistant Datasets
+
+The datasets for this project are currently hosted as loading scripts on the
+[Open-Assistant organization](https://huggingface.co/OpenAssistant) the Hugging
+Face Hub. Each of them can be loaded by first installing the 🤗 Datasets
+library:
+
+```bash
+python -m pip install datasets
+```
+
+and then running:
+
+```python
+from datasets import load_dataset
+
+dataset = load_dataset("OpenAssistant/{dataset-name}")
+```
+
+We use this GitHub repository to accept new submissions and standardize quality
+control. See the instructions below if you'd like to contribute a new dataset to
+the project.
+
+## Adding a new dataset
+
+### 0. Pre-Requisites
+
+Install Git and create a GitHub account prior to implementing a dataset; you can
+follow instructions to install Git
+[here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
+
+You will also need at least Python 3.8+. If you are installing Python, we
+recommend downloading
+[Anaconda](https://docs.anaconda.com/anaconda/install/index.html) to curate a
+python environment with necessary packages. **We strongly recommend Python 3.8+
+for stability**.
+
+### 1. **Fork the OpenAssistant repository**
+
+Fork the
+`OpenAssistant`[repository](https://github.com/LAION-AI/Open-Assistant). To do
+this, click the link to the repository and click "Fork" in the upper-right
+corner. You should get an option to fork to your account, provided you are
+signed into Github.
+
+After you fork, clone the repository locally. You can do so as follows:
+
+```bash
+git clone git@github.com:<your_github_username>/OpenAssistant.git
+cd OpenAssistant  # enter the directory
+```
+
+Next, you want to set your `upstream` location to enable you to push/pull (add
+or receive updates). You can do so as follows:
+
+```bash
+git remote add upstream git@github.com:LAION-AI/Open-Assistant.git
+```
+
+You can optionally check that this was set properly by running the following
+command:
+
+```bash
+git remote -v
+```
+
+The output of this command should look as follows:
+
+```bash
+origin  git@github.com:<your_github_username>/Open-Assistant.git (fetch)
+origin  git@github.com:<your_github_username>/Open-Assistant.git (push)
+upstream        git@github.com:LAION-AI/Open-Assistant.git (fetch)
+upstream        git@github.com:LAION-AI/Open-Assistant.git (push)
+```
+
+If you do NOT have an `origin` for whatever reason, then run:
+
+```bash
+git remote add origin git@github.com:<your_github_username>/OpenAssistant.git
+```
+
+The goal of `upstream` is to keep your repository up-to-date to any changes that
+are made officially to the OpenAssistant repo. You can do this as follows by
+running the following commands:
+
+```
+git fetch upstream
+git pull
+```
+
+Provided you have no _merge conflicts_, this will ensure the repo stays
+up-to-date as you make changes. However, before you make changes, you should
+make a custom branch to implement your changes.
+
+You can make a new branch as such:
+
+```
+git checkout -b <dataset_name>
+```
+
+<p style="color:red"> <b> Please do not make changes on the master branch! </b></p>
+
+Always make sure you're on the right branch with the following command:
+
+```
+git branch
+```
+
+The correct branch will have a asterisk \* in front of it.
+
+### 2. **Create a development environment**
+
+You can make an environment in any way you choose to. We highlight two possible
+options:
+
+#### 2a) Create a conda environment
+
+The following instructions will create an Anaconda `openassistant` environment.
+
+- Install [anaconda](https://docs.anaconda.com/anaconda/install/) for your
+  appropriate operating system.
+- Run the following command while in the `biomedical` folder (you can pick your
+  python version):
+
+```bash
+conda create -n openassistant python=3.8  # Creates a conda env
+conda activate openassistant  # Activate your conda environment
+cd openassistant
+pip install -r dev-requirements.txt # Install this while in the openassistant folder
+```
+
+You can deactivate your environment at any time by either exiting your terminal
+or using `conda deactivate`.
+
+#### 2b) Create a venv environment
+
+Python 3.3+ has venv automatically installed; official information is found
+[here](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/).
+
+```
+python3 -m venv <your_env_name_here>
+source <your_env_name_here>/bin/activate  # activate environment
+cd openassistant
+pip install -r dev-requirements.txt # Install this while in the openassistant folder
+```
+
+Make sure your `pip` package points to your environment's source.
+
+### 3. Prepare a folder in `datasets` for your dataloader
+
+Make a new directory within the `openassistant/datasets` directory:
+
+```bash
+mkdir openassistant/datasets/<dataset_name>
+```
+
+**NOTE**: Please use snake_case, i.e. lowercase letters and underscores when
+choosing a `<dataset_name>`.
+
+Add an `__init__.py` file to this directory:
+
+```bash
+touch openassistant/datasets/<dataset_name>/__init__.py
+```
+
+Next, copy the `template.py` script and `hub.py` module of `templates` into your
+dataset folder. The `template.py` script has "TODOs" to fill in for your
+dataloading script:
+
+```bash
+cp templates/hub.py openassistant/datasets/<dataset_name>/
+cp templates/template.py openassistant/datasets/<dataset_name>/<dataset_name>.py
+```
+
+#### (Optional) Prepare local dataset files
+
+If your dataset files aren't publicly available via URLs (e.g. because you
+implemented a web scraper), you'll need to implement some extra logic to store
+and prepare the data locally prior to implementing a loading script in 🤗
+Datasets.
+
+To do so, first copy the template script for dataset creation:
+
+```bash
+cp templates/prepare.py openassistant/datasets/<dataset_name>/
+```
+
+Next, implement any logic that is needed to prepare a local version of the
+dataset files (by convention we store them in `datasets/<dataset_name>/data/`).
+Add any extra dependencies to a `requirements.txt` file and provide instructions
+on how to prepare the dataset files in a README:
+
+```bash
+touch openassistant/datasets/<dataset_name>/requirements.txt
+cp templates/README.py openassistant/datasets/<dataset_name>/
+```
+
+**Note:** Do not commit any dataset files to the OpenAssistant repo - all data
+will be hosted on the Hugging Face Hub. This step is needed for the project's
+data admins to be able to replicate the dataset creation process before pushing
+to the Hub.
+
+### 4. Implement your dataset
+
+To implement your dataloader, you will need to follow `template.py` and fill in
+all necessary TODOs. There are three key methods that are important:
+
+- `_info`: Specifies the schema of the expected dataloader
+- `_split_generators`: Downloads and extracts data for each split (e.g.
+  train/val/test) or associate local data with each split.
+- `_generate_examples`: Create examples from data that conform to each schema
+  defined in `_info`.
+
+For the `_info_` function, you will need to define `features` for your
+`DatasetInfo` object. For each dataset config, choose the right schema from our
+list of examples. You can find the schemas in the
+[schemas directory](openassistant/utils/schemas/).
+
+You will use this schema in the `_generate_examples` return value.
+
+Populate the information in the dataset according to this schema; some fields
+may be empty.
+
+#### Example scripts
+
+TODO
+
+#### Running & debugging
+
+You can run your data loader script during development by appending the
+following statement to your code ([templates/template.py](templates/template.py)
+already includes this):
+
+```python
+if __name__ == "__main__":
+    datasets.load_dataset(__file__)
+```
+
+If you want to use an interactive debugger during development, you will have to
+use `breakpoint()` instead of setting breakpoints directly in your IDE. Most
+IDEs will recognize the `breakpoint()` statement and pause there during
+debugging. If your preferred IDE doesn't support this, you can always run the
+script in your terminal and debug with `pdb`.
+
+### 5. Check if your dataloader works
+
+Make sure your dataset is implemented correctly by checking in python the
+following commands:
+
+```python
+from datasets import load_dataset
+
+data = load_dataset("openassistant/datasets/<dataset_name>/<dataset_name>.py", name="<dataset_name>_<schema>")
+```
+
+Run these commands from the top level of the `OpenAssistant` repo.
+
+### 6. Create a dataset card
+
+Copy and fill out the template dataset card:
+
+```bash
+cp templates/dataset_card.md openassistant/datasets/<dataset_name>/README.md
+```
+
+### 7. Format your code
+
+From the main directory, run the code quality checks via the following command:
+
+```
+pre-commit run --all-files
+```
+
+This runs the black formatter, isort, and lints to ensure that the code is
+readable and looks nice. Flake8 linting errors may require manual changes.
+
+### 8. Commit your changes
+
+First, commit your changes to the branch to "add" the work:
+
+```
+git add openassistant/datasets/<dataset_name>/*.py
+git commit -m "A message describing your commits"
+```
+
+Then, run the following commands to incorporate any new changes in the master
+branch of datasets as follows:
+
+```
+git fetch upstream
+git rebase upstream/main
+```
+
+**Run these commands in your custom branch**.
+
+Push these changes to **your fork** with the following command:
+
+```
+git push -u origin <dataset_name>
+```
+
+### 9. **Make a pull request**
+
+Make a Pull Request to implement your changes on the main repository
+[here](https://github.com/LAION-AI/Open-Assistant/pulls). To do so, click "New
+Pull Request". Then, choose your branch from your fork to push into "base:main".
+
+When opening a PR, please link the
+[issue](https://github.com/LAION-AI/Open-Assistant/issues) corresponding to your
+dataset using
+[closing keywords](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue)
+in the PR's description, e.g. `resolves #17`.
+
+## [Admins] Uploading a dataset to the Hugging Face Hub
+
+Uploading a new dataset from `openassistant/datasets/<dataset_name>` to the
+Hugging Face Hub typically involves the following steps:
+
+1. Setup
+2. Create a new dataset repository
+3. Copy a dataset loading script and dataset card
+4. Upload to the Hub
+
+### 1. Setup
+
+To upload a dataset to the OpenAssistant organization, you first need to:
+
+- Create a [Hugging Face account](https://huggingface.co/join) (it's free)
+- Join the [OpenAssistant organization](https://huggingface.co/OpenAssistant) by
+  clicking on the _Request to join this org_ button on the top right-hand side
+
+Next, check that you're correctly logged in and that `git-lfs` is installed so
+that the dataset can be uploaded. To log in, create a **write access token**
+that can be found under your Hugging Face profile (icon in the top right corner
+on [hf.co](http://hf.co/), then Settings -> Access Tokens -> User Access Tokens
+-> New Token. Alternatively, you can go to
+[your token settings](https://huggingface.co/settings/tokens) directly.
+
+Once you've created a token, run:
+
+```bash
+huggingface-cli login
+```
+
+in a terminal, or case you're working in a notebook
+
+```python
+from huggingface_hub import notebook_login
+
+notebook_login()
+```
+
+You can then copy-paste your token to log in locally.
+
+Next, let's make sure that `git-lfs` is correctly installed. To do so, simply
+run:
+
+```bash
+git-lfs -v
+```
+
+The output should show something like
+`git-lfs/2.13.2 (GitHub; linux amd64; go 1.15.4)`. If your console states that
+the `git-lfs` command was not found, please make sure to install it
+[here](https://git-lfs.github.com/) or simply via:
+
+```bash
+sudo apt-get install git-lfs
+git config --global user.email "you@example.com"
+git config --global user.name "Your Name"
+```
+
+The final step of the setup is to install the 🤗 Datasets library by running:
+
+```bash
+python -m pip install datasets
+```
+
+### 2. Create a new dataset repository
+
+Follow [this guide](https://huggingface.co/docs/datasets/upload_dataset) for
+instructions on creating a new dataset repo on the Hub. Use the same snake_case
+name as the dataset in `openassistant/datasets/<dataset_name>`.
+
+Once you've created the dataset repo, clone it by running:
+
+```bash
+git clone https://huggingface.co/datasets/OpenAssistant/<dataset_name>
+cd <dataset_name>
+```
+
+### 3. Copy a dataset loading script and dataset card
+
+Next, copy the loading script and dataset card to your repo:
+
+```bash
+cp openassistant/datasets/<dataset_name>/<dataset_name>.py .
+cp openassistant/datasets/<dataset_name>/README.md .
+```
+
+#### (Optional) Prepare local dataset files
+
+If the dataset files of `openassistant/datasets/<dataset_name>` aren't public,
+you'll need to run the `openassistant/datasets/<dataset_name>/prepare.py` script
+to create them. Store them in the same directory that is specified by the
+loading script (`data` by default).
+
+### 4. Upload to the Hub
+
+Once the dataset script and card are ready, use Git to push them to the Hub
+(along with any data files you may need).
+
+At this point, you can load the dataset by running:
+
+```python
+from datasets import load_dataset
+
+load_dataset("OpenAssistant/{dataset_name}")
+```
+
+Congratulations - you've now added a dataset to the OpenAssistant org!
@@ -0,0 +1 @@
+datasets>=2.8,<3.0.0
@@ -0,0 +1,10 @@
+# Dataset preparation instructions for {dataset_name}
+
+## Setup
+
+Add any installation details here.
+
+## Usage
+
+Explain how to run any scripts that involve preparing local dataset files, e.g.
+if the dataset files aren't public or are produced by a web scraper.
@@ -0,0 +1,28 @@
+---
+license: mit
+tags:
+- open-assistant
+- human-feedback
+- dialogue-modeling
+- language-modeling
+---
+
+# Dataset card for {dataset_name}
+
+This is a dataset card template for the [LAION-AI OpenAssistant project](https://github.com/LAION-AI/Open-Assistant). Fill out this template when adding a new dataset to the Hugging Face Hub.
+
+## Dataset summary
+
+[More information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
+
+## Usage
+
+[More information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
+
+## Source data
+
+[More information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
+
+## Citation
+
+[More information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
@@ -0,0 +1,22 @@
+from dataclasses import dataclass
+
+import datasets
+
+
+@dataclass
+class OpenAssistantConfig(datasets.BuilderConfig):
+    """BuilderConfig for OpenAssistant datasets."""
+
+    name: str = None
+    version: datasets.Version = None
+    description: str = None
+    schema: str = None
+    subset_id: str = None
+
+
+lm_features = datasets.Features(
+    {
+        "text": datasets.Value("string"),
+        "meta": [datasets.Value("string")],
+    }
+)
@@ -0,0 +1,8 @@
+import typer
+
+def main(output_dir: str = "data"):
+    """Download and prepare the dataset for use."""
+    raise NotImplementedError
+
+if __name__ == "__main__":
+    typer.run(main)
@@ -0,0 +1,205 @@
+# Copyright 2023 The OpenAssistant Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This template serves as a starting point for contributing a dataset to the OpenAssistant repo.
+
+When modifying it for your dataset, look for TODO items that offer specific instructions.
+
+To create a dataset loading script you will create a class and implement 3 methods:
+  * `_info`: Establishes the schema for the dataset, and returns a datasets.DatasetInfo object.
+  * `_split_generators`: Downloads and extracts data for each split (e.g. train/val/test) or associates local data with each split.
+  * `_generate_examples`: Creates examples from data on disk that conform to each schema defined in `_info`.
+
+Full documentation on writing dataset loading scripts can be found here:
+https://huggingface.co/docs/datasets/dataset_script
+
+This template is adapted from the one provided by BigScience's BigBIO library:
+https://github.com/bigscience-workshop/biomedical/blob/main/templates/template.py
+
+TODO: Before submitting your script, delete this docstring and replace it with a description of your dataset.
+"""
+
+import os
+from typing import Dict, List, Tuple
+
+import datasets
+
+from .hub import OpenAssistantConfig
+
+# TODO: import the schema (i.e. features) that fits your dataset:
+from .hub import
+
+# TODO: Add BibTeX citation where appropriate
+_CITATION = """\
+@article{,
+  author    = {},
+  title     = {},
+  journal   = {},
+  volume    = {},
+  year      = {},
+  url       = {},
+  doi       = {},
+  biburl    = {},
+  bibsource = {}
+}
+"""
+
+# TODO: create a module level variable with your dataset name (should match the script name)
+#  E.g. The Pile: [dataset_name] --> the_pile
+_DATASETNAME = "[dataset_name]"
+# TODO: create a pretty display name for your dataset
+_DISPLAYNAME = "Dataset Name"
+
+# TODO: Add a description of the dataset here
+# You can copy an official description
+_DESCRIPTION = """\
+This dataset is designed for XXX NLP task.
+"""
+
+# TODO: Add a link to an official homepage for the dataset here (if possible)
+_HOMEPAGE = ""
+
+# TODO: Add the licence for the dataset here (if possible)
+# Note that this doesn't have to be a common open source license.
+# Some datasets have custom licenses. In this case, simply put the full license terms
+# into `_LICENSE`
+_LICENSE = ""
+
+# TODO: Add links to the URLs needed to download your dataset files.
+# This variable can be a relative path for datasets whose files need to be
+# manually downloaded or preprocessed in advance.
+
+# For publicly available datasets you will most likely end up passing these URLs to dl_manager in _split_generators.
+# However, if you need to access different files for each config you can have multiple entries in this dict.
+# This can be an arbitrarily nested dict/list of URLs (see below in `_split_generators` method)
+_URLS = {
+    _DATASETNAME: "url or list of urls or relative path like ./data ",
+}
+
+# TODO: add supported task by dataset. One dataset may support multiple tasks
+_SUPPORTED_TASKS = []  # example: [Tasks.TRANSLATION, Tasks.NAMED_ENTITY_RECOGNITION, Tasks.RELATION_EXTRACTION]
+
+# TODO: set this to a version that is associated with the dataset. if none exists use "1.0.0"
+#  This version doesn't have to be consistent with semantic versioning. Anything that is
+#  provided by the original dataset as a version goes.
+_VERSION = ""
+
+
+# TODO: Name the dataset class to match the script name using CamelCase instead of snake_case
+#  Append "Dataset" to the class name: ThePile --> ThePileDataset
+class NewDataset(datasets.GeneratorBasedBuilder):
+    """TODO: Short description of my dataset."""
+
+    VERSION = datasets.Version(_VERSION)
+
+    # You will be able to load each dataset with
+    # dataset = datasets.load_dataset('my_dataset')
+
+    # TODO: For each dataset, implement a config for each subset;
+    #  If a dataset contains more than one subset, implement a config for EACH of them.
+    #  Each of them should contain:
+    #   - name: should be unique for each dataset config eg. the_pile_[schema_name]
+    #   - version: VERSION
+    #   - description: one line description for the dataset
+    #   - schema: open_assistant_[schema_name]
+    #   - subset_id: subset id is the canonical name for the dataset (eg. the_pile)
+    #  where [schema_name] = (language_modeling)
+
+    BUILDER_CONFIGS = [
+        OpenAssistantConfig(
+            name=f"{_DATASETNAME}_[schema_name]",
+            version=VERSION,
+            description=f"OpenAssistant dataset config for {_DATASETNAME}",
+            schema_name="[schema_name]",
+            subset_id=_DATASETNAME,
+        )
+    ]
+
+    DEFAULT_CONFIG_NAME = _DATASETNAME
+
+    def _info(self) -> datasets.DatasetInfo:
+        # TODO: Implement the schema for your dataset here.
+        raise NotImplementedError()
+
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager) -> List[datasets.SplitGenerator]:
+        """Returns SplitGenerators."""
+        # TODO: This method is tasked with downloading/extracting the data and defining the splits depending on the configuration
+
+        # If you need to access a config choice, that will be in self.config.name
+
+        # dl_manager is a datasets.download.DownloadManager that can be used to download and extract URLs; many examples use the download_and_extract method; see the DownloadManager docs here: https://huggingface.co/docs/datasets/package_reference/builder_classes.html#datasets.DownloadManager
+
+        # dl_manager can accept any type of nested list/dict and will give back the same structure with the url replaced with the path to local files.
+
+        urls = _URLS[_DATASETNAME]
+        data_dir = dl_manager.download_and_extract(urls)
+
+        # Not all datasets have predefined canonical train/val/test splits.
+        # If your dataset has no predefined splits, use datasets.Split.TRAIN for all of the data.
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                # Whatever you put in gen_kwargs will be passed to _generate_examples
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, "train.jsonl"),
+                    "split": "train",
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, "test.jsonl"),
+                    "split": "test",
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={
+                    "filepath": os.path.join(data_dir, "dev.jsonl"),
+                    "split": "dev",
+                },
+            ),
+        ]
+
+    # method parameters are unpacked from `gen_kwargs` as given in `_split_generators`
+
+    # TODO: change the args of this function to match the keys in `gen_kwargs`. You may add any necessary kwargs.
+
+    def _generate_examples(self, filepath, split: str) -> Tuple[int, Dict]:
+        """Yields examples as (key, example) tuples."""
+        # TODO: This method handles input defined in _split_generators to yield (key, example) tuples from the dataset.
+
+        # The `key` is for legacy reasons (tfds) and is not important in itself, but must be unique for each example.
+
+        # NOTE: For local datasets you will have access to self.config.data_dir and self.config.data_files
+
+        if self.config.schema == "[schema_name]":
+            # TODO: yield (key, example) tuples in the given schema
+            for key, example in thing:
+                yield key, example
+
+# This allows you to run your dataloader with `python [dataset_name].py` during development
+# TODO: Remove this before making your PR
+if __name__ == "__main__":
+    datasets.load_dataset(__file__)