Add UnifiedQA datasets that are not in xP3 (#666)

* Add UnifiedQA datasets that are not in xP3
2026-06-27 16:10:30 +08:00 · 2023-01-14 01:57:22 -08:00
parent 90e71f3da2
commit c36a7ae9da
2 changed files with 528 additions and 0 deletions
@@ -0,0 +1,515 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "00b2848c",
+   "metadata": {},
+   "source": [
+    "# UnifiedQA Dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b2e3c95c",
+   "metadata": {},
+   "source": [
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/data-argumentation/UnifiedQA.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d81932b9",
+   "metadata": {},
+   "source": [
+    "The purpose of this notebook is to download datasets from the UnifiedQA dataset collection and convert them into a format that can be used for training the OpenAssistant.\n",
+    "\n",
+    "The UnifiedQA repo can be found here: https://github.com/allenai/unifiedqa\n",
+    "\n",
+    "If you extend or use this work, please cite the relevant papers:\n",
+    "```\n",
+    "@article{khashabi2022unifiedqa,\n",
+    "    title={UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training},\n",
+    "    author={Khashabi, Daniel and Kordi, Yeganeh and Hajishirzi, Hannaneh},\n",
+    "    journal={arXiv preprint arXiv:2202.12359},\n",
+    "    year={2022}\n",
+    "}\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "74d91250",
+   "metadata": {},
+   "source": [
+    "## Compare xP3 and UnifiedQA"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d7ca591",
+   "metadata": {},
+   "source": [
+    "As many of the datasets that are in UnifiedQA are already in xP3, we do a simple (and incomplete) check to limit the number of datasets that we download."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 82,
+   "id": "95b57b2c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "xp3_list = [\n",
+    "    \"Code Miscellaneous\",\n",
+    "    \"CodeComplex\",\n",
+    "    \"Docstring Corpus\",\n",
+    "    \"GreatCode\",\n",
+    "    \"State Changes\",\n",
+    "    \"Closed-book QA\",\n",
+    "    \"Hotpot QA\",\n",
+    "    \"Trivia QA\",\n",
+    "    \"Web Questions\",\n",
+    "    \"Wiki QA\",\n",
+    "    \"Extractive QA\",\n",
+    "    \"Adversarial QA\",\n",
+    "    \"CMRC2018\",\n",
+    "    \"DRCD\",\n",
+    "    \"DuoRC\",\n",
+    "    \"MLQA\",\n",
+    "    \"Quoref\",\n",
+    "    \"ReCoRD\",\n",
+    "    \"ROPES\",\n",
+    "    \"SQuAD v2\",\n",
+    "    \"xQuAD\",\n",
+    "    \"TyDI QA\",\n",
+    "    \"Primary\",\n",
+    "    \"Goldp\",\n",
+    "    \"Multiple-Choice QA\",\n",
+    "    \"ARC\",\n",
+    "    \"C3\",\n",
+    "    \"CoS-E\",\n",
+    "    \"Cosmos\",\n",
+    "    \"DREAM\",\n",
+    "    \"MultiRC\",\n",
+    "    \"OpenBookQA\",\n",
+    "    \"PiQA\",\n",
+    "    \"QUAIL\",\n",
+    "    \"QuaRel\",\n",
+    "    \"QuaRTz\",\n",
+    "    \"QASC\",\n",
+    "    \"RACE\",\n",
+    "    \"SciQ\",\n",
+    "    \"Social IQA\",\n",
+    "    \"Wiki Hop\",\n",
+    "    \"WiQA\",\n",
+    "    \"Paraphrase Identification\",\n",
+    "    \"MRPC\",\n",
+    "    \"PAWS\",\n",
+    "    \"PAWS-X\",\n",
+    "    \"QQP\",\n",
+    "    \"Program Synthesis\",\n",
+    "    \"APPS\",\n",
+    "    \"CodeContests\",\n",
+    "    \"JupyterCodePairs\",\n",
+    "    \"MBPP\",\n",
+    "    \"NeuralCodeSearch\",\n",
+    "    \"XLCoST\",\n",
+    "    \"Structure-to-text\",\n",
+    "    \"Common Gen\",\n",
+    "    \"Wiki Bio\",\n",
+    "    \"Sentiment\",\n",
+    "    \"Amazon\",\n",
+    "    \"App Reviews\",\n",
+    "    \"IMDB\",\n",
+    "    \"Rotten Tomatoes\",\n",
+    "    \"Yelp\",\n",
+    "    \"Simplification\",\n",
+    "    \"BiSECT\",\n",
+    "    \"Summarization\",\n",
+    "    \"CNN Daily Mail\",\n",
+    "    \"Gigaword\",\n",
+    "    \"MultiNews\",\n",
+    "    \"SamSum\",\n",
+    "    \"Wiki-Lingua\",\n",
+    "    \"XLSum\",\n",
+    "    \"XSum\",\n",
+    "    \"Topic Classification\",\n",
+    "    \"AG News\",\n",
+    "    \"DBPedia\",\n",
+    "    \"TNEWS\",\n",
+    "    \"TREC\",\n",
+    "    \"CSL\",\n",
+    "    \"Translation\",\n",
+    "    \"Flores-200\",\n",
+    "    \"Tatoeba\",\n",
+    "    \"Word Sense disambiguation\",\n",
+    "    \"WiC\",\n",
+    "    \"XL-WiC\",\n",
+    "    \"Evaluation datasets (included in xP3all except for HumanEval)\",\n",
+    "    \"Natural Language Inference\",\n",
+    "    \"ANLI\",\n",
+    "    \"CB\",\n",
+    "    \"RTE\",\n",
+    "    \"XNLI\",\n",
+    "    \"Coreference Resolution\",\n",
+    "    \"Winogrande\",\n",
+    "    \"XWinograd\",\n",
+    "    \"Program Synthesis\",\n",
+    "    \"HumanEval\",\n",
+    "    \"Sentence Completion\",\n",
+    "    \"COPA\",\n",
+    "    \"Story Cloze\",\n",
+    "    \"XCOPA\",\n",
+    "    \"XStoryCloze\",\n",
+    "    \"Additional xP3all datasets\",\n",
+    "    \"Coreference Resolution\",\n",
+    "    \"WSC (Fixed)\",\n",
+    "    \"Sentence Completion\",\n",
+    "    \"HellaSwag\",\n",
+    "    \"Translation\",\n",
+    "    \"MultiEurlex\",\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 83,
+   "id": "cba85ada",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "unifiedQA_list = [\n",
+    "    \"SQuAD 1.1\",\n",
+    "    \"SQuAD 2\",\n",
+    "    \"NewsQA\",\n",
+    "    \"Quoref\",\n",
+    "    \"ROPES\",\n",
+    "    \"NarrativeQA\",\n",
+    "    \"DROP\",\n",
+    "    \"NaturalQuestions\",\n",
+    "    \"MCTest\",\n",
+    "    \"RACE\",\n",
+    "    \"OpenBookQA\",\n",
+    "    \"ARC\",\n",
+    "    \"CommonsenseQA\",\n",
+    "    \"QASC\",\n",
+    "    \"PhysicalIQA\",\n",
+    "    \"SocialIQA\",\n",
+    "    \"Winogrande\",\n",
+    "    \"BoolQ\",\n",
+    "    \"MultiRC (yes/no)\",\n",
+    "    \"BoolQ-NP\",\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78fc6ee9",
+   "metadata": {},
+   "source": [
+    "Now that we've defined the list of datasets (which we found in the paper for UnifiedQA and on the Hugging Face page of xP3) we can do the simple check."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "id": "9cf91317",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "SQuAD 1.1\n",
+      "SQuAD 2\n",
+      "NewsQA\n",
+      "NarrativeQA\n",
+      "DROP\n",
+      "NaturalQuestions\n",
+      "MCTest\n",
+      "CommonsenseQA\n",
+      "PhysicalIQA\n",
+      "SocialIQA\n",
+      "BoolQ\n",
+      "MultiRC (yes/no)\n",
+      "BoolQ-NP\n"
+     ]
+    }
+   ],
+   "source": [
+    "for ds in unifiedQA_list:\n",
+    "    if ds not in xp3_list:\n",
+    "        print(ds)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "93eeabad",
+   "metadata": {},
+   "source": [
+    "The SQuAD dataset is actually covered (with a slightly different name) but the other ones should be downloaded."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8c98078",
+   "metadata": {},
+   "source": [
+    "# OpenAssistant Data Scheme"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2731f88f",
+   "metadata": {},
+   "source": [
+    "We will use the data scheme that can be found in the docs for Open-Assistant. This code is taken from the StackExchange notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "id": "d35ab066",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import TypeVar, List, Dict, Any, Literal\n",
+    "from json import JSONEncoder\n",
+    "\n",
+    "T = TypeVar(\"T\", bound=\"ConversationTreeNode\")\n",
+    "\n",
+    "\n",
+    "class ConversationTreeNode:\n",
+    "    text: str  # The text of the node\n",
+    "    role: Literal[\"prompter\", \"assistant\"]  # Whether the node is a user prompt/follow-up or an assistant response\n",
+    "    children: List[T]  # The children of the node (if you have a linear conversation, this will be of length 0 or 1)\n",
+    "    metadata: Dict[str, Any]  # Node metadata (see below)\n",
+    "\n",
+    "    def __init__(\n",
+    "        self, text: str, role: Literal[\"prompter\", \"assistant\"], children: List[T], metadata: Dict[str, Any]\n",
+    "    ) -> None:\n",
+    "        self.text = text\n",
+    "        self.role = role\n",
+    "        self.children = children\n",
+    "        self.metadata = metadata\n",
+    "\n",
+    "\n",
+    "class ConversationTree:\n",
+    "    root: ConversationTreeNode  # The node containing the initial prompt\n",
+    "    metadata: Dict[str, Any]  # Tree metadata, different from root node metadata.\n",
+    "\n",
+    "    def __init__(self, root: ConversationTreeNode, metadata: Dict[str, Any]) -> None:\n",
+    "        self.root = root\n",
+    "        self.metadata = metadata\n",
+    "\n",
+    "\n",
+    "# subclass JSONEncoder\n",
+    "class TreeEncoder(JSONEncoder):\n",
+    "    def default(self, o):\n",
+    "        return o.__dict__"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e7457bae",
+   "metadata": {},
+   "source": [
+    "# Download and convert"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "54b0fd63",
+   "metadata": {},
+   "source": [
+    "We firstly import pandas, which we'll use to download the TSV files from Google Cloud Storage, and any other libraries that we'll need."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 86,
+   "id": "9317d4b4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "62dc4e18",
+   "metadata": {},
+   "source": [
+    "The following is a simple function to take the data (which has two columns) and convert it to a tree with a root note (question) and one child (answer)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 87,
+   "id": "963e0d92",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def convert_unified_qa(dataset_url):\n",
+    "    # download using pandas\n",
+    "    ds = pd.read_csv(dataset_url, on_bad_lines=\"skip\", names=[\"Question\", \"Answer\"], sep=\"\\t\")\n",
+    "    # get name for metatdata\n",
+    "    ds_name = dataset_url.split(\"/unifiedqa/data/\")[1].split(\"/\")[0]\n",
+    "\n",
+    "    # create conversation forest\n",
+    "    conversation_forest = []\n",
+    "    for item in ds.itertuples():\n",
+    "        # build nodes and tree\n",
+    "        root = ConversationTreeNode(text=item.Question, role=\"prompter\", children=[], metadata=None)\n",
+    "        child = ConversationTreeNode(text=item.Answer, role=\"assistant\", children=[], metadata=None)\n",
+    "        root.children.append(child)\n",
+    "        conversation_tree = ConversationTree(root=root, metadata={\"dataset\": ds_name})\n",
+    "\n",
+    "        conversation_forest.append(conversation_tree)\n",
+    "\n",
+    "    conversation_forest_json = [\n",
+    "        json.loads(TreeEncoder().encode(conversation_tree)) for conversation_tree in conversation_forest\n",
+    "    ]\n",
+    "\n",
+    "    print(json.dumps(conversation_forest_json, indent=4), file=open(f\"./{ds_name}.json\", \"w+\"))\n",
+    "\n",
+    "    print(\"*****\", ds_name, \"****\")\n",
+    "    print(ds.head(2))\n",
+    "    print(\"....\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4448c9a",
+   "metadata": {},
+   "source": [
+    "We now define the list of URLs that we want to download. These URLs were found by manually going UnifiedQA'S Google Cloud bucket: https://console.cloud.google.com/storage/browser/unifiedqa/data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 88,
+   "id": "43e188b6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "urls = [\n",
+    "    \"https://storage.googleapis.com/unifiedqa/data/natural_questions/train.tsv\",\n",
+    "    \"https://storage.googleapis.com/unifiedqa/data/narrativeqa/train.tsv\",\n",
+    "    \"https://storage.googleapis.com/unifiedqa/data/newsqa/train.tsv\",\n",
+    "    \"https://storage.googleapis.com/unifiedqa/data/drop/train.tsv\",\n",
+    "    \"https://storage.googleapis.com/unifiedqa/data/commonsenseqa/train.tsv\",\n",
+    "    \"https://storage.googleapis.com/unifiedqa/data/physical_iqa/train.tsv\",\n",
+    "    \"https://storage.googleapis.com/unifiedqa/data/social_iqa/train.tsv\",\n",
+    "    \"https://storage.googleapis.com/unifiedqa/data/boolq/train.tsv\",\n",
+    "    \"https://storage.googleapis.com/unifiedqa/data/boolq_np/train.tsv\",\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 77,
+   "id": "b39bb154",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "***** natural_questions ****\n",
+      "                                            Question  \\\n",
+      "0  which is the most common use of opt-in e-mail ...   \n",
+      "1           how i.met your mother who is the mother?   \n",
+      "\n",
+      "                                              Answer  \n",
+      "0  a newsletter sent to an advertising firm's cus...  \n",
+      "1                                    Tracy McConnell  \n",
+      "....\n",
+      "***** narrativeqa ****\n",
+      "                                            Question  \\\n",
+      "0  Who is Miss Delmer? \\n  At Madeline Hall, an o...   \n",
+      "1  Who is Miss Delmer? \\n  At Madeline Hall, an o...   \n",
+      "\n",
+      "                                              Answer  \n",
+      "0   the elderly spinster aunt of the Earl de Vers...  \n",
+      "1                      She's Captail Delmar's aunt.   \n",
+      "....\n",
+      "***** newsqa ****\n",
+      "                                            Question      Answer\n",
+      "0  How many Americans are part of the federal foo...  31 million\n",
+      "1  How much did Sean Callebs live on? \\n (CNN) --...        $176\n",
+      "....\n",
+      "***** drop ****\n",
+      "                                            Question Answer\n",
+      "0  How many points did the buccaneers need to tie...      3\n",
+      "1  How many field goals did the Lions score? \\n T...      2\n",
+      "....\n",
+      "***** commonsenseqa ****\n",
+      "                                            Question           Answer\n",
+      "0  The sanctions against the school were a punish...           ignore\n",
+      "1  Sammy wanted to go to where the people were.  ...  populated areas\n",
+      "....\n",
+      "***** physical_iqa ****\n",
+      "                                            Question  \\\n",
+      "0  When boiling butter, when it's ready, you can ...   \n",
+      "1  To permanently attach metal legs to a chair, y...   \n",
+      "\n",
+      "                                              Answer  \n",
+      "0                                 Pour it into a jar  \n",
+      "1   Weld the metal together to get it to stay fir...  \n",
+      "....\n",
+      "***** social_iqa ****\n",
+      "                                            Question           Answer\n",
+      "0  How would Others feel as a result? \\n (A) like...   like attending\n",
+      "1  What will Others want to do next? \\n (A) disag...      get to work\n",
+      "....\n",
+      "***** boolq ****\n",
+      "                                            Question Answer\n",
+      "0  can you buy beer at a grocery store in pa? \\n ...    yes\n",
+      "1  is the baby in fuller house a twin? \\n (List o...    yes\n",
+      "....\n",
+      "***** boolq_np ****\n",
+      "                                            Question Answer\n",
+      "0  do iran and afghanistan write the same languag...    yes\n",
+      "1  do iran and afghanistan read the same language...    yes\n",
+      "....\n"
+     ]
+    }
+   ],
+   "source": [
+    "for url in urls:\n",
+    "    convert_unified_qa(url)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f0309674",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,13 @@
+# UnifiedQA Downloader
+
+UnifiedQA is a notebook that downloads data from UnifiedQA's Google Cloud bucket
+and converts it into OpenAssistant Data Scheme formats. Files for each dataset
+in the UnifiedQA collection (excluding the ones already in xP3) are saved to
+JSON.
+
+---
+
+## Contributing
+
+Feel free to contribute to this notebook. It's not perfect and additional
+functionality is planned.