diff --git a/notebooks/data-argumentation/UnifiedQA.ipynb b/notebooks/data-argumentation/UnifiedQA.ipynb new file mode 100644 index 00000000..4b93f596 --- /dev/null +++ b/notebooks/data-argumentation/UnifiedQA.ipynb @@ -0,0 +1,515 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "00b2848c", + "metadata": {}, + "source": [ + "# UnifiedQA Dataset" + ] + }, + { + "cell_type": "markdown", + "id": "b2e3c95c", + "metadata": {}, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/data-argumentation/UnifiedQA.ipynb)" + ] + }, + { + "cell_type": "markdown", + "id": "d81932b9", + "metadata": {}, + "source": [ + "The purpose of this notebook is to download datasets from the UnifiedQA dataset collection and convert them into a format that can be used for training the OpenAssistant.\n", + "\n", + "The UnifiedQA repo can be found here: https://github.com/allenai/unifiedqa\n", + "\n", + "If you extend or use this work, please cite the relevant papers:\n", + "```\n", + "@article{khashabi2022unifiedqa,\n", + " title={UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training},\n", + " author={Khashabi, Daniel and Kordi, Yeganeh and Hajishirzi, Hannaneh},\n", + " journal={arXiv preprint arXiv:2202.12359},\n", + " year={2022}\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "74d91250", + "metadata": {}, + "source": [ + "## Compare xP3 and UnifiedQA" + ] + }, + { + "cell_type": "markdown", + "id": "8d7ca591", + "metadata": {}, + "source": [ + "As many of the datasets that are in UnifiedQA are already in xP3, we do a simple (and incomplete) check to limit the number of datasets that we download." + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "95b57b2c", + "metadata": {}, + "outputs": [], + "source": [ + "xp3_list = [\n", + " \"Code Miscellaneous\",\n", + " \"CodeComplex\",\n", + " \"Docstring Corpus\",\n", + " \"GreatCode\",\n", + " \"State Changes\",\n", + " \"Closed-book QA\",\n", + " \"Hotpot QA\",\n", + " \"Trivia QA\",\n", + " \"Web Questions\",\n", + " \"Wiki QA\",\n", + " \"Extractive QA\",\n", + " \"Adversarial QA\",\n", + " \"CMRC2018\",\n", + " \"DRCD\",\n", + " \"DuoRC\",\n", + " \"MLQA\",\n", + " \"Quoref\",\n", + " \"ReCoRD\",\n", + " \"ROPES\",\n", + " \"SQuAD v2\",\n", + " \"xQuAD\",\n", + " \"TyDI QA\",\n", + " \"Primary\",\n", + " \"Goldp\",\n", + " \"Multiple-Choice QA\",\n", + " \"ARC\",\n", + " \"C3\",\n", + " \"CoS-E\",\n", + " \"Cosmos\",\n", + " \"DREAM\",\n", + " \"MultiRC\",\n", + " \"OpenBookQA\",\n", + " \"PiQA\",\n", + " \"QUAIL\",\n", + " \"QuaRel\",\n", + " \"QuaRTz\",\n", + " \"QASC\",\n", + " \"RACE\",\n", + " \"SciQ\",\n", + " \"Social IQA\",\n", + " \"Wiki Hop\",\n", + " \"WiQA\",\n", + " \"Paraphrase Identification\",\n", + " \"MRPC\",\n", + " \"PAWS\",\n", + " \"PAWS-X\",\n", + " \"QQP\",\n", + " \"Program Synthesis\",\n", + " \"APPS\",\n", + " \"CodeContests\",\n", + " \"JupyterCodePairs\",\n", + " \"MBPP\",\n", + " \"NeuralCodeSearch\",\n", + " \"XLCoST\",\n", + " \"Structure-to-text\",\n", + " \"Common Gen\",\n", + " \"Wiki Bio\",\n", + " \"Sentiment\",\n", + " \"Amazon\",\n", + " \"App Reviews\",\n", + " \"IMDB\",\n", + " \"Rotten Tomatoes\",\n", + " \"Yelp\",\n", + " \"Simplification\",\n", + " \"BiSECT\",\n", + " \"Summarization\",\n", + " \"CNN Daily Mail\",\n", + " \"Gigaword\",\n", + " \"MultiNews\",\n", + " \"SamSum\",\n", + " \"Wiki-Lingua\",\n", + " \"XLSum\",\n", + " \"XSum\",\n", + " \"Topic Classification\",\n", + " \"AG News\",\n", + " \"DBPedia\",\n", + " \"TNEWS\",\n", + " \"TREC\",\n", + " \"CSL\",\n", + " \"Translation\",\n", + " \"Flores-200\",\n", + " \"Tatoeba\",\n", + " \"Word Sense disambiguation\",\n", + " \"WiC\",\n", + " \"XL-WiC\",\n", + " \"Evaluation datasets (included in xP3all except for HumanEval)\",\n", + " \"Natural Language Inference\",\n", + " \"ANLI\",\n", + " \"CB\",\n", + " \"RTE\",\n", + " \"XNLI\",\n", + " \"Coreference Resolution\",\n", + " \"Winogrande\",\n", + " \"XWinograd\",\n", + " \"Program Synthesis\",\n", + " \"HumanEval\",\n", + " \"Sentence Completion\",\n", + " \"COPA\",\n", + " \"Story Cloze\",\n", + " \"XCOPA\",\n", + " \"XStoryCloze\",\n", + " \"Additional xP3all datasets\",\n", + " \"Coreference Resolution\",\n", + " \"WSC (Fixed)\",\n", + " \"Sentence Completion\",\n", + " \"HellaSwag\",\n", + " \"Translation\",\n", + " \"MultiEurlex\",\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "cba85ada", + "metadata": {}, + "outputs": [], + "source": [ + "unifiedQA_list = [\n", + " \"SQuAD 1.1\",\n", + " \"SQuAD 2\",\n", + " \"NewsQA\",\n", + " \"Quoref\",\n", + " \"ROPES\",\n", + " \"NarrativeQA\",\n", + " \"DROP\",\n", + " \"NaturalQuestions\",\n", + " \"MCTest\",\n", + " \"RACE\",\n", + " \"OpenBookQA\",\n", + " \"ARC\",\n", + " \"CommonsenseQA\",\n", + " \"QASC\",\n", + " \"PhysicalIQA\",\n", + " \"SocialIQA\",\n", + " \"Winogrande\",\n", + " \"BoolQ\",\n", + " \"MultiRC (yes/no)\",\n", + " \"BoolQ-NP\",\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "78fc6ee9", + "metadata": {}, + "source": [ + "Now that we've defined the list of datasets (which we found in the paper for UnifiedQA and on the Hugging Face page of xP3) we can do the simple check." + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "9cf91317", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SQuAD 1.1\n", + "SQuAD 2\n", + "NewsQA\n", + "NarrativeQA\n", + "DROP\n", + "NaturalQuestions\n", + "MCTest\n", + "CommonsenseQA\n", + "PhysicalIQA\n", + "SocialIQA\n", + "BoolQ\n", + "MultiRC (yes/no)\n", + "BoolQ-NP\n" + ] + } + ], + "source": [ + "for ds in unifiedQA_list:\n", + " if ds not in xp3_list:\n", + " print(ds)" + ] + }, + { + "cell_type": "markdown", + "id": "93eeabad", + "metadata": {}, + "source": [ + "The SQuAD dataset is actually covered (with a slightly different name) but the other ones should be downloaded." + ] + }, + { + "cell_type": "markdown", + "id": "a8c98078", + "metadata": {}, + "source": [ + "# OpenAssistant Data Scheme" + ] + }, + { + "cell_type": "markdown", + "id": "2731f88f", + "metadata": {}, + "source": [ + "We will use the data scheme that can be found in the docs for Open-Assistant. This code is taken from the StackExchange notebook." + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "d35ab066", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import TypeVar, List, Dict, Any, Literal\n", + "from json import JSONEncoder\n", + "\n", + "T = TypeVar(\"T\", bound=\"ConversationTreeNode\")\n", + "\n", + "\n", + "class ConversationTreeNode:\n", + " text: str # The text of the node\n", + " role: Literal[\"prompter\", \"assistant\"] # Whether the node is a user prompt/follow-up or an assistant response\n", + " children: List[T] # The children of the node (if you have a linear conversation, this will be of length 0 or 1)\n", + " metadata: Dict[str, Any] # Node metadata (see below)\n", + "\n", + " def __init__(\n", + " self, text: str, role: Literal[\"prompter\", \"assistant\"], children: List[T], metadata: Dict[str, Any]\n", + " ) -> None:\n", + " self.text = text\n", + " self.role = role\n", + " self.children = children\n", + " self.metadata = metadata\n", + "\n", + "\n", + "class ConversationTree:\n", + " root: ConversationTreeNode # The node containing the initial prompt\n", + " metadata: Dict[str, Any] # Tree metadata, different from root node metadata.\n", + "\n", + " def __init__(self, root: ConversationTreeNode, metadata: Dict[str, Any]) -> None:\n", + " self.root = root\n", + " self.metadata = metadata\n", + "\n", + "\n", + "# subclass JSONEncoder\n", + "class TreeEncoder(JSONEncoder):\n", + " def default(self, o):\n", + " return o.__dict__" + ] + }, + { + "cell_type": "markdown", + "id": "e7457bae", + "metadata": {}, + "source": [ + "# Download and convert" + ] + }, + { + "cell_type": "markdown", + "id": "54b0fd63", + "metadata": {}, + "source": [ + "We firstly import pandas, which we'll use to download the TSV files from Google Cloud Storage, and any other libraries that we'll need." + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "9317d4b4", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import json" + ] + }, + { + "cell_type": "markdown", + "id": "62dc4e18", + "metadata": {}, + "source": [ + "The following is a simple function to take the data (which has two columns) and convert it to a tree with a root note (question) and one child (answer)." + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "963e0d92", + "metadata": {}, + "outputs": [], + "source": [ + "def convert_unified_qa(dataset_url):\n", + " # download using pandas\n", + " ds = pd.read_csv(dataset_url, on_bad_lines=\"skip\", names=[\"Question\", \"Answer\"], sep=\"\\t\")\n", + " # get name for metatdata\n", + " ds_name = dataset_url.split(\"/unifiedqa/data/\")[1].split(\"/\")[0]\n", + "\n", + " # create conversation forest\n", + " conversation_forest = []\n", + " for item in ds.itertuples():\n", + " # build nodes and tree\n", + " root = ConversationTreeNode(text=item.Question, role=\"prompter\", children=[], metadata=None)\n", + " child = ConversationTreeNode(text=item.Answer, role=\"assistant\", children=[], metadata=None)\n", + " root.children.append(child)\n", + " conversation_tree = ConversationTree(root=root, metadata={\"dataset\": ds_name})\n", + "\n", + " conversation_forest.append(conversation_tree)\n", + "\n", + " conversation_forest_json = [\n", + " json.loads(TreeEncoder().encode(conversation_tree)) for conversation_tree in conversation_forest\n", + " ]\n", + "\n", + " print(json.dumps(conversation_forest_json, indent=4), file=open(f\"./{ds_name}.json\", \"w+\"))\n", + "\n", + " print(\"*****\", ds_name, \"****\")\n", + " print(ds.head(2))\n", + " print(\"....\")" + ] + }, + { + "cell_type": "markdown", + "id": "e4448c9a", + "metadata": {}, + "source": [ + "We now define the list of URLs that we want to download. These URLs were found by manually going UnifiedQA'S Google Cloud bucket: https://console.cloud.google.com/storage/browser/unifiedqa/data" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "43e188b6", + "metadata": {}, + "outputs": [], + "source": [ + "urls = [\n", + " \"https://storage.googleapis.com/unifiedqa/data/natural_questions/train.tsv\",\n", + " \"https://storage.googleapis.com/unifiedqa/data/narrativeqa/train.tsv\",\n", + " \"https://storage.googleapis.com/unifiedqa/data/newsqa/train.tsv\",\n", + " \"https://storage.googleapis.com/unifiedqa/data/drop/train.tsv\",\n", + " \"https://storage.googleapis.com/unifiedqa/data/commonsenseqa/train.tsv\",\n", + " \"https://storage.googleapis.com/unifiedqa/data/physical_iqa/train.tsv\",\n", + " \"https://storage.googleapis.com/unifiedqa/data/social_iqa/train.tsv\",\n", + " \"https://storage.googleapis.com/unifiedqa/data/boolq/train.tsv\",\n", + " \"https://storage.googleapis.com/unifiedqa/data/boolq_np/train.tsv\",\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "b39bb154", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "***** natural_questions ****\n", + " Question \\\n", + "0 which is the most common use of opt-in e-mail ... \n", + "1 how i.met your mother who is the mother? \n", + "\n", + " Answer \n", + "0 a newsletter sent to an advertising firm's cus... \n", + "1 Tracy McConnell \n", + "....\n", + "***** narrativeqa ****\n", + " Question \\\n", + "0 Who is Miss Delmer? \\n At Madeline Hall, an o... \n", + "1 Who is Miss Delmer? \\n At Madeline Hall, an o... \n", + "\n", + " Answer \n", + "0 the elderly spinster aunt of the Earl de Vers... \n", + "1 She's Captail Delmar's aunt. \n", + "....\n", + "***** newsqa ****\n", + " Question Answer\n", + "0 How many Americans are part of the federal foo... 31 million\n", + "1 How much did Sean Callebs live on? \\n (CNN) --... $176\n", + "....\n", + "***** drop ****\n", + " Question Answer\n", + "0 How many points did the buccaneers need to tie... 3\n", + "1 How many field goals did the Lions score? \\n T... 2\n", + "....\n", + "***** commonsenseqa ****\n", + " Question Answer\n", + "0 The sanctions against the school were a punish... ignore\n", + "1 Sammy wanted to go to where the people were. ... populated areas\n", + "....\n", + "***** physical_iqa ****\n", + " Question \\\n", + "0 When boiling butter, when it's ready, you can ... \n", + "1 To permanently attach metal legs to a chair, y... \n", + "\n", + " Answer \n", + "0 Pour it into a jar \n", + "1 Weld the metal together to get it to stay fir... \n", + "....\n", + "***** social_iqa ****\n", + " Question Answer\n", + "0 How would Others feel as a result? \\n (A) like... like attending\n", + "1 What will Others want to do next? \\n (A) disag... get to work\n", + "....\n", + "***** boolq ****\n", + " Question Answer\n", + "0 can you buy beer at a grocery store in pa? \\n ... yes\n", + "1 is the baby in fuller house a twin? \\n (List o... yes\n", + "....\n", + "***** boolq_np ****\n", + " Question Answer\n", + "0 do iran and afghanistan write the same languag... yes\n", + "1 do iran and afghanistan read the same language... yes\n", + "....\n" + ] + } + ], + "source": [ + "for url in urls:\n", + " convert_unified_qa(url)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0309674", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/data-argumentation/UnifiedQA.md b/notebooks/data-argumentation/UnifiedQA.md new file mode 100644 index 00000000..54b4bbf8 --- /dev/null +++ b/notebooks/data-argumentation/UnifiedQA.md @@ -0,0 +1,13 @@ +# UnifiedQA Downloader + +UnifiedQA is a notebook that downloads data from UnifiedQA's Google Cloud bucket +and converts it into OpenAssistant Data Scheme formats. Files for each dataset +in the UnifiedQA collection (excluding the ones already in xP3) are saved to +JSON. + +--- + +## Contributing + +Feel free to contribute to this notebook. It's not perfect and additional +functionality is planned.