Add UnifiedQA datasets that are not in xP3 (#666)

* Add UnifiedQA datasets that are not in xP3
This commit is contained in:
Valentino
2023-01-14 01:57:22 -08:00
committed by GitHub
parent 90e71f3da2
commit c36a7ae9da
2 changed files with 528 additions and 0 deletions
@@ -0,0 +1,515 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "00b2848c",
"metadata": {},
"source": [
"# UnifiedQA Dataset"
]
},
{
"cell_type": "markdown",
"id": "b2e3c95c",
"metadata": {},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/data-argumentation/UnifiedQA.ipynb)"
]
},
{
"cell_type": "markdown",
"id": "d81932b9",
"metadata": {},
"source": [
"The purpose of this notebook is to download datasets from the UnifiedQA dataset collection and convert them into a format that can be used for training the OpenAssistant.\n",
"\n",
"The UnifiedQA repo can be found here: https://github.com/allenai/unifiedqa\n",
"\n",
"If you extend or use this work, please cite the relevant papers:\n",
"```\n",
"@article{khashabi2022unifiedqa,\n",
" title={UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training},\n",
" author={Khashabi, Daniel and Kordi, Yeganeh and Hajishirzi, Hannaneh},\n",
" journal={arXiv preprint arXiv:2202.12359},\n",
" year={2022}\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "74d91250",
"metadata": {},
"source": [
"## Compare xP3 and UnifiedQA"
]
},
{
"cell_type": "markdown",
"id": "8d7ca591",
"metadata": {},
"source": [
"As many of the datasets that are in UnifiedQA are already in xP3, we do a simple (and incomplete) check to limit the number of datasets that we download."
]
},
{
"cell_type": "code",
"execution_count": 82,
"id": "95b57b2c",
"metadata": {},
"outputs": [],
"source": [
"xp3_list = [\n",
" \"Code Miscellaneous\",\n",
" \"CodeComplex\",\n",
" \"Docstring Corpus\",\n",
" \"GreatCode\",\n",
" \"State Changes\",\n",
" \"Closed-book QA\",\n",
" \"Hotpot QA\",\n",
" \"Trivia QA\",\n",
" \"Web Questions\",\n",
" \"Wiki QA\",\n",
" \"Extractive QA\",\n",
" \"Adversarial QA\",\n",
" \"CMRC2018\",\n",
" \"DRCD\",\n",
" \"DuoRC\",\n",
" \"MLQA\",\n",
" \"Quoref\",\n",
" \"ReCoRD\",\n",
" \"ROPES\",\n",
" \"SQuAD v2\",\n",
" \"xQuAD\",\n",
" \"TyDI QA\",\n",
" \"Primary\",\n",
" \"Goldp\",\n",
" \"Multiple-Choice QA\",\n",
" \"ARC\",\n",
" \"C3\",\n",
" \"CoS-E\",\n",
" \"Cosmos\",\n",
" \"DREAM\",\n",
" \"MultiRC\",\n",
" \"OpenBookQA\",\n",
" \"PiQA\",\n",
" \"QUAIL\",\n",
" \"QuaRel\",\n",
" \"QuaRTz\",\n",
" \"QASC\",\n",
" \"RACE\",\n",
" \"SciQ\",\n",
" \"Social IQA\",\n",
" \"Wiki Hop\",\n",
" \"WiQA\",\n",
" \"Paraphrase Identification\",\n",
" \"MRPC\",\n",
" \"PAWS\",\n",
" \"PAWS-X\",\n",
" \"QQP\",\n",
" \"Program Synthesis\",\n",
" \"APPS\",\n",
" \"CodeContests\",\n",
" \"JupyterCodePairs\",\n",
" \"MBPP\",\n",
" \"NeuralCodeSearch\",\n",
" \"XLCoST\",\n",
" \"Structure-to-text\",\n",
" \"Common Gen\",\n",
" \"Wiki Bio\",\n",
" \"Sentiment\",\n",
" \"Amazon\",\n",
" \"App Reviews\",\n",
" \"IMDB\",\n",
" \"Rotten Tomatoes\",\n",
" \"Yelp\",\n",
" \"Simplification\",\n",
" \"BiSECT\",\n",
" \"Summarization\",\n",
" \"CNN Daily Mail\",\n",
" \"Gigaword\",\n",
" \"MultiNews\",\n",
" \"SamSum\",\n",
" \"Wiki-Lingua\",\n",
" \"XLSum\",\n",
" \"XSum\",\n",
" \"Topic Classification\",\n",
" \"AG News\",\n",
" \"DBPedia\",\n",
" \"TNEWS\",\n",
" \"TREC\",\n",
" \"CSL\",\n",
" \"Translation\",\n",
" \"Flores-200\",\n",
" \"Tatoeba\",\n",
" \"Word Sense disambiguation\",\n",
" \"WiC\",\n",
" \"XL-WiC\",\n",
" \"Evaluation datasets (included in xP3all except for HumanEval)\",\n",
" \"Natural Language Inference\",\n",
" \"ANLI\",\n",
" \"CB\",\n",
" \"RTE\",\n",
" \"XNLI\",\n",
" \"Coreference Resolution\",\n",
" \"Winogrande\",\n",
" \"XWinograd\",\n",
" \"Program Synthesis\",\n",
" \"HumanEval\",\n",
" \"Sentence Completion\",\n",
" \"COPA\",\n",
" \"Story Cloze\",\n",
" \"XCOPA\",\n",
" \"XStoryCloze\",\n",
" \"Additional xP3all datasets\",\n",
" \"Coreference Resolution\",\n",
" \"WSC (Fixed)\",\n",
" \"Sentence Completion\",\n",
" \"HellaSwag\",\n",
" \"Translation\",\n",
" \"MultiEurlex\",\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 83,
"id": "cba85ada",
"metadata": {},
"outputs": [],
"source": [
"unifiedQA_list = [\n",
" \"SQuAD 1.1\",\n",
" \"SQuAD 2\",\n",
" \"NewsQA\",\n",
" \"Quoref\",\n",
" \"ROPES\",\n",
" \"NarrativeQA\",\n",
" \"DROP\",\n",
" \"NaturalQuestions\",\n",
" \"MCTest\",\n",
" \"RACE\",\n",
" \"OpenBookQA\",\n",
" \"ARC\",\n",
" \"CommonsenseQA\",\n",
" \"QASC\",\n",
" \"PhysicalIQA\",\n",
" \"SocialIQA\",\n",
" \"Winogrande\",\n",
" \"BoolQ\",\n",
" \"MultiRC (yes/no)\",\n",
" \"BoolQ-NP\",\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "78fc6ee9",
"metadata": {},
"source": [
"Now that we've defined the list of datasets (which we found in the paper for UnifiedQA and on the Hugging Face page of xP3) we can do the simple check."
]
},
{
"cell_type": "code",
"execution_count": 84,
"id": "9cf91317",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SQuAD 1.1\n",
"SQuAD 2\n",
"NewsQA\n",
"NarrativeQA\n",
"DROP\n",
"NaturalQuestions\n",
"MCTest\n",
"CommonsenseQA\n",
"PhysicalIQA\n",
"SocialIQA\n",
"BoolQ\n",
"MultiRC (yes/no)\n",
"BoolQ-NP\n"
]
}
],
"source": [
"for ds in unifiedQA_list:\n",
" if ds not in xp3_list:\n",
" print(ds)"
]
},
{
"cell_type": "markdown",
"id": "93eeabad",
"metadata": {},
"source": [
"The SQuAD dataset is actually covered (with a slightly different name) but the other ones should be downloaded."
]
},
{
"cell_type": "markdown",
"id": "a8c98078",
"metadata": {},
"source": [
"# OpenAssistant Data Scheme"
]
},
{
"cell_type": "markdown",
"id": "2731f88f",
"metadata": {},
"source": [
"We will use the data scheme that can be found in the docs for Open-Assistant. This code is taken from the StackExchange notebook."
]
},
{
"cell_type": "code",
"execution_count": 85,
"id": "d35ab066",
"metadata": {},
"outputs": [],
"source": [
"from typing import TypeVar, List, Dict, Any, Literal\n",
"from json import JSONEncoder\n",
"\n",
"T = TypeVar(\"T\", bound=\"ConversationTreeNode\")\n",
"\n",
"\n",
"class ConversationTreeNode:\n",
" text: str # The text of the node\n",
" role: Literal[\"prompter\", \"assistant\"] # Whether the node is a user prompt/follow-up or an assistant response\n",
" children: List[T] # The children of the node (if you have a linear conversation, this will be of length 0 or 1)\n",
" metadata: Dict[str, Any] # Node metadata (see below)\n",
"\n",
" def __init__(\n",
" self, text: str, role: Literal[\"prompter\", \"assistant\"], children: List[T], metadata: Dict[str, Any]\n",
" ) -> None:\n",
" self.text = text\n",
" self.role = role\n",
" self.children = children\n",
" self.metadata = metadata\n",
"\n",
"\n",
"class ConversationTree:\n",
" root: ConversationTreeNode # The node containing the initial prompt\n",
" metadata: Dict[str, Any] # Tree metadata, different from root node metadata.\n",
"\n",
" def __init__(self, root: ConversationTreeNode, metadata: Dict[str, Any]) -> None:\n",
" self.root = root\n",
" self.metadata = metadata\n",
"\n",
"\n",
"# subclass JSONEncoder\n",
"class TreeEncoder(JSONEncoder):\n",
" def default(self, o):\n",
" return o.__dict__"
]
},
{
"cell_type": "markdown",
"id": "e7457bae",
"metadata": {},
"source": [
"# Download and convert"
]
},
{
"cell_type": "markdown",
"id": "54b0fd63",
"metadata": {},
"source": [
"We firstly import pandas, which we'll use to download the TSV files from Google Cloud Storage, and any other libraries that we'll need."
]
},
{
"cell_type": "code",
"execution_count": 86,
"id": "9317d4b4",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import json"
]
},
{
"cell_type": "markdown",
"id": "62dc4e18",
"metadata": {},
"source": [
"The following is a simple function to take the data (which has two columns) and convert it to a tree with a root note (question) and one child (answer)."
]
},
{
"cell_type": "code",
"execution_count": 87,
"id": "963e0d92",
"metadata": {},
"outputs": [],
"source": [
"def convert_unified_qa(dataset_url):\n",
" # download using pandas\n",
" ds = pd.read_csv(dataset_url, on_bad_lines=\"skip\", names=[\"Question\", \"Answer\"], sep=\"\\t\")\n",
" # get name for metatdata\n",
" ds_name = dataset_url.split(\"/unifiedqa/data/\")[1].split(\"/\")[0]\n",
"\n",
" # create conversation forest\n",
" conversation_forest = []\n",
" for item in ds.itertuples():\n",
" # build nodes and tree\n",
" root = ConversationTreeNode(text=item.Question, role=\"prompter\", children=[], metadata=None)\n",
" child = ConversationTreeNode(text=item.Answer, role=\"assistant\", children=[], metadata=None)\n",
" root.children.append(child)\n",
" conversation_tree = ConversationTree(root=root, metadata={\"dataset\": ds_name})\n",
"\n",
" conversation_forest.append(conversation_tree)\n",
"\n",
" conversation_forest_json = [\n",
" json.loads(TreeEncoder().encode(conversation_tree)) for conversation_tree in conversation_forest\n",
" ]\n",
"\n",
" print(json.dumps(conversation_forest_json, indent=4), file=open(f\"./{ds_name}.json\", \"w+\"))\n",
"\n",
" print(\"*****\", ds_name, \"****\")\n",
" print(ds.head(2))\n",
" print(\"....\")"
]
},
{
"cell_type": "markdown",
"id": "e4448c9a",
"metadata": {},
"source": [
"We now define the list of URLs that we want to download. These URLs were found by manually going UnifiedQA'S Google Cloud bucket: https://console.cloud.google.com/storage/browser/unifiedqa/data"
]
},
{
"cell_type": "code",
"execution_count": 88,
"id": "43e188b6",
"metadata": {},
"outputs": [],
"source": [
"urls = [\n",
" \"https://storage.googleapis.com/unifiedqa/data/natural_questions/train.tsv\",\n",
" \"https://storage.googleapis.com/unifiedqa/data/narrativeqa/train.tsv\",\n",
" \"https://storage.googleapis.com/unifiedqa/data/newsqa/train.tsv\",\n",
" \"https://storage.googleapis.com/unifiedqa/data/drop/train.tsv\",\n",
" \"https://storage.googleapis.com/unifiedqa/data/commonsenseqa/train.tsv\",\n",
" \"https://storage.googleapis.com/unifiedqa/data/physical_iqa/train.tsv\",\n",
" \"https://storage.googleapis.com/unifiedqa/data/social_iqa/train.tsv\",\n",
" \"https://storage.googleapis.com/unifiedqa/data/boolq/train.tsv\",\n",
" \"https://storage.googleapis.com/unifiedqa/data/boolq_np/train.tsv\",\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "b39bb154",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"***** natural_questions ****\n",
" Question \\\n",
"0 which is the most common use of opt-in e-mail ... \n",
"1 how i.met your mother who is the mother? \n",
"\n",
" Answer \n",
"0 a newsletter sent to an advertising firm's cus... \n",
"1 Tracy McConnell \n",
"....\n",
"***** narrativeqa ****\n",
" Question \\\n",
"0 Who is Miss Delmer? \\n At Madeline Hall, an o... \n",
"1 Who is Miss Delmer? \\n At Madeline Hall, an o... \n",
"\n",
" Answer \n",
"0 the elderly spinster aunt of the Earl de Vers... \n",
"1 She's Captail Delmar's aunt. \n",
"....\n",
"***** newsqa ****\n",
" Question Answer\n",
"0 How many Americans are part of the federal foo... 31 million\n",
"1 How much did Sean Callebs live on? \\n (CNN) --... $176\n",
"....\n",
"***** drop ****\n",
" Question Answer\n",
"0 How many points did the buccaneers need to tie... 3\n",
"1 How many field goals did the Lions score? \\n T... 2\n",
"....\n",
"***** commonsenseqa ****\n",
" Question Answer\n",
"0 The sanctions against the school were a punish... ignore\n",
"1 Sammy wanted to go to where the people were. ... populated areas\n",
"....\n",
"***** physical_iqa ****\n",
" Question \\\n",
"0 When boiling butter, when it's ready, you can ... \n",
"1 To permanently attach metal legs to a chair, y... \n",
"\n",
" Answer \n",
"0 Pour it into a jar \n",
"1 Weld the metal together to get it to stay fir... \n",
"....\n",
"***** social_iqa ****\n",
" Question Answer\n",
"0 How would Others feel as a result? \\n (A) like... like attending\n",
"1 What will Others want to do next? \\n (A) disag... get to work\n",
"....\n",
"***** boolq ****\n",
" Question Answer\n",
"0 can you buy beer at a grocery store in pa? \\n ... yes\n",
"1 is the baby in fuller house a twin? \\n (List o... yes\n",
"....\n",
"***** boolq_np ****\n",
" Question Answer\n",
"0 do iran and afghanistan write the same languag... yes\n",
"1 do iran and afghanistan read the same language... yes\n",
"....\n"
]
}
],
"source": [
"for url in urls:\n",
" convert_unified_qa(url)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0309674",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
+13
View File
@@ -0,0 +1,13 @@
# UnifiedQA Downloader
UnifiedQA is a notebook that downloads data from UnifiedQA's Google Cloud bucket
and converts it into OpenAssistant Data Scheme formats. Files for each dataset
in the UnifiedQA collection (excluding the ones already in xP3) are saved to
JSON.
---
## Contributing
Feel free to contribute to this notebook. It's not perfect and additional
functionality is planned.