mirror of
https://github.com/wassname/Open-Assistant.git
synced 2026-06-27 16:10:30 +08:00
Add UnifiedQA datasets that are not in xP3 (#666)
* Add UnifiedQA datasets that are not in xP3
This commit is contained in:
@@ -0,0 +1,515 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "00b2848c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# UnifiedQA Dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b2e3c95c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"[](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/data-argumentation/UnifiedQA.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d81932b9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The purpose of this notebook is to download datasets from the UnifiedQA dataset collection and convert them into a format that can be used for training the OpenAssistant.\n",
|
||||
"\n",
|
||||
"The UnifiedQA repo can be found here: https://github.com/allenai/unifiedqa\n",
|
||||
"\n",
|
||||
"If you extend or use this work, please cite the relevant papers:\n",
|
||||
"```\n",
|
||||
"@article{khashabi2022unifiedqa,\n",
|
||||
" title={UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training},\n",
|
||||
" author={Khashabi, Daniel and Kordi, Yeganeh and Hajishirzi, Hannaneh},\n",
|
||||
" journal={arXiv preprint arXiv:2202.12359},\n",
|
||||
" year={2022}\n",
|
||||
"}\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "74d91250",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Compare xP3 and UnifiedQA"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8d7ca591",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As many of the datasets that are in UnifiedQA are already in xP3, we do a simple (and incomplete) check to limit the number of datasets that we download."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 82,
|
||||
"id": "95b57b2c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"xp3_list = [\n",
|
||||
" \"Code Miscellaneous\",\n",
|
||||
" \"CodeComplex\",\n",
|
||||
" \"Docstring Corpus\",\n",
|
||||
" \"GreatCode\",\n",
|
||||
" \"State Changes\",\n",
|
||||
" \"Closed-book QA\",\n",
|
||||
" \"Hotpot QA\",\n",
|
||||
" \"Trivia QA\",\n",
|
||||
" \"Web Questions\",\n",
|
||||
" \"Wiki QA\",\n",
|
||||
" \"Extractive QA\",\n",
|
||||
" \"Adversarial QA\",\n",
|
||||
" \"CMRC2018\",\n",
|
||||
" \"DRCD\",\n",
|
||||
" \"DuoRC\",\n",
|
||||
" \"MLQA\",\n",
|
||||
" \"Quoref\",\n",
|
||||
" \"ReCoRD\",\n",
|
||||
" \"ROPES\",\n",
|
||||
" \"SQuAD v2\",\n",
|
||||
" \"xQuAD\",\n",
|
||||
" \"TyDI QA\",\n",
|
||||
" \"Primary\",\n",
|
||||
" \"Goldp\",\n",
|
||||
" \"Multiple-Choice QA\",\n",
|
||||
" \"ARC\",\n",
|
||||
" \"C3\",\n",
|
||||
" \"CoS-E\",\n",
|
||||
" \"Cosmos\",\n",
|
||||
" \"DREAM\",\n",
|
||||
" \"MultiRC\",\n",
|
||||
" \"OpenBookQA\",\n",
|
||||
" \"PiQA\",\n",
|
||||
" \"QUAIL\",\n",
|
||||
" \"QuaRel\",\n",
|
||||
" \"QuaRTz\",\n",
|
||||
" \"QASC\",\n",
|
||||
" \"RACE\",\n",
|
||||
" \"SciQ\",\n",
|
||||
" \"Social IQA\",\n",
|
||||
" \"Wiki Hop\",\n",
|
||||
" \"WiQA\",\n",
|
||||
" \"Paraphrase Identification\",\n",
|
||||
" \"MRPC\",\n",
|
||||
" \"PAWS\",\n",
|
||||
" \"PAWS-X\",\n",
|
||||
" \"QQP\",\n",
|
||||
" \"Program Synthesis\",\n",
|
||||
" \"APPS\",\n",
|
||||
" \"CodeContests\",\n",
|
||||
" \"JupyterCodePairs\",\n",
|
||||
" \"MBPP\",\n",
|
||||
" \"NeuralCodeSearch\",\n",
|
||||
" \"XLCoST\",\n",
|
||||
" \"Structure-to-text\",\n",
|
||||
" \"Common Gen\",\n",
|
||||
" \"Wiki Bio\",\n",
|
||||
" \"Sentiment\",\n",
|
||||
" \"Amazon\",\n",
|
||||
" \"App Reviews\",\n",
|
||||
" \"IMDB\",\n",
|
||||
" \"Rotten Tomatoes\",\n",
|
||||
" \"Yelp\",\n",
|
||||
" \"Simplification\",\n",
|
||||
" \"BiSECT\",\n",
|
||||
" \"Summarization\",\n",
|
||||
" \"CNN Daily Mail\",\n",
|
||||
" \"Gigaword\",\n",
|
||||
" \"MultiNews\",\n",
|
||||
" \"SamSum\",\n",
|
||||
" \"Wiki-Lingua\",\n",
|
||||
" \"XLSum\",\n",
|
||||
" \"XSum\",\n",
|
||||
" \"Topic Classification\",\n",
|
||||
" \"AG News\",\n",
|
||||
" \"DBPedia\",\n",
|
||||
" \"TNEWS\",\n",
|
||||
" \"TREC\",\n",
|
||||
" \"CSL\",\n",
|
||||
" \"Translation\",\n",
|
||||
" \"Flores-200\",\n",
|
||||
" \"Tatoeba\",\n",
|
||||
" \"Word Sense disambiguation\",\n",
|
||||
" \"WiC\",\n",
|
||||
" \"XL-WiC\",\n",
|
||||
" \"Evaluation datasets (included in xP3all except for HumanEval)\",\n",
|
||||
" \"Natural Language Inference\",\n",
|
||||
" \"ANLI\",\n",
|
||||
" \"CB\",\n",
|
||||
" \"RTE\",\n",
|
||||
" \"XNLI\",\n",
|
||||
" \"Coreference Resolution\",\n",
|
||||
" \"Winogrande\",\n",
|
||||
" \"XWinograd\",\n",
|
||||
" \"Program Synthesis\",\n",
|
||||
" \"HumanEval\",\n",
|
||||
" \"Sentence Completion\",\n",
|
||||
" \"COPA\",\n",
|
||||
" \"Story Cloze\",\n",
|
||||
" \"XCOPA\",\n",
|
||||
" \"XStoryCloze\",\n",
|
||||
" \"Additional xP3all datasets\",\n",
|
||||
" \"Coreference Resolution\",\n",
|
||||
" \"WSC (Fixed)\",\n",
|
||||
" \"Sentence Completion\",\n",
|
||||
" \"HellaSwag\",\n",
|
||||
" \"Translation\",\n",
|
||||
" \"MultiEurlex\",\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 83,
|
||||
"id": "cba85ada",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"unifiedQA_list = [\n",
|
||||
" \"SQuAD 1.1\",\n",
|
||||
" \"SQuAD 2\",\n",
|
||||
" \"NewsQA\",\n",
|
||||
" \"Quoref\",\n",
|
||||
" \"ROPES\",\n",
|
||||
" \"NarrativeQA\",\n",
|
||||
" \"DROP\",\n",
|
||||
" \"NaturalQuestions\",\n",
|
||||
" \"MCTest\",\n",
|
||||
" \"RACE\",\n",
|
||||
" \"OpenBookQA\",\n",
|
||||
" \"ARC\",\n",
|
||||
" \"CommonsenseQA\",\n",
|
||||
" \"QASC\",\n",
|
||||
" \"PhysicalIQA\",\n",
|
||||
" \"SocialIQA\",\n",
|
||||
" \"Winogrande\",\n",
|
||||
" \"BoolQ\",\n",
|
||||
" \"MultiRC (yes/no)\",\n",
|
||||
" \"BoolQ-NP\",\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "78fc6ee9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now that we've defined the list of datasets (which we found in the paper for UnifiedQA and on the Hugging Face page of xP3) we can do the simple check."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 84,
|
||||
"id": "9cf91317",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"SQuAD 1.1\n",
|
||||
"SQuAD 2\n",
|
||||
"NewsQA\n",
|
||||
"NarrativeQA\n",
|
||||
"DROP\n",
|
||||
"NaturalQuestions\n",
|
||||
"MCTest\n",
|
||||
"CommonsenseQA\n",
|
||||
"PhysicalIQA\n",
|
||||
"SocialIQA\n",
|
||||
"BoolQ\n",
|
||||
"MultiRC (yes/no)\n",
|
||||
"BoolQ-NP\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for ds in unifiedQA_list:\n",
|
||||
" if ds not in xp3_list:\n",
|
||||
" print(ds)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "93eeabad",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The SQuAD dataset is actually covered (with a slightly different name) but the other ones should be downloaded."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a8c98078",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# OpenAssistant Data Scheme"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2731f88f",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We will use the data scheme that can be found in the docs for Open-Assistant. This code is taken from the StackExchange notebook."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 85,
|
||||
"id": "d35ab066",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from typing import TypeVar, List, Dict, Any, Literal\n",
|
||||
"from json import JSONEncoder\n",
|
||||
"\n",
|
||||
"T = TypeVar(\"T\", bound=\"ConversationTreeNode\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"class ConversationTreeNode:\n",
|
||||
" text: str # The text of the node\n",
|
||||
" role: Literal[\"prompter\", \"assistant\"] # Whether the node is a user prompt/follow-up or an assistant response\n",
|
||||
" children: List[T] # The children of the node (if you have a linear conversation, this will be of length 0 or 1)\n",
|
||||
" metadata: Dict[str, Any] # Node metadata (see below)\n",
|
||||
"\n",
|
||||
" def __init__(\n",
|
||||
" self, text: str, role: Literal[\"prompter\", \"assistant\"], children: List[T], metadata: Dict[str, Any]\n",
|
||||
" ) -> None:\n",
|
||||
" self.text = text\n",
|
||||
" self.role = role\n",
|
||||
" self.children = children\n",
|
||||
" self.metadata = metadata\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"class ConversationTree:\n",
|
||||
" root: ConversationTreeNode # The node containing the initial prompt\n",
|
||||
" metadata: Dict[str, Any] # Tree metadata, different from root node metadata.\n",
|
||||
"\n",
|
||||
" def __init__(self, root: ConversationTreeNode, metadata: Dict[str, Any]) -> None:\n",
|
||||
" self.root = root\n",
|
||||
" self.metadata = metadata\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# subclass JSONEncoder\n",
|
||||
"class TreeEncoder(JSONEncoder):\n",
|
||||
" def default(self, o):\n",
|
||||
" return o.__dict__"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e7457bae",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Download and convert"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "54b0fd63",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We firstly import pandas, which we'll use to download the TSV files from Google Cloud Storage, and any other libraries that we'll need."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 86,
|
||||
"id": "9317d4b4",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import json"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "62dc4e18",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The following is a simple function to take the data (which has two columns) and convert it to a tree with a root note (question) and one child (answer)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 87,
|
||||
"id": "963e0d92",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def convert_unified_qa(dataset_url):\n",
|
||||
" # download using pandas\n",
|
||||
" ds = pd.read_csv(dataset_url, on_bad_lines=\"skip\", names=[\"Question\", \"Answer\"], sep=\"\\t\")\n",
|
||||
" # get name for metatdata\n",
|
||||
" ds_name = dataset_url.split(\"/unifiedqa/data/\")[1].split(\"/\")[0]\n",
|
||||
"\n",
|
||||
" # create conversation forest\n",
|
||||
" conversation_forest = []\n",
|
||||
" for item in ds.itertuples():\n",
|
||||
" # build nodes and tree\n",
|
||||
" root = ConversationTreeNode(text=item.Question, role=\"prompter\", children=[], metadata=None)\n",
|
||||
" child = ConversationTreeNode(text=item.Answer, role=\"assistant\", children=[], metadata=None)\n",
|
||||
" root.children.append(child)\n",
|
||||
" conversation_tree = ConversationTree(root=root, metadata={\"dataset\": ds_name})\n",
|
||||
"\n",
|
||||
" conversation_forest.append(conversation_tree)\n",
|
||||
"\n",
|
||||
" conversation_forest_json = [\n",
|
||||
" json.loads(TreeEncoder().encode(conversation_tree)) for conversation_tree in conversation_forest\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
" print(json.dumps(conversation_forest_json, indent=4), file=open(f\"./{ds_name}.json\", \"w+\"))\n",
|
||||
"\n",
|
||||
" print(\"*****\", ds_name, \"****\")\n",
|
||||
" print(ds.head(2))\n",
|
||||
" print(\"....\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e4448c9a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We now define the list of URLs that we want to download. These URLs were found by manually going UnifiedQA'S Google Cloud bucket: https://console.cloud.google.com/storage/browser/unifiedqa/data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 88,
|
||||
"id": "43e188b6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"urls = [\n",
|
||||
" \"https://storage.googleapis.com/unifiedqa/data/natural_questions/train.tsv\",\n",
|
||||
" \"https://storage.googleapis.com/unifiedqa/data/narrativeqa/train.tsv\",\n",
|
||||
" \"https://storage.googleapis.com/unifiedqa/data/newsqa/train.tsv\",\n",
|
||||
" \"https://storage.googleapis.com/unifiedqa/data/drop/train.tsv\",\n",
|
||||
" \"https://storage.googleapis.com/unifiedqa/data/commonsenseqa/train.tsv\",\n",
|
||||
" \"https://storage.googleapis.com/unifiedqa/data/physical_iqa/train.tsv\",\n",
|
||||
" \"https://storage.googleapis.com/unifiedqa/data/social_iqa/train.tsv\",\n",
|
||||
" \"https://storage.googleapis.com/unifiedqa/data/boolq/train.tsv\",\n",
|
||||
" \"https://storage.googleapis.com/unifiedqa/data/boolq_np/train.tsv\",\n",
|
||||
"]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 77,
|
||||
"id": "b39bb154",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"***** natural_questions ****\n",
|
||||
" Question \\\n",
|
||||
"0 which is the most common use of opt-in e-mail ... \n",
|
||||
"1 how i.met your mother who is the mother? \n",
|
||||
"\n",
|
||||
" Answer \n",
|
||||
"0 a newsletter sent to an advertising firm's cus... \n",
|
||||
"1 Tracy McConnell \n",
|
||||
"....\n",
|
||||
"***** narrativeqa ****\n",
|
||||
" Question \\\n",
|
||||
"0 Who is Miss Delmer? \\n At Madeline Hall, an o... \n",
|
||||
"1 Who is Miss Delmer? \\n At Madeline Hall, an o... \n",
|
||||
"\n",
|
||||
" Answer \n",
|
||||
"0 the elderly spinster aunt of the Earl de Vers... \n",
|
||||
"1 She's Captail Delmar's aunt. \n",
|
||||
"....\n",
|
||||
"***** newsqa ****\n",
|
||||
" Question Answer\n",
|
||||
"0 How many Americans are part of the federal foo... 31 million\n",
|
||||
"1 How much did Sean Callebs live on? \\n (CNN) --... $176\n",
|
||||
"....\n",
|
||||
"***** drop ****\n",
|
||||
" Question Answer\n",
|
||||
"0 How many points did the buccaneers need to tie... 3\n",
|
||||
"1 How many field goals did the Lions score? \\n T... 2\n",
|
||||
"....\n",
|
||||
"***** commonsenseqa ****\n",
|
||||
" Question Answer\n",
|
||||
"0 The sanctions against the school were a punish... ignore\n",
|
||||
"1 Sammy wanted to go to where the people were. ... populated areas\n",
|
||||
"....\n",
|
||||
"***** physical_iqa ****\n",
|
||||
" Question \\\n",
|
||||
"0 When boiling butter, when it's ready, you can ... \n",
|
||||
"1 To permanently attach metal legs to a chair, y... \n",
|
||||
"\n",
|
||||
" Answer \n",
|
||||
"0 Pour it into a jar \n",
|
||||
"1 Weld the metal together to get it to stay fir... \n",
|
||||
"....\n",
|
||||
"***** social_iqa ****\n",
|
||||
" Question Answer\n",
|
||||
"0 How would Others feel as a result? \\n (A) like... like attending\n",
|
||||
"1 What will Others want to do next? \\n (A) disag... get to work\n",
|
||||
"....\n",
|
||||
"***** boolq ****\n",
|
||||
" Question Answer\n",
|
||||
"0 can you buy beer at a grocery store in pa? \\n ... yes\n",
|
||||
"1 is the baby in fuller house a twin? \\n (List o... yes\n",
|
||||
"....\n",
|
||||
"***** boolq_np ****\n",
|
||||
" Question Answer\n",
|
||||
"0 do iran and afghanistan write the same languag... yes\n",
|
||||
"1 do iran and afghanistan read the same language... yes\n",
|
||||
"....\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for url in urls:\n",
|
||||
" convert_unified_qa(url)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f0309674",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.9"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
@@ -0,0 +1,13 @@
|
||||
# UnifiedQA Downloader
|
||||
|
||||
UnifiedQA is a notebook that downloads data from UnifiedQA's Google Cloud bucket
|
||||
and converts it into OpenAssistant Data Scheme formats. Files for each dataset
|
||||
in the UnifiedQA collection (excluding the ones already in xP3) are saved to
|
||||
JSON.
|
||||
|
||||
---
|
||||
|
||||
## Contributing
|
||||
|
||||
Feel free to contribute to this notebook. It's not perfect and additional
|
||||
functionality is planned.
|
||||
Reference in New Issue
Block a user