diff --git a/notebooks/diverse/README.md b/notebooks/diverse/README.md new file mode 100644 index 00000000..a56806c6 --- /dev/null +++ b/notebooks/diverse/README.md @@ -0,0 +1,11 @@ +# DIVERSE Downloader + +Diverse is a notebook that downloads the DIVERSE dataset and converts it into +OpenAssistant Data Scheme formats. + +--- + +## Contributing + +Feel free to contribute to this notebook. It's not perfect and additional +functionality is planned. diff --git a/notebooks/diverse/diverse.ipynb b/notebooks/diverse/diverse.ipynb new file mode 100644 index 00000000..c4c6a518 --- /dev/null +++ b/notebooks/diverse/diverse.ipynb @@ -0,0 +1,477 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "00b2848c", + "metadata": {}, + "source": [ + "# Diverse Dataset" + ] + }, + { + "cell_type": "markdown", + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CmXjXVrmPtpAVBaogBSuDclM0O6Zzewf?usp=sharing)" + ], + "metadata": { + "collapsed": false + } + }, + { + "cell_type": "markdown", + "id": "d81932b9", + "metadata": {}, + "source": [ + "The purpose of this notebook is to download the DIVERSE dataset and convert it into a format that can be used for training the OpenAssistant.\n", + "\n", + "The DIVERSE repo can be found here: https://github.com/microsoft/CodeT/tree/main/DIVERSE\n", + "\n", + "If you extend or use this work, please cite the relevant papers:\n", + "```\n", + "@article{li2022advance,\n", + " title={On the Advance of Making Language Models Better Reasoners},\n", + " author={Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu},\n", + " journal={arXiv preprint arXiv:2206.02336},\n", + " year={2022}\n", + "}\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "a8c98078", + "metadata": {}, + "source": [ + "# OpenAssistant Data Scheme" + ] + }, + { + "cell_type": "markdown", + "id": "2731f88f", + "metadata": {}, + "source": [ + "We will use the data scheme that can be found in the docs for Open-Assistant. This code is taken from the StackExchange notebook." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "d35ab066", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import TypeVar, List, Dict, Any, Literal\n", + "from json import JSONEncoder\n", + "\n", + "T = TypeVar(\"T\", bound=\"ConversationTreeNode\")\n", + "\n", + "\n", + "class ConversationTreeNode:\n", + " text: str # The text of the node\n", + " role: Literal[\"prompter\", \"assistant\"] # Whether the node is a user prompt/follow-up or an assistant response\n", + " children: List[T] # The children of the node (if you have a linear conversation, this will be of length 0 or 1)\n", + " metadata: Dict[str, Any] # Node metadata (see below)\n", + "\n", + " def __init__(\n", + " self, text: str, role: Literal[\"prompter\", \"assistant\"], children: List[T], metadata: Dict[str, Any]\n", + " ) -> None:\n", + " self.text = text\n", + " self.role = role\n", + " self.children = children\n", + " self.metadata = metadata\n", + "\n", + "\n", + "class ConversationTree:\n", + " root: ConversationTreeNode # The node containing the initial prompt\n", + " metadata: Dict[str, Any] # Tree metadata, different from root node metadata.\n", + "\n", + " def __init__(self, root: ConversationTreeNode, metadata: Dict[str, Any]) -> None:\n", + " self.root = root\n", + " self.metadata = metadata\n", + "\n", + "\n", + "# subclass JSONEncoder\n", + "class TreeEncoder(JSONEncoder):\n", + " def default(self, o):\n", + " return o.__dict__" + ] + }, + { + "cell_type": "markdown", + "id": "e7457bae", + "metadata": {}, + "source": [ + "# Download and convert" + ] + }, + { + "cell_type": "markdown", + "id": "54b0fd63", + "metadata": {}, + "source": [ + "We firstly import pandas and any other libraries that we'll need." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "9317d4b4", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import json" + ] + }, + { + "cell_type": "markdown", + "id": "62dc4e18", + "metadata": {}, + "source": [ + "The following is a simple function to take the data (which has two columns) and convert it to a tree with a root note (question) and one child (answer)." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "963e0d92", + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "\n", + "\n", + "def convert_diverse(dataset_json_path):\n", + " # read files using pandas\n", + " ds = pd.read_json(dataset_json_path, lines=True)\n", + "\n", + " # create dataset name from path\n", + " ds_name = \"diverse\" + file.split(\"data\")[-1].replace(\"/\", \"_\").split(\".\")[0]\n", + " print(\"*****\", ds_name, \"****\")\n", + " print(\"Example of raw dataset\")\n", + " print(ds.head(2))\n", + "\n", + " # create conversation forest\n", + " # Print first sample so the user of this notebook has an idea of what he's looking at\n", + " first_sample = True\n", + " print(\"\\nExamples from converted dataset\")\n", + " conversation_forest = []\n", + " for item in ds[\"context\"]:\n", + " # build nodes and tree\n", + " # Find all answers:\n", + "\n", + " answers = re.findall(r\"Answer:?(.*?)#\", item.replace(\"\\n\", \" \"))\n", + " questions = re.findall(r\"Question:?(.*?) Answer:\", item.replace(\"\\n\", \" \"))\n", + "\n", + " # The last question does not contain an aswer so we drop it every time.\n", + " if len(answers) < len(questions):\n", + " questions.pop(-1)\n", + "\n", + " for (answer, question) in zip(answers, questions):\n", + " if first_sample:\n", + " print(f\"Q: {question}\")\n", + " print(f\"A: {answer}\")\n", + " root = ConversationTreeNode(text=question, role=\"prompter\", children=[], metadata=None)\n", + " child = ConversationTreeNode(text=answer, role=\"assistant\", children=[], metadata=None)\n", + " root.children.append(child)\n", + " conversation_tree = ConversationTree(root=root, metadata={\"dataset\": ds_name})\n", + " conversation_forest.append(conversation_tree)\n", + "\n", + " first_sample = False\n", + "\n", + " conversation_forest_json = [\n", + " json.loads(TreeEncoder().encode(conversation_tree)) for conversation_tree in conversation_forest\n", + " ]\n", + "\n", + " print(json.dumps(conversation_forest_json, indent=4), file=open(f\"./{ds_name}.json\", \"w+\"))\n", + " print(\"\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "e4448c9a", + "metadata": {}, + "source": [ + "We now clone the repository containing the dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "06e7719e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Cloning into 'CodeT'...\r\n", + "remote: Enumerating objects: 144, done.\u001B[K\r\n", + "remote: Counting objects: 100% (16/16), done.\u001B[K\r\n", + "remote: Compressing objects: 100% (16/16), done.\u001B[K\r\n", + "remote: Total 144 (delta 1), reused 0 (delta 0), pack-reused 128\u001B[Kving objects: 8% (12/144), 3.70 MiB | 1.35 MiB/s Receiving objects: 12% (18/144), 5.13 MiB | 1.58 MiB/s Receiving objects: 13% (19/144), 11.36 MiB | 2.39 MiB/s Receiving objects: 13% (20/144), 19.19 MiB | 3.97 MiB/s Receiving objects: 15% (22/144), 19.19 MiB | 3.97 MiB/s Receiving objects: 23% (34/144), 22.15 MiB | 4.37 MiB/s Receiving objects: 27% (39/144), 22.15 MiB | 4.37 MiB/s Receiving objects: 29% (42/144), 25.30 MiB | 4.77 MiB/s Receiving objects: 32% (47/144), 28.71 MiB | 5.22 MiB/s Receiving objects: 41% (60/144), 32.41 MiB | 5.62 MiB/s Receiving objects: 54% (78/144), 32.41 MiB | 5.62 MiB/s Receiving objects: 60% (87/144), 39.34 MiB | 6.20 MiB/s Receiving objects: 61% (88/144), 47.14 MiB | 6.79 MiB/s Receiving objects: 64% (93/144), 51.36 MiB | 7.12 MiB/s Receiving objects: 66% (96/144), 55.58 MiB | 7.40 MiB/s \r\n", + "Receiving objects: 100% (144/144), 56.76 MiB | 4.97 MiB/s, done.\r\n", + "Resolving deltas: 100% (33/33), done.\r\n", + "Checking out files: 100% (64/64), done.\r\n" + ] + } + ], + "source": [ + "!git clone https://github.com/microsoft/CodeT.git" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "89a166c9", + "metadata": {}, + "outputs": [], + "source": [ + "diverse_files = [\n", + " \"CodeT/DIVERSE/data/sqa/split1/test.jsonl\",\n", + " \"CodeT/DIVERSE/data/sqa/split1/train.jsonl\",\n", + " \"CodeT/DIVERSE/data/sqa/split2/test.jsonl\",\n", + " \"CodeT/DIVERSE/data/sqa/split2/train.jsonl\",\n", + " \"CodeT/DIVERSE/data/gsm8k/test.jsonl\",\n", + " \"CodeT/DIVERSE/data/gsm8k/train.jsonl\",\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "2da75b14", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "***** diverse_sqa_split1_test ****\n", + "Example of raw dataset\n", + " context \\\n", + "0 Question:\\nIs clerk of Supreme Court of Canada... \n", + "1 Question:\\nIs Saturn named after king of gods ... \n", + "\n", + " samples \\\n", + "0 [Snoopy is a dog.\\nChance is a dog.\\nDogs look... \n", + "1 [Snoopy is a cartoon dog.\\nChance is a cartoon... \n", + "\n", + " metadata \n", + "0 {'type': 'solution', 'question': 'Does Snoopy ... \n", + "1 {'type': 'solution', 'question': 'Does Snoopy ... \n", + "\n", + "Examples from converted dataset\n", + "Q: Is clerk of Supreme Court of Canada safe profession for someone with seismophobia?\n", + "A: The Supreme Court of Canada is in Ottawa, Canada. Ottawa is in Ontario, Canada. Ontario is in the Canadian Shield. The Canadian Shield is a stable tectonic plate. Thus, Ottawa is not prone to earthquakes. Thus, the clerk of the Supreme Court of Canada is a safe profession for someone with seismophobia. So the answer is yes.\n", + "Q: During the Cuban revolution, did the US experience a population boom?\n", + "A: The Cuban revolution was in 1959. The US population in 1959 was about 180 million. The US population in 2010 was about 310 million. Thus, the US population increased by about 130 million. So the answer is yes.\n", + "Q: Can the largest crustacean stretch out completely on a king-sized mattress?\n", + "A: The largest crustacean is the Japanese spider crab. The Japanese spider crab has a leg span of 3.8 meters. A king-sized mattress is 1.9 meters wide. Thus, the Japanese spider crab could not stretch out completely on a king-sized mattress. So the answer is no.\n", + "Q: Could morphine cure HIV?\n", + "A: Morphine is a painkiller. HIV is a virus. Painkillers do not cure viruses. Thus, morphine could not cure HIV. So the answer is no.\n", + "Q: Is Christopher Nolan indebted to Bob Kane?\n", + "A: Bob Kane created Batman. Christopher Nolan directed the Dark Knight trilogy. The Dark Knight trilogy is about Batman. Thus, Christopher Nolan is indebted to Bob Kane. So the answer is yes.\n", + "Q: In baseball, is a \"Homer\" named after the poet Homer who wrote the Odyssey?\n", + "A: A \"Homer\" is a home run. The term \"home run\" was coined by Harry Wright in 1858. Homer lived from about 800 BC to about 700 BC. Thus, the term \"home run\" was not named after the poet Homer. So the answer is no.\n", + "\n", + "\n", + "***** diverse_sqa_split1_train ****\n", + "Example of raw dataset\n", + " context \\\n", + "0 Question:\\nIs eating a Dicopomorpha echmeptery... \n", + "1 Question:\\nWas Alexander the Great baptized?\\n... \n", + "\n", + " samples \\\n", + "0 [New York Public Library is a library.\\nSix Fl... \n", + "1 [The New York Public Library is in New York Ci... \n", + "\n", + " metadata \n", + "0 {'type': 'solution', 'question': 'Could you go... \n", + "1 {'type': 'solution', 'question': 'Could you go... \n", + "\n", + "Examples from converted dataset\n", + "Q: Is eating a Dicopomorpha echmepterygis size Uranium pellet fatal?\n", + "A: The Dicopomorpha echmepterygis is a beetle. The mass of a beetle is about 0.1 g. The mass of a Uranium pellet is about 10 g. Thus, eating a Dicopomorpha echmepterygis is not the same as eating a Uranium pellet. So the answer is no.\n", + "Q: Does Jerry Seinfeld hang out at the Budweiser Party Deck?\n", + "A: Jerry Seinfeld is a comedian. Comedians perform at comedy clubs. The Budweiser Party Deck is not a comedy club. Thus, Jerry Seinfeld does not hang out at the Budweiser Party Deck. So the answer is no.\n", + "Q: Does The Jungle Book contain racist subtext?\n", + "A: The Jungle Book is a book about a boy raised by wolves. The boy is white. The wolves are black. Thus, The Jungle Book contains racist subtext. So the answer is yes.\n", + "Q: Does Dean Cain have less days to birthday than Will Ferrell every 4th of July?\n", + "A: Dean Cain was born on July 31. Will Ferrell was born on July 16. July 4 is the middle of July. Thus, Dean Cain has more days to his birthday than Will Ferrell every 4th of July. So the answer is no.\n", + "Q: Do mail carriers need multiple uniforms?\n", + "A: Mail carriers need to wear uniforms. Uniforms are clothes. People need multiple clothes. Thus, mail carriers need multiple uniforms. So the answer is yes.\n", + "Q: Can Curiosity take samples of rocks from Lacus Temporis?\n", + "A: Curiosity is a rover on Mars. Lacus Temporis is a lake on Mars. Curiosity is not in Lacus Temporis. Thus, Curiosity cannot take samples of rocks from Lacus Temporis. So the answer is no.\n", + "\n", + "\n", + "***** diverse_sqa_split2_test ****\n", + "Example of raw dataset\n", + " context \\\n", + "0 Question:\\nDoes the density of helium cause vo... \n", + "1 Question:\\nIs Mark Cuban able to visit Norther... \n", + "\n", + " samples \\\n", + "0 [The New York Public Library is in New York Ci... \n", + "1 [The distance from New York Public Library to ... \n", + "\n", + " metadata \n", + "0 {'type': 'solution', 'question': 'Could you go... \n", + "1 {'type': 'solution', 'question': 'Could you go... \n", + "\n", + "Examples from converted dataset\n", + "Q: Does the density of helium cause voices to sound deeper?\n", + "A: The density of helium is 0.1785 g/L, which is less than air. The density of air is 1.225 g/L. The density of helium is less than air, so helium causes voices to sound higher. Thus, helium does not cause voices to sound deeper. So the answer is no.\n", + "Q: Would Janet Jackson avoid a dish with ham?\n", + "A: Janet Jackson is a vegetarian. Vegetarians do not eat meat. Ham is a type of meat. Thus, Janet Jackson would avoid a dish with ham. So the answer is yes.\n", + "Q: Do people watching Coen brothers films in Guinea Bissau need subtitles?\n", + "A: The Coen brothers are American. Americans speak English. Guinea Bissau is in Africa. Africans speak Portuguese. Thus, people in Guinea Bissau would need subtitles to watch Coen brothers films. So the answer is yes.\n", + "Q: Could a Gladiator's weapon crush a diamond?\n", + "A: A gladiator's weapon was a sword. The hardness of a diamond is 10. The hardness of a sword is 5. Thus, a gladiator's weapon could not crush a diamond. So the answer is no.\n", + "Q: Can Spartina Patens thrive in the Sahara Desert?\n", + "A: Spartina Patens is a salt marsh grass. The Sahara Desert is a desert. Deserts are dry. Thus, Spartina Patens would not thrive in the Sahara Desert. So the answer is no.\n", + "Q: Were all the materials to make a cannon known during the bronze age?\n", + "A: The bronze age was about 3000 BC to 1200 BC. The bronze age was before the iron age. Thus, iron was not known during the bronze age. Iron is needed to make a cannon. Thus, all the materials to make a cannon were not known during the bronze age. So the answer is no.\n", + "\n", + "\n", + "***** diverse_sqa_split2_train ****\n", + "Example of raw dataset\n", + " context \\\n", + "0 Question:\\nWas the Peak of the Andes hidden fr... \n", + "1 Question:\\nIs Mark Cuban able to visit Norther... \n", + "\n", + " samples \\\n", + "0 [Snoopy is a dog.\\nChance is a dog.\\nChance is... \n", + "1 [Snoopy is a cartoon dog.\\nChance is a cartoon... \n", + "\n", + " metadata \n", + "0 {'type': 'solution', 'question': 'Does Snoopy ... \n", + "1 {'type': 'solution', 'question': 'Does Snoopy ... \n", + "\n", + "Examples from converted dataset\n", + "Q: Was the Peak of the Andes hidden from the view of the Colossus of Rhodes?\n", + "A: The Colossus of Rhodes was a statue of the Greek god Helios. The Colossus of Rhodes was located on the island of Rhodes. The Peak of the Andes is in South America. Thus, the Peak of the Andes was not visible from the Colossus of Rhodes. So the answer is yes.\n", + "Q: Can you swim to Miami from New York?\n", + "A: The distance from New York to Miami is about 1,500 miles. The fastest swimmer can swim about 2 miles per hour. Thus, it would take about 750 hours to swim from New York to Miami. Thus, you could not swim to Miami from New York. So the answer is no.\n", + "Q: Is Freya a combination of Athena and Aphrodite?\n", + "A: Freya is the Norse goddess of love and beauty. Athena is the Greek goddess of wisdom and war. Aphrodite is the Greek goddess of love and beauty. Thus, Freya is a combination of Athena and Aphrodite. So the answer is yes.\n", + "Q: Were the Great Pyramids built by a theocratic government?\n", + "A: The Great Pyramids were built by the Egyptians. The Egyptians were ruled by a pharaoh. The pharaoh was considered a god. Thus, the Great Pyramids were built by a theocratic government. So the answer is yes.\n", + "Q: Was P. G. Wodehouse's favorite book The Hunger Games?\n", + "A: P. G. Wodehouse's favorite book was The Pickwick Papers. The Pickwick Papers was written by Charles Dickens. The Hunger Games was written by Suzanne Collins. Thus, P. G. Wodehouse's favorite book was not The Hunger Games. So the answer is no.\n", + "Q: Can Burundi's communicate with citizens of New Brunswick?\n", + "A: Burundi's speak French. New Brunswick is in Canada. Canada is a bilingual country. Thus, Burundi's can communicate with citizens of New Brunswick. So the answer is yes.\n", + "\n", + "\n", + "***** diverse_gsm8k_test ****\n", + "Example of raw dataset\n", + " context \\\n", + "0 Question:\\nA community is building a metal fen... \n", + "1 Question:\\nThe white rabbit can hop 15 meters ... \n", + "\n", + " samples \\\n", + "0 [She eats 3 and uses 4, so that is 7 eggs.\\n16... \n", + "1 [She eats 3 eggs for breakfast and uses 4 in m... \n", + "\n", + " metadata \n", + "0 {'question': 'Janet’s ducks lay 16 eggs per da... \n", + "1 {'question': 'Janet’s ducks lay 16 eggs per da... \n", + "\n", + "Examples from converted dataset\n", + "Q: A community is building a metal fence. Each fence panel is made of 3 metal sheets, and 2 metal beams. The fence is made of 10 fence panels. If each sheet is made of 10 metal rods and each metal beam is made of 4 metal rods, how many metal rods does the community need for the fence?\n", + "A: In each panel, the metal sheets use 3 metal sheets * 10 metal rods = <<3*10=30>>30 metal rods. In each panel, the metal beams use 2 metal beams * 4 metal rods = <<2*4=8>>8 metal rods. So each panel uses 30 + 8 = <<30+8=38>>38 metal rods. The entire fence therefore needs 38 metal rods * 10 fence panels = <<38*10=380>>380 metal rods. \n", + "Q: John buys 3 dress shirts. They sell for $20 each. He also has to pay 10% tax on everything. How much did he pay in total?\n", + "A: The shirts cost 3*$20=$<<3*20=60>>60 before tax The tax cost $60*.1=$<<60*.1=6>>6 So in total they paid $60+$6=$<<60+6=66>>66 \n", + "Q: Bob gets rent assistance because he's low-income. If he gets a raise of $0.50/hour and works 40 hours a week, how much more will he actually earn a week if his housing benefit is reduced by $60/month?\n", + "A: First find the total increase in Bob's earnings: $0.50/hour * 40 hours/week = $<<0.50*40=20>>20/week Then find the weekly decrease in Bob's housing assistance: $60/month / 4 weeks/month = $<<60/4=15>>15/week Then subtract the lost assistance from the increased wages to find Bob's net increase in money: $20/week - $15/week = $<<20-15=5>>5/week \n", + "Q: Annie plants 3 pots of basil, 9 pots of rosemary, and 6 pots of thyme. Each basil plant has 4 leaves, each rosemary plant has 18 leaves, and each thyme plant has 30 leaves. How many leaves are there total?\n", + "A: First find the total number of basil leaves: 3 pots * 4 leaves/pot = <<3*4=12>>12 leaves Then find the total number of rosemary leaves: 9 pots * 18 leaves/pot = <<9*18=162>>162 leaves Then find the total number of thyme leaves: 6 pots * 30 leaves/pot = <<6*30=180>>180 leaves Then add the number of each type of leaf to find the total number of leaves: 12 leaves + 162 leaves + 180 leaves = <<12+162+180=354>>354 leaves \n", + "Q: There are 7 mL of solution in each of 6 test tubes. Dr. Igor takes all of the solution and then evenly distributes it into 3 beakers. How many mL of solution are in each beaker?\n", + "A: 7 * 6 = <<7*6=42>>42 mL 42/3 = <<42/3=14>>14 mL Each beaker holds 14 mL of solution. \n", + "Q: Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet lessons in a year?\n", + "A: First find the total Janet spends on clarinet lessons per week: $40/hour * 3 hours/week = $<<40*3=120>>120/week Then find the total Janet spends on piano lessons per week: $28/hour * 5 hours/week = $<<28*5=140>>140/week Then subtract her weekly clarinet spending from her weekly piano spending to find the weekly difference: $140/week - $120/week = $<<140-120=20>>20/week Then multiply the weekly difference by the number of weeks in a year to find the annual difference: $20/week * 52 weeks/year = $<<20*52=1040>>1040/year \n", + "Q: A normal lemon tree produces 60 lemons per year. Jim has specially engineered lemon trees that produce 50% more lemons per year. He has a grove that is 50 trees by 30 trees. How many lemons does he produce in 5 years?\n", + "A: Each tree produces 60*.5=<<60*.5=30>>30 more lemons than normal So they each produce 60+30=<<60+30=90>>90 lemons He has 50*30=<<50*30=1500>>1500 trees So every year he produces 1500*90=<<1500*90=135000>>135000 lemons That means he produces 135000*5=<<135000*5=675000>>675,000 \n", + "Q: Billy weighs 9 pounds more than Brad. Brad weighs 5 pounds more than Carl. If Carl weighs 145 pounds, how much does Billy weigh, in pounds?\n", + "A: Brad weighs 145+5=<<145+5=150>>150 pounds. Billy weighs 150+9=<<150+9=159>>159 pounds. \n", + "\n", + "\n", + "***** diverse_gsm8k_train ****\n", + "Example of raw dataset\n", + " context \\\n", + "0 Question:\\nA magician has a top hat with 20 re... \n", + "1 Question:\\nA community is building a metal fen... \n", + "\n", + " samples \\\n", + "0 [There are 125 cars in total\\n64% of them are ... \n", + "1 [64% of 125 is <<64/100*125=80>>80.\\n8% of 125... \n", + "\n", + " metadata \n", + "0 {'question': 'Pauline has 125 matchbox cars. T... \n", + "1 {'question': 'Pauline has 125 matchbox cars. T... \n", + "\n", + "Examples from converted dataset\n", + "Q: A magician has a top hat with 20 red marbles and a top hat with 30 blue marbles. If he takes away 3 red marbles and four times as many blue marbles as red marbles (without looking), how many marbles in total does he have left?\n", + "A: He had 20 red marbles and took away 3 leaving 20-3 = <<20-3=17>>17 red marbles He took 4 times as many blue marbles as red marbles which is 4*3 = <<4*3=12>>12 blue marbles He took 12 blue marbles from 30 leaving 30-12 = 18 blue marbles He now has 17+18 = <<17+18=35>>35 marbles left \n", + "Q: Lucas wants to get a dog but his parents think he already has too many pets and won't have enough space. He already has 12 pet beds in his room but manages to fit another 8 pet beds. His parents argue that each pet is going to need 2 beds each to feel comfortable. According to his parent's argument, how many pets does Lucas have enough room for?\n", + "A: Lucas has a total of 12 existing pet beds + 8 new pet beds = <<12+8=20>>20 pet beds. So according to his parents, Lucas has enough room for 20 pet beds / 2 pet beds per pet = <<20/2=10>>10 pets. \n", + "Q: Super Clean Car Wash Company cleans 80 cars per day. They make $5 per car washed. How much money will they make in 5 days?\n", + "A: Each day they will make 80 × $5 = $<<80*5=400>>400. They will make $400 × 5 = $<<400*5=2000>>2000 in 5 days. \n", + "Q: Eighteen hours ago, Beth and I took 100 photographs of our project. Today, Beth and I will take 20% fewer photographs of the same project. If we were to take 300 photographs of the project, how many photographs would we take to reach the target?\n", + "A: If you took 100 photographs of the project 18 hours ago, and today 20% few photographs have been taken, then 20/100*100 = 20 fewer photographs of the project have been taken today. The total number of photographs of the project that have been taken today is 100-20 = <<100-20=80>>80 So far, you've taken 80+100 = <<80+100=180>>180 photographs of the project. Since the target number of photographs is 300, the number of photographs that you need to take to reach the target is 300-180 = <<300-180=120>>120 \n", + "Q: Ruby was going to order pizza for dinner. Her son would only eat pepperoni pizza. Her daughter would only eat sausage. Ruby and her husband wanted black olive and mushroom pizza. To make life easy, Ruby decided to order an entire pizza for each of her children and she would split one with her husband. The pizza restaurant charged $10.00 per pizza and $1.00 per topping. She also needed to add a $5.00 tip. Including tip, how much would the pizza order cost?\n", + "A: Ruby was going to order 1 for her son, 1 for her daughter and 1 to share with her husband. So she needed to order 1+1+1 = <<1+1+1=3>>3 pizzas Each pizza cost $10 and she was ordering 3 so that comes to 10*3 = $<<10*3=30.00>>30.00 She needed to order pepperoni, sausage, black olive and mushroom, which came to 4 toppings at $1.00 each so 4*1 = $<<4*1=4.00>>4.00 extra for toppings The pizzas cost $30 and $4 for the toppings so the total costs of the pizzas came to 30+4 = $<<30+4=34.00>>34.00 She also had to add a $5.00 tip to her current order total of $34.00 so 5+34.00 = $<<5+34=39.00>>39.00 for the total order \n", + "Q: Yves and his siblings ordered pizza and asked to have it cut into 16 slices. During dinner time, they only ate one-fourth of it. The next day, Yves ate one-fourth of the remaining pizza. Then his two siblings ate 2 slices each. How many slices of pizza were left?\n", + "A: During dinner time, Yves and his siblings ate 16/4 = <<16/4=4>>4 slices. So the next day, there were still 16 - 4 = <<16-4=12>>12 slices left. The next day, Yves ate 12/4 = <<12/4=3>>3 slices of pizza. Thus, there were 12 - 3 = <<12-3=9>>9 slices left. Then, his two siblings ate 2 x 2 = <<2*2=4>>4 slices of pizza. Therefore, there were still 9 - 4 = <<9-4=5>>5 slices of pizza left. \n", + "Q: Pria bought a new car that advertised an estimated gas mileage of 35 miles per gallon. The car has a 12-gallon tank. She filled her car full of gas and was able to drive a total of 372 miles. What was the difference, in miles per gallon, between Pria's mileage and the advertised mileage?\n", + "A: Pria's car achieved a rate of 372 miles / 12 gallons = <<372/12=31>>31 miles per gallon. Therefore, it was a difference of 35 - 31 = <<35-31=4>>4 miles per gallon. \n", + "Q: Janet has 24 dresses. Half of them have pockets. Of those, a third have 2 pockets and the rest have 3 pockets. How many total pockets do her dresses have?\n", + "A: She has 24/2=<<24/2=12>>12 dresses with pockets Of those 12/3=4 have 2 pockets So 12-4=<<12-4=8>>8 have three pockets So the dresses with 2 pockets have 2*4=<<2*4=8>>8 pockets The other dresses contribute 8*3=<<8*3=24>>24 pockets So she has a total of 8+24=<<8+24=32>>32 pockets \n", + "\n", + "\n" + ] + } + ], + "source": [ + "for file in diverse_files:\n", + " convert_diverse(file)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "name": "conda-env-open-assistant-py", + "language": "python", + "display_name": "Python [conda env:open-assistant]" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.0" + }, + "vscode": { + "interpreter": { + "hash": "25d5c2324055587ceaeef27650c79ce8358ea61d7689f2e0b8ada5d53f85bce4" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}