Merge pull request #571 from jquesnelle/soda-synthetic-dialogue

add soda_synthetic_dialogue dataset
This commit is contained in:
ontocord
2023-01-21 08:29:50 -08:00
committed by GitHub
6 changed files with 483 additions and 0 deletions
View File
@@ -0,0 +1,108 @@
---
annotations_creators:
- no-annotation
language:
- en
language_creators:
- machine-generated
license:
- mit
multilinguality:
- monolingual
pretty_name: "SODA Synthetic Dialogue"
size_categories:
- 1M<n<10M
source_datasets:
- extended|allenai/soda
tags:
- open-assistant
- conversations-summarization
task_categories:
- conversational
task_ids:
- dialogue-generation
dataset_info:
features:
- name: conversation
dtype: string
config_name: soda_synthetic_dialogue_dialogue_modeling
splits:
- name: train
num_bytes: 1524941476
num_examples: 1191582
- name: test
num_bytes: 190565707
num_examples: 148968
- name: validation
num_bytes: 187161878
num_examples: 146346
download_size: 1952926579
dataset_size: 1902669061
---
# Dataset Card for 🥤SODA Synthetic Dialogue
## Dataset Description
- **Repository:**
[Code](https://github.com/LAION-AI/Open-Assistant/openassistant/datasets/soda_synthetic_dialogue/)
## Dataset Summary
🥤SODA Synthetic Dialogue is a set of synthetic dialogues between Assistant and
User. In each conversation, User asks Assistant to perform summarization or
story generation tasks based on a snippet of an existing dialogue, story, or
from a title or theme.
This data was created by synthesizing the dialogues in
[🥤Soda](https://huggingface.co/datasets/allenai/soda) and applying a set of
templates to generate the conversation. The original research paper can be found
[here](https://arxiv.org/pdf/2212.10465v1.pdf).
Example:
```
User: Can you write the next few lines of dialogue for this scene:
Cornell: I know what you're thinking, Dontrell. I can sense your feelings and know exactly what you're going to say or do next. You're thinking about how much you miss your family, and how you wish you could see them again. You're wondering if there's anything you can do to make that happen. And you're also feeling a little guilty, because you know that if you could see them again, it would only be for a short while before they died.
Dontrell: Cornell, how can you possibly know all that? It's like you can read my mind!
Assistant: Sure, the next dialogue for this scene could be:
Cornell: I told you, I can sense your feelings. It's like I have a connection to your thoughts and emotions. I'm not sure why or how it works, but ever since we met, I've been able to understand what you're thinking and feeling pretty easily.
Dontrell: Wow...that's amazing! And a little bit scary too. But it does explain why we've always been so close - even though we've only known each other for a few months.
Cornell: Yeah, I guess our connection is pretty strong. But don't worry, Dontrell - I'll never use it to manipulate or control you in any way. I promise."
Dontrell: That's a relief! I was starting to feel a little bit like you could read my mind, and it was kind of creepy. But if you promise not to use your abilities to control me, then I'm okay with it.
Cornell: I swear, Dontrell. I'll never use my abilities to control or manipulate you in any way. You have my word.
User: And a title?
Assistant: Sure, a title for this dialogue could be:
Cornell is intuitive. Cornell reads Dontrell's mind.
User: How about a short description?
Assistant: Sure, a short description for this dialogue could be:
Cornell knows what Dontrell is thinking. He can sense her feelings and know exactly what she is going to say or do next.
User: What would be one theme of this story?
Assistant: One theme of this story could be:
being intuitive
```
## Usage
The dataset contains one configuration, `dialogue_modeling`, which has a single
text `conversation` feature.
## Source data
The script to synthesize this dataset was originally created by
[ontocord](https://github.com/ontocord) in
[this Colab notebook](https://colab.research.google.com/drive/1Sw3px5dP8whdqT7QMNoqwmqIasZkMbJi?usp=sharing)
and prepared for Hugging Face by
[Jeffrey Quesnelle](https://github.com/jquesnelle/).
## Citation
Please cite our work if you find the resources in this repository useful:
```
@article{ontocord2023sodasynth,
author = {ontocord and Jeffrey Quesnelle},
title = {SODA Synthetic Dialogue},
year = {2023}
}
```
@@ -0,0 +1,21 @@
from dataclasses import dataclass
import datasets
@dataclass
class OpenAssistantConfig(datasets.BuilderConfig):
"""BuilderConfig for OpenAssistant datasets."""
name: str = None
version: datasets.Version = None
description: str = None
schema: str = None
subset_id: str = None
features = datasets.Features(
{
"conversation": datasets.Value("string"),
}
)
@@ -0,0 +1,246 @@
"""Prepare the SODA Synthetic Dialogue Dataset"""
import json
import os
import random
import sys
from datasets import load_dataset
from tqdm import tqdm
# adapted from https://colab.research.google.com/drive/1Sw3px5dP8whdqT7QMNoqwmqIasZkMbJi?usp=sharing
SUMMARY_TEMPLATE = """User: Can you give me a short story description for this dialogue?
{dialogue}
Assistant: Sure, a short story description for this dialogue could be:
{story}
User: And a title?
Assistant: Sure, a title for this dialogue could be:
{title}"""
THEME_TEMPLATE = """
User: What would be one theme of this story?
Assistant: One theme of this story could be:
{theme}"""
NEW_DIALOGUE_TEMPLATE = """User: Can you write a short dialogue based on this story:
{story}
Assistant: Sure, a dialogue for this story could be:
{dialogue}
User: And a title?
Assistant: Sure, a title for this dialogue could be:
{title}"""
NEXT_LINES_TEMPLATE = """User: Can you write the next few lines of dialogue for this scene:
{scene}
Assistant: Sure, the next dialogue for this scene could be:
{dialogue}
User: And a title?
Assistant: Sure, a title for this dialogue could be:
{title}
User: How about a short description?
Assistant: Sure, a short description for this dialogue could be:
{story}"""
NEW_STORY_AND_DIALOGUE_TEMPLATE = """User: Can you write a short story and dialogue about:
{title1}
Assistant: Sure, a short story and dialogue about: "{title1}" could be:
{story}"""
FULL_DIALOGUE_TEMPLATE = """{conversation}
{dialogue}"""
MORE_DIALOGUE_TEMPLATE = """{conversation}
{dialogue1}
User: Can you provide more dialogue assuming "{title2}"?
Assistant: Sure, the next dialogue for this scene could be:
{dialogue2}"""
NEXT_DIALOGUE_TEMPLATE = """{conversation}
{dialogue1}
User: More please.
Assistant: Sure, the next dialogue for this scene could be:
{dialogue2}"""
NEW_STORY_AND_DIALOGUE_FROM_THEME_TEMPLATE = """User: Can you write short story and dialogue based on the theme:
{theme}
Assistant: Sure, a short story and dialogue based on the theme "{theme}" could be:
{story}
{dialogue}
User: And a title?
Assistant: Sure, a title for this dialogue could be:
{title}"""
PRINT = len(sys.argv) > 1 and sys.argv[1] == "--print"
def main(output_dir: str = "data"):
"""Download and prepare the dataset for use."""
random.seed(42)
dataset = load_dataset("allenai/soda")
os.makedirs(output_dir, exist_ok=True)
for split in ["train", "test", "validation"]:
with open(f"{output_dir}/{split}.jsonl", "w", encoding="utf8") as output:
for i in tqdm(range(len(dataset[split])), desc=split):
dat = dataset["train"][i]
title = dat["literal"]
story = dat["narrative"]
if dat["relation"] == "xWant":
theme = "wanting " + dat["tail"]
elif dat["relation"] == "xNeed":
theme = "needing " + dat["tail"]
elif not dat["tail"].startswith("to ") and not dat["tail"].startswith("and "):
theme = "being " + dat["tail"]
elif dat["tail"].startswith("and "):
theme = "people are " + dat["tail"].replace("and PersonY ", "")
else:
theme = dat["tail"]
theme = theme.replace("PersonY", "another person")
theme = theme.replace("being is", "being")
dialogue = [s2 + ": " + s1 for s1, s2 in zip(dat["dialogue"], dat["speakers"])]
if random.randint(0, 6) == 0:
# print("##")
# print(f"User: Can you give me a short story description for this dialog?")
# print(" " + "\n ".join(dialog))
# print(f"Assistant: Sure, a short story description for this dialog could be: \n {story}")
# print("User: And a title?")
# print(f"Assistant: Sure, a title for this dialog could be: \n {title}")
# if theme:
# print("User: What would be one theme of this story?")
# print(f'Assistant: One theme of this story could be: "{theme}"')
conversation = SUMMARY_TEMPLATE.format(dialogue="\n ".join(dialogue), story=story, title=title)
if theme:
conversation = conversation + THEME_TEMPLATE.format(theme=theme)
elif random.randint(0, 6) == 0:
# print("##")
# print(f"User: Can you write a short dialog based on this story:\n {story}")
# print(f"Assistant: Sure, a dialog for this story could be:")
# print(" " + "\n ".join(dialog))
# print("User: And a title?")
# print(f"Assistant: Sure, a title for this dialog could be: \n {title}")
# if theme:
# print("User: What would be one theme of this story?")
# print(f'Assistant: One theme of this story could be: "{theme}"')
conversation = NEW_DIALOGUE_TEMPLATE.format(
story=story, dialogue="\n ".join(dialogue), title=title
)
if theme:
conversation = conversation + THEME_TEMPLATE.format(theme=theme)
elif random.randint(0, 3) == 0:
# print("##")
# print(f"User: Can you write the next few lines of dialog for this scene:")
# if random.randint(0, 1) == 0:
# print(" " + "\n ".join(dialog[:-5]))
# print(f"Assistant: Sure, the next dialog for this scene could be:")
# print(" " + "\n ".join(dialog[-5:]))
# elif random.randint(0, 1) == 0:
# print(" " + "\n ".join(dialog[:-3]))
# print(f"Assistant: Sure, the next dialog for this scene could be:")
# print(" " + "\n ".join(dialog[-3:]))
# else:
# print(" " + "\n ".join(dialog[:-4]))
# print(f"Assistant: Sure, the next dialog for this scene could be:")
# print(" " + "\n ".join(dialog[-4:]))
# print("User: And a title?")
# print(f"Assistant: Sure, a title for this dialog could be: \n {title}")
# print("User: How about a short description?")
# print(f"Assistant: Sure, a short description for this dialog could be: \n {story}")
# if theme:
# print("User: What would be one theme of this story?")
# print(f'Assistant: One theme of this story could be: "{theme}"')
if random.randint(0, 1) == 0:
depth = -5
elif random.randint(0, 1) == 0:
depth = -3
else:
depth = -4
conversation = NEXT_LINES_TEMPLATE.format(
scene="\n ".join(dialogue[:depth]),
dialogue="\n ".join(dialogue[depth:]),
title=title,
story=story,
)
if theme:
conversation = conversation + THEME_TEMPLATE.format(theme=theme)
elif random.randint(0, 3) == 0:
# print("##")
# title1 = title.split(".")[0]
# title2 = title.split(".")[1]
# print(f"User: Can you write short story and dialog about: {title1}")
# print(f'Assistant: Sure, a short story and dialog about: "{title1}" could be:')
# print(f" {story}")
# if random.randint(0, 1) == 0:
# print(" " + "\n ".join(dialog))
# elif random.randint(0, 1) == 0 and len(dialog) > 5:
# print(" " + "\n ".join(dialog[:-5]))
# print(f'User: Can you provide more dialog assuming "{title2}"?')
# print(f"Assistant: Sure, the next dialog for this scene could be:")
# print(" " + "\n ".join(dialog[-5:]))
# elif random.randint(0, 1) == 0:
# print(" " + "\n ".join(dialog[:-3]))
# print("User: more please.")
# print(f"Assistant: Sure, the next dialog for this scene could be:")
# print(" " + "\n ".join(dialog[-3:]))
# else:
# print(" " + "\n ".join(dialog[:-4]))
# print(f'User: Can you provide more dialog assuming "{title2}"?')
# print(f"Assistant: Sure, the next dialog for this scene could be:")
# print(" " + "\n ".join(dialog[-4:]))
# if theme:
# print("User: What would be one theme of this story?")
# print(f'Assistant: One theme of this story could be: "{theme}"')
title1 = title.split(".")[0]
title2 = title.split(".")[1]
conversation = NEW_STORY_AND_DIALOGUE_TEMPLATE.format(title1=title1, story=story)
if random.randint(0, 1) == 0:
conversation = FULL_DIALOGUE_TEMPLATE.format(
conversation=conversation, dialogue="\n ".join(dialogue)
)
elif random.randint(0, 1) == 0 and len(dialogue) > 5:
conversation = MORE_DIALOGUE_TEMPLATE.format(
conversation=conversation,
dialogue1="\n ".join(dialogue[:-5]),
title2=title2,
dialogue2="\n ".join(dialogue[-5:]),
)
elif random.randint(0, 1) == 0:
conversation = NEXT_DIALOGUE_TEMPLATE.format(
conversation=conversation,
dialogue1="\n ".join(dialogue[:-3]),
dialogue2="\n ".join(dialogue[-3:]),
)
else:
conversation = MORE_DIALOGUE_TEMPLATE.format(
conversation=conversation,
dialogue1="\n ".join(dialogue[:-4]),
title2=title2,
dialogue2="\n ".join(dialogue[-4:]),
)
if theme:
conversation = conversation + THEME_TEMPLATE.format(theme=theme)
else:
# print("##")
# print(f"User: Can you write short story and dialog based on the theme:\n {theme}")
# print(f'Assistant: Sure, a short story and dialog based on the theme "{theme}" could be:')
# print(f" {story}")
# print(" " + "\n ".join(dialog))
# print("User: And a title?")
# print(f"Assistant: Sure, a title for this dialog could be: \n {title}")
conversation = NEW_STORY_AND_DIALOGUE_FROM_THEME_TEMPLATE.format(
theme=theme, story=story, dialogue="\n ".join(dialogue), title=title
)
if PRINT:
print("##")
print(conversation)
output.write(f"{json.dumps({'conversation': conversation})}\n")
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,108 @@
# Copyright 2023 The OpenAssistant Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This dataset is a set of dialogues synthesized from the SODA dataset.
In each dialogue, User and Assistant have a conversation about a story.
The original collab notebook for this dataset can be found at:
https://colab.research.google.com/drive/1Sw3px5dP8whdqT7QMNoqwmqIasZkMbJi?usp=sharing
"""
import json
from typing import Dict, List, Tuple
import datasets
from .hub import OpenAssistantConfig, features
_CITATION = """\
@article{ontocord2023sodasynth,
author = {ontocord and Jeffrey Quesnelle},
title = {SODA Synthetic Dialogue},
year = {2023}
}
"""
_DATASETNAME = "soda_synthetic_dialogue"
_DISPLAYNAME = "🥤SODA Synthetic Dialogue"
_DESCRIPTION = "A set of dialogues synthesized from the SODA dataset."
_HOMEPAGE = ""
_LICENSE = "mit"
_URLS = {
_DATASETNAME: {"train": "./data/train.jsonl", "test": "./data/test.jsonl", "validation": "./data/validation.jsonl"}
}
_SUPPORTED_TASKS = ["dialogue-modeling"]
_VERSION = "1.0.0"
class SODASyntheticDialogueDataset(datasets.GeneratorBasedBuilder):
"""A set of dialogues synthesized from the SODA dataset."""
VERSION = datasets.Version(_VERSION)
BUILDER_CONFIGS = [
OpenAssistantConfig(
name=f"{_DATASETNAME}_dialogue_modeling",
version=VERSION,
description=f"OpenAssistant dataset config for {_DATASETNAME}",
schema="dialogue_modeling",
subset_id=_DATASETNAME,
)
]
DEFAULT_CONFIG_NAME = f"{_DATASETNAME}_dialogue_modeling"
def _info(self) -> datasets.DatasetInfo:
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=features,
homepage=_HOMEPAGE,
license=_LICENSE,
citation=_CITATION,
)
def _split_generators(self, dl_manager) -> List[datasets.SplitGenerator]:
"""Returns SplitGenerators."""
urls = _URLS[_DATASETNAME]
data_dir = dl_manager.download_and_extract(urls)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={"filepath": data_dir, "split": "train"},
),
datasets.SplitGenerator(
name=datasets.Split.TEST,
gen_kwargs={"filepath": data_dir, "split": "test"},
),
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
gen_kwargs={"filepath": data_dir, "split": "validation"},
),
]
def _generate_examples(self, filepath, split: str) -> Tuple[int, Dict]:
"""Yields examples as (key, example) tuples."""
if self.config.schema == "dialogue_modeling":
key = 0
with open(filepath[split], "r", encoding="utf8") as data:
while True:
line = data.readline()
if not line:
return
yield key, json.loads(line)
key += 1