Merge pull request #571 from jquesnelle/soda-synthetic-dialogue

add soda_synthetic_dialogue dataset
2026-06-27 16:10:30 +08:00 · 2023-01-21 08:29:50 -08:00
parent e08b36e675 3222485d6f
commit 128102e6f1
6 changed files with 483 additions and 0 deletions
@@ -0,0 +1,108 @@
+---
+annotations_creators:
+  - no-annotation
+language:
+  - en
+language_creators:
+  - machine-generated
+license:
+  - mit
+multilinguality:
+  - monolingual
+pretty_name: "SODA Synthetic Dialogue"
+size_categories:
+  - 1M<n<10M
+source_datasets:
+  - extended|allenai/soda
+tags:
+  - open-assistant
+  - conversations-summarization
+task_categories:
+  - conversational
+task_ids:
+  - dialogue-generation
+dataset_info:
+  features:
+    - name: conversation
+      dtype: string
+  config_name: soda_synthetic_dialogue_dialogue_modeling
+  splits:
+    - name: train
+      num_bytes: 1524941476
+      num_examples: 1191582
+    - name: test
+      num_bytes: 190565707
+      num_examples: 148968
+    - name: validation
+      num_bytes: 187161878
+      num_examples: 146346
+  download_size: 1952926579
+  dataset_size: 1902669061
+---
+
+# Dataset Card for 🥤SODA Synthetic Dialogue
+
+## Dataset Description
+
+- **Repository:**
+  [Code](https://github.com/LAION-AI/Open-Assistant/openassistant/datasets/soda_synthetic_dialogue/)
+
+## Dataset Summary
+
+🥤SODA Synthetic Dialogue is a set of synthetic dialogues between Assistant and
+User. In each conversation, User asks Assistant to perform summarization or
+story generation tasks based on a snippet of an existing dialogue, story, or
+from a title or theme.
+
+This data was created by synthesizing the dialogues in
+[🥤Soda](https://huggingface.co/datasets/allenai/soda) and applying a set of
+templates to generate the conversation. The original research paper can be found
+[here](https://arxiv.org/pdf/2212.10465v1.pdf).
+
+Example:
+
+```
+User: Can you write the next few lines of dialogue for this scene:
+  Cornell: I know what you're thinking, Dontrell. I can sense your feelings and know exactly what you're going to say or do next. You're thinking about how much you miss your family, and how you wish you could see them again. You're wondering if there's anything you can do to make that happen. And you're also feeling a little guilty, because you know that if you could see them again, it would only be for a short while before they died.
+  Dontrell: Cornell, how can you possibly know all that? It's like you can read my mind!
+Assistant: Sure, the next dialogue for this scene could be:
+  Cornell: I told you, I can sense your feelings. It's like I have a connection to your thoughts and emotions. I'm not sure why or how it works, but ever since we met, I've been able to understand what you're thinking and feeling pretty easily.
+  Dontrell: Wow...that's amazing! And a little bit scary too. But it does explain why we've always been so close - even though we've only known each other for a few months.
+  Cornell: Yeah, I guess our connection is pretty strong. But don't worry, Dontrell - I'll never use it to manipulate or control you in any way. I promise."
+  Dontrell: That's a relief! I was starting to feel a little bit like you could read my mind, and it was kind of creepy. But if you promise not to use your abilities to control me, then I'm okay with it.
+  Cornell: I swear, Dontrell. I'll never use my abilities to control or manipulate you in any way. You have my word.
+User: And a title?
+Assistant: Sure, a title for this dialogue could be:
+  Cornell is intuitive. Cornell reads Dontrell's mind.
+User: How about a short description?
+Assistant: Sure, a short description for this dialogue could be:
+  Cornell knows what Dontrell is thinking. He can sense her feelings and know exactly what she is going to say or do next.
+User: What would be one theme of this story?
+Assistant: One theme of this story could be:
+  being intuitive
+```
+
+## Usage
+
+The dataset contains one configuration, `dialogue_modeling`, which has a single
+text `conversation` feature.
+
+## Source data
+
+The script to synthesize this dataset was originally created by
+[ontocord](https://github.com/ontocord) in
+[this Colab notebook](https://colab.research.google.com/drive/1Sw3px5dP8whdqT7QMNoqwmqIasZkMbJi?usp=sharing)
+and prepared for Hugging Face by
+[Jeffrey Quesnelle](https://github.com/jquesnelle/).
+
+## Citation
+
+Please cite our work if you find the resources in this repository useful:
+
+```
+@article{ontocord2023sodasynth,
+  author    = {ontocord and Jeffrey Quesnelle},
+  title     = {SODA Synthetic Dialogue},
+  year      = {2023}
+}
+```
@@ -0,0 +1,21 @@
+from dataclasses import dataclass
+
+import datasets
+
+
+@dataclass
+class OpenAssistantConfig(datasets.BuilderConfig):
+    """BuilderConfig for OpenAssistant datasets."""
+
+    name: str = None
+    version: datasets.Version = None
+    description: str = None
+    schema: str = None
+    subset_id: str = None
+
+
+features = datasets.Features(
+    {
+        "conversation": datasets.Value("string"),
+    }
+)
@@ -0,0 +1,246 @@
+"""Prepare the SODA Synthetic Dialogue Dataset"""
+
+import json
+import os
+import random
+import sys
+
+from datasets import load_dataset
+from tqdm import tqdm
+
+# adapted from https://colab.research.google.com/drive/1Sw3px5dP8whdqT7QMNoqwmqIasZkMbJi?usp=sharing
+
+SUMMARY_TEMPLATE = """User: Can you give me a short story description for this dialogue?
+  {dialogue}
+Assistant: Sure, a short story description for this dialogue could be:
+  {story}
+User: And a title?
+Assistant: Sure, a title for this dialogue could be:
+  {title}"""
+
+THEME_TEMPLATE = """
+User: What would be one theme of this story?
+Assistant: One theme of this story could be:
+  {theme}"""
+
+NEW_DIALOGUE_TEMPLATE = """User: Can you write a short dialogue based on this story:
+  {story}
+Assistant: Sure, a dialogue for this story could be:
+  {dialogue}
+User: And a title?
+Assistant: Sure, a title for this dialogue could be:
+  {title}"""
+
+NEXT_LINES_TEMPLATE = """User: Can you write the next few lines of dialogue for this scene:
+  {scene}
+Assistant: Sure, the next dialogue for this scene could be:
+  {dialogue}
+User: And a title?
+Assistant: Sure, a title for this dialogue could be:
+  {title}
+User: How about a short description?
+Assistant: Sure, a short description for this dialogue could be:
+  {story}"""
+
+NEW_STORY_AND_DIALOGUE_TEMPLATE = """User: Can you write a short story and dialogue about:
+  {title1}
+Assistant: Sure, a short story and dialogue about: "{title1}" could be:
+  {story}"""
+
+FULL_DIALOGUE_TEMPLATE = """{conversation}
+  {dialogue}"""
+
+MORE_DIALOGUE_TEMPLATE = """{conversation}
+  {dialogue1}
+User: Can you provide more dialogue assuming "{title2}"?
+Assistant: Sure, the next dialogue for this scene could be:
+  {dialogue2}"""
+
+NEXT_DIALOGUE_TEMPLATE = """{conversation}
+  {dialogue1}
+User: More please.
+Assistant: Sure, the next dialogue for this scene could be:
+  {dialogue2}"""
+
+NEW_STORY_AND_DIALOGUE_FROM_THEME_TEMPLATE = """User: Can you write short story and dialogue based on the theme:
+  {theme}
+Assistant: Sure, a short story and dialogue based on the theme "{theme}" could be:
+  {story}
+  {dialogue}
+User: And a title?
+Assistant: Sure, a title for this dialogue could be:
+  {title}"""
+
+PRINT = len(sys.argv) > 1 and sys.argv[1] == "--print"
+
+
+def main(output_dir: str = "data"):
+    """Download and prepare the dataset for use."""
+
+    random.seed(42)
+    dataset = load_dataset("allenai/soda")
+    os.makedirs(output_dir, exist_ok=True)
+
+    for split in ["train", "test", "validation"]:
+        with open(f"{output_dir}/{split}.jsonl", "w", encoding="utf8") as output:
+
+            for i in tqdm(range(len(dataset[split])), desc=split):
+                dat = dataset["train"][i]
+                title = dat["literal"]
+                story = dat["narrative"]
+
+                if dat["relation"] == "xWant":
+                    theme = "wanting " + dat["tail"]
+                elif dat["relation"] == "xNeed":
+                    theme = "needing " + dat["tail"]
+                elif not dat["tail"].startswith("to ") and not dat["tail"].startswith("and "):
+                    theme = "being " + dat["tail"]
+                elif dat["tail"].startswith("and "):
+                    theme = "people are " + dat["tail"].replace("and PersonY ", "")
+                else:
+                    theme = dat["tail"]
+                theme = theme.replace("PersonY", "another person")
+                theme = theme.replace("being is", "being")
+
+                dialogue = [s2 + ": " + s1 for s1, s2 in zip(dat["dialogue"], dat["speakers"])]
+
+                if random.randint(0, 6) == 0:
+                    # print("##")
+                    # print(f"User: Can you give me a short story description for this dialog?")
+                    # print("  " + "\n  ".join(dialog))
+                    # print(f"Assistant: Sure, a short story description for this dialog could be: \n  {story}")
+                    # print("User: And a title?")
+                    # print(f"Assistant: Sure, a title for this dialog could be: \n  {title}")
+                    # if theme:
+                    #     print("User: What would be one theme of this story?")
+                    #     print(f'Assistant: One theme of this story could be: "{theme}"')
+                    conversation = SUMMARY_TEMPLATE.format(dialogue="\n  ".join(dialogue), story=story, title=title)
+                    if theme:
+                        conversation = conversation + THEME_TEMPLATE.format(theme=theme)
+                elif random.randint(0, 6) == 0:
+                    # print("##")
+                    # print(f"User: Can you write a short dialog based on this story:\n  {story}")
+                    # print(f"Assistant: Sure, a dialog for this story could be:")
+                    # print("  " + "\n  ".join(dialog))
+                    # print("User: And a title?")
+                    # print(f"Assistant: Sure, a title for this dialog could be: \n  {title}")
+                    # if theme:
+                    #     print("User: What would be one theme of this story?")
+                    #     print(f'Assistant: One theme of this story could be: "{theme}"')
+                    conversation = NEW_DIALOGUE_TEMPLATE.format(
+                        story=story, dialogue="\n  ".join(dialogue), title=title
+                    )
+                    if theme:
+                        conversation = conversation + THEME_TEMPLATE.format(theme=theme)
+                elif random.randint(0, 3) == 0:
+                    # print("##")
+                    # print(f"User: Can you write the next few lines of dialog for this scene:")
+                    # if random.randint(0, 1) == 0:
+                    #     print("  " + "\n  ".join(dialog[:-5]))
+                    #     print(f"Assistant: Sure, the next dialog for this scene could be:")
+                    #     print("  " + "\n  ".join(dialog[-5:]))
+                    # elif random.randint(0, 1) == 0:
+                    #     print("  " + "\n  ".join(dialog[:-3]))
+                    #     print(f"Assistant: Sure, the next dialog for this scene could be:")
+                    #     print("  " + "\n  ".join(dialog[-3:]))
+                    # else:
+                    #     print("  " + "\n  ".join(dialog[:-4]))
+                    #     print(f"Assistant: Sure, the next dialog for this scene could be:")
+                    #     print("  " + "\n  ".join(dialog[-4:]))
+                    # print("User: And a title?")
+                    # print(f"Assistant: Sure, a title for this dialog could be: \n  {title}")
+                    # print("User: How about a short description?")
+                    # print(f"Assistant: Sure, a short description for this dialog could be: \n  {story}")
+                    # if theme:
+                    #     print("User: What would be one theme of this story?")
+                    #     print(f'Assistant: One theme of this story could be: "{theme}"')
+                    if random.randint(0, 1) == 0:
+                        depth = -5
+                    elif random.randint(0, 1) == 0:
+                        depth = -3
+                    else:
+                        depth = -4
+                    conversation = NEXT_LINES_TEMPLATE.format(
+                        scene="\n  ".join(dialogue[:depth]),
+                        dialogue="\n  ".join(dialogue[depth:]),
+                        title=title,
+                        story=story,
+                    )
+                    if theme:
+                        conversation = conversation + THEME_TEMPLATE.format(theme=theme)
+                elif random.randint(0, 3) == 0:
+                    # print("##")
+                    # title1 = title.split(".")[0]
+                    # title2 = title.split(".")[1]
+                    # print(f"User: Can you write short story and dialog about: {title1}")
+                    # print(f'Assistant: Sure, a short story and dialog about: "{title1}" could be:')
+                    # print(f"  {story}")
+                    # if random.randint(0, 1) == 0:
+                    #     print("  " + "\n  ".join(dialog))
+                    # elif random.randint(0, 1) == 0 and len(dialog) > 5:
+                    #     print("  " + "\n  ".join(dialog[:-5]))
+                    #     print(f'User: Can you provide more dialog assuming "{title2}"?')
+                    #     print(f"Assistant: Sure, the next dialog for this scene could be:")
+                    #     print("  " + "\n  ".join(dialog[-5:]))
+                    # elif random.randint(0, 1) == 0:
+                    #     print("  " + "\n  ".join(dialog[:-3]))
+                    #     print("User: more please.")
+                    #     print(f"Assistant: Sure, the next dialog for this scene could be:")
+                    #     print("  " + "\n  ".join(dialog[-3:]))
+                    # else:
+                    #     print("  " + "\n  ".join(dialog[:-4]))
+                    #     print(f'User: Can you provide more dialog assuming "{title2}"?')
+                    #     print(f"Assistant: Sure, the next dialog for this scene could be:")
+                    #     print("  " + "\n  ".join(dialog[-4:]))
+                    # if theme:
+                    #     print("User: What would be one theme of this story?")
+                    #     print(f'Assistant: One theme of this story could be: "{theme}"')
+                    title1 = title.split(".")[0]
+                    title2 = title.split(".")[1]
+                    conversation = NEW_STORY_AND_DIALOGUE_TEMPLATE.format(title1=title1, story=story)
+                    if random.randint(0, 1) == 0:
+                        conversation = FULL_DIALOGUE_TEMPLATE.format(
+                            conversation=conversation, dialogue="\n  ".join(dialogue)
+                        )
+                    elif random.randint(0, 1) == 0 and len(dialogue) > 5:
+                        conversation = MORE_DIALOGUE_TEMPLATE.format(
+                            conversation=conversation,
+                            dialogue1="\n  ".join(dialogue[:-5]),
+                            title2=title2,
+                            dialogue2="\n  ".join(dialogue[-5:]),
+                        )
+                    elif random.randint(0, 1) == 0:
+                        conversation = NEXT_DIALOGUE_TEMPLATE.format(
+                            conversation=conversation,
+                            dialogue1="\n  ".join(dialogue[:-3]),
+                            dialogue2="\n  ".join(dialogue[-3:]),
+                        )
+                    else:
+                        conversation = MORE_DIALOGUE_TEMPLATE.format(
+                            conversation=conversation,
+                            dialogue1="\n  ".join(dialogue[:-4]),
+                            title2=title2,
+                            dialogue2="\n  ".join(dialogue[-4:]),
+                        )
+                    if theme:
+                        conversation = conversation + THEME_TEMPLATE.format(theme=theme)
+                else:
+                    # print("##")
+                    # print(f"User: Can you write short story and dialog based on the theme:\n  {theme}")
+                    # print(f'Assistant: Sure, a short story and dialog based on the theme "{theme}" could be:')
+                    # print(f"  {story}")
+                    # print("  " + "\n  ".join(dialog))
+                    # print("User: And a title?")
+                    # print(f"Assistant: Sure, a title for this dialog could be: \n  {title}")
+                    conversation = NEW_STORY_AND_DIALOGUE_FROM_THEME_TEMPLATE.format(
+                        theme=theme, story=story, dialogue="\n  ".join(dialogue), title=title
+                    )
+                if PRINT:
+                    print("##")
+                    print(conversation)
+
+                output.write(f"{json.dumps({'conversation': conversation})}\n")
+
+
+if __name__ == "__main__":
+    sys.exit(main())
@@ -0,0 +1,108 @@
+# Copyright 2023 The OpenAssistant Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This dataset is a set of dialogues synthesized from the SODA dataset.
+In each dialogue, User and Assistant have a conversation about a story.
+
+The original collab notebook for this dataset can be found at:
+https://colab.research.google.com/drive/1Sw3px5dP8whdqT7QMNoqwmqIasZkMbJi?usp=sharing
+"""
+
+import json
+from typing import Dict, List, Tuple
+
+import datasets
+
+from .hub import OpenAssistantConfig, features
+
+_CITATION = """\
+@article{ontocord2023sodasynth,
+  author    = {ontocord and Jeffrey Quesnelle},
+  title     = {SODA Synthetic Dialogue},
+  year      = {2023}
+}
+"""
+_DATASETNAME = "soda_synthetic_dialogue"
+_DISPLAYNAME = "🥤SODA Synthetic Dialogue"
+_DESCRIPTION = "A set of dialogues synthesized from the SODA dataset."
+_HOMEPAGE = ""
+_LICENSE = "mit"
+_URLS = {
+    _DATASETNAME: {"train": "./data/train.jsonl", "test": "./data/test.jsonl", "validation": "./data/validation.jsonl"}
+}
+_SUPPORTED_TASKS = ["dialogue-modeling"]
+_VERSION = "1.0.0"
+
+
+class SODASyntheticDialogueDataset(datasets.GeneratorBasedBuilder):
+    """A set of dialogues synthesized from the SODA dataset."""
+
+    VERSION = datasets.Version(_VERSION)
+
+    BUILDER_CONFIGS = [
+        OpenAssistantConfig(
+            name=f"{_DATASETNAME}_dialogue_modeling",
+            version=VERSION,
+            description=f"OpenAssistant dataset config for {_DATASETNAME}",
+            schema="dialogue_modeling",
+            subset_id=_DATASETNAME,
+        )
+    ]
+
+    DEFAULT_CONFIG_NAME = f"{_DATASETNAME}_dialogue_modeling"
+
+    def _info(self) -> datasets.DatasetInfo:
+
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager) -> List[datasets.SplitGenerator]:
+        """Returns SplitGenerators."""
+
+        urls = _URLS[_DATASETNAME]
+        data_dir = dl_manager.download_and_extract(urls)
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={"filepath": data_dir, "split": "train"},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"filepath": data_dir, "split": "test"},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={"filepath": data_dir, "split": "validation"},
+            ),
+        ]
+
+    def _generate_examples(self, filepath, split: str) -> Tuple[int, Dict]:
+        """Yields examples as (key, example) tuples."""
+
+        if self.config.schema == "dialogue_modeling":
+            key = 0
+            with open(filepath[split], "r", encoding="utf8") as data:
+                while True:
+                    line = data.readline()
+                    if not line:
+                        return
+                    yield key, json.loads(line)
+                    key += 1