StackExchange Builder update

Included markdown file and the initial notebook for ingesting and processing StackExchange data dumps
2026-06-27 16:10:30 +08:00 · 2023-01-03 21:36:05 -05:00
parent d6efdd6b49
commit 65524a2140
2 changed files with 1931 additions and 0 deletions
@@ -0,0 +1,98 @@
+# StackExchange Builder
+
+StackExchange Builder is a notebook that downloads data from StackExchange data dumps 
+and converts it into different formats. It will parse the XML files, group questions and answers, can filter the dataset and puts the results into the Open-Assistant Data Scheme.
+Files can be saved to either JSON, JSONL, Parquet, or CSV.
+
+---
+#### Sample Data Open-Assistant Data Scheme:
+
+Open-Assistant Data Scheme as outlined here: https://github.com/LAION-AI/Open-Assistant/blob/main/docs/data_schemas.md
+
+```
+{
+    "root": {
+        "text": "Science Fiction has frequently shown AI to be a threat to the very existence of mankind. AI systems have often been the antagonists...",
+        "role": "prompter",
+        "children": [
+            {
+                "text": "Nothing.  \nIts in almost everyone's favor for it to stay that way financially. Having non-technical individuals associate AI with terminators...",
+                "role": "assistant",
+                "children": [],
+                "metadata": {
+                    "AnswerScore": 2.0,
+                    "AcceptedAnswerFlag": true
+                }
+            }
+        ],
+        "metadata": {
+            "QuestionScore": 5,
+            "QuestionTags": "social, artificial consciousness"
+        }
+    },
+    "metadata": {
+        "Title": "\"AI will kill us all! The machines will rise up!\" - what is being done to dispel such myths?",
+        "QuestionContentLicense": "CC BY-SA 4.0",
+        "DataSource": "https://ia600107.us.archive.org/view_archive.php?archive=/27/items/stackexchange/ai.stackexchange.com.7z&file=Posts.xml",
+        "CreationDate": "2019-10-16T13:57:37.143"
+    }
+}
+```
+
+---
+#### JSONL format
+Each question and all related answers are on a single line in JSONL format.
+
+```
+{
+    "Title": "1 hidden layer with 1000 neurons vs. 10 hidden layers with 100 neurons",
+    "Question": "These types of questions may be problem-dependent...",
+    "QuestionScore": 16,
+    "QuestionTags": "neural networks",
+    "QuestionContentLicense": "CC BY-SA 3.0",
+    "DataSource": "https://ia600107.us.archive.org/view_archive.php?archive=/27/items/stackexchange/ai.stackexchange.com.7z&file=Posts.xml",
+    "CreationDate": "2017-05-04T13:06:37.990",
+    "Answers": [
+        {
+            "Answer": "Basically, having multiple layers (aka a deep network) makes your network more eager to recognize certain aspects of input data...",
+            "AnswerScore": 13.0,
+            "AcceptedAnswerFlag": true
+        },
+        {
+            "Answer": "There are so many aspects.\n1. Training:\nTraining deep nets is a hard job due to the vanishing (rearly exploding) gradient problem...",
+            "AnswerScore": 4.0,
+            "AcceptedAnswerFlag": false
+        },
+        {
+            "Answer": "If the problem you are solving is linearly separable, one layer of 1000 neurons can do better job...",
+            "AnswerScore": 1.0,
+            "AcceptedAnswerFlag": false
+        },
+        {
+            "Answer": "\nI think you have a confusion in the basics of the neural networks.\n  Every layer has a separate activation...",
+            "AnswerScore": 0.0,
+            "AcceptedAnswerFlag": false
+        }
+    ]
+}
+```
+
+
+#### Table/CSV/Parquet Format
+
+There are a lot more columns left over in the table format. `_q` and `_a` are suffixes indiciating if the column came from the question or answer table as leftover from a join statement.
+
+```
+|  Id_q |                                          Question | ParentId_a | AcceptedAnswerId |    Id_a |                                            Answer | AnswerScore | AcceptedAnswerFlag |
+|------:|--------------------------------------------------:|-----------:|-----------------:|--------:|--------------------------------------------------:|------------:|-------------------:|
+| 15730 | As a human being, we can think infinity. In pr... |    15730.0 |            15744 | 15744.0 | I think this is a fairly common misconception ... |        62.0 |               True |
+| 15730 | As a human being, we can think infinity. In pr... |    15730.0 |            15744 | 15753.0 | I think your premise is flawed.\nYou seem to a... |        19.0 |              False |
+| 15730 | As a human being, we can think infinity. In pr... |    15730.0 |            15744 | 15747.0 | TL;DR: The subtleties of infinity are made app... |        12.0 |              False |
+| 15730 | As a human being, we can think infinity. In pr... |    15730.0 |            15744 | 15756.0 | In Haskell, you can type:\nprint [1..]\nand it... |         9.0 |              False |
+```
+
+---
+
+## Contributing
+
+Feel free to contribute to this notebook. It's not perfect and additional functionality is planned.