StackExchange Builder update

Included markdown file and the initial notebook for ingesting and processing StackExchange data dumps
This commit is contained in:
b-mc2
2023-01-03 21:36:05 -05:00
committed by GitHub
parent d6efdd6b49
commit 65524a2140
2 changed files with 1931 additions and 0 deletions
File diff suppressed because one or more lines are too long
@@ -0,0 +1,98 @@
# StackExchange Builder
StackExchange Builder is a notebook that downloads data from StackExchange data dumps
and converts it into different formats. It will parse the XML files, group questions and answers, can filter the dataset and puts the results into the Open-Assistant Data Scheme.
Files can be saved to either JSON, JSONL, Parquet, or CSV.
---
#### Sample Data Open-Assistant Data Scheme:
Open-Assistant Data Scheme as outlined here: https://github.com/LAION-AI/Open-Assistant/blob/main/docs/data_schemas.md
```
{
"root": {
"text": "Science Fiction has frequently shown AI to be a threat to the very existence of mankind. AI systems have often been the antagonists...",
"role": "prompter",
"children": [
{
"text": "Nothing. \nIts in almost everyone's favor for it to stay that way financially. Having non-technical individuals associate AI with terminators...",
"role": "assistant",
"children": [],
"metadata": {
"AnswerScore": 2.0,
"AcceptedAnswerFlag": true
}
}
],
"metadata": {
"QuestionScore": 5,
"QuestionTags": "social, artificial consciousness"
}
},
"metadata": {
"Title": "\"AI will kill us all! The machines will rise up!\" - what is being done to dispel such myths?",
"QuestionContentLicense": "CC BY-SA 4.0",
"DataSource": "https://ia600107.us.archive.org/view_archive.php?archive=/27/items/stackexchange/ai.stackexchange.com.7z&file=Posts.xml",
"CreationDate": "2019-10-16T13:57:37.143"
}
}
```
---
#### JSONL format
Each question and all related answers are on a single line in JSONL format.
```
{
"Title": "1 hidden layer with 1000 neurons vs. 10 hidden layers with 100 neurons",
"Question": "These types of questions may be problem-dependent...",
"QuestionScore": 16,
"QuestionTags": "neural networks",
"QuestionContentLicense": "CC BY-SA 3.0",
"DataSource": "https://ia600107.us.archive.org/view_archive.php?archive=/27/items/stackexchange/ai.stackexchange.com.7z&file=Posts.xml",
"CreationDate": "2017-05-04T13:06:37.990",
"Answers": [
{
"Answer": "Basically, having multiple layers (aka a deep network) makes your network more eager to recognize certain aspects of input data...",
"AnswerScore": 13.0,
"AcceptedAnswerFlag": true
},
{
"Answer": "There are so many aspects.\n1. Training:\nTraining deep nets is a hard job due to the vanishing (rearly exploding) gradient problem...",
"AnswerScore": 4.0,
"AcceptedAnswerFlag": false
},
{
"Answer": "If the problem you are solving is linearly separable, one layer of 1000 neurons can do better job...",
"AnswerScore": 1.0,
"AcceptedAnswerFlag": false
},
{
"Answer": "\nI think you have a confusion in the basics of the neural networks.\n Every layer has a separate activation...",
"AnswerScore": 0.0,
"AcceptedAnswerFlag": false
}
]
}
```
#### Table/CSV/Parquet Format
There are a lot more columns left over in the table format. `_q` and `_a` are suffixes indiciating if the column came from the question or answer table as leftover from a join statement.
```
| Id_q | Question | ParentId_a | AcceptedAnswerId | Id_a | Answer | AnswerScore | AcceptedAnswerFlag |
|------:|--------------------------------------------------:|-----------:|-----------------:|--------:|--------------------------------------------------:|------------:|-------------------:|
| 15730 | As a human being, we can think infinity. In pr... | 15730.0 | 15744 | 15744.0 | I think this is a fairly common misconception ... | 62.0 | True |
| 15730 | As a human being, we can think infinity. In pr... | 15730.0 | 15744 | 15753.0 | I think your premise is flawed.\nYou seem to a... | 19.0 | False |
| 15730 | As a human being, we can think infinity. In pr... | 15730.0 | 15744 | 15747.0 | TL;DR: The subtleties of infinity are made app... | 12.0 | False |
| 15730 | As a human being, we can think infinity. In pr... | 15730.0 | 15744 | 15756.0 | In Haskell, you can type:\nprint [1..]\nand it... | 9.0 | False |
```
---
## Contributing
Feel free to contribute to this notebook. It's not perfect and additional functionality is planned.