Changes from pre-commit

This commit is contained in:
b-mc2
2023-01-05 14:55:26 -05:00
parent 65524a2140
commit ee370e85a4
2 changed files with 1855 additions and 1835 deletions
File diff suppressed because one or more lines are too long
@@ -1,13 +1,17 @@
# StackExchange Builder
StackExchange Builder is a notebook that downloads data from StackExchange data dumps
and converts it into different formats. It will parse the XML files, group questions and answers, can filter the dataset and puts the results into the Open-Assistant Data Scheme.
Files can be saved to either JSON, JSONL, Parquet, or CSV.
StackExchange Builder is a notebook that downloads data from StackExchange data
dumps and converts it into different formats. It will parse the XML files, group
questions and answers, can filter the dataset and puts the results into the
Open-Assistant Data Scheme. Files can be saved to either JSON, JSONL, Parquet,
or CSV.
---
#### Sample Data Open-Assistant Data Scheme:
Open-Assistant Data Scheme as outlined here: https://github.com/LAION-AI/Open-Assistant/blob/main/docs/data_schemas.md
Open-Assistant Data Scheme as outlined here:
https://github.com/LAION-AI/Open-Assistant/blob/main/docs/data_schemas.md
```
{
@@ -40,7 +44,9 @@ Open-Assistant Data Scheme as outlined here: https://github.com/LAION-AI/Open-As
```
---
#### JSONL format
Each question and all related answers are on a single line in JSONL format.
```
@@ -77,10 +83,11 @@ Each question and all related answers are on a single line in JSONL format.
}
```
#### Table/CSV/Parquet Format
There are a lot more columns left over in the table format. `_q` and `_a` are suffixes indiciating if the column came from the question or answer table as leftover from a join statement.
There are a lot more columns left over in the table format. `_q` and `_a` are
suffixes indiciating if the column came from the question or answer table as
leftover from a join statement.
```
| Id_q | Question | ParentId_a | AcceptedAnswerId | Id_a | Answer | AnswerScore | AcceptedAnswerFlag |
@@ -95,4 +102,5 @@ There are a lot more columns left over in the table format. `_q` and `_a` are su
## Contributing
Feel free to contribute to this notebook. It's not perfect and additional functionality is planned.
Feel free to contribute to this notebook. It's not perfect and additional
functionality is planned.