mirror of
https://github.com/wassname/Open-Assistant.git
synced 2026-06-27 16:10:30 +08:00
PR: Create notebook to convert r/changemyview data (#839)
* (#737) Create notebook to convert r/changemyview data into cleaner format
This commit is contained in:
@@ -0,0 +1,37 @@
|
||||
# README
|
||||
|
||||
## Introduction
|
||||
|
||||
This program converts data obtained from the subreddit r/changemyview into a cleaner format for further data processing. The data is not clean enough to be used directly in a model yet, and additional preprocessing is required.
|
||||
|
||||
## Data Format
|
||||
|
||||
The cleaned data is stored in an Apache Parquet file with the following columns:
|
||||
|
||||
| Column Name | Description | Data Type |
|
||||
|-------------|------------------------------------------------------------------------|----------------|
|
||||
| INSTRUCTION | Post title + body text | String |
|
||||
| RESPONSE | Body text of comments attempting to change OP's mind of `INSTRUCTION`. | List\<String\> |
|
||||
| SOURCE | Permalink to the reddit post | String |
|
||||
| METADATA | Metadata related to `RESPONSE`. | Dict\<Variant> |
|
||||
|
||||
### Metadata
|
||||
Currently, metadata is only broken into one category:
|
||||
- `detoxify_labels`- A Dictionary of values outputted by the [Unitaryai Detoxifier](https://github.com/unitaryai/detoxify) model, fitted to every comment under any given post.
|
||||
|
||||
## Usage
|
||||
|
||||
To use the program, follow these instructions:
|
||||
|
||||
1. **Clone the repository** - `git clone https://github.com/LAION-AI/Open-Assistant.git`
|
||||
2. **Navigate to the project directory** - `cd notebooks/data-augmentation/changemyview-builder`
|
||||
3. **Open the Jupyter Notebook** - `jupyter notebook data_processor.ipynb`
|
||||
4. **Run the program** - Go through the notebook and run the cells
|
||||
|
||||
## Contributing
|
||||
|
||||
If you would like to contribute to this project, please fork the repository and submit a pull request with your changes.
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the Apache-2.0 License - see the [LICENSE](LICENSE) file for details.
|
||||
@@ -0,0 +1,577 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"# r/ChangeMyView data converter\n",
|
||||
"Converts subreddit data into readable format for ML training\n",
|
||||
"\n",
|
||||
"[](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/main/notebooks/data-augmentation/changemyview-builder/data_processor.ipynb)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 65,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"### REMEMBER: setup the .env before running this code!\n",
|
||||
"\n",
|
||||
"\"\"\"CONSTANTS\"\"\"\n",
|
||||
"\n",
|
||||
"# Set the head number to the amount of entries you want to load in minus one\n",
|
||||
"ENTRIES_COUNT = 10\n",
|
||||
"\n",
|
||||
"# Set the threshold for toxic comments to be removed\n",
|
||||
"TOXIC_THRESHOLD = 0.95"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 66,
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Requirement already satisfied: pandas in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (1.4.4)\r\n",
|
||||
"Requirement already satisfied: python-dateutil>=2.8.1 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from pandas) (2.8.2)\r\n",
|
||||
"Requirement already satisfied: pytz>=2020.1 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from pandas) (2022.1)\r\n",
|
||||
"Requirement already satisfied: numpy>=1.18.5 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from pandas) (1.21.5)\r\n",
|
||||
"Requirement already satisfied: six>=1.5 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\r\n",
|
||||
"Requirement already satisfied: praw in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (7.6.1)\r\n",
|
||||
"Requirement already satisfied: websocket-client>=0.54.0 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from praw) (0.58.0)\r\n",
|
||||
"Requirement already satisfied: update-checker>=0.18 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from praw) (0.18.0)\r\n",
|
||||
"Requirement already satisfied: prawcore<3,>=2.1 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from praw) (2.3.0)\r\n",
|
||||
"Requirement already satisfied: requests<3.0,>=2.6.0 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from prawcore<3,>=2.1->praw) (2.28.1)\r\n",
|
||||
"Requirement already satisfied: six in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from websocket-client>=0.54.0->praw) (1.16.0)\r\n",
|
||||
"Requirement already satisfied: idna<4,>=2.5 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from requests<3.0,>=2.6.0->prawcore<3,>=2.1->praw) (3.3)\r\n",
|
||||
"Requirement already satisfied: certifi>=2017.4.17 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from requests<3.0,>=2.6.0->prawcore<3,>=2.1->praw) (2022.9.24)\r\n",
|
||||
"Requirement already satisfied: charset-normalizer<3,>=2 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from requests<3.0,>=2.6.0->prawcore<3,>=2.1->praw) (2.0.4)\r\n",
|
||||
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from requests<3.0,>=2.6.0->prawcore<3,>=2.1->praw) (1.26.11)\r\n",
|
||||
"Requirement already satisfied: python-dotenv in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (0.21.0)\r\n",
|
||||
"Requirement already satisfied: pyarrow in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (10.0.1)\r\n",
|
||||
"Requirement already satisfied: numpy>=1.16.6 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from pyarrow) (1.21.5)\r\n",
|
||||
"Requirement already satisfied: detoxify in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (0.5.1)\r\n",
|
||||
"Requirement already satisfied: transformers==4.22.1 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from detoxify) (4.22.1)\r\n",
|
||||
"Requirement already satisfied: torch>=1.7.0 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from detoxify) (1.13.1)\r\n",
|
||||
"Requirement already satisfied: sentencepiece>=0.1.94 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from detoxify) (0.1.97)\r\n",
|
||||
"Requirement already satisfied: huggingface-hub<1.0,>=0.9.0 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from transformers==4.22.1->detoxify) (0.11.1)\r\n",
|
||||
"Requirement already satisfied: regex!=2019.12.17 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from transformers==4.22.1->detoxify) (2022.7.9)\r\n",
|
||||
"Requirement already satisfied: pyyaml>=5.1 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from transformers==4.22.1->detoxify) (6.0)\r\n",
|
||||
"Requirement already satisfied: tqdm>=4.27 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from transformers==4.22.1->detoxify) (4.64.1)\r\n",
|
||||
"Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from transformers==4.22.1->detoxify) (0.12.1)\r\n",
|
||||
"Requirement already satisfied: filelock in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from transformers==4.22.1->detoxify) (3.6.0)\r\n",
|
||||
"Requirement already satisfied: packaging>=20.0 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from transformers==4.22.1->detoxify) (21.3)\r\n",
|
||||
"Requirement already satisfied: numpy>=1.17 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from transformers==4.22.1->detoxify) (1.21.5)\r\n",
|
||||
"Requirement already satisfied: requests in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from transformers==4.22.1->detoxify) (2.28.1)\r\n",
|
||||
"Requirement already satisfied: typing-extensions in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from torch>=1.7.0->detoxify) (4.3.0)\r\n",
|
||||
"Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from packaging>=20.0->transformers==4.22.1->detoxify) (3.0.9)\r\n",
|
||||
"Requirement already satisfied: idna<4,>=2.5 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from requests->transformers==4.22.1->detoxify) (3.3)\r\n",
|
||||
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from requests->transformers==4.22.1->detoxify) (1.26.11)\r\n",
|
||||
"Requirement already satisfied: charset-normalizer<3,>=2 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from requests->transformers==4.22.1->detoxify) (2.0.4)\r\n",
|
||||
"Requirement already satisfied: certifi>=2017.4.17 in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (from requests->transformers==4.22.1->detoxify) (2022.9.24)\r\n",
|
||||
"Requirement already satisfied: tqdm in /Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages (4.64.1)\r\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Install any dependencies\n",
|
||||
"!pip install pandas\n",
|
||||
"!pip install praw\n",
|
||||
"!pip install python-dotenv\n",
|
||||
"!pip install pyarrow\n",
|
||||
"!pip install detoxify\n",
|
||||
"!pip install tqdm"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 67,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import praw\n",
|
||||
"import os\n",
|
||||
"from os.path import join, dirname\n",
|
||||
"from dotenv import main\n",
|
||||
"\n",
|
||||
"# Make sure you create a .env file and fill in all the necessary information in the same folder as this script!\n",
|
||||
"main.load_dotenv(join(dirname(os.path.realpath('__file__')), '.env'))\n",
|
||||
"\n",
|
||||
"reddit = praw.Reddit(\n",
|
||||
" client_id=os.environ.get(\"CLIENT_ID\"),\n",
|
||||
" client_secret=os.environ.get(\"CLIENT_SECRET\"),\n",
|
||||
" user_agent=\"CMV_Scraper\",\n",
|
||||
")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 68,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# load the data\n",
|
||||
"import tarfile\n",
|
||||
"import os.path\n",
|
||||
"import json\n",
|
||||
"import re\n",
|
||||
"from bz2 import BZ2File\n",
|
||||
"from urllib import request\n",
|
||||
"from io import BytesIO\n",
|
||||
"\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"fname = \"cmv.tar.bz2\"\n",
|
||||
"url = \"https://chenhaot.com/data/cmv/\" + fname\n",
|
||||
"\n",
|
||||
"# download if not exists\n",
|
||||
"if not os.path.isfile(fname):\n",
|
||||
" f = BytesIO()\n",
|
||||
" with request.urlopen(url) as resp, open(fname, 'wb') as f_disk:\n",
|
||||
" data = resp.read()\n",
|
||||
" f_disk.write(data) # save to disk too\n",
|
||||
" f.write(data)\n",
|
||||
" f.seek(0)\n",
|
||||
"else:\n",
|
||||
" f = open(fname, 'rb')\n",
|
||||
"\n",
|
||||
"\n"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 69,
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/Users/kayjaymac/opt/anaconda3/lib/python3.9/bz2.py:124: ResourceWarning: unclosed file <_io.BufferedReader name='cmv.tar.bz2'>\n",
|
||||
" self._buffer = None\n",
|
||||
"ResourceWarning: Enable tracemalloc to get the object allocation traceback\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#tar = tarfile.open(fileobj=f, mode=\"r:bz2\")\n",
|
||||
"tar = tarfile.open(fileobj=f, mode=\"r\")\n",
|
||||
"\n",
|
||||
"# Extract the file we are interested in\n",
|
||||
"\n",
|
||||
"train_fname = \"op_task/train_op_data.jsonlist.bz2\"\n",
|
||||
"test_fname = \"op_task/heldout_op_data.jsonlist.bz2\"\n",
|
||||
"\n",
|
||||
"train_bzlist = tar.extractfile(train_fname)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 70,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Deserialize the JSON list\n",
|
||||
"original_posts_train = [\n",
|
||||
" json.loads(line.decode('utf-8'))\n",
|
||||
" for line in BZ2File(train_bzlist)\n",
|
||||
"]"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 71,
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": "[{'title': \"CMV: I shouldn't get a job in this economic climate because it'll be automated anyway; I should just wait for a post-scarcity utopia.\",\n 'delta_label': False,\n 'name': 't3_2rpsl8',\n 'selftext': \"I think the world is automating fast enough that a utopia will arise where no one will have to work anymore. Within the next 2 decades or so, having a job won't mean much, and most people will be artists and scientists. \\n\\nMy parents let me live with them, so I can just wait until the utopia happens.\\n\\nCMV.\"}]"
|
||||
},
|
||||
"execution_count": 71,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"original_posts_train[:1]"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 72,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Load the jsonlist file into a dataframe\n",
|
||||
"#df = pd.read_json(original_posts_train, orient='list', lines=True)\n",
|
||||
"df = pd.DataFrame(original_posts_train)"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 73,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Function to check if the posts still exists on reddit\n",
|
||||
"def try_get_post(post_id):\n",
|
||||
" try:\n",
|
||||
" submission = reddit.submission(id=post_id)\n",
|
||||
" submission.name\n",
|
||||
" return True\n",
|
||||
" except Exception as e:\n",
|
||||
" return False"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 74,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Set up the detoxifier model:\n",
|
||||
"from detoxify import Detoxify"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 75,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import re\n",
|
||||
"\n",
|
||||
"# Removes > sign and the template message at the end of a message\n",
|
||||
"def cleanup_body_text(cmv_post):\n",
|
||||
" lines = [line for line in cmv_post.splitlines()\n",
|
||||
" if not line.lstrip().startswith(\">\")\n",
|
||||
" and not line.lstrip().startswith(\"____\")\n",
|
||||
" and not line.lstrip().startswith(\"So go forth and CMV, noble redditors!\")\n",
|
||||
" and \"edit\" not in \" \".join(line.lower().split()[:2])\n",
|
||||
" ]\n",
|
||||
" return \"\\n\".join(lines)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Create the function that will be handling all the data gathering\n",
|
||||
"def get_top_comment_and_clean_data(post_id):\n",
|
||||
" #print(post_id.lstrip(\"t3_\"))\n",
|
||||
" last_author = \"\"\n",
|
||||
" # Grab the post\n",
|
||||
" submission = reddit.submission(id=post_id.lstrip(\"t3_\"))\n",
|
||||
" #print(submission.title)\n",
|
||||
"\n",
|
||||
" # Grab the highest rated comment on root layer\n",
|
||||
" submission.submission_type = 'best'\n",
|
||||
" submission.comments.replace_more(limit=0)\n",
|
||||
" replies = list(submission.comments)[0].replies.list()\n",
|
||||
"\n",
|
||||
" # Just some variables\n",
|
||||
" pros = []\n",
|
||||
"\n",
|
||||
" # If the post author doesn't exist this submission was deleted (submission.deleted doesn't work)\n",
|
||||
" if type(submission.author) == type(None):\n",
|
||||
" last_author = \"[deleted]\"\n",
|
||||
" else:\n",
|
||||
" last_author = submission.author.name\n",
|
||||
"\n",
|
||||
" is_pro_argument = False\n",
|
||||
"\n",
|
||||
" for comment in replies:\n",
|
||||
"\n",
|
||||
" # If redditor object doesn't exist, the account is invalid/deleted\n",
|
||||
" if type(comment.author) != type(None):\n",
|
||||
" author = comment.author.name\n",
|
||||
" else:\n",
|
||||
" author = \"[deleted]\"\n",
|
||||
"\n",
|
||||
" # Assume that whenever the user changes, they are countering the previous person\n",
|
||||
" if author != last_author:\n",
|
||||
" is_pro_argument = !is_pro_argument\n",
|
||||
"\n",
|
||||
" if author == \"[deleted]\" or author==\"DeltaBot\":\n",
|
||||
" #print(\"Skipping comment...\")\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" # Remove meta and duplicate comments\n",
|
||||
" comment.body = \" \".join([line for line in comment.body.splitlines()\n",
|
||||
" if not re.search(r\"(?i)(Change\\smy\\sview|CMV)\", line)\n",
|
||||
" and line not in pros # Why doesn't this line work\n",
|
||||
" ])\n",
|
||||
"\n",
|
||||
" # Sometimes for some reason duplicate entries exist\n",
|
||||
" # Also remove automated message with \"Δ\" in it\n",
|
||||
"\n",
|
||||
" if comment.body in pros:\n",
|
||||
" #print(\"Skipping duplicate entry\")\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" #print(\"\\t\\t>>\\t\",comment.body)\n",
|
||||
"\n",
|
||||
" # Remove toxic comments\n",
|
||||
" if Detoxify(\"multilingual\").predict(comment.body)[\"toxicity\"] > TOXIC_THRESHOLD:\n",
|
||||
" #print(\"Identified toxic comment, ignoring...\")\n",
|
||||
" comment.body = \"\"\n",
|
||||
"\n",
|
||||
" # Add to the respective argument type \n",
|
||||
" if is_pro_argument:\n",
|
||||
" pros.append(comment.body)\n",
|
||||
" \n",
|
||||
" last_author = comment.author.name\n",
|
||||
" \n",
|
||||
" # Pros = arguments for the Title of this post\n",
|
||||
" # Cons = arguments against the title of this post\n",
|
||||
"\n",
|
||||
" pros.append(comment.body)\n",
|
||||
" return pros"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 76,
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Loading in 10 posts\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(f\"Loading in {ENTRIES_COUNT} posts\")\n",
|
||||
"dataset = df.head(ENTRIES_COUNT)\n"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 77,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# the name column does some weird sh** because dataframes already have a name property, so migrate to a different column name\n",
|
||||
"\n",
|
||||
"import warnings\n",
|
||||
"warnings.filterwarnings('ignore')\n",
|
||||
"\n",
|
||||
"dataset[\"post_id\"] = dataset[\"name\"]\n",
|
||||
"warnings.filterwarnings('default')"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 78,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Loading in data... This will take a while.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": " 0%| | 0/10 [00:00<?, ?it/s]",
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"version_major": 2,
|
||||
"version_minor": 0,
|
||||
"model_id": "203fd74c8a5146b68b8af961bb3874c8"
|
||||
}
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"<timed exec>:29: SettingWithCopyWarning: \n",
|
||||
"A value is trying to be set on a copy of a slice from a DataFrame\n",
|
||||
"\n",
|
||||
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
|
||||
"/Users/kayjaymac/opt/anaconda3/lib/python3.9/site-packages/torch/serialization.py:997: ResourceWarning: unclosed file <_io.BufferedReader name='cmv.tar.bz2'>\n",
|
||||
" storage = zip_file.get_storage_from_record(name, numel, torch._UntypedStorage).storage()._untyped()\n",
|
||||
"ResourceWarning: Enable tracemalloc to get the object allocation traceback\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"CPU times: user 7min 49s, sys: 2min 29s, total: 10min 19s\n",
|
||||
"Wall time: 8min 45s\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"%%time\n",
|
||||
"\n",
|
||||
"from tqdm.auto import tqdm\n",
|
||||
"# Reset variables for if we run this multiple times\n",
|
||||
"all_pros = []\n",
|
||||
"all_names = []\n",
|
||||
"all_titles = []\n",
|
||||
"all_sources = []\n",
|
||||
"\n",
|
||||
"print(\"Loading in data... This will take a while.\")\n",
|
||||
"\n",
|
||||
"for i in tqdm(range(dataset.shape[0])):\n",
|
||||
"\n",
|
||||
" post = dataset.iloc[i]\n",
|
||||
" modified_title = post.title.replace('CMV', \"Change my mind\")\n",
|
||||
" #print(f\"\\n Loading entry {i+1}/{dataset.shape[0]}:\\n\\t\\\"{modified_title}\\\"\")\n",
|
||||
"\n",
|
||||
" if type(post) == type(None):\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" assert(post.post_id != i)\n",
|
||||
"\n",
|
||||
" pros = get_top_comment_and_clean_data(post.post_id)\n",
|
||||
"\n",
|
||||
" if post.title == \"[deleted]\":\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" pros = \" \".join([*set(pros)])\n",
|
||||
" pros = pros.replace(\"[deleted]\",\"\")\n",
|
||||
"\n",
|
||||
" post.selftext = cleanup_body_text(post.selftext)\n",
|
||||
" all_titles.append(modified_title + \" \" + post.selftext)\n",
|
||||
" all_pros.append(pros)\n",
|
||||
" all_names.append(post.name)\n",
|
||||
" all_sources.append(f\"https://reddit.com/r/changemyview/comments/{post.post_id}\")\n",
|
||||
" #print(post.title)\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 83,
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": "'it\\'s already been signed. They even claim to be adhering to it, though they\\'ve been found to be violating it before. There is no such thing as \"de facto acceptance of Israel\\'s nuclear program.\" the Non-Proliferation Treaty is only binding for signatory states. Israel is not a signatory. Article 10 of the NPT allows them to withdraw if they so choose. they have not done so. a whole new country which explicitly has a right to withdraw from the NPT and has not chosen to do so. It\\'s more accurate, I think, to say that the problem with Iran here from a legal standpoint is that they aren\\'t honoring their own commitments, rather than that they\\'re building weapons. They could pull out of the NPT at any time, and the ball would be essentially in America\\'s court, because their nuclear program would no longer be illegal by international legal standards. However, Iran insists both on developing nukes *and* remaining an NPT signatory non-nuclear state, and that\\'s what makes their program illegal. I\\'d also like to clarify that I\\'m not making an ethical argument here, this is just how international law currently works. because international law doesn\\'t require states to sign treaties, it only requires them to adhere to treaties they\\'ve already signed. Israel isn\\'t defying the UN, at least not in this particular case. Think of the NPT less like a standard law within a state and more like a contract. Once you\\'ve signed, you\\'re bound by the contract, but if you never sign it then you haven\\'t broken a law, you\\'ve just decided not to agree to the terms you were offered. > Because Iran did sign the treaty, and thus are bound by it. They signed on July 1, 1968. Hmm. So is the argument here that it\\'s not \"ok\" for Iran to have a nuke, since they signed treaty not to do so. But it\\'s \"ok\" for Israel to have one because they never signed such thing? Can\\'t quite put my finger on it, but doesn\\'t seem quite right this one.'"
|
||||
},
|
||||
"execution_count": 83,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"all_pros[1]"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 80,
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Place it all into a Pandas Dataframe\n",
|
||||
"clean_df = pd.DataFrame({\n",
|
||||
" \"INSTRUCTION\": all_titles,\n",
|
||||
" \"RESPONSE\": all_pros,\n",
|
||||
" \"SOURCE\": all_sources\n",
|
||||
"}, index=all_names\n",
|
||||
")"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 81,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create Apache Paquete file\n",
|
||||
"\n",
|
||||
"import pyarrow as pa\n",
|
||||
"import pyarrow.parquet as pq\n",
|
||||
"\n",
|
||||
"table = pa.Table.from_pandas(clean_df)\n",
|
||||
"pq.write_table(table,\"output.parquet\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 82,
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": " INSTRUCTION \\\n0 Change my mind: I shouldn't get a job in this ... \n1 Change my mind: Iran has the right to develop ... \n2 Change my mind: The events in Paris suck...but... \n3 Change my mind: It is ok to hate a religion so... \n4 Change my mind: There is no productive reason ... \n5 Change my mind: Diet soda is perfectly healthy... \n6 Change my mind:Essential Oils are bullshit My ... \n7 Change my mind: I think the Paris shooting mak... \n8 Change my mind: Printing an image of the Musli... \n9 Change my mind: Philosophy has no tangible val... \n\n RESPONSE \\\n0 That is what someone in the 1500s would have s... \n1 it's already been signed. They even claim to b... \n2 Hm I guess I made the OP incorrectly. The mai... \n3 I don't understand your analogy. Promoting a ... \n4 ∆ I hadn't thought it from a \"let's trick peop... \n5 Thanks for a fresh argument! I hadn't conside... \n6 Most do. Some smell kinda funky. \n7 I already said in different comments that thi... \n8 The first bacon sandwich came about because 9... \n9 >Why restrict it to 50 years? I can name all s... \n\n SOURCE \n0 https://reddit.com/r/changemyview/comments/t3_... \n1 https://reddit.com/r/changemyview/comments/t3_... \n2 https://reddit.com/r/changemyview/comments/t3_... \n3 https://reddit.com/r/changemyview/comments/t3_... \n4 https://reddit.com/r/changemyview/comments/t3_... \n5 https://reddit.com/r/changemyview/comments/t3_... \n6 https://reddit.com/r/changemyview/comments/t3_... \n7 https://reddit.com/r/changemyview/comments/t3_... \n8 https://reddit.com/r/changemyview/comments/t3_... \n9 https://reddit.com/r/changemyview/comments/t3_... ",
|
||||
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>INSTRUCTION</th>\n <th>RESPONSE</th>\n <th>SOURCE</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Change my mind: I shouldn't get a job in this ...</td>\n <td>That is what someone in the 1500s would have s...</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>Change my mind: Iran has the right to develop ...</td>\n <td>it's already been signed. They even claim to b...</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Change my mind: The events in Paris suck...but...</td>\n <td>Hm I guess I made the OP incorrectly. The mai...</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Change my mind: It is ok to hate a religion so...</td>\n <td>I don't understand your analogy. Promoting a ...</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Change my mind: There is no productive reason ...</td>\n <td>∆ I hadn't thought it from a \"let's trick peop...</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n <tr>\n <th>5</th>\n <td>Change my mind: Diet soda is perfectly healthy...</td>\n <td>Thanks for a fresh argument! I hadn't conside...</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n <tr>\n <th>6</th>\n <td>Change my mind:Essential Oils are bullshit My ...</td>\n <td>Most do. Some smell kinda funky.</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n <tr>\n <th>7</th>\n <td>Change my mind: I think the Paris shooting mak...</td>\n <td>I already said in different comments that thi...</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n <tr>\n <th>8</th>\n <td>Change my mind: Printing an image of the Musli...</td>\n <td>The first bacon sandwich came about because 9...</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n <tr>\n <th>9</th>\n <td>Change my mind: Philosophy has no tangible val...</td>\n <td>>Why restrict it to 50 years? I can name all s...</td>\n <td>https://reddit.com/r/changemyview/comments/t3_...</td>\n </tr>\n </tbody>\n</table>\n</div>"
|
||||
},
|
||||
"execution_count": 82,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Test to see if it was sucessful\n",
|
||||
"table = pq.read_table(\"output.parquet\")\n",
|
||||
"table.to_pandas()"
|
||||
],
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
}
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.13"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
||||
Reference in New Issue
Block a user