diff --git a/notebooks/gutenberg/README.md b/notebooks/gutenberg/README.md
new file mode 100644
index 00000000..d30a5943
--- /dev/null
+++ b/notebooks/gutenberg/README.md
@@ -0,0 +1,92 @@
+---
+dataset_info:
+ features:
+ - name: Text#
+ dtype: int64
+ - name: Issued
+ dtype: timestamp[us]
+ - name: Title
+ dtype: string
+ - name: Authors
+ dtype: string
+ - name: Subjects
+ dtype: string
+ - name: LoCC
+ dtype: string
+ - name: Bookshelves
+ dtype: string
+ - name: Body
+ dtype: string
+ splits:
+ - name: train
+ num_bytes: 193811664
+ num_examples: 495
+ download_size: 125691249
+ dataset_size: 193811664
+license: mit
+task_categories:
+ - text-generation
+ - conversational
+language:
+ - hu
+tags:
+ - project gutenberg
+ - ebook
+ - gutenberg.org
+pretty_name: Hungarian langauge eBooks from Project Gutenberg
+size_categories:
+ - n<1K
+---
+
+# Dataset Card for "gutenberg_hu"
+
+# Dataset Card for Project Gutenberg - Hungarian eBooks
+
+## Dataset Description
+
+- **Repository:**
+ [Code](https://github.com/LAION-AI/Open-Assistant/openassistant/datasets/gutenberg/)
+
+## Source data
+
+Please **READ** the site's TOS before running the crawler Notebook and follow
+these instructions:
+
+- The website will IP ban crawlers for going through each book's metadata page
+ separately. Instead use `catalog()` to access the list of available E-books.
+ For more information, visit: https://www.gutenberg.org/ebooks/feeds.html
+- You can avoid running the crawler by mirroring the entire database of Project
+ Gutenberg or use one of their FTPs instead, and then call the `parse()`
+ function on each text
+- For more on robot access see:
+ https://www.gutenberg.org/policy/robot_access.html
+
+How does it work?
+
+- The crawler downloads the raw HTML code for each E-book based on **Text#** id
+ (if available)
+- The metadata and the body of text are not clearly separated so the parser will
+ try to split them, then remove transcriber's notes and e-book related
+ information from the body of text (text marked as copyrighted or malformed
+ will be skipped)
+- If there is text both the metadata and the cleared body of text are saved, the
+ latter is then added to a filtered parquet file (will contain only the catalog
+ information and body of text for the books that were successfully retrieved)
+
+Copyright notice:
+
+- Some of the books are copyrighted! The crawler (parser) will ignore all books
+ with an english copyright header by utilizing a regex expression, but make
+ sure to check out the metadata for each book manually to ensure they are okay
+ to use in your country! More information on copyright:
+ https://www.gutenberg.org/help/copyright.html and
+ https://www.gutenberg.org/policy/permission.html
+- Project Gutenberg has the following requests when using books without
+ metadata: _Books obtianed from the Project Gutenberg site should have the
+ following legal note next to them: "This eBook is for the use of anyone
+ anywhere in the United States and most other parts of the world at no cost and
+ with almost" no restrictions whatsoever. You may copy it, give it away or
+ re-use it under the terms of the Project Gutenberg License included with this
+ eBook or online at www.gutenberg.org. If you are not located in the United
+ States, you will have to check the laws of the country where you are located
+ before using this eBook."_
diff --git a/notebooks/gutenberg/__init__.py b/notebooks/gutenberg/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/notebooks/gutenberg/hub.py b/notebooks/gutenberg/hub.py
new file mode 100644
index 00000000..e7d38166
--- /dev/null
+++ b/notebooks/gutenberg/hub.py
@@ -0,0 +1,16 @@
+from datasets import concatenate_datasets, load_dataset
+
+
+def load(languages: list = ["en", "de", "fr", "es", "it", "pt", "nl", "hu"]):
+ ds = None
+ for lang in languages:
+ if ds is None:
+ ds = load_dataset(f"sedthh/gutenberg_{lang}")
+ else:
+ ds = concatenate_datasets([ds, f"sedthh/gutenberg_{lang}"])
+ return ds
+
+
+if __name__ == "__main__":
+ ds = load()
+ print(ds["train"])
diff --git a/notebooks/gutenberg/project_gutenberg_crawler.ipynb b/notebooks/gutenberg/project_gutenberg_crawler.ipynb
new file mode 100644
index 00000000..76a19bb2
--- /dev/null
+++ b/notebooks/gutenberg/project_gutenberg_crawler.ipynb
@@ -0,0 +1,644 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Project Gutenber Crawler\n",
+ "\n",
+ "Make sure you read the site's TOS and the notebook's README.md on how to use the crawler."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "[](https://colab.research.google.com/github/LAION-AI/Open-Assistant/blob/notebooks/gutenberg/project_gutenberg_crawler.ipynb)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# uncomment and run below lines to set up if running in colab\n",
+ "# !git clone https://github.com/LAION-AI/Open-Assistant.git\n",
+ "# %cd Open-Assistant/notebooks/gutenberg\n",
+ "# !pip install -r requirements.txt"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# global settings\n",
+ "\n",
+ "LANG = (\n",
+ " \"en\" # crawl english language books, NOTE: there are a few houndred books with multiple languages such as 'en; es'\n",
+ ")\n",
+ "FOLDER = \"text\" # save metadata and body of text to this folder\n",
+ "CHUNKS = 1 # optionally divide the dataset into this many compressed parquet files if you have less memory\n",
+ "STATUS = \"crawled.csv\" # save the list of downloaded files and their status into this csv"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# import required packages\n",
+ "import os\n",
+ "import io\n",
+ "import re\n",
+ "import requests\n",
+ "import time\n",
+ "import warnings\n",
+ "\n",
+ "try:\n",
+ " from BeautifulSoup import BeautifulSoup\n",
+ "except ImportError:\n",
+ " from bs4 import BeautifulSoup\n",
+ "from tqdm import tqdm\n",
+ "\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "from typing import Tuple, Optional, Any"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Code for crawler"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class GutenbergCrawler:\n",
+ "\n",
+ " HEADER = {\n",
+ " \"User-Agent\": \"Mozilla/5.0 (compatible; GutenbergCrawler/0.1)\",\n",
+ " }\n",
+ " TIMER = 600 # wait ms between calls\n",
+ " MIRRORS = [\n",
+ " \"http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/\",\n",
+ " \"https://www.gutenberg.org/dirs/\",\n",
+ " \"http://mirrors.xmission.com/gutenberg/\",\n",
+ " ] # see https://www.gutenberg.org/MIRRORS.ALL for available mirrors\n",
+ "\n",
+ " def __init__(self, folder: Optional[str] = None) -> None:\n",
+ " self.folder = folder\n",
+ " if self.folder is not None:\n",
+ " os.makedirs(self.folder, exist_ok=True)\n",
+ " self.calls = 0\n",
+ " self.last_call = 0\n",
+ "\n",
+ " def _get(self, url: str) -> str:\n",
+ " self.calls += 1\n",
+ " diff = max(0.0, self.TIMER - (time.time() - self.last_call))\n",
+ " if diff:\n",
+ " time.sleep(diff / 1000.0)\n",
+ " data = requests.get(url, headers=self.HEADER)\n",
+ " self.last_call = time.time()\n",
+ " if data.status_code == 404:\n",
+ " return None\n",
+ " try:\n",
+ " return data.content.decode(\"utf-8\")\n",
+ " except UnicodeDecodeError:\n",
+ " try:\n",
+ " return data.content.decode(\"ISO-8859-1\") # latin-1\n",
+ " except UnicodeDecodeError:\n",
+ " return data.content.decode(\"utf-8\", \"backslashreplace\")\n",
+ "\n",
+ " def catalog(self) -> pd.DataFrame:\n",
+ " try:\n",
+ " csv = pd.read_csv(\"https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv.gz\", sep=\",\")\n",
+ " except Exception:\n",
+ " raw = self._get(\"https://www.gutenberg.org/cache/epub/feeds/pg_catalog.csv\")\n",
+ " if raw is None:\n",
+ " raise ValueError(\"Catalog CSV file does not exist!\")\n",
+ " csv = pd.read_csv(io.StringIO(raw), sep=\",\")\n",
+ " return csv.loc[csv[\"Type\"] == \"Text\"].reset_index(drop=True)\n",
+ "\n",
+ " def search(self, url: str) -> dict:\n",
+ " \"\"\"Use catalog() instead! Returns dict with book_id: 'book title' pairs for gutenberg.org pages\"\"\"\n",
+ " assert \"/www.gutenberg.org\" in url, \"The URL must be a page at https://www.gutenberg.org/\"\n",
+ " html = self._get(url)\n",
+ " if html is None:\n",
+ " return {}\n",
+ " dom = BeautifulSoup(html, \"html.parser\")\n",
+ " results = {}\n",
+ " for a in dom.find_all(\"a\"):\n",
+ " for elem in re.findall(r\"(.+?)\", str(a)):\n",
+ " ebook, title = elem\n",
+ " results[int(ebook)] = title.replace(\"\\r
\", \"\\r\\n\")\n",
+ " return results\n",
+ "\n",
+ " def download(self, book: int) -> Optional[str]:\n",
+ " book = int(book)\n",
+ " assert book > 0\n",
+ " mirror = np.random.choice(self.MIRRORS)\n",
+ " if book < 10:\n",
+ " page = f\"0/{book}/\"\n",
+ " else:\n",
+ " page = \"/\".join([char for char in str(book)[:-1]]) + f\"/{book}/\"\n",
+ " url = f\"{mirror}{page}{book}-h/{book}-h.htm\"\n",
+ " return self._get(url)\n",
+ "\n",
+ " def parse(self, book: int, html: str) -> Tuple[Optional[str], Optional[str]]:\n",
+ " book = int(book)\n",
+ " assert book > 0\n",
+ " if html is None:\n",
+ " return None, None\n",
+ " dom = BeautifulSoup(html, \"html.parser\")\n",
+ " if dom is None or dom.title is None or dom.title.string is None or \"404\" in dom.title.string:\n",
+ " return None, None\n",
+ "\n",
+ " meta = \"\"\n",
+ " for pre in dom.select(\"title, pre\"):\n",
+ " meta += str(pre.get_text()).strip()\n",
+ " # remove metadata from dom afterwards\n",
+ " pre.extract()\n",
+ " if re.findall(r\"(?i)\\*{2,}[^\\n]+?(?:please.+?copyright|copyrighted.+?project)[^\\n]+?\\*{2,}\\r?\\n\", meta):\n",
+ " warnings.warn(f\"Book {book} is copyrighted.\")\n",
+ " return None, None\n",
+ " for img in dom.select(\"img\"):\n",
+ " # add image alt attributes as text\n",
+ " try:\n",
+ " img.insert(0, img[\"alt\"])\n",
+ " except KeyError:\n",
+ " pass\n",
+ " text = str(dom.get_text()).strip()\n",
+ " if re.findall(r\"(?i)\\*{2,}[^\\n]+?(?:please.+?copyright|copyrighted.+?project)[^\\n]+?\\*{2,}\\r?\\n\", text):\n",
+ " warnings.warn(f\"Book {book} is copyrighted.\")\n",
+ " return None, None\n",
+ "\n",
+ " s = re.split(r\"(?i)\\*{2,}[^\\n]+?project gutenberg[^\\n]+?\\*{2,}\\s*[\\r\\n]+\", text) # 49843\n",
+ " if len(s) > 1:\n",
+ " if len(s) > 3:\n",
+ " warnings.warn(f\"Book {book} is malformed.\")\n",
+ " return None, None\n",
+ " meta += s[0]\n",
+ " return meta, s[1]\n",
+ " return meta, text\n",
+ "\n",
+ " @staticmethod\n",
+ " def pretty(text: Optional[str]) -> str:\n",
+ " if not text:\n",
+ " return \"\"\n",
+ " # attempt to remove transcriber's notes\n",
+ " text = re.sub(r\"(?i)(?:\\[|\\b)transcriber[\\'’]?s? notes?\\s*(?:[^\\xa0\\n].*?\\]?(?:\\r?\\n){1,2})+\", \"\", text)\n",
+ " # attempt to remove e-text info\n",
+ " text = re.sub(\n",
+ " r\"(?i)e-text prepared(?:[^\\xa0]\\(?.+\\)?\\r?\\n{1,3})+(?:\\xa0*\\s*note\\:\\s*(?:.+\\s*\\r{0,2}\\n{1,2}){1,5}\\xa0\\s+)?\",\n",
+ " \"\",\n",
+ " text,\n",
+ " )\n",
+ " # standardize line endings\n",
+ " text = \"\\r\\n\".join(text.splitlines())\n",
+ " text = re.sub(r\"(\\r\\n){3,}\", \"\\r\\n\\r\\n\\r\\n\", text).strip()\n",
+ " return text\n",
+ "\n",
+ " def _write(self, file: str, content: str) -> None:\n",
+ " path = os.path.join(self.folder, file) if self.folder is not None else file\n",
+ " with open(path, \"w+\", encoding=\"utf-8\") as f:\n",
+ " f.write(content)\n",
+ "\n",
+ " def save(self, book: int) -> bool:\n",
+ " html = self.download(book)\n",
+ " meta, text = self.parse(book, html)\n",
+ " if meta:\n",
+ " self._write(f\"{book}_meta.txt\", meta)\n",
+ " if text:\n",
+ " self._write(f\"{book}_text.txt\", text)\n",
+ " return bool(text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Start crawling"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "gc = GutenbergCrawler(FOLDER) # use text/ folder to save files"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
| \n", + " | Text# | \n", + "Type | \n", + "Issued | \n", + "Title | \n", + "Language | \n", + "Authors | \n", + "Subjects | \n", + "LoCC | \n", + "Bookshelves | \n", + "
|---|---|---|---|---|---|---|---|---|---|
| 9936 | \n", + "10631 | \n", + "Text | \n", + "2004-01-01 | \n", + "Halleck's New English Literature | \n", + "en | \n", + "Halleck, Reuben Post, 1859-1936 | \n", + "English literature -- History and criticism | \n", + "PR | \n", + "NaN | \n", + "
| 20456 | \n", + "21443 | \n", + "Text | \n", + "2007-05-15 | \n", + "Vesty of the Basins | \n", + "en | \n", + "Greene, Sarah Pratt McLean, 1856-1935 | \n", + "Maine -- Fiction | \n", + "PS | \n", + "NaN | \n", + "
| 35395 | \n", + "36710 | \n", + "Text | \n", + "2011-07-12 | \n", + "The Black Opal | \n", + "en | \n", + "Prichard, Katharine Susannah, 1884-1969 | \n", + "Opal mines and mining -- Australia -- Fiction | \n", + "PR | \n", + "NaN | \n", + "
| 27832 | \n", + "29139 | \n", + "Text | \n", + "2009-06-17 | \n", + "No Pets Allowed | \n", + "en | \n", + "Cummings, Monette, 1914-1999 | \n", + "Science fiction; Short stories | \n", + "PS | \n", + "Science Fiction | \n", + "
| 67010 | \n", + "68338 | \n", + "Text | \n", + "2022-06-17 | \n", + "Nick Carter Stories No. 160, October 2, 1915: ... | \n", + "en | \n", + "Carter, Nicholas (House name); Lebhar, Bertram... | \n", + "Popular literature -- Periodicals; Detective a... | \n", + "PS | \n", + "NaN | \n", + "