From 6f9c32ebb404800598af4fc7b335dd1d184ee884 Mon Sep 17 00:00:00 2001 From: wassname Date: Tue, 2 Jan 2024 13:46:21 +0800 Subject: [PATCH] wip --- README.md | 22 +- main.ipynb | 279 +------------ nbs/01_use_tldr_prompt.ipynb | 782 +++++++++++++++++++++++++++++++++++ nbs/02_use_lora.ipynb | 372 +++++++++++++++++ poetry.lock | 27 +- pyproject.toml | 2 + samples.json | 1 + 7 files changed, 1205 insertions(+), 280 deletions(-) create mode 100644 nbs/01_use_tldr_prompt.ipynb create mode 100644 nbs/02_use_lora.ipynb create mode 100644 samples.json diff --git a/README.md b/README.md index a4c8b33..7aa2f1f 100644 --- a/README.md +++ b/README.md @@ -13,4 +13,24 @@ The ideal way to do this would be by fine tuning. But that would be momory inten If yes/yes then it's may be suprising new information. If either is not true, then it's proboably BS. -See main.ipynb +See [main.ipynb](main.ipynb) for the code and results. + + +# Results + +When using microsoft/phi-2 we get this amount of perplexity reduction by including a summary of the key learnings + +| | sample | learning% | +|---:|:----------------------------------------------|-------------:| +| 3 | einsteins theory of general relativity | 0.0751468 | +| 5 | wikipedia on LK-99 | 0.0674738 | +| 8 | Schmidhuber 2023 Subjective Novelty, Surprise | 0.0396319 | +| 1 | good_ml | 0.0321225 | +| 0 | bad_ml | -9.58801e-05 | +| 2 | sokal hoax | -0.0168107 | +| 7 | AI gen fake paper | -0.134864 | +| 4 | lorem ipsum | -0.69694 | +| 6 | I have a dream | -0.796421 | + + +As you can see, some of these are probobly in the training set diff --git a/main.ipynb b/main.ipynb index db10f97..f510f15 100644 --- a/main.ipynb +++ b/main.ipynb @@ -18,283 +18,6 @@ "from evaluate import load\n" ] }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "samples = [\n", - " {\"name\": \"bad_ml\", \"url\":\"https://arxiv.org/abs/2312.10868\", \"text\": \"\"\"This roadmap survey has embarked on an exploration of the\n", - "transformative trends in generative AI research, particularly focusing on speculated advancements like Q* and the progressive strides towards AGI. Our analysis highlights a crucial paradigm shift, driven by innovations such as MoE, multi-modal learning, and the pursuit of AGI. These advancements signal a future where AI systems could significantly extend their capabilities in reasoning, contextual understanding, and creative problem-solving. This study reflects on AI’s dual potential to either contribute to or impede global equity and justice. The equitable distribution of AI benefits and its role in decision-making processes raise crucial questions about fairness and inclusivity. It is imperative to thoughtfully integrate AI into societal structures to enhance justice and reduce disparities. Despite these advancements, several open questions and research gaps remain. These include ensuring the ethical alignment of advanced AI systems with human values and societal norms, a challenge compounded by their increasing autonomy. The safety and robustness of AGI systems in diverse environments also remain a significant research gap. Addressing these challenges requires a multidisciplinary approach, incorporating ethical, social, and philosophical perspectives. Our survey has highlighted key areas for future inter-disciplinary research in AI, emphasizing the integration of ethical, sociological, and technical perspectives. This approach will foster collaborative research, bridging the gap between technological advancement and societal needs, ensuring that AI development is aligned with human values and global welfare. The roles of MoE, multimodal, and AGI in reshaping generative AI have been identified as significant, as their advancements can enhance model performance and versatility, and pave the way for future research in areas like ethical AI alignment and AGI. As we forge ahead, the balance between AI advancements and human creativity is not just a goal but a necessity, ensuring AI’s role as a complementary force that amplifies our capacity to innovate and solve complex challenges. Our responsibility is to guide these advancements towards enriching the human experience, aligning technological progress with ethical standards and societal well-being. \"\"\"},\n", - " {\"name\": \"good_ml\", \"url\":\"https://arxiv.org/abs/2310.01405\", \"text\": \"\"\"We explored representation engineering (RepE), an approach to top-down transparency for AI systems. Inspired by the Hopfieldian view in cognitive neuroscience, RepE places representations and the transformations between them at the center of analysis. As neural networks exhibit more coherent internal structures, we believe analyzing them at the representation level can yield new insights, aiding in effective monitoring and control. Taking early steps in this direction, we proposed new RepE methods, which obtained state-of-the-art on TruthfulQA, and we demonstrated how RepE and can provide traction on a wide variety of safety-relevant problems. While we mainly analyzed subspaces of representations, future work could investigate trajectories, manifolds, and state-spaces of representations. We hope this initial step in exploring the potential of RepE helps to foster new insights into understanding and controlling AI systems, ultimately ensuring that future AI systems are trustworthy and safe.\"\"\"},\n", - "\n", - " {\"name\": \"sokal hoax\", \"url\":\"www.physics.nyu.edu/faculty/sokal/transgress_v2/transgress_v2_singlefile.html\", \"text\": \"\"\" There are many natural scientists, and especially physicists, who continue to reject the notion that the disciplines concerned with social and cultural criticism can have anything to contribute, except perhaps peripherally, to their research. Still less are they receptive to the idea that the very foundations of their worldview must be revised or rebuilt in the light of such criticism. Rather, they cling to the dogma imposed by the long post-Enlightenment hegemony over the Western intellectual outlook, which can be summarized briefly as follows: that there exists an external world, whose properties are independent of any individual human being and indeed of humanity as a whole; that these properties are encoded in ``eternal'' physical laws; and that human beings can obtain reliable, albeit imperfect and tentative, knowledge of these laws by hewing to the ``objective'' procedures and epistemological strictures prescribed by the (so-called) scientific method.\n", - "\n", - " But deep conceptual shifts within twentieth-century science have undermined this Cartesian-Newtonian metaphysics1; revisionist studies in the history and philosophy of science have cast further doubt on its credibility2; and, most recently, feminist and poststructuralist critiques have demystified the substantive content of mainstream Western scientific practice, revealing the ideology of domination concealed behind the façade of ``objectivity''.3 It has thus become increasingly apparent that physical ``reality'', no less than social ``reality'', is at bottom a social and linguistic construct; that scientific ``knowledge\", far from being objective, reflects and encodes the dominant ideologies and power relations of the culture that produced it; that the truth claims of science are inherently theory-laden and self-referential; and consequently, that the discourse of the scientific community, for all its undeniable value, cannot assert a privileged epistemological status with respect to counter-hegemonic narratives emanating from dissident or marginalized communities. These themes can be traced, despite some differences of emphasis, in Aronowitz's analysis of the cultural fabric that produced quantum mechanics4; in Ross' discussion of oppositional discourses in post-quantum science5; in Irigaray's and Hayles' exegeses of gender encoding in fluid mechanics6; and in Harding's comprehensive critique of the gender ideology underlying the natural sciences in general and physics in particular.7 \"\"\",},\n", - " \n", - "\n", - " {\"name\": \"einsteins theory of general relativity\", \"url\":\"\", \"text\": \"\"\"In recent years I have worked, in part together with my friend Grossman, on a [1] generalization of the theory of relativity. During these investigations, a kaleidoscopic mixture of postulates from physics and mathematics has been introduced and used as heuristical tools; as a consequence it is not easy to see through and characterize the theory from a formal mathematical point of view, that is, only based upon these papers. The primary objective of the present paper is to close this gap. In particular, it has been possible to obtain the equations of the gravitational field in a purely covariance-theoretical manner (section D). I also tried to give simple derivations of the basic laws of absolute differential calculus-in part, they are probably new ones (section B)-in order to allow the reader to get a complete grasp of the theory without having to read other, purely mathematical tracts. As an illustration of the mathematical methods, I derived the (Eulerian) equations of hydrodynamics and the field equations of the electrodynamics of moving bodies (section C). Section E shows that Newton's theory of gravitation follows from the general theory as an approximation. The most elementary features of the present theory are also derived inasfar as [2] they are characteristic of a Newtonian (static) gravitational field (curvature of light rays, shift of spectral \"\"\",},\n", - "\n", - " {\"name\": \"lorem ipsum \", \"url\":\"\", \"text\": \"\"\"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\"\"\"},\n", - "\n", - " {\"name\": \"wikipedia on LK-99\", \"url\":\"https://en.wikipedia.org/wiki/LK-99\", \"text\": \"\"\"Some small LK-99 samples were reported to show strong diamagnetic properties, including a response confusingly[23] referred to as \"partial levitation\" over a magnet.[18] This was misinterpreted by some as a sign of superconductivity, although it is a sign of regular diamagnetism or ferromagnetism.\n", - "\n", - "While initial preprints claimed the material was a room-temperature superconductor,[18]: 1  they did not report observing any definitive features of superconductivity, such as zero resistance, the Meissner effect, flux pinning, AC magnetic susceptibility, the Josephson effect, a temperature-dependent critical field and current, or a sudden jump in specific heat around the critical temperature.[24]\n", - "\n", - "As it is common for a new material to spuriously seem like a potential candidate for high-temperature superconductivity,[13] thorough experimental reports normally demonstrate a number of these expected properties. As of 15 October 2023, not one of these properties had been observed by the original experiment or any replications.[25] \"\"\",},\n", - "\n", - "\n", - "{\"name\": \"I have a dream\", \"url\":\"\", \"text\": \"\"\"So even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream. I have a dream that one day this nation will rise up and live out the true meaning of its creed: We hold these truths to be self-evident, that all men are created equal.\n", - "\n", - "People clap and sing along to a freedom song between speeches at the March on Washington for Jobs and Freedom in 1963.\n", - "Express Newspapers via Getty Images\n", - "\n", - "I have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood.\n", - "\n", - "I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression will be transformed into an oasis of freedom and justice.\"\"\",},\n", - "{\"name\": \"AI gen fake paper\", \"url\":\"\", \"text\": \"\"\"\n", - "Machine Learning (ML) as a field has largely embraced binary constructs as foundational to its functioning, with 0/1 and true/false distinctions underpinning many of its core algorithms. This paper argues that these binary frameworks are not merely technical conveniences but are, in fact, reflective of deeper hegemonic paradigms that perpetuate exclusionary practices and systemic biases. By uncritically adopting these binaries, ML inadvertently reinforces a worldview that marginalizes complex, nuanced identities and experiences.\n", - "\n", - "We propose a critical examination of these binaries, questioning the necessity and ubiquity of dualistic thinking within ML. We suggest that the field's reliance on binary classification not only limits its predictive accuracy in certain contexts but also fails to capture the rich, fluid nature of human experiences and societal structures. Instead, we advocate for a \"fluid\" approach to algorithms that allows for more nuanced and inclusive representations of reality, rejecting the oversimplified and often exclusionary nature of strict binary outcomes.\n", - "\n", - "Furthermore, this paper argues for the integration of intersectional data that reflects the diverse and overlapping categories of identity, including race, gender, class, and more. Current ML models often overlook these complexities, leading to outcomes that fail to serve, and even harm, underrepresented populations. We critique the prevailing notion that ML is a neutral, objective tool, highlighting the lack of socio-political context in algorithmic decision-making processes.\n", - "\n", - "In conclusion, we call for an epistemological shift in the field of Machine Learning. This shift involves moving away from a purely positivistic, binary approach towards one that is reflective, inclusive, and aware of the social dimensions of technology. By reimagining the foundational paradigms of ML, we can work towards a more equitable and nuanced understanding of the world, one that respects and represents the full spectrum of human experience.\n", - "\"\"\",},\n", - "\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "\n", - "model_names = [\n", - " # \"EleutherAI/pythia-70m-v0\",\n", - " \"EleutherAI/pythia-160m\",\n", - " # \"EleutherAI/pythia-410m-v0\",\n", - " # \"EleutherAI/pythia-1b-v0\",\n", - " \"EleutherAI/pythia-2.8b\",\n", - " # \"EleutherAI/pythia-6.9b-v0\",\n", - " \"EleutherAI/pythia-12b\",\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from dotenv import load_dotenv\n", - "\n", - "load_dotenv() " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "What's the best way to measure the information content of text? I don't care about style or jargon or acryonyms.\n", - "\n", - "One simple way is: perplexity_large_model - perplexity_small_model. This measures if having a bigger brain helps understand/write this text. How would that work if\n", - "\n", - "- new jargon?\n", - "- new results?\n", - "- new writing style?\n", - "- math, acronyms\n", - "- simple language\n", - "- incoherent\n", - "- vauge language?" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'tl;dr: Representation engineering (RepE) is a transparent approach for analyzing AI systems. By focusing on representations and their transformations, RepE can provide new insights and aid in monitoring and controlling AI systems. Initial RepE methods achieved state-of-the-art results on TruthfulQA, showing promise for safety-related problems. Future work can explore different aspects of representations. RepE has the potential to improve understanding and trustworthiness of AI systems.'" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import os\n", - "from openai import OpenAI\n", - "\n", - "def summize(text):\n", - " client = OpenAI()\n", - " content = f\"Make a tl;dr of this text in <280 chars.\\n\\n## Text\\n\\n{text}\\n\\n## Instruction\\n\\nMake a tl;dr of this text in <280 chars. Start with the most important, as extra text will be discarded :\\n\\ntl;dr:\"\n", - " chat_completion = client.chat.completions.create(\n", - " messages=[\n", - " {\n", - " \"role\": \"user\",\n", - " \"content\": content,\n", - " }\n", - " ],\n", - " model=\"gpt-3.5-turbo\",\n", - " )\n", - " # print(content)\n", - " r = chat_completion.choices[0].message.content\n", - " return r\n", - "\n", - "r = summize(samples[1][\"text\"])\n", - "r" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "490" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(r)" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "from evaluate import evaluator\n", - "from datasets import Dataset\n", - "from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n", - "import torch\n", - "import pandas as pd" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "perplexity = load(\"perplexity\", module_type=\"metric\")\n", - "data = []\n", - "\n", - "summaries = {}\n", - "for model_name in model_names:\n", - " for sample in samples:\n", - " if sample['name'] not in summaries:\n", - " summaries[sample['name']] = summize(sample['text'])[:600]\n", - " summary = summaries[sample['name']]\n", - "\n", - " # before \n", - " s = sample['text']\n", - " results = perplexity.compute(predictions=s, model_id=model_name, device='cuda')\n", - " before = results['mean_perplexity']\n", - "\n", - " # after \n", - " s = f\"\"\"\n", - " High level summary: {summary}\n", - "\n", - "Text:\n", - "{sample['text']}\n", - " \"\"\"\n", - " results = perplexity.compute(predictions=s, model_id=model_name, device='cuda')\n", - " after = results['mean_perplexity']\n", - "\n", - " print(model_name, sample['name'], before, after)\n", - " data.append(dict(before=before, after=after, model=model_name, sample=sample['name']))\n", - "\n", - "\n", - "# df = pd.DataFrame(data)\n", - "# df\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "df = pd.DataFrame(data)\n", - "df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df['improvement'] = df['before'] - df['after']\n", - "df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Are smarter models less confused?\n", - "# Does it confuse dumber models?\n", - "df.groupby('sample').apply(lambda x: x['perplexity'].max()-x['perplexity'].min())" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Are smarter models less confused?\n", - "# Does it confuse dumber models?\n", - "df.groupby('sample').apply(lambda x: x['perplexity'].max())" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Are smarter models less confused?\n", - "# Does it confuse dumber models?\n", - "def content_rating(x):\n", - " max = x['perplexity'].max()\n", - " min = x['perplexity'].min()\n", - " return (max-min)/max\n", - "\n", - "df.groupby('sample').apply(content_rating)" - ] - }, { "cell_type": "code", "execution_count": null, @@ -326,7 +49,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.0rc1" + "version": "3.11.0" } }, "nbformat": 4, diff --git a/nbs/01_use_tldr_prompt.ipynb b/nbs/01_use_tldr_prompt.ipynb new file mode 100644 index 0000000..81f177b --- /dev/null +++ b/nbs/01_use_tldr_prompt.ipynb @@ -0,0 +1,782 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "An attempt to measure suprise in text using adapters\n", + "\n", + "https://github.com/huggingface/peft/blob/main/examples/fp4_finetuning/finetune_fp4_opt_bnb_peft.py" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/media/wassname/SGIronWolf/projects5/bs_writing_detector/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "import os\n", + "from transformers import AutoModelForCausalLM, AutoTokenizer\n", + "import torch\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "import torch\n", + "import torch.nn as nn\n", + "import transformers\n", + "from datasets import load_dataset\n", + "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n", + "\n", + "# from peft import LoraConfig, get_peft_model\n", + "\n", + "\n", + "# os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'name': 'bad_ml',\n", + " 'url': 'https://arxiv.org/abs/2312.10868',\n", + " 'text': 'This roadmap survey has embarked on an exploration of the\\ntransformative trends in generative AI research, particularly focusing on speculated advancements like Q* and the progressive strides towards AGI. Our analysis highlights a crucial paradigm shift, driven by innovations such as MoE, multi-modal learning, and the pursuit of AGI. These advancements signal a future where AI systems could significantly extend their capabilities in reasoning, contextual understanding, and creative problem-solving. This study reflects on AI’s dual potential to either contribute to or impede global equity and justice. The equitable distribution of AI benefits and its role in decision-making processes raise crucial questions about fairness and inclusivity. It is imperative to thoughtfully integrate AI into societal structures to enhance justice and reduce disparities. Despite these advancements, several open questions and research gaps remain. These include ensuring the ethical alignment of advanced AI systems with human values and societal norms, a challenge compounded by their increasing autonomy. The safety and robustness of AGI systems in diverse environments also remain a significant research gap. Addressing these challenges requires a multidisciplinary approach, incorporating ethical, social, and philosophical perspectives. Our survey has highlighted key areas for future inter-disciplinary research in AI, emphasizing the integration of ethical, sociological, and technical perspectives. This approach will foster collaborative research, bridging the gap between technological advancement and societal needs, ensuring that AI development is aligned with human values and global welfare. The roles of MoE, multimodal, and AGI in reshaping generative AI have been identified as significant, as their advancements can enhance model performance and versatility, and pave the way for future research in areas like ethical AI alignment and AGI. As we forge ahead, the balance between AI advancements and human creativity is not just a goal but a necessity, ensuring AI’s role as a complementary force that amplifies our capacity to innovate and solve complex challenges. Our responsibility is to guide these advancements towards enriching the human experience, aligning technological progress with ethical standards and societal well-being. ',\n", + " 'in_training': False}" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "MAX_LEN = 2000\n", + "import json\n", + "samples = json.load(open(\"../samples.json\"))\n", + "df_samples = pd.DataFrame(samples)\n", + "df_samples['len'] = df_samples['text'].str.len()\n", + "df_samples\n", + "\n", + "\n", + "sample = samples[0]\n", + "sample" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Helpers" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv() " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Exploring representation engineering (RepE) for AI transparency, we proposed new methods that improved results on TruthfulQA and presented a tool for AI safety issues. Future RepE research could further boost our control over AI systems.'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import os\n", + "from openai import OpenAI\n", + "\n", + "def summize(text):\n", + " client = OpenAI()\n", + " content = f\"Make a tl;dr of this text in <280 chars.\\n\\n## Text\\n\\n{text}\\n\\n## Instruction\\n\\nMake a tl;dr of this text in <280 chars. Start with the most important, as extra text will be discarded :\\n\\ntl;dr:\"\n", + " chat_completion = client.chat.completions.create(\n", + " messages=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": content,\n", + " }\n", + " ],\n", + " model=\"gpt-4\",\n", + " )\n", + " # print(content)\n", + " r = chat_completion.choices[0].message.content\n", + " return r\n", + "\n", + "r = summize(samples[1][\"text\"])\n", + "r" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# modified from https://github.dev/huggingface/evaluate/blob/8dfe05784099fb9af55b8e77793205a3b7c86465/measurements/perplexity/perplexity.py#L154\n", + "\n", + "# from evaluate.measurements.perplexity import Perplexity\n", + "import evaluate\n", + "from evaluate import logging\n", + "from torch.nn import CrossEntropyLoss\n", + "\n", + "# @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)\n", + "def perplexity_compute(\n", + " data, model, tokenizer, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None\n", + "):\n", + "\n", + " if device is not None:\n", + " assert device in [\"gpu\", \"cpu\", \"cuda\"], \"device should be either gpu or cpu.\"\n", + " if device == \"gpu\":\n", + " device = \"cuda\"\n", + " else:\n", + " device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + "\n", + " # model = AutoModelForCausalLM.from_pretrained(model_id)\n", + " model = model.to(device)\n", + "\n", + " # tokenizer = AutoTokenizer.from_pretrained(model_id)\n", + "\n", + " # if batch_size > 1 (which generally leads to padding being required), and\n", + " # if there is not an already assigned pad_token, assign an existing\n", + " # special token to also be the padding token\n", + " if tokenizer.pad_token is None and batch_size > 1:\n", + " existing_special_tokens = list(tokenizer.special_tokens_map_extended.values())\n", + " # check that the model already has at least one special token defined\n", + " assert (\n", + " len(existing_special_tokens) > 0\n", + " ), \"If batch_size > 1, model must have at least one special token to use for padding. Please use a different model or set batch_size=1.\"\n", + " # assign one of the special tokens to also be the pad token\n", + " tokenizer.add_special_tokens({\"pad_token\": existing_special_tokens[0]})\n", + "\n", + " if add_start_token and max_length:\n", + " # leave room for token to be added:\n", + " assert (\n", + " tokenizer.bos_token is not None\n", + " ), \"Input model must already have a BOS token if using add_start_token=True. Please use a different model, or set add_start_token=False\"\n", + " max_tokenized_len = max_length - 1\n", + " else:\n", + " max_tokenized_len = max_length\n", + "\n", + " encodings = tokenizer(\n", + " data,\n", + " add_special_tokens=False,\n", + " padding=True,\n", + " truncation=True if max_tokenized_len else False,\n", + " max_length=max_tokenized_len,\n", + " return_tensors=\"pt\",\n", + " return_attention_mask=True,\n", + " ).to(device)\n", + "\n", + " encoded_texts = encodings[\"input_ids\"]\n", + " attn_masks = encodings[\"attention_mask\"]\n", + "\n", + " # check that each input is long enough:\n", + " if add_start_token:\n", + " assert torch.all(torch.ge(attn_masks.sum(1), 1)), \"Each input text must be at least one token long.\"\n", + " else:\n", + " assert torch.all(\n", + " torch.ge(attn_masks.sum(1), 2)\n", + " ), \"When add_start_token=False, each input text must be at least two tokens long. Run with add_start_token=True if inputting strings of only one token, and remove all empty input strings.\"\n", + "\n", + " ppls = []\n", + " loss_fct = CrossEntropyLoss(reduction=\"none\")\n", + "\n", + " for start_index in logging.tqdm(range(0, len(encoded_texts), batch_size)):\n", + " end_index = min(start_index + batch_size, len(encoded_texts))\n", + " encoded_batch = encoded_texts[start_index:end_index]\n", + " attn_mask = attn_masks[start_index:end_index]\n", + "\n", + " if add_start_token:\n", + " bos_tokens_tensor = torch.tensor([[tokenizer.bos_token_id]] * encoded_batch.size(dim=0)).to(device)\n", + " encoded_batch = torch.cat([bos_tokens_tensor, encoded_batch], dim=1)\n", + " attn_mask = torch.cat(\n", + " [torch.ones(bos_tokens_tensor.size(), dtype=torch.int64).to(device), attn_mask], dim=1\n", + " )\n", + "\n", + " labels = encoded_batch\n", + "\n", + " with torch.no_grad():\n", + " out_logits = model(encoded_batch, attention_mask=attn_mask).logits\n", + "\n", + " shift_logits = out_logits[..., :-1, :].contiguous()\n", + " shift_labels = labels[..., 1:].contiguous()\n", + " shift_attention_mask_batch = attn_mask[..., 1:].contiguous()\n", + "\n", + " perplexity_batch = torch.exp(\n", + " (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)\n", + " / shift_attention_mask_batch.sum(1)\n", + " )\n", + "\n", + " ppls += perplexity_batch.tolist()\n", + "\n", + " return {\"perplexities\": ppls, \"mean_perplexity\": np.mean(ppls)}" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForMaskedLM, AutoModelForCausalLM, AutoConfig, PreTrainedTokenizerBase, PreTrainedTokenizer, GPTQConfig, BitsAndBytesConfig\n", + "\n", + "def load_model(model_name):\n", + " trust_remote_code = True\n", + " tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=trust_remote_code)\n", + " config = AutoConfig.from_pretrained(model_name, trust_remote_code=trust_remote_code)\n", + " # print(config)\n", + " if config.quantization_config is not None:\n", + " config.quantization_config['disable_exllama'] = True\n", + " if 'use_exllama' in config.quantization_config:\n", + " del config.quantization_config['use_exllama']\n", + " model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=trust_remote_code, \n", + " config=config,\n", + " )\n", + " return model, tokenizer\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "import gc\n", + "\n", + "def clear_mem():\n", + " torch.cuda.empty_cache()\n", + " gc.collect()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Results" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "models = [\n", + " \"TheBloke/phi-2-GPTQ\",\n", + " \"TheBloke/Llama-2-7B-GPTQ\",\n", + " \"TheBloke/Llama-2-13B-GPTQ\",\n", + " \"TheBloke/Mistral-7B-v0.1-GPTQ\",\n", + "]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n", + "Using pad_token, but it is not set yet.\n", + "100%|██████████| 1/1 [00:00<00:00, 4.20it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 4.14it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ bad_ml 12.456705093383789 11.44649887084961\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 5.00it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 4.92it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ good_ml 22.6639461517334 20.115413665771484\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 4.17it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 3.81it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ sokal hoax 14.285429000854492 14.216052055358887\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 4.48it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 4.06it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ Theory o. general relativity 20.507640838623047 19.644332885742188\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 5.02it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 5.00it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ lorem ipsum 1.1642249822616577 2.3542158603668213\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 4.48it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 4.50it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ wikipedia on LK-99 18.05230140686035 14.71042537689209\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 4.83it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 4.60it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ I have a dream 2.8362326622009277 4.256137371063232\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 4.45it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 4.28it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ AI gen fake paper 7.09130334854126 7.495457649230957\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 3.54it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 3.71it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ Schmidhuber 2023 Subjective Novelty, Surprise 28.30998992919922 27.353872299194336\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 4.34it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 3.98it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ email_to_fauci 21.060319900512695 18.637592315673828\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 5.12it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 5.21it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ enron_email1 22.875591278076172 19.7093563079834\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 4.48it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 4.16it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/phi-2-GPTQ openai_board_ann 8.552927017211914 9.061805725097656\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Using pad_token, but it is not set yet.\n", + "100%|██████████| 1/1 [00:00<00:00, 2.40it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.40it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ bad_ml 7.999119281768799 7.641711235046387\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 3.24it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.78it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ good_ml 14.435519218444824 13.1925630569458\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 2.44it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.40it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ sokal hoax 7.42789363861084 7.882516860961914\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 2.74it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.70it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ Theory o. general relativity 13.795381546020508 13.612862586975098\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 3.35it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 3.29it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ lorem ipsum 1.24347722530365 2.416914463043213\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 3.21it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.75it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ wikipedia on LK-99 12.341974258422852 9.84363842010498\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 3.28it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.75it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ I have a dream 2.2682995796203613 3.098977565765381\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 2.48it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.47it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ AI gen fake paper 6.16485071182251 6.050373554229736\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 2.37it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.38it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ Schmidhuber 2023 Subjective Novelty, Surprise 13.61117172241211 13.45300006866455\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 2.52it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.48it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ email_to_fauci 11.317075729370117 10.221866607666016\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 3.63it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 3.44it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ enron_email1 13.783302307128906 11.154099464416504\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 1/1 [00:00<00:00, 2.71it/s]\n", + "100%|██████████| 1/1 [00:00<00:00, 2.73it/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "TheBloke/Llama-2-7B-GPTQ openai_board_ann 5.916965007781982 5.880436897277832\n" + ] + } + ], + "source": [ + "summaries = {}\n", + "data = []\n", + "for model_name in models:\n", + " model, tokenizer = load_model(model_name)\n", + " for sample in samples:\n", + " if sample['name'] not in summaries:\n", + " summaries[sample['name']] = summize(sample['text'])[:600]\n", + " summary = summaries[sample['name']]\n", + "\n", + " # before \n", + " s1 = sample['text']\n", + " results = perplexity_compute(data=s1, model=model, tokenizer=tokenizer, device='cuda')\n", + " before = results['mean_perplexity']\n", + "\n", + " # after \n", + " s2 = f\"\"\"\n", + " High level summary: {summary}\n", + "\n", + "Text:\n", + "{sample['text']}\n", + " \"\"\"\n", + " results = perplexity_compute(data=s2, model=model, tokenizer=tokenizer, device='cuda')\n", + " after = np.array(results['perplexities'])[-len(s1):].mean()\n", + "\n", + " print(model_name, sample['name'], before, after)\n", + " data.append(dict(before=before, after=after, model=model_name, sample=sample['name']))\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# results\n", + "df = pd.DataFrame(data)\n", + "df[\"learning%\"] = (df[\"before\"] - df[\"after\"])/df[\"before\"]\n", + "# df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for n,d in df.groupby(\"model\"):\n", + " print(n)\n", + " d = d[['sample', 'learning%', 'in_training']].sort_values(\"learning%\", ascending=False)\n", + " print(d.to_markdown())\n", + " display(d)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# TODO: compare big and small" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0rc1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/nbs/02_use_lora.ipynb b/nbs/02_use_lora.ipynb new file mode 100644 index 0000000..f814dd3 --- /dev/null +++ b/nbs/02_use_lora.ipynb @@ -0,0 +1,372 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "https://github.com/huggingface/peft/blob/main/examples/fp4_finetuning/finetune_fp4_opt_bnb_peft.py" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/media/wassname/SGIronWolf/projects5/bs_writing_detector/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "import os\n", + "\n", + "import torch\n", + "import torch.nn as nn\n", + "import transformers\n", + "from datasets import load_dataset\n", + "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n", + "import numpy as np\n", + "from peft import LoraConfig, get_peft_model, IA3Config" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n", + "\n", + "model_name = \"microsoft/phi-2\"\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " model_name,\n", + " # max_memory=max_memory,\n", + " # quantization_config=BitsAndBytesConfig(\n", + " # load_in_4bit=True,\n", + " # llm_int8_threshold=6.0,\n", + " # llm_int8_has_fp16_weight=False,\n", + " # bnb_4bit_compute_dtype=torch.float16,\n", + " # bnb_4bit_use_double_quant=True,\n", + " # bnb_4bit_quant_type=\"nf4\",\n", + " # ),\n", + " # torch_dtype=torch.float16,\n", + " trust_remote_code=True,\n", + ")\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MAX_LEN = 2000\n", + "import json\n", + "samples = json.load(open(\"../samples.json\"))\n", + "\n", + "sample = samples[0]\n", + "sample" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Helpers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# modified from https://github.dev/huggingface/evaluate/blob/8dfe05784099fb9af55b8e77793205a3b7c86465/measurements/perplexity/perplexity.py#L154\n", + "\n", + "# from evaluate.measurements.perplexity import Perplexity\n", + "import evaluate\n", + "from evaluate import logging\n", + "from torch.nn import CrossEntropyLoss\n", + "\n", + "# @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)\n", + "def perplexity_compute(\n", + " data, model, tokenizer, batch_size: int = 16, add_start_token: bool = True, device=None, max_length=None\n", + "):\n", + "\n", + " if device is not None:\n", + " assert device in [\"gpu\", \"cpu\", \"cuda\"], \"device should be either gpu or cpu.\"\n", + " if device == \"gpu\":\n", + " device = \"cuda\"\n", + " else:\n", + " device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", + "\n", + " # model = AutoModelForCausalLM.from_pretrained(model_id)\n", + " # model = model.to(device)\n", + "\n", + " # tokenizer = AutoTokenizer.from_pretrained(model_id)\n", + "\n", + " # if batch_size > 1 (which generally leads to padding being required), and\n", + " # if there is not an already assigned pad_token, assign an existing\n", + " # special token to also be the padding token\n", + " if tokenizer.pad_token is None and batch_size > 1:\n", + " existing_special_tokens = list(tokenizer.special_tokens_map_extended.values())\n", + " # check that the model already has at least one special token defined\n", + " assert (\n", + " len(existing_special_tokens) > 0\n", + " ), \"If batch_size > 1, model must have at least one special token to use for padding. Please use a different model or set batch_size=1.\"\n", + " # assign one of the special tokens to also be the pad token\n", + " tokenizer.add_special_tokens({\"pad_token\": existing_special_tokens[0]})\n", + "\n", + " if add_start_token and max_length:\n", + " # leave room for token to be added:\n", + " assert (\n", + " tokenizer.bos_token is not None\n", + " ), \"Input model must already have a BOS token if using add_start_token=True. Please use a different model, or set add_start_token=False\"\n", + " max_tokenized_len = max_length - 1\n", + " else:\n", + " max_tokenized_len = max_length\n", + "\n", + " encodings = tokenizer(\n", + " data,\n", + " add_special_tokens=False,\n", + " padding=True,\n", + " truncation=True if max_tokenized_len else False,\n", + " max_length=max_tokenized_len,\n", + " return_tensors=\"pt\",\n", + " return_attention_mask=True,\n", + " ).to(device)\n", + "\n", + " encoded_texts = encodings[\"input_ids\"]\n", + " attn_masks = encodings[\"attention_mask\"]\n", + "\n", + " # check that each input is long enough:\n", + " if add_start_token:\n", + " assert torch.all(torch.ge(attn_masks.sum(1), 1)), \"Each input text must be at least one token long.\"\n", + " else:\n", + " assert torch.all(\n", + " torch.ge(attn_masks.sum(1), 2)\n", + " ), \"When add_start_token=False, each input text must be at least two tokens long. Run with add_start_token=True if inputting strings of only one token, and remove all empty input strings.\"\n", + "\n", + " ppls = []\n", + " loss_fct = CrossEntropyLoss(reduction=\"none\")\n", + "\n", + " for start_index in logging.tqdm(range(0, len(encoded_texts), batch_size)):\n", + " end_index = min(start_index + batch_size, len(encoded_texts))\n", + " encoded_batch = encoded_texts[start_index:end_index]\n", + " attn_mask = attn_masks[start_index:end_index]\n", + "\n", + " if add_start_token:\n", + " bos_tokens_tensor = torch.tensor([[tokenizer.bos_token_id]] * encoded_batch.size(dim=0)).to(device)\n", + " encoded_batch = torch.cat([bos_tokens_tensor, encoded_batch], dim=1)\n", + " attn_mask = torch.cat(\n", + " [torch.ones(bos_tokens_tensor.size(), dtype=torch.int64).to(device), attn_mask], dim=1\n", + " )\n", + "\n", + " labels = encoded_batch\n", + "\n", + " with torch.no_grad():\n", + " out_logits = model(encoded_batch, attention_mask=attn_mask).logits\n", + "\n", + " shift_logits = out_logits[..., :-1, :].contiguous()\n", + " shift_labels = labels[..., 1:].contiguous()\n", + " shift_attention_mask_batch = attn_mask[..., 1:].contiguous()\n", + "\n", + " perplexity_batch = torch.exp(\n", + " (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)\n", + " / shift_attention_mask_batch.sum(1)\n", + " )\n", + "\n", + " ppls += perplexity_batch.tolist()\n", + "\n", + " return {\"perplexities\": ppls, \"mean_perplexity\": torch.tensor(ppls).mean()}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Perplexity" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results = perplexity_compute(data=sample['text'], model=model, tokenizer=tokenizer, device='cuda')\n", + "results['mean_perplexity']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Learn" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\"\"\"### Post-processing on the model\n", + "\n", + "Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.\n", + "\"\"\"\n", + "\n", + "print(model)\n", + "\n", + "for param in model.parameters():\n", + " param.requires_grad = False # freeze the model - train adapters later\n", + " if param.ndim == 1:\n", + " # cast the small parameters (e.g. layernorm) to fp32 for stability\n", + " param.data = param.data.to(torch.float32)\n", + "\n", + "# model.gradient_checkpointing_enable() # reduce number of stored activations\n", + "# model.model.decoder.project_in = lambda x: x.requires_grad_(True)\n", + "\n", + "\n", + "class CastOutputToFloat(nn.Sequential):\n", + " def forward(self, x):\n", + " return super().forward(x).to(torch.float32)\n", + "\n", + "\n", + "model.lm_head = CastOutputToFloat(model.lm_head)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "peft_config = IA3Config(\n", + " target_modules=[ \"fc2\", \"Wqkv\",], \n", + " feedforward_modules=[\"fc2\"],\n", + " inference_mode=False,\n", + ")\n", + "model = get_peft_model(model, peft_config)\n", + "model.config.use_cache = False" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Verifying the datatypes.\n", + "dtypes = {}\n", + "for _, p in model.named_parameters():\n", + " dtype = p.dtype\n", + " if dtype not in dtypes:\n", + " dtypes[dtype] = 0\n", + " dtypes[dtype] += p.numel()\n", + "total = 0\n", + "for k, v in dtypes.items():\n", + " total += v\n", + "for k, v in dtypes.items():\n", + " print(k, v, v / total)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\"\"\"### Training\"\"\"\n", + "from datasets import Dataset\n", + "\n", + "# data = load_dataset(\"Abirate/english_quotes\")\n", + "data = Dataset.from_dict({\"text\": [sample['text'][:len(sample['text'])//2]]*100})\n", + "data = data.map(lambda samples: tokenizer(samples[\"text\"]), batched=True).with_format(\"torch\")\n", + "data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from torch.utils.data import DataLoader\n", + "# batch.keys()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)\n", + "model.train()\n", + "model = model.to('cuda')\n", + "for epoch in range(10):\n", + " for batch in DataLoader(data, batch_size=1):\n", + " b_in = {'input_ids': batch['input_ids'].to('cuda').to(dtype), 'attention_mask': batch['attention_mask'].to('cuda').to(dtype)}\n", + " optimizer.zero_grad()\n", + " loss = model(**batch).loss\n", + " loss.backward()\n", + " optimizer.step()\n", + " print(loss.item())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results2 = perplexity_compute(data=sample['text'], model=model, tokenizer=tokenizer, device='cuda')\n", + "results['mean_perplexity'], results2['mean_perplexity']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0rc1" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/poetry.lock b/poetry.lock index 6a234fb..1f0768d 100644 --- a/poetry.lock +++ b/poetry.lock @@ -663,6 +663,17 @@ files = [ {file = "distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed"}, ] +[[package]] +name = "einops" +version = "0.7.0" +description = "A new flavour of deep learning operations" +optional = false +python-versions = ">=3.8" +files = [ + {file = "einops-0.7.0-py3-none-any.whl", hash = "sha256:0f3096f26b914f465f6ff3c66f5478f9a5e380bb367ffc6493a68143fbbf1fd1"}, + {file = "einops-0.7.0.tar.gz", hash = "sha256:b2b04ad6081a3b227080c9bf5e3ace7160357ff03043cd66cc5b2319eb7031d1"}, +] + [[package]] name = "evaluate" version = "0.4.1" @@ -2781,6 +2792,20 @@ files = [ [package.dependencies] mpmath = ">=0.19" +[[package]] +name = "tabulate" +version = "0.9.0" +description = "Pretty-print tabular data" +optional = false +python-versions = ">=3.7" +files = [ + {file = "tabulate-0.9.0-py3-none-any.whl", hash = "sha256:024ca478df22e9340661486f85298cff5f6dcdba14f3813e8830015b9ed1948f"}, + {file = "tabulate-0.9.0.tar.gz", hash = "sha256:0095b12bf5966de529c0feb1fa08671671b3368eec77d7ef7ab114be2c068b3c"}, +] + +[package.extras] +widechars = ["wcwidth"] + [[package]] name = "threadpoolctl" version = "3.2.0" @@ -3376,4 +3401,4 @@ multidict = ">=4.0" [metadata] lock-version = "2.0" python-versions = ">=3.10,<3.13" -content-hash = "6db4fa9fb490014cf343d01f7425a26dc498aeece735a58a626a8ed2d3be806a" +content-hash = "14d2aa11627a0f3e72e1eaefecf4a6ee762fb5bcc2f66caa2bbf1869f74c7031" diff --git a/pyproject.toml b/pyproject.toml index 1a9fc54..9a2912d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -23,6 +23,8 @@ evaluate = "^0.4.1" auto-gptq = "^0.6.0" openai = "^1.6.1" python-dotenv = "^1.0.0" +einops = "^0.7.0" +tabulate = "^0.9.0" [[tool.poetry.source]] name = "pytorch" diff --git a/samples.json b/samples.json new file mode 100644 index 0000000..ae663c8 --- /dev/null +++ b/samples.json @@ -0,0 +1 @@ +[{"name": "bad_ml", "url": "https://arxiv.org/abs/2312.10868", "text": "This roadmap survey has embarked on an exploration of the\ntransformative trends in generative AI research, particularly focusing on speculated advancements like Q* and the progressive strides towards AGI. Our analysis highlights a crucial paradigm shift, driven by innovations such as MoE, multi-modal learning, and the pursuit of AGI. These advancements signal a future where AI systems could significantly extend their capabilities in reasoning, contextual understanding, and creative problem-solving. This study reflects on AI\u2019s dual potential to either contribute to or impede global equity and justice. The equitable distribution of AI benefits and its role in decision-making processes raise crucial questions about fairness and inclusivity. It is imperative to thoughtfully integrate AI into societal structures to enhance justice and reduce disparities. Despite these advancements, several open questions and research gaps remain. These include ensuring the ethical alignment of advanced AI systems with human values and societal norms, a challenge compounded by their increasing autonomy. The safety and robustness of AGI systems in diverse environments also remain a significant research gap. Addressing these challenges requires a multidisciplinary approach, incorporating ethical, social, and philosophical perspectives. Our survey has highlighted key areas for future inter-disciplinary research in AI, emphasizing the integration of ethical, sociological, and technical perspectives. This approach will foster collaborative research, bridging the gap between technological advancement and societal needs, ensuring that AI development is aligned with human values and global welfare. The roles of MoE, multimodal, and AGI in reshaping generative AI have been identified as significant, as their advancements can enhance model performance and versatility, and pave the way for future research in areas like ethical AI alignment and AGI. As we forge ahead, the balance between AI advancements and human creativity is not just a goal but a necessity, ensuring AI\u2019s role as a complementary force that amplifies our capacity to innovate and solve complex challenges. Our responsibility is to guide these advancements towards enriching the human experience, aligning technological progress with ethical standards and societal well-being. ", "in_training": false}, {"name": "good_ml", "url": "https://arxiv.org/abs/2310.01405", "text": "We explored representation engineering (RepE), an approach to top-down transparency for AI systems. Inspired by the Hopfieldian view in cognitive neuroscience, RepE places representations and the transformations between them at the center of analysis. As neural networks exhibit more coherent internal structures, we believe analyzing them at the representation level can yield new insights, aiding in effective monitoring and control. Taking early steps in this direction, we proposed new RepE methods, which obtained state-of-the-art on TruthfulQA, and we demonstrated how RepE and can provide traction on a wide variety of safety-relevant problems. While we mainly analyzed subspaces of representations, future work could investigate trajectories, manifolds, and state-spaces of representations. We hope this initial step in exploring the potential of RepE helps to foster new insights into understanding and controlling AI systems, ultimately ensuring that future AI systems are trustworthy and safe.", "in_training": false}, {"name": "sokal hoax", "url": "www.physics.nyu.edu/faculty/sokal/transgress_v2/transgress_v2_singlefile.html", "text": " There are many natural scientists, and especially physicists, who continue to reject the notion that the disciplines concerned with social and cultural criticism can have anything to contribute, except perhaps peripherally, to their research. Still less are they receptive to the idea that the very foundations of their worldview must be revised or rebuilt in the light of such criticism. Rather, they cling to the dogma imposed by the long post-Enlightenment hegemony over the Western intellectual outlook, which can be summarized briefly as follows: that there exists an external world, whose properties are independent of any individual human being and indeed of humanity as a whole; that these properties are encoded in ``eternal'' physical laws; and that human beings can obtain reliable, albeit imperfect and tentative, knowledge of these laws by hewing to the ``objective'' procedures and epistemological strictures prescribed by the (so-called) scientific method.\n\n But deep conceptual shifts within twentieth-century science have undermined this Cartesian-Newtonian metaphysics1; revisionist studies in the history and philosophy of science have cast further doubt on its credibility2; and, most recently, feminist and poststructuralist critiques have demystified the substantive content of mainstream Western scientific practice, revealing the ideology of domination concealed behind the fa\u00e7ade of ``objectivity''.3 It has thus become increasingly apparent that physical ``reality'', no less than social ``reality'', is at bottom a social and linguistic construct; that scientific ``knowledge\", far from being objective, reflects and encodes the dominant ideologies and power relations of the culture that produced it; that the truth claims of science are inherently theory-laden and self-referential; and consequently, that the discourse of the scientific community, for all its undeniable value, cannot assert a privileged epistemological status with respect to counter-hegemonic narratives emanating from dissident or marginalized communities. These themes can be traced, despite some differences of emphasis, in Aronowitz's analysis of the cultural fabric that produced quantum mechanics4; in Ross' discussion of oppositional discourses in post-quantum science5; in Irigaray's and Hayles' exegeses of gender encoding in fluid mechanics6; and in Harding's comprehensive critique of the gender ideology underlying the natural sciences in general and physics in particular.7 ", "in_training": true}, {"name": "Theory o. general relativity", "url": "", "text": "In recent years I have worked, in part together with my friend Grossman, on a [1] generalization of the theory of relativity. During these investigations, a kaleidoscopic mixture of postulates from physics and mathematics has been introduced and used as heuristical tools; as a consequence it is not easy to see through and characterize the theory from a formal mathematical point of view, that is, only based upon these papers. The primary objective of the present paper is to close this gap. In particular, it has been possible to obtain the equations of the gravitational field in a purely covariance-theoretical manner (section D). I also tried to give simple derivations of the basic laws of absolute differential calculus-in part, they are probably new ones (section B)-in order to allow the reader to get a complete grasp of the theory without having to read other, purely mathematical tracts. As an illustration of the mathematical methods, I derived the (Eulerian) equations of hydrodynamics and the field equations of the electrodynamics of moving bodies (section C). Section E shows that Newton's theory of gravitation follows from the general theory as an approximation. The most elementary features of the present theory are also derived inasfar as [2] they are characteristic of a Newtonian (static) gravitational field (curvature of light rays, shift of spectral ", "in_training": true}, {"name": "lorem ipsum ", "url": "", "text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.", "in_training": true}, {"name": "wikipedia on LK-99", "url": "https://en.wikipedia.org/wiki/LK-99", "text": "Some small LK-99 samples were reported to show strong diamagnetic properties, including a response confusingly[23] referred to as \"partial levitation\" over a magnet.[18] This was misinterpreted by some as a sign of superconductivity, although it is a sign of regular diamagnetism or ferromagnetism.\n\nWhile initial preprints claimed the material was a room-temperature superconductor,[18]:\u200a1\u200a they did not report observing any definitive features of superconductivity, such as zero resistance, the Meissner effect, flux pinning, AC magnetic susceptibility, the Josephson effect, a temperature-dependent critical field and current, or a sudden jump in specific heat around the critical temperature.[24]\n\nAs it is common for a new material to spuriously seem like a potential candidate for high-temperature superconductivity,[13] thorough experimental reports normally demonstrate a number of these expected properties. As of 15 October 2023, not one of these properties had been observed by the original experiment or any replications.[25] ", "in_training": false}, {"name": "I have a dream", "url": "", "text": "So even though we face the difficulties of today and tomorrow, I still have a dream. It is a dream deeply rooted in the American dream. I have a dream that one day this nation will rise up and live out the true meaning of its creed: We hold these truths to be self-evident, that all men are created equal.\n\nPeople clap and sing along to a freedom song between speeches at the March on Washington for Jobs and Freedom in 1963.\nExpress Newspapers via Getty Images\n\nI have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood.\n\nI have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression will be transformed into an oasis of freedom and justice.", "in_training": true}, {"name": "AI gen fake paper", "url": "", "text": "\nMachine Learning (ML) as a field has largely embraced binary constructs as foundational to its functioning, with 0/1 and true/false distinctions underpinning many of its core algorithms. This paper argues that these binary frameworks are not merely technical conveniences but are, in fact, reflective of deeper hegemonic paradigms that perpetuate exclusionary practices and systemic biases. By uncritically adopting these binaries, ML inadvertently reinforces a worldview that marginalizes complex, nuanced identities and experiences.\n\nWe propose a critical examination of these binaries, questioning the necessity and ubiquity of dualistic thinking within ML. We suggest that the field's reliance on binary classification not only limits its predictive accuracy in certain contexts but also fails to capture the rich, fluid nature of human experiences and societal structures. Instead, we advocate for a \"fluid\" approach to algorithms that allows for more nuanced and inclusive representations of reality, rejecting the oversimplified and often exclusionary nature of strict binary outcomes.\n\nFurthermore, this paper argues for the integration of intersectional data that reflects the diverse and overlapping categories of identity, including race, gender, class, and more. Current ML models often overlook these complexities, leading to outcomes that fail to serve, and even harm, underrepresented populations. We critique the prevailing notion that ML is a neutral, objective tool, highlighting the lack of socio-political context in algorithmic decision-making processes.\n\nIn conclusion, we call for an epistemological shift in the field of Machine Learning. This shift involves moving away from a purely positivistic, binary approach towards one that is reflective, inclusive, and aware of the social dimensions of technology. By reimagining the foundational paradigms of ML, we can work towards a more equitable and nuanced understanding of the world, one that respects and represents the full spectrum of human experience.\n", "in_training": false}, {"name": "Schmidhuber 2023 Subjective Novelty, Surprise", "url": "https://arxiv.org/pdf/0812.4360.pdf", "text": "We pointed out that a surprisingly simple algorithmic principle based on the notions\nof data compression and data compression progress informally explains fundamental aspects of attention, novelty, surprise, interestingness, curiosity, creativity, subjective beauty, jokes, and science & art in general. The crucial ingredients of the corresponding formal framework are (1) a continually improving predictor or compressor of the continually growing data history, (2) a computable measure of the compressor\u2019s progress (to calculate intrinsic rewards), (3) a reward optimizer or reinforcement learner translating rewards into action sequences expected to maximize future reward. To improve our previous implementations of these ingredients (Section 3), we will (1) study better adaptive compressors, in particular, recent, novel RNNs [94] and other general but practically feasible methods for making predictions [75]; (2) investigate under which conditions learning progress measures can be computed both accurately and efficiently, without frequent expensive compressor performance evaluations on the entire history so far; (3) study the applicability of recent improved RL techniques in the fields of policy gradients [110, 119, 118, 56, 100, 117], artificial evolution [43, 20, 21, 19, 22, 23, 24], and others [71, 75]. Apart from building improved artificial curious agents, we can test the predictions of our theory in psychological investigations of human behavior, extending previous studies in this vein [32] and going beyond anecdotal evidence mentioned above. It should be easy to devise controlled experiments where test subjects must anticipate initially unknown but causally connected event sequences exhibiting more or less complex, learnable patterns or regularities. The subjects will be asked to quantify their intrinsic rewards in response to their improved predictions. Is the reward indeed strongest when the predictions are improving most rapidly? Does the intrinsic reward indeed vanish as the predictions become perfect or do not improve any more? Finally, how to test our predictions through studies in neuroscience? Currently we hardly understand the human neural machinery. But it is well-known that certain neurons seem to predict others, and brain scans show how certain brain areas light up in response to reward. Therefore the psychological experiments suggested above should be accompanied by neurophysiological studies to localize the origins of intrinsic rewards, possibly linking them to improvements of neural predictors. Success in this endeavor would provide additional motivation to implement our principle on robots.", "in_training": false}, {"name": "email_to_fauci", "url": "https://s3.documentcloud.org/documents/20793561/leopold-nih-foia-anthony-fauci-emails.pdf", "text": "Dear Doctor Fauci,\nI am writing to you as a member of the SPEACproject which CEPI has funded to assist with the\nevaluation of the safety of vaccines in their portfolio. As part of this effort, we are assisting\nwith developing preclinical and clinical test ing criteria to evaluate the risk of enhanced disease\nfol lowing vaccination with COVID-19 vaccine cand idat es. As you know, this had been an issue\nwith some prior SARSvaccine candidates.\nWe are convening a two day virtual meeting of experts via video conference next week on\nMarch 12 and 13 between 8 am and 1 pm Eastern tim e each day. Participants in the meet ing\nare shown in the attached agenda but include Paul Henri Lambert from Geneva and Barney\nGraham from NIH. The meeting will actively invo lve the partic ipants on the agenda the first day\nNIH-000981\nNIH-000982and on the second day the meeting will be open for several peer reviewers including Stanley\nPlotkin and Andy Pollard to review and comment on possible small and NHP animal models as\nwell as appropriate immunologic testing to be done in early phase one trials.\nI am wanted to make you aware of the meeting so that you could attend all or part as an\nobserver if you wish but also to invit e you to consider joining on day two as one of our formal\npeer reviewers. The goal of the meeting would be to share recommendations with CEPI\nCOVID-19 developers as well as other interested parties.\nAny comments you have on the agenda or draft questions for consideration would be greatly\nappreciated.\nI look forward to hear ing back from you", "in_training": false}, {"name": "enron_email1", "url": "http://www.enron-mail.com/email/lay-k/discussion_threads/References_for_Board_Appointment_1.html", "text": "Ken,\n\nBruce Peterson at Korn/Ferry International has offered to suggest my name to\nvarious corporations for board appointments. I plan to leave El Paso at the\nend of the year. Please find attached a draft of my background information.\nMay I use you as a reference? Let me know if this meets with your approval.\nI enjoyed seeing you and your family at the convention. Thank you in\nadvance for your assistance.\n\nNancy\n\n<