Files
adapters_as_hypotheses/docs/ia3_few_shot_peft.md
T
wassname 3800481a30 Initial commit: adapters as representational hypotheses catalog
30 PEFT methods reframed as hypotheses about transformer geometry.
Each entry: pseudocode, hypothesis, evidence, grade.
All papers saved to docs/ (full text).
2026-02-22 19:48:37 +08:00

79 KiB
Raw Blame History

Title: 2205.05638v2.pdf

URL Source: https://arxiv.org/pdf/2205.05638

Published Time: Mon, 23 Jan 2023 14:43:06 GMT

Number of Pages: 23

Markdown Content:

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Haokun Liu Derek Tam Mohammed Muqeeth

Jay Mohta Tenghao Huang Mohit Bansal Colin Raffel

Department of Computer Science University of North Carolina at Chapel Hill

{haokunl,dtredsox,muqeeth,craffel}@cs.unc.edu

Abstract

Few-shot in-context learning (ICL) enables pre-trained language models to per-form a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (PEFT) (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and PEFT and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new PEFT method called (IA) 3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model [ 1 ] called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark [ 2 ], attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available. 1

1 Introduction

Pre-trained language models have become a cornerstone of natural language processing, thanks to the fact that they can dramatically improve data efficiency on tasks of interest i.e., using a pre-trained language model for initialization often produces better results with less labeled data. A historically common approach has been to use the pre-trained models parameters for initialization before performing gradient-based fine-tuning on a downstream task of interest. While fine-tuning has produced many state-of-the-art results [ 1], it results in a model that is specialized for a single task with an entirely new set of parameter values, which can become impractical when fine-tuning a model on many downstream tasks. An alternative approach popularized by [ 3, 4] is in-context learning (ICL), which induces a model to perform a downstream task by inputting prompted examples. Few-shot prompting converts a small collection of input-target pairs into (typically) human-understandable instructions and examples [3, 4 ], along with a single unlabeled example for which a prediction is desired. Notably, ICL requires no gradient-based training and therefore allows a single model to immediately perform a wide variety of tasks. Performing ICL therefore solely relies on the capabilities that a model learned during pre-training. These characteristics have led to a great deal of recent interest in ICL methods [510].

Equal contribution.

1

https://github.com/r-three/t-few

Preprint. Under review.

arXiv:2205.05638v2 [cs.LG] 26 Aug 2022

V K Q

softmax Dense Nonlinearity Dense

T0

Susie loves her grandma's banana bread. Susie called her grandma and asked her to send some. Grandma lived very far away. A week passed and grandma surprised Susie by coming to visit. What is a possible continuation for the story? Susie was so happy. Susie was upset.

(IA) 3 Losses used in T-Few Figure 1: Diagram of (IA) 3 and the loss terms used in the T-Few recipe. Left: (IA) 3 introduces the learned vectors lk, l v, and lff which respectively rescale (via element-wise multiplication, visualized as

) the keys and values in attention mechanisms and the inner activations in position-wise feed-forward networks. Right: In addition to a standard cross-entropy loss LLM , we introduce an unlikelihood loss

LUL that lowers the probability of incorrect outputs and a length-normalized loss LLN that applies a standard softmax cross-entropy loss to length-normalized log-probabilities of all output choices. Despite the practical benefits of ICL, it has several major drawbacks. First, processing all prompted input-target pairs every time the model makes a prediction incurs significant compute costs. Second, ICL typically produces inferior performance compared to fine-tuning [ 4 ]. Finally, the exact formatting of the prompt (including the wording [ 11 ] and ordering of examples [ 12 ]) can have significant and unpredictable impact on the models performance, far beyond inter-run variation of fine-tuning. Recent work has also demonstrated that ICL can perform well even when provided with incorrect labels, raising questions as to how much learning is taking place at all [9]. An additional paradigm for enabling a model to perform a new task with minimal updates is parameter-efficient fine-tuning (PEFT), where a pre-trained model is fine-tuned by only updating a small number of added or selected parameters. Recent methods have matched the performance of fine-tuning the full model while only updating or adding a small fraction (e.g. 0.01%) of the full models parameters [13 , 14 ]. Furthermore, certain PEFT methods allow mixed-task batches where different examples in a batch are processed differently [14], making both PEFT and ICL viable for multitask models. While the benefits of PEFT address some shortcomings of fine-tuning (when compared to ICL), there has been relatively little focus on whether PEFT methods work well when very little labeled data is available. Our primary goal in this paper is to close this gap by proposing a recipe i.e., a model, a PEFT method, and a fixed set of hyperparameters that attains strong performance on novel, unseen tasks while only updating a tiny fraction of the models parameters. Specifically, we base our approach on the T0 model [ 1], a variant of T5 [ 15 ] fine-tuned on a multitask mixture of prompted datasets. To improve performance on classification and multiple-choice tasks, we add unlikelihood [ 16 , 17 ]and length normalization-based [ 4] loss terms. In addition, we develop (IA) 3, a PEFT method that multiplies intermediate activations by learned vectors. (IA) 3 attains stronger performance than full-model fine-tuning while updating up to 10,000 × fewer parameters. Finally, we demonstrate the benefits of pre-training the (IA) 3 parameters before fine-tuning [ 18 , 19 ]. Our overall recipe, which we dub “ T-Few ”, performs significantly better than ICL (even against 16 × larger models) and outperforms humans for the first time on the real-world few-shot learning benchmark RAFT [ 2]while requiring dramatically less compute and allowing for mixed-task batches during inference. To facilitate the use of T-Few on new problems and future research on PEFT, we release our code. 1

After providing background on ICL and PEFT in the following section, we discuss the design of

T-Few in section 3. In section 4, we present experiments comparing T-Few to strong ICL baselines. Finally, we discuss related work in appendix B and conclude in section 5.

2 Background

In this section, we provide am verview of ICL and PEFT with a focus on characterizing the com-putation, memory, and on-disk storage costs of making a prediction. Real-world costs depend on implementation and hardware, so we report costs in terms of FLOPs for computation and bytes for memory and storage, respectively. Additional related work is discussed in appendix B.

2.1 Few-shot in-context learning (ICL)

ICL [ 3, 4] aims to induce a model to perform a task by feeding in concatenated and prompted input-target examples (called “shots”) along with an unlabeled query example. Taking the cycled 2letter task from Brown et al. [4] as an example, a 4-shot input or context would be “ Please unscramble the letters into a word, and write that word: asinoc = casino, yfrogg = froggy, plesim = simple, iggestb = biggest, astedro = ”, for which the desired output would be “ roasted ”. ICL induces an autoregressive language model to perform this task by feeding in the context and sampling from the model. For classification tasks, each label is associated with a string (e.g. “ positive ” and “ negative ” for sentiment analysis) and a label is assigned by choosing the label string that the model assigns the highest probability to. For multiple-choice tasks (e.g. choosing between N possible answers to a question), the models prediction is similarly determined by determining which choice is assigned the highest probability. The primary advantage of ICL is that it enables a single model to perform many tasks immediately without fine-tuning. This also enables mixed-task batches , where different examples in a batch of data correspond to different tasks by using different contexts in the input. ICL is also typically performed with only a limited number of labeled examples called few-shot learning making it data-efficient. Despite these advantages, ICL comes with significant practical drawbacks: First, making a prediction is dramatically more expensive because the model needs to process all of the in-context labeled examples. Specifically, ignoring the quadratic complexity of self-attention operations in Transformer language models (which are typically small compared to the costs of the rest of the model [ 20 ]), processing the k training examples for k-shot ICL increases the computational cost by approximately

k + 1 times compared to processing the unlabeled example alone. Memory costs similarly scale approximately linearly with k, though during inference the memory costs are typically dominated by storing the models parameters. Separately, there is a small amount of on-disk storage required for storing the in-context examples for a given task. For example, storing 32 examples for a task where the prompted input and target for each example is 512 tokens long would require about 66 kilobytes of storage on disk ( 32 examples × 512 tokens × 32 bits). Beyond the aforementioned costs, ICL also exhibits unintuitive behavior. Zhao et al. [12] showed that the ordering of examples in the context heavily influences the models predictions. Min et al. [9] showed that ICL can still perform well even if the labels of the in-context examples are swapped (i.e. made incorrect), which raises questions about whether ICL is really “learning” from the labeled examples. Various approaches have been proposed to mitigate these issues. One way to decrease computational costs is to cache the key and value vectors for in-context examples. This is possible because decoder-only Transformer language models have a causal masking pattern, so the models activations for the context do not do not depend on the unlabeled example. In an extreme case, 32 -shot ICL with 512

tokens per in-context example would result in over 144 gigabytes of cached key and value vectors for the GPT-3 model ( 32 examples × 512 tokens × 96 layers × 12288 d model × 32 bits each for the key and value vectors). Separately, Min et al. [21] proposed ensemble ICL , where instead of using the output probability from concatenating the k training examples, the output probabilities of the model on each training example (i.e. 1-shot ICL for each of the k examples) are multiplied together. This lowers the non-parameter memory cost by a factor of k/ 2 but increases the computational cost by a factor of 2. In terms of task performance, Min et al. [21] find that ensemble ICL outperforms the standard concatenative variant.

2.2 Parameter-efficient fine-tuning

While standard fine-tuning updates all parameters of the pre-trained model, it has been demonstrated that it is possible to instead update or add a relatively small number of parameters. Early methods proposed adding adapters [22 24 ], which are small trainable feed-forward networks inserted between the layers in the fixed pre-trained model. Since then, various sophisticated PEFT methods have been proposed, including methods that choose a sparse subset of parameters to train [ 25 , 26 ], produce low-rank updates [ 13 ], perform optimization in a lower-dimensional subspace [ 27 ], add low-rank adapters using hypercomplex multiplication [ 28 ], and more. Relatedly, prompt tuning [14 ] and prefix tuning [29 ] concatenate learned continuous embeddings to the models input or activations to induce it to perform a task; this can be seen as a PEFT method [ 30 ]. State-of-the-art PEFT methods can match the performance of fine-tuning all of the models parameters while updating only a tiny fraction (e.g. 0.01%) of the models parameters. PEFT drastically reduces the memory and storage requirements for training and saving the model. In addition, certain PEFT methods straightforwardly allow mixed-task batches for example, prompt 3tuning enables a single model to perform many tasks simply by concatenating different prompt embeddings to each example in the batch [ 14 ]. On the other hand, PEFT methods that re-parameterize the model (e.g. [ 27 , 13 ]) are costly or onerous for mixed-task batches. Separately, different PEFT methods increase the computation and memory required to perform inference by different amounts. For example, adapters effectively add additional (small) layers to the model, resulting in small but non-negligible increases in computational costs and memory. An additional cost incurred by PEFT is the cost of fine-tuning itself, which must be performed once and is then amortized as the model is used for inference. However, we will show that PEFT can be dramatically more computationally efficient when considering both fine-tuning and inference while achieving better accuracy than ICL.

3 Designing the T-Few Recipe

Given that PEFT allows a model to be adapted to a new task with relatively small storage requirements and computational cost, we argue that PEFT presents a promising alternative to ICL. Our goal is therefore to develop a recipe that allows a model to attain high accuracy on new tasks with limited labeled examples while allowing mixed-task batches during inference and incurring minimal computational and storage costs. By recipe , we mean a specific model and hyperparameter setting that provides strong performance on any new task without manual tuning or per-task adjustments. In this way, we can ensure that our approach is a realistic option in few-shot settings where limited labeled data is available for evaluation [31, 32].

3.1 Model and Datasets

As a first step, we must choose a pre-trained model. Ideally, the model should attain high performance on new tasks after fine-tuning on a limited number of labeled examples. In preliminary experiments applying PEFT methods to different pre-trained models, we attained the best performance with T0 [1]. T0 is based on T5 [ 15 ], an encoder-decoder Transformer model [ 33 ] that was pre-trained via a masked language modeling objective [ 34 ] on a large corpus of unlabeled text data. T0 was created by fine-tuning T5 on a multitask mixture of datasets in order to enable zero-shot generalization, i.e. the ability to perform tasks without any additional gradient-based training. Examples in the datasets used to train T0 were prompted by applying the prompt templates from the Public Pool of Prompts (P3 [35 ]), which convert each example in each dataset to a prompted text-to-text format where each label corresponds to a different string. For brevity, we omit a detailed description of T0 and T5; interested readers can refer to Sanh et al. [1] and Raffel et al. [15] . T0 was released in three billion and eleven billion parameter variants, referred to as “T0-3B” and simply “T0” respectively. In this section (where our goal is to design the T-Few recipe through extensive experimentation), we use T0-3B to reduce computational costs. For all models and experiments, we use Hugging Face Transformers [36]. While T0 was designed for zero-shot generalization, we will demonstrate that it also attains strong performance after fine-tuning with only a few labeled examples. To test T0s generalization, Sanh et al. [1] chose a set of tasks (and corresponding datasets) to hold out from the multitask training mixture specifically, sentence completion (COPA [ 37 ], H-SWAG [ 38 ], and Story Cloze [ 39 ] datasets), natural language inference (ANLI [ 40 ], CB [ 41 ], and RTE [ 42 ]), coreference resolution (WSC [ 43 ]and Winogrande [ 44 ]), and word sense disambiguation (WiC [ 45 ]). Evaluation of generalization capabilities can then be straightforwardly done by measuring performance on these held-out datasets. We also will later test T-Few s abilities in the RAFT benchmark [ 2] in section 4.3, a collection of unseen “real-world” few-shot tasks with no validation set and a held-out test set. ANLI, WiC, WSC is licensed under a Creative Commons License. Winogrande is licnsed under an Apache license. COPA is under a BSD-2 Clause license. We could not find the license of RTE and CB but they are part of SuperGLUE which mentions the datasets are allowed for use in research context. To ease comparison, we use the same number of few-shot training examples for each dataset as Brown et al. [4] , which varies from 20 to 70. Unfortunately, the few-shot dataset subsets used by Brown et al. [4] have not been publicly disclosed. To allow for a more robust comparison, we therefore constructed five few-shot datasets by sampling subsets with different seeds and report the median and interquartile range. We prompt examples from each dataset using the prompt templates from P3 Bach et al. [35] , using a randomly-sampled prompt template for each example at each step. Unless otherwise stated, we train our model for 1K steps with a batch size of 8 and report performance at the end of training. For evaluation, we use “rank classification”, where the models log-probabilities for all possible label strings are ranked and the models prediction is considered correct if the highest-ranked choice is the 4correct answer. Rank classification evaluation is compatible with both classification and multiple-choice tasks. Since model performance can vary significantly depending on the prompt template used, we report the median accuracy across all prompt templates from P3 and across few-shot data subsets for each dataset. For all datasets, we report the accuracy on the test set or validation set when the test labels are not public (e.g. SuperGLUE datasets). In the main text, we report median accuracy across the nine datasets mentioned above. Detailed results on each dataset are provided in the appendices.

3.2 Unlikelihood Training and Length Normalization

Before investigating PEFT methods, we first explore two additional loss terms to improve the performance of few-shot fine-tuning of language models. Language models are normally trained with cross-entropy loss LLM = 1

T

t

log p(yt|x, y <t ) where the model is trained to increase the probability of the correct target sequence y = ( y1, y 2, . . . , y T ) given the input sequence x.For evaluation, we use rank classification (described in section 3.1) which depends on both the probability that the model assigns to the correct choice as well as the probabilities assigned by the model to the incorrect choices. To account for this during training, we consider adding an unlikelihood loss [16, 17]:

LUL =

∑Nn=1

∑T (n)

t=1

log(1 p(ˆ y(n)

i

|x, ˆy(n)

<t

))

∑Nn=1 T (n) (1) which discourages the model from predicting tokens from incorrect target sequences, where ˆy(n) =(ˆ y1, ˆy2, . . . , ˆyT (n) ) is the n-th of N incorrect target sequences. We hypothesize that adding LUL will improve results on rank classification because the model will be trained to assign lower probabilities to incorrect choices, thereby improving the chance that the correct choice is ranked highest. The possible target sequences for a given training example can have significantly different lengths, especially in multiple-choice tasks. Ranking each choice based on probability can therefore “favor” shorter choices because the models assigned probability to each token is ≤ 1. To rectify this, we consider using length normalization when performing rank classification, which divides the models score on each possible answer choice by the number of tokens in the choice (as used in GPT-3 [ 4 ]). When using length normalization during evaluation, we introduce an additional loss term during training that more closely reflects length-normalized evaluation. First, we compute the length-normalized log probability of a given output sequence β(x, y) = 1

T

∑Tt=1 log p(yt|x, y <t ).Then, we maximize the length-normalized log probability of the correct answer choice by minimizing the softmax cross-entropy loss:

LLN = log exp( β(x, y)) exp( β(x, y)) + ∑Nn=1 exp( β(x, ˆy(n))) (2) When training a model with LLM , LUL , and LLN , we simply sum them. This avoids introducing any hyperparameters that would be problematic to tune in the few-shot setting (where realistically-sized validation sets are tiny by necessity [31, 32]). We report the results of fine-tuning all of T0-3Bs parameters with and without length normalization on all datasets in appendix C. We find that adding LLN improves the accuracy from 60.7% to 62.71% and including both LUL and LLN provides a further improvement to 63.3%. Since these loss terms improve performance without introducing any additional hyperparameters, we include them in our recipe and use them in all following experiments.

3.3 Parameter-efficient fine-tuning with (IA) 3

In order to compare favorably to few-shot ICL, we need a PEFT method that has the following properties: First, it must add or update as few parameters as possible to avoid incurring storage and memory costs. Second, it should achieve strong accuracy after few-shot training on new tasks. Finally, it must allow for mixed-task batches, since that is a capability of ICL. In order to easily enable mixed-task batches, a PEFT method should ideally not modify the model itself. Otherwise, each example in a batch would effectively need to be processed by a different model or computational graph. A more convenient alternative is provided by methods that directly modify the activations of the model since this can be done independently and cheaply to each example in the batch according to which task the example corresponds to. Prompt tuning and prefix tuning methods [ 14 , 29 ] work by concatenating learned vectors to activation or embedding sequences and are therefore examples of activation-modifying PEFT methods that allow for mixed-task batches. However, as we will discuss 5later, we were unable to attain reasonable accuracy with prompt tuning and found that the more performant PEFT methods did not allow for mixed-task batches. We therefore developed a new PEFT method that meets our desiderata. As an alternative, we explored element-wise multiplication (i.e. rescaling) of the models activations against a learned vector. Specifically, we consider adaptation of the form l x where l ∈ Rd is a learned task-specific vector, represents element-wise multiplication, and x ∈ RT ×d is a length-T

sequence of activations. We use “broadcasting notation” [ 46 ] so that the (i, j )th entry of l x is lj xi,j .In preliminary experiments, we found it was not necessary to introduce a learned rescaling vector for each set of activations in the Transformer model. Instead, we found it was sufficient to introduce rescaling vectors on the keys and values in self-attention and encoder-decoder attention mechanisms and on the intermediate activation of the position-wise feed-forward networks. Specifically, using the notation from Vaswani et al. [33] , we introduce three learned vectors lk ∈ Rdk , l v ∈ Rdv , and

lff ∈ Rdff , which are introduced into the attention mechanisms as:

softmax

( Q(lk KT )

√dk

)

(lv V )

and in the position-wise feed-forward networks as (lff γ(W1x)) W2, where γ is the feed-forward network nonlinearity. We introduce a separate set of lk, l v, and lff vectors in each Transformer layer block. This adds a total of L(dk + dv + dff ) new parameters for a L-layer-block Transformer encoder and L(2 dk + 2 dv + dff ) (with factors of 2 accounting for the presence of both self-attention and encoder-decoder attention) for a L-layer-block decoder. lk, l v, and lff are all initialized with ones so that the overall function computed by the model does not change when they are added. We call our method (IA) 3, which stands for “Infused Adapter by Inhibiting and Amplifying Inner Activations”.

(IA) 3 makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector. We also note that, in the event that a model will only be used on a single task, the modifications introduced by (IA) 3 can also be applied to weight matrices permanently so that no elementwise multiplication is required and the models architecture remains unchanged. This possible because element-wise multiplications performed in (IA) 3 always co-occur with a matrix multiplication, and l W x = ( l W )x. In this case, our method incurs no additional computational cost compared to the original model. To validate (IA) 3, we compare it to a large variety of existing adaptation methods in our setting of fine-tuning T0-3B on few-shot datasets from held-out tasks. Specifically, we compare with 9 strong PEFT methods: BitFit [ 47 ] which updates only the bias parameters; Adapters [ 23 ] which introduce task-specific layers after the self-attention and position-wise feed-forward networks; Compacter and Compacter++ [ 28 ] which improve upon adapters by using low-rank matrices and hypercomplex mul-tiplication; prompt tuning [ 14 ] which learns task-specific prompt embeddings that are concatenated to the models input; FISH Mask [ 26 ] which chooses a subset of parameters to update based on their ap-proximate Fisher information; Intrinsic SAID [ 27 ] which performs optimization in a low-dimensional subspace; prefix-tuning [ 29 ] which learns task-specific vectors that are concatenated to the models activations; and LoRA [ 13 ] which assigns low-rank updates to parameter matrices. Additionally, we include the baselines of full-model fine-tuning and updating only the layer normalization parameters. For certain methods that allow changing the parameter efficiency, we report results for different budgets: 0.2% and 0.02% sparsity for FISH Mask, 10 and 100 learned prompt vectors for prompt tuning, and 20,000- or 500,000-dimensional subspaces for Intrinsic SAID. The results are shown in fig. 2, with detailed per-dataset results in appendix D. We find that (IA) 3

is the only method that attains higher accuracy than the full-model-fine-tuning baseline. While other PEFT methods (e.g. Intrinsic SAID and prompt tuning) update or introduce fewer parameters,

(IA) 3 performs considerably better. Our results and setting differ with some past work on the PEFT methods we compare against. Mahabadi et al. [28] report that Compacter and Compacter++ outperform full-model fine-tuning, including in the few-shot setting. Lester et al. [14] found that prompt tuning could match full-model fine-tuning, and in subsequent work Wei et al. [48] found that prompt tuning performed well when applied to a multitask fine-tuned model in the few-shot setting. In both cases, we experimented with various hyperparameter choices to try to match past results. We hypothesize the disagreement comes from us using a different model and different datasets. For prompt tuning specifically, we noticed that the validation set performance could fluctuate wildly over the course of training, hinting at possible optimization issues. 60.001% 0.01% 0.1%

% of parameters updated 50 55 60 65 Accuracy All parameters (IA)³ LoRA BitFit Layer Norm Compacter Compacter++ Prompt Tuning Prefix Tuning Adapter FISH Mask Intrinsic SAID

Figure 2: Accuracy of PEFT methods with LUL

and LLN when applied to T0-3B. Methods that with variable parameter budgets are represented with larger and smaller markers for more or less parameters. 10 12 10 13 10 14 10 15

FLOPs per example 50 55 60 65 70 Accuracy T-Few T0 T5+LM GPT-3 6.7B GPT-3 13B GPT-3 175B

Figure 3: Accuracy of different few-shot learning methods. T-Few uses (IA) 3 for PEFT methods of T0, T0 uses zero-shot learning, and T5+LM and the GPT-3 variants use few-shot ICL. The x-axis corresponds to inference costs; details are provided in section 4.2.

3.4 Pre-training (IA) 3

In recent work, Gu et al. [18] , Vu et al. [19] showed that pre-training the prompt embeddings in prompt tuning can improve performance when fine-tuning on downstream few-shot tasks. For pre-training, Gu et al. [18] use a suite of self-supervised tasks applied to unlabeled text data, and Vu et al. [19] consider using embeddings from a separate task or multitask mixture. We follow Vu et al. [19] and simply pre-train the new parameters introduced by (IA) 3 on the same multitask mixture used to train T0. We pre-train for 100,000 steps with a batch size of 16 before fine-tuning the (IA) 3

parameters on each individual downstream dataset. A full comparison of accuracy with and without pre-training (IA) 3 is detailed in appendix E. We find that pre-training improves fine-tuned accuracy from 64.6 to 65.8 and therefore add it to our recipe.

3.5 Combining the ingredients

In summary, the T-Few recipe is defined as follows: We use the T0 model as a backbone. We add

(IA) 3 for downstream task adaptation and use parameters initialized from pre-training (IA) 3 on the same multitask mixture for T0. As an objective, we use the sum of a standard language modeling loss LLM , an unlikelihood loss LUL for incorrect choices, and a length-normalized loss LLN . We train for 1,000 steps with a batch size of 8 sequences using the Adafactor optimizer [ 49 ] with a learning rate of 3e3 and a linear decay schedule with a 60-step warmup. We apply prompt templates to downstream datasets during training and inference to convert each example into an instructive text-to-text format. Importantly, we apply this recipe to every downstream dataset in exactly the same way without per-dataset hyperparameter tuning or modifications. This makes the recipe a realistic option for few-shot learning settings where validation sets are tiny by definition [31, 32].

4 Outperforming ICL with T-Few

Having designed and established the T-Few recipe on T0-3B, we now apply it to T0 (with 11 billion parameters) and compare performance to strong few-shot ICL baselines. From this point onwards, we use exactly the same recipe and hyperparameters across all tasks.

4.1 Performance on T0 tasks

First, we evaluate T-Few on the datasets that were held out from T0s training mixture. We compare against zero-shot learning with T0 [ 1] (since we found few-shot ICL to performed worse than zero-7shot for T0, see appendix F); few-shot ICL with T5+LM [ 14 ] (the next-step-prediction language model upon which T0 is based); and few-shot ICL with the 6.7, 13, and 175 billion parameter variants of GPT-3. See appendix F for more details on these baselines. The accuracy on the held-out T0 datasets (described in section 3.1) is shown in table 1 and fig. 3, with per-dataset results reported in appendix F. We find that T-Few outperforms all other methods by a substantial margin. Notably,

T-Few achieves a 6% higher accuracy than few-shot ICL with GPT-3 175B despite being about 16 ×

smaller and outperforms the smaller GPT-3 variants by an even larger margin. T-Few also attains significantly higher accuracy than both zero-shot learning with T0 and few-shot ICL with T5+LM. Method Inference FLOPs Training FLOPs Disk space Acc.

T-Few 1.1e12 2.7e16 4.2 MB 72.4% T0 [1] 1.1e12 0 0 B 66.9% T5+LM [14] 4.5e13 0 16 kB 49.6% GPT-3 6.7B [4] 5.4e13 0 16 kB 57.2% GPT-3 13B [4] 1.0e14 0 16 kB 60.3% GPT-3 175B [ 4] 1.4e15 0 16 kB 66.6% Table 1: Accuracy on held-out T0 tasks and computational costs for different few-shot learning methods and models. T-Few

attains the highest accuracy with 1,000 × lower computational cost than ICL with GPT-3 175B. Fine-tuning with T-Few costs about as much as ICL on 20 examples with GPT-3 175B. Method Acc.

T-Few 75.8% Human baseline [2] 73.5% PET [50] 69.6% SetFit [51] 66.9% GPT-3 [4] 62.7% Table 2: Top-5 best methods on RAFT as of writing. T-Few is the first method to outperform the human baseline and achieves over 6% higher accuracy than the next-best method.

4.2 Comparing computational costs

Having established that T-Few significantly outperforms ICL-based models, we now compare the relative costs of each few-shot learning approach. For simplicity, we use the FLOPs-per-token estimates for Transformer-based language models introduced by Kaplan et al. [20] . Specifically, we estimate that a decoder-only Transformer (e.g. the GPT series) with N parameters uses 2N FLOPs per token for inference and 6N FLOPs per token for training. Encoder-decoder models like T0 and T5 (where the encoder and decoder have the same number of layers and layer sizes) only process each token with either the encoder or decoder (each having roughly half the parameters of the full model), so the FLOPs per token estimates are halved to N and 3N FLOPs per token for inference and training. We note that FLOPs are not a direct measurement of real-world computational cost because latency, power usage, and other costs can vary significantly depending on hardware and other factors [ 52 ]. However, we focus on FLOPs because it is a hardware-independent metric that closely with real-world costs the hardware setup used for running the different methods we consider would likely vary significantly across methods. We summarize the costs in table 1 and discuss them below. For all estimates, we use the median number of shots (41) across the datasets we consider. Rank evaluation and our unlikelihood loss both require processing every possible output choice to attain a prediction for an unlabeled example. The median combined tokenized sequence length for the input and all possible targets is 103 for the datasets we consider. For in-context examples processed for few-shot ICL, only the correct target is required, producing a median sequence length of 98. Assuming that key and value vectors are cached, processing a single example with ICL therefore involves processing

41 × 98 + 103 tokens. A summary of our cost estimates is provided in table 1.

Inference cost. Beyond improved accuracy, the primary advantage of avoiding few-shot ICL is dramatically lower inference costs. Processing a single input and all target choices with T-Few

requires 11 e9 × 103 = 1 .1e12 FLOPs, whereas few-shot ICL with GPT-3 175B requires 2 × 175 e9 ×

(41 × 98 + 103) = 1 .4e15 FLOPs more than 3 orders of magnitude more. Inference costs with ICL using the smaller GPT-3 variants are also dramatically higher than the inference cost of T-Few . As discussed in section 2.1, caching the key and value vectors when the same set of in-context examples is to be reused can reduce the computational cost of ICL. However, this would only result in an approximately 41 × reduction, which is not nearly enough to make any of the GPT-3 ICL costs as low as T-Few .

Training cost. Since T-Few is the only method that involves updating parameters, it is the only method that incurs a training cost. Training an eleven billion parameter encoder-decoder model for 1,000 steps with a batch size of 8 length-103 sequences requires approximately 3 × 11 e9 × 1, 000 ×

88 × 103 = 2 .7e16 FLOPs. While not insignificant, this is only about 20 times larger than the FLOPs required to process a single example with few-shot ICL using GPT-3 175B. In other words, training

T-Few costs as much as using GPT-3 175B to process 20 examples with few-shot ICL. We also found that fine-tuning T0 with T-Few on a single dataset only takes about a half an hour on a single NVIDIA A100 GPU. As of writing, this would cost about $2 USD using Microsoft Azure. 2

Storage cost. T-Few also incurs the largest storage cost. When stored as single-precision floats, the parameters added by (IA) 3 take up 4.2 MB of space on disk. In contrast, ICL methods only require storing the tokenized in-context examples (typically stored as 32-bit integers), resulting in a smaller

41 × 98 × 32 bits = 16 kB disk space requirement. However, we note that 4.2 MB is dwarfed by the on-disk size of the model checkpoints themselves storing the (IA) 3 adaptation vectors for 10,000 tasks would take about as much space as the T0 checkpoint (41.5 GB).

Memory usage. During inference, the primary memory cost is incurred by the models parameters. The only model smaller than T0 (used by T-Few ) is GPT-3 6.7B; otherwise, T-Few will incur a lower memory cost during inference. Additional memory costs are incurred when training T-Few due to the need to cache intermediate activations for backpropagation and for the gradient accumulator variables in Adafactor. However, as mentioned above, it is possible to use the T-Few recipe on a single 80GB A100 GPU.

4.3 Performance on Real-world Few-shot Tasks (RAFT)

So far, we have evaluated performance on a collection of datasets that were not explicitly designed for benchmarking few-shot learning. To better evaluate T-Few s performance in the real world, we evaluated our approach on the RAFT benchmark [2]. RAFT consists of 11 “economically valuable” tasks that aim to mirror real-world applications. Importantly, each RAFT datasets has only 50 training examples with no validation set and a (larger) test set with no public labels, so it is impossible to “cheat” by tuning on an unrealistically-large validation set or by peeking at the test set [ 32 , 31 ]. We apply T-Few to RAFT by using the standard prompts released alongside the dataset. The accuracy of the current top-5 methods is shown in table 2, with further details provided in appendix H. T-Few

attains a state-of-the-art accuracy of 75.8% and outperforms the human baseline (73.5% accuracy) for the first time. The next-best model (from Schick and Schütze [50] ) achieves 6% lower accuracy and GPT-3 175B attains only 62.7%. These results validate that T-Few can be readily applied as-is to novel real-world tasks to attain strong performance.

4.4 Ablation experiments

Given that our T-Few design experiments were on T0-3B, we perform an ablation of some of the ingredients of T-Few on T0. Detailed results are shown in appendix G. While the gains from adding each ingredient does not always significant increase the accuracy on each individual dataset, each ingredient consistently improves the average performance across datasets: Removing pre-training decreases accuracy by 1.6%, removing unlikelihood training and length normalization decreases accuracy by 4.1%, and removing both pre-training and our additional loss terms reduces accuracy by 2.5%.

5 Conclusion

We introduced T-Few , a parameter-efficient few-shot learning recipe that attains higher accuracy than few-shot ICL at a lower computational cost. T-Few uses (IA) 3, a new PEFT method that rescales inner activations with learned vectors. Using (IA) 3 produces better performance than fine-tuning the full model while only introducing a tiny amount of additional parameters. T-Few also uses two additional loss terms that encourage the model to output lower probabilities for incorrect choices and account for the length of different answer choices. When applying T-Few as-is (with no task-specific hyperparameter tuning or other changes) to the RAFT benchmark, we attained super-human performance for the first time and outperformed prior submissions by a large margin. Through detailed characterization of computational costs, we found that T-Few uses over 1,000 × fewer FLOPs during inference than few-shot ICL with GPT-3 and only requires 30 minutes to train on a single NVIDIA A100 GPU. Since all of our experiments were on classification tasks, we are interested in applying T-Few to generative tasks like as summarization and question answering in future work. We hope our results provide a new perspective on how best to perform few-shot learning with large language models.

2https://docs.microsoft.com/en-us/azure/virtual-machines/ndm-a100-v4-series

9References

[1] Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 , 2021. [2] Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek Thakur, Pegah Maham, C Jess Riedel, Emmie Hine, Carolyn Ashurst, Paul Sedille, Alexis Carlier, et al. RAFT: A real-world few-shot text classification benchmark. arXiv preprint arXiv:2109.14076 , 2021. [3] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog , 2019. [4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 , 2020. [5] Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. arXiv preprint arXiv:2110.07814 , 2021. [6] Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943 , 2021. [7] Andrew Kyle Lampinen, Ishita Dasgupta, Stephanie C. Y. Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L. McClelland, Jane X. Wang, and Felix Hill. Can language models learn from explanations in context? ArXiv , abs/2204.02329, 2022. [8] Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-augmented language models through few-shot prompting for open-domain question answering.

arXiv preprint arXiv:2203.05115 , 2022. [9] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837 , 2022. [10] Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705 , 2022. [11] Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247 , 2021. [12] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. arXiv preprint arXiv:2102.09690 , 2021. [13] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. ArXiv , abs/2106.09685, 2021. [14] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 , 2021. [15] Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv , abs/1910.10683, 2020. [16] Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Srivastava, and Colin Raffel. Improving and simplifying pattern exploiting training. arXiv preprint arXiv:2103.11955 , 2021. [17] Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319 , 2019. [18] Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. PPT: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332 , 2021. 10 [19] Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer. SPoT: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904 , 2021. [20] Jared Kaplan, Sam McCandlish, T. J. Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeff Wu, and Dario Amodei. Scaling laws for neural language models.

arXiv preprint arXiv:2001.08361 , 2020. [21] Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Noisy channel language model prompting for few-shot text classification. arXiv preprint arXiv:2108.04106 , 2021. [22] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. Advances in neural information processing systems , 30, 2017. [23] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. arXiv preprint arXiv:1902.00751 , 2019. [24] Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Simple, scalable adaptation for neural machine translation. arXiv preprint arXiv:1909.08478 , 2019. [25] Demi Guo, Alexander M. Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463 , 2020. [26] Yi-Lin Sung, Varun Nair, and Colin Raffel. Training neural networks with fixed sparse masks.

arXiv preprint arXiv:2111.09839 , 2021. [27] Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255 , 2020. [28] Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. arXiv preprint arXiv:2106.04647 , 2021. [29] Xiang Lisa Li and Percy Liang. Prefix-Tuning: Optimizing continuous prompts for generation.

arXiv preprint arXiv:2101.00190 , 2021. [30] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. To-wards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 ,2021. [31] Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models.

arXiv preprint arXiv:2105.11447 , 2021. [32] Avital Oliver, Augustus Odena, Colin Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. Advances in Neural Information Processing Systems , 2018. [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems , 2017. [34] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 ,2018. [35] Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Févry, et al. PromptSource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279 , 2022. [36] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , 2020. 11 [37] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. 2011 AAAI Spring Symposium Series , 2011. [38] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 , 2019. [39] Rishi Sharma, James Allen, Omid Bakhshandeh, and Nasrin Mostafazadeh. Tackling the story ending biases in the story cloze test. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 752757, 2018. [40] Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial NLI: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599 , 2019. [41] Marie-Catherine de Marneffe, Mandy Simons, and Judith Tonhauser. The CommitmentBank: Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung 23 , 2019. [42] Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop , pages 177190. Springer, 2005. [43] Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. Thir-teenth International Conference on the Principles of Knowledge Representation and Reasoning ,2012. [44] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence , 2020. [45] Mohammad Taher Pilehvar and Jose Camacho-Collados. WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. arXiv preprint arXiv:1808.09121 , 2018. [46] Stefan Van Der Walt, S. Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in science & engineering , 13(2), 2011. [47] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199 ,2021. [48] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652 , 2021. [49] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning . PMLR, 2018. [50] Timo Schick and Hinrich Schütze. True few-shot learning with promptsa real-world perspective.

arXiv preprint arXiv:2111.13440 , 2021. [51] Moshe Wasserblat. Sentence transformer fine-tuning (SetFit): Outperforming GPT-3 on few-shot text-classification while being 1600 times smaller, 2021. [52] Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. The efficiency misnomer. arXiv preprint arXiv:2110.12894 , 2021. [53] Guanghui Qin and Jason Eisner. Learning how to ask: Querying LMs with mixtures of soft prompts. arXiv preprint arXiv:2104.06599 , 2021. [54] Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-Tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602 , 2021. 12 [55] Shengnan An, Yifei Li, Zeqi Lin, Qian Liu, Bei Chen, Qiang Fu, Weizhu Chen, Nanning Zheng, and Jian-Guang Lou. Input-Tuning: Adapting unfamiliar inputs to frozen pretrained models.

arXiv preprint arXiv:2203.03131 , 2022. [56] Yulong Chen, Yang Liu, Li Dong, Shuohang Wang, Chenguang Zhu, Michael Zeng, and Yue Zhang. AdaPrompt: Adaptive model training for prompt-based NLP. arXiv preprint arXiv:2202.04824 , 2022. [57] Shizhe Diao, Xuechun Li, Yong Lin, Zhichao Huang, and Tong Zhang. Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531 , 2022. [58] Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sameer Singh, Sean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, et al. Prompt wayward-ness: The curious case of discretized interpretation of continuous prompts. arXiv preprint arXiv:2112.08348 , 2021. [59] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. arXiv preprint arXiv:2112.08654 , 2021. [60] Zonghan Yang and Yang Liu. On robust prefix-tuning for text classification. arXiv preprint arXiv:2203.10378 , 2022. [61] Yuting Yang, Pei Huang, Juan Cao, Jintao Li, Yun Lin, Jin Song Dong, Feifei Ma, and Jian Zhang. A prompting-based approach for adversarial example generation and robustness enhancement.

arXiv preprint arXiv:2203.10714 , 2022. [62] Xiaochen Liu, Yu Bai, Jiawei Li, Yinan Hu, and Yang Gao. PSP: Pre-trained soft prompts for few-shot abstractive summarization. arXiv preprint arXiv:2204.04413 , 2022. [63] Xavier Garcia and Orhan Firat. Using natural language prompts for machine translation. arXiv preprint arXiv:2202.11822 , 2022. [64] Hunter Lang, Monica Agrawal, Yoon Kim, and David Sontag. Co-training improves prompt-based learning for large language models. arXiv preprint arXiv:2202.00828 , 2022. [65] Boshi Wang, Xiang Deng, and Huan Sun. Shepherd pre-trained language models to develop a train of thought: An iterative prompting approach. arXiv preprint arXiv:2203.08383 , 2022. [66] Xu Zou, Da Yin, Qingyang Zhong, Hongxia Yang, Zhilin Yang, and Jie Tang. Controllable gener-ation from pre-trained language models via inverse prompting. arXiv preprint arXiv:2103.10685 ,2021. [67] Yusheng Su, Xiaozhi Wang, Yujia Qin, Chi-Min Chan, Yankai Lin, Zhiyuan Liu, Peng Li, Juanzi Li, Lei Hou, Maosong Sun, et al. On transferability of prompt tuning for natural language understanding. arXiv preprint arXiv:2111.06719 , 2021. [68] Yun He, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, et al. HyperPrompt: Prompt-based task-conditioning of transformers. arXiv preprint arXiv:2203.00759 , 2022. [69] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. arXiv preprint arXiv:2203.12119 , 2022. [70] Timo Schick and Hinrich Schütze. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676 , 2020. [71] Teven Le Scao and Alexander M. Rush. How many data points is a prompt worth? arXiv preprint arXiv:2103.08493 , 2021. [72] Sen Yang, Yunchen Zhang, Leyang Cui, and Yue Zhang. Do prompts solve NLP tasks using natural language? arXiv preprint arXiv:2203.00902 , 2022. 13 [73] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Auto-Prompt: Eliciting knowledge from language models with automatically generated prompts.

arXiv preprint arXiv:2010.15980 , 2020. [74] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723 , 2020. [75] Ningyu Zhang, Luoqiu Li, Xiang Chen, Shumin Deng, Zhen Bi, Chuanqi Tan, Fei Huang, and Huajun Chen. Differentiable prompt makes pre-trained language models better few-shot learners. arXiv preprint arXiv:2108.13161 , 2021. [76] Rabeeh Karimi Mahabadi, Luke Zettlemoyer, James Henderson, Marzieh Saeidi, Lambert Mathias, Veselin Stoyanov, and Majid Yazdani. PERFECT: Prompt-free and efficient few-shot learning with language models. arXiv preprint arXiv:2204.01172 , 2022. [77] Nafise Sadat Moosavi, Quentin Delfosse, Kristian Kersting, and Iryna Gurevych. Adaptable adapters. arXiv preprint arXiv:2205.01549 , 2022. [78] Eleni Triantafillou, Hugo Larochelle, Richard Zemel, and Vincent Dumoulin. Learning a universal template for few-shot dataset generalization. arXiv preprint arXiv:/2105.07029 , 2021. [79] James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, and Richard E. Turner. Fast and flexible multi-task classification using conditional neural adaptive processes. arXiv preprint arXiv:1906.07697 , 2019. [80] Wei-Hong Li, Xialei Liu, and Hakan Bilen. Universal representation learning from multiple domains for few-shot classification. Proceedings of the IEEE/CVF International Conference on Computer Vision. , 2021. [81] Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628 .14 A Compute resources used

All T0-3B models were trained on 48GB A6000s. Training T0-3B with different PEFT methods took about an hour to train, except for Intrinsic SAID and FishMask which each took about two hours to train. Pre-training (IA) 3 took 1 day on 4 A6000s. All T0 models were trained 80GB A100s from DataCrunch 3 and took about half an hour to train each. Pre-training (IA) 3 took about 1 day on 4 A100s.

Currently, prompt tuning is one of the most parameter-efficient methods for large language models [ 29 , 14 , 53 ]. Liu et al. [54] introduce several tricks to improve prompt tuning, An et al. [55] tune prompts along with input embeddings for boost in performance, and Chen et al. [56] improve prompt embeddings through continued pre-training. Given optimization difficulties when training prompt embeddings, Diao et al. [57] recently used black-box optimization to train prompt embeddings without requiring gradients. Several works have analyzed prompt tuning from the perspective of interpretability Khashabi et al. [58] and its similarity to other PEFT methods He et al. [30] . Prompt tuning has been applied to various applications for NLP including continual learning [ 59 ], model robustness [ 60 , 61 ], summarization [ 62 ], machine translation [ 63 ], co-training [ 64 ], probing language models [ 65 , 65 ], inverse prompting [ 66 ] and transfer learning [ 67 ]. He et al. [68] recently proposed the use of a hypernetwork to predict prompts for new tasks (rather than training the prompt parameters with gradient descent). Prompt tuning and other PEFT methods have also been explored outside of the context of language models (e.g. vision [22, 69] and vision-and-language models [26]). Separately, various studies have considered few-shot full-model fine-tuning with discrete prompts [70 ]. Recent work has analyzed training with discrete prompts, demonstrating a boost in performance with prompting when training on various numbers of examples [ 71 ], finding that models perform similarly when trained on good and bad prompts [ 11 ], and exploring which prompts work well for few-shot and full-shot setting [ 72 ]. There have also been efforts to develop methods that find performant discrete prompts [ 73 , 74 ] and training prompts using methods similar to prompt tuning [75]. There has also been a great deal of work on improving ICL. Chen et al. [5] , Min et al. [6] use ICL for meta-learning to perform few-shot learning on new tasks. Lampinen et al. [7] show ICL can improve when explanations are provided and [ 8] use ICL with text retrieved from the web for open-domain question-answering. Meanwhile, Min et al. [9] analyze how ICL works and show that ICL can still perform well when incorrect labels are provided for the in-context examples. With the advent of large language models with billions of parameters, there has been a great deal of recent interest in PEFT methods. A small amount of recent work has also begun to explore the compatibility of PEFT methods in the few-shot setting. Mahabadi et al. [28] found that PEFT can outperform standard fine-tuning in the low-resource setting. In concurrent work, Mahabadi et al. [76] compare PEFT to the use of discrete prompts (e.g. PET [ 70 ]) during few-shot fine-tuning and find that PEFT compares favorably. Also concurrently, Moosavi et al. [77] propose a framework for introducing adapters whose architecture and design vary from task to task and demonstrate improved results in few-shot settings. Gu et al. [18] and Vu et al. [19] both explored how pre-training prompt tuning parameters can improve when limited labeled data is available. For few-shot learning, Triantafillou et al. [78] explore learning universal and dataset dependent parameters that can be blended for generalization. Requeima et al. [79] use conditional neural adaptive processes and Li et al. [80] leverage distillation from multiple feature extractors for learning new classes or domains in few-shot learning.

C Full Unlikelihood Training and Length Normalization Results

Table 3 shows the full results with unlikelihood training and length normalization.

D Full PEFT Results

We compare against the following PEFT methods, using a linear decay with warmup scheduler with a warm-up ratio of 0.06 and the Adafactor optimizer [ 49 ]. We show the full per-dataset result of all

3https://cloud.datacrunch.io/

15 COPA H-Swag StoryCloze Winogrande WSC WiC FT 78 .02.0 39 .20.2 91 .51.0 54 .50.9 66 .41.0 53 .81.7

  • UL 81 .03.0 46 .14.8 93 .62.5 56 .52.2 61 .58.7 56 .44.1

  • LN 86 .04.0 47 .122 .4 94 .00.6 56 .93.8 65 .43.9 53 .92.0

  • UL + LN 81 .011 .0 46 .48.8 93 .82.7 56 .51.5 65 .47.7 57 .73.9

RTE CB ANLI-R1 ANLI-R2 ANLI-R3 FT 75 .85.4 82 .15.4 47 .81.5 40 .60.8 37 .81.8

  • UL 77 .61.4 89 .31.8 47 .91.9 40 .91.9 38 .85.0

  • LN 75 .84.3 89 .37.1 48 .20.6 40 .90.9 38 .31.6

  • UL + LN 79 .83.6 87 .55.4 46 .62.5 41 .30.9 40 .25.3

Table 3: Per-dataset results for comparing the effect of including the additional loss terms introduced in section 3.2. Subscripts are IQR. PEFT methods we considered and ablate the losses. Table 4 includes all losses, Table 5 includes LLN ,Table 6 includes LUL , and Table 7 does not include either loss.

Full Model Fine-tuning We train for 300 steps with a learning rate of 3e4.

BitFit [47] We train for 300 steps with a learning rate of 3e4.

LayerNorm We train for 300 steps with a learning rate of 3e4.

Adapter [23] We use a reduction factor of 32 , ReLU nonlinearity, and residual connections. We train for 500 steps with a learning rate of 3e3.

Compacter [28] We train for 500 steps with a learning rate of 3e3 and hyper complex division factor of 4 (n = 4) .

Compacter++ [28] We train for 500 steps with a learning rate of 3e3 and hyper complex division factor of 4 (n = 4) .

Prompt tuning [14] We train for 1000 steps with a learning rate of 3e1 and use 10 and 100 prompt embeddings.

Prefix tuning [29] We train for 1000 steps with a learning rate of 3e3 and adopt the two-layer MLP parameterization in the paper with hidden size 512. We use "Question:" and "Answer:" as initialization text for the prefixes attached to the input and target sequence, respectively.

FishMask [26] The Fisher is first computed on the training examples and we keep 0.2% or 0.02%

of the parameters. Then, these parameters are trained for 1500 steps with a learning rate of

3e4.

Intrinsic SAID [27] We train for 3000 steps with a learning rate of 3e2. Due to large model size, we use Intrinsic SAID to produce rank-1 updates for 2D weights via an outer product of two vectors.

LoRA [13] We use a rank of 4 with initialization scale of 0.01 and update all the attention and feedforward module. We train for 1000 steps with a learning rate of 3e3.

E Full Pre-training Results

Table 8 shows the per-dataset results for of pre-training (IA) 3.

F Full Main Results

We compare against the following baselines:

T0. To measure the improvement in performance conferred through parameter-efficient few-shot learning, we compare to zero-shot evaluation using T0 itself. In preliminary experiments, we found that T0 was not able to perform few-shot ICL performance actually decreased as we increased the 16 number of in-context examples. This is likely because of the zero-shot format used during multitask prompted fine-tuning and corroborates a recent finding by [10].

T5+LM. Since T0 is unable to perform ICL on its own, we also compare to T5+LM, the next-step-prediction language model upon which T0 is based. Specifically, we use the LM-adapted variant of T5.1.1.xxl released by Lester et al. [14] , which has the same architecture and number of parameters as T0. Due to memory constraints and because of its improved performance, we use ensemble ICL for T5+LM [ 6 ]. Specifically, we perform one-shot ICL using each example in the training set individually and average the predictions for a given query example. For fair comparison with GPT-3 models, we use the EleutherAI evaluation harness [ 81 ], which was designed to replicate the evaluation setup done by Brown et al. [4].

GPT-3. For a strong ICL baseline, we consider models in the GPT-3 family [ 4]. Specifically, we compare to the 6.7, 13, and 175 billion parameter variants of GPT-3. Because these models have not been publicly released, we report numbers directly from Brown et al. [4] . While GPT-3 is available through the commercial OpenAI API, re-running evaluation through the API would be more than an order of magnitude more expensive than running all of the experiments performed for this paper.

G Full Ablation Results

Table table 10 shows the T-Few ablation results.

H RAFT Experiment Details

RAFT consists of 11 tasks: Ade Corpus V2, Banking 77, NeurIps Impact Statement Risks, One Stop English, Overruling, Systematic Review Inclusion, Tai Safety Research, Terms of Service, Tweet Eval Hate, and Twitter Complaints. We use the T-Few recipe on all datasets without putting the labels into the input string except Banking 77. Since Banking 77 has 77 classes which causes memory issues for unlikelihood training, we turn off unlikelihood training for Banking 77. We also feed in all the labels as part of the input string for Banking 77 since there were some labels never seen during training and clean the labels by replacing "." with ",". Per-dataset results of T-Few and the other top-5 methods on RAFT are shown in table 11. 17 # of Param COPA H-Swag StoryCloze Winogrande Full Model Fine-tuning 3B 81 .011 .0 46 .48.8 93 .82.7 56 .51.5

BitFit (with LayerNorm) 1.3M 75 .02.0 29 .53.6 88 .60.7 49 .61.3

LayerNorm 250K 76 .02.0 29 .63.4 88 .70.9 49 .41.4

Adapter 12.9M 84 .03.0 41 .93.8 91 .73.7 54 .73.6

Compacter 807K 84 .05.0 46 .42.5 93 .52.2 55 .52.9

Compacter++ 540K 86 .03.0 46 .33.0 93 .51.2 55 .11.1

Prompt tuning (10) 41K 67 .05.0 29 .90.6 84 .20.8 51 .91.6

Prompt tuning (100) 409K 60 .019 .0 26 .80.6 74 .03.4 51 .10.8

Prefix tuning 576K 71 .08.0 42 .14.0 90 .23.1 52 .01.3

FishMask (0.2%) 6M 82 .05.0 44 .14.2 94 .21.8 54 .52.1

FishMask (0.02%) 600K 84 .06.0 38 .23.6 93 .60.7 53 .92.8

Intrinsic SAID 500K 77 .04.0 36 .74.5 89 .32.3 52 .72.1

Intrinsic SAID 20K 76 .04.0 38 .36.4 89 .72.7 50 .91.0

LoRA 9.1M 88 .05.0 47 .13.2 93 .62.1 56 .83.3

(IA) 3 540K 87 .03.0 49 .44.6 94 .72.7 59 .80.6

of Param WSC WiC RTE CB Full Model Fine-tuning 3B 65 .47.7 57 .73.9 79 .83.6 87 .55.4

BitFit (with LayerNorm) 1.3M 61 .511 .5 51 .72.2 72 .21.1 57 .11.8

LayerNorm 250K 63 .512 .5 52 .21.6 71 .80.4 57 .11.8

Adapter 12.9M 65 .41.0 55 .52.7 76 .23.6 87 .53.6

Compacter (n = 4) 807K 64 .46.7 55 .23.8 75 .86.1 82 .13.6

Compacter++ (n = 4) 540K 65 .43.9 54 .12.2 76 .90.4 82 .13.6

Prompt tuning (10) 41K 54 .810 .6 51 .62.0 52 .75.4 66 .11.8

Prompt tuning (100) 409K 60 .64.8 50 .01.1 48 .02.9 53 .617 .9

Prefix tuning 576K 56 .73.3 54 .23.3 68 .63.3 84 .01.8

FishMask (0.2%) 6M 63 .54.8 52 .53.3 76 .94.7 83 .93.6

FishMask (0.02%) 600K 61 .51.0 53 .51.3 75 .55.4 76 .83.6

SAID 500K 61 .58.7 55 .02.7 69 .07.6 80 .40.0

SAID 20K 55 .86.7 55 .30.5 66 .15.4 83 .91.8

LoRA 9.1M 60 .65.8 55 .25.0 78 .37.6 85 .71.8

(IA) 3 540K 68 .36.7 56 .04.6 78 .02.5 87 .51.8

of Param ANLI-R1 ANLI-R2 ANLI-R3 Full Model Fine-tuning 3B 46 .62.5 41 .30.9 40 .25.3

BitFit (with LayerNorm) 1.3M 36 .50.8 35 .32.2 36 .60.8

LayerNorm 250K 36 .50.7 35 .12.6 36 .31.0

Adapter 12.9M 45 .12.6 40 .41.2 35 .31.3

Compacter 807K 40 .83.3 37 .40.2 35 .83.3

Compacter++ 540K 41 .70.4 38 .31.8 36 .91.5

Prompt tuning (10) 41K 34 .21.9 33 .51.1 33 .51.3

Prompt tuning (100) 409K 33 .41.2 33 .80.5 33 .30.8

Prefix tuning 576K 43 .34.1 37 .51.2 36 .51.5

FishMask (0.2%) 6M 43 .70.3 39 .71.4 37 .21.1

FishMask (0.02%) 600K 39 .90.9 38 .12.0 36 .21.8

SAID 500K 40 .43.3 35 .44.1 35 .51.6

SAID 20K 41 .31.3 38 .51.8 35 .82.0

LoRA 9.1M 45 .12.5 41 .01.4 39 .54.8

(IA) 3 540K 48 .62.0 40 .81.5 40 .82.3

Table 4: Per-dataset accuracies for the PEFT methods we consider when adding LUL and LLN .Subscripts are IQR. 18 # of Param COPA H-Swag StoryCloze Winogrande Full Model Fine-tuning 3B 86 .00 4.00 47 .12 22 .44 93 .96 0.59 56 .91 3.79

BitFit (with LayerNorm) 1.3M 80 .00 6.00 31 .33 0.16 92 .89 0.27 51 .38 0.71

LayerNorm 250K 82 .00 2.00 31 .25 0.64 92 .84 0.48 51 .14 0.39

Adapter 12.9M 84 .00 5.00 44 .05 3.22 92 .89 2.35 52 .64 0.55

Compacter (n = 4) 807K 85 .00 3.00 47 .20 5.34 94 .33 1.23 53 .91 1.34

Compacter++ (n = 4) 540K 85 .00 2.00 47 .86 1.65 94 .55 0.69 54 .38 2.92

Prompt tuning (10) 41K 72 .00 5.00 30 .43 1.07 90 .38 1.23 50 .51 0.95

Prompt tuning (100) 409K 65 .00 1.00 27 .93 4.69 87 .01 3.05 51 .93 0.39

Prefix tuning 576K 79 .00 6.00 34 .40 9.71 90 .33 3.15 51 .10 1.72

FishMask (0.2%) 6M 85 .00 4.00 26 .65 0.14 93 .80 0.90 54 .38 0.16

FishMask (0.02%) 600K 82 .00 2.00 26 .65 0.14 93 .64 1.12 53 .91 1.97

Intrinsic SAID 500K Intrinsic SAID 20K LoRA 9.1M 86 .00 1.00 48 .68 2.62 94 .44 1.66 56 .12 1.03

(IA) 3 540K 90 .00 2.00 50 .03 3.02 95 .40 1.12 58 .25 0.55

of Param WSC WiC RTE CB Full Model Fine-tuning 3B 65 .38 3.85 53 .92 2.04 75 .81 4.33 89 .29 7.14

BitFit (with LayerNorm) 1.3M 63 .46 2.88 54 .23 3.13 75 .45 1.81 67 .86 0.00

LayerNorm 250K 60 .58 2.88 55 .33 1.88 76 .17 1.44 67 .86 1.79

Adapter 12.9M 63 .46 3.85 55 .49 3.61 77 .26 3.97 80 .36 3.57

Compacter (n = 4) 807K 64 .42 3.85 53 .29 5.49 75 .45 2.89 82 .14 5.36

Compacter++ (n = 4) 540K 65 .38 3.85 54 .86 3.45 77 .26 5.78 76 .79 7.14

Prompt tuning (10) 41K 53 .85 4.81 52 .04 1.72 55 .23 2.53 66 .07 3.57

Prompt tuning (100) 409K 50 .96 6.73 51 .88 1.57 48 .38 3.69 62 .50 12 .50

Prefix tuning 576K 60 .58 3.85 68 .95 0.72 80 .36 12 .50 75 .00 8.93

FishMask (0.2%) 6M 66 .35 2.88 54 .23 1.10 75 .81 3.61 83 .93 7.14

FishMask (0.02%) 600K 60 .58 1.92 52 .82 1.10 75 .09 3.61 76 .79 3.57

SAID 500K SAID 20K LoRA 9.1M 61 .54 1.92 55 .02 4.70 74 .73 4.69 85 .71 1.79

(IA) 3 540K 66 .35 3.85 53 .76 0.63 76 .90 2.89 83 .93 0.00

of Param ANLI-R1 ANLI-R2 ANLI-R3 Avg.

Full Model Fine-tuning 3B 48 .20 0.60 40 .90 0.90 38 .25 1.58 63 .25

BitFit (with LayerNorm) 1.3M 36 .10 1.40 35 .60 1.40 35 .42 2.00 56 .7

LayerNorm 250K 37 .30 0.50 37 .10 0.70 36 .25 1.08 57 .07

Adapter 12.9M 42 .40 3.20 38 .80 0.60 36 .50 3.83 60 .71

Compacter (n = 4) 807K 42 .90 3.90 38 .00 0.80 37 .33 2.33 61 .27

Compacter++ (n = 4) 540K 41 .90 0.50 38 .50 2.40 36 .00 0.58 61 .13

Prompt tuning (10) 41K 34 .20 1.10 34 .20 1.30 34 .42 0.83 52 .12

Prompt tuning (100) 409K 34 .10 1.10 34 .20 0.20 34 .08 1.25 49 .82

Prefix tuning 576K 37 .50 3.60 34 .17 4.50 34 .40 9.71 58 .71

FishMask (0.2%) 6M 43 .40 0.60 40 .00 0.90 36 .75 2.83 60 .03

FishMask (0.02%) 600K 40 .10 0.90 38 .00 2.00 35 .50 0.75 57 .73

SAID 500K SAID 20K LoRA 9.1M 46 .20 1.70 41 .40 0.90 38 .42 2.67 62 .57

(IA) 3 540K 49 .20 2.80 40 .30 2.30 40 .42 3.17 64 .05

Table 5: Per-dataset accuracies for the PEFT methods we consider when adding LLN . Subscripts are IQR. 19 # of Param COPA H-Swag StoryCloze Winogrande Full Model Fine-tuning 3B 81 .00 3.00 46 .12 4.82 93 .64 2.51 56 .51 2.21

BitFit (with LayerNorm) 1.3M 81 .00 4.00 35 .51 2.34 92 .78 0.86 50 .91 0.08

LayerNorm 250K 82 .00 1.00 34 .60 2.31 92 .68 0.75 51 .78 1.26

Adapter 12.9M 83 .00 1.00 42 .53 5.35 90 .49 3.15 53 .67 3.63

Compacter (n = 4) 807K 88 .00 3.00 42 .95 4.06 92 .89 1.87 54 .62 1.50

Compacter++ (n = 4) 540K 85 .00 2.00 48 .26 2.95 93 .85 1.60 54 .85 2.84

Prompt tuning (10) 41K 74 .00 5.00 29 .24 2.48 88 .88 1.12 51 .38 0.47

Prompt tuning (100) 409K 68 .00 7.00 28 .51 2.43 86 .91 4.33 50 .59 0.16

Prefix tuning 576K 69 .00 2.00 29 .04 10 .83 86 .44 2.35 50 .63 1.41

FishMask (0.2%) 6M 85 .00 5.00 27 .78 0.51 94 .01 1.55 53 .67 2.60

FishMask (0.02%) 600K 84 .00 4.00 27 .78 0.51 93 .16 1.23 53 .59 2.21

Intrinsic SAID 500K Intrinsic SAID 20K LoRA 9.1M 87 .00 3.00 46 .97 1.98 93 .11 2.03 57 .93 3.63

(IA) 3 540K 86 .00 4.00 48 .78 4.12 94 .01 2.83 58 .72 1.34

of Param WSC WiC RTE CB Full Model Fine-tuning 3B 61 .54 8.65 56 .43 4.08 77 .62 1.44 89 .29 1.79

BitFit (with LayerNorm) 1.3M 64 .42 3.85 53 .61 2.51 76 .17 3.61 60 .71 1.79

LayerNorm 250K 60 .58 8.65 53 .92 2.35 75 .09 1.81 57 .14 3.57

Adapter 12.9M 65 .38 6.73 54 .39 3.13 79 .06 5.42 85 .71 3.57

Compacter (n = 4) 807K 65 .38 4.81 54 .55 3.61 75 .45 5.05 82 .14 0.00

Compacter++ (n = 4) 540K 64 .42 3.85 55 .64 3.61 77 .62 4.69 80 .36 7.14

Prompt tuning (10) 41K 54 .81 6.73 52 .82 3.29 52 .71 1.08 69 .64 5.36

Prompt tuning (100) 409K 50 .00 3.85 50 .16 0.94 52 .71 4.33 58 .93 12 .50

Prefix tuning 576K 55 .77 1.92 71 .12 6.14 82 .14 5.36 83 .93 8.93

FishMask (0.2%) 6M 62 .50 3.85 53 .61 1.41 76 .17 2.17 83 .93 8.93

FishMask (0.02%) 600K 59 .62 1.92 53 .61 0.47 74 .37 5.05 75 .00 1.79

SAID 500K SAID 20K LoRA 9.1M 59 .62 12 .50 55 .49 4.86 79 .06 1.81 87 .50 1.79

(IA) 3 540K 65 .38 4.81 56 .74 4.39 77 .26 2.53 87 .50 1.79

of Param ANLI-R1 ANLI-R2 ANLI-R3 Avg.

Full Model Fine-tuning 3B 47 .90 1.90 40 .90 1.90 38 .83 5.00 62 .71

BitFit (with LayerNorm) 1.3M 36 .40 1.10 34 .00 0.70 35 .25 2.42 56 .43

LayerNorm 250K 37 .00 1.90 36 .00 2.10 35 .58 2.17 56 .03

Adapter 12.9M 43 .90 1.10 38 .60 1.10 36 .17 2.17 61 .17

Compacter (n = 4) 807K 41 .80 1.30 37 .60 3.00 37 .17 1.92 61 .14

Compacter++ (n = 4) 540K 41 .70 0.60 38 .20 2.50 35 .58 0.33 61 .41

Prompt tuning (10) 41K 35 .00 2.10 33 .80 0.60 33 .67 2.75 52 .36

Prompt tuning (100) 409K 35 .70 0.90 33 .80 1.50 33 .00 2.17 49 .85

Prefix tuning 576K 34 .60 1.60 36 .83 4.67 38 .52 3.00 58

FishMask (0.2%) 6M 44 .10 1.00 38 .70 1.50 38 .25 0.83 59 .79

FishMask (0.02%) 600K 40 .50 2.60 37 .00 1.20 35 .58 0.75 57 .66

SAID 500K SAID 20K LoRA 9.1M 45 .90 2.20 41 .10 1.70 38 .83 1.08 62 .96

(IA) 3 540K 49 .80 2.10 40 .30 0.30 40 .17 3.33 64 .06

Table 6: Per-dataset accuracies for the PEFT methods we consider when adding LUL . Subscripts are IQR. 20 # of Param COPA H-Swag StoryCloze Winogrande Full Model Fine-tuning 3B 78 .00 2.00 39 .16 0.24 91 .45 0.96 54 .46 0.87

BitFit (with LayerNorm) 1.3M 77 .00 7.00 33 .76 0.38 90 .49 0.27 51 .54 0.16

LayerNorm 250K 77 .00 7.00 33 .58 0.65 90 .43 0.21 51 .38 0.32

Adapter 12.9M 76 .00 5.00 36 .41 2.27 90 .59 1.71 52 .01 0.47

Compacter (n = 4) 807K 81 .00 5.00 37 .53 0.67 91 .50 0.21 52 .57 0.87

Compacter++ (n = 4) 540K 78 .00 2.00 37 .00 1.02 91 .98 0.91 53 .12 0.87

Prompt tuning (10) 41K 73 .00 4.00 30 .09 1.67 88 .88 1.12 52 .25 0.32

Prompt tuning (100) 409K 66 .00 4.00 26 .31 4.46 87 .44 0.21 51 .14 0.55

Prefix tuning 576K 70 .00 3.00 27 .98 6.62 86 .75 2.24 51 .07 1.10

FishMask (0.2%) 6M 77 .00 3.00 35 .45 0.87 90 .54 1.07 52 .96 0.87

FishMask (0.02%) 600K 74 .00 2.00 31 .15 1.30 89 .52 1.28 52 .57 0.47

Intrinsic SAID 500K Intrinsic SAID 20K LoRA 9.1M 80 .00 5.00 39 .14 1.26 92 .04 1.07 53 .75 0.47

(IA) 3 540K 82 .00 1.00 40 .59 0.56 92 .57 0.48 56 .91 2.53

of Param WSC WiC RTE CB Full Model Fine-tuning 3B 66 .35 0.96 53 .76 1.72 75 .81 5.42 82 .14 5.36

BitFit (with LayerNorm) 1.3M 61 .54 3.85 53 .13 1.72 76 .53 1.08 64 .29 8.93

LayerNorm 250K 61 .54 3.85 53 .29 1.72 76 .17 2.17 62 .50 8.93

Adapter 12.9M 65 .38 7.69 54 .70 1.72 77 .26 2.89 83 .93 1.79

Compacter (n = 4) 807K 61 .54 2.88 55 .33 3.61 76 .17 2.17 83 .93 0.00

Compacter++ (n = 4) 540K 61 .54 1.92 54 .70 4.23 73 .65 1.81 78 .57 5.36

Prompt tuning (10) 41K 53 .85 7.69 52 .51 1.88 57 .40 4.33 69 .64 10 .71

Prompt tuning (100) 409K 56 .73 6.73 52 .35 0.63 54 .15 3.97 53 .57 19 .64

Prefix tuning 576K 52 .88 7.69 52 .51 0.31 72 .56 11 .91 75 .00 17 .86

FishMask (0.2%) 6M 62 .50 4.81 54 .23 2.04 77 .26 5.42 82 .14 1.79

FishMask (0.02%) 600K 58 .65 2.88 54 .39 1.10 76 .17 5.05 75 .00 3.57

SAID 500K SAID 20K LoRA 9.1M 64 .42 12 .50 54 .86 3.45 77 .26 4.33 87 .50 3.57

(IA) 3 540K 64 .42 3.85 54 .23 1.57 77 .98 1.81 82 .14 5.36

of Param ANLI-R1 ANLI-R2 ANLI-R3 Avg.

Full Model Fine-tuning 3B 47 .80 1.50 40 .60 0.80 37 .75 1.83 60 .66

BitFit (with LayerNorm) 1.3M 37 .30 1.80 36 .10 2.60 35 .17 3.67 56 .08

LayerNorm 250K 37 .50 1.50 36 .00 2.80 35 .08 3.42 55 .86

Adapter 12.9M 40 .70 3.70 39 .20 1.10 35 .83 1.92 59 .27

Compacter (n = 4) 807K 41 .80 2.70 38 .00 0.80 36 .00 2.75 59 .58

Compacter++ (n = 4) 540K 41 .10 1.50 38 .90 2.50 36 .92 1.42 58 .68

Prompt tuning (10) 41K 33 .60 0.70 33 .80 1.10 34 .83 1.00 52 .71

Prompt tuning (100) 409K 35 .60 1.70 34 .50 0.70 34 .75 1.42 50 .23

Prefix tuning 576K 37 .60 2.30 34 .10 3.50 35 .08 0.67 54 .14

FishMask (0.2%) 6M 43 .50 0.30 40 .30 0.40 36 .42 2.25 59 .3

FishMask (0.02%) 600K 40 .40 2.20 37 .50 1.00 36 .42 1.08 56 .89

SAID 500K SAID 20K LoRA 9.1M 44 .20 2.60 40 .40 1.20 37 .58 0.58 61 .01

(IA) 3 540K 48 .50 0.90 40 .20 1.80 39 .42 1.67 61 .72

Table 7: Per-dataset accuracies for the PEFT methods we consider without LUL or LLN . Subscripts are IQR. 21 COPA H-Swag StoryCloze Winogrande WSC WiC

(IA) 3 87 .03.0 49 .44.6 94 .72.7 59 .80.6 68 .36.7 56 .04.6

  • PT 89 .05.0 51 .24.6 95 .12.5 62 .61.1 70 .28.7 57 .22.5

RTE CB ANLI-R1 ANLI-R2 ANLI-R3 Acc.

(IA) 3 78 .02.5 87 .51.8 48 .62.0 40 .81.5 40 .83 2.3 64.6 + PT 80 .91.4 87 .51.8 49 .31.1 41 .10.5 39 .84.8 65.8 Table 8: Per-dataset results when pre-training (PT) (IA) 3 vs. not pre-training (IA) 3. Subscripts are IQR. COPA H-Swag StoryCloze Winogrande WSC WiC

T-Few 93 .02.0 67 .16.0 97 .90.3 74 .31.5 75 .05.5 62 .27.8

T0 90 .8 33 .7 94 .7 60 .5 64 .4 57 .2

T5+LM 68 .0 60 .95 62 .8 56 .9 63 .5 50 .0

GPT-3 (175B) 92 .0 79 .3 87 .7 77 .7 75 .0 55 .3

GPT-3 (13B) 86 .0 71 .3 83 .0 70 .0 75 .0 51 .1

GPT-3 (6.7B) 83 .0 67 .3 81 .2 67 .4 67 .3 53 .1

RTE CB ANLI-R1 ANLI-R2 ANLI-R3

T-Few 85 .62.9 87 .53.6 59 .33.6 49 .82.6 44 .88.0

T0 81 .2 78 .6 44 .7 39 .4 42 .4

T5 + LM 53 .4 32 .1 33 .3 32 .7 34 .1

GPT-3 (175B) 72 .9 82 .1 36 .8 34 .0 40 .2

GPT-3 (13B) 60 .6 66 .1 33 .3 32 .6 34 .5

GPT-3 (6.7B) 49 .5 60 .7 33 .1 33 .1 33 .9

Table 9: Comparing T-Few with few-shot ICL methods. All GPT-3 numbers are from Brown et al. [4] and all T0 numbers are from Sanh et al. [1]. Subscripts are IQR. COPA H-Swag StoryCloze Winogrande WSC WiC

T-Few 93 .02.0 67 .16.0 97 .90.3 74 .31.5 75 .05.5 62 .15 7.8

  • PT 92 .02.0 64 .56.6 97 .80.8 72 .71.0 73 .16.3 60 .86.4

  • LUL - LLN 91 .02.0 52 .12.7 97 .40.5 71 .91.1 71 .21.0 62 .22.4

  • PT - LUL - LLN 94 .02.3 52 .74.9 98 .00.3 74 .01.1 72 .64.8 62 .65.0

RTE CB ANLI-R1 ANLI-R2 ANLI-R3 Acc.

T-Few 85 .62.9 87 .53.6 59 .33.6 49 .82.6 44 .88.0 72.4 - PT 84 .52.8 83 .95.4 57 .93.2 48 .63.0 43 .15.7 70.8 - LUL - LLN 82 .00.7 82 .13.6 54 .80.4 46 .10.6 40 .85.2 68.3 - PT - LUL - LLN 84 .52.9 80 .43.6 57 .13.1 47 .12.4 43 .85.9 69.7 Table 10: T-Few ablation results when omitting (IA) 3 pre-training (PT) and/or the LUL and LLN

losses. Subscripts are IQR. 22 Method Ade Corpus V2 Banking 77 Neurips Impact Statement Risks One Stop English Overruling Semiconductor Org Types Systematic Review Inclusion Tai Safety Research Terms Of Service Tweet Eval Hate Twitter Complaints

T-Few 80 .4 69 .5 83 .3 67 .6 95 .0 91 .5 50 .8 73 .6 75 .0 58 .6 87 .9

Human baseline [2] 83 .0 60 .7 85 .7 64 .6 91 .7 90 .8 46 .8 60 .9 62 .7 72 .2 89 .7

PET [50] 82 .2 59 .3 85 .7 64 .6 90 .8 81 .6 49 .3 63 .8 57 .6 48 .3 82 .4

SetFit [51] 72 .6 53 .8 87 .2 52 .1 90 .7 68 .2 49 .3 62 .8 62 .0 53 .2 83 .7

GPT-3 [4] 68 .6 29 .9 67 .9 43 .1 93 .7 76 .9 51 .6 65 .6 57 .4 52 .6 82 .1

Table 11: Detailed per-dataset results for T-Few and the other top-5 methods on RAFT. 23