Title: 2309.14859v2.pdf URL Source: https://arxiv.org/pdf/2309.14859 Published Time: Tue, 12 Mar 2024 03:16:10 GMT Number of Pages: 79 Markdown Content: # Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation Shih-Ying Yeh 1* Yu-Guan Hsieh 2* † Zhidong Gao 3* Bernard B W Yang 4 Giyeong Oh 5 Yanmin Gong 31National Tsing Hua University, Taiwan > 2 Apple > 3 University of Texas at San Antonio, USA > 4 University of Toronto, Canada > 5 Yonsei University, South Korea https://github.com/KohakuBlueleaf/LyCORIS Abstract Text-to-image generative models have garnered immense attention for their ability to produce high-fidelity images from text prompts. Among these, Stable Diffusion distinguishes itself as a leading open-source model in this fast-growing field. However, the intricacies of fine-tuning these models pose multiple challenges from new methodology integration to systematic evaluation. Addressing these issues, this paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion), an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present a thorough framework for the systematic assessment of varied fine-tuning techniques. This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning, including hyperparameter adjustments and the evaluation with different prompt types across various concept categories. Through this comprehensive approach, our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application. 1 Introduction The recent advancements in deep generative models along with the availability of vast data on the internet have ushered in a new era of text-to-image synthesis (Balaji et al., 2022; Ramesh et al., 2022; Saharia et al., 2022). These models allow users to transform text prompts into high-quality, visually appealing images, revolutionizing the way we conceive of and interact with digital media (Ko et al., 2023; Zhang et al., 2023). Moreover, the models’ wide accessibility and user-friendly interfaces extend their influence beyond the research community to laypeople who aspire to create their own artworks. Among these, Stable Diffusion (Rombach et al., 2022) emerges as one of the pioneering open-source models offering such capabilities. Its open-source nature has served as a catalyst for a multitude of advances, attracting both researchers and casual users alike. Extensions such as cross-attention control (Liu et al., 2022) and ControlNet (Zhang et al., 2023) have further enriched the landscape, broadening the model’s appeal and utility. > *Equal contribution. > †Corresponding author: . Work done during the author’s Ph.D. at Université Grenoble Alpes. 1 > arXiv:2309.14859v2 [cs.CV] 11 Mar 2024 While these models offer an extensive repertoire of image generation, they often fall short in capturing highly personalized or novel concepts, leading to a burgeoning interest in model customization techniques. Initiatives like DreamBooth (Ruiz et al., 2023) and Textual Inversion (Gal et al., 2023) have spearheaded efforts in this domain, allowing users to imbue pretrained models like Stable Diffusion with new concepts through a small set of representative images (see Appendix A for detailed related work). Coupled with user-friendly trainers designed to customize Stable Diffusion, the ecosystem now boasts a plethora of specialized models and dedicated platforms that host them— often witnessing the upload of thousands of new models to a single website in just one week. In spite of this burgeoning landscape, our understanding of the intricacies involved in fine-tuning these models remains limited. The complexity of the task—from variations in datasets, image types, and captioning strategies, to the abundance of available methods each with their own sets of hyperparameters—renders it a challenging terrain to navigate. While new methods proposed by researchers offer much potential, they are not always seamlessly integrated into the existing ecosystem, which can hinder comprehensive testing and wider adoption. Moreover, current evaluation paradigms lack a systematic approach that covers the full depth and breadth of what fine-tuning entails. To address these gaps and bridge the divide between research innovations and casual usage, we present our contributions as follows. 1. We develop LyCORIS, an open source library dedicated to fine-tuning of Stable Diffusion. This library encapsulates a spectrum of methodologies ranging from the most standard LoRA to a number of emerging strategies such as LoHa, LoKr, GLoRA, and (IA) 3 that are newer and lesser-explored in the context of text-to-image models. 2. To enable rigorous comparisons between methods, we propose a comprehensive evaluation framework that incorporates a wide range of metrics, capturing key aspects such as concept fidelity, text-image alignment, image diversity, and preservation of the base model’s style. 3. Through extensive experiments, we compare the performances of different fine-tuning algorithms implemented in LyCORIS and assess the impacts of various hyperparameters, offering insights into how these factors influence the results. Concurrently, we underscore the complexities inherent in model evaluation, advocating for the development and adoption of more comprehensive and systematic evaluation processes. 2 Preliminary In this section, we briefly review the two core components of our study: Stable Diffusion and LoRA for model customization. 2.1 Stable Diffusion Diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015) are a family of probabilistic generative models that are trained to capture a data distribution through a sequence of denoising operations. Given an initial noise map xT ∼ N (0 , I), the models iteratively refine it by reversing the diffusion process until it is synthesized into a desired image x0. These models can be conditioned on elements such as text prompts, class labels, or low-resolution images, allowing conditioned generation. Specifically, our work is based on Stable Diffusion, a text-to-image latent diffusion model (Rombach et al., 2022) pretrained on the LAION 5-billion image dataset (Schuhmann et al., 2022). Latent diffusion models reduce the cost of diffusion models by shifting the denoising operation into the latent space of a pre-trained variational autoencoder, composed of an encoder E and a decoder D.During training, the noise is added to the encoder’s latent output z = E(x0) for each time step t ∈ { 0, . . . , T }, resulting in a noisy latent zt. Then, the model is trained to predict the noise applied to zt, given text conditioning c = T (l) obtained from an image description l (also known as the image’s caption) using a text encoder T . Formally, with θ denoting the parameter of the denoising U-Net and ϵθ (·) representing the predicted noise from this model, we aim to minimize L(θ) = Ex0,c,ϵ,t [|| ϵ − ϵθ (zt, t, c)|| 22], (1) where x0, c are drawn from the dataset, ϵ ∼ N (0 , I), and t is uniformly drawn from {1, ..., T }.22.2 Model Customization With LoRA To enable more personalized experiences, model customization has been proposed as a means to adapt foundational models to specific domains or concepts. In the case of Stable Diffusion, this frequently involves fine-tuning a pretrained model by minimizing the original loss function (1) on a new dataset, containing as few as a single image for each target concept. In this process, we introduce a concept descriptor [V ] for each target concept, comprising a neutral trigger word [Vtrigger ] and an optional class word [Vclass ] to denote the category to which the concept belongs. This concept descriptor is intended for use in both the image captions and text prompts. While it is possible to include a prior-preservation loss by utilizing a set of regularization images (Kumari et al., 2023; Ruiz et al., 2023), we have chosen not to employ this strategy in the current study. Low-Rank Adaptation (LoRA). When integrated into the model customization process, Low-Rank Adaptation (LoRA) could substantially reduce the number of parameters that need to be updated. It was originally developed for large language models (Hu et al., 2021), and later adapted for Stable Diffusion by Simo Ryu (2022). LoRA operates by constraining fine-tuning to a low-rank subspace of the original parameter space. More specifically, the weight update ∆W ∈ Rp×q is pre-factorized into two low-rank matrixes B ∈ Rp×r , A ∈ Rr×q , where p, q are the dimensions of the original model parameter, r is the dimension of the low-rank matrix, and r ≪ min( p, q ). During fine-tuning, the foundational model parameter W0 remains frozen, and only the low-rank matrices are updated. Formally, the forward pass of h′ = W0h + b is modified to: h′ = W0h + b + γ∆W h = W0h + b + γBA h, (2) where γ is a merge ratio that balances the retention of pretrained model information and its adaptation to the target concepts. 1 Following Hu et al. (2021), we further define α = γr so that γ = α/r . 3 The LyCORIS Library Building upon the initiative of LoRA, this section introduces LyCORIS, our open-source library that provides an array of different methods for fine-tuning Stable Diffusion. 3.1 Design and Objectives LyCORIS stands for Lora be yond Conventional methods, Other Rank adaptation Implementations for Stable diffusion . Broadly speaking, the library’s main objective is to serve as a test bed for users to experiment with a variety of fine-tuning strategies for Stable Diffusion models. Seamlessly integrating into the existing ecosystem, LyCORIS is compatible with easy-to-use command-line tools and graphic interfaces, allowing users to leverage the algorithms implemented in the library effortlessly. Additionally, native support exists in popular user interfaces designed for image generation, facilitating the use of models fine-tuned through LyCORIS methods. For most of the algorithms implemented in LyCORIS, stored parameters naturally allow for the reconstruction of the weight update ∆W .This design brings inherent flexibility: it enables the weight updates to be scaled and applied to a base model W ′ > 0 different from those originally used for training, expressed as W ′ = W ′ > 0 + λ∆W .Furthermore, a weight update can be combined with those from other fine-tuned models, further compressed, or integrated with advanced tools like ControlNet. This opens up a diverse range of possibilities for the application of these fine-tuned models. 3.2 Implemented Algorithms We now discuss the core of the library—the algorithms implemented in LyCORIS. For conciseness, we will primarily focus on three main algorithms: LoRA (LoCon), LoHa, and LoKr. The merge ratio γ = α/r introduced in (2) is implemented for all these methods. > 1Setting γis mathematically equivalent to scaling the initialization of Band Aby √γand scaling the learning rate by √γ > or γ, depending on the used optimizer. See Appendix B.1 for a generalization of this result. 3: Matrix Product : Hadamard Product : Kronecker Product > Frozen Pretrained Weight > LoRA > Frozen Pretrained Weight > LoHA (ours) > Frozen Pretrained Weight > LoKr (ours) Figure 1: This figure shows the structure of the proposed Loha and Lokr modules implemented in LyCORIS. LoRA (LoCon). In the work of Hu et al. (2021), the focus was centered on applying the low-rank adapter to the attention layer within the large language model. In contrast, the convolutional layers play a key role in Stable Diffusion. Therefore, we extend the method to the convolutional layers of diffusion models (details are provided in Appendix B.2). The intuition is with more layers getting involved during fine-tuning, the performance (generated image quality and fidelity) should be better. LoHa. Inspired by the basic idea underlying LoRA, we explore the potential enhancements in fine-tuning methods. In particular, it is well recognized that methods based on matrix factorization suffer from the low-rank constraint . Within the LoRA framework, weight updates are confined within the low-rank space, inevitably impacting the performance of the fine-tuned model. To achieve better fine-tuning performance, we conjecture that a relatively large rank might be necessary, particularly when working with larger fine-tuning datasets or when the data distribution of downstream tasks greatly deviates from the pretraining data. However, this cloud leads to increased memory usage and more storage demands. FedPara (Hyeon-Woo et al., 2022) is a technique originally developed for federated learning that aims to mitigate the low-rank constraint when applying low-rank decomposition methods to federated learning. One of the advantages of FedPara is that the maximum rank of the resulting matrix is larger than those derived from conventional low-rank decomposition (such as LoRA). More precisely, for ∆W = ( B1A1) ⊙ (B2A2), where ⊙ denotes the Hadamard product (element-wise product), B1, B 2 ∈ Rp×r , A1, A 2 ∈ Rr×q , and r ≤ min( p, q ), the rank of ∆W can be as large as r2. To make a fair comparison, we assume the low-rank dimension in equation (2) is 2r, such that they have the same number of trainable parameters. Then, the reconstructed matrix ∆W = BA has a maximum rank of 2r. Clearly, 2r < r 2, if r > 2. This implies decomposing the weight update with the Hadamard product could improve the fine-tuning capability given the same number of trainable parameters. We term this method as LoHa ( Lo w-rank adaptation with Ha damard product). The forward pass of h′ = W0h + b is then modified to: h′ = W0h + b + γ∆W h = W0h + b + γ [( B1A1) ⊙ (B2A2)] h. (3) LoKr. In the same spirit of maximizing matrix rank while minimizing parameter count, our library offers LoKr ( Lo w-rank adaptation with Kr onecker product) as another viable option. This method is an extension of the KronA technique, initially proposed by Edalati et al. (2022) for fine-tuning of language models, and employs Kronecker products for matrix decomposition. Importantly, we have adapted this technique to work with convolutional layers, similar to what we achieved with LoCon. A unique advantage of using Kronecker products lies in the multiplicative nature of their ranks, allowing us to move beyond the limitations of low-rank assumptions. Going further, to provide finer granularity for model fine-tuning, we additionally incorporate an optional low-rank decomposition (which users can choose to apply or not) that focuses exclusively on the right block resulting from the Kronecker decomposition. 2 In summary, writing ⊗ for the > 2As shown in Eq. (5), in our implementation, the right block is always the larger of the two. 4Kronecker product, the forward pass h′ = W0h + b is modified to: h′ = W0h + b + γ∆W h = W0h + b + γ [C ⊗ (BA )] h, (4) The size of these matrices are determined by two user-specified hyperparameters: the factor f and the dimension r. With these, we have C ∈ Rup×uq , B ∈ Rvp×r , and A ∈ Rr×vq , where up = max ( u ≤ min( f, √p) | p mod u = 0) , vp = pup . (5) The two scalars uq and vq are defined in the same way. Interestingly, LoKr has the widest range of potential parameter counts among the three methods and can yield the smallest file sizes when appropriately configured. Additionally, it can be interpreted an adapter that is composed of a number of linear layers, as detailed in Appendix B.3. Others. In addition to LoRA, LoHa, and LoKr described earlier, our library features other algo-rithms including DyLoRA (Valipour et al., 2022), GLoRA (Chavan et al., 2023), and (IA) 3 (Liu et al., 2022). Moreover, between the date of submission and the preparation of the camera-ready version for the main conference, we have further expanded LyCORIS by incorporating more recent advancements, notably OFT (Qiu et al., 2023), BOFT (Liu et al., 2024), and DoRA (Liu et al., 2024). However, the discussion of these supplementary algorithms is beyond the scope of this paper. 4 Evaluating Fine-Tuned Text-To-Image Models With the wide range of algorithmic choices and hyperparameter settings made possible by LyCORIS, one naturally wonders: Is there an optimal algorithm or set of hyperparameters for fine-tuning Stable Diffusion? To tackle this question in a comprehensive manner, it is essential to first establish a clear framework for model evaluation. With this in mind, in this section, we turn our focus to two independent but intertwined components that are crucial for a systematic evaluation of fine-tuned text-to-image models: i) the types of prompts used for image generation and ii ) the evaluation of the generated images. While these two components are commonly considered as a single entity in existing literature, explicitly distinguishing between them allows for a more nuanced evaluation of model performance (see Appendix A for a comprehensive overview of related works on text-to-image model evaluation). Below, we explore each of these components in detail. 4.1 Classification of Prompts for Image Generation To fully understand the model’s behavior, it is important to distinguish between different types of prompts that guide image generation. We categorize these into three main types as follows: • Training Prompts : These are the prompts originally used for training the model. The images generated from these prompts are expected to closely align with the training dataset, providing insight into how well the model has captured the target concepts. • Generalization Prompts : These prompts seek to generate images that generalize learned concepts to broader contexts, going beyond the specific types of images encountered in the training set. This includes, for example, combining the innate knowledge of the base model with the learned concepts, combining concepts trained within the same model, and combining concepts trained across different models which are later merged together. Such prompts are particularly useful to evaluate the disentanglement of the learned representations. • Concept-Agnostic Prompts : These are prompts that deliberately avoid using trigger words from the training set and are often employed to assess concept leak, see e.g., Kumari et al. (2023). When training also involves class words, this category can be further refined to distinguish between prompts that do and do not use these class words. 54.2 Evaluation Criteria After detailing the different types of prompts that guide the image generation process, the next important step is to identify the aspects that we would like to look at when evaluating the generated images, as we outline below. • Fidelity measures the extent to which generated images adhere to the target concept. • Controllability evaluates the model’s ability to generate images that align well with text prompts. • Diversity assesses the variety of images that are produced from a single or a set of prompts. • Base Model Preservation measures how much fine-tuning affects the base model’s inherent capabilities, particularly in ways that may be undesirable. For example, if the target concept is an object, retaining the background and style as generated by the base model might be desired. • Image Quality concerns the visual appeal of the generated images, focusing primarily on aspects like naturalness, absence of artifacts, and lack of weird deformations. Aesthetics, though related, are considered to be more dependent on the dataset than on the training method, and are therefore not relevant for our purpose. Taken together, the prompt classification of Section 4.1 and the evaluation criteria listed above offer a nuanced and comprehensive framework for assessing fine-tuned text-to-image models. Notably, these tools also enable us to evaluate other facets of model performance, such as the ability to learn multiple distinct concepts without mutual interference and the capability for parallel training of multiple models that can later be successfully merged. 5 Experiments In this section, we perform extensive experiments to compare different LyCORIS algorithms and to assess the impact of the hyperparameters. Our experiments employ the non-EMA version of Stable Diffusion 1.5 as the base model. All the experimental details not included in the main text along with presentations of additional experiments can be found in the appendix. 5.1 Dataset Contrary to prior studies that primarily focus on single-concept fine-tuning with very few images, we consider a dataset that spans across a wide variety of concepts with an imbalance in the number of images for each. Our dataset is hierarchically structured, featuring 1,706 images across five categories: anime characters, movie characters, scenes, stuffed toys, and styles. These categories further break down into various classes and sub-classes. Importantly, classes under “scenes” and “stuffed toys” contain only 4 to 12 images, whereas other categories have 45 to 200 images per class. The influence of training captions on the fine-tuned model is also widely acknowledged in the community. It is particularly observed that training with uninformative captions such as “A photo of [V]” , which are commonly employed in the literature, can lead to subpar results. In light of this, we use a publicly available tagger to tag the training images. We then remove tags that are inherently tied to each target concept. The resulting tags are combined with the concept descriptor to create more informative captions as “[V], {tag 1}, ..., {tag k}” . To justify this choice, comparative analyses for models trained using different captions are presented in Appendix H.3. 5.2 Algorithm Configuration and Evaluation Our experiments focus on methods that are implemented in the LyCORIS library, and notably LoRA, LoHa, LoKr, and native fine-tuning (note that DreamBooth Ruiz et al., 2023 can be simply regarded as native fine-tuning with regularization images). For each of these four algorithms, we define a set of default hyperparameters and then individually vary one of the following hyperparameters: learning rate, trained layers, dimension and alpha for LoRA and LoHa, and factor for LoKr. This leads to 26 distinct configurations. For each configuration, three models are trained using different random seeds, 6and three checkpoints are saved along each fine-tuning, giving in this way 234 checkpoints in the end. While other parameter-efficient fine-tuning methods exist in the literature, most of the proposed modifications are complementary to our approach. We thus do not include them for simplicity. Data Balancing. To address dataset imbalance, we repeat each image a number of times within each epoch to ensure images from different classes are equally exposed during training. Evaluation Procedure. To evaluate the trained models, we consider the following four types of prompts i) training captions, ii ) concept descriptor alone, iii ) generalization prompts with content alteration, and iv )