Unlock Limitless Creativity: A Comprehensive Guide to Stable Diffusion Textual Inversion

Stable Diffusion is an amazing tool, but getting it to generate exactly what you envision can be tricky. This guide dives into the powerful technique of Textual Inversion, enabling you to master Stable Diffusion and generate stunning, highly customized images!

What is Stable Diffusion Textual Inversion?

Textual Inversion is a method for teaching Stable Diffusion new concepts or styles by creating unique "words" (tokens) associated with specific image features. It allows fine-grained control over your generated images, going beyond simple prompt engineering. It's like teaching your AI a new visual language.

More Control, Better Results: Textual Inversion lets you inject specific artistic styles, object details, or even personal characteristics into your images.
Fine-Tuning Without the Heavy Lifting: Unlike full model retraining, Textual Inversion is computationally efficient, requiring less processing power and time.
Expand Your Creative Palette: Seamlessly combine Textual Inversion with other Stable Diffusion techniques like Dreambooth for unparalleled creative control.

Setting up Textual Inversion: A Step-by-Step Guide

Ready to dive in? Follow these steps to set up Textual Inversion and bend Stable Diffusion to your will, focusing on training the model to recognize a specific object—a plastic toy Groot from Guardians of the Galaxy.

1. Essential Installations & Setup

First, install the necessary libraries and create directories for your project:

!pip install -qq accelerate tensorboard ftfy
!pip install -qq -U transformers
!pip install -qq -U diffusers
!pip install -qq bitsandbytes
!pip install gradio
!mkdir inputs_textual_inversion
!git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui

2. Import Libraries & Helper Functions

Import the required Python libraries and define a helper function to display images:

import argparse
import itertools
import math
import os
import random

import numpy as np
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch.utils.data import Dataset

import PIL
from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import set_seed
from diffusers import AutoencoderKL, DDPMScheduler, PNDMScheduler, StableDiffusionPipeline, UNet2DConditionModel
from diffusers.hub_utils import init_git_repo, push_to_hub
from diffusers.optimization import get_scheduler
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from PIL import Image
from torchvision import transforms
from tqdm.auto import tqdm
from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer

## Instantiate helper function
def image_grid(imgs, rows, cols):
 assert len(imgs) == rows*cols

 w, h = imgs[0].size
 grid = Image.new('RGB', size=(cols*w, rows*h))
 grid_w, grid_h = grid.size

 for i, img in enumerate(imgs):
 grid.paste(img, box=(i%cols*w, i//cols*h))
 return grid

3. Load Stable Diffusion Model

Select your Stable Diffusion checkpoint. You can either specify a local path or download the model from Hugging Face:

## Use local files
pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5" #@param {type:"string"}

## Download online files
#@markdown Please read and, if you agree, accept the LICENSE [here](https://huggingface.co/runwayml/stable-diffusion-v1-5) if you see an error
# pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5" #@param {type:"string"}

4. Gathering Your Training Images

Collect a set of images representing the concept you want to teach Stable Diffusion. 3-5 images are usually sufficient to start. For this example, we will use images of a plastic toy Groot.

#@markdown Add here the URLs to the images of the concept you are adding. 3-5 should be fine
urls = [
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/5/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg"]

Download and save the images to the designated directory:

# @title Setup and check the images you have just added
import requests
import glob
from io import BytesIO

def download_image(url):
 try:
 response = requests.get(url)
 except:
 return None
 return Image.open(BytesIO(response.content)).convert("RGB")

images = list(filter(None,[download_image(url) for url in urls]))
save_path = "./inputs_textual_inversion"
if not os.path.exists(save_path):
 os.mkdir(save_path)
[image.save(f"{save_path}/{i}.jpeg") for i, image in enumerate(images)]
image_grid(images, 1, len(images))

5. Defining Your New Concept

Define the key parameters for your new concept:

#@title Settings for your newly created concept

concept_name = "grooty"

#@markdown `initializer_token` is a word that can summarise what your
new concept is, to be used as a starting point

initializer_token = "groot" #@param {type:"string"}

#@markdown `what_to_teach`: what is it that you are teaching? `object` enables you to teach the model a new object to be used, `style` allows you to teach the model a new style one can use.

what_to_teach = "object" #@param ["object", "style"]

#@markdown `placeholder_token` is the token you are going to use to represent your new concept (so when you prompt the model, you will say "A ` ` in an amusement park"). We use angle brackets to differentiate a token from other words/tokens, to avoid collision.

placeholder_token = f'<{concept_name}>'

concept_name: A short, descriptive name for your concept (e.g., "grooty").
initializer_token: A word that closely resembles your concept (e.g., "groot").
what_to_teach: Specify whether you're teaching an "object" or a "style."
placeholder_token: A unique token enclosed in angle brackets that will represent your new concept in prompts (e.g., "").

###6. Setup the Prompt Templates

Create prompts to help the model associate the placeholder token with visual features:

#@title Setup the prompt templates for training
imagenet_templates_small = [
 "a photo of a {}",
 "a rendering of a {}",
 "a cropped photo of the {}",
 "the photo of a {}",
 "a photo of a clean {}",
 "a photo of a dirty {}",
 "a dark photo of the {}",
 "a photo of my {}",
 "a photo of the cool {}",
 "a close-up photo of a {}",
 "a bright photo of the {}",
 "a cropped photo of a {}",
 "a photo of the {}",
 "a good photo of the {}",
 "a photo of one {}",
 "a close-up photo of the {}",
 "a rendition of the {}",
 "a photo of the clean {}",
 "a rendition of a {}",
 "a photo of a nice {}",
 "a good photo of a {}",
 "a photo of the nice {}",
 "a photo of the small {}",
 "a photo of the weird {}",
 "a photo of the large {}",
 "a photo of a cool {}",
 "a photo of a small {}",
]

imagenet_style_templates_small = [
 "a painting in the style of {}",
 "a rendering in the style of {}",
 "a cropped painting in the style of {}",
 "the painting in the style of {}",
 "a clean painting in the style of {}",
 "a dirty painting in the style of {}",
 "a dark painting in the style of {}",
 "a picture in the style of {}",
 "a cool painting in the style of {}",
 "a close-up painting in the style of {}",
 "a bright painting in the style of {}",
 "a cropped painting in the style of {}",
 "a good painting in the style of {}",
 "a close-up painting in the style of {}",
 "a rendition in the style of {}",
 "a nice painting in the style of {}",
 "a small painting in the style of {}",
 "a weird painting in the style of {}",
 "a large painting in the style of {}",
]

7. Creating the Dataset

Create a dataset class to manage the training images and prompts:

#@title Setup the dataset
class TextualInversionDataset(Dataset):
 def __init__(
 self,
 data_root,
 tokenizer,
 learnable_property="object", # [object, style]
 size=512,
 repeats=100,
 interpolation="bicubic",
 flip_p=0.5,
 set="train",
 placeholder_token="*",
 center_crop=False,
 ):

 self.data_root = data_root
 self.tokenizer = tokenizer
 self.learnable_property = learnable_property
 self.size = size
 self.placeholder_token = placeholder_token
 self.center_crop = center_crop
 self.flip_p = flip_p

 self.image_paths = [os.path.join(self.data_root, file_path) for file_path in os.listdir(self.data_root)]

 self.num_images = len(self.image_paths)
 self._length = self.num_images

 if set == "train":
 self._length = self.num_images * repeats

 self.interpolation = {
 "linear": PIL.Image.LINEAR,
 "bilinear": PIL.Image.BILINEAR,
 "bicubic": PIL.Image.BICUBIC,
 "lanczos": PIL.Image.LANCZOS,
 }[interpolation]

 self.templates = imagenet_style_templates_small if learnable_property == "style" else imagenet_templates_small
 self.flip_transform = transforms.RandomHorizontalFlip(p=self.flip_p)

 def __len__(self):
 return self._length

 def __getitem__(self, i):
 example = {}
 image = Image.open(self.image_paths[i % self.num_images])

 if not image.mode == "RGB":
 image = image.convert("RGB")

 placeholder_string = self.placeholder_token
 text = random.choice(self.templates).format(placeholder_string)

 example["input_ids"] = self.tokenizer(
 text,
 padding="max_length",
 truncation=True,
 max_length=self.tokenizer.model_max_length,
 return_tensors="pt",
 ).input_ids[0]

 # default to score-sde preprocessing
 img = np.array(image).astype(np.uint8)

 if self.center_crop:
 crop = min(img.shape[0], img.shape[1])
 h, w, = (
 img.shape[0],
 img.shape[1],
 )
 img = img[(h - crop) // 2 : (h + crop) // 2, (w - crop) // 2 : (w + crop) // 2]

 image = Image.fromarray(img)
 image = image.resize((self.size, self.size), resample=self.interpolation)

 image = self.flip_transform(image)
 image = np.array(image).astype(np.uint8)
 image = (image / 127.5 - 1.0).astype(np.float32)

 example["pixel_values"] = torch.from_numpy(image).permute(2, 0, 1)
 return example

8. Tokenizer Loading and Special Token Addition

Load the CLIP tokenizer and add the placeholder token as a special token:

#@title Load the tokenizer and add the placeholder token as a additional special token.
#@markdown Please read and, if you agree, accept the LICENSE [here](https://huggingface.co/runwayml/stable-diffusion-v1-5) if you see an error
tokenizer = CLIPTokenizer.from_pretrained(
 pretrained_model_name_or_path,
 subfolder="tokenizer")

# Add the placeholder token in tokenizer
num_added_tokens = tokenizer.add_tokens(placeholder_token)
if num_added_tokens == 0:
 raise ValueError(
 f"The tokenizer already contains the token {placeholder_token}. Please pass a different"
 " `placeholder_token` that is not already in the tokenizer."
 )

Why Use Stable Diffusion with Textual Inversion?

Stable Diffusion with Textual Inversion provides granular control, allowing you to:

Introduce new objects: Easily add specific objects to your generated scenes. Want a cat wearing a hat? Train it with Textual Inversion.
Replicate artistic styles: Master the styles of your favorite artists and apply them to your creations.
Personalize your images: Insert unique characteristics to create truly one-of-a-kind visuals.

By following this guide, you've taken the first steps towards mastering Stable Diffusion Textual Inversion. With a little experimentation, you'll be crafting incredibly detailed and personalized images in no time.

Unlock Limitless Creativity: A Comprehensive Guide to Stable Diffusion Textual Inversion

What is Stable Diffusion Textual Inversion?

More Control, Better Results: Textual Inversion lets you inject specific artistic styles, object details, or even personal characteristics into your images.
Fine-Tuning Without the Heavy Lifting: Unlike full model retraining, Textual Inversion is computationally efficient, requiring less processing power and time.
Expand Your Creative Palette: Seamlessly combine Textual Inversion with other Stable Diffusion techniques like Dreambooth for unparalleled creative control.

Setting up Textual Inversion: A Step-by-Step Guide

1. Essential Installations & Setup

First, install the necessary libraries and create directories for your project:

!pip install -qq accelerate tensorboard ftfy
!pip install -qq -U transformers
!pip install -qq -U diffusers
!pip install -qq bitsandbytes
!pip install gradio
!mkdir inputs_textual_inversion
!git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui

2. Import Libraries & Helper Functions

Import the required Python libraries and define a helper function to display images:

import argparse
import itertools
import math
import os
import random

import numpy as np
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch.utils.data import Dataset

import PIL
from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import set_seed
from diffusers import AutoencoderKL, DDPMScheduler, PNDMScheduler, StableDiffusionPipeline, UNet2DConditionModel
from diffusers.hub_utils import init_git_repo, push_to_hub
from diffusers.optimization import get_scheduler
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from PIL import Image
from torchvision import transforms
from tqdm.auto import tqdm
from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer

## Instantiate helper function
def image_grid(imgs, rows, cols):
 assert len(imgs) == rows*cols

 w, h = imgs[0].size
 grid = Image.new('RGB', size=(cols*w, rows*h))
 grid_w, grid_h = grid.size

 for i, img in enumerate(imgs):
 grid.paste(img, box=(i%cols*w, i//cols*h))
 return grid

3. Load Stable Diffusion Model

Select your Stable Diffusion checkpoint. You can either specify a local path or download the model from Hugging Face:

## Use local files
pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5" #@param {type:"string"}

## Download online files
#@markdown Please read and, if you agree, accept the LICENSE [here](https://huggingface.co/runwayml/stable-diffusion-v1-5) if you see an error
# pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5" #@param {type:"string"}

4. Gathering Your Training Images

Collect a set of images representing the concept you want to teach Stable Diffusion. 3-5 images are usually sufficient to start. For this example, we will use images of a plastic toy Groot.

#@markdown Add here the URLs to the images of the concept you are adding. 3-5 should be fine
urls = [
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/5/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg"]

Download and save the images to the designated directory:

# @title Setup and check the images you have just added
import requests
import glob
from io import BytesIO

def download_image(url):
 try:
 response = requests.get(url)
 except:
 return None
 return Image.open(BytesIO(response.content)).convert("RGB")

images = list(filter(None,[download_image(url) for url in urls]))
save_path = "./inputs_textual_inversion"
if not os.path.exists(save_path):
 os.mkdir(save_path)
[image.save(f"{save_path}/{i}.jpeg") for i, image in enumerate(images)]
image_grid(images, 1, len(images))

5. Defining Your New Concept

Define the key parameters for your new concept:

#@title Settings for your newly created concept

concept_name = "grooty"

#@markdown `initializer_token` is a word that can summarise what your
new concept is, to be used as a starting point

initializer_token = "groot" #@param {type:"string"}

#@markdown `what_to_teach`: what is it that you are teaching? `object` enables you to teach the model a new object to be used, `style` allows you to teach the model a new style one can use.

what_to_teach = "object" #@param ["object", "style"]

#@markdown `placeholder_token` is the token you are going to use to represent your new concept (so when you prompt the model, you will say "A ` ` in an amusement park"). We use angle brackets to differentiate a token from other words/tokens, to avoid collision.

placeholder_token = f'<{concept_name}>'

concept_name: A short, descriptive name for your concept (e.g., "grooty").
initializer_token: A word that closely resembles your concept (e.g., "groot").
what_to_teach: Specify whether you're teaching an "object" or a "style."
placeholder_token: A unique token enclosed in angle brackets that will represent your new concept in prompts (e.g., "").

###6. Setup the Prompt Templates

Create prompts to help the model associate the placeholder token with visual features:

#@title Setup the prompt templates for training
imagenet_templates_small = [
 "a photo of a {}",
 "a rendering of a {}",
 "a cropped photo of the {}",
 "the photo of a {}",
 "a photo of a clean {}",
 "a photo of a dirty {}",
 "a dark photo of the {}",
 "a photo of my {}",
 "a photo of the cool {}",
 "a close-up photo of a {}",
 "a bright photo of the {}",
 "a cropped photo of a {}",
 "a photo of the {}",
 "a good photo of the {}",
 "a photo of one {}",
 "a close-up photo of the {}",
 "a rendition of the {}",
 "a photo of the clean {}",
 "a rendition of a {}",
 "a photo of a nice {}",
 "a good photo of a {}",
 "a photo of the nice {}",
 "a photo of the small {}",
 "a photo of the weird {}",
 "a photo of the large {}",
 "a photo of a cool {}",
 "a photo of a small {}",
]

imagenet_style_templates_small = [
 "a painting in the style of {}",
 "a rendering in the style of {}",
 "a cropped painting in the style of {}",
 "the painting in the style of {}",
 "a clean painting in the style of {}",
 "a dirty painting in the style of {}",
 "a dark painting in the style of {}",
 "a picture in the style of {}",
 "a cool painting in the style of {}",
 "a close-up painting in the style of {}",
 "a bright painting in the style of {}",
 "a cropped painting in the style of {}",
 "a good painting in the style of {}",
 "a close-up painting in the style of {}",
 "a rendition in the style of {}",
 "a nice painting in the style of {}",
 "a small painting in the style of {}",
 "a weird painting in the style of {}",
 "a large painting in the style of {}",
]

7. Creating the Dataset

Create a dataset class to manage the training images and prompts:

#@title Setup the dataset
class TextualInversionDataset(Dataset):
 def __init__(
 self,
 data_root,
 tokenizer,
 learnable_property="object", # [object, style]
 size=512,
 repeats=100,
 interpolation="bicubic",
 flip_p=0.5,
 set="train",
 placeholder_token="*",
 center_crop=False,
 ):

 self.data_root = data_root
 self.tokenizer = tokenizer
 self.learnable_property = learnable_property
 self.size = size
 self.placeholder_token = placeholder_token
 self.center_crop = center_crop
 self.flip_p = flip_p

 self.image_paths = [os.path.join(self.data_root, file_path) for file_path in os.listdir(self.data_root)]

 self.num_images = len(self.image_paths)
 self._length = self.num_images

 if set == "train":
 self._length = self.num_images * repeats

 self.interpolation = {
 "linear": PIL.Image.LINEAR,
 "bilinear": PIL.Image.BILINEAR,
 "bicubic": PIL.Image.BICUBIC,
 "lanczos": PIL.Image.LANCZOS,
 }[interpolation]

 self.templates = imagenet_style_templates_small if learnable_property == "style" else imagenet_templates_small
 self.flip_transform = transforms.RandomHorizontalFlip(p=self.flip_p)

 def __len__(self):
 return self._length

 def __getitem__(self, i):
 example = {}
 image = Image.open(self.image_paths[i % self.num_images])

 if not image.mode == "RGB":
 image = image.convert("RGB")

 placeholder_string = self.placeholder_token
 text = random.choice(self.templates).format(placeholder_string)

 example["input_ids"] = self.tokenizer(
 text,
 padding="max_length",
 truncation=True,
 max_length=self.tokenizer.model_max_length,
 return_tensors="pt",
 ).input_ids[0]

 # default to score-sde preprocessing
 img = np.array(image).astype(np.uint8)

 if self.center_crop:
 crop = min(img.shape[0], img.shape[1])
 h, w, = (
 img.shape[0],
 img.shape[1],
 )
 img = img[(h - crop) // 2 : (h + crop) // 2, (w - crop) // 2 : (w + crop) // 2]

 image = Image.fromarray(img)
 image = image.resize((self.size, self.size), resample=self.interpolation)

 image = self.flip_transform(image)
 image = np.array(image).astype(np.uint8)
 image = (image / 127.5 - 1.0).astype(np.float32)

 example["pixel_values"] = torch.from_numpy(image).permute(2, 0, 1)
 return example

8. Tokenizer Loading and Special Token Addition

Load the CLIP tokenizer and add the placeholder token as a special token:

#@title Load the tokenizer and add the placeholder token as a additional special token.
#@markdown Please read and, if you agree, accept the LICENSE [here](https://huggingface.co/runwayml/stable-diffusion-v1-5) if you see an error
tokenizer = CLIPTokenizer.from_pretrained(
 pretrained_model_name_or_path,
 subfolder="tokenizer")

# Add the placeholder token in tokenizer
num_added_tokens = tokenizer.add_tokens(placeholder_token)
if num_added_tokens == 0:
 raise ValueError(
 f"The tokenizer already contains the token {placeholder_token}. Please pass a different"
 " `placeholder_token` that is not already in the tokenizer."
 )

Why Use Stable Diffusion with Textual Inversion?

Stable Diffusion with Textual Inversion provides granular control, allowing you to:

Introduce new objects: Easily add specific objects to your generated scenes. Want a cat wearing a hat? Train it with Textual Inversion.
Replicate artistic styles: Master the styles of your favorite artists and apply them to your creations.
Personalize your images: Insert unique characteristics to create truly one-of-a-kind visuals.

Unlock Limitless Creativity: A Comprehensive Guide to Stable Diffusion Textual Inversion

What is Stable Diffusion Textual Inversion?

Setting up Textual Inversion: A Step-by-Step Guide

1. Essential Installations & Setup

2. Import Libraries & Helper Functions

3. Load Stable Diffusion Model

4. Gathering Your Training Images

5. Defining Your New Concept

7. Creating the Dataset

8. Tokenizer Loading and Special Token Addition

Why Use Stable Diffusion with Textual Inversion?

Unlock Limitless Creativity: A Comprehensive Guide to Stable Diffusion Textual Inversion

What is Stable Diffusion Textual Inversion?

Setting up Textual Inversion: A Step-by-Step Guide

1. Essential Installations & Setup

2. Import Libraries & Helper Functions

3. Load Stable Diffusion Model

4. Gathering Your Training Images

5. Defining Your New Concept

7. Creating the Dataset

8. Tokenizer Loading and Special Token Addition

Why Use Stable Diffusion with Textual Inversion?

Related Posts