Stable Diffusion Textual Inversion: Control Your AI Image Generation

Excited about AI image generation but want more control? This tutorial dives into Stable Diffusion textual inversion, a powerful technique to fine-tune your results. Learn how to teach Stable Diffusion new concepts and styles by adding custom "words" to its vocabulary. We'll guide you step-by-step, making complex AI accessible and fun!

What is Stable Diffusion Textual Inversion?

Textual inversion allows you to "teach" Stable Diffusion specific concepts or styles without retraining the entire model. Instead, you create a new "word" (a unique token) that represents the desired concept. When this token is used in a prompt, Stable Diffusion will generate images incorporating that specific concept. This empowers you with more creative control for generating novel images for personal or commercial projects.

Why Use Textual Inversion?

Precise Control: Finely control the generated images by teaching the model new objects and styles.
Computational Efficiency: Less demanding than full model fine-tuning, making it accessible to more users.
Combine with Dreambooth: Enhance your existing Dreambooth models with even greater stylistic control.

Setting Up Your Environment for Textual Inversion

Before we begin, ensure you have the necessary libraries installed. This setup prepares your system for training a unique "word" token.

!pip install -qq accelerate tensorboard ftfy
!pip install -qq -U transformers
!pip install -qq -U diffusers
!pip install -qq bitsandbytes
!pip install gradio

#Create the directories we will use for the task
!mkdir inputs_textual_inversion

!git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui

Preparing Your Training Data

The quality of your training data directly impacts the effectiveness of textural inversion. For generating consistent results, 3 to 5 images should be good.

Gather Images: Find images that represent the concept you want to teach Stable Diffusion.

Save Images: Download these images to a dedicated folder like inputs_textual_inversion.

import requests
import glob
from io import BytesIO
from PIL import Image
import os

def download_image(url):
  try:
    response = requests.get(url)
    response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
    return Image.open(BytesIO(response.content)).convert("RGB")
  except requests.exceptions.RequestException as e:
    print(f"Error downloading image: {e}")
    return None
  except Exception as e:
    print(f"Error processing image: {e}")
    return None


urls = [
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/5/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg"]

images = list(filter(None,[download_image(url) for url in urls]))
save_path = "./inputs_textual_inversion"
if not os.path.exists(save_path):
  os.makedirs(save_path) # Use makedirs to create parent directories if they don't exist
[image.save(f"{save_path}/{i}.jpeg") for i, image in enumerate(images)]

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size

    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

image_grid(images, 1, len(images))

Defining Your New Concept

Let's tell Stable Diffusion what you're teaching it!

Concept Name: Choose a descriptive name for your concept (e.g., "grooty").
Initializer Token: Select a "seed" word that relates to your concept (e.g., "groot").

Placeholder Token: Create a unique token wrapped in brackets to represent your concept (e.g., "").

concept_name = "grooty"
initializer_token = "groot"
what_to_teach = "object"
placeholder_token = f'<{concept_name}>'

Create a Dataset for Textual Inversion

The dataset is critical for teaching Stable Diffusion the new concept.

Prompt Templates: Select or create sentences that use your placeholder_token to describe the concept.

Dataset Class: Use this class to create the training data.

imagenet_templates_small = [
 "a photo of a {}",
 "a rendering of a {}",
 "a cropped photo of the {}",
 "the photo of a {}",
 "a photo of a clean {}",
 "a photo of a dirty {}",
 "a dark photo of the {}",
 "a photo of my {}",
 "a photo of the cool {}",
 "a close-up photo of a {}",
 "a bright photo of the {}",
 "a cropped photo of a {}",
 "a photo of the {}",
 "a good photo of the {}",
 "a photo of one {}",
 "a close-up photo of the {}",
 "a rendition of the {}",
 "a photo of the clean {}",
 "a rendition of a {}",
 "a photo of a nice {}",
 "a good photo of a {}",
 "a photo of the nice {}",
 "a photo of the small {}",
 "a photo of the weird {}",
 "a photo of the large {}",
 "a photo of a cool {}",
 "a photo of a small {}",
]

imagenet_style_templates_small = [
 "a painting in the style of {}",
 "a rendering in the style of {}",
 "a cropped painting in the style of {}",
 "the painting in the style of {}",
 "a clean painting in the style of {}",
 "a dirty painting in the style of {}",
 "a dark painting in the style of {}",
 "a picture in the style of {}",
 "a cool painting in the style of {}",
 "a close-up painting in the style of {}",
 "a bright painting in the style of {}",
 "a cropped painting in the style of {}",
 "a good painting in the style of {}",
 "a close-up painting in the style of {}",
 "a rendition in the style of {}",
 "a nice painting in the style of {}",
 "a small painting in the style of {}",
 "a weird painting in the style of {}",
 "a large painting in the style of {}",
]

import random import numpy as np import PIL from torchvision import transforms import torch from torch.utils.data import Dataset import os

class TextualInversionDataset(Dataset): def init( self, data_root, tokenizer, learnable_property="object", # [object, style] size=512, repeats=100, interpolation="bicubic", flip_p=0.5, set="train", placeholder_token="*", center_crop=False, ): self.data_root = data_root self.tokenizer = tokenizer self.learnable_property = learnable_property self.size = size self.placeholder_token = placeholder_token self.center_crop = center_crop self.flip_p = flip_p

    self.image_paths = [os.path.join(self.data_root, file_path) for file_path in os.listdir(self.data_root)]
    self.num_images = len(self.image_paths)
    self._length = self.num_images

    if set == "train":
        self._length = self.num_images * repeats

    self.interpolation = {
        "linear": PIL.Image.LINEAR,
        "bilinear": PIL.Image.BILINEAR,
        "bicubic": PIL.Image.BICUBIC,
        "lanczos": PIL.Image.LANCZOS,
    }[interpolation]

    self.templates = imagenet_style_templates_small if learnable_property == "style" else imagenet_templates_small
    self.flip_transform = transforms.RandomHorizontalFlip(p=self.flip_p)

def __len__(self):
    return self._length

def __getitem__(self, i):
    example = {}
    image = Image.open(self.image_paths[i % self.num_images])

    if not image.mode == "RGB":
        image = image.convert("RGB")

    placeholder_string = self.placeholder_token
    text = random.choice(self.templates).format(placeholder_string)

    example["input_ids"] = self.tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=self.tokenizer.model_max_length,
        return_tensors="pt",
    ).input_ids[0]

    # default to score-sde preprocessing
    img = np.array(image).astype(np.uint8)

    if self.center_crop:
        crop = min(img.shape[0], img.shape[1])
        h, w = img.shape[0], img.shape[1]
        img = img[(h - crop) // 2 : (h + crop) // 2, (w - crop) // 2 : (w + crop) // 2]

    image = Image.fromarray(img)
    image = image.resize((self.size, self.size), resample=self.interpolation)

    image = self.flip_transform(image)
    image = np.array(image).astype(np.uint8)
    image = (image / 127.5 - 1.0).astype(np.float32)

    example["pixel_values"] = torch.from_numpy(image).permute(2, 0, 1)
    return example
```

Load Stable Diffusion Model Files

Load the base Stable Diffusion model to prepare it for textual inversion.

pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5"

Adding Your Unique Token

Make sure Stable Diffusion recognizes your new concept.

from transformers import CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained(
    pretrained_model_name_or_path, subfolder="tokenizer"
)

# Add the placeholder token in tokenizer
num_added_tokens = tokenizer.add_tokens(placeholder_token)
if num_added_tokens == 0:
    raise ValueError(
        f"The tokenizer already contains the token {placeholder_token}. Please pass a different"
        " `placeholder_token` that is not already in the tokenizer."
    )

Next Steps

This tutorial covers the initial steps to textual inversion. The following steps would delve into actually training and incorporating the unique word into the model, and generating images.

Stable Diffusion Textual Inversion: Control Your AI Image Generation

What is Stable Diffusion Textual Inversion?

Why Use Textual Inversion?

Precise Control: Finely control the generated images by teaching the model new objects and styles.
Computational Efficiency: Less demanding than full model fine-tuning, making it accessible to more users.
Combine with Dreambooth: Enhance your existing Dreambooth models with even greater stylistic control.

Setting Up Your Environment for Textual Inversion

Before we begin, ensure you have the necessary libraries installed. This setup prepares your system for training a unique "word" token.

!pip install -qq accelerate tensorboard ftfy
!pip install -qq -U transformers
!pip install -qq -U diffusers
!pip install -qq bitsandbytes
!pip install gradio

#Create the directories we will use for the task
!mkdir inputs_textual_inversion

!git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui

Preparing Your Training Data

The quality of your training data directly impacts the effectiveness of textural inversion. For generating consistent results, 3 to 5 images should be good.

Gather Images: Find images that represent the concept you want to teach Stable Diffusion.

Save Images: Download these images to a dedicated folder like inputs_textual_inversion.

import requests
import glob
from io import BytesIO
from PIL import Image
import os

def download_image(url):
  try:
    response = requests.get(url)
    response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
    return Image.open(BytesIO(response.content)).convert("RGB")
  except requests.exceptions.RequestException as e:
    print(f"Error downloading image: {e}")
    return None
  except Exception as e:
    print(f"Error processing image: {e}")
    return None


urls = [
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/5/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg"]

images = list(filter(None,[download_image(url) for url in urls]))
save_path = "./inputs_textual_inversion"
if not os.path.exists(save_path):
  os.makedirs(save_path) # Use makedirs to create parent directories if they don't exist
[image.save(f"{save_path}/{i}.jpeg") for i, image in enumerate(images)]

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size

    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

image_grid(images, 1, len(images))

Defining Your New Concept

Let's tell Stable Diffusion what you're teaching it!

Concept Name: Choose a descriptive name for your concept (e.g., "grooty").
Initializer Token: Select a "seed" word that relates to your concept (e.g., "groot").

Placeholder Token: Create a unique token wrapped in brackets to represent your concept (e.g., "").

concept_name = "grooty"
initializer_token = "groot"
what_to_teach = "object"
placeholder_token = f'<{concept_name}>'

Create a Dataset for Textual Inversion

The dataset is critical for teaching Stable Diffusion the new concept.

Prompt Templates: Select or create sentences that use your placeholder_token to describe the concept.

Dataset Class: Use this class to create the training data.

imagenet_templates_small = [
 "a photo of a {}",
 "a rendering of a {}",
 "a cropped photo of the {}",
 "the photo of a {}",
 "a photo of a clean {}",
 "a photo of a dirty {}",
 "a dark photo of the {}",
 "a photo of my {}",
 "a photo of the cool {}",
 "a close-up photo of a {}",
 "a bright photo of the {}",
 "a cropped photo of a {}",
 "a photo of the {}",
 "a good photo of the {}",
 "a photo of one {}",
 "a close-up photo of the {}",
 "a rendition of the {}",
 "a photo of the clean {}",
 "a rendition of a {}",
 "a photo of a nice {}",
 "a good photo of a {}",
 "a photo of the nice {}",
 "a photo of the small {}",
 "a photo of the weird {}",
 "a photo of the large {}",
 "a photo of a cool {}",
 "a photo of a small {}",
]

imagenet_style_templates_small = [
 "a painting in the style of {}",
 "a rendering in the style of {}",
 "a cropped painting in the style of {}",
 "the painting in the style of {}",
 "a clean painting in the style of {}",
 "a dirty painting in the style of {}",
 "a dark painting in the style of {}",
 "a picture in the style of {}",
 "a cool painting in the style of {}",
 "a close-up painting in the style of {}",
 "a bright painting in the style of {}",
 "a cropped painting in the style of {}",
 "a good painting in the style of {}",
 "a close-up painting in the style of {}",
 "a rendition in the style of {}",
 "a nice painting in the style of {}",
 "a small painting in the style of {}",
 "a weird painting in the style of {}",
 "a large painting in the style of {}",
]

import random import numpy as np import PIL from torchvision import transforms import torch from torch.utils.data import Dataset import os

    self.image_paths = [os.path.join(self.data_root, file_path) for file_path in os.listdir(self.data_root)]
    self.num_images = len(self.image_paths)
    self._length = self.num_images

    if set == "train":
        self._length = self.num_images * repeats

    self.interpolation = {
        "linear": PIL.Image.LINEAR,
        "bilinear": PIL.Image.BILINEAR,
        "bicubic": PIL.Image.BICUBIC,
        "lanczos": PIL.Image.LANCZOS,
    }[interpolation]

    self.templates = imagenet_style_templates_small if learnable_property == "style" else imagenet_templates_small
    self.flip_transform = transforms.RandomHorizontalFlip(p=self.flip_p)

def __len__(self):
    return self._length

def __getitem__(self, i):
    example = {}
    image = Image.open(self.image_paths[i % self.num_images])

    if not image.mode == "RGB":
        image = image.convert("RGB")

    placeholder_string = self.placeholder_token
    text = random.choice(self.templates).format(placeholder_string)

    example["input_ids"] = self.tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=self.tokenizer.model_max_length,
        return_tensors="pt",
    ).input_ids[0]

    # default to score-sde preprocessing
    img = np.array(image).astype(np.uint8)

    if self.center_crop:
        crop = min(img.shape[0], img.shape[1])
        h, w = img.shape[0], img.shape[1]
        img = img[(h - crop) // 2 : (h + crop) // 2, (w - crop) // 2 : (w + crop) // 2]

    image = Image.fromarray(img)
    image = image.resize((self.size, self.size), resample=self.interpolation)

    image = self.flip_transform(image)
    image = np.array(image).astype(np.uint8)
    image = (image / 127.5 - 1.0).astype(np.float32)

    example["pixel_values"] = torch.from_numpy(image).permute(2, 0, 1)
    return example
```

Load Stable Diffusion Model Files

Load the base Stable Diffusion model to prepare it for textual inversion.

pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5"

Adding Your Unique Token

Make sure Stable Diffusion recognizes your new concept.

from transformers import CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained(
    pretrained_model_name_or_path, subfolder="tokenizer"
)

# Add the placeholder token in tokenizer
num_added_tokens = tokenizer.add_tokens(placeholder_token)
if num_added_tokens == 0:
    raise ValueError(
        f"The tokenizer already contains the token {placeholder_token}. Please pass a different"
        " `placeholder_token` that is not already in the tokenizer."
    )

Next Steps

This tutorial covers the initial steps to textual inversion. The following steps would delve into actually training and incorporating the unique word into the model, and generating images.

Stable Diffusion Textual Inversion: Control Your AI Image Generation

What is Stable Diffusion Textual Inversion?

Why Use Textual Inversion?

Setting Up Your Environment for Textual Inversion

Preparing Your Training Data

Defining Your New Concept

Create a Dataset for Textual Inversion

Load Stable Diffusion Model Files

Adding Your Unique Token

Next Steps

Stable Diffusion Textual Inversion: Control Your AI Image Generation

What is Stable Diffusion Textual Inversion?

Why Use Textual Inversion?

Setting Up Your Environment for Textual Inversion

Preparing Your Training Data

Defining Your New Concept

Create a Dataset for Textual Inversion

Load Stable Diffusion Model Files

Adding Your Unique Token

Next Steps

Related Posts