Master Stable Diffusion: A Step-by-Step Guide to Textual Inversion

Want to take your AI image generation to the next level? This guide shows you how to use textual inversion with Stable Diffusion to gain precise control over image creation.

Updated for 2024, this tutorial walks you through the process, enabling you to generate images with specific characteristics.

What is Stable Diffusion Textual Inversion?

Textual inversion teaches Stable Diffusion new concepts by introducing unique "words" into its vocabulary. Unlike fine-tuning, it doesn't alter the model itself, instead, it expands its understanding by associating new tokens with visual features. This method is computationally efficient and gives you more text-based control over your generated images.

Why Use Textual Inversion for Stable Diffusion?

Enhanced Control: Fine-tune image generation by manipulating text inputs.
Efficient Training: Requires less computational power than other fine-tuning methods.
Combined Power: Integrates seamlessly with other techniques like DreamBooth for advanced customization.

Getting Started: Setup and Installations

First, install the necessary libraries and clone the Stable Diffusion web UI:

!pip install -qq accelerate tensorboard ftfy
!pip install -qq -U transformers
!pip install -qq -U diffusers
!pip install -qq bitsandbytes
!pip install gradio

#create the directories we will use for the task
!mkdir inputs_textual_inversion

!git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui

Loading Stable Diffusion v1-5: Model Prep

To teach Stable Diffusion new tricks, you need the base model. Clone the repository directly from Hugging Face:

apt-get install git-lfs && git-lfs clone https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5

Step-by-Step: Teaching Your New Concept

Textual inversion trains Stable Diffusion to recreate image features by creating a new word token. Here’s how to prepare your data:

Gather Images: Collect 3-5 images representing the concept you want to teach (e.g., a specific object or style).
Prepare Your Data: Download the sample "Groot" images:

urls = [
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/5/image/image.jpg",
 "https://datasets-server.huggingface.co/assets/valhalla/images/--/valhalla--images/train/7/image/image.jpg"]

import requests
import glob
from io import BytesIO

def download_image(url):
 try:
 response = requests.get(url)
 except:
 return None
 return Image.open(BytesIO(response.content)).convert("RGB")

images = list(filter(None,[download_image(url) for url in urls]))
save_path = "./inputs_textual_inversion"
if not os.path.exists(save_path):
 os.mkdir(save_path)
[image.save(f"{save_path}/{i}.jpeg") for i, image in enumerate(images)]
image_grid(images, 1, len(images))

Defining Your Stable Diffusion Concept: Names & Tokens

Define the concept you’re teaching Stable Diffusion:

concept_name = "grooty"
initializer_token = "groot"
what_to_teach = "object"
placeholder_token = f'<{concept_name}>'

concept_name: A simple name for your concept.
initializer_token: Summarizes the object or style of the concept.
what_to_teach: Specifies if you're teaching an "object" or a "style."
placeholder_token: A unique token to represent your new concept in prompts.

Crafting the Dataset: Stable Diffusion Prompts

Create sentences using your placeholder_token to guide the model. Modify these templates to suit your specific images.

imagenet_templates_small = [
 "a photo of a {}",
 "a rendering of a {}",
 "a cropped photo of the {}",
 "the photo of a {}",
 "a photo of a clean {}",
 "a photo of a dirty {}",
 "a dark photo of the {}",
 "a photo of my {}",
 "a photo of the cool {}",
 "a close-up photo of a {}",
 "a bright photo of the {}",
 "a cropped photo of a {}",
 "a photo of the {}",
 "a good photo of the {}",
 "a photo of one {}",
 "a close-up photo of the {}",
 "a rendition of the {}",
 "a photo of the clean {}",
 "a rendition of a {}",
 "a photo of a nice {}",
 "a good photo of a {}",
 "a photo of the nice {}",
 "a photo of the small {}",
 "a photo of the weird {}",
 "a photo of the large {}",
 "a photo of a cool {}",
 "a photo of a small {}",
]

Loading Model Files: Bringing in Stable Diffusion

pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5"

Set the pretrained_model_name_or_path to your desired Stable Diffusion checkpoint.

Setting Up Your New Token: Expanding the Lexicon

Add the placeholder token to the tokenizer as a special token:

tokenizer = CLIPTokenizer.from_pretrained(
 pretrained_model_name_or_path,
 subfolder="tokenizer")

# Add the placeholder token in tokenizer
num_added_tokens = tokenizer.add_tokens(placeholder_token)
if num_added_tokens == 0:
 raise ValueError(
 f"The tokenizer already contains the token {placeholder_token}. Please pass a different"
 " `placeholder_token` that is not already in the tokenizer."
 )

Unleash New Levels of Creativity

By following this tutorial, you can effectively harness the power of textual inversion to mold Stable Diffusion to your creative vision. Experiment with different concepts and prompts to discover the endless possibilities of AI-powered image generation.