Dreambooth Hyperparameter Guide

A technical guide for getting better stable diffusion dreambooth results.

Jef Packer
12 min readJan 23, 2023

This guide is for those who understand the basics of dreambooth, are training models, and want to get better results on their models. Through making models I’ve developed a few best practices and insights.

I start with selecting images and generating class images. Then I review training parameter choices. Finally, I show how to pick the best checkpoint along with writing prompts.

I won’t be reviewing how dreambooth works, or where to run it. Some great resources include the dreambooth hackathon page, the dreambooth paper, and the huggingface diffusers class (unit 3). For a complete list check out the Resources section at the bottom.

Setup

I used huggingface sd_dreambooth_training notebook for anyone wanting to follow along.

Instance Prompt

First, a subject prompt is needed. There are 2 key variables here, instance_prompt and class_prompt. The instance_prompt has both your unique identifier and class identifier. The class_prompt only has the class identifier. Using “photo of” helps the images be less abstract especially when generating class images.

Here are some examples:

instance_prompt = "photo of jefsnacker person"
prior_preservation_class_prompt = "photo of a person"
instance_prompt = "photo of azzy cat"
prior_preservation_class_prompt = "photo of a cat"
# leva is short for lunar extra vehicular activity (moon landscape)
instance_prompt = "leva landscape"
prior_preservation_class_prompt = "a landscape"
# sks is the original instance prompt used in the dreambooth paper
instance_prompt = "sks dog"
prior_preservation_class_prompt = "a dog"

It should not look like this:

instance_prompt = "azzy"
prior_preservation_class_prompt = "cat"

Note that unique identifier word choice is important. If an unused word is chosen then the model will have no pre-existing notion of what the output images look like. If a real word is chosen then it will start training from that existing meaning. If a real word is chosen it may output images somewhere between the training images and the pre-existing notion. Be careful here and know the consequences.

Subject Images

This is one of the most important parts of the process. It’s important to have representative images for the subject.

Picking photos: It’s good to have a variety of photos with similar expressions. I’ll use an example of training a model for my cat Azriel. I’ll give an example of a training set that didn’t work for me, and then one that did.

This a bad example of a training set.

Problem 1: Eyes and mouth

Notice how in some pictures Azzy’s eyes are closed and in others his eyes are open? Imagine a trained model trying to reconstruct a picture. It may pick eyes that are halfway open, it may pick closed, and as the model gets overfit it may try to do both at the same time. All of these could look bad. This concept also applies to makeup and facial expressions. Try to keep these things consistant between all training photos.

Verdict: Keep the subjects face consistant (eyes, mouth, makeup, etc.). It’s okay for 5% of the training images to be different, but nowhere near half.

Problem 2: Background

Notice how the first two images have a furry gray/orange thing in the background? As the model gets more fit, recurring background patterns can start to emerge in the output images, especially if they look like part of the subject (like his fur).

Verdict: Use a variety of backgrounds. Avoid repeating backgrounds.

Problem 3: Important features

Diffusion models generally have problems with some intricate features including fingers, eyes, and teeth. To help remedy this, it’s good to include a training image or two that highlight these things. In particular, too many fingers are very common. So here for Azzy, I’ll include extra images of his paws.

Verdict: Include images with important features like hands, fingers, paws, and teeth. Whatever’s important for the model to get correct.

Problem 4: Subject Size

When selecting photos it’s good to have ones at different distances. The model will be able to produce images similar to what it has seen. For example, if all the training images are portraits then it will have a hard time producing a full-body shot — It won’t know what the face looks like from far away. This training set has a good variety so not a problem here.

Verdict: Include pictures at different distances including portrait, full-body, and mid-shots. Most importantly, include the type that the model will output.

New training set: with these points in mind, here’s a better set of training images.

A better set of training images.

Other important points when selecting your images are cropping and the number of photos.

Cropping: The model expeects a specific image size (normally 512x512, but make sure to check!). If they aren’t this size then they will be automatically cropped and resized. You’ll have much more control if the images are cropped before training.

The Number of Photos: You’ll need more than 5, but I haven’t seen much improvement above 20. Some of my best models came from around 12 for faces, and I’ve seen great ones online from both 8 and 50 images.

Naming the Photos: I’ve seen posts that recommend naming all photos in the same format using the instance_prompt. While I haven’t noticed a difference in training, it’s still good to keep everything organized.

azzy_01.jpg
azzy_02.jpg
azzy_03.jpg
...

Class Images

Most training scripts include a section to generate class images. These can be pre-generated or downloaded (which saves LOTS of time). I’ve opened a dreambooth class images git repo that has large sets of class images. When using/generating class images there are important factors to keep in mind here:

  • Use enough images so that your model doesn’t overfit the class images (It can overfit because training alternates these images with subject images). Use at least 2x the number of class images - More is better. I generally use 1500.
  • Double-check the resolution! The class images should be the same as the model that is being fine-tuned. The same problem can occur as with the training images.

Training

Time to train, the exciting hurry up and wait!

Training Arguments

I’ve plopped my training args struct here with guiding comments.

The most important arguments to note are num_photos, learning_rate, max_train_steps, and save_steps. These are all correlated with each other. I discuss them after the args struct.

args = Namespace(
# The model to be fine-tuned. Something like stable-diffusion-XXX
pretrained_model_name_or_path=pretrained_model_name_or_path,

# Make sure your training images match the model output images
# Most of the time it's 512, but some models output 768 (sd-2.1)
resolution=vae.sample_size,

# What to do if training images don't match 'resolution'
# shouldn't happen if you followed this guide ;)
center_crop=True,

# Very important to be True,
# otherwise the text encoder won't learn the new instance prompt word.
train_text_encoder=True,

# Where to save output
instance_data_dir=save_path,

# What are your training images of?
instance_prompt=instance_prompt,

# I've found 1e-06 to be the best. Any higher and the model learns too fast.
# (at least for complicated things like faces, simple things could be higher)
learning_rate=1e-06,

# Approx Rule: num_photos * 100 (for lr=2e-6), photos * 200 (for lr=1e-6)
# Possibly up to photos * 300 (for lr=1e-6) [faces look better when overfit]
# NOTE: Discussed below
max_train_steps=3000,

# How frequently to save. Takes longer, but more options for best model.
save_steps=200,

# How many images to train at a time. Set to 1 if using prior preservation.
train_batch_size=1,
gradient_accumulation_steps=1,

# Training configs that I didn't touch.
max_grad_norm=1.0,
mixed_precision="fp16", # Set to "fp16" for mixed-precision training.

# Set this to True to lower the memory usage.
gradient_checkpointing=True,

# True: Reduce RAM and might degrade performance.
use_8bit_adam=False,

# For reproducibility.
seed=3434554,

# Set to True to use class images for training.
with_prior_preservation=prior_preservation,

# I've always used 1.0 for this, never tried anything else.
prior_loss_weight=prior_loss_weight,

# 2 and 4 both give stable results, didn't try anything else.
sample_batch_size=4,

# Path to class images.
class_data_dir=prior_preservation_class_folder,

# What the class images are of.
class_prompt=prior_preservation_class_prompt,

# Make sure there are enough class images!
num_class_images=num_class_images,

# Flat lr curve because we're fine-tuning.
lr_scheduler="constant",

# During warmup a higher LR can be used.
# Because we're fine-tuning we don't need this.
lr_warmup_steps=0,

# Where to save checkpoints and the model.
output_dir="dreambooth-concept",
)

Scheduler

I’ve had success with the DDPMScheduler during training, and the DPMSolverMultistepScheduler for inference. It’s interesting that the script uses one for training and one for inference. I’m a little confused as to why, so if anyone knows please point me in the right direction.

# Inference
pipe = StableDiffusionPipeline.from_pretrained(
model,
scheduler = DPMSolverMultistepScheduler.from_pretrained(model, subfolder="scheduler"),
torch_dtype=torch.float16,
).to("cuda")

This huggingface blog post goes into detail about different schedulers and their impacts.

Learning Rate

There’s a direct relationship between the learning rate and the number of steps. Assuming the gradient is always the same, then learning_rate and steps are inversely proportional to maintain the same weights after training.

# Gradient descent update rule:
Weights = Weights - learning_rate * grad

# Fixing grad leads to:
Weights = Weights - learning_rate * C

# Therefore:
Weights - 2*(learning_rate_1e-6 * C) == Weights - (learning_rate_2e-6 * C)

# Basically, 2 steps at 1e-6 is similar to 1 step at 2e-6.

In short — lower learning_rate means that more steps are needed.

Internet resources have converged upon 1e-6 being the best option for faces. Higher learning rates can be used for less detailed subjects. FollowFox has a great comparison of learning rates and the number of steps.

Number of Photos

The next variable to think about is num_photos. Assuming that all training photos are unique then the number of times the model “sees” each photo during training is what we care about then we can say the following:

# num_times_seen   - number of times to see each photo
# num_photos - number of photos in the train
# train_batch_size - number of photos used at each training step
steps_needed = num_photos * num_times_seen / train_batch_size

In short, with more unique photos requires more training steps.

This of course is a simplification. If two photos in the training set are similar then perhaps the model can learn the same features from both, and fewer steps are needed.

Putting it all Together

After training a few models to learn the faces of friends and family members I put together a simple linear model. The model guesses the number of steps needed based on learning_rate and num_photos.

# For learning faces
step_guess = num_photos * 300 / (learning rate * 1e6)

# Examples
step_guess = (12 photos * 300) / (1e-6 * 1e6) = 3600
step_guess = (6 photos * 300) / (2e-6 * 1e6) = 900

If anyone knows better equations for landscapes, food, or anything else please share!

The only problem is that the arguments don’t take step_guess, they take max_train_steps and save_steps. We now want to pick these variables to have a few checkpoints to choose from that are close to this ideal number of steps.

Here are some examples of what I might choose:

# step_guess = 3600 (lr=1e-6, num_photos=12)
# save checkpoints at 600, 1200, 1800, 2400, 3000, 3600, 4200, 4800
max_train_steps = 4800
save_steps = 600


# step_guess = 900 (lr=2e-6, num_photos=6)
# save checkpoints at 200, 400, 600, 800, 1000, 1200, 1400
max_train_steps = 1400
save_steps = 200

I tend to go 30–40% over the step_guess, so the model can be seen transitioning to overfit. While it seems tempting to generate a ton of checkpoints it’s important to note checkpoints take a few GBs each and slow down training time.

Post-Training

Now that training is complete we need to select which checkpoint to use and prompt the model.

Checkpoint Selection

Most of the time the final checkpoint turns out to overfit (probably because I go over by 40%). So rather than selecting the final one I compare all the generated checkpoints using the following script:

prompt = "portrait of azzy cat swimming underwater"
num_samples = 6
guidance_scale = 8
num_inference_steps = 50

# Paths to all checkpoints.
# Note: Google Colab has a hard time displaying more than 4 rows at a time.
model_list = [
# 'dreambooth-concept/checkpoint-200',
'dreambooth-concept/checkpoint-400',
# 'dreambooth-concept/checkpoint-600',
'dreambooth-concept/checkpoint-800',
# 'dreambooth-concept/checkpoint-1000',
'dreambooth-concept/checkpoint-1200',
# 'dreambooth-concept/checkpoint-1400',
'dreambooth-concept/checkpoint-1600',
]

# output images
all_images = []
for model in model_list:
# Setup pipeline for checkpoint
pipe = StableDiffusionPipeline.from_pretrained(
model,
scheduler = DPMSolverMultistepScheduler.from_pretrained(model, subfolder="scheduler"),
torch_dtype=torch.float16,
).to("cuda")

# Set the seed to compare checkpoints.
generator = torch.Generator(device="cuda").manual_seed(42)

# Generate images & add to output
images = pipe(
prompt,
num_images_per_prompt=num_samples,
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
generator=generator,
).images
all_images.extend(images)

# Display all in a grid
grid = image_grid(all_images, len(model_list), num_samples)
grid

I’ll use the example of my cat Azriel. The above script will generate a grid like the following (using a different prompt and checkpoints):

A variety of checkpoints with the same seed for my cat Azriel. The top is the smallest number of steps, and the bottom is the highest.

There are a few key things to notice. The first row is underfit. Pay attention to the eyes and how well it knows the subject.

By the final row the model is overfit. It’s outputting images of Azriel similar to the training images and ignoring the prompt.

A checkpoint somewhere in the middle is best. In this case row 2.

Examples from the grid above. From left to right: Slightly underfit (notice the hands, and how the mouth/eyes look a little weird), Overfit (copying pictures from the training set), Good fit (It even turned the hands into paws!).

Trying out many prompts is vital to selecting the best checkpoint!

Prompting

Stealing prompts and altering them always yields the best results. Check out sites like lexica for inspiration.

When altering the prompt it’s important to think about where and how to use the instance and class prompts. Every word has a strength with the model. Words like King Kong and Pikachu are like gravity wells for the model, which leads to it ignoring other words in the prompt. The same can happen with your class prompt depending on how many steps were run. There are some tricks to counteract this.

# Training prompts
instance_prompt = "photo of azzy cat"
class_prompt = "photo of a cat"

# Will strongly depict azzy by using "azzy cat"
prompt = "azzy cat wearing a tuxedo at a wedding"

## Trick 1: inclusion/exclusion of class prompt
# Will portray azzy in a more general way, not as strong
prompt = "azzy wearing a tuxedo at a wedding"

## Trick 2: placement of instance key word
# Weak portrayal. Use if "azzy" is stronger than the rest of the prompt
prompt = "at a wedding with azzy cat in a tuxedo"

## Trick 3: weighting
# This is built into the model generation, and can accomdate negative numbers.
prompt = "(azzy cat:0.4) wearing a tuxedo at a wedding"

Trying out many options is normally needed to get the best results.

Finish Up

Now you’re done! Time to share your model and get some fancy photos made.

Resources

This is a set of resources I found helpful when learning. This list is by no means complete.

Training scripts:

Training guides:

AI art prompt inspiration sites:

--

--

Jef Packer
0 Followers

Researcher figuring out new ways to move things. More at jefpacker.ai.