-
-
Save victorchall/67bc53472f86641aef1ebee1e154f5d1 to your computer and use it in GitHub Desktop.
Using this repo: | |
https://github.com/kanewallmann/Dreambooth-Stable-Diffusion | |
Folder structure, using a project name of "ff7r" for example but you can name it however you want | |
/reg/man/ (all your regularization images of men) | |
/training_samples/ff7r/man (all your images of men to train) | |
/reg/woman/ (all your regulaization images of women) | |
/training_samples/ff7r/woman (all your images of women to train) | |
/reg/group/ (all your regulaization images of groups of people) | |
/training_samples/ff7r/group (all your images of multiple characters in one frame) | |
/reg/city/ (all your regulaization images of city stuff, like "aerial photo of a city at night" or "photo of a city street") | |
/training_samples/ff7r/city (all your images of city styles to train) | |
etc. as many pairings as you want. /indoors, /building, whatever. Make a pairing of the train and reg sets in identical subfolders in your /reg and /training_samples/projectname | |
Python run command to kick off training: | |
python main.py --base configs/stable-diffusion/v1-finetune_unfrozen.yaml -t --actual_resume last.ckpt -n ff7r --gpus 0, --data_root training_samples\ff7r --reg_data_root reg | |
Last successful run: | |
Training images are run through blip interrogator, 16 beams, and files are renamed to that caption it spits out | |
"a man" and "a woman" and so forth are changed to "cloud strife" or "barret wallace", obviously to the correct character name shown in the image | |
Every single training image has a custom caption such as " | |
120-140 images each of Cloud Strife and Barret Wallace in /training_samples/ff7r/man | |
120-140 images each of Aerith Gainsborough and Tifa Lockhart in /training_samples/ff7r/woman | |
80 images of Jessie Rasberry in /training_samples/ff7r/woman | |
60 group photos (various combinations of characters) in /training_samples/ff7r/group | |
30 images of Wedge and Biggs in /training_samples/ff7r/man | |
10 images of red xiii in /training_samples/ff7r/dog | |
10 images of aerial screesshot of midgar city in /training_samples/ff7r/city | |
10 images of city streets and concept art in /training_samples/ff7r/city | |
etc. | |
Results: Cloud, Barret, Aerith, Tifa, and Jessie all look very good. | |
Biggs/wedge look like PS2-era renders and kinda smoothed over, but are at least there, more training samples will fix this | |
Style transfer for city of midgar works fairly well given the limited set | |
Tom Cruise still looks like Tom Cruise, Emma Watson still looks like Emma Watson, etc. | |
"photo of city streets" does not turn into midgar unless "midgar city" or "midgar" is in the prompt | |
There is some degradation, but if you want to generate context mashups with Cloud strife as Captain America it works VERY well, or Robert Downney Jr as Cloud Strife, it still works great | |
Future: | |
1400 images in next training set, more wedge/biggs, etc | |
Adding "slums district" and "business district" in next model, fairly certain it will do extremely well | |
Adding more training images for wedge/biggs, sephiroth, president shinra, heidegger, rufus shinra, etc. | |
Model:
https://drive.google.com/file/d/1opVuuEOZLY_D7clYHh8QKLxiYY5rv-ks/view?usp=sharing
Well trained:
"tifa lockhart"
"cloud strife"
"barret wallace"
"aerith gainsborough"
"jessie rasberry"
Poorly trained:
"wedge ff7r"
"biggs ff7r"
"scarlet ff7r"
"sephiroth"
"president shinra"
"rufus shinra"
"shinra security officer holding an assault rifle"
Garbage level training:
"red xiii"
Styles:
"streets of midgar city"
"aerial photo of midgar city"
"in the style of midgar city" or "in the style of midgar"
You can monitor training by looking at the logs folder, for example: logs[ff7r2022-10-11T04-19-59_ff7rv4]\images\train
It will spit out test images ever so many steps based on the ImageLogger settings in the finetune yaml.
The caption is per-image in kanewallmann's repo. Underscore is used to mark the end of the caption so you can have multiple images with the same caption without filename collision
Ex.
"zack fair in a black outfit holding a broadsword.jpg"
"cloud strife sitting on a motorcycle with his buster sword in his hand.png"
"cloud strife standing in a burning alleyway_1.jpg"
"cloud strife standing in a burning alleyway_2.jpg"
"cloud strife standing in a burning alleyway_3.jpg"
"a food truck in the slums distrct of midgar city_1.png"
"a food truck in the slums distrct of midgar city_2.png"
"ruined streets of midgar city with a fallen building in the background and people standing around.png"
"ruined streets of midgar city with a fallen building in the background and people standing around_1.png"
Same goes for reg images!!
"a small a 2-story apartment building_ (1).png"
"a small a 2-story apartment building_ (11).png"
"an interior photo of a small hometown bar with a cash register on the counter_ (1).png"
"an interior photo of a small hometown bar with a cash register on the counter_ (2).png"
New ckpt with another epoch (~4080 steps) added to the above at LR 5e-7, about 14k training steps total (~18k with validation?):
https://drive.google.com/file/d/1BpaJi9JtOoekd0cjXBHni9-R9UnpItWk/view?usp=sharing
Samples from v4 (not 4.1 posted above): https://imgur.com/a/hVOyRmZ#8xoSy4i
Please look at the comments on each image.
New 4.1 samples: https://imgur.com/a/J8lJYrQ
If you're interested in discussing fine tuning techniques moving past the Dreambooth techniques, I'll be sharing more here as well on discord: https://discord.gg/UwM6T5Jp
Discord invite seems invalid.
Discord invite seems invalid.
hopefully permanent link. A lot has happened in the last week! Now using Laion ground truth data for model preservation and up to 2200+ training images.
finetune yaml settings:
LR: 1.0e-6
train repeats: 5
validation repeats: 1
I train to 3 epochs, which works out to ~13k steps with ~900 images, will be more as I add more training data to keep shooting for 5 repeats 3 epochs as that seems to be the sweet spot