SDXL vs. SD 1.5: A Deep Dive into Image Generation AI Performance

Fever, digital art by Daniel Sandner
"Fever", 2023, by Daniel Sandner

I am creating this comparison to perhaps demystify some issues with SDXL. I was trying to avoid this for some time. But the only statistics you can trust are those you falsified yourself, so here we are.

Performance Comparison

The test was performed on consecutive generation of 3 images (Batch size 1), 50 steps each (i.e. Batch count 3, 150 steps). GPU A4000, SDXL 1.0 base and refiner, XFOMERS were active. Comparisons in tables: 

SDXL base v1.0 and refiner, SD 1.5 Photomatix merge, and SD 1.5 PhotomatixRT TensorRT to add some spice.

50 steps. Batch count 3 (150 steps)Euler aDPM++SDEDPM++3M SDE
SDXL 10241:172:371:22 (3:40 with AD)
SDXL 1024 + Refiner 0.82:283:422:50 ( 5:05 with AD)
SD15 10241:082:131:10
SD15RT 8320:350:370:37

If using ADetailer (AD) with face (1st) and eyes (2nd) corrections, consider these times basically doubled or tripled (supposing 1 character face is detected in the scene). Also the times using AD slightly varies from image to image.

50 steps. Batch count 3 (150 steps)Euler aDPM++SDEDPM++3M SDE
SDXL 832 + Refiner 0.8 +AD4:157:044:23
SDXL 832 + Refiner 0.8 +AD + Hires fix 1.634 (to 1359)5:22not performednot performed, similar speeds to Euler a
SDXL 8320:561:50not performed, similar speeds to Euler a
SD15 832 +AD2:044:09not performed, similar speeds to Euler a
SD15RT 832 +AD1:503:50not performed, similar speeds to Euler a

I have added SDXL without AD and Refiner into the table, because the quality of face rendering are often good (in close and middle shots).

As you can see in the tables, SDXL does perform basically the same in the term of speed in comparable areas. The illusion that SDXL is terribly slow comes from low resolution workflows people are used to in SD15 (also you would probably use 20 steps only most of the time, compared to 50-70 with SDXL, see table below). And the confusion also comes, frankly, from the CLAIMS about huge performance boost of SDXL (which are partially true, but we will come to this later). 

20 steps. Batch count 3 (60 steps)Euler aDPM++SDE
SDXL 512x7680:220:41
SDXL 512x768 + Refiner 0.81:391:49
SD15 512x7680:100:22
SD15RT 512x7680:080:15

portrait 1 woman (Style: Cinematic)

TIP: Try just the SDXL refiner model version for smaller resolutions (f. i. 512x768) if your hardware struggles with full 1024 renders.

Technology Comparison

SDXL uses base model for high-noise diffusion stage and refiner model for  low-noise diffusion stage. Base resolution is 1024x1024 (although different resolutions training is possible).

SD15 base resolution is 512x512  (although different resolutions training is possible, common is 768x768).

So, the SDXL version indisputably has a higher base image resolution (1024x1024) and should have better prompt recognition, along with more advanced LoRA training and full fine-tuning support. The truth is, SDXL is much harder to overtrain than SD15. We will discuss SDXL LoRA training further in the next article.

SDXL refiner part is trained for high resolution data and is used to finish the image usually in the last 20% of diffusion process. 

Because of various manipulations possible with SDXL, a lot of users started to use ComfyUI with its node workflows (and a lot of people did not, because of its node workflows).

SDXL does react well to styles and combination of styles. You may download my test style files for A1111 from Hugginface (styles.csv) or Github (styles.csv).

Outputs

Comparison of overall aesthetics is hard. SD 1.5 has issues at 1024 resolutions obviously (it generates multiple persons, twins, fused limbs or malformations). SDXL struggles with proportions at this point, in face and body alike (it can be partially fixed with LoRAs). SDXL also exaggerates styles more than SD15. In this mixed gallery, you can tell the model by hovering over an image (the SD15 model in the examples is fine-tuned for photorealism).

Interestingly, SD15 also makes mistakes in body proportions in higher resolutions, but face proportions are usually right. The prompt raw color, (1 middle aged man:1.1) art portrait photomodel, happy, model pose, bright light was using styles, the full parameters (man/woman modified):

analog film photo isometric style raw color, (1 middle aged woman:1.1) art portrait photomodel, happy, model pose, bright light . vibrant, beautiful, crisp, detailed, ultra detailed, intricate . faded film, desaturated, 35mm photo, grainy, vignette, vintage, Kodachrome, Lomography, stained, highly detailed, found footage
Negative prompt: 3d, nsfw, breasts visible , painting, drawing, deformed, mutated, ugly, disfigured, blur, blurry, noise, noisy, realistic, photographic, painting, drawing, illustration, glitch, deformed, mutated, cross-eyed, ugly, disfigured
Steps: 50, Sampler: DPM++ 3M SDE, CFG scale: 7, Seed: 1054841253, Size: 1024x1024, Model hash: 57661c1ae7, Model: photomatix, ADetailer model: face_yolov8n.pt, ADetailer confidence: 0.3, ADetailer dilate/erode: 4, ADetailer mask blur: 4, ADetailer denoising strength: 0.4, ADetailer inpaint only masked: True, ADetailer inpaint padding: 32, ADetailer model 2nd: mediapipe_face_mesh_eyes_only, ADetailer confidence 2nd: 0.3, ADetailer dilate/erode 2nd: 4, ADetailer mask blur 2nd: 4, ADetailer denoising strength 2nd: 0.4, ADetailer inpaint only masked 2nd: True, ADetailer inpaint padding 2nd: 32, ADetailer version: 23.8.0, Version: v1.5.1-495-g541ef924

SDXL is meant to be used with another "refiner" model to finish the render (and also is designed and intended to be used with styles). For lower output resolutions (which you may want to try for various reasons) you may experiment with using the refiner part only: 

Comparison of SDXL models
"woman holding candle" without a style (comparison with SD1.5 fine-tuned model)
Synthetic SDXL photo comparisons, portrait of woman holding candle
The same prompt with Style: Cinematic

However, without any doubt, SDXL excels in higher resolution renders and complex scenes with more subjects.

If you compare the base 1.5 model (not fine-tuned) with the base SDXL and compare the outputs, SDXL is visually clearly lightyears ahead.

Control

Full ControlNet funcionality is not yet available for SDXL. Not much to add here. Points for SD15. Also I wonder how big the CN models for SDXL will be and how much VRAM you will need (there are some ControNet models for ComfyUI in sizes 2.5-5GB each).

Updated: ControlNet updated for SDXL, models here.

SDXL was nicknamed as "Midjourney killer" in the promo. But SD1.5 it already was.

Conclusion

SDXL performance does seem sluggish for SD 1.5 users not used for 1024 resolution, and it actually IS slower in lower resolutions. Also memory requirements—especially for model training—are disastrous for owners of older cards with less VRAM (this issue will disappear soon as better cards will resurface on second hand market.

SDXL is undisputably leap forward and has huge potential to produce great results. It will need some time to really explore all the options. Also new checkpoints could—and will—change the landscape a lot. But hardware wise, one thing is clear:  You're gonna need a bigger boat.

References

You may also like:

Subscribe

Stay connected to make sure you don’t miss anything. Join our newsletter community for artists, designers, and art and science enthusiasts.