Forge (UI) Ahead: LLM Vision and Text Prompt Engineering for FLUX.1

Daniel Sandner September 7, 2024

Generated FLUX.1 still life image using LM Studio running Llama 3.1 Storm LLM model

Flux model is versatile (even if better suited for photorealism at the moment), and it seems to react nicely to verbose descriptions. To effectively leverage the power of FLUX.1 you can use LLM-Vision/Text LLama 3 model for generation directly from the A1111 Forge UI interface (there are also node solutions for ComfyUI, for some time already). This article will guide you through the process of setting up Auto LLM Extension and a suitable model for your LLM-Vision and LLM-Text experiments using FLUX.1.

What You Will Need
Quick Installation Instructions
- Forge UI + Auto LLM + FLUX.1
- LM Studio + Llama Vision Model
Settings in LM Studio and Forge
Auto LLM Usage
- LLM-text
- LLM-vision
Tips
Solutions to Errors You May Encounter
Testing LLM Prompts Created by Llama 3.1 Storm
Conclusion
References

What You Will Need

Forge UI Installation
Auto LLM A1111 Extension
LM Studio Installation
Llama 3 model or text/vision model
Flux.1 model
16GB+ of VRAM is recommended (you probably may use the technique on lower VRAM, f.i. using a smaller FLUX and LLM quantized models)

Flux.1 dev enhanced prompt using Llama-3.1-Storm-8B — Enhancing details with Llama-3.1

FLUX.1 image using prompt engineering Llama-3.1-Storm-8B creating a lively portrait — Enhancing portraits with Llama-3.1-Storm-8B

How to enhance Flux.1 image using Llama-3.1-Storm-8B for enhanced prompt engineering — Creating moods with Llama-3.1

Quick Installation Instructions

Forge UI + Auto LLM + FLUX.1

Install free version control manager git from git-scm.com. In target folder, where you want your Forge Ui installation, open cmd terminal and run git clone https://github.com/lllyasviel/stable-diffusion-webui-forge.git
Run webui-user.bat to automatically setup Forge UI.

After the installation finishes, run Forge UI (webui-user.bat) and head into the Extensions tab. Go to Available, click "Load from:" button. After the list appears, find Auto LLM (sd-webui-decadetw-auto-prompt-llm) extension and install it. Apply and restart Forge UI.

Now download FLUX.1 model version you want (consider your VRAM size). Unless you use flux1 bnb-nf4 versions you will also need clip , text encoder and the correct VAE files. You will find more detailed article about FLUX.1 in Forge here.

You may consider a smaller quantized version of FLUX, depending on your VRAM size, SCHNELL https://huggingface.co/city96/FLUX.1-schnell-gguf, DEV https://huggingface.co/city96/FLUX.1-dev-gguf

LM Studio + Llama Vision Model

With LM Studio you can easily install run LLM models locally. Install LMStudio https://lmstudio.ai/.

In LM Studio, go to Search section (a magnifying glass icon). Search for xtuner. Download two files:

llava-llama-3-8b-v1_1-mmproj-f16.gguf (VISION ADAPTER)
llava-llama-3-8b-v1_1-int4.gguf (MODEL)

You may experiment with various llama 3 models for text generation, but only versions with VISION ADAPTER will work in the LLM Vision mode in Auto LLM Extension.

An article about LM Studio and LLMs running locally is here on the site.

Settings in LM Studio and Forge

To run the downloaded VISION/TEXT LLM model in LM Studio, use "Local Server" icon on the left, click the top button for loading models, and load the xtuner llava llama 3 model. The server should start automatically (if does not, use "Start server" button in Configuration frame panel).

LM Studio running LLM VISION model — LM Studio running server with LLM VISION model

Auto LLM basic setup for LM Studio for llama 3 model — Auto LLM basic setup for LM Studio

With these settings the controls from Auto-LLM should work for both LLM-text and LLM-vision.

Auto LLM Usage

LLM-text

In Auto LLM, activate "Enable LLM-Answer to SD-prompt"
In Auto LLM, put your prompt into [LLM-Your-Prompt], and you may use A1111 "Generate" button to create and image even with empty normal prompt.
Additionally, You may use Call LLM above button in Auto LLM extension, to generate the prompt into the text window, and copy+paste from there.

LLM-vision

Drag+drop an image and press Call LLM-vision above. You need a model with vision adapter installed, see instructions above.
For another image, close the current one with cross symbol, and drop the next one.
If "Enable LLM-vision" is active, the prompt generated from an input image is used as a main prompt, similarly to LLM-text.

Note: Auto LLM is currently under development, so some new features may not work as expected.

Tips

You may use automated prompt generations, or use both text and vision outputs to use in a prompt manually
Auto LLM has interesting iterative functions, check LLM-text Loop for instance
Experiment with LLM temperature, the higher the more creave the LLM output is. In some scenarios though, setting it too high may cause errors.
Experiment with LLM Max length(tokens), you may set it for a max and see the results.

Solutions to Errors You May Encounter

Error code: 400 - {'error': 'Vision model is not loaded. Cannot process images...

LLM-vision error. You are using a Llama model without a vision adapter. Install the proper model with a vision adapter as described above.

Connection error.

LLM-text error. Check if the LM Studio server is running, with a loaded model. The error may appear with an incorrect Setup. Check Auto LLM Setup and fill all needed info to connect to LM Studio: [LLM-URL] http://localhost:1234/v1 [LLM-API-Key] lm-studio

[Low GPU VRAM Warning]

This may show in A1111 Forge terminal. Usually, the rendering will be very slow after the warning. Use a smaller model, free some VRAM, or prepare the final prompts separately before rendering. In LM Studio, experiment with GPU Acceleration/GPU Offload in the right 'GPU Settings' tab, try to lower the offload to VRAM (Low setting, the text generation will take much longer).

Testing LLM Prompts Created by Llama 3.1 Storm

The model (akjindal53244/Llama-3.1-Storm-8B-GGUF/Llama-3.1-Storm-8B.Q6_K.gguf) makes great use of Auto LLM extensions LLM Max length(tokens) 5000 (:p) cap. Although it does NOT have a vision adapter, it is worth to test for prompt text generation (higher VRAM needed). I would suggest 2500 tokens to be a reasonable cap for testing, depending on your GPU and FLUX model. Check References for a link to my test settings.

How to use LLM Llama model to enhance Stable Diffusion Prompt, an image by Daniel Sandner — Enhance the details in a composition-heavy scene.

Conclusion

By following the steps outlined in this article, you've successfully set up a suitable LLM-Vision and LLM-Text model for your experiments with Flux.1.
One of the significant advantages of using Flux.1 is the ability to experiment with highly verbose prompts (especially with techniques like Hyper-SD). This flexibility allows you to provide detailed descriptions and instructions, leading to more precise and tailored results.

You can expand your existing prompts and test them with FLUX.1. This iterative process enables you to refine your prompts, learn from the model's responses, and continually improve your generations and styles.

References

Auto LLM https://github.com/xlinx/sd-webui-decadetw-auto-prompt-llm
My Auto LLM test [LLM-System-Prompt] here> https://github.com/sandner-art/ai-research/tree/main/Auto%20LLM
FLUX.1 bnb https://huggingface.co/silveroxides/flux1-nf4-weights/tree/main
LLAMA VISION https://huggingface.co/xtuner/llava-llama-3-8b-v1_1 , https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf
LLama 3.1 Storm 8B GGUF Quantizations https://huggingface.co/bartowski/Llama-3.1-Storm-8B-GGUF

Forge (UI) Ahead: LLM Vision and Text Prompt Engineering for FLUX.1

What You Will Need