> Your text prompts simply don't have enough semantic capacity. Mostly, current ...

orbital-decay · on Aug 30, 2023

Sorry I noticed your post too late, not sure you'll read it.

The reason it happens is that the models are far too small (parameter count-wise) and the prompt understanding part is simple, usually it's either CLIP or in the best case a small and dumb transformer. (But regardless of the current capabilities, text is just not a great tool to express artistic intent)

Generally, what you want can be done by giving the model higher-order hints like sketches and pose skeletons; see controlnets for Stable Diffusion for example. The overall idea here is to use a custom model created specifically to guide the diffusion model, based on the non-textual input. The problem is that MidJourney can't do this, you have to use SD.

Another thing is photobashing/compositing. Avoid fitting the entire composition into one generation, it will make the model lose track of your scene. Using multiple passes helps a lot. It's best to inpaint the objects or img2img them based on non-textual guidance to add objects and details in the specific spot.

Check my other comment for an example of a complex scene workflow. https://qht.co/item?id=37140233