> Your text prompts simply don't have enough semantic capacity.
Mostly, current tools are abysmal at maxing the semantic capacity. Midjourney is great a generating things that look good, but terrible at piecing scenes together.
Recent example I tried: a robot playing magic the gathering seated across a human.
Even getting the human in the picture is a challenge, but then the model doesn't know enough about MTG (it correctly pattern matches to "board game" or "card game").
Some pictures generated are much better than other, and it would be great to take e.g. the table setup from one picture and the robot from another, but doing is not really possible atm ("blend" doesn't work for that).
I have no doubt this will improve, but I'm wondering if there's something underlying this that could be a more general limitation? Maybe simply not enough data (a google image search for magic the gathering is also pretty disapppointing).
Another example: a glowing blue <company logo> carved into a stone monolith. It sometimes got the logo to be carved (rarely), it was never glowing blue (usually the whole monolith, or a part of it).
Sorry I noticed your post too late, not sure you'll read it.
The reason it happens is that the models are far too small (parameter count-wise) and the prompt understanding part is simple, usually it's either CLIP or in the best case a small and dumb transformer. (But regardless of the current capabilities, text is just not a great tool to express artistic intent)
Generally, what you want can be done by giving the model higher-order hints like sketches and pose skeletons; see controlnets for Stable Diffusion for example. The overall idea here is to use a custom model created specifically to guide the diffusion model, based on the non-textual input. The problem is that MidJourney can't do this, you have to use SD.
Another thing is photobashing/compositing. Avoid fitting the entire composition into one generation, it will make the model lose track of your scene. Using multiple passes helps a lot. It's best to inpaint the objects or img2img them based on non-textual guidance to add objects and details in the specific spot.
Mostly, current tools are abysmal at maxing the semantic capacity. Midjourney is great a generating things that look good, but terrible at piecing scenes together.
Recent example I tried: a robot playing magic the gathering seated across a human.
Even getting the human in the picture is a challenge, but then the model doesn't know enough about MTG (it correctly pattern matches to "board game" or "card game").
Some pictures generated are much better than other, and it would be great to take e.g. the table setup from one picture and the robot from another, but doing is not really possible atm ("blend" doesn't work for that).
I have no doubt this will improve, but I'm wondering if there's something underlying this that could be a more general limitation? Maybe simply not enough data (a google image search for magic the gathering is also pretty disapppointing).
Another example: a glowing blue <company logo> carved into a stone monolith. It sometimes got the logo to be carved (rarely), it was never glowing blue (usually the whole monolith, or a part of it).