Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

"This has nothing to do with the model's poor understanding of natural language, and will not change until we have something that could reasonably pass for AGI, and likely not even then. Your text prompts simply don't have enough semantic capacity."

I don't think it's going to take AGI to get to this point. It's 'just' going to take a top-tier model adding robust multi-modal input imho. A detailed prompt plus a bunch of examples of the style you're looking for seems like it would be enough.

That's not to say it isn't really hard, but it doesn't seem like it requires fundamental innovations to do this. The building blocks that are needed already exist.



The biggest problem I see with LLM-generated imagery is a near total inability to get details right, which makes perfect sense when one considers how they work.

LLMs pick out patterns in the data they're trained on and then regurgitates them. This works great for broad strokes, because those have relatively little variance between training pieces and have distinct visual signatures that act as anchors.

Details on the other hand differ dramatically between pieces and have no such consistent visual anchor. Take limbs for example, which are notoriously problematic for LLMs: there are so many different ways that arms, legs, and especially hands and fingers can look between their innumerable possible articulations, positions relative to the rest of the body, clothing, objects obscuring them, etc etc and the LLM, not actually understanding the subject matter, is predictably terrible at drawing the connections between all of these disparate states and struggles to draw them without human guidance.

You see this effect in other fine details, too. Jewelry, chain-link fences, fishing nets, chainmail, lace, etc are all near-guaranteed disasters for these things.


It's mostly a problem of resolution, model size, and dataset quality, which can be mitigated with compositing. Larger models don't have problems with hands, and if they do, it can be solved by higher-order guidance (e.g. controlnets) and doing multiple supersampled passes on regions to avoid to fit too much detail in one generation. Even SD 1.5 (a notoriously tiny model) issues with faces and hands can be solved with multiple passes, which is what everyone does.


There are two problems with this: a) natural language is inherently poor at giving artistic directions compared to higher-order ways like sketching and references, even if you got a human on the other end of the wire, and b) to create something conceptually appealing/novel, the model has to have much better conceptualizing ability than is currently possible with the best LLMs, and those already need some mighty hardware to run. Besides, tweaking the prompt will probably never be stable, partly due to the reasons outlined in the OP; although you could optimize for that, I guess.

That said, better understanding is always welcome. DeepFloyd IF tried to pair a full-fledged transformer with a diffusion part (albeit with only 11B parameters). It improved the understanding of complex prompts like "koi fish doing a handstand on a skateboard", but also pushed the hardware requirements way up, and haven't solved the fundamental issues above.


I think you're right about the current limitations, but imagine a trillion or ten trillion parameter model trained and RLHF'd for this specific use case. It may take a year or two, but I see no reason to think it isn't coming.

Yes, hardware requirements will be steep, but it will still be cheap compared to equivalent human illustrators. And compute costs will go down in the long run.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: