Considering that a fleur-de-lis involves somewhat intricate curves, I think I'd be pretty happy with myself if I could get that task done in an hour.
Given a harness that allows the model to validate the result of its program visually, and given the models are capable of using this harness to self correct (which isn't yet consistently true), then you're in a situation where in that hour you are free to do some other work.
A dishwasher might take 3 hours to do for what a human could do in 30 minutes, but they're still very useful because the machine's labor is cheaper than human labor.
Even with well understood languages, if there isn't much in the public domain for the framework you're using it's not really that helpful. You know you're at the edges of its knowledge when you can see the exact forum posts you are looking at showing up verbatim in it's responses.
I think some industries with mostly proprietary code will be a bit disappointing to use AI within.
Supposedly the frontier LLMs are multimodal and trained on images as well, though I don't know how much that helps for tasks that don't use the native image input/output support.
Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:
I have to admit I'm seeing this for the first time and am somewhat impressed by the results and even think they will get better with more training, why not... But are these multimodal LLMs still LLMs though? I mean, they're still LLMs but with a sidecar that does other things and the training of the image takes place outside the LLMs so in a way the LLMs still don't "know" anything about these images, they're just generating them on the fly upon request.
Some of the LLMs that can draw (bad) pelicans on bicycles are text-input-only LLMs.
The ones that have image input do tend to do better though, which I assume is because they have better "spatial awareness" as part of having been trained on images in addition to text.
I use the term vLLMs or vision LLMs to define LLMs that are multimodal for image and text input. I still don't have a great name for the ones that can also accept audio.
The pelican test requires SVG output because asking a multimodal output model like Gemini Flash Image (aka Nano Banana) to create an image is a different test entirely.
I got Opus 4.6 to one shot it, took 5-ish mins. "Write me a python program that outputs an svg of a fleur-de-lis. Use freely available images to double check your work."
It basically just re-created the wikipedia article fleur-de-lis, which I'm not sure proves anything beyond "you have to know how to use LLMs"
Just for reference, Codex using GPT-5.4 and that exact prompt was a 4-shot that took ten minutes. The first result was a horrific caricature. After a slight rebuke ("That looks terrible. Read https://en.wikipedia.org/wiki/Fleur-de-lis for a better understanding of what it should look like."), it produced a very good result but it then took two more prompts about the right side of the image being clipped off before it got it right.
Same, I used Sonnet 4.6 with the prompt, "Write a simple program that displays a fleur-de-lis. Python is a good language for this." Took five or six minutes, but it wrong a nice Python TK app that did exactly what it was supposed to.
The model stumbles when asked to invent procedural geometry it has rarely tokenized because LLMs predict tokens, not precise coordinate math. For reliable output define acceptance criteria up front and require a strict format such as an SVG path with absolute coordinates and explicit cubic Bezier control points, plus a tiny rendering test that checks a couple of landmark pixels.
Break the job into microtasks, ask for one petal as a pair of cubic Beziers with explicit numeric control points, render that snippet locally with a simple rasterizer, then iterate on the numbers. If determinism matters accept the tradeoff of writing a small generator using a geometry library like Cairo or a bezier solver so you get reproducible coordinates instead of watching the model flounder for an hour.
I tried to use Codex to write a simple TCP to QUIC proxy. I intentionally kept the request fairly simple, take one TCP connection and map it to a QUIC connection. Gave a detailed spec, went through plan mode, clarified all the misunderstandings, let it write it in Python, had it research the API, had it write a detailed step by step roadmap... The result was a fucking mess.
Beyond the fact that it was "correct" in the same way the author of the article talked about, there was absolutely bizarre shit in there. As an example, multiple times it tried to import modules that didn't exist. It noticed this when tests failed, and instead of figuring out the import problem it add a fucking try/except around the import and did some goofy Python shenanigans to make it "work".
Try writing code from description without looking at the picture or generated graphics. Visual LLM with a suggestion to find coordinates of different features and use lines/curves to match them might do better.
No exaggeration it floundered for an hour before it started to look right.
It's really not good at tasks it has not seen before.