Context: I've been using agents (both Claude Code and Codex) for my daily work and for personal projects, but always in domains where I had some knowledge and I'm currently happy with them.
I tried using Claude Code to build an RPG game with Godot and GDScript, using free to use assets: a total failure :/
The game was supposed to be many implementation steps long but I asked Claude to first produce a one area demo, so I could test the assets and choose the one I liked. First it produced some garbage using the assets randomly. Then it tried to copy from an existing demo but it had not idea where a door or a path were and at a certain point it even admitted it with something like: "I can't design an usable and nice area: I either make it functional and ugly or I copy and adapt the existing demo but I will have no clue about what is what"
I've never even attempted to develop games before so I'm sure I don't even know the basic concepts, but this use case definitely didn't work for me.
Maybe it could generate the code of the game if I provided the full design?
That's exactly the failure mode this project exists to solve. The core issue is Claude Code has no way to see what it's producing — code compiles fine but assets are floating, paths lead nowhere, layouts are garbage. It even told you as much.
Godogen closes that loop: after writing code, it captures screenshots from the running engine and a vision model evaluates them. That's the difference between "compiles but broken" and "actually playable."
And yes — providing design docs helps a lot. The pipeline generates those automatically (visual reference, architecture, task plan), but you can provide your own and customize the skills to match your vision.
It would be a hit, if you packaged that loop as an MCP. Opus can make really pretty 3d models even using three.js primitives but they tend to have serious issues (like facial features inside the head). Being able to have it automatically generate a set of screenshots and Gemini scrutinize them and provide structured feedback would be a time saver. Curiously, I could not get Gemini 3.1 Pro to ever generate anything even remotely passable.
And it's exactly what I was trying to do manually :D
I accept the limit and say that probably doing a video game is not for me, but it's nice that a solution exists.
Question: how can you find the exact session you are looking for, among hundreds of them? I had a look at my ~/.claude/projects/*/ and I couldn't even find my last session.
I had exactly this problem and didn’t see anything good out there (Claude —resume
only searches session names and auto-created titles) so I got a tool built that uses a Rust/Tantivy full text search index. It’s part of the aichat command suite, called “aichat search”:
It brings up a nice TUI for filtering and further actions. There’s also a —json flag so agents can use it as a CLI search tool to find context about any past work. There’s a plugin that provides a corresponding session-searcher agent that knows to use this tool to search sessions.
I have hundreds/thousands of past sessions and this has been a life saver; I can just ask the main agent, “use the session searcher agent to get the details of how we built the tmux-cli tool so we can add some features”.
Ha, good question. Short answer: I often let Claude Code find it.
Sessions are grouped by the folder where you ran Claude Code (e.g. ~/.claude/projects/Users-<user>-<path>), so if you don’t run everything from the same directory, it’s usually easy to narrow down.
They’re also plain JSONL files, so grep works well if you remember part of a prompt.
That said, it might be nice for claude-replay to add a helper command to list or search recent sessions.
I must have missed something: why are people moving from OpenAI? Since they released gpt-5.3-codex I'be been using it and claude with opus-4.6 and Codex has always been better, more accurate, less prone to allucinations. I can do more with a 20$ OpenAI pland than with a Claude Max 100
More specifically OpenAI has agreed to be used for domestic mass surveillance and for autonomous (no human in the loop anywhere) drone attacks. ChatGPT will decide which building to destroy, and then it will be destroyed.
I'm only waiting for OpenAI to provide an equivalet ~100 USD subscription to entirely ditch Claude.
Opus has gone down the hill continously in the last week (and before you start flooding with replies, I've been testing opus/codex in parallel for the last week, I've plenty of examples of Claude going off track, then apologising, then saying "now it's all fixed!" and then only fixing part of it, when codex nailed at the first shot).
I can accept specific model limits, not an up/down in terms of reliability. And don't even let me get started on how bad Claude client has become. Others are finally catching up and gpt-5.3-codex is definitely better than opus-4.6
Everyone else (Codex CLI, Copilot CLI etc...) is going opensource, they are going closed. Others (OpenAI, Copilot etc...) explicitly allow using OpenCode, they explicitly forbid it.
We’re still in the mid-late 2020s. Once we really get to the late 2020s, attention spans won’t be long enough to even finish reading your comment. People will be speaking (not typing) to LLMs and getting distracted mid-sentence.
Opus 4.6 genuinely seems worse than 4.5 was in Q4 2025 for me. I know everyone always says this and anecdote != data but this is the first time I've really felt it with a new model to the point where I still reach for the old one.
Huh… I’ve seen this comment a lot in this thread but I’ve really been impressed with both Anthropic’s latest models and latest tooling (plugins like /frontend-design mean it actually designs real front ends instead of the vibe coded purple gradient look). And I see it doing more planning and making fewer mistakes than before. I have to do far less oversight and debugging broken code these days.
But if people really like Codex better, maybe I’ll try it. I’ve been trying not to pay for 2 subscriptions at once but it might be worth a test.
> And I see it doing more planning and making fewer mistakes than before
Anecdotally, maybe this is the reason? It does seem to spend a lot more time “thinking” before giving what feels like equivalent results, most of the time.
Probably eats into the gambling-style adrenaline cycles.
Heh, I find Codex to be a far, far smarter model than Claude Code.
And there's a good reason the most "famous" vibe coders, including the OpenClaw creator all moved to Codex, it's just better.
Claude writes a lot more code to do anything, tons of redundent code, repeated code etc. Codex is only model I've seen which occasionally removes more code than it writes.
Funnily enough I've been using Codex 5.3 on maximum thinking for bug hunting and code reviews and it's been really good at it (it's just seem to have a completely different focus than Opus.)
I generally don't like the way codex approaches coding itself so I just feed its review comments back in to Claude Code and off we go.
I just created an OpenCode skill where both these models will talk to each other and discuss bug-finding approaches.
In my experience, two different models together works much better than one, that's why this subscription banning is distressing. I won't be able to use a tool that can use both models.
It is (slower), especially at xhigh setting. But if I have to redo things three times, keep confirming trivial stuff (Claude Code seems to keep changing the commands it uses to read code... once it uses "bash-read", once it uses "tree", once it uses "head" and I have to keep confirming permission), I definitely waste more time than give a command to codex (or in my case OpenCode + codex model) and come back after 10 minutes.
I was underwhelmed by Opus4.6. I didn’t get a sense of significant improvement, but the token usage was excessive to the point that I dropped the subscription for codex. I am suspect that all the models are so glib that they can create a quagmire for themselves in a project. I have not yet found a satisfying strategy for non-destructive resets when the systems own comments and notes poisons new output. Fortunately, deleting and starting over is cheap.
No offense, but this is the most predicable outcome ever. The software industry at large does this over and over again and somehow we're surprised. Provide thing for free or for cheap, and then slowly draw back availability once you have dominant market share or find yourself needing money (ahem).
The providers want to control what AI does to make money or dominate an industry so they don't have to make their money back right away. This was inevitable, I do not understand why we trust these companies, ever.
Well, yes. They know what they are doing. They know when given the option the consumer makes the affordable choice. I just don't have to like or condone their practices. Maybe instead of taking on billions of dollars of debt they should have thought about a business model that makes sense first? Maybe the collective "we" (consumers and investors, but especially investors) should keep it in our pants until the product is proven and sustainable?
It will be real interesting if the haters are right and this technology is not the breakthrough the investors assume it to be AFTER it is already sewn into everyone's work flows. Everyone keeps talking about how jobs will be displaced, yet few are asking what happens when a dependency is swept out from underneath the industry as a whole if/when this massive gamble doesn't pay off.
Whatever. I am squawking into the void as we just repeat history.
Or the companies can be transparent about their product roadmap. I can guarantee this enshittification was on the roadmap way before we knew about it. They let us operate under false information, that's just weak behavior.
First, we are not talking about a cheap service here. We are talking about a monthly subscription which costs 100 USD or 200 USD per month, depending on which plan you choose.
Second, it's like selling me a pizza and pretending I only eat it while sitting at your table. I want to eat the pizza at home. I'm not getting 2-3 more pizzas, I'm still getting the same pizza others are getting.
It's the most overrated model there is. I do Elixir development primarily and the model sucks balls in comparison to Gemini and GPT-5x. But the Claude fanboys will swear by it and will attack you if you ever say even something remotely negative about their "god sent" model. It fails miserably even in basic chat and research contexts and constantly goes off track. I wired it up to fire up some tasks. It kept hallucinating and swearing it did when it didn't even attempt to. It was so unreliable I had to revert to Gemini.
It might simply be that it was not trained enough in Elixir RL environments compared to Gemini and gpt.
I use it for both ts and python and it's certainly better than Gemini. For Codex, it depends on the task.
Claude has gotten a lot of popular media attention in the last few weeks, and the influx of users is constraining compute/memory on an already compute heavy model.
So you get all the suspected "tricks" like quantization, shorter thinking, KV cache optimizations.
It feels like the same thing that happened to Gemini 3, and what you can even feel throughout the day (the models seem smartest at 12am).
Dario in his interview with dwarkesh last week also lamented the same refrain that other lab leaders have: compute is constrained and there are big tradeoffs in how you allocate it. It feels safe to reason then that they will use any trick they can to free up compute.
I regularly run the same prompts twice and through different models. Particularly, when making changes to agent metadata like agent files or skills.
At least weekly I run a set of prompts to compare codex/claude against each other. This is quite easy the prompt sessions are just text files that are saved.
The problem is doing it enough for statistical significance and judging the output as better or not.
I suspect you may not be writing code regularly...
If I have to ask Claude the same things three times and it keeps saying "You are right, now I've implemented it!" and the code is still missing 1 out of 3 things or worse, then I can definitely say the model has become worse (since this wasn't happening before).
I haven't experiences this with gpt-5.3-codex (xhigh) for example. Opus/Sonnet usually work well when just released, then they degrade quite regularly. I know the prompts are not the same every day or even across the day, but if the type of problems are always the same (at least in my case) and a model starts doing stupid things, then it means something is wrong. Everyone I know who uses Claude regularly, usually have the same esperience whenever I notice they degrade.
When I use Claude daily (both professionally and personally with a Max subscription), there are things that it does differently between 4.5 and 4.6. It's hard to point to any single conversation, but in aggregate I'm finding that certain tasks don't go as smoothly as they used to. In my view, Opus 4.6 is a lot better at long standing conversations (which has value), but does worse with critical details within smaller conversations.
A few things I've noticed:
* 4.6 doesn't look at certain files that it use to
* 4.6 tends to jump into writing code before it's fully understood the problem (annoying but promptable)
* 4.6 is less likely to do research, write to artifacts, or make external tool calls unless you specifically ask it to
* 4.6 is much more likely to ask annoying (blocking) questions that it can reasonably figure out on it's own
* 4.6 is much more likely to miss a critical detail in a planning document after being explicitly told to plan for that detail
* 4.6 needs to more proactively write its memories to file within a conversation to avoid going off track
* 4.6 is a lot worse about demonstrating critical details. I'm so tired of it explaining something conceptually without it thinking about how it implements details.
Just hit a situation where 4.6 is driving me crazy.
I'm working through a refactor and I explicitly told it to use a block (as in Ruby Blocks) and it completely overlooked that. Totally missed it as something I asked it to do.
same! I personally released a couple of CLIs (written using Claude Code) which I regularly use for my work: logbasset (to access Scalyr logs) and sentire (to access Sentry issues). I never use them manually, I wrote them to be used well by LLMs. I think they are lighter compared to an MCP.
I don't think that retention part was clear at all. It was separate from the opt-out. I assume I'm now opted out but that they'll keep the data for five years anyway.
same except 64GB and M3 Max smh... takes literally minutes to open the "Labels" popup and make a pr... its completely unacceptable for a product like this...
1) I tried to use it on an existing project asking this "Analyse the project and create a GEMINI.md". It fumbled some non sense for 10-15 minutes and after that it said it was done, but it had only analysed a few files in the root and didn't generate anything at all.
2) Despite finding a way to login with my workspace account, it then asks me for the GOOGLE_CLOUD_PROJECT which doesn't make any sense to me
3) It's not clear AT ALL if and how my data and code will be used to train the models. Until this is pretty clear, for me is a no go.
p.s: it feels like a promising project which has been rushed out too quickly :/
I tried using Claude Code to build an RPG game with Godot and GDScript, using free to use assets: a total failure :/
The game was supposed to be many implementation steps long but I asked Claude to first produce a one area demo, so I could test the assets and choose the one I liked. First it produced some garbage using the assets randomly. Then it tried to copy from an existing demo but it had not idea where a door or a path were and at a certain point it even admitted it with something like: "I can't design an usable and nice area: I either make it functional and ugly or I copy and adapt the existing demo but I will have no clue about what is what"
I've never even attempted to develop games before so I'm sure I don't even know the basic concepts, but this use case definitely didn't work for me.
Maybe it could generate the code of the game if I provided the full design?
reply