Hacker Timesnew | past | comments | ask | show | jobs | submit | Bjorkbat's commentslogin

I was about to post a snarky comment along the lines of "Sierra? The publishers of Homeworld and Homeworld 2?"


I always got the impression that the only way a relatively average software developer could make a “successful” SaaS is if they built something weird and niche that appeals to an audience of <1000 paying customers. In that sense, it doesn’t really matter if your competition is better at SEO since your competition never even thought to build something like this, never cared, and not to mention that the market for this thing is so small that SEO is arguably a wasted skill. You’ll need to acquire these people by finding them directly or through word of mouth.

This blog post seems to fundamentally misunderstand the nature of solo-developer SaaS, but then again arguably mostly software developers also fundamentally misunderstand it.


Finding out that this is over 10 years old has made me profoundly sad. Despite the age of LLMs arguably unlocking massive amounts of productivity and agency for developers and non-developers alike, it feels as though we are living in a dark age of creativity on the web, maybe even a dark age for computer culture in general.


New interesting artsy web projects are being posted on hn all the time. neal.fun is an obvious example but there are plenty of others as well.

https://ambient.garden/

https://cannoneyed.com/isometric-nyc/

https://terra.layoutit.com/

https://ambigr.am/hall-of-fame

https://autism-simulator.vercel.app/


I'm keenly aware, I have a pretty extensive collection of Hacker News bookmarks. It's hard to articulate why I think these are different, but I think the best way to put it is that cachemonet feels a lot more avant garde, and perhaps also a reflection of a very particular form of "web culture" that has no clear successors.

People are experimenting with what you can do on the web, but the experiments aren't very "aesthetically inspiring". For that reason I'm kind of lukewarm on neal.fun.

EDIT: so I think a better way to describe it is that when artists experiment with technology, you get something like cachemonet. When developers experiment with technology, you get a web experiment that challenges conventional notions of what you can do with the web, but with varying degrees of creativity. I think terra.layoutit.com is best appreciated by other web devs who can appreciate the sheer amount of work required to figure out how to render a terrain map in CSS, but otherwise it's basically just a tool to generate terrain height maps, and not a particularly good one. Generating terrain maps in CSS is not a feature, but a handicap.


I wonder when peak demoscene occurred .. some of those mini code demos seem artistically and technically innovative.


I believe the demoscene is still ongoing, especially in France. Would love to understand why that is (French tech: parallel early teletype internet, high demoscene, more open approach to UFOs (GEIPAN) - French don't seem obviously "that different" to Americans, but there's obviously something different going on).


To me demoscene is kind of synonymous with the Amiga, so I would argue that peak demoscene lines up with the rise and fall of the Amiga brand.

So I think that's maybe the other differentiator between web experiment and art, because demoscene has a very distinct but difficult to describe cultural element that makes me identify it as art.


This is a good description. Well done.


I posit that periods of relatively high creativity [ in art science music literature ] coincide with periods of relatively low inequality.

ie. if everyone is working so hard to pay rent / college, nobody has time to work on side projects in the garage, or go deep into books, or dedicate spare time to a craft or do down a science research rabbit hole.

Im not sure LLMs will free up much time for people in the middle of the economy - they might produce more but get paid the same.


I'm not sure if that's true. The Renaissance was peak creativity, but also high inequality - from peasants to the Medicis. Chinese and Japanese art seemed to flourish during wealthy imperial times, but decline during war, where the blender of chaos made people much more equal. Chinese art surged back in the last two decades in new modern forms.

Basquiat thrived during peak 1980s New York, and had a rags to riches trajectory, I think. Art is not generally something people get to "as a hobby" when they have time among normal life. The artist mindset is different: you need to do it. It's survival. Not about money. You have to express and create. You probably don't choose like other people.

The true creatives find a way with what they have. This is not to denigrate people who take up painting or photography as a hobby and often produce high quality stuff. It is to distinguish separate experiences. It's also to highlight that "great creativity" comes from a psychic imperative and visceral drive on part of the people who do it.


Ironically the physics are kind of my biggest criticism. They call these "world models", but I think it's more accurate to call them "video game models" because they employ "video game physics" rather than real world physics, among other things

This is most evident in the way things collide.


It's getting better staggeringly fast, just a year ago I wouldn't expect it to be at even video game physics level so quickly.

If there is a possibility where it continue to improve at a similar rate with llms. A way to simulate fluid dynamics or structural dynamics with reasonable accuracy and speed can unlock much faster pace of innovation in the physical world. (And validated with rigorous scientific methods)


Numerical simulation is a well explored field, we know how to do all sorts of things, the issues lie rather in the tooling and robustness of it all put together (from geometry to numerical results) than in conceptual barriers. Finite Differences have existed since the 1700's! What hadn't for the longest time, is the computational power to crunch billions of operations per simulation.

A nice thing about numerical simulation from first principles, is it innately supports arbitrary speed/precision, that's in fact the backbone of the mathematical analysis for why it works.

In some cases, as is the case for CFD, we're actually mathematically screwed because you just have to resolve the small scales to get the macro dynamics. So the standard remains a kind of hack, which is to introduce additional equations (turbulence models) that steer the dynamics in place of the small (unresolved) scales. We know how to do better though (DNS), but it costs an arm and a leg (like years to milenia on a super computer).


I’m sure there’s some boring neuro-chemical explanation for this, and I won’t doubt or deny the neuro-chemical explanation, but the fact that there’s a mushroom that consistently brings about hallucinations of tiny people is so bizarre that I kind of want to indulge in equally bizarre explanations. Maybe it’s not a hallucination and this mushroom simply allows us to see the tiny people all around us. Maybe mushrooms are intelligent and are intentionally making us hallucinate tiny people.

It’s a little bit crazy, I know, but it’s odd to me that evolutionary forces would produce a mushroom that makes you have some specific hallucinations, rather than simply make things swirl together or simply produce intense feelings of euphoria or dread. I mean, marijuana just gets you high and that’s that.


Perhaps human-like creatures are so common in drug hallucinations because we're human, social animals, creatures who are maximally interested in other humans. If you gave drugs to dogs then perhaps they'd see human-like things mixed with dog-like things. I assume crocodiles, solitary animals, would see nothing besides wounded fish or maybe sexy female crocodiles.


Something I find weird about AI image generation models is that even though they no longer produce weird "artifacts" that give away that the fact that it was AI generated, you can still recognize that it's AI due to stylistic choices.

Not all examples they gave were like this. The example they gave of the word "Typography" would have fooled me as human-made. The infographics stood out though. I would have immediately noticed that the String of Turtles infographic was AI generated because of the stylistic choices. Same for the guide on how to make chai. I would be "suspicious" of the example they gave of the weather forecast but wouldn't immediately flag at as AI generated.

Similar note, earlier I was able to tell if something was AI generated right off the bat by noticing that it had a "Deviant Art" quality to it. My immediate guess is that certain sources of training data are over-represented.


We are just very sharp when it comes to seeing small differences in images.

I'm reminded of when the air force decided to create a pilot seat that worked for everyone. They took the average body dimensions of all their recruits and designed a seat to fit the average. It turned out, the seat fit none of their recruits. [1]

I think AI image generation is a lot like this. When you train on all images, you get to this weird sort of average space. AI images look like that, and we recognize it immediately. You can prompt or fine tune image models to get away from this, though -- the features are there it's a matter of getting them out. Lots of people trying stuff like this: https://www.reddit.com/r/StableDiffusion/comments/1euqwhr/re..., the results are nearly impossible to distinguish from real images.

[1] https://www.thestar.com/news/insight/when-u-s-air-force-disc...


What determines which “average” AI models latch onto? At a pixel level, the average of every image is a grayish rectangle; that's obviously not what we mean and AI does not produce that. At a slightly higher level, the average of every image is the average of every subject every photographed or drawn (human, tree, house, plate of food, ...) in concept space; but AI still doesn't generate a human with branches or a house with spaghetti on it. At a still higher level there are things we recognize as sensible scenes, e.g., barista pouring a cup of coffee, anime scene of a guy fighting a robot, watercolor of a boat on a lake, which AI still does not (by default) average into, say, an equal parts watercolor/anime/photorealistic image of a barista fighting a robot on a boat while pouring a cup of coffee.

But it is undeniable that AI images do have an “average” feel to them. What causes this? What is the space over which AI is taking an average to produce its output? One possible answer is that a finite model size means that the model can only explore image space with a limited resolution, and as models get bigger/better they can average over a smaller and smaller portion of this space, but it is always limited.

But that raises the question of why models don't just naturally land on a point in image space. Is this just a limitation of training, which punishes big failures more strongly than it rewards perfection? Or is there something else at play here that's preventing models from landing directly on a “real” image?


> At a pixel level, the average of every image is a grayish rectangle; that's obviously not what we mean and AI does not produce that.

That isn't correct since images in the real world aren't uniformly distributed from [0, 255] color-wise. Take, for example, the famous ImageNet normalization magic numbers:

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
If it were actually uniformly distributed, the mean for each channel would be 0.5 and the standard deviation would be 0.289. Also due to z-normalization, the "image" most image models see is not how humans typically see images.


The model "averages" in the latent space. That is in the space of packed image representations. I put "averages" into scare quotes, because I think it might be due to legal reasons. The model training might be organized in such a way as to push its default style away from styles of prominent artists. I might be wrong though.


Isn't the space you're talking about the input images that are close to the textual prompt?

These models are trained on image+text pairs. So if you prompt something like "an apple" you get a conceptual average of all images containing apples. Depending on your dataset, it's likely going to be a photograph of an apple in the center.


See the third diagram in https://www.mdpi.com/1424-8220/24/18/6049 . There are elements of noise, of input embeddings in the form of images, or in the form of text.


Tragedy of the aggregate.


It's a bit odd to say, but another big clue identifying something as AI-generated is that it simply looks "too good" for what it is being used for. If I see a little info graphic demonstrating something relatively mundane, and it has nice 3D rendered characters or graphical elements, at this point it's basically guaranteed to be AI, because you just sort of intuitively know when something would've justified the human labor necessary to produce that.


Funny enough that had crossed my mind with the woodchuck example, because at a glance I can't see any weird artifacts, but I felt confident I could tell it was AI generated immediately if I saw it in the wild, and I couldn't really explain why. My immediate guess was "well, who the hell would actually bother to make something like this?"


It's not odd to say. It was one of the first telling signs to identify AI artists[0] on Twitter: overly detailed backgrounds.

Of course now a lot of them have learned the lesson and it's much harder to tell.

[0]: I know, I know...


I think it's because they're all trained on the same data (everything they could possibly scrape from the open web). The models tend to learn some kind of distribution of what is most likely for a given prompt. It tends to produce things that are very average looking, very "likely", but as a result also predictable and unoriginal.

If you want something that looks original, you have to come up with a more original prompt. Or we have to find a way to train these models to sample things that are less likely from their distribution? Find a way to mathematically describe what it means to be original.


An more original prompt wont fix things. Modern base models want to eliminate everything that puts their creators at risk, which is anything that is clearly made by someone else, more or less accurately reproducible. If you avoid decent representation of any artist style, or anything/anyone that is likely to go to court, you wont get the chance of an creative synthesis either.


Do you know of some tools with a parameter that asks it to be "weird" and increase diversity of outputs?


If you want a chance for real creativity, flexibility and you have a decent gpu go local. Check out comfyui, download models and play around. The mainstream services have zero knobs to play around with, local is infinite.


If you ever had a pinterest account and a deviant art account, all becomes clear.


It still has some artifacts more often than not, they are a lot subtler in nature but they still come out, whether it's texture, proportion, lighting, or perspective. Now some things are easier to fix on second pass edits, some are not. I guess it's why they consider image editing to be the next challenge.


We can also pick up hints on discordant production value. This is quite noticeable on websites such as Amazon/Alibaba/Etsy/Ebay/etc where there's a lot of scam listings that use AI images for cheap or basic items.

So even though the image shown doesn't present obvious flaws, the fact that the image is high quality is the tell-tale sign of being AI generated.

This also isn't something that can be easily fixed - even if we produce convincing low production value imagery using AI, then the scam listing doesn't achieve its goal because it looks like junky crap.


The problem is how they are fine tuned with human feedbacks that are not opinionated, so they produce some "average taste" that is very recognizable. Early models didn't have this issue, it's a paradox... Lower quality / broken images but often more interesting. Krea & Black Forest did a blog post about that some time ago.


Oh yeah, funny enough even though I’m a bit of an AI art hater I actually thought very early Midjourney looked good because of all had an impressionistic, dreamy quality.


I wonder if we'll get to the point where we train different personalities into an image model that we can bring out in the prompt and these personalities have distinct art/picture styles they produce.


I don't think it's solely an data issue. Flux models for example are quite stylized, very notable with photorealism. But I think it was an deliberate choice to to have outputs that are absent of likeness and distinct style. I think it's an side effect that it washes away fine details and creates outputs feel artificial. The problem is that closed models can't be fixed easily, while models like flux or even older architectures can add back details and style with fine tuning and LoRas.


Maybe the AI feeling is illusion because you already know it's AI-generated, just confirmation bias. Like wine tastes better after knowing it's expensive. In real world AI-generated images have passed Turing test. Only by double blind test do you can be really sure.


Reminds me of the old gem of the Web 1.0 internet that was Exit Mundi


One of my most frustrating things regarding the potential of an AI bubble was some very smart and intelligent researcher being incredibly bullish on AI on Twitter because if you extrapolate graphs measuring AI's ability to complete long-duration tasks (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...) or other benchmarks then by 2026 or 2027 then you've basically invented AGI.

I'm going to take his statements at face value and assume that he really does have faith in his own predictions and isn't trying to fleece us.

My gripe with this statement is that this prediction is based on proxies for capability that aren't particularly reliable. To elaborate, the latest frontier models score something like 65% on SWE-bench, but I don't think they're as capable as a human that also scored 65%. That isn't to say that they're incapable, but just that they aren't as capable as an equivalent human. I think there's a very real chance that a model absolutely crushes the SWE-bench benchmark but still isn't quite ready to function as an independent software engineering agent.

So a lot of this bullishness basically hinges on the idea that if you extrapolate some line on a graph into the future, then by next year or the year after all white-collar work can be automated. Terrifying as that is, this all hinges on the idea that these graphs, these benchmarks, are good proxies.

And if they aren't, oh wow.


There's a huge disconnect between what the benchmarks are showing and what the day-to-day experience of those of us using LLMs are experiencing. According to SWE-bench, I should be able to outsource a lot of tasks to LLMs by now. But practically speaking, I can't get them to reliably do even the most basic of tasks. Benchmaxxing is a real phenomenon. Internal private assessments are the most accurate source of information that we have, and those seem to be quite mixed for the most recent models.


How ironic that these LLM's appear to be overfitting to the benchmark scores. Presumably these researchers deal with overfitting every day, but can't recognize it right in front of them


I'm sure they all know it's happening. But the incentives are all misaligned. They get promotions and raises for pushing the frontier which means showing SOTA performance on benchmarks.


> very smart and intelligent researcher being incredibly bullish on AI on Twitter

A bit offtopic but as time goes by, I believe we can be very intelligent in some aspects and very, very naive and/or wrong in other aspects.


>> by next year or the year after all white-collar work can be automated

Work generates work. If you remove the need for 50% of the work then a significant amount of the remaining work never needs to be done. It just doesn't appear.

The software that is used by people in their jobs will no longer be needed if those people aren't hired to do their jobs. There goes Slack, Teams, GitHub, Zoom, Powerpoint, Excel, whatever... And if the software isn't needed then it doesn't need to be written, by either a person or an AI. So any need for AI Coders shrinks considerably.


You mean Julian Schrittwieser (collaborator on AlphaGo and first author on MuZero)?

https://www.julian.ac/blog/2025/09/27/failing-to-understand-...


I think people underestimate the degree to which fun matters when it comes to productivity. If something isn’t fun then I’ll likely put it off. A 15 minute task can become hours, maybe days long, because I’m going to procrastinate on doing it.

If managing a bunch of AI agents is a very un-fun way to spend time, then I don’t think it’s the future. If the new way of doing this is more work and more tedium, then why the hell have we collectively decided this is the new way to work when historically the approach has been to automate and abstract tedium so we can focus on what matters?

The people selling you the future of work don’t necessarily know better than you.


I think some people have more fun using LLM agents and generative AI tools. Not my case, but you can definitely read a bunch of comments from people using the tools and having fun/experience a state of flow like they have never had before


>I think some people have more fun using LLM agents and generative AI tools

I think I'm one of them

The rate at which I can explore new paths, or revisit old ones with a new perspective, has _exploded_ and I love it

But then I'm the kind of person who could spend hours on Wikipedia going from one page to the next, so that might have something to do with it

There's just so much to learn, I'm in my element

(Though I use agents mostly in Ask mode, or I manually review every line of code in Agent mode and never commit anything I don't understand)


I definitely agree with you there. I contracted with a company that had some older engineers who were in largely managerial roles who really liked using AI for personal projects, and honestly, I kind of get it. Their work flow was basically prompt, get results, prompt again with modifications, rinse and repeat, it's low effort and has a nice REPL-like loop. Paraphrasing a bit, but it basically re-kindled the joy of programming for them.

Haven't gotten the chance to ask, but I imagine managing a team of AI agents would feel a little too much like their day job, and consequently, suck the fun out of it.

That said, looking back, I think the reason why generative AI is so fun for so many coders is because programming has become unnecessarily complex. I have to admit, programming nowadays for me feels like a bit of a slog at times because of the sheer effort it can sometimes take to implement the simplest things. Doesn't have to be that way, but I think LLM copy-paste machines are probably the wrong direction.


I think the majority of people I've worked with who have the title of "Software Engineer" do not like coding. They got into it for the money/career, and dream of eventually moving out of coding into management. I can count the number of coders who I've met who like coding on one hand


It's a different kind of fun for me.

I've been enjoying seeing my agents produce code while I am otherwise too busy to program, or seeing refined prompts & context engineering get better results. The boring kinds of programming tasks that I would normally put off are now lower friction, and now there's an element of workflow tinkering with all these different AI tools that lets me have some fun with it.

I also recently programmed for a few hours on a plane, with no LLM assistance whatsoever, and it was a refreshing way to reconnect with the joy of just internalizing a problem and fitting the pieces together in realtime. I am a bit sad that this kind of fun may no longer be lucrative in the near future, but I am thankful I got to experience it.


I’ll be that voice I guess - I have fun “vibe coding”.

I’m a professional software engineer in Silicon Valley, and I’m fortunate to have been able to work on household-name consumer products across my career. I definitely know how to do “real” professional work “at scale” or whatever. Point is, I can do real work and understand things on my own, and I can generally review code and guide architecture and all that jazz. I became a software engineer because I love creating things that I and others could use, and I don’t care about “solving the puzzle” type satisfaction from writing code. In engineering school, software had the fastest turnaround time from idea in my head to something I could use, and that’s why I became a software engineer.

LLM assisted coding accelerates this trend. I can guide an LLM to help me create things quickly and easily. Things I can mostly create myself, of course, but I find it faster for a whole category of easy tasks like generating UIs. It really lowers the “activation energy” to experiment. I think of it like 3D printing, where I can prototype ideas in an afternoon instead of long weekend or a few weeks.


>because I love creating things that I and others could use, and I don’t care about “solving the puzzle” type satisfaction from writing code.

Please don't take offense to this, but it sounds like you just don't like building software? It seems like the end goal is what excites you, not the process.

I think for many of us who prefer to write code ourselves, the relationship we have with building software is for the craft/intellectual stimulation. The working product is cool of course, but the real joy is knowing how to do something new.


I understand where you're coming from (and I don't take offense), but based on your reply, I don't really feel like my views came across.

When I was a student, I took classes on chip and circuit design. One class, the professor had us work on all these complex circuits to do things like flash lights and produce various signals with analog circuits. The next lesson, he had us replace all that complex work with a microcontroller and 20 lines of C - "the way it's done in industry". The students mourned the loss of the "real" engineering because the circuit that required skill and careful math was replaced by a cheap chip and some trivial software. Their entire concept of the craft was destroyed when they were given a tool that replaced the "fun parts" with some trivial and comparatively boring work. That same concept of replacing circuits with digital logic scaled up is how extremely complex and well engineered circuits like FPGAs work.

Maybe it was just my earlier wording, but I think there is joy in the act of turning your ideas into something real - creation - not just having something real. Shopping is not building. Importantly, it takes careful thought and practice and a learned instinct to engineer and create things correctly, and do it repeatably, as the original article discusses. Craft is about practice, and learning, and trying something new with what you've learned.

If LLMs mean that I'll never have to write another trivial set of methods to store a JSON object in a SQL database, I don't think I'll lose any project-wide joy. Expressing creativity, and trying new things is what's great, not typing something that's been done a million times before. It's a tired analogy, but I do think of it more like a level of abstraction, like the LLM is a "compiler" for design docs or specifications. For myself, I usually don't see a difference between a prompt instructing an LLM to write some function, and the code for the function itself - in same way that a method in Java, bytecode, and asm are basically the same (with some caveats here around complexity and originality).


For a lot of folks, the derivation of joy is not as scale-free as seems necessary to move up the hierarchy in this way. The jump in abstraction kills some joy by removing the tangible process. The tactile enjoyment someone gets from knitting is not there when operating a loom, much less when managing someone else who operates the loom.

The change in agency also kills the joy for me. I thrive on abstraction in the language and mathematics sense. But I do not at all enjoy indirection and delegation through unreliable agents. I am not interested in the loss of control and the new risk management task. I would never accept a "stochastic compiler" that offered to optimize my code but with risk of randomly changing the semantics. That determinism in the semantics needs to remain for me to accept a tool as a valid abstraction.

For context, I am a computer scientist by title and a programmer at heart. I got my CS degree from a liberal arts program rather than an engineering school. My temperament is more that of a hands-on artist at an easel or typewriter and not that of a manager of an engineering department. In my long career, I have thrived with peers or betters on collaborative projects. I have zero interest in "advancing" to a managerial role.

But honestly, the loss of control, lack of trust, and associated risk management is a big problem for me. I have rarely delegated work to less skilled or less reliable juniors, and I have never enjoyed that. The scenario of a confidently wrong subordinate is a huge trigger for me. It evokes long term trauma from growing up with a mentally ill family member. It feels like all of the burden of being a caregiver to someone with delusions, but with none of the moral context to make that worth the cost.


There is nothing wrong with finding joy however one finds joy, and that can vary from person to person. Someone may find joy from knitting by hand, but maybe someone else finds joy from experimenting with pattern and material, and a loom lets them focus on the parts that interest them.

I'm glad you found what interests you.


As a thought experiment, do you think it would be just as fun if you were given access to an infinite database of apps, and you were able to search through the database for an existing app that suit your needs, and then it gave it to you?

Or would it no longer be fun, because it no longer feels like creating?


I'll repeat something I said to a sibling comment. I guess my original wasn't particularly clear.

> I think there is joy in the act of turning your ideas into something real - creation - not just having something real. Shopping is not building.


No, you were clear. I suppose I was interested to see where you drew the distinction between creating and shopping.

For example, lets say LLMs improve to the point where they can now reliably one-shot entire apps with no more input than the original prompt. Would you no longer consider that creating? What's the difference between that and typing your prompt into an infinite app store?


> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.

Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)

I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output


Interestingly the internet is full of "slack clone" dev tutorials. I used to work for a company that provides chat backend/frontend components as a service. It was one of their go-to examples, and the same is true for their competitors.

While it's impressive that you can now just have an llm build this, I wouldn't be surprised if the result of these 30 hours is essentially just a re-hash of one of those example Slack clones. Especially since all of these models have internet access nowadays; I honestly think 30 hours isn't even that fast for something like this, where you can realistically follow a tutorial and have it done.

In fact, I just did a quick google search and found this 15 hour course about building a slack clone: https://www.codewithantonio.com/projects/slack-clone


This is obviously much more than just taking an LLM an letting it run for 30 hours. You have to build a whole environment together with external tool integration and context management and then tune the prompts and perhaps even set up a multi-agent system. I believe that if someone puts a ton of work into this you can have an LLM run for that long and still produce sellable outputs, but let's not pretend like this is something that average devs can do by buying some API tokens and kicking off a frontier model.


Well, yes, that's Claude Code. And OpenAI Codex. And Google Gemini CLI.

Your average dev can just use those.


Yes but you need to setup quite a bit of tooling to provide feedback loops.

It's one thing to get an llm to do something unattended for long durations, it's a other to give it the means of verification.

For example I'm busy upgrading a 500k LoC rails 1 codebase to rails 8 and built several DSLs that give it proper authorised sessions in a headless browser with basic html parsing tooling so it can "see" what affect it's fixes have. Then you somehow need to also give it a reliable way to keep track of the past and it's own learnings, which sound simple but I have yet to see any tool or model solve it on this scale...will give sonnet 4.5 a try this weekend, but yeah none of the models I tried are able to produce meaningful results over long periods on this upgrade task without good tooling and strong feedback loops

Btw I have upgraded the app and taking it to alpha testing now so it is possible


I've tried asking it to log every request and response to a project_log.md but it routinely ignores that.

I've also tried using playwright for testing in a headless browser and taking screenshots for a blog that can effectively act as a log , it just seems like too tall an order for it.

It sounds like you're streets ahead of where I am could you give me some pointers on getting started with a feed back loop please


> rails 1 codebase to rails 8

A bit off topic, but Rails *1* ? I hope this was an internal app and not on the public internet somewhere …


haha no it's an old (15years old) abandoned enterprise app running on-prem that hasn't seen updates in more than a decade.


Wow Rails 3 came out 15 years ago, so that thing started life out of date.


> enterprise app

> started life out of date

That tracks my experiences.


But then that goes back to the original question, considering my own experiences observing the amount of damage CC or Codex can do in a working code base with a couple tiny initial mistakes or confusion about intent while being left unattended for ten minutes, let alone 30 hours....


If you had used any of those, you'd know they clearly don't work well enough for such long tasks. We're not yet at the point where we have general purpose fire-and-forget frameworks. But there have been a few research examples from constrained environments with a complex custom setup.


Claude Code with a good prompt can run for hours.


That sounds to me like a full room of guys trying to figure out the most outrageous thing they can say about the update, without being accused of lying. Half of them on ketamine, the other on 5-MeO-DMT. Bat country. 2 months of 007 work.

Imagine reviewing 30 hours of 2025-LLM code.


What they don't mention is all the tooling, MCPs and other stuff they've added to make this work. It's not 30 hours out of the box. It's probably heavily guard-railed, with a lot of validated plans, checklists and verification points they can check. It's similar to 'lab conditions', you won't get that output in real-world situations.


Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.

Unless the main area of improvement was tools and scaffolding rather than the model itself.


“30 hours of unattended work” is totally vague and it doesn’t mean anything on its own. It - at the very least - highly depends on the amount of tokens you were able to process.

Just to illustrate, say you are running on a slow machine that outputs 1 token per hour. At that speed you would produce approximately one sentence.


"Slack clone" is also super vague:

(First of all: Why would anyone in their right mind want a Slack clone? Slack is a cancer. The only people who want it are non-technical people, who inflict it upon their employees.)

Is it just a chat with a group or 1on1 chat? Or does it have threads, emojis, voice chat calls, pinning of messages, all the CSS styling (which probably already is 11k lines or more for the real Slack), web hooks/apps?

Also, of course it is just a BS announcement, without honesty, if they don't publish a reproducible setup, that leads to the same outcome they had. It's the equivalent of "But it worked on my machine!" or "scientific" papers that prove anti gravity with superconductors and perpetuum mobile infinite energy, that only worked in a small shed where some supposed physics professor lives.


Has their comment has been edited? A few words later it says it resulted in 11,000 LoC.

> [..] left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code [..]


Their point still stands though? They said the 1 tok/hr example was illustrative only. 11,000 LoC could be generated line-by-line in one shot, taking not much more than 11,000 * avg_tokens_per_line tokens. Or the model could be embedded in an agent and spend a million tokens contemplating every line.


> Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code

it's going to be an issue I think, now that lots of these agents support computer use, we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it.

The software world may find it's got more in common with book authors than they thought sooner rather than later once full clones of popular apps are popping out of coding tools. It will be interesting to see if this results in a war of attrition with counter measures and strict ToU that prohibit use by AI agents etc.


That just means that owning the walled gardens and network effects will become yet more important.


It has been trivial to build a clone of most popular services for years, even before LLMs. One of my first projects was Miguel Grinberg's Flask tutorial, in which a total noob can build a Twitter clone in an afternoon.

What keeps people in are network effects and some dark patterns like vendor lock-in and data unportability.


There's a marked difference between running a Twitter-like application that scales to even a few hundred thousand users, and one that is a global scale application.

You may find quickly that, network effects aside, you would find yourself crushed under the weight and unexpected bottlenecks of that network you desire.


Agreed entirely but not sure that's relevant in what I'm replying to.

> we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it

That won't produce a global-scale application infrastructure either, it'll just reproduce the functionality available to the user.


Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.


Sub-agents. I've had Claude Code run a prompt for hours on end.


What kind of agents do you have setup?


You can use the built in task agent. When you have a plan and ready for Claude to implement, just say something along the line of “begin implementation, split each step into their own subagent, run them sequentially”


subagents are where Claude code shines and codex still lags behind. Claude code can do some things in parallel within a single session with subagents and codex cannot.


By parallel, do you mean editing the codebase in parallel? Does it use some kind of mechanism to prevent collisions (e.g. work trees)?


Yeah, in parallel. They don't call it yolo mode for nothing! I have Claude configured to commit units of work to git, and after reviewing the commits by hand, they're cleanly separated be file. The todo's don't conflict in the first place though; eg changes to the admin api code won't conflict with changes to submission frontend code so that's the limited human mechanism I'm using for that.

I'll admit it's a bit insane to have it make changes in the same directory simultaneously. I'm sure I could ask it to use git worktrees and have it use separate directories, but I haven't (needed to) try that (yet), so I won't comment on how well it would actually do with that.


I personally do not do any writes in parallel but parallel works great for read operations like investigating multiple failing tests.


Have the released the code for this? Does it work? or are there x number of caviets and excuses. I'm kinda of sick of them (and others) getting a free pass at saying stuff like this.


They don't seem to link any source code or demo. They could have run Claude for 10 hours to write thousands of the verge articles as well.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: