I think they don't understand what milquetoast actually means, as the post defintiely isn't - django quite clearly asserted themselves and their rules.
What the parent comment was probably trying to say was something like "a completely reasonable, uncontroversial post that I'm glad to see them make", but chose milquetoast (a word that no normal human ever uses - and certainly not in casual conversation) due to an affectation of one kind or another.
On the contrary, they could have stated their points much more bluntly and strongly than they did in the post. I had the same impression upon reading it.
Milquetoast perfectly describes it, I am happy to see less common words used around here (specially when the convey the intended meaning this precisely), and I find claiming "affectation" of the person who used it unnecessarily rude.
A use of LLMs is when you are in your second reply and you don’t have the will to make your own argument.
The post is timid and conciliatory, spending words on some weird bargaining on all the wonderful things you can do with LLMs in preparation for a contribution. Who cares? I’m not in the Django project, but I’d think (living in These Times and all) that the thrust ought to be more about how no-effort faux contributions are wasting people’s time. At some point you can say: you’ve been warned, others have warned about this for years as well, and we don’t take kindly to you pinging us in any form.
But if someone disagrees with this milquetoast proposal or stance? If they want to defy even this and go ahead and “spend tokens” by trying to shovel unlabeled, generated code into the project? Then that’s the kind of person that I don’t want to work with. I hope that clarifies milquetoast hermeneutics.
What I wrote last was the expanded thought process for my original comment, the same sentiment. For those who don’t understand what milquetoast means.
You can try your luck with some text slot machines. Maybe they will be able to analyze and find some discrepancies. Not that I would read those analyses.
The problem is finding the needle in the haystack. When you can cheaply develop AI slop by the millions, good luck finding that one game where a human put blood, sweat and tears to realize their vision/dream. Even if you somehow have access to at-scale distribution, economics will ultimately always triumph everything else and more slop will be pushed because it makes economic sense.
It will take at least a full decade for people to realize the slop isn't helping, has made us all collectively mediocre and will seek out people with real specializations. By then I sure hope those who are specializing haven't lost the motivation to do great things and moved on to other fields.
This probably only works properly in the developed countries. In developing countries like India we suffered through decades of "booth captures" [1] where armed gangs would take over a polling booth and cast votes for their political candidate at gun point. Villagers would be disallowed from casting their votes. In many instances, the polling booth itself would be set on fire, ensuring that those votes are never counted.
With EVMs the polling officer can just deactivate the machine (which stops the counting at that moment) making booth capturing pointless.
Not saying this is not possible in developed countries. It could very well happen sometime in the future where armed gangs take over polling booths (especially if the candidate in question is bound to lose due to corruption/scandal and needs to cling onto political power to prevent himself/herself from going to prison).
> This probably only works properly in the developed countries. In developing countries like India we suffered through decades of "booth captures" [1] where armed gangs would take over a polling booth and cast votes for their political candidate at gun point. Villagers would be disallowed from casting their votes. In many instances, the polling booth itself would be set on fire, ensuring that those votes are never counted.
Yeah, but these are visible! They provide evidence that the voting was not fair.
Compare to electronic voting, where a capture might be done and no one ever finds out.
We want rigging of elections to be visible. That's the whole point.
I mean looks like booth capture can only capture a booth at most and to capture more you practically need armed rebellion. But if we automate it, then you only need to capture a location to capture all booths in the region.
I don't think any system can do much if things have degraded to the point where armed gangs are running around with impunity. I think systems (paper or otherwise) presuppose a certain level of functional civil society
> Not saying this is not possible in developed countries. It could very well happen sometime in the future where armed gangs take over polling booths…
I fully expect this happening more as the systems degrade in the west and, arguably, it already has happened several times now in many different ways, even if executed in more “sophisticated” ways that make it less apparent.
What do you call the many “color revolutions” the US and EU have now perpetrated in many different ways and places? The ”gang” was just a state level actor with immense resources and methods that exceed the local capacity to prevent them… just like a local gang using arms to take over a local polling booth.
There are declassified versions of old and obsolete CIA guides on how to conduct the precursors of such “color revolutions” through long term “capacity building” that is then activated if/when necessary. That’s the voluntarily declassified manual of the CIA; someone might suggest there are more effective instructions that are classified.
There have also been medium sophistication level events like what has happened over the last several years in Europe, where Merkel ordered an election result cancelled through technicalities because she/the literal The Party, did not like the result (I guess you can take the woman out of the dictatorship…), the EU simply used the judiciary to force a “runoff” because the election results were not to its liking, de facto canceling elections, or even all the subtle measures like visually misrepresenting election results where the bar or pie chart does not match the numerical data to suppress public mandate and perceptions about results, i.e., higher result numbers being represented by smaller bars than lower numbers.
I would argue they are all examples of the very same things you describe, the equivalent of “…gangs take over polling booths…” only it’s done through process, authority, policy, or even law and those in power tell themselves they’re doing it for “our democracy” and justified through similar dystopian, narcissistic, megalomanic, authoritarian mindsets; “I need to be in power for your own good because you don’t know any better”.
It could go both ways, either things will increasingly start degrading even more as the power slips out of the “gang’s”hands, and the system starts crumbling around them; or if “digital voting” is fully implemented there will essentially be “backdoors” to make sure the powers can “preserve our democracy” just like they need OS backdoors and media control to “protect the children”, which coincidentally seems to always coincide with them remaining in power and control and the people not even being asked about major upheavals of their society and their votes being effectively meaningless because the agenda is continuous regardless of election results.
It’s like those people who used to play slot machines at the casino, (now doing so digitally on their phones) pounding at the buttons that do absolutely nothing since the algorithm is what determines where the spin ends, not them rapidly hitting an essentially dead button just because the “clicking”, the “voting”, makes them think they have control. . . . “our democracy” where you and I are not part of that “our”.
Thank you! Please also make a separate Show HN for AI-generated/vibe-coded projects (specifically open-source projects) and queue any project that has a .claude/.codex (or whatever flavor of the month) into a slow queue automatically.
> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.
How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.
EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.
> I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code.
That license is called "All Rights Reserved", in which case you wouldn't be able to legally use the output for anything.
There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.
But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.
> There are research models out there which are trained on only permissively licensed data
Models whose authors tried to train only on permissively licensed data.
For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.
I agree with your assessment. Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair not only because companies are using AI generated output but developers themselves are also paying and using AI generated output that is trained on other developer's input. I would feel good (in my conscience) that I am not "stealing" someone else's effort and they are being paid for it.
Why settle on some private agreement between creators and ai companies where a tiny percentage is shared, let's just tax the hell out of AI companies and redistribute.
Because the authors of the original content deserve recompense for their work.
That's what the whole copyright and patent regimes are designed to achieve.
It's to encourage the creation of knowledge.
US Constitution, Article I, section 8:
To promote the Progress of Science and useful Arts, by
securing for limited Times to Authors and Inventors the
exclusive Right to their respective Writings
and Discoveries;
Right, it says exclusive rights, which does not translate to "we siphon everything and you get a tiny percentage of our profits", it means I can choose to say no to all of this. To me the matter of compensation and that of authorship rights are mostly orthogonal.
Agreed, but the right to compensation is derived from the right of licensing something you author.
The courts have ruled that something machine generated does not have a human author, so therefore it is not subject to copyright, in the US.
So if enough authors agreed and sued the AI companies to remove their copyrighted elements from the AI training, then that would be a reasonable solution as well.
However, any lawsuit is highly likely to result in some sort of compensation paid if decided in favor of the authors.
> let's just tax the hell out of AI companies and redistribute.
That's not what I favor because you are inserting a middleman, the Government, into the mix. The Government ALWAYS wants to maximize tax collections AND fully utilize its budget. There is no concept of "savings" in any Government anywhere in the World. And Government spending is ALWAYS wasteful. Tenders floated by Government will ALWAYS go to companies that have senators/ministers/prime ministers/presidents/kings etc as shareholders. In other words, the tax money collected will be redistributed again amongst the top 500 companies. There is no trickle down. Which is why agreements need to be between creators and those who are enjoying fruits of the creation. What have Governments ever created except for laws that stifle innovation/progress every single time?
Just because you have a failure of imagination for how government should work, doesn’t mean it can’t work. And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.
Pension fund is an example of what exactly? All countries have pension funds. This has nothing to do with Governments wasting money. Please go beyond tiny European countries that have very few verticals and are largely dependent on outside support for protecting their sovereignty. They are not representative of most of the World.
> As its name suggests, the Government Pension Fund Global is invested in international financial markets, so the risk is independent from the Norwegian economy. The fund is invested in 8,763 companies in 71 countries (as of 2024).
Basically what I said above. You give your tax dollars to Government and it will invest it into top 500 companies. In the Norway Pension Fund case it is 8,763 companies in 71 countries. None of them are startups/small businesses/creators.
> And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.
You are confusing current lack of laws regulating this space with innovation being evil. Innovation is not evil. The technology per se is not evil. Every innovation brings with it a set of challenges which requires us to think of new legislation. This has ALWAYS been the case for thousands of years of human innovation.
In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.
Apart from that, you have answered to a strawman. I said redistribute, not give to the government. I explicitly worded things that way because I don't think we should not be having a discussion on policy.
I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.
> Apart from that, you have answered to a strawman. I said redistribute, not give to the government
You said: "let's just tax the hell out of AI companies and redistribute.". Only the Government has the power to tax. Question of redistribution does not even arise without first having the power to the coffers of the Company. Which you nor I have. Government CAN have if it wants to by either Nationalizing the Company or as you said "taxing the hell out of" the company. Please explain how you would go about taxing and redistributing without involving the Government?
> In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.
These fall under the ambit of governance and hence why you have a Government. That's the only power Governments should have. Governments SHOULD NOT be managing private enterprises.
> I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.
Agreed. Which is why I was proposing private agreements in the first place (without involving a third-party like the Government which, more often than not, mismanages funds).
> Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair
That wouldn't be fair because these models are not only trained on code. A huge chunk of the training data are just "random" webpages scraped off the Internet. How do you propose those people are compensated in such a scheme? How do you even know who contributed, and how much, and to whom to even direct the money?
I think the only "fair" model would be to essentially require models trained on data that you didn't explicitly license to be released as open weights under a permissive license (possibly with a slight delay to allow you to recoup costs). That is: if you want to gobble up the whole Internet to train your model without asking for permission then you're free to do so, but you need to release the resulting model so that the whole humanity can benefit from it, instead of monopolizing it behind an API paywall like e.g. OpenAI or Anthropic does.
Those big LLM companies harvest everyone's data en-masse without permission, train their models on it, and then not only they don't release jack squat, but have the gall to put up malicious explicit roadblocks (hiding CoT traces, banning competitors, etc.) so that no one else can do it to them, and when people try they call it an "attack"[1]. This is what people should be angry about.
I don't know how far it would get, but I imagine that a FAANG will be able to get the farthest here by virtue of having mountains of corporate data that they have complete ownership over.
They’d probably get the farthest, but they won’t pursue that because they don’t want to end up leaking the original data from training.
It is possible in regular language/text subsets of models to reconstruct massive consecutive parts of the training data [1], so it ought to be possible for their internal code, too.
Copyright for me not for thee? :) That's a good point though. Maybe they could round trip things? E.g., use the model trained only on internal content to generate training data (which you could probably do some kind of screening to remove anything you don't want leaking) and then train a new model off just that?
AI can't claim ownership, humans can't either as they haven't produced it. If there is guaranteed no one which can claim ownership it often seen as being in the public domain.
In general it is irrelevant what the copyright of the AI training data is. At least in the US judges have been relevant clear about that. (Except if the AI reproduced input data close to verbatim. _But in general we aren't speaking about AI being trained on a code base but an AI using/rewriting it_.)
(1): Which isn't the same as no one seems to know who has ownership. It also might be owned by no-one in the sense that no one can grant you can copyright permission (so opposite of public domain), but also no-one can sue (so de-facto public domain).
Humans can't claim ownership, but they are still liable for the product of their bot. That's why MS was so quick to indemnify their users, they know full well that it is going to be super hard to prove that there is a key link to some original work.
The main analogy is this one: you take a massive pile of copyrighted works, cut them up into small sections and toss the whole thing in a centrifuge, then, when prompted to produce a work you use a statistical method to pull pieces of those copyrighted works out of the centrifuge. Sometimes you may find that you are pulling pieces out of the laundromat in the order in which they went in, which after a certain number of tokens becomes a copyright violation.
This suggests there are some obvious ways in which AI companies can protect themselves from claims of infringement but as far as I'm aware not a single one has protections in place to ensure that they do not materially reproduce any fraction of the input texts other than that they recognize prompts asking it to do so.
So it won't produce the lyrics of 'Let it be'. But they'll be happy to write you mountains of prose that strongly resembles some of the inputs.
The fact that they are not doing that tells you all you really need to know: they know that everything that their bots spit out is technically derived from copyrighted works. They also have armies of lawyers and technical arguments to claim the opposite.
> which is about AI using code as input to produce similar code as output
> not about AI being trained on code
The two are very directly connected.
The LLM would not be able to do what it does without being trained, and it was trained on copyrighted works of others. Giving it a piece of code for a rewrite is a clear case of transformation, no matter what, but now it also rests on a mountain of other copyrighted code.
So now you're doubly in the wrong, you are willfully using AI to violate copyright. AI does not create original works, period.
Every programmer is trained on the copyrighted works of others. there a vanishingly few modern programs with available source code in the public domain.
it isn't clear how/if llm is different from the brain but we all have training by looking at copywrited source code at some time.
> but we all have training by looking at copywrited[sic] source code at some time.
The single word "training" is here being used to describe two very different processes; what an LLM does with text during training is at basically every step fundamentally distinct from what a human does with text.
Word embedding and gradient descent just aren't anything at all like reading text!
Indeed, but that's just a misdirection. We don't actually know how a human brain learns, so it is hard to base any kind of legal definition on that difference. Obviously there are massive differences but what those differences are is something you can debate just about forever.
I have a lot of music in my head that I've listened to for decades. I could probably replicate it note-for-note given the right gear and enough time. But that would not make any of my output copyrightable works. But if I doodle for three minutes on the piano, even if it is going to be terrible that is an original work.
> humans can't either as they haven't produced it. If there is guaranteed no one which can claim ownership it often seen as being in the public domain.
Says who?. The US ruling the article refers to does not cover this.
It is different in other countries. Even if US law says it is public domain (which is probably not the case) you had better not distribute it internationally. For example, UK law explicitly says a human is the author of machine generated content: https://qht.co/item?id=47260110
I would be totally fine with all code generated by LLMs being considered to be under GPL v3 unless the model authors can prove without any doubt it was not trained on any GPL v3 code - viral licensing to the max. ;-)
"We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not."
I think it will depend on the way HOW the AI arrived to the new code.
If it was using the original source code then it probably is guilty-by-association. But in theory an AI model could also generate a rewrite if being fed intermediary data not based on that project.
> "We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not."
it depends on the country you are in
but overall in the US judges have mostly consistently ruled it as legal
and this is extremely unlikely to change/be effectively interpreted different
but where things are more complex is:
- model containing training data (instead of generic abstractions based on it), determined by weather or not it can be convinced to produce close to verbatim output of the training data the discussion is about
- model producing close to verbatim training data
the later seems to be mostly? always? be seen as copyright violation, with the issue that the person who does the violation (i.e. uses the produced output) might not known
the former could mean that not just the output but the model itself can count as a form of database containing copyright violating content. In which case they model provider has to remove it, which is technically impossible(1)... The pain point with that approach is that it will likely kill public models, while privately kept models will for every case put in a filter and _claim_ to have removed it and likely will get away with it. So while IMHO it should be a violation conceptually, it probably is better if it isn't.
But also the case the original article refers to is more about models interacting/using with code base then them being trained on.
(1): For LLMs, it is very much removable for knowledge based used by LLMs.
You should just look at it as a giant computation graph. If some of the inputs in this graph are tainted by copyright and an output depends on these inputs (changing them can change the output) then the output is tainted too.
> We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not.
That horse has bolted. No one knows where all the AI code any more, and it would no longer possible to be compliant with a ruling that no one can use AI generated code.
There may be some mental and legal gymnastics to make it possible, but it will be made legal because it’s too late to do anything else now.
I hate that this may be true, but I also don't think the law will fix this for us.
I think this is down the community and the culture to draw our red lines on and enforce them. If we value open source, we will find a way to prevent its complete collapse through model-assisted copyright laundering. If not, OSS will be slowly enshittified as control of projects slowly flows to the most profit-motivated entities.
But what tools do we have to stop this happening? I agree, we can (and should) all refuse to participate in licence laundering, but there will always be folks less principled.
I don't either, but I guess we're both about to find out. There only surety is that there will be moves and countermoves. As far as I could tell the best thing we could do right now is fund software-legal organizations like the EFF which are likely to be the ones to litigate the test cases. What's hurting us most right now is we don't know what law means in this context, so we don't fully understand the scale of what we need to protect against or what tools we have that the courts will recognize
I don't think you can classify "public data in" as public domain. Public data could also include commercial licenses which forbid using it in any way other than what the license states. Just because the source is open for viewing does not necessarily mean it is OSL.
That's the core issue here. All models are trained on ALL source code that is publicly available irrespective of how it was licensed. It is illegal but every company training LLMs is doing it anyways.
Only (?) in America. In the EU, scraping is legal by default unless explicitly opted out with machine-readable instructions like robots.txt. That covers "training input". For training output, the rule is: "if the output is unrecognizable to the input, the license of the input does not matter" (otherwise, any project X could sue project Y for copyright infringement even if the projects only barely resemble each other). The cases where companies actually got sued were where the output was a direct copy or repetition of the input, even if an LLM was involved.
There is, however, a larger philosophical divide between the US and the EU based on history and religion. The US philosophy is highly individualistic, capitalistic, and considers "first-order principles." Copyright is a "property right": "I own this string of bits, you used them, therefore you owe me" (principle of absolute ownership).
Continental philosophy is more social and considers "second-order / causal effects." Copyright is a "personality right" that exists within a social ecosystem. The focus is on the effect of the action rather than a singular principle like "intellectual property." If the new code provides a secondary benefit to society and doesn't "hurt" the original creator's unique intellectual stamp, the law is inclined to view it as a new work.
In terms of legal sociology, America and Britain are more "individual-property-atomistic" thanks to their Protestant heritage, focusing on the rights of the individual (sola me, and my property, and God). Meanwhile, Europe was, at least to a large part, Catholic (esp. France), which focuses more on works, results, and effects on society to determine morality. While the states are officially secular, the heritage of this echoes in different definitions of what is considered "legal" or "moral", depending on which side of the ocean you are on.
Copyright is not a blacklist but an allowlist of things kept aside for the holder. Everything else is free game. LLM ingestion comes under fair use so no worries. If someone can get their hand on it, nothing in law stops it from training ingestion.
We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough.
I don't think so. It is no where "limited use". Entirety of the source code is ingested for training the model. In other words, it meets the bar of "heart of the work" being used for training. There are other factors as well, such as not harming owner's ability to profit from original work.
This hasn't gone to Supreme Court yet. And this is just USA. Courts in rest of the World will also have to take a call. It is not as simple as you make it out to be. Developers are spread across the World with majority living outside USA. Jurisdiction matters in these things.
Copyright's ambit has been pretty much defined and run by US for over a century.
You're holding out for some grace on this from the wrong venue. The right avenue would be lobbying for new laws to regulate and use LLMs, not try to find shelter in an archaic and increasingly irrelevant bit of legalese.
I don't disagree. However, just because your assertion of copyright being initially defined by US (which is not the fact. It was England that came up with it and was adopted by the Commonwealth which US was also a part of until its independence) does not mean jurisdiction is US. Even if US Supreme Court rules one way or the other, it doesn't matter as the rest of the World have its own definitions and legalese that need to be scrutinized and modernized.
Alsup absolutely did not vindicate Anthropic as "fair use".
> Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. [0]
It was only fair use, where they already had a license to the information at hand.
> I’m starting to think that’s going to be personality and feel and polish, but turned up a notch. That’s what I used to do when I started writing apps, but in some ways I have really toned it down in favor of OS alignment.
That's all that is really required. I mean look at the Microslop fiasco. They ruined a perfectly good editor: Notepad with AI slop. But this is not reflecting in their sales. They still are showing record revenues.
Just because a competing product exists does not mean your product is suddenly obsolete. There will always be people who will want to buy (provided the market is not oversaturated). Because that is how humans do things. AI won't change that behavior overnight [1]. Look around you and you will see every product you hold in your hand has at least 5-10 competitors.
[1] Think about all the things that are still not computerized and which requires you to fill some or the other form of paperwork. We have had computers for over nearly 6 decades now. We STILL have physical forms that we fill from time to time. Computerization was touted to revolutionize this and yet here we are. Still not achieved 100% digitalization. The same will happen with AI as well. There is this initial burst of excitement (which is the phase we are in) until reality sets in and that's when people will learn how to best use the technology. What you are seeing today (vibe coding et all) is NOT IT.
reply