Finally, the great IP washing machine hums and can dissolve the whole structure....

kmeisthax · on Aug 28, 2024

I don't think this proves you can just launder away copyright - nor do I think even we want that at this point.

First off: the claims dismissed have to do with 17 USC 1202, the part of the DMCA that deals with copyright management information. It's a bit of a plaintiff meme[0] to add a CMI claim onto a copyright infringement lawsuit. Obviously, if you infringe copyright, you're also not going to preserve the CMI. And if an AI were to regurgitate output, it doesn't even know that it did so, so it can't preserve CMI even if it wanted to.

Problem is, the AI doesn't regurgitate consistently enough to make a legal claim of CMI removal. The model does generate legally distinct outputs sometimes. You need to point to specific generations and connect the dots from the model to the output in a way that legally implicates GitHub, OpenAI, and/or Microsoft in ways that would not be disclaimed by, say, 17 USC 512 safe harbor. This is distinct from the training-time infringement claims which are still live, wouldn't rely on secondary liability, can't be disclaimed by honoring DMCA takedowns, and which I think are the stronger claim.

Let's step out of the realm of legality. Why do we want to get rid of copyright? For me, it's because copyright centralizes control over creativity. It tells other artists what they can do and forces them into larger and larger hierarchies. The problem is that AI models do the same thing. Using an AI model doesn't make you an artist[1], but it does move that artistic control further towards large creative industry. This is why you have a lot of publisher and big media CEOs that are strangely bullish on AI, a bunch of artists that ordinarily post shit for free are angry about it, and the FOSS people who hate software copyright were the first to sue.

In other words, AI is breaking copyright in order to replace it with more of the thing we hate about copyright.

[0] Or at least Richard Liebowitz liked to do it before he got disbarred.

[1] In the same way that commissioning an art piece does not itself make you an artist

kelnos · on Aug 29, 2024

Thank you for this; you really hit the nail on the head as to why this is so gross. If building, training, and operating an LLM was something well within the resources of anyone to do, I'd probably have much less of a problem of my open source code being copyright-laundered through products like Copilot.

But that's just not where we are right now, and the result of that feels awful.

CaptainFever · on Aug 29, 2024

Open source models exist and can be run locally. It doesn't matter if training from scratch is impractical for the average person, if we already have free (as in freedom) models we can build off of.

kmeisthax · on Aug 29, 2024

I have several "open" models sitting around for experimentation and occasional use, but I don't think downloadable weights solves the underlying issue.

First off, none of the good models are FOSS in the sense we normally expect - i.e. the four freedoms. At the least onerous end of the scale, Stable Diffusion models under the OpenRAIL license have a morality clause[0] and technical protections[1] to enforce that clause. LLaMA's licensing is only open for entities below a certain MAU, and Stable Diffusion 3 recently switched away from OpenRAIL to a LLaMA-like "free as in beer" license. Not only is this non-free, it's getting increasingly more proprietary as entities who are paying for the AI training start demanding a return on investment, and the easiest way to get that is to just demand licensing fees.

The reason why AI companies - not the artists or programmers they stole from - are in a position to demand these licensing terms at all is because they're the ones controlling the capital necessary to train models. If training from scratch at the frontier was still viable for FOSS tinkerers, we wouldn't have to worry about OpenAI reading all our GPT queries or Stability finding ways to put asterisks on their openness branding. FOSS software development is something you can do as a hobby, so if a project screws something up, you can build a new one. That's not how AI works. If Stability screws up, you still have to obey Stability's rules unless you train a new foundation model, and that's very expensive.

You see how this is very "copyright-like" even though it's legally contravening the letter and spirit of copyright law? Barriers to entry drive industrial consolidation and ratchet us towards a capitalist, privately-owned command economy. If I could train models from scratch, I'd make a decent model trained purely on public domain datasets[2], slather Grokfast[3] on it, and build in some UI to selectively fine-tune on licensed or custom data.

[0] To be clear, I don't have much against morality clauses personally, that's why I've used OpenRAIL models. But I still think adding morality clauses to otherwise FOSS licensing is a bad idea. At the very least, in order to put moral values into a legal contract, we have to agree as a community as to what moral values should be enforced by copyright. Furthermore, copyright and contracts are a bad tool to enforce morals.

[1] e.g. the Stable Diffusion safety filter

[2] Believe me, I tried

[3] An algorithm that increases the speed of grokking (generalization) by taking the FFT of gradients and amplifying the slow ones.

AnimalMuppet · on Aug 28, 2024

That was the point of a clean-room implementation of a spec, which is how the Phoenix BIOS was done for PC clones clear back in the 1980s.

So "finally" may not be an accurate word...

pfdietz · on Aug 28, 2024

This makes me think we need models to deliberately try to make code that's equivalent to copyrighted code, but sufficiently changed that it's not infringing.

The end state would be to make the rewriting powerful enough that trying to claim infringement would also hit manually created code.

Alternately, generate code that is optimized for some task by some metric, and show that because the code is best by this criterion, it doesn't show creativity.

Another possibility here is for the LLM vendor to log the code generation tasks typically asked for and then salt the model with vetted, correct, non-infringing code for those questions.

kelnos · on Aug 29, 2024

While I think that would probably get around the copyright infringement issues, it still bothers me.

I don't like the idea that a corporation can hoover up countless open source code and contributions (including mine), and then use that to make money selling code generation assistance to other people, even if the output of that code generation would be different enough to any specific copyrighted block of code such that it couldn't result in a copyright infringement claim.

It's not even clear that we can stop this from happening; it's certainly possible that a "GPLv4" that had a provision against using covered code for LLM trailing would be legally (if not just practically) unenforceable.

To me, the ickiest part is centralization. While models and training tools will probably (maybe?) get cheaper over time, building and operating something like Copilot requires a lot of money and resources. Do we really want these capabilities to be locked up inside big, well-capitalized corporations? For me, the answer to that is a resounding hell no.