Opinionated article by Alexander Hanff, a computer scientist and privacy technologist who helped develop Europe’s GDPR (General Data Protection Regulation) and ePrivacy rules.
We cannot allow Big Tech to continue to ignore our fundamental human rights. Had such an approach been taken 25 years ago in relation to privacy and data protection, arguably we would not have the situation we have to today, where some platforms routinely ignore their legal obligations at the detriment of society.
Legislators did not understand the impact of weak laws or weak enforcement 25 years ago, but we have enough hindsight now to ensure we don’t make the same mistakes moving forward. The time to regulate unlawful AI training is now, and we must learn from mistakes past to ensure that we provide effective deterrents and consequences to such ubiquitous law breaking in the future.
It’s more like “Slapping on the wrist isn’t helping.” The Alex Jones bankruptcy is the first time I’ve seen anyone fined significantly to the point of it mattering. Fines are meant to be significant enough that the company would do its best to avoid them. If the fine is palpable, then it’s just the cost of doing business.
The Alex Jones bankruptcy is the first time I’ve seen anyone fined significantly to the point of it mattering.
The Alex Jones case is a textbook example of what happens when a rich person is so overconfident that he does even less than the absolute bare minimum to defend himself in a court case. He defaulted on the case! That’s the absolute zero of stupidity in legal terms.
I don’t really consider the Alex Jones case to be a win. It was a fluke, and if he had even put up a slight bit of effort, it would have turned out very differently. You know, like 99% of the other cases where the rich is legally attacking the poor.
Not to mention that the rest of the outcome of that is being handled rather poorly
Yes the fines are not high enough. IMHO there should be two payments: a return of all earnings which are related to the violation PLUS a hefty fine and/or jail for the executives
That’s the only way it isn’t cost efficient for the big companies to ignore the laws. Also, make sure the fines are actually paid in full and in a reasonable amount of time
So set the fine in percentage of the company?
I’d argue Alex Jones is completely different in the eyes of the law. His Sandy Hook case and subsequent bankruptcy are very different than the fines levied against tech companies. Which is why there’s a huge difference. In general, crimes that done physically hurt people have less consequences. And that should change. Fining these companies a significant amount, so that they can no longer be considered a line item on the budget would be a good start. There definitely needs to be a change. But I’m no expert to truly evaluate what changes would be effective.
That’s stupid. The damage is still done to the owner of that data used illegally. Make them destroy it.
But when you levy such miniscule fines that are less than they stand to make from it, it’s just a cost of business. Fines can work if they were appropriate to the value derived.
Yeah, the only threat to Big Tech is that they might sink a lot of money into training material they’d have to give away later. But releasing the material into the Public Domain is not exactly an improvement for the people whose data and work has been used without consent or payment.
“Congratulations, your rights are still being violated, but now the data is free to use for everyone”.
They would actually still benefit from public-domain’ing LLMs, because they themselves also get to use the data produced by others. Everyone gets losses but also gets gains on this idea, which is much better than current model.
Whether rights have been violated depends on the jurisdiction, of course.
Semantics. If person A is protected by privacy rights in her jurisdiction, but her data is scraped by project B from one where such rights conveniently aren’t legally respected, A should still be able to expect some way of injunction.
I guess the idea is that the models themselves are not infringing copyright, but the training process DID. Some of the big players have admitted to using pirated material in training data. The rest obviously did even if they haven’t admitted it.
While language models have the capacity to produce infringing output, I don’t think the models themselves are infringing (though there are probably exceptions). I mean, gzip can reproduce infringing material too with the correct input. If producing infringing work requires both the algorithm AND specific, intentional user input, then I don’t think you should put the blame solely on the algorithm.
Either way, I don’t think existing legal frameworks are suitable to answer these questions, so I think it’s more important to think about what the law should be rather than what it currently is.
I remember stories about the RIAA suing individuals for many thousands of dollars per mp3 they downloaded. If you applied that logic to OpenAI — maximum fine for every individual work used — it’d instantly bankrupt them. Honestly, I’d love to see it. But I don’t think any copyright holder has the balls to try that against someone who can afford lawyers. They’re just bullies.
I guess the idea is that the models themselves are not infringing copyright, but the training process DID.
I’m still not understanding the logic. Here is a copyrighted picture. I can search for it, download it, view it, see it with my own eye balls. My browser already downloaded the image for me, in order for me to see it in the browser. I can take that image and edit it in a photo editor. I can do whatever I want with the image on my own computer, as long as I don’t publish the image elsewhere on the internet. All of that is legal. None of it infringes on copyright.
Hell, it could be argued that if I transform the image to a significant degree, I can still publish it under Fair Use. But, that still gets into a gray area for each use case.
What is not a gray area is what AI training does. They download the image and use it in training, which is like me looking at a picture in a browser. The image isn’t republished, or stored in the published model, or represented in any way that could be reconstructed back to the source image in any reasonable form. It just changes a bunch of weights in a LLM model. It’s mathematically impossible for a 4GB model to somehow store the many many terabytes of images on the internet.
Where is the copyright infringement?
I remember stories about the RIAA suing individuals for many thousands of dollars per mp3 they downloaded. If you applied that logic to OpenAI — maximum fine for every individual work used — it’d instantly bankrupt them. Honestly, I’d love to see it. But I don’t think any copyright holder has the balls to try that against someone who can afford lawyers. They’re just bullies.
You want to use the same bullshit tactics and unreasonable math that the RIAA used in their court cases?
I agree that the models themselves are clearly transformative. That doesn’t mean it’s legal for Meta to pirate everything on earth to use for training. THAT’S where the infringement is. And they admitted they used pirated material: https://www.techspot.com/news/101507-meta-admits-using-pirated-books-train-ai-but.html
You want to use the same bullshit tactics and unreasonable math that the RIAA used in their court cases?
I would enjoying seeing megacorps held to at least the same standards as individuals. I would prefer for those standards to be reasonable across the board, but that’s not really on the table here.
If you take that image, copy it and then try to resell it for profit you’ll find you’re quickly in breach of copyright.
The LLM is, in most cases, being licensed out to users for a profit off of the input data without which it could not exist in its current form.
You could see it akin to plagiarism if you think ctrl+c, ctrl+v is too extreme.
If you take that image, copy it and then try to resell it for profit you’ll find you’re quickly in breach of copyright.
That’s not what’s happening. Did you even read my comment?
OK, if you ignore the hyperbole of my pre-christmas stress aggressive start, how much of the rest do you disagree with?
Less combatitively, I’m of the stance that just make AI generated materials exempt from copyright and you’ll at least limit mass adoption in public facing things by big money. Doesn’t address all the issues, though.
AI-generated materials are already exempt from copyright. It falls under the same arguments as the monkey selfie. Which is great.
Crack copyright like a fucking egg. It only benefited the rich, anyway.
That’s good, and I’m glad to have been informed of it.
Thank you.
My copyright change is the 17 years from first publication. Feels maybe still a little long, but much better than what we have now.
Destroying it is both not an option, and an objectively regressive suggestion to even make.
Destruction isn’t possible because even if you deleted every bit of information from every hard drive in the world, now that we know it’s possible, someone would recreate it all in a matter of months.
Regressive because you’re literally suggesting that we destroy a new technology because we’re afraid of what it will do to the technology it replaces. Meanwhile, there’s a very decent chance that AI is our best chance at solving the energy/climate crises through advancing nuclear tech, as well as surviving the next pandemic via ground breaking protein folding tech.
I realize AI tech makes people uncomfortable (for…so many reasons), but becoming old fashioned conservatives in response is not a solution.
I would take it a step further than public domain, though. I would also make any profits from illegally trained AI need to be licensed from the public. If you’re going to use an AI to replace workers, then you need to pay taxes to the people proportional to what you would be paying those it replaces.
I never suggested destroying the technology that is “AI”. I’m not uncomfortable about AI, I’ve even considered pivoting my career in that direction.
I suggested destroying the particular implementation that was trained on the illegitimate data. If someone can recreate it using legitimate data, GREAT. That’s what we want to happen. The tool isn’t the problem. It’s the method they’re using to train them.
Please don’t make up random ass narratives I never even hunted at, and then argue against them.
I didn’t misinterpret what you were saying, everything I said applies to the specific case you lay out. If illegal networks were somehow entirely destroyed, someone would just make them again. That’s my point, there’s no way around that, there’s just holding people accountable when they do it. IMO that takes the form of restitutions to the people proportional to profits.
This is the dumb kind of “best do nothing, because both no is perfect” approach to making sure no disincentives are ever taken because someone somewhere else might also try to do the illegal thing that they’ll lose access to the moment they’re caught…
What the? I’m literally saying what action to take, what is happening? Is there maybe a bug where you only see the first few characters of my post? Are you able to read these characters I’m typing? Testing testing testing. Let me know how far you get. Maybe there’s just too many words for you? Test test. Say “elephant” if you can read this.
Mate LLMs are literally gobbling up energy as if they’re working at a power plant gloryhole. It’s furthering the climate crisis, not solving it. They’re also incapable of logic to make something new so they’re not gonna invent anything. AI in general has it’s uses but LLMs are not the golden goose you should bet on. And profits from them are afaik non existent. They only come from investors thinking it’ll be profitable some day but it’s a way too energy intense process to be profitable
I understand that you are familiar with the buzzword “LLM”, but let me introduce you to a different one: transformers.
Virtually all modern successful AIs are based on transformers, LLMs included. I agree that LLMs currently amount to a chinese-room-inspired parlor trick, but the money involved has no doubt advanced all transfomer-based AI research, both directly (what works for LLMs may generalize) and indirectly (the market demand for LLMs in consumer products has created the a demand for power and compute hardware).
We have transformer-based AI to thank for our understanding of the covid19 protein, and developing a safe and effective vaccine in a timely manner.
The massive demand for energy has convinced Microsoft, Meta, and others to invest in their own modern nuclear power plants, representing a monumental step forward in sustainable energy generation that we have been trying to convince the US government to take for decades.
Modern AI is being used to solve the hardest problems of nuclear fusion. If we can finally crack that nut, there’s no telling what’s possible.
But specifically when it comes to LLMs, profitable or not, people obviously find them useful. People aren’t using it in place of search engines, or doing all their homework with it because they don’t find it useful. My only argument is that any AI trained on public content without consent should be required to effectively buy a license from, or pay royalties to the public. If McDonald’s is going to replace their front counters with AI trained on public content, then they should have to pay taxes proportional to how much use they get from that AI.
In the theoretical extreme, if someone trains an AI on the general public’s data, and is able to create an AI that somehow replaces every job on earth, then congrats, we now live in a post-work society, we just need to reach out and take it rather than letting one person capitalize infinitely.
And at the end of the day, if you honestly believe the profits from AI are non-existent, then what are you worried about? All those companies putting all their eggs in the LLM basket are going to disappear overnight when the AI bubble finally pops, right?
There’s a reason why in my comment i talked about LLMs as bad while saying AI in general has it’s uses. The reason being this post being about LLMs.
I know very well that specialized AI has a lot of uses in medical science and other fields but that’s not really what got hit with all the hype, is it? The hype is managers saw a language model give seemingly better answers to questions than John Rando from 2 blocks down the road so they’re now looking to cut out all the already low paid workers and spoiler alert we will not land in a society where the general public profits from not having work. It will be the same owners of capital profiting as per usual.
we will not land in a society where the general public profits from not having work. It will be the same owners of capital profiting as per usual.
If we do nothing, sure. I’m suggesting, like the article, that we do something.
The only sentiment I took issue with was the poster above who suggested that somehow the solution would be to delete/destroy illegally trained networks. I’m just saying that’s not practical nor progressive. AI is here to stay, we just need to create legislature that ensures it works for us, especially when it couldn’t have been built without us.
would love to see a source for AI helping with the covid 19 vaccine
For sure, here you go.
I’d argue it’s not useless, rather, it would remove any financial incentive for these companies to sink who knows how much into training AI. By putting them on the public domain, they would loose their competitve advantage over other cloud providers who could exploit it all the same, all the while not disturbing the current usage of AI.
Now, I do agree that destroying it would be even better, but I fear something like that would face too much force back by the parts of civil society who do use AI.
Strongly agree. Legislators have to come up with a way to handle how copyright works in conjunction with AI. I think it’s a sound approach to say companies can’t copyright it and keep it to themselves, if most of what went in was other people’s copyrighted work.
And it’d help make AI more democratic. I.e. not just entirely dominated by the motives of those super rich companies who have the millions of dollars to do it.
Legislators have to come up with a way to handle how copyright works in conjunction with AI.
That’s the neat part. It doesn’t.
Copyright hasn’t worked for the past 100 years. Copyright was borne out of an social agreement that works generated from it would enter public domain in a reasonable time frame. Thanks to Mark Twain and Disney, the limit is basically forever, or it might as well be. Here we are still arguing about the next Bond film for a book series that was made in the fucking 1950s. Or the Lord of the Rings series, the genesis of all fantasy. Or thousands of other things that deserve to be in public domain already.
Copyright is a blunt tool that rich people use to bash the poor with. Whatever you think copyright is doing to protect your rights or your works is easy enough for them to just spend enough money with lawyers and cases until you cave. If copyright isn’t working for the public good, then we should abolish it.
People hate AI because it’s mostly developed and used by the rich as a shitty way to save money and layoff even more people than we’ve already had. But, it doesn’t have to be. All of these LLM projects were based on freely available research. Hell, Stable Diffusion is still something you can just download and use for free, despite the fact that Stability AI is still trying to wrestle back their own control into the model.
Instead of sticking our ears in our fingers and saying “la la la la, AI doesn’t exist, it must be destroyed/regulated/fined”, we could push this technology to open sourced as much as possible. I mean, let’s assume that we somehow regulate AI so that people have to pay to use copyrighted works for training (as absurd as that is). AI training goes down drastically, and stagnates. Counties like China are not going to follow those same rules, and eventually, China will be the technological leader here.
Or the program works, and other people who don’t give a shit about copyright freely allow AI to train their works. Then you have AI models that have to follow these arcane rules, but arrived at the same spot, anyway, but only for the rich people who can afford the systems that allow for that regulation. What the fuck was the point in the regulation, except to make it even more expensive to make?
I mean, let’s assume that we somehow regulate AI so that people have to pay to use copyrighted works for training (as absurd as that is).
ISBNDB approximates there to be 158,464,880 published books in existence.
Meta’s annual income was ~156 billion last year.
Assuming a one time purchase scenario and a $20 average cost that’s ~3.2 billion dollars. ~2% of their annual revenue.
Or you could assume assuming a $0.2 annual license (similar to a lot of technology licenses), or a 0.002 per “stream” (which I. This instance would be ‘use of data to train model’)
I agree with most of what you said, but if you buy into a lot of the economic paradigms your arguments are based on you must also realize that those require the copyrighted works must be paid for and it’s not unreasonable to do so.
Sure. Copyright is is - is broken. And it certainly doesn’t help I’m paying Spotify etc just so they can pocket the money. But don’t we need something so Hollywood can produce my favorite TV show? I mean that stuff costs millions and millions to make, until it somehow arrives on my screen. Or an author making a decent living with coming up with a nice fantasy novel series? What’s the alternative until we arrive at Star Trek and money is a thing of the past?
I’m pretty sure the AI companies are stealing copyrighted work. Afaik Mata admitted doing it. For several older ones we know which books were in the training datasets. There are several ongoing lawsuits dealing with books being used to train AI, Scarlett Johansson’s voice etc.
I agree. As is, AI is a plaything for rich companies. They have complete control, since they hired the experts and they have the money for all the graphics cards and electricity. If it’s as disruptive as people claim, it’s our bad. Because we’re out of the loop.
This feels like either a weak response or a.shift in position. If privacy is the issue, how is the PD a serious solution? Of course it isn’t. So PD is a penalty of sorts, which is no better or worse than any other penalty. Meh.