It’s so ridiculous when corporations steal everyone’s work for their own profit, no one bats an eye but when a group of individuals do the same to make education and knowledge free for everyone it’s somehow illegal, unethical, immoral and what not.
Using publically available data to train isn’t stealing.
Daily reminder that the ones pushing this narrative are literally corporation like OpenAI. If you can’t use copyright materials freely to train on, it brings up the cost in such a way that only a handful of companies can afford the data.
They want to kill the open-source scene and are manipulating you to do so. Don’t build their moat for them.
The point that was being made was that public available data includes a whole lot amount of copyrighted data to begin with and its pretty much impossible to filter it out.
Grand example, the Eiffel tower in Paris is not copyright protected, but the lights on it are so you can only using pictures of the Eiffel tower during the day, if the picture itself isn’t copyright protected by the original photographer. Copyright law has all these complex caveat and exception that make it impossible to tell in glance whether or not it is protected.
This in turn means, if AI cannot legally train on copyrighted materials it finds online without paying huge sums of money then effectively only mega corporation who can pay copyright fines as cost of business will be able to afford training decent AI.
The only other option to produce any ai of such type is a very narrow curated set of known materials with a public use license but that is not going to get you anything competent on its own.
EDIT: In case it isn’t clear i am clarifying what i understood from Grimy@lemmy.world comment, not adding to it.
So then we as a society aren’t ready to untangle the mess of our infancy in the digital age. ChatGPT isn’t something we must have at all costs, it’s something we should have when we can deploy it while still respecting the rights of people who have made the content being used to train it.
I would go even further and say that we should have it until we can be sure it will respect others’ rights. All kind of rights, not only Copyright. Unlike Bing at the beginning, with all it’s bullying and menaces, or Chatgpt regurgitating private information gathered from God knows where.
The problem with waiting is the arms race with other governments. I feel it’s similar to fossil fuels, but all governments need to take the risk of being disadvantaged. Damned prisoner’s dilemma.
I didn’t want any of this shit. IDGAF if we don’t have AI. I’m still not sure the internet actually improved anything, let alone what the benefits of AI are supposed to be.
It’s not like all this data was randomly dumped at the AIs. For data sets to serve as good training materials they need contextual information so that the AI can discern patterns and replicate them when prompted.
We see this when you can literally prompt AIs with whose style you want it to emulate. Meaning that the data it was fed had such information.
Midjourney is facing extra backlash from artists after a spreadsheet was leaked containing a list of artist styles their AI was trained on. Meaning they can keep track of it and they trained the AI with those artists’ works deliberately. They simply pretend this is impossible to figure out so that they might not be liable to seek permission and compensate the artists whose works were used.
I clarified the comment above which was misunderstood, whether it makes a moral/sane argument is subjective and i am not covering that.
I am not sure why you think there is a claim that openAI is trying to make companies pay, on the contrary the comment i was clarifying (so not my opinion/words) states that openAI is making an argument that anyone should be able to use copyrighted materials for free to train AI.
The costs of running an online service like chatgpt is wildly besides the argument presented. You can run your own open source large language models at home about as well as you can run Bethesda’s Starfield on a same spec’d PC
Those Open source large language models are trained on the same collections of data including copyrighted data.
The logic being used here is:
If It becomes globally forbidden to train AI with copyrighted materials or there is a large price or fine in order to use them for training then the Non-Corporate, Free, Open Source Side of AI will perish or have to go underground while to the For-Profit mega corporations will continue exploit and train ai as usual because they can pay to settle in court.
The Ethical dilemma as i understand it is:
Allowing Ai to train for free is a direct threat towards creatives and a win for BigProfit Enthertainment, not allowing it to train to free is treat to public democratic AI and a win for BigTech merging with BigCrime
That is very well put, I really wish I could have started with that.
Though I envision it as a loss for BigProfit Enthertainment since I see this as a real boon for the indie gaming, animation and eventually filmmaking industry.
You can run your own open source large language models at home about as well as you can run Bethesda’s Starfield on a same spec’d PC
…
Yes, you can download an executable of a chatbot lol.
That’s different than running something remotely like even OpenAI.
The more it has to reference, the more the system scales up. Not just storage, but everything else.
Like, in your example of video games it would be more like stripping down a PS5 game of all the assets, then playing it on a NES at 1 frame per five minutes.
You’re not only wildly overestimating chatbots ability, you’re doing that while drastically underestimating the resources needed.
Edit:
I think you literally don’t know what people are talking about…
Do you think people are talking about AI image generators?
I am talking about generative AI, be it text or image both have a challenge with copyrighted material.
“executable of a chatbot”
lol, aint you cute
“example of video games”
Are you refering to my joke?
I am far from overestimating capacity, Starfield runs mediocre on a modern gaming system compared to other games.
The Vicuna 13b llm runs mediocre on the same system compared with gpt 3.5. To this date there is no local model that i would trust for professional use and chatgpt 3.5 doesnt hit that level either.
But it remains a very interesting, rapidly evolving technology that i hope receives as much future open source support as possible.
“I think you literally don’t know what people are talking about”
I hate to break it to you but you’re embarrassing yourself.
I presume you must believe the the following lemmy community and resources to be typed up by a group of children, either that or your just naive.
I’m not sure if someone else has brought this up, but I could see OpenAI and other early adopters pushing for tighter controls of training data as a means to be the only players in town. You can’t build your own competing AI because you won’t have the same amount of data as us and we’ll corner the market.
OpenAI is definitely not the one arguing that they have stole data to train their AIs, and Disney will be fine whether AI requires owning the rights to training materials or not. Small artists, the ones protesting the most against it, will not. They are already seeing jobs and commission opportunities declining due to it.
Being publicly available in some form is not a permission to use and reproduce those works however you feel like. Only the real owner have the right to decide. We on the internet have always been a bit blasé about it, sometimes deservedly, but as we get to a point we are driving away the very same artists that we enjoy and get inspired by, maybe we should be a bit more understanding about their position.
That depends on what your definition of “publicly available” is. If you’re scraping New York Times articles and pulling art off Tumblr then yeah, it’s exactly stealing in the same way scihub is. Only difference is, scihub isn’t boiling the oceans in an attempt to make rich people even richer.
Yeah, by using the argument you just gave as an excuse to “launder” copyleft works in the training data into permissively-licensed output.
Including even a single copyleft work in the training data ought to force every output of the system to be copyleft. Or if it doesn’t, then the alternative is that the output shouldn’t be legal to use at all.
We have a mechanism for people to make their work publically visible while reserving certain rights for themselves.
Are you saying that creators cannot (or ought not be able to) reserve the right to ML training for themselves? What if they want to selectively permit that right to FOSS or non-profits?
Scientific research papers are generally public too, in that you can always reach out to the researcher and they’ll provide the papers for free, it’s just the “corporate” journals that need their profit off of other peoples work…
Yeah, just wait until they see the ai design tools that allow anyone to casually describe the spare part or upgrade they want and it’ll be designed and printed at home or local fab shop.
Lot of once fairly safe monopolies are going to start looking very shaky, and then things like natural language cookery toolarms disrupting even more…
We’ve only barely started to see what the tech we have now is able to do, yes a million shitty chat bots / img gen apps are cashing in on the hype but when we start seeing some killer apps emerge it’s when people won’t be able to ignore it any longer
True, Big Tech loves monopoly power. It’s hard to see how there can be an AI monopoly without expanding intellectual property rights.
It would mean a nice windfall profit for intellectual property owners. I doubt they worry about open source or competition but only think as far as lobbying to be given free money. It’s weird how many people here, who are probably not all rich, support giving extra money to owners, merely for owning things. That’s how it goes when you grow up on Ayn Rand, I guess.
This is the hardest thing to explain to people. Just convert it into a person with unlimited memory.
Open AI is sending said person to view every piece of human work, learns and makes connections, then make art or reports based on what you tell/ask this person.
Sci-Hub is doing the same thing but you can ask it for a specific book and they will write it down word for word for you, an exact copy.
Both morally should be free to do so. But we have laws that say the sci-hub human is illegally selling the work of others. Whereas the open ai human has to be given so many specific instructions to reproduce a human work that it’s practically like handing it a book and it handing the book back to you.
Cue the Max Headroom episode where the blanks (disconnected people) are chased by the censors because the blanks steal cable so their children can watch the educational shows and learn to read, and they are forced to use clandestine printing presses to teach them.
what’s this? an anti-corporate message that sneers at cable TV companies??? CANCEL THAT SHOW!!!
that show was so amazingly prescient: the theme of the first episode was how advertising literally kills its viewers and the news covers things up. No wonder they didn’t get renewed. ;)
Because it’s easy to get these chatbots to output direct copyrighted text…
Even ones the company never paid for, not even just a subscription for a single human to view the articles they’re reproducing. Like, think of it as buying a movie, then burning a copy for anyone who asks.
Which reproducing word for word for people who didn’t pay is still a whole nother issue. So this is more like torrenting a movie, then seeding it.
It’s not that easy, don’t believe the articles being broadcasted every day. They are heavily cherry picked.
Also, if someone is creating copyright works, it is on that person to be responsible if they release or sell it, not the tool they used. Just because the tool can be good (learns well and responds well when asked to make a clone of something) doesn’t mean it is the only thing it does or must do. It is following instructions, which were to make a thing. The one giving the instructions is the issue, and the intent of that person when they distribute is the issue.
If I draw a perfect clone of Donald Duck in the privacy of my home after looking at hundreds of Donald Duck images online, there is nothing wrong with that. If I go on Etsy and start selling them without a license, they will come after ME. Not because I drew it, but because I am selling it and violating a copyright. They won’t go after the pencil or ink manufacturer. And they won’t go after Adobe if I drew it on a computer with Photoshop.
If I draw a perfect clone of Donald Duck in the privacy of my home after looking at hundreds of Donald Duck images online, there is nothing wrong with that
In your picture example it would be an exact copy…
But even if you started a business and when people asked for a picture of Donald Duck, giving them a traced copy is still copyright infringement… Hell, even your bad analogy of a person’s own drawing, still copyright infringement
The worst thing about these chatbots is the people who think it’s amazing don’t understand what it’s doing. If you understood it, it wouldn’t be impressive.
Because humans have more rights than tools. You are free to look at copyrighted text and pictures, memorize them and describe them to others. It doesn’t mean you can use a camera to take and share pictures of it.
Acting like every right that AIs have must be identical to humans’, and if not that means the erosion of human rights, is a fundamentally flawed argument.
It’s so ridiculous when corporations steal everyone’s work for their own profit, no one bats an eye but when a group of individuals do the same to make education and knowledge free for everyone it’s somehow illegal, unethical, immoral and what not.
Using publically available data to train isn’t stealing.
Daily reminder that the ones pushing this narrative are literally corporation like OpenAI. If you can’t use copyright materials freely to train on, it brings up the cost in such a way that only a handful of companies can afford the data.
They want to kill the open-source scene and are manipulating you to do so. Don’t build their moat for them.
And using publicly available data to train gets you a shitty chatbot…
Hell, even using copyrighted data to train isn’t that great.
Like, what do you even think they’re doing here for your conspiracy?
You think OpenAI is saying they should pay for the data? They’re trying to use it for free.
Was this a meta joke and you had a chatbot write your comment?
if someone said this to me I’d cry
The point that was being made was that public available data includes a whole lot amount of copyrighted data to begin with and its pretty much impossible to filter it out. Grand example, the Eiffel tower in Paris is not copyright protected, but the lights on it are so you can only using pictures of the Eiffel tower during the day, if the picture itself isn’t copyright protected by the original photographer. Copyright law has all these complex caveat and exception that make it impossible to tell in glance whether or not it is protected.
This in turn means, if AI cannot legally train on copyrighted materials it finds online without paying huge sums of money then effectively only mega corporation who can pay copyright fines as cost of business will be able to afford training decent AI.
The only other option to produce any ai of such type is a very narrow curated set of known materials with a public use license but that is not going to get you anything competent on its own.
EDIT: In case it isn’t clear i am clarifying what i understood from Grimy@lemmy.world comment, not adding to it.
So then we as a society aren’t ready to untangle the mess of our infancy in the digital age. ChatGPT isn’t something we must have at all costs, it’s something we should have when we can deploy it while still respecting the rights of people who have made the content being used to train it.
I would go even further and say that we should have it until we can be sure it will respect others’ rights. All kind of rights, not only Copyright. Unlike Bing at the beginning, with all it’s bullying and menaces, or Chatgpt regurgitating private information gathered from God knows where.
The problem with waiting is the arms race with other governments. I feel it’s similar to fossil fuels, but all governments need to take the risk of being disadvantaged. Damned prisoner’s dilemma.
I didn’t want any of this shit. IDGAF if we don’t have AI. I’m still not sure the internet actually improved anything, let alone what the benefits of AI are supposed to be.
It doesn’t matter what you want. What matters is if corporations can extract $ from you, gain an efficiency, or cut their workforce using it.
That’s what the drive for AI is all about.
No doubt.
A perfectly valid stance to take.
It’s not like all this data was randomly dumped at the AIs. For data sets to serve as good training materials they need contextual information so that the AI can discern patterns and replicate them when prompted.
We see this when you can literally prompt AIs with whose style you want it to emulate. Meaning that the data it was fed had such information.
Midjourney is facing extra backlash from artists after a spreadsheet was leaked containing a list of artist styles their AI was trained on. Meaning they can keep track of it and they trained the AI with those artists’ works deliberately. They simply pretend this is impossible to figure out so that they might not be liable to seek permission and compensate the artists whose works were used.
That’s insane logic…
Like you’re essentially saying I can copy/paste any article without a paywall to my own blog and sell adspace on it…
And your still saying OpenAI is trying to make AI companies pay?
Like, do you think AI runs off free cloud services? The hardware is insanely expensive.
And OpenAI is trying to argue the opposite, that AI companies shouldn’t have to pay to use copyrighted works.
You have zero idea what is going on, but you are really confident you do
I clarified the comment above which was misunderstood, whether it makes a moral/sane argument is subjective and i am not covering that.
I am not sure why you think there is a claim that openAI is trying to make companies pay, on the contrary the comment i was clarifying (so not my opinion/words) states that openAI is making an argument that anyone should be able to use copyrighted materials for free to train AI.
The costs of running an online service like chatgpt is wildly besides the argument presented. You can run your own open source large language models at home about as well as you can run Bethesda’s Starfield on a same spec’d PC
Those Open source large language models are trained on the same collections of data including copyrighted data.
The logic being used here is:
The Ethical dilemma as i understand it is:
That is very well put, I really wish I could have started with that.
Though I envision it as a loss for BigProfit Enthertainment since I see this as a real boon for the indie gaming, animation and eventually filmmaking industry.
It’s definitely overall quite a messy situation.
…
Yes, you can download an executable of a chatbot lol.
That’s different than running something remotely like even OpenAI.
The more it has to reference, the more the system scales up. Not just storage, but everything else.
Like, in your example of video games it would be more like stripping down a PS5 game of all the assets, then playing it on a NES at 1 frame per five minutes.
You’re not only wildly overestimating chatbots ability, you’re doing that while drastically underestimating the resources needed.
Edit:
I think you literally don’t know what people are talking about…
Do you think people are talking about AI image generators?
No one else is…
I think you’re confusing training it with running it. After it’s trained, you can run it on much weaker hardware.
The issue is it reproducing copyrighted works verbatim…
It can’t do that unless it contains the entire text to begin with…
I am talking about generative AI, be it text or image both have a challenge with copyrighted material.
Are you refering to my joke?
I am far from overestimating capacity, Starfield runs mediocre on a modern gaming system compared to other games. The Vicuna 13b llm runs mediocre on the same system compared with gpt 3.5. To this date there is no local model that i would trust for professional use and chatgpt 3.5 doesnt hit that level either.
But it remains a very interesting, rapidly evolving technology that i hope receives as much future open source support as possible.
I presume you must believe the the following lemmy community and resources to be typed up by a group of children, either that or your just naive.
https://lemmy.world/c/fosai
https://www.fosai.xyz/
https://github.com/huggingface/transformers
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
https://huggingface.co/microsoft/phi-2 & https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
https://www.theguardian.com/technology/2023/may/05/google-engineer-open-source-technology-ai-openai-chatgpt
Hey man, that’s damn hurtful
I’m not sure if someone else has brought this up, but I could see OpenAI and other early adopters pushing for tighter controls of training data as a means to be the only players in town. You can’t build your own competing AI because you won’t have the same amount of data as us and we’ll corner the market.
deleted
deleted
OpenAI is definitely not the one arguing that they have stole data to train their AIs, and Disney will be fine whether AI requires owning the rights to training materials or not. Small artists, the ones protesting the most against it, will not. They are already seeing jobs and commission opportunities declining due to it.
Being publicly available in some form is not a permission to use and reproduce those works however you feel like. Only the real owner have the right to decide. We on the internet have always been a bit blasé about it, sometimes deservedly, but as we get to a point we are driving away the very same artists that we enjoy and get inspired by, maybe we should be a bit more understanding about their position.
That depends on what your definition of “publicly available” is. If you’re scraping New York Times articles and pulling art off Tumblr then yeah, it’s exactly stealing in the same way scihub is. Only difference is, scihub isn’t boiling the oceans in an attempt to make rich people even richer.
Also Sci-hub don’t make any money off the works
Yeah, by using the argument you just gave as an excuse to “launder” copyleft works in the training data into permissively-licensed output.
Including even a single copyleft work in the training data ought to force every output of the system to be copyleft. Or if it doesn’t, then the alternative is that the output shouldn’t be legal to use at all.
We have a mechanism for people to make their work publically visible while reserving certain rights for themselves.
Are you saying that creators cannot (or ought not be able to) reserve the right to ML training for themselves? What if they want to selectively permit that right to FOSS or non-profits?
Scientific research papers are generally public too, in that you can always reach out to the researcher and they’ll provide the papers for free, it’s just the “corporate” journals that need their profit off of other peoples work…
All of the AI fear mongering is fuelled by mega corps who fear that AI in some sort will eat into their profits and they can’t make money off of it.
Image generation also had similar outcry because open source models smoked all the commercial ones.
Yeah, just wait until they see the ai design tools that allow anyone to casually describe the spare part or upgrade they want and it’ll be designed and printed at home or local fab shop.
Lot of once fairly safe monopolies are going to start looking very shaky, and then things like natural language cookery toolarms disrupting even more…
We’ve only barely started to see what the tech we have now is able to do, yes a million shitty chat bots / img gen apps are cashing in on the hype but when we start seeing some killer apps emerge it’s when people won’t be able to ignore it any longer
True, Big Tech loves monopoly power. It’s hard to see how there can be an AI monopoly without expanding intellectual property rights.
It would mean a nice windfall profit for intellectual property owners. I doubt they worry about open source or competition but only think as far as lobbying to be given free money. It’s weird how many people here, who are probably not all rich, support giving extra money to owners, merely for owning things. That’s how it goes when you grow up on Ayn Rand, I guess.
This is the hardest thing to explain to people. Just convert it into a person with unlimited memory.
Open AI is sending said person to view every piece of human work, learns and makes connections, then make art or reports based on what you tell/ask this person.
Sci-Hub is doing the same thing but you can ask it for a specific book and they will write it down word for word for you, an exact copy.
Both morally should be free to do so. But we have laws that say the sci-hub human is illegally selling the work of others. Whereas the open ai human has to be given so many specific instructions to reproduce a human work that it’s practically like handing it a book and it handing the book back to you.
What data is public?
Cue the Max Headroom episode where the blanks (disconnected people) are chased by the censors because the blanks steal cable so their children can watch the educational shows and learn to read, and they are forced to use clandestine printing presses to teach them.
what’s this? an anti-corporate message that sneers at cable TV companies??? CANCEL THAT SHOW!!!
that show was so amazingly prescient: the theme of the first episode was how advertising literally kills its viewers and the news covers things up. No wonder they didn’t get renewed. ;)
Reminds me of this: https://www.gnu.org/philosophy/right-to-read.html
deleted by creator
Because it’s easy to get these chatbots to output direct copyrighted text…
Even ones the company never paid for, not even just a subscription for a single human to view the articles they’re reproducing. Like, think of it as buying a movie, then burning a copy for anyone who asks.
Which reproducing word for word for people who didn’t pay is still a whole nother issue. So this is more like torrenting a movie, then seeding it.
It’s not that easy, don’t believe the articles being broadcasted every day. They are heavily cherry picked.
Also, if someone is creating copyright works, it is on that person to be responsible if they release or sell it, not the tool they used. Just because the tool can be good (learns well and responds well when asked to make a clone of something) doesn’t mean it is the only thing it does or must do. It is following instructions, which were to make a thing. The one giving the instructions is the issue, and the intent of that person when they distribute is the issue.
If I draw a perfect clone of Donald Duck in the privacy of my home after looking at hundreds of Donald Duck images online, there is nothing wrong with that. If I go on Etsy and start selling them without a license, they will come after ME. Not because I drew it, but because I am selling it and violating a copyright. They won’t go after the pencil or ink manufacturer. And they won’t go after Adobe if I drew it on a computer with Photoshop.
In your picture example it would be an exact copy…
But even if you started a business and when people asked for a picture of Donald Duck, giving them a traced copy is still copyright infringement… Hell, even your bad analogy of a person’s own drawing, still copyright infringement
The worst thing about these chatbots is the people who think it’s amazing don’t understand what it’s doing. If you understood it, it wouldn’t be impressive.
Because humans have more rights than tools. You are free to look at copyrighted text and pictures, memorize them and describe them to others. It doesn’t mean you can use a camera to take and share pictures of it.
Acting like every right that AIs have must be identical to humans’, and if not that means the erosion of human rights, is a fundamentally flawed argument.
Whoosh