• Lemminary@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    7
    ·
    7 months ago

    Is sampling and analyzing publicly available data and not storing it considered stealing?

    I’m not defending anyone here, but that’s just weird. It’s also weird to take a meme seriously, but w/e.

    • owen@lemmy.ca
      link
      fedilink
      English
      arrow-up
      12
      arrow-down
      3
      ·
      7 months ago

      You can’t use someone’s work for whatever you want just because it’s publicly accessible.

      • JorMaFur@lemm.ee
        link
        fedilink
        English
        arrow-up
        6
        arrow-down
        3
        ·
        7 months ago

        I’m actually still not sure how I feel about this.

        I can use books to learn a new language. AI can use texts to learn their kind of language in a sense.

        I’m not sure where the limit is or should be though.

        • wander1236@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          1
          ·
          7 months ago

          I don’t think that’s really the same thing. Most people learning another language aren’t doing it specifically so they can turn around and sell translations to millions of customers.

          And if they were, they’d probably need to be accredited and licensed, using standardized sources that they pay for, directly or indirectly.

          • Lemminary@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            7 months ago

            so they can turn around and sell translations to millions of customers

            That sounds like a translator to me. And also, they’re kind of doing it for free. What they’re selling is access to their latest models, their API and their plugins store. They’re not exactly selling the information that has been transformed.

      • Lemminary@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        7 months ago

        Hasn’t web scraping been done for like forever, though? How is this any different? You get publicly accessible information and you derive data from it. You’re literally not stealing anything or storing it as-is.

    • Danitos@reddthat.com
      link
      fedilink
      English
      arrow-up
      9
      ·
      7 months ago

      Public data still have licenses. Eg, some open source licences force you to open source the software you created using them, something OpenAI doesn’t do.

      • Lemminary@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        7 months ago

        If you’re using it as you found it, then yeah. But if I take derived data from it like word count and word frequency, it’s not exactly the same thing and we call that statistics. Now if I draw associations of how often certain words appear together, and then compound that with millions of other sources to create a map of related words and concepts, I’m no longer using the data as you described because I’m doing something entirely different with it. What LLMs do is generates new information from its underlying sources.

        • Danitos@reddthat.com
          link
          fedilink
          English
          arrow-up
          1
          ·
          7 months ago

          In my example, they would still be using the source code to create new software that is not open source, not matter how many Markov chains are behind it.

    • wander1236@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      4
      ·
      7 months ago

      It has to be stored in some form for the AI to “learn” from and remember it, and a lot of the debate is around whether AI is actually able to learn, or if it can only really blindly combine 1:1 copies of elements into something derivative.

      There’s also the debate of whether what humans learn and produce based on influence can be compared to AI, but humans aren’t able to consume millions of records in seconds like AI.

      • Lemminary@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        7 months ago

        They’re not storing the original data and OpenAI even state so themselves. LLMs compound derived associations between words and concepts from whatever it analyzes, which is further modified by all the other sources it analyzes and that’s what gets stored during training. It doesn’t matter if it’s a few sources or a million sources, it’s not storing any of it as-is. It’s very much like how we process information ourselves for the length of our entire lives by making generalizations. We don’t memorize everything precisely besides the foundational blocks of language, but our neurons do fire in a certain pattern when given a trigger. How is that stealing?