• utopiah@lemmy.world
    link
    fedilink
    English
    arrow-up
    22
    arrow-down
    1
    ·
    5 天前

    There are AI’s that are ethically trained

    Can you please share examples and criteria?

    • dogslayeggs@lemmy.world
      link
      fedilink
      English
      arrow-up
      20
      ·
      4 天前

      Sure. My company has a database of all technical papers written by employees in the last 30-ish years. Nearly all of these contain proprietary information from other companies (we deal with tons of other companies and have access to their data), so we can’t build a public LLM nor use a public LLM. So we created an internal-only LLM that is only trained on our data.

      • Fmstrat@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        1
        ·
        4 天前

        I’d bet my lunch this internal LLM is a trained open weight model, which has lots of public data in it. Not complaining about what your company has done, as I think that makes sense, just providing a counterpoint.

      • utopiah@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        4 天前

        You are solely using your own data or rather you are refining an existing LLM or rather RAG?

        I’m not an expert but AFAIK training an LLM requires, by definition, a vast mount of text so I’m skeptical that ANY company publish enough papers to do so. I understand if you can’t share more about the process. Maybe me saying “AI” was too broad.

    • Fmstrat@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      4 天前

      Apertus was developed with due consideration to Swiss data protection laws, Swiss copyright laws, and the transparency obligations under the EU AI Act. Particular attention has been paid to data integrity and ethical standards: the training corpus builds only on data which is publicly available. It is filtered to respect machine-readable opt-out requests from websites, even retroactively, and to remove personal data, and other undesired content before training begins.

      https://www.swiss-ai.org/apertus

      Fully open source, even the training data is provided for download. That being said, this is the only one I know of.

      • utopiah@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        1
        ·
        4 天前

        Thanks, a friend recommended it few days ago indeed but unfortunately AFAICT they don’t provide the CO2eq in their model card nor an analogy equivalence non technical users could understand.

      • utopiah@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 天前

        Right, and to be clear I’m not saying it’s not possible (if fact I some models in mind but I’d rather let others share first). This isn’t a trick question, it’s a genuine request to hopefully be able to rely on such tools.

    • Hackworth@piefed.ca
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      4 天前

      Adobe’s image generator (Firefly) is trained only on images from Adobe Stock.

        • Hackworth@piefed.ca
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 天前

          The Firefly image generator is a diffusion model, and the Firefly video generator is a diffusion transformer. LLMs aren’t involved in either process - rather the models learn image-text relationships from meta tags. I believe there are some ChatGPT integrations with Reader and Acrobat, but that’s unrelated to Firefly.

          • utopiah@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            3 天前

            Surprising, I would expect it’d rely at some point on something like CLIP in order to be prompted.

            • Hackworth@piefed.ca
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              2 天前

              As I understand it, CLIP (and other text encoders in diffusion models) aren’t trained like LLMs, exactly. They’re trained on image/text pairing, which ya get from the metadata creators upload with their photos in Adobe Stock. Open AI trained CLIP with alt text on scraped images, but I assume Adobe would want to train their own text encoder on the more extensive tags on the stock images its already using.

              All that said, Adobe hasn’t published their entire architecture. And there were some reports during the training of Firefly 1 back in '22 that they weren’t filtering out AI-generated images in the training set. At the time, those made up ~5% of the full stock library. Currently, AI images make up about half of Adobe Stock, though filtering them out seems to work well. We don’t know if they were included in later versions of Firefly. There’s an incentive for Adobe to filter them out, since AI trained on AI tends to lose its tails (the ability to handle edge cases well), and that would be pretty devastating for something like generative fill.

              I figure we want to encourage companies to do better, whatever that looks like. For a monopolistic giant like Adobe, they seem to have at least done better. And at some point, they have to rely on the artists uploading stock photos to be honest. Not just about AI, but about release forms, photo shoot working conditions, local laws being followed while shooting, etc. They do have some incentive to be honest, since Adobe pays them, but I don’t doubt there are issues there too.