• Grimy@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    6 hours ago

    All LLMs and Gen AI use data they don’t own. The Pile is all scraped or pirated info, which served as a starting point for most LLMs. Image gen is all scraped from the web. Speech to text and video gen mainly uses YouTube data.

    So either you put a price tag on that data, which means only a handful of companies can afford to build these tools (including Meta), or you understand that piracy is the only way for most to aquire this data but since it’s highly transformative, it isn’t breaching copyrights or directly stealing from them as piracy “normally” is.

    I’m being pragmatic.