Todd A. Jacobs | Pragmatic Cybersecurity<p>TL;DR: <a href="https://infosec.exchange/tags/AI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>AI</span></a> & <a href="https://infosec.exchange/tags/ML" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>ML</span></a> data aggregators break the law when they <em>consume</em> data unlawfully, whether or not the <em>output</em> of their systems is considered fair use. Let's talk about the underlying data, not just the tools!</p><p>Most big AI & ML data aggregators aren't sharing their data openly in compliance with <a href="https://infosec.exchange/tags/FOSS" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>FOSS</span></a> or open-content licenses, or adhering to the <a href="https://infosec.exchange/tags/TOS" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>TOS</span></a> of data they scrape from less litigious content creators. They are trying to avoid legal liability by signing "content deals" with big media companies, but this just papers over the fact that commercializing data in violation of the copyright holders' license terms, terms of service, or contracts of adhesion is a criminal act and that hiding that data inside proprietary databases and models is simply one of the ways these companies are attempting to dodge lawsuits and liability.</p><p>Copyright theft hurts independent writers, bloggers, educators, and journalists more than it hurts media moguls. Signing content licensing agreements with the likes of GitHub, Hatchette Group, or the New York Times is all well and good, but this typically doesn't compensate the actual content producers or bring the unlawful aggregation into compliance. These sorts of deals insult every <a href="https://infosec.exchange/tags/opensource" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opensource</span></a> and <a href="https://infosec.exchange/tags/opencontent" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>opencontent</span></a> creator. <a href="https://infosec.exchange/tags/DRM" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>DRM</span></a> isn't the answer, and neither is paying off big media. <a href="https://infosec.exchange/tags/OpenAI" class="mention hashtag" rel="nofollow noopener" target="_blank">#<span>OpenAI</span></a> and others should either pay for that data directly, make the data publicly available under the share-alike clauses common to most open source/content licenses, or exclude it altogether.</p><p>Many years ago, Canada took an interesting approach that wasn't based on chasing individual "pirates." Instead, they taxed storage media like burnable CDs. The main flaw in that approach was that it still mostly benefited large companies who could collect meaningful amounts from this tax rather than paying small, independent artists. Nevertheless it was (and is) a better model <em>for society</em> & the creative commons than paying "protection money" to big media conglomerates while continuing to build for-profit business models that violate the copyrights and licensing terms of anyone who isn't deemed a significant legal threat.</p><p>As a society, we can and <em>must</em> do better. If our copyright laws (including commercial software licenses and terms of service) are outdated and no longer serve society then they should be updated or fixed. However, deliberate and open theft should never be permitted to become business-as-usual. The fact that the <em>output</em> of such systems may or may not be sufficiently transformative to be labeled theft or plagiarism doesn't change the fact that almost all of today's commercial options currently rely on very large data sets filled with material that belongs to others who weren't compensated for their work, and where most copyright owners lack sufficient visibility to even determine whether or not their work was unlawfully used.</p>