August 24, 2023•374 words
Intellectual property ownership by individuals or businesses is important. There are already existing threats out there to intellectual property rights (for example, IP theft and patent trolls - just to name two of the most egregious examples).
A newer threat to individuals and small businesses is the scraping of data without consent to train AI. If AI companies wish to use data to train their AI, they should either:
a) use public domain data only - this does NOT mean data that is on the web under copyright, such as blogs.
AI companies can instead, or in addition:
b) enter into a license agreement with content creators, blog owners, etc whose data they wish to use for AI training purposes.
Most importantly, the difficulty for AI companies of entering into licensing agreements at scale is NOT the copyright owner's (blog owner's) problem. That is a problem that rests solely on AI companies. Likewise, the difficulty for AI companies of making a profit if they are paying license fees for all of their training data is not the content creator's problem - it's a problem inherent to the business model of AI (which requires huge amounts of data). It's not the problem of anyone else, and certainly not the content creator's problem.
If AI companies cite profitability concerns as a reason to scrape data without permission or license agreement, that is not a valid argument. Any company which is stealing someone else's material is obviously going to make more of a profit than any company which pays fairly for the material, but that doesn't make it any more correct to steal.
On a somewhat related note to this, I recently tried to block the common crawl bot and the GPT bot on my robots.txt file - then a few days later Google Search Console says it can't index some pages due to my robots.txt file. Pretty sure CC and GPT don't have anything to do with Google, so this is a little weird! My robots.txt file at the time looked like this:
User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: /
I'm still tinkering with the robots.txt file - I may wind up allowing it but putting a really long crawl-delay on those bots.