Public domain, business models, and 'teaching' AI
April 18, 2025•1,720 words
Data theft?
There was a striking moment in Carole Cadwalldr's Ted talk this year - at 13:38 if you want to jump to it but please watch the whole talk - where she showed ChatGPT writing a Ted talk in the style of Carole Cadwalladr and said to Sam Altman 'I do not consent'.
There was a fascinating exchange about this moment in a follow-up interview with Chris Anderson which is worth quoting (I am editing to get the key points):
CA: Would you feel differently about it if there was an improved business model here where the platform's committed to respecting individual talent. So that, for example, when a request is made to specifically embody the style of a musician or a writer or an artist, that actually that there would be some conversation back to that person, so that you could say, actually, this is a way in which I could amplify my impact on the planet, and I will actually be compensated for it.
CC: So in a theoretical world with an ethical AI company which asked your permission before it scraped your data and then paid you whenever it used that in some ways, but as we know, it's like it's so hard to make that assessment right, because it's taken in such vast amounts of data and then it's mixed it all up into some weird sausage, which it's now putting back out there...
CA: they would argue, and I'm not saying I agree with this, but they would argue that every time technology changes that the rules need to be worked out again, that you've got a situation where you know your words were published, put out freely for anyone on the internet to read, no matter how many more people read your past words, you don't get any more payment. And so the data is out there. They would argue that it's out there as a sort of public resource for fair use...
CC: It's actually much, much deeper than that, right, which is that every nation state in the world has some form of property law, right? Which is, you can't walk into somebody's house and just steal the silver. ... if you can't respect the basic, fundamental underlying principles with which we order society, which is, do not steal, then, then what are you left with? It's like, it's fine. We're going to take your silver, and then if we sell it on eBay, we might give you, like, 5% of it.
CA: Yeah. So I get the anger. There is a difference between a physical object, where if you steal it, that person doesn't have it, versus a digital property, where if you, quote, steal it, you still have there's no difference. Well, you know, I think there's traditional difference in like, when an idea is out there, it can be built on an amplified and, like, for example, in the music business, there's constant building on one person's work by the next artist. The kindest way of viewing what they're doing is, is for for them, is to say we're not stealing, we're amplifying.1
I used to think the disagreement in exchanges like this was based upon an ambiguity in the phrase 'public domain'. There is an ordinary, everyday use, according to which if you publish something, if you put it out there for anyone to see or read, then it is in the public domain. This is the sense of the difference between doing something in private and doing it in public, and if you do it in public you cannot complain if someone else looks and responds to what you have done in whatever ways they like. Then there is the legal sense in which being in the public domain means being free of copyright.
Now it looks like Carole Cadwalladr is saying that just because her work is in the public domain in the first sense, that doesn't make it free of copyright and the use of it to train AI is theft of her intellectual property. And Chris Anderson's response on behalf of the data scrapers could be seen as saying that an AI is as free to read and learn from her works as a person and all that matters is that she is recompensed should that company make money out of it, thereby discharging the legal liabilities of copyright.
If that is the basis for the disagreement, then the 'Spotify' business model that Chris Anderson proposes does look like a possible resolution. In fact, a form of it is already being used in the syndication deal between Cambridge University Press and various AI companies. We may have many good reasons to critique such business models for their longer term effects on the creative industries, but it is hard to call them 'theft'.
Learning from vs. using for
However, I think there is something more subtle and invidious going on here. The first sense of 'public domain', where someone puts their work 'out there', making it accessible to almost anyone but retaining copyright, is built upon a distinction between two things we might do when we engage with such works:
- We might read/view/listen to them, take pleasure, learn things, and possibly respond, either privately or publicly.
- We might share, redistribute (in original or modified form) or even make money.
Now, traditional forms of copyright allow anyone to do 1) but require permission and usually fees for 2). Creative Commons licenses split up 2), allowing unrestricted sharing of the original but optionally restricting attribution (BY), commercialisation (NC) and the sharing of derivative works (ND). However, the clear conceptual distinction between 1) and 2) remains at the heart of how we think about what is in the public domain in the first sense. And that is because we place a very high value on retaining some control over how the things we share in the public domain are used.2
The real problem with the Spotify business model for AI is that AI blurs that distinction in a subtle way.
Human minds are shaped, usually imperceptibly and in tiny increments by everything the read, see, and hear. You will probably never recall reading the last 1000 words nor make any explicit use of them, but possibly, some time in the future, when talking about the public domain, what you say will be slightly influenced by them.
The same is true of AI models. It is exceptionally unlikely that anyone will ever prompt a genAI model in such a way that causes it to explicitly draw upon this blog or anything else I have put in the public domain. So even on a Spotify model, I will not get a millionth of a penny. However, the scraping of this blog for training will have incrementally altered the model. And it will (hopefully) be better for that, which means that the responses it gives to prompts will be more useful to the users. The blurring comes because those users are - either directly or indirectly through some means like advertising - paying customers of the company that owns the AI. The learning looks like 1), but the effect is more like 2).
Teaching AI
I have been an academic for 30 years. During that time I have written or spoken millions of words to 10s of thousands of readers and listeners. While citations are gratifying, I would be an arrogant fool to think that was the true measure of the value of what I have done. I do this work in the hope that my audiences - whether students, other academics, or the wider public - will be able to think better as a consequence. I even have the temerity to hope that some of my students will have better lives and perhaps make better contributions to society as a consequence. My purpose in life has been to teach in that sense. And I have the great good fortune to be paid for that through the funding of universities.
Now it may be tempting to make the analogy complete and say that the AI models trained on my words may also - to some tiny degree - learn to 'think' better and make better contributions to society and a consequence. So why should I worry about it? I teach people and I teach AI. I am just as unaware of how many people, let alone who, reads this blog as I am of the AI models trained on it.
The difference which matters to me is that the AI models are owned and controlled by powerful individuals and companies who may - and some palpably do - have very different ideas to me of the public good and more generally about what being ethical looks like. So even if my contribution to the training data gives the model an 'understanding' of my perspective, the way in which that understanding manifests in the outputs of the model will be subject to the control of the owners. It is like educating someone who will blindly follow the orders of an untrustworthy autocrat.
Perhaps some of my students and audiences will end up doing and thinking things I hoped they would not, but at least the choice to do so was made by them after hearing what I had to say. The AI models whose outputs go against what I was hoping to teach will not do so because they disagree, not because they have made a judgement (even if they were capable of so doing), but because their owners have manipulated that output.
Which we are seeing happen in public view in real time - Facebook Pushes Its Llama 4 AI Model to the Right, Wants to Present “Both Sides” - but it is the nature of these models that the reinforcement learning and fine-tuning they go through before release is based on a clear and explicit - but usually hidden from our view - set of values chosen by monopolistic, anti-democratic, rent-seeking, profit-maximisers.