AI's Ethics Overdraft

Most of what passes for AI is machine learning trained on massive data sets. It is that data which has enabled the dramatic advances we see with generative AI like DALL-E and ChatGPT, but also systems from facial recognition to self-driving vehicles. The data has been collected over the past 15+ years from the internet, which is undoubtedly the largest repository of machine-readable data anyone has ever 'assembled'.

In particular, the advent of Web 2.0 and the exponential increase in social media usage and user-generated content on commercial websites has provided a treasure trove of texts and images from ordinary people and ordinary lives. 'Data-scraping' is the technique used to harvest this data, and once harvested it can be categorised and analysed. The underlying principle here, approved not only in corporations big and small, but also by University Research Ethics Committees the world over, is that if something has been posted publicly, then no consent is needed for it to be scraped and analysed like this. Basically, if anyone can read it, they can do what they like with it.

Setting aside the fact that - at least within Universities - this is probably based on an equivocation on 'public domain', which has an academic use to mean 'published' and a legal use to mean 'without copyright', there is something ethically questionable about the practice of data-scraping. But it is worth noting that the absence of an explicit statement of copyright does not usually amount to the absence of copyright.

Let is focus on a paradigm case: someone posting information about themselves, and photographs, on a public Twitter or Instagram account. It is quite clear that they have consented to other users of that social media platform viewing, liking, sharing and commenting on their posts. (Well, assuming the 'public' settings were obvious enough ...) It is also plausible that they have consented to journalists referring to or quoting their posts and possibly also academics researching what is posted on social media, for example looking for trends relating to real world events. While much of that is building on what the original creator has posted, it seems to be consistent with the original intent of the post. In copyright terms, it would be allowed under fair use in many cases and under a Creative Commons CC-BY-NC[-ND?] licence in others, suggesting that the additional value has been created rather than extracted.

What about data-scraping? Clearly the data so collected has not been posted with thought to the possibility that it might be used to train an AI, let alone one which will be exploited for commercial purposes. And it is far from clear that even 50% of those people would give consent if asked. It would likely violate any fair use exemptions and Creative Commons licence. But most importantly, it violates the dignity of the person who posted in a particular way. For this use of their creation treats them not as a person but as an object which happens to produce content of value to someone who cares not who or what they are. Even the troll who quotes and pours scorn on your post treats you as an object worthy of scorn. A researcher looking for responses to some event may take the post out of context, but its value to them comes in part from a recognition that you are a person contributing to a public discourse. But when a post becomes data to train an AI, the humanity of the data creator is merely instrumental, a source which gives it a property that makes it more valuable. We are often reassured the data is anonymised, but in fact it is dehumanised, and that is the ethical problem. The original post was shared as an act of self-expression, but when it has value extracted as data, the self that created it disappears.

Maybe some, even many, would consent to that, but many would and do feel uncomfortable with it. The process of data-scraping doesn't merely over-rule or set aside such responses, it is blind to them.

Data-scraping people's social media posts is not the only ethically dubious data collection practice. Web-browsers and mobile apps collect masses of data about people's online activities and their interaction, including scrolling, touchscreen interactions, typing speed etc. which can be used in many different ways that the typical data subject could never conceive of, rendering the simulacrum of consent meaningless. As well as collecting data we generate as a by-product, many systems coerce our labour, for example, a Capthca asking users to identify cars or traffic lights in an image is giving free human labour in the tiresome process of training autonomous vehicles, all as a condition of providing a service.

If this is right and becomes widely recognised and accepted in the near future, then AI has a problem, the problem we might call an 'Ethics Overdraft'. Existing AI, and anything built upon that, was only possible by practices which are ethically unacceptable - by not thinking through the ethical issues carefully enough at the time, these organisations, corporate and academic, have effectively postponed the ethical assessment of their work. That is a form of borrowing and society will ask for that loan to be paid back some time.

On a more positive note, this creates a new opportunity for ethical organisations to get ahead. If they start collecting data in an ethically acceptable way, with consent and recognition of 'ownership' (in a sense that needs some care) by the data creators, then when the ethics bailiffs come knocking on the doors of Big Tech, these smaller, responsible, probably federated, organisations will have the commercial and research advantage. It is a bit like investing in renewable energy sources: the oil will run out or fossil fuels may be so heavily restricted that the renewables will become the 'new oil'. Similarly, we can expect that ethically sourced data will drive the AI of the 2030s, so prepare for it now.


You'll only receive email when they publish something new.

More from Tom Stoneham
All posts