Data-scraping, ethics, and copyright
February 2, 2023•828 words
Public social media posts provide a vast and rich source of data for research and commercial purposes. Posts are date and time stamped, usually geo-located, have an authorial identifier (this could be real name or not, but it is an identity which has data value even in the few cases it cannot be connected to other identities), and usually rich information about readers and their reactions. It is hard for researchers and developers to resist this treasure trove.
Who owns the copyright of this material? Some social media platforms may have copyright policies which they may have communicated to their users and which may hold in some jurisdictions, but the overwhelmingly majority of this data comes with no information about who the copyright holder is - and even if there is one.
I am no lawyer and intend to make no judgements or assumptions about the legal facts of copyright in such cases. My interest in copyright is ethical. Where copyright is explicit, it is an expression of the author's wishes about how their work - and yes, even a five word Tweet is something someone created and can be regarded as their work, be used by others. Their ability to fully and accurately express those wishes through copyright depends upon the details of the current copyright law in the place they publish, but it remains useful to understand the range of different wishes an author may have with respect to the content they create through the categories of copyright.
Thus we can formulate the fundamental ethical question facing anyone who wants to data-mine public social media as:
In the absence of any explicit statement of copyright, what should about the author's wishes for how their work is used?
Many actual data-scraping projects appear to answer this question like this:
- If there is no statement of copyright, we can assume that there is no copyright, that the author's wishes are consistent with us doing whatever we like with their content.
But that looks obviously unethical. Merely because I have failed to make explicit what you may and may not do with my property does not give you the right to do whatever you wish with it. (Not locking something away does not give implicit permission for others to use it.) The analogy of social media posts to physical property is not perfect by any means, but there remains some ethically relevant parallel: the absence of prohibition is not the presence of permission in both cases.
Another partial analogy might be sexual harassment and assault: how the victim was dressed, where they went with whom, their degree of intoxication - none of these ever amount to implicit consent. It is not just that 'No means No' but also that 'Only Yes means Yes'.
These analogies suggest we can construct a strong ethical argument against the free-for-all proposal. So what should we assume about the author's wishes?
Firstly, the content is not being sold by the author. Of course, some content creators and also most platforms try to monetize the readers' attention through advertising, so there is a potential loss of income if the content is republished elsewhere. We need here to distinguish between the platform and the author. If the platform has an advertising-based business model and wants to restrict republishing of content created by its users, then we really are in the area of law, not ethics, at least insofar as all the users are made aware of this situation. An author who has knowingly chosen to publish on such a platform has thereby expressed their wish about how others may use their content.
(Of course, very few of the billions of social media users do make such informed choices and ethically we should care about what their wishes are even though those wishes have been confounded by deceptive business practices.)
It seems then that for the billions of social media authors who are not attempting to make money from their posts, and have not given informed consent for someone else to make money from posts other than by showing advertising in the original place of publication, we can assume:
- The author wishes their work to part of the shared culture and knowledge, i.e. the Creative Commons.
By the same reasoning, we ought to assume that the author does not wish other to monetize their work without permission:
- The author prohibits commercial use without permission.
And we ought also to assume that any way the author chose to identify themselves in the original post, whether that be a real name, a pseudonym, or just an image, is retained in any reuse. This also means that if they chose anonymity or pseudonymity, that should be preserved unless permission is sought.
- The author requires the original attribution of authorship to be preserved.
And so we have concluded that - ethically speaking - we should assume that public social media posts with no explicit copyright are CC-BY-NC.