Plain text again

I realised that there is an easier way of describing the difference between the several versions of 'plain text' email.

Three definitions of 'plain text'

What people want to transmit by email is generally some sequence of graphemes, possibly together with ancillary data, such as how they are to be laid out, related images, sounds, and so on. Graphemes are mostly the same as characters, but not completely so: to understand this requires understanding Unicode, which I suspect almost no-one does1.

To transmit this information in such a way that nothing is lost requires encoding it into some sequence of octets which will not be changed by the MTAs between the sender & receivers of the mail. Call the source text (graphemes, other data) m, then there is some encoding function E such that E(m) is a sequence of octets, say m', and then there should exist a function E-1 such that E-1(m') = E-1(E(m)) = m.

Plain text (1). The first definition of 'plain text' simply puts very strong restrictions on what m can be: it can only be a sequence of graphemes together with some very limited layout information (newlines, tabs and so on).

This means that E, and correspondingly E-1, is much easier to write – it no longer has to encode things like images, tabular layouts, font information and so on.

Plain text (2). The second definition of 'plain text' is the same as the first except that it additionally requires that the mapping from graphemes to octets be 1-1, and often that only a very restricted range of octets (7-bit ASCII) can be used.

This definition makes E even simpler to write. For general purposes it's also very obviously racist 2, although for restricted purposes it may be acceptable (for instance as a lowest-common-denominator as with the LKML).

Plain text (3). The third definition simply requires that E-1 = E. In other words that the mapping from graphemes to octets is not only 1-1, it is the identity mapping, and graphemes can therefore be represented directly by octets.

This final meaning is what the LKML means by 'plain text'.


The first confusion seems to be that either definition (1) or, usually, (2) is sufficient: they're not. In particular E is certainly not the identity mapping for (1) because there are hugely more graphemes than there are octets. For (2) there are no more graphemes than octets (and in fact fewer), but E is still not in general the identity mapping, because it needs to get the data through MTAs which have restrictions which are more severe than (2), such as line-length restrictions, and also restrictions on the content of various lines.

The second confusion is that E-1 may not in general exist, or may only exist in the sense that it produces text that looks the same to a human: I'll call this F. In particular, n a transmission from MUA1 via some series of MTAs to MUA2, MUA2's F may not be the inverse of MUA1's E, but merelt something which is good enough for humans. As an example instance F(E(m)) may not preserve line breaks in the original text. This is very often fine for humans, but is not fine for sending patches.

I was astonished at the combination of vitriol and complete lack of understanding of the problem in the comments about this, especially when it is completely clear from very casual reading that the LKML requires (3). I should not have been astonished of course.

  1. A lot of the people who think they understand what plain text mail means, but don't, will of course also think they understand Unicode, but won't. 

  2. If this is not obvious, consider the impact on someone of making it impossible for them to represent their own name in email. 

You'll only receive email when 100 suns publishes a new post

More from 100 suns