TAO

Creating TAO, the universal syntax for structured communication at cosmic scale.

Fixing CSV

CSV is a simple format that works fairly well for representing tabular data.

However, it does have some problems.

Let's see how we can solve them by creating an even simpler format out of a subset of TAO.

CSV problems and solutions

The trouble, as always, appears in the edge cases.

New lines in data

CSV separates rows of data by newlines. This becomes a problem if we need to include newlines in the data itself.

Solution: use a TAO operator to separate rows. Let's keep the newline character as the operator.

Commas in data

CSV separates data cells by commas. This becomes a problem if we need to include commas in the data itself.

Solution: use a TAO operator to separate cells. Let's keep the comma character as the operator.

Quoted data

To solve the problem of newlines and commas in data, CSV introduces quoted values. This creates the need to escape the quote symbols in the data.

Solution: since we've solved the previous problems differently, we can drop the quoting and get rid of the incidental issue.

Trimming whitespace

Whether or not and when to trim whitespace from CSV data is not clear and causes inconsistencies in practice. This is somewhat related to the problem of quoting.

Solution: for simplicity, let's specify that all whitespace is part of data. The problem of quoting is already solved by unncessary feature subtraction, so there is no further ambiguity.

TOSA

The format we ended up with shall be called TOSA for Tabular Operator-Separated Annotations (using nomenclature consistent with TAO).

Compared to CSV it is much simpler and at the same time less brittle.

There is a fixed cost we have to pay for that -- the separators now take twice as many symbols. However we have a much simpler parser and got rid of the major headaches of CSV.

The only special symbol in our format and so the only character that needs to be escaped in the data is the operator meta symbol.

This CSV example from Wikipedia:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

can be translated to TOSA like so:

Year`,Make`,Model`,Description`,Price`
1997`,Ford`,E350`,ac, abs, moon`,3000.00`
1999`,Chevy`,Venture "Extended Edition"`,`,4900.00`
1999`,Chevy`,Venture "Extended Edition, Very Large"`,`,5000.00`
1996`,Jeep`,Grand Cherokee`,MUST SELL!
air, moon roof, loaded`,4799.00

Summary

Even simple formats like CSV suffer from issues related to escaping.

Equally expressive, yet simpler formats are however possible, as demonstrated here.

The format created herein can be made fully compatible with TAO by introducing two additional special symbols ([ and ]) for building nested structures.

It is also possible to maintain only a single special symbol and build nested structures with additional operators (`[ and `]) instead.

If we pick a control character in place of ` as the operator symbol, we get as close as possible to eliminating the escaping problem from data space.

This however creates a few trade-offs which reduce the practical utility of the syntax. This is why TAO does not go to such extreme. It aims to achieve a pragmatic equilibrium of ease and simplicity that will make it useful in as many domains as possible, today.

Design routes however remain open for the future.

Demo

A TOSA to CSV converter demo is available on the TAO blog.

Operator, please dial the number

Theory

The grammar of TAO includes the following rule (in BNF):

<operator>   ::= "`" <any>

Where any is any printable Unicode character.

The operator is a very simple, yet very versatile concept, which captures the essence of many syntactical constructs. It has 2 basic roles in TAO:

  1. Escape mechanism for the 3 meta symbols of the grammar, i.e. [, ], and ` (the operator symbol itself).

  2. Extension mechanism for future notations based on TAO as well as for custom ad-hoc notations -- either community-built notations that might become standard or limited-use internal notations. There is thus a risk associated with abuse of this mechanism. In the spirit of TAO, the use of operators should be kept to a minimum.

Furthermore, an important property of the operator is that a single operator meta symbol ` introduces two annotation insertion points at the same depth in the syntax tree -- on either side of the operator. This enables slightly more compact notations than could be achieved otherwise.

The property is also where operator gets its name from. It is however a more primitive and lower-level construct than a programming language operator -- a programming language built on top of TAO could use it to represent operators.

Practice

What follows from the grammar definition, but is perhaps worth noting, is that operators are always single-character. A possible way to model a multi-character high-level operators could be to use a low-level operator for "quoting" these, e.g.

a`.<=>`.b

Here `. is the low-level quoting operator and <=> is the high-level multicharacter quoted operator.

A less generic, but a more compact way could be (colored for clarity):

a`<=>[b]

Here `< could be a low-level operator that introduces a class of high-level operators. The annotation that follows it -- => -- determines a specific high level operator (<=>). What follows is a tree that contains the right-hand-side argument to the high-level operator.

Better practice

A programming language built in the true spirit of TAO however would avoid both of the above solutions. A better one would not use operators in this case at all, e.g.:

[a]<=>[b]

But that's a story for another time.

Streaming spreadsheets

Streaming

In mathematical terms, TAO can be seen as a superset of the Dyck language.

Like the Dyck language, TAO is closed under the operation of concatenation.

This is an important property which has practical implications.

For example TAO lends itself very well to streaming, unlike syntaxes which do not have this property.

One such syntax is JSON where concatenating two objects or arrays does not produce a valid syntactical structure.

This can be worked around with various JSON streaming techniques such as ndjson or JSON Lines.

However having this property built into the syntax certainly makes life simpler.

Spreadsheets

An interesting example from the JSON Lines specification:

["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, false]
["Deloise", "2012A", 19, true] 

Can be translated to Data TAO:

[
[
Name
]
[
Session
]
[
Score
]
[
Completed
]
]
[
[
Gilbert
]
[
2013
]
[
24
]
[
true
]
]
[
[
Alexa
]
[
2013
]
[
29
]
[
true
]
]
[
[
May
]
[
2012B
]
[
14
]
[
false
]
]
[
[
Deloise
]
[
2012A
]
[
19
]
[
true
]
]

or with more readable formatting:

[
[
Name
]
[
Session
]
[
Score
]
[
Completed
]
]
[
[
Gilbert
]
[
2013
]
[
24
]
[
true
]
]
[
[
Alexa
]
[
2013
]
[
29
]
[
true
]
]
[
[
May
]
[
2012B
]
[
14
]
[
false
]
]
[
[
Deloise
]
[
2012A
]
[
19
]
[
true
]
]

Is it just me or does it look a lot like a spreadsheet?

Interestingly, if we drop Data TAO compliance (but still maintain TAO compliance) this can be compacted even more. For example:

[Name`,Session`,Score`,Completed]
[Gilbert`,2013`,24`,true]
[Alexa`,2013`,29`,true]
[May`,2012B`,14`,false]
[Deloise`,2012A`,19`,true]

or even:

Name`,Session`,Score`,Completed`;Gilbert`,2013`,24`,true`;Alexa`,2013`,29`,true`;May`,2012B`,14`,false`;Deloise`,2012A`,19`,true`;

If we unescape the commas and replace `; with new lines we get CSV. Except TAO is much more generic, elegant, and powerful.

Nested query params

There are many ways to encode lists and nested structures in query params. For example like this[Example from the JSON API spec]:

?include=author&fields[articles]=title,body&fields[people]=name HTTP/1.1

A more-or-less equivalent JSON:

{
  "include": "author",
  "fields": {
    "articles": ["title", "body"],
    "people": ["name"]
  }
}

How could that look in Data TAO?

include [author]
fields [
  articles [[title][body]]
  people [[name]]
]

More compactly:

include[author]fields[articles[[title][body]]people[[name]]]

There is an even more compact way that is still TAO, but not strictly Data TAO.

Point is: an extremely minimal syntax like TAO can fit in all the contexts that JSON, XML, or other less minimal syntaxes can, as well as in contexts where they are impractical. Thus the syntax is more universal and there is no need to invent, parse, and translate any error-prone ad-hoc solutions. Incidental complexity is reduced.

No escaping

Today I'd like to illustrate one important design aspect of TAO.

It is the problem of how to interleave structure and data that goes together with it within a minimal syntax.

This is related to escaping and more generally in-band signalling.

The problems that appear in most designs are the leaning toothpick syndrome and delimiter collision.

I have explored many solutions to these problems, including inventing elaborate mechanisms that effectively allow in-band redefinition of syntax, only to conclude that they conflict with the basic design goals of simplicity and minimalism.

Thus I settled on the most practical approach that meets these goals, i.e. the syntax has a minimal number of special symbols (3) that, on average, occur relatively rarely. One of the symbols (`) provides a generalized escaping mechanism. Next to that it is a separator and an extension mechanism.

This seems to be a pragmatic solution in most contexts, but it is not ideal. So I continue to explore the space in search for variants of TAO that will fit in the remaining contexts.

These variants could use control characters (e.g.) for the special symbols, virtually eliminating the aforementioned problems.

What is TAO?

TAO (Tree Annotation Operator) is a minimal syntax, akin to S-expressions that is used to build simple and compatible (sharing the same parser) notations for data (intended for the same purposes as JSON), markup (intended for the same purposes as XML/HTML), code, and various other domain-specific applications.

TAO is based on an extremely simple grammar that encodes only the most generic syntactical constructs: trees (for creating structures), annotations (for encoding primitives associated with the structures), and operators (for escaping and auxiliary purposes).

The idea of TAO is to introduce a very simple and generic universal syntax, on top of which similarly universal notations can be built, standardized, and reused, dramatically reducing the need for countless fundamentally unnecessary translations between incompatible formats. As a consequence making all software that uses TAO more cross-compatible and intercommunicated and the writers and users of this software happy, free, and capable of achieving more.

TAO blog and public newsletter

Welcome to the official Listed blog for TAO.

Posts from the official TAO blog will be cross-published here.

The public newsletter is launched together with the blog, for all who wish to be actively kept up to date on the latest official TAO news, some of which then make it to the blog.