The problems with ‘simple markup’

Recently, I debated the virtues of plain-text lightweight HTML markup languages with Noah Slater and Sean B. Palmer. Their view, which I did not initially share, is that HTML (especially with the advent of 5, which makes many things easy again where XHTML complicated them) is sufficiently lightweight on its own, and Noah’s opinion is that Markdown and its brethren are both pointless and dangerous, for several reasons. They all require an external script to convert from the lightweight format to HTML before publication; this complicates the process of publishing, often requiring much more scripting to make the document into a fully valid, styled HTML document; none of them are formally-defined, leaving a great deal of implementation-specific behaviour; this makes them fragile in the long term, since your documents might well stop rendering as you intended if your favourite implementation ever becomes unsupported.

Their view, which I have shared for a short while, but which I am once again coming to disagree with, is that HTML needn’t be as complicated as people make out. They espouse leaving out all tags that aren’t strictly necessary, such as <html>, <head>, <body>, </p>, etc. For an example, look at the source code of this page. Even I use some things which aren’t strictly needed, out of habit or paranoia.

Here I’ve attempted to document the most painful parts of writing HTML in this way, and to pontificate on possible solutions. Some of these are general complaints about HTML and its implementations; others are specific to why I find HTML such a pain as a writing format. (As a format for laying out a page, it’s not so bad, because most of this doesn’t apply.) I compare most often to Markdown because I’m most familiar with it; I believe most of my comparisons apply equally to other lightweight markup languages.

Typesetting code in HTML is a pain, especially when talking about HTML in HTML

Typesetting code in raw HTML is a complete pain compared to Markdown. In standard Markdown, one simply indents code blocks by a tab to have them wrapped in <pre><code></pre></code>; most variants support ‘fenced code blocks’, too, which don’t require indentation, but the syntax varies between implementations. For short inline code, wrapping code in `backticks` wraps text in <code></code>.

HTML traditionally requires that all instances of <, >, and & be escaped as entities. HTML5 sensibly removes this requirement when it’s clear from context that these characters aren’t being used as tag delimiters or starting entities, respectively. (It’s still officially disallowed to put a raw < in a context where it doesn’t open a tag, but the official parsing algorithm does what you’d expect, as does every previous implementation of HTML, to my knowledge.)

Talking about HTML in HTML, though, remains a complete pain in the arse. The easiest way is just to replace the opening < of each tag with an &lt;, and the opening & of each entity with &amp;. This is extremely tedious even when talking about short snippets like above; though I haven’t had to do it yet, I imagine that it would be mind-numbing to fix for every instance in a large code block. Naturally, it can be done by a script in your text editor, but when you want to change the code, you pretty much have to remove all your entities before making changes, then re-add them once you’re done.

Even then, it’s a pain to wrap things in <code></code> compared to just backticks; even worse is <pre><code></code></pre>. I’m actually fairly happy to leave inline code spans in normal text, rather than marking them up for a monospaced font; likewise to just use <pre></pre>, leaving out the inner <code></code> tags. But still, <pre> requires that I drop any indentation level in the middle of a page to typeset code, which is unsightly. To do it any other way than I do naturally in Markdown is to cause the markup language I write with to have an effect on the quality of the typography of my work, which goes against all principles of separating design and content.

Footnotes must be marked up explicitly

I like to use this markup for footnotes:

<p>Lorem ipsum dolor sit amet,<sup id="fnref1"><a href="#fn1">1</a></sup> consectetur adipiscing elit.

<!-- at the end of the document -->

<div class="footnotes">
<hr>
<ul>
  <li id="fn1">This is the first footnote. <a href="#fnref1">↩</a>
</ul>
</div>

Marking this up in HTML is a pain. It involves nested tags, one of which is quite long, adding noise to the document; it moves the footnote content well away from the footnote reference; I like using the ↩ character (U+21A9 LEFTWARDS ARROW WITH HOOK), which is tricky to type or to remember the numbered entity for (admittedly, I could use the word ‘back’ or similar).

Compare to the usual Markdown syntax, invented (I believe) by the PHP Markdown Extra variant, and now common elsewhere, in which one can simply think of a name for the footnote (like ‘fn1’ in the HTML example), like ‘foo’, put it in square brackets with a caret, like [^foo], then anywhere later in the document, fill in the content of the footnote:

[^foo]: This is an example of a footnote

See also my notes on HTML’s lack of named links below, which is related to this complaint.

HTML links are very noisy, and can’t be abbreviated with names

Links are the worst part of all markup syntaxes, in my opinion. There is almost no notation which feels natural, and they all feel noisy to some degree. HTML’s, though, is far worse than most lightweight languages, especially in one respect: URLs can’t be ‘named’ for short reference elsewhere in the document. Therefore, the longer the URL, the noisier a paragraph becomes; and if you link to the same page more than once in a document you have to do more work than should be necessary in order to change it later.

Markdown allows you to do this:

This is a [link elsewhere][r] -- who knows where it will lead?

[r]: http://en.wikipedia.org/wiki/Special:Random

Raw HTML, on the other hand, doesn’t have such a convenience.

<p>This is a <a href="http://en.wikipedia.org/wiki/Special:Random">link elsewhere</a> -- who knows where it will lead?

For URL much longer than 30 characters or so, this quickly becomes very noisy. It’s also especially annoying if, like me, you like to hard-wrap your source code, because it’s hard to know what to do. Put the <a> tag on a separate line? Just let it overflow? Noah recommends against hard-wrapping anyway, so perhaps I should just break this habit …

Ideas for solving these issues

To recap from the top, the desirable properties of a markup language which HTML satisfies are:

  1. You don’t need to run a script in order to convert it to HTML
  2. You can publish it almost directly as a document, without too much additional markup for layout, or for historical reasons
  3. It’s formally defined, with multiple interoperable implementations (and a likelihood that none of them will go away soon, or that new ones will be developed to replace them if they do)
  4. They’re not likely to break in the long term: documents written in the language will probably carry on, working and usable, for hundreds of years

HTML satisfies all of these in a way that no lightweight substitute ever could. Any language which attempts to solve the issues on this page will always require conversion into HTML; for that reason, they can’t be published directly as documents (at least, not fully interactively); all the attempts to create a lightweight formally-defined language have failed for various reasons; and, for all these reasons, they’re not likely to continue working in the long term.

Below, though, are some ideas for what can be done about this if we’re willing to sacrifice some or all of these to solve the problems of raw HTML.

Make up faux-HTML tags which can be translated to real HTML with a simple script

This was the first idea I considered. I could invent my own tags, <codeblock> and <codespan> perhaps, which would automatically take care of escaping entities and newlines in <pre>; I could add a tag, perhaps <fn>, for footnotes; I could modify the functionality of the <a> tag to add support for named links. Before publishing, I could just run a simple script on the file to transform away these basic abstractions to raw HTML for browsers to consume.

It would not be too hard to write a script to attack an HTML document with regexps in order to perform these simple transformations; would it be worth it? Of the list above, we lose 1 (you have to run this script before publishing), perhaps gain some of 2 (since the script could add the cruftier markup for you), and lose some of 3 and 4 (while still being based on the formally specified HTML, and easily transformable to it, as soon as your script breaks, a page is un-renderable without recreating the program you wrote). It also still doesn’t really solve the issue of escaping entities for talking about HTML.

Here’s an example of what such markup might look like:

<p>Here is an example paragraph.<fn ref=1> I'll add a <a ref=r>link</a> to some random page. <codespan><p>Here I'm talking about some code.</p></codespan>

<url name=r href="http://en.wikipedia.org/wiki/Special:Random">
<fn id=1>This is an example of a footnote</fn>

Don’t use inline markup; use a GUI tool to edit pages

This is a moderately attractive option. I considered a system like Ted Nelson wanted for Xanadu, and which was partly implemented in old Macintosh file formats, where the “source” of each page is just plain text, and the markup is stored in somewhere different altogether. On the old Macintosh, the ‘markup’ (which usually simply told the styling of each part of the document, not the semantic information carried by some HTML tags, or hyperlinks of any kind) was stored in the resource fork, making reference to the plain text information stored in the data fork. I don’t know what was planned for Xanadu, but I imagine it was probably something quite similar.

It’s not necessary to go this length, though: I could simply use a WYSIWYG (or, more practically, WYSIWYM) editor to edit pages in HTML, instead of this mad separation-of-markup-and-content idea. I keep all four of the desirable elements of HTML this way, and, potentially, all my complaints could be irrelevant, since I wouldn’t care about the output markup.

Some problems: all the existing GUI editors suck. I don’t really have the JavaScript (or any other GUI) programming ability right now to create one which doesn’t suck. I also happen to really like writing in BBEdit, though I would not want that to become part-WYSIWYG, as some other text editors have done with their Markdown support.

Use an existing lightweight markup language

Existing markup languages all suffer from problems of their own. The best one out there is Markdown, but even it has many problems shared by all the rest. Notably, it is, a ‘lossy’ recoding of HTML syntax, which happens to have the real HTML syntax kind-of available-if-you-want-it, but not in a way that supports the full syntax of HTML.

You can’t express every HTML tag in Markdown’s lightweight syntax. To solve this, Markdown lets you use HTML tags when you need them, and, in most variants, they’re left as untouched source in the output.

Markdown loses on 1; it also loses on 2, since it can only really create document fragments, requiring an external script to add in DOCTYPEs and charset tags and title tags as may be necessary; it’s far from formally defined, with the ‘official’ implementation consisting of a ton of regexp substitutions (I’m fairly sure you could implement a ‘conforming’ Markdown in sed(1), if you wanted), so it loses on 3; and though it will still be readable as text in the long term, it will kind of lose on 4 if you want your links to be clickable and your emphasised text to be emphasised. As far as I can tell, the exact same complaints apply to Textile, reStructuredText, AsciiDoc …

Use a new lightweight markup language, designed to reduce these problems

Avocado was designed before I tried the ‘simple markup’ approach. It seems to score better in many respects than Markdown: instead of allowing embedding a subset of HTML syntax, it recodes it altogether, allowing arbitrary tags to be expressed in its own syntax instead of unsatisfactorily mixing in raw source, as Markdown does. There are still some problems which I’d fix if I were to take the idea further: entities are taken straight from HTML, making it difficult to talk about them (you still have to replace &#x2014; with &amp;#x2014; if you want to talk about the entity itself, rather than the character), so I would add my own Avocado syntax for that, were I to change it; I’d scrap the support for lettered and Roman numeral ordered lists, because of the ambiguity; but on the whole it’s fairly good. You can express any HTML node tree you like in Avocado — you just can’t do it with HTML’s original syntax, unless you wrap a block in {{{ and }}}.

It still suffers from one problem which Markdown et al share: you can’t start and end a span safely outside of a word boundary. My solution to this problem is to ignore it: it’s a rare use case in actual writing; if you really want it, use raw HTML inside a {{{ }}} block. (I have thought of solutions such as using double-percent signs or some other unlikely sequence of characters for this, but no syntax I’ve yet conceived has really stuck.)

Avocado is not yet implemented, because I switched to ‘simple markup’ soon after designing it. Naturally, it loses on 1 again; it almost makes 2, except for DOCTYPE (I’m sure I can fix this, somehow); I could formally define it if I wanted, but for now 3 is null; scores the same as Markdown on 4. As for 3, I personally wouldn’t mind the lack of a formal definition, since the way I would use it would be to generate HTML and forget about it (bake, don’t fry), and, since I wrote it, I don’t have to care about anyone else’s data breaking. This is an irresponsible position to take if I ever open-source an Avocado implementation, though, so I would immediately formally define it in that case.