md2html.awk

History of a rewrite


For some time, I have maintained an awk(1) script to generate html code from text files with a markdown syntax. What started as an small subset of markdown has grown with time to become an almost compatible implementation. However, the original code has turned into a mess at the same rate that new features have been added, and some early mistakes start to make it unbearable. That is why I am rewriting it.

Unfortunately, markdown syntax is not clearly defined. There are corner cases which are not evident from reading the syntax document, and some features - like nested lists - are not documented at all. In order to follow some "standard" the suite at http://git.michelf.com/mdtest/ will be used for testing purposes.

The script is tested with Plan9 awk and gawk. Normally, other awk implementations will give the same results.

See also:


The rewrite process is briefly explained in progressive steps. Features are added in the lest intrusive way that I have found. There is no real explanations in the text, but it helps to read the code - which, I hope, explains itself - and also serves as a reference of the exact md2html.awk syntax and its differences with markdown.pl.

The newest version can be found at the mercurial repository.

md2html.awk is being heavily tested NOW. It can be considered in β state, so - as this page - it could change at any moment.

HTML Blocks

There are two main sort of blocks inside a markdown document: html code and markdown formatted text. The first version of the script will only handle html blocks, and will consider the rest of the text as paragraphs separated by blank lines.

md2html.01.awk

First things first.

#!/bin/awk -f

The function printp() finish a paragraph if there is some text (not only whitespace), and adds the corresponding tag if it is provided.

function printp(tag) {
	if(!match(text, /^[ 	]*$/)){
		if(tag != "")
			print "<" tag ">" text "</" tag ">";
		else
			print text;
	}
	text = "";
}

After defining functions. Initialization of variables is done in the BEGIN block:

BEGIN {
	html = 0;
	hr = 0;
	text = "";
}

The html processing comes next. First, if the beginning of a block is detected we set the html variable. The hr variable is required to support multi-line hr blocks (not well worth, if you ask me, but added to pass one more php test).

# html
/^<(address|blockquote|center|dir|div|dl|fieldset|form|h[1-6r]|\
isindex|menu|noframes|noscript|ol|p|pre|table|ul|!--)/ {
	html = match($0, /((^<hr ?\/?)|(--))>$/) ? 0 : 1;
	print;
	if(html && match($0, /^<hr/))
		hr = 1;
	next;
}

We do the opposite when a closing tag is found.

html && (/(^<\/(address|blockquote|center|dir|div|dl|fieldset|form|h[1-6r]|\
isindex|menu|noframes|noscript|ol|p|pre|table|ul).*)|(--)>$/ ||
(hr && />$/)) {
	print;
	html = 0;
	hr = 0;
	next;
}

When we are inside html blocks, the text will be printed out without further processing.

html {
	print;
	next;
}

Blank lines finish the current paragraph.

# Paragraph	
/^$/ { 
	printp("p");
	next;
}

If not, add the current line to the rest of the paragraph.

# Add text
{ text = (text ? text " " : "") $0; }

Markdown allows the user to be as lazy as possible (that is a feature), so the blank line at the end of the document is not a requirement.

END { printp("p"); }

Nested html is not supported, but it will work perfectly right if the code inside the parent block is indented.

Horizontal rules and headers

Html is not easy to get right. Let's go for an easy one now.

md2html.02.awk

Support for horizontal rules is straightforward. The !text test tells us that we are not in the middle of a paragraph.

# Horizontal rules
!text && /^ ? ? ?([-*_][ 	]*)([-*_][ 	]*)([-*_][ 	]*)+$/ {
	print "<hr>";
	next;
}

Setex-style headers are not more difficult. The printp() function will be handy here.

# Setex-style Headers
text && /^=+$/ {printp("h1"); next;}
text && /^-+$/ {printp("h2"); next;} 

We will need a new variable (par) to identify our paragraph when we reach a blank line.

# BEGIN {
	...
	par = "p";
}

...

The values it can take are h[1-6] for headers or p for paragraphs. Then, we will use this variable as the argument of the printp() function.

# Atx-Style headers
/^#+/ {
	while(sub(/^#/, ""))
		n++;
	n = (n > 6) ? 6 : n;
	par = "h" n;
}

# Paragraph	
/^$/ {
	printp(par);
	par = "p";
	next;
}

...

END { printp(par); }

We are printing all the previous text, so we can use multiple-line titles:

Night of the Day of the Dawn of the Son of the Bride of the
Return of the Revenge of the Terror of the Attack of the Evil,
Mutant, Alien, Flesh Eating, Hellbound, Zombified Living Dead
=============================================

Part 2:
In Shocking 2-D
-------------

Atx-style headers could have been added in a similar way to Setex ones - printing the line once we know it is a title -, but once more it is more convenient to allow multi-line paragraphs:

##### The Incredibly Strange Creatures Who
Stopped Living and Became Mixed-Up Zombies!

Code

There is a special kind of paragraph in markdown: all the text preceded by a tab or four spaces is considered pre-formatted code, and has to be printed as is, once html (<, &) has been escaped.

For some reason, markdown syntax allows blank lines in the middle of code blocks. In my opinion, a blank line should finish the block. This way, two consecutive code blocks can be inserted, and blank lines could be added if they were preceded of the right indentation anyway. In any case, markdown syntax will be respected.

md2html.03.awk

A new variable is needed, to know if we are processing code or normal text.

# BEGIN {
	...
	code = 0;
	...
}

If raw html is found, and we are inside a code block, closing it could be a good idea.

/^<(address|blockquote|center|dir|div|dl|fieldset|form|h[1-6r]|\
isindex|menu|noframes|noscript|ol|p|pre|table|ul|!--)/ {
	if(code)
		print "</pre></code>";
	...

We don't want to find rules inside code blocks.

# Horizontal rules
(blank || (!text && !code)) && /^[ 	]*([-*_][ 	]*)([-*_][ 	]*)([-*_][ 	]*)+$/ {
	if(code){
		print "</pre></code>";
		code = 0;
	}
	blank = 0;
	...

The blank variable will be used to know if a blank line was found inside a code block. The beginning of a code block is only possible after a blank line (in other words, when there is no pending text to be printed out). Once we don't find the right indentation any more, we quickly close the code block before processing the current line.

# Code blocks
code && /^$/ { 
	if(blank)
		print "";
	blank = 1;
	next;
}
!text && sub(/^(	|    )/, "") {
	if(blank)
		print "";
	blank = 0;
	if(!code)
		print "<code><pre>";
	code = 1;
	gsub("&", "\\&amp;");
	gsub("<", "\\&lt;");
	print;
	next;
}
code {
	out("</pre></code>");
	code = 0;
}

Once again, we take care of finishing the document.

END {
	if(code){
		print "</pre></code>";
		code = 0;
	}
	printp(par);
}

Quote blocks

Until now, our script supports html blocks, horizontal rules, headings and paragraphs. In the last step, code blocks were added. The rest of blocks supported by markdown are lists and quotes. These blocks have the possibility of containing nested paragraphs, code, or even more lists and quote blocks. In this fourth step, quote blocks will be implemented.

One more time, markdown shows a weird behavior here: quote blocks can suddenly start in the middle of the paragraph, a blank line is not needed to finish the previous one, so the !text test will not be needed.

md2html.04.awk

We don't have to pay anybody to add another variable, so here we go.

# BEGIN {
	...
	quote = 0;
}

Once more, we have to handle an special case. If raw html is found after a blank line inside a block quote, we close the block.

/^<(address|blockquote|center|dir|div|dl|fieldset|form|h[1-6r]|\
isindex|menu|noframes|noscript|ol|p|pre|table|ul|!--)/ {
	if(code)
		print "</pre></code>";
	for(; !text && quote > 0; quote--)
		print "</blockquote>";
	...

The technique we will use is not complicated. First, we check for the level of quoting of the current line.

# Quote blocks
{ for(nquote = 0; sub(/^ ? ? ?> ?/, ""); nquote++); }

Paragraphs can be continued without adding quoting to every line (laziness rule again).

nquote < quote && text { nquote = quote; }

If we are at a different level, finish the current paragraph.

nquote != quote {
	if(code){
		print "</pre></code>";
		code = 0;
	}
	printp(par);
}

And adjust the nesting level.

nquote < quote {
	for(; quote > nquote; quote--)
		print "</blockquote>";
}
nquote > quote {
	for(; quote < nquote; quote++)
		print "<blockquote>";
}

Don't leave open blocks behind.

END {
	if(code){
		print "</pre></code>";
		code = 0;
	}
	printp(par);
	for(; quote > 0; quote--)
		print "</blockquote>";
}

Since we are checking for block-quotes and printing the corresponding tags before anything else, there is nothing we have to do to allow nesting of headers, code and paragraphs inside quote blocks.

List blocks

To add lists, we have to consider that list and quote blocks can be arbitrarily nested. This means that we will have to take into account not only the nesting level (which will be renamed from quote to nl), but also what type of block we have at every level. An array block[] seems like the most reasonable way to store such information. The rules that define when paragraphs start inside lists are somewhat complicated. I cannot warranty this code works with every corner case, but it looks like it handle the ones in the markdown test suite quite well.

md2html.05.awk

At the beginning of the document, there is no nesting. The quote variable is gone.

# BEGIN {
	...
	nl = 0;
}

This change has to be reflected on the raw html corner case.

/^<(address|blockquote|center|dir|div|dl|fieldset|form|h[1-6r]|\
isindex|menu|noframes|noscript|ol|p|pre|table|ul|!--)/ {
	if(code)
		print "</pre></code>";
	for(; !text && block[nl] == "blockquote"; nl--)
		print "</blockquote>";
	...

We start removing the indentation.

# List and quote blocks

#   Remove indentation
{
	for(nnl = 0; nnl < nl; nnl++)
		if(((block[nnl + 1] == "ol" || block[nnl + 1] == "ol") && !sub(/^(  ? ?|	)/, "")) || \
		(block[nnl + 1] == "blockquote" && !sub(/^ ? ? ?> ?/, "")))
			break;
}

Or pretending we did, if we apply the laziness rule.

nnl < nl && text { nnl = nl; }

Next, let's see if we are going into deeper levels of quoting. Deeper blocks from now on will go on the nblock variable, which will be used later to compare with block.

#   Quote blocks
{ 
	while(sub(/^( ? ? ?> ?)+/, "")
		nblock[++nnl] = "blockquote";
}

Blank lines inside lists have a particular meaning. We will need two variables: blank to identify blank lines and newli for new list items.

#   List items
block[nl] ~ /[ou]l/ && /^$/ { 
	blank = 1;
	next;
}
{ newli = 0; }

We are ready now to look for list items.

(nnl != nl || !text || block[nl] ~ /[ou]l/) && /^  ? ? ?[*+-][ 	]/ {
		sub(/^  ? ? ?[*+-][ 	]/);
		block[nnl] = "ul";
		newli = 1;
}
(nnl != nl || !text || block[nl] ~ /[ou]l/) && /^  ? ? ?([0-9]+[\.-]?)+[ 	]/ {
		sub(/^  ? ? ?([0-9]+[\.-]?)+[ 	]/);
		block[nnl] = "ol";
		newli = 1;
}

If one is found, we have to start a new paragraph. The block formed by the closing (/li) and opening (li) list item tag is used here as a separator between list items. The first and the last delimiters will be added together with the list block tags.

newli { 
	if(blank && nnl == nl)
		par = "p";
	blank = 0;
	printp(par);
	if(nnl == nl && block[nl] == nblock[nl])
		print "</li><li>";
}

And to finish, we situate ourselves in the right level in a similar way as how we did with block-quotes, using the block[] array mentioned earlier.

# Close old blocks and open new ones
nnl != nl || nblock[nl] != block[nl] {
	if(code){
		print "</pre></code>";
		code = 0;
	}
	printp(par);
	b = (nnl > nl) ? nblock[nnl] : block[nnl];
	par = (match(b, /[ou]l/)) ? "" : "p";
}
nnl < nl || (nnl == nl && nblock[nl] != block[nl]) {
	for(; nl > nnl || (nnl == nl && pblock[nl] != block[nl]); nl--){
		if(match(block[nl], /[ou]l/))
			print "</li>";
		print "</" block[nl] ">";
	}
}
nnl > nl {
	for(; nl < nnl; nl++){
		block[nl + 1] = nblock[nl + 1];
		print "<" block[nl + 1] ">";
		if(match(block[nl + 1], /[ou]l/))
			print "<li>";
	}
}
hr {
	print "<hr>";
	next;
}

...

END {
	if(code){
		print "</pre></code>";
		code = 0;
	}
	printp(par);
	for(; nl > 0; nl--){
		if(match(block[nl], /[ou]l/))
			print "</li>";
		print "</" block[nl] ">";
	}
}

Unfortunately, the code at this point is not so easy to understand as once was. The good news are that not much more is needed, from now on we will only have to do replacements on the text string calling different functions.

In-line spans: code, html, auto-links and backslash escaping

To get our nice html output we only have to replace the in-line elements (links, images, emphasized test, code spans, html tags, ...) and escape html (< and &). Everything will be encapsulated into an inline() function, so it can be reused in other scripts and does not get mixed with block processing.

md2html.06.awk

First, we will define a new eschtml() function, that can also be used for code blocks:

function eschtml(t) {
	gsub("&", "\\&amp;", t);
	gsub("<", "\\&lt;", t);
	return t;
}

...

!text && sub(/^(	|    )/, "") {
	...
	$0 = eschtml($0);
	...

The function nextil() will be recursively called to process the next inline (or span) element. If there is no special symbols on the text t, we return it as it came.

function nextil(t) {
	if(!match(t, /[`<&\\]/))
		return t;

But if not, we split it in three parts: the text before the tag t1, the tag tag and the text after the tag t2.

	t1 = substr(t, 1, RSTART - 1);
	tag = substr(t, RSTART, RLENGTH);
	t2 = substr(t, RSTART + RLENGTH);

Code spans need special treatment, all the html has to be escaped.

if(ilcode && tag != "`")
	return eschtml(t1 tag) nextil(t2);

Before proceeding with substitutions, we escape all the special symbols preceded by a backslash.

	# Backslash escaping
	if(tag == "\\"){
		if(match(t2, /[\\`*_{}\[\]()#+\-\.!]/)){
			tag = substr(t2, 1, 1);
			t2 = substr(t2, 2);
		}
		return t1 tag nextil(t2);
		}

The symbol can be one or two back-ticks, if two back-ticks are used one back-tick is allowed inside the code span.

	# Inline Code
	if(tag == "`"){
		if(match(t2, /^`/)){
			ilcode2 = !ilcode2;
			t2 = substr(t2, 2);
		}
		else if(ilcode2)
			return t1 tag nextil(t2);
		tag = "<code>";
		if(ilcode){
			t1 = eschtml(t1);
			tag = "</code>";
		}
		ilcode = !ilcode;
		return t1 tag nextil(t2);
	}

Auto-links URL matching is poor, we are only checking for some text (without blank characters) followed by a dot or an at symbol. Mail addresses are not encoded in html entities like markdown does.

	if(tag == "<"){
	# Auto links
		if(match(t2, /^[^ 	]+[\.@][^ 	].+>/)){
			url = eschtml(substr(t2, 1, RLENGTH - 1));
			t2 = substr(t2, RLENGTH + 1);
			linktext = url;
			if(match(url, /@/) && !match(url, /^mailto:/))
				url = "mailto:" url;
			return t1 "<a href=\"" url "\">" linktext "</a>" nextil(t2);
		}

Html tags and special entities are simple substitutions.

	# Html tags
		if(match(t2, /^[A-Za-z\/!][^>]*>/)){
			tag = tag substr(t2, RSTART, RLENGTH);
			t2 = substr(t2, RLENGTH + 1);
			return t1 tag nextil(t2);
		}
		return t1 "&lt;" nextil(t2);
	}
	# Html special entities
	if(tag == "&"){
		if(match(t2, /^#?[A-Za-z0-9]+;/)){
			tag = tag substr(t2, RSTART, RLENGTH);
			t2 = substr(t2, RLENGTH + 1);
			return t1 tag nextil(t2);
		}
		return t1 "&amp;" nextil(t2);
	}
}

Another function will be used to call nextil() the first time.

function inline(t) {
	ilcode = 0;
	ilcode2 = 0;
	
	return nextil(t);
}

We could set the variables and call nextil() directly from printp(), saving some lines of code, but the inline() function will help us to keep our code better organized.

function printp(tag) {
	if(!match(text, /^[ 	]*$/)){
		text = inline(text);
		...

Inline images and links

We are ready now to add images and links support to the nextil() function. For the moment, we won't worry about referenced links and images, only the inline ones.

md2html.07.awk

Most of the code of the nextil function will remain as it was in the previous step. We only have to modify one of the first regular expressions to check for the new special symbols introduced.

	...
	if(!match(t, /[`<&\[\\]|(\!\[)/))
		return t;

The rest of the code, that will perform the substitutions, is added at the end of the function.

	# Images
	if(tag == "!["){
		match(t2, /^[^\]]+/);
		alt = substr(t2, 1, RLENGTH);
		t2 = substr(t2, RLENGTH + 2);
		match(t2, /^[^\)]+/);
		url = eschtml(substr(t2, 2, RLENGTH - 1));
		t2 = substr(t2, RLENGTH + 2);
		title = "";
		if(match(url, /[ 	]+\".*\"[ 	]*$/)) {
			title = substr(url, RSTART, RLENGTH);
			url = substr(url, 1, RSTART - 1);
			match(title, /\".*\"/);
			title = " title=\"" substr(title, RSTART + 1, RLENGTH - 2) "\"";
		}
		if(match(url, /^<.*>$/))
			url = substr(url, 2, RLENGTH - 2);
		return t1 "<img src=\"" url "\" alt=\"" alt "\"" title " />" nextil(t2);
	}

Links are the same, with the difference that other inline elements can be nested inside the link text.

	# Links
	if(tag == "["){
		match(t2, /^[^\]]+(\[[^\]]+\][^\]]*)*/);
		linktext = substr(t2, 1, RLENGTH);
		t2 = substr(t2, RLENGTH + 2);
		match(t2, /^[^\)]+(\([^\)]+\)[^\)]*)*/);
		url = substr(t2, 2, RLENGTH - 1);
		pt2 = substr(t2, RLENGTH + 2);
		title = "";
		if(match(url, /[ 	]+\".*\"[ 	]*$/)) {
			title = substr(url, RSTART, RLENGTH);
			url = substr(url, 1, RSTART - 1);
			match(title, /\".*\"/);
			title = " title=\"" substr(title, RSTART + 1, RLENGTH - 2) "\"";
		}
		if(match(url, /^<.*>$/))
			url = substr(url, 2, RLENGTH - 2);
		url = eschtml(url);
		return t1 "<a href=\"" url "\"" title ">" nextil(linktext) "</a>" nextil(pt2);
	}

We need the pt2 variable to store the previous t2 value because t2 can change during the call to nextil(linktext).

Emphasis

We are finishing our awk markdown implementation. This is an easy step, we will only need to add some more span blocks before the last step in which referenced links and images will be added. Emphasized and strong text have to be processed simultaneously, because their delimiters can appear in groups of one (em), two (strong) or three (strong and em).

md2html.08.awk

The set of special symbols is complete now.

	if(!match(t, /[`<&\[*_\\]|(\!\[)/))
		return t;
	...

First we check if we have found an em or a strong delimiter.

	# Emphasis
	if(match(tag, /[*_]/)){
		ntag = tag;
		if(sub("^" tag, "", t2)){

Closing an em span have preference before opening a new strong one. This convention allows the use of three consecutive symbols for emphasized strong text.

			if(stag[ns] == tag && match(t2, "^" tag))
				t2 = tag t2;
			else
				ntag = tag tag
		}
		n = length(ntag);
		tag = (n == 2) ? "strong" : "em";

If the symbol is surrounded by spaces, it is not a delimiter, it is just a symbol and must be printed out.

		if(match(t1, / $/) && match(t2, /^ /))
			return t1 tag nextil(t2);

We can set the stag[] and ns variables and return.

		if(stag[ns] == ntag){
			tag = "/" tag;
			ns--;
		}
		else
			stag[++ns] = ntag;
		tag = "<" tag ">";
		return t1 tag nextil(t2);
	}

PHP markdown support fancier syntax. For example, this input

***Zombie** Attack*

will generate the right output

<em><strong>Zombie</strong> Attack</em>

This behavior could be implemented in future versions, but for the moment I don't see much need for it, since you could always use one of these input lines:

_**Zombie** Attack_
*__Zombie__ Attack*

References

Links and images by reference are the last stone. We will need to buffer the output until all the references are known (but it will be printed ASAP, allowing to use md2html.awk as a filter connected to a pipe). We will use the otext variable to store the output text, and will call our custom oprint() function to print the output, instead of the awk built-in print. We will also use a ref[] array to store the referenced links.

md2html.09.awk

We will start adding our custom print: oprint().

function oprint(t){
	if(nr == 0)
		print t;
	else
		otext = otext "\n" t;
}

Another function will be used to replace the references in otext. To identify the references, we will need some kind of special symbol. As the text that arrives to otext is almost ready-to-use html there should not be two < symbols together, << will be our mark.

function subref(id){
	for(; nr > 0 && sub("<<" id, ref[id], otext); nr--);
	if(nr == 0 && otext) {
		print otext;
		otext = "";
		}
}

The variable nr indicates the number of references. We also initialize (just in case) the output text otext.

BEGIN {
	...
	nr = 0;
	otext = "";
	...

References can appear in images or links.

	# Images
	if(tag == "!["){
		if(!match(t2, /(\[.*\])|(\(.*\))/))
			return t1 tag nextil(t2);
		match(t2, /^[^\]]*/);
		alt = substr(t2, 1, RLENGTH);
		t2 = substr(t2, RLENGTH + 1);
		if(match(t2, /^\(/)){
			# Inline
			...
		}
		else{
			# Referenced
			sub(/^ ?\[/, "", t2);
			id = alt;
			if(match(t2, /^[^\]]+/))
				id = substr(t2, 1, RLENGTH);
			t2 = substr(t2, RLENGTH + 2);
			if(ref[id])
				r = ref[id];
			else{
				r = "<<" id;
				nr++;
			}
			return t1 "<img src=\"" r "\" alt=\"" alt "\" />" nextil(t2);
		}
	}

The implementation is almost identical.

	# Links
	if(tag == "["){
		if(!match(t2, /(\[.*\])|(\(.*\))/))
			return t1 tag nextil(t2);
		match(t2, /^[^\]]*(\[[^\]]*\][^\]]*)*/);
		linktext = substr(t2, 1, RLENGTH);
		t2 = substr(t2, RLENGTH + 2);
		if(match(t2, /^\(/)){
			# Inline
			...
		}
		else{
			# Referenced
			sub(/^ ?\[/, "", t2);
			id = linktext;
			if(match(t2, /^[^\]]+/))
				id = substr(t2, 1, RLENGTH);
			t2 = substr(t2, RLENGTH + 2);
			if(ref[id])
				r = ref[id];
			else{
				r = "<<" id;
				nr++;
			}
			pt2 = t2;
			return t1 "<a href=\"" r "\" />" nextil(linktext) "</a>" nextil(pt2);
		}
	}

Just in case there is still any referenced link or image without a definition, we will remove the special symbol << and will print out the text.

END {
	gsub(/<</, "", otext);
	oprint otext;
}

The references implementation is not very advanced, it does not pass all the tests in the markdown test-suite, but should work well enough with simple input. So, for the moment...

We are done!


Of course, we were not done:

What next?

At this point, a professional developer would spend a lot of time testing, fixing bugs, and being completely sure of the quality of his product (ha!). But since I am not a professional, what I do is to release it in β status and hope that somebody will find the bugs and will report them to me. Meanwhile, in the spirit of eating my own dog food, md2html.awk is processing the markdown text in anarchyinthetubes.

Some goals for the long term:

In other words: there is more work to do, it will be ready when it is ready.

The newest version can be found at the mercurial repository.
Comments at:

and yiyu.jgl