MardownSharp and Encoded HTML

During the creation of this site I ran into an issue with MarkdownSharp and encoded HTML.

The Problem

MarkdownSharp encodes all HTML that is detected between code blocks. Like this:

<span style="color: red;">I am encoded HTML</span>

This becomes a problem if you are passing pre-encoded text to MarkdownSharp. It essentially re-encodes the already encoded HTML within code blocks. For example:

&lt;span style="color: red;"&gt;I am encoded HTML&lt;/span&gt;

Yeah...not exactly what you expect it to look like.

The Fix

There are a couple of options here:

The first and easiest option is a fix that I submitted to the MarkdownSharp team. It is a modified version of the Markdown.cs file that contains a new boolean property called EncodeCodeBlocks. If you set it to false it will disable MarkdownSharp's ability to encode HTML within code blocks. The modified file is attached to my linked issue.

The second option is the one I prefer. I leave my EncodeCodeBlocks option set to true and I sanitize the HTML input before passing it to MarkdownSharp. I wanted to do this in much the same way that Stack Overflow does. They use a white-list of allowed HTML tags, and filter out the rest. This allows users to submit comments that contain "safe" HTML tags and/or Markdown syntax. This is exactly how Stack Overflow accomplishes this. Here is the code I use to sanitize the HTML:

private static Regex _tags = new Regex("<[^>]*(>|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
private static Regex _whitelist = new Regex(@"^</?(b(lockquote)?|code|d(d|t|l|el)|em|h(1|2|3)|i|kbd|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$|^<(b|h)r\s?/?>$", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_a = new Regex(@"^<a\shref=""(\#\d+|(https?|ftp)://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+)""(\stitle=""[^""<>]+"")?\s?>$|^</a>$", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);
private static Regex _whitelist_img = new Regex(@"^<img\ssrc=""https?://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+""(\swidth=""\d{1,3}"")?(\sheight=""\d{1,3}"")?(\salt=""[^""<>]*"")?(\stitle=""[^""<>]*"")?\s?/?>$", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);

/// <summary>
/// sanitize any potentially dangerous tags from the provided raw HTML input using
/// a whitelist based approach, leaving the "safe" HTML tags
/// CODESNIPPET:4100A61A-1711-4366-B0B0-144D1179A937
/// </summary>
public static string Sanitize(string html)
{
    if (String.IsNullOrEmpty(html)) return html;
    string tagname;
    Match tag;
    // match every HTML tag in the input
    MatchCollection tags = _tags.Matches(html);
    for (int i = tags.Count - 1; i > -1; i--)
    {
        tag = tags[i];
        tagname = tag.Value.ToLowerInvariant();
        if (!(_whitelist.IsMatch(tagname) || _whitelist_a.IsMatch(tagname) || _whitelist_img.IsMatch(tagname)))
        {
            html = html.Remove(tag.Index, tag.Length);
            System.Diagnostics.Debug.WriteLine("tag sanitized: " + tagname);
        }
    }
    return html;
}

I dropped this code into a "Common Utilities" project in my MVC solution. I have an HtmlHelper in my project that lets me call MarkdownSharp really easily. I modified it to include a call to this sanitize function:

public static class HtmlHelpers
{
    public static MvcHtmlString Markdown(this HtmlHelper helper, string text)
    {
        string html = MarkdownUtils.FormatMarkdown(text);
        html = CommonUtils.Sanitize(html);
        return MvcHtmlString.Create(html);
    }
}

All this is a part of a project I have called MvcUtilities. I am planning on open-sourcing this project after I add some more things to it. It will contain a lot of little goodies to utilize. Be sure to check out the Projects section often; I'll add it there eventually. That section isn't done yet but it shouldn't be too much longer.