This document is background information for a series of projects I am currently (as of early 2020) working on. I am making it public in an attempt to solicit useful comments and advice on the ideas documented here; to the best of my knowledge and belief, it contains no information proprietary to either my present employer nor any other commercial, non-BSD-license-using organisation.
Markdown makes a far better standard document format for hand-editable content than HTML. Previous forays into HTML-to-Markdown conversion several years ago were frustrating due to the immaturity of both the available Markdown spec(s) and of the tools then available to do the conversion (notably, now-ancient versions of Pandoc).
Pandoc is a highly capable document converter. It can convert from dozens of formats, to dozens of formats.
Note that there is a maintained Ruby-wrapper Gem for Pandoc on GitHub. That Gem has apparent limitations, however, as noted below.
Pandoc has several variants of Markdown that it supports for input and output.
- commonmark -- a vendor-neutral, "standardised" Markdown;
- gfm -- Github-Flavoured Markdown;
- markdown_github -- a deprecated but more versatile GFM parser (highly deprecated for output);
- markdown -- Pandoc's extended Markdown, comparable but not identical to GFM;
- markdown_mmd -- MultiMarkdown, an extended, longer-lived set of extensions built on top of the historical Markdown core; and
- markdown_strict -- John Gruber's original Markdown, ca. 2004.
Not all variants of Markdown support all features of modern HTML, either because they predate the current language (e.g., markdown_strict
) or because the variant doesn't support a feature supported by other Markdown variants, such as footnotes.
Note that valid HTML is, by definition, valid Markdown although it is not native Markdown; i.e., it does not have the clear legibility nor easy authoring and modification capabilities that Markdown has. For example,
<h1>A SQL walked into a bar...</h1>
<p>...walked up to two tables and asked, "May I <a href="https://www.databasestar.com/sql-joins/">JOIN</a> you?</p>
can be rendered in native Markdown as
# A SQL walked into a bar...
...walked up to two tables and asked, "May I [JOIN](https://www.databasestar.com/sql-joins/) you?
Which is easier for you to read and write? (Note that the document you are reading now was authored and is maintained in Github-Flavoured Markdown.)
Any current Markdown-to-HTML converter will take the above Markdown and reproduce the above HTML from it.
To convert from hand-coded "rich" HTML, possibly including links, code blocks, etc, we want to use the html
input format and the markdown_github
output format. Note that this is deprecated as of version 2.9.2.1 (and likely somewhat earlier), but it is the most resilient GFM-compliant Markdown output supported by Pandoc. However, the pandoc-ruby
Gem does not and apparently has never supported the markdown_github
output format. A means to pass format names directly to Pandoc was previously supported (e.g., pandoc json
for JSON); a quick read of the commit history does not make glaringly obvious whether or not that has been removed. If it has not, then supplying an output format of 'pandoc markdown_github'
may work; that should be investigated further.
To convert from non-generated Markdown to HTML, one would use a command line such as
$ pandoc -f html -t markdown_github -o output.md input.html
or, with the pandoc-ruby
Gem,
# at some likely initialisation point
PandocRuby::WRITERS['markdown_github'] = 'pandoc markdown_github'
PandocRuby::READERS['markdown_github'] = 'pandoc markdown_github'
PandocRuby::STRING_WRITERS['markdown_github'] = 'pandoc parkdown_github'
# ...
content = PandocRuby.new(['/path/to/input.html'], from: 'html').to_markdown_github
# ... or ...
content = PandocRuby.new(html_content_str, from: 'html').to_markdown_github
For machine-generated or validated HTML, we can instead use the gfm
(preferred), commonmark
, or html
(HTML5/HTML4) output (or, indeed, any of the supported writers from pandoc-ruby
; e.g.,
content = PandocRuby.new(['/path/to/valid_input.html'], from: 'html').to_gfm
# ... or ...
content = PandocRuby.new(valid_html_content_str, from: 'html').to_commonmark
(CommonMark is, if you didn't know, a "strongly defined, highly compatible specification of Markdown". It supports most commonly-used features of GFM.)
This should probably be configurable in some way, since the pandoc-ruby
Gem returns error messages as though they were output content, which thus needs to be inspected appropriately. Actually, input that should be converted using github_markdown
will return more-or-less mangled Markdown when using one of the other format specifiers, making it even more difficult for inspection to correctly determine whether the output has been mangled or not.
Note also that the pandoc-ruby
Gem requires that the pandoc
CLI executable be in the current PATH
.