github-flavored-markdown-to-html

A user-friendly python-module and command-line frontend to convert markdown to html. It uses GitHubs online Markdown-to-html-API by default (which requires internet connection), but comes with an option for offline conversion (which closely imitates GitHubs behavior), and any other python- or commandline tool can be plugged into it as well. Whatever you use it with is automatically extended with a ton of functionality, like more in- and output options, github-flavored CSS, formula support, image downloading, host-ready file- and image-placement, pdf-conversion, and more.

Whilst its main purpose is the creation of static pages from markdown files, for example in conjunction with a static website builder or github actions if you host on Github, it can be very well-used for any other purpose.

Advantages include:

  • Lets you specify the markdown to convert as a string, as a repository path, as a local
    file name or as a hyperlink.
  • Pulls any images referenced in the markdown files from the web/ your local storage and
    places them in a directory relative to your specified website root, so the resulting file structure is host-ready for
    static sites. Multiple arguments allow the customization of the saving locations, but the images will always be
    referenced correctly in the resulting html files. This is especially useful since it reflects GitHub's behavior to serve cached copies of README-images instead of linking to them directly, reducing tracking and possibly downscaling overlarge images in the process.
  • Creates all links as root-relative hyperlinks and lets you specify the root directory
    as well as the locations for css and images, but uses smart standard values for
    everything.
  • Supports inline LaTeX-formulas (use $-formula-$ to use them), which GitHub usually doesn't. gh-md-to-html uses
    LaTeX and dvisvgm if they are both installed (advantage: fast,
    requires no internet), and otherwise the Codecogs EqnEditor (advantage: doesn't require
    you to install 3 GB of LaTeX libraries) to achieve this.
  • Supports exporting to pdf with or without Github styling, using the
    pdfkit python module (if it is installed).
  • Tested and optimized to look good when using DarkReader (the
    .js-module as well as the browser extension). This is especially relevant considering that DarkReader doesn't usually
    shift the colors of svg images, and the formulas added by gh-md-to-html's formula support are embedded as inline svg.
    gh-md-to-html ensured that the formulas are the same color as the text, shifted in accordance with DarkReader's
    current/enabled colorscheme.
  • Supports umlauts and other non-ascii-characters in plain text as well as in multiline code blocks, which the github
    REST api usually doesn't.
  • Allows you to choose which tool or module to use at its core for the basic markdown to html conversion.
  • Styles its output with github's README-css (can be turned off).
  • Allows you to choose a width for the box surrounding the text; this can increase readability if you intend to host the
    markdown file stand-alone rather than embedded into a different html file (see
    #25 and
    Wikipedia).
  • Comes with an optional support for the use of [[_TOC_]], {:toc} and [toc] at the beginning of an otherwise empty
    line to create a table of content for the document, like GitLab-flavored markdown does, among others.
  • Comes with an option to compress and downscale all images referenced in the markdown file (does not affect the
    original images) with a specified background color (default is white) for converting RGBA to RGB, and a specified
    compression rate (default is 90). Images with a specified width or height attribute in pixels get scaled down to that
    size to reduce loading time. This helps severely reduce the size of generated pages for markdown files with lots of
    images. There is also an option to save all images in multiple sizes and let the html viewer/browser pick the one
    fitting for the viewport size (using the img srcset attribute), thus making gh-md-to-html the only md-to-html
    converter with builtin srcset support for image load reduction.
  • If two equal images from equal or different sources are referenced in the given markdown file, and both would be saved
    in the same resolution et cetera, both are pointed to the same copy in the generated html to minimize loading
    overhead.
  • Comes with an option to closely imitate GitHub's markdown-to-html-conversion behavior offline!

Whilst using pandoc to convert from markdown to pdf usually yields more beautiful results (pandoc uses LaTeX, after
all), gh-md-to-html has its own set of advantages when it comes to quickly converting complex files for a homework
assignment or other purposes where reliability weights more than beauty:

  • pandoc converts .md to LaTeX and then renders it to pdf, which means that images embedded in the .md are shown where
    they fit best in the .pdf and not, as one would expect it from a .md-file, exactly where they were embedded.
  • pandoc's pandoc-flavored markdown supports formulas; however, some specific rules apply regarding the amount of
    whitespace cornering the $-signs and what characters the formula may start with. These rules do not apply in some
    common markdown editors like MarkText, though, which leads to lots of frustration when formulas that worked in the
    editor don't work anymore when converting with pandoc (MarkText's own export-to-pdf-function sometimes fails on
    formula-heavy files without an error message, though, which makes it even less reliable). The worst part is that,
    whenever pandoc fails converting .md to .pdf because of this, it shows the line number of the error based on the
    intermediate .tex-file instead of the input .md-file, which makes it difficult to find the problem's root.
    As you might have guessed, gh-md-to-html couldn't care less about the amount of whitespace you start your formulas
    with, leaving the decision up to you.
  • pandoc supports multiple markdown flavors. The sole formula-supporting one of these is pandoc-flavored markdown, which
    comes with some quite specific requirements regarding the amount of trailing whitespace before a sub-list in a nested
    list, and other requirements to create multi-line bullet point entries. These requirements are not fulfilled my many
    markdown-editors (such as MarkText) and not required by many other markdown flavors, causing pandoc to not render
    multiline bullet point entries and nestled lists correctly in many cases.
    gh-md-to-html, on the other hand, supports both nested lists like you would expect it, and formulas, releasing
    the burden of having to edit entire markdown files to make then work with pandoc's md-to-html-conversion from your
    shoulders.

To sum it up, pandoc's md-to-pdf-conversion acts quite unusual when it comes to images, nested lists, multiline bullet
point entries, or formulas, and gh-md-to-html does not.

Installation

Use pip3 install gh-md-to-html to install directly from the python package index, or python3 -m pip install ... if
you are on windows.

Or use

git clone https://github.com/phseiff/github-flavored-markdown-to-html.git
cd github-flavored-markdown-to-html
pip3 install .

to clone from master and add changes before installing.

Both might require sudo on Linux, and you can optionally do

sudo apt-get install wkhtmltopdf
python3 -m pip install pdfkit

(if you want to use the optional pdf features) to include pdf support into your installation.

Usage

If you want to access the interface with your command line, you can just supply
gh-md-to-html with the arguments documented in the help text (accessible with
gh-md-to-html -h and shown below). On windows, you must supply python3 -m gh_md_to_html with the corresponding
arguments.

If you want to access the interface via python, you can use

import gh_md_to_html

and then use gh_md_to_html.main() with the same arguments (and default values) you would
supply to the command line interface.

If you only want to imitate the conversion results yield by GitHub's REST API offline, but don't want image caching,
formula support and fancy CSS styling, use

html_as_a_string = gh_md_to_html.core_converter.markdown(your_markdown_as_a_string)

in Python.

Documentation

All arguments and how they work are documented in the help text of the program, which looks
like this:

usage: __main__.py [-h] [-t {file,repo,web,string}]
                   [-w WEBSITE_ROOT [WEBSITE_ROOT ...]]
                   [-d DESTINATION [DESTINATION ...]]
                   [-i IMAGE_PATHS [IMAGE_PATHS ...]]
                   [-c CSS_PATHS [CSS_PATHS ...]]
                   [-n OUTPUT_NAME [OUTPUT_NAME ...]]
                   [-p OUTPUT_PDF [OUTPUT_PDF ...]] [-s STYLE_PDF]
                   [-f FOOTER [FOOTER ...]] [-m MATH]
                   [-x EXTRA_CSS [EXTRA_CSS ...]]
                   [-o CORE_CONVERTER [CORE_CONVERTER ...]]
                   [-e COMPRESS_IMAGES [COMPRESS_IMAGES ...]]
                   [-b BOX_WIDTH [BOX_WIDTH ...]] [-a TOC]
                   MD-origin [MD-origin ...]

Convert markdown to HTML using the GitHub API and some additional tweaks with
python.

positional arguments:
  MD-origin             Where to find the markdown file that should be
                        converted to html

optional arguments:
  -h, --help            show this help message and exit
  -t {file,repo,web,string}, --origin-type {file,repo,web,string}
                        In what way the MD-origin-argument describes the origin
                        of the markdown file to use. Defaults to file. The
                        options mean: 
                        * file: takes a relative or absolute path to a file
                        * repo: takes a path to a markdown-file in a github
                        repository, such as <user_name>/<repo_name>/<branch-
                        name>/<path_to_markdown>.md 
                        * web: takes an url to a markdown file
                        * string: takes a string containing the files content
  -w WEBSITE_ROOT [WEBSITE_ROOT ...], --website-root WEBSITE_ROOT [WEBSITE_ROOT ...]
                        Only relevant if you are creating the html for a static
                        website which you manage using git or something similar.
                        --website-root is the directory from which you serve
                        your website (which is needed to correctly generate the
                        links within the generated html, such as the link
                        pointing to the css, since they are all root- relative),
                        and can be a relative as well as an absolute path.
                        Defaults to the directory you called this script from.
                        If you intent to view the html file on your laptop
                        instead of hosting it on a static site, website-root
                        should be a dot and destination not set. The reason the
                        generated html files use root-relative links to embed
                        images is that on many static websites,
                        https://foo/bar/index.html can be accessed via
                        https://foo/bar, in which case relative (non-root-
                        relative) links in index.html will be interpreted as
                        relative to foo instead of bar, which can cause images
                        not to load.
  -d DESTINATION [DESTINATION ...], --destination DESTINATION [DESTINATION ...]
                        Where to store the generated html. This path is relative
                        to --website-root. Defaults to "".
  -i IMAGE_PATHS [IMAGE_PATHS ...], --image-paths IMAGE_PATHS [IMAGE_PATHS ...]
                        Where to store the images needed or generated for the
                        html. This path is relative to website-root. Defaults to
                        the "images"-folder within the destination folder.
  -c CSS_PATHS [CSS_PATHS ...], --css-paths CSS_PATHS [CSS_PATHS ...]
                        Where to store the css needed for the html (as a path
                        relative to the website root). Defaults to the
                        "<WEBSITE_ROOT>/github-markdown-css"-folder.
  -n OUTPUT_NAME [OUTPUT_NAME ...], --output-name OUTPUT_NAME [OUTPUT_NAME ...]
                        What the generated html file should be called like. Use
                        <name> within the value to refer to the name of the
                        markdown file that is being converted (if you don't use
                        "-t string"). You can use '-n print' to print the file
                        (if using the command line interface) or return it (if
                        using the python module), both without saving it.
                        Default is '<name>.html'.
  -p OUTPUT_PDF [OUTPUT_PDF ...], --output-pdf OUTPUT_PDF [OUTPUT_PDF ...]
                        If set, the file will also be saved as a pdf file in the
                        same directory as the html file, using pdfkit, a python
                        library which will also need to be installed for this to
                        work. You may use the <name> variable in this value like
                        you did in --output-name.
  -s STYLE_PDF, --style-pdf STYLE_PDF
                        If set to false, the generated pdf (only relevant if you
                        use --output-pdf) will not be styled using github's css.
  -f FOOTER [FOOTER ...], --footer FOOTER [FOOTER ...]
                        An optional piece of html which will be included as a
                        footer where the 'hosted with <3 by github'-footer in a
                        gist usually is. Defaults to None, meaning that the
                        section usually containing said footer will be omitted
                        altogether.
  -m MATH, --math MATH  If set to True, which is the default, LaTeX-formulas
                        using $formula$-notation will be rendered.
  -x EXTRA_CSS [EXTRA_CSS ...], --extra-css EXTRA_CSS [EXTRA_CSS ...]
                        A path to a file containing additional css to embed into
                        the final html, as an absolute path or relative to the
                        working directory. This file should contain css between
                        two <style>-tags, so it is actually a html file, and can
                        contain javascript as well. It's worth mentioning and
                        might be useful for your css/js that every element of
                        the generated html is a child element of an element with
                        id xxx, where xxx is "article-" plus the filename
                        (without extension) of: 
                        * output- name, if output-name is not "print" and not
                        the default value.
                        * the input markdown file, if output- name is "print",
                        and the input type is not string. * the file with the
                        extra-css otherwise. If none of these cases applies, no
                        id is given.
  -o CORE_CONVERTER [CORE_CONVERTER ...], --core-converter CORE_CONVERTER [CORE_CONVERTER ...]
                        The converter to use to convert the given markdown to
                        html, before additional modifications such as formula
                        support and image downloading are applied; this defaults
                        to using GitHub's REST API and can be 
                        * on Unix/ any system with a cmd: a command containing
                        the string "{md}", where "{md}" will be replaced with an
                        escaped version of the markdown file's content, and
                        which returns the finished html. Please note that
                        commands for Unix-system won't work on Windows systems,
                        and vice versa etc. 
                        * when using gh-md-to- html in python: A callable which
                        converts markdown to html, or a string as described
                        above. 
                        * OFFLINE as a value to indicate that gh-md-to-html
                        should imitate the output of their builtin
                        md-to-html-converter using mistune. This requires the
                        optional dependencies for "offline_conversion" to be
                        satisfied, by using `pip3 install
                        gh-md-to-html[offline_conversion]` or `pip3 install
                        mistune>=2.0.0rc1`. 
                        * OFFLINE+ behaves identical to OFFLINE, but it doesn't
                        remove potentially harmful content like javascript and
                        css like the GitHub REST API usually does. DO NOT USE
                        THIS FEATURE unless you need a way to convert secure
                        manually-checked markdown files without having all your
                        inline js stripped away!
  -e COMPRESS_IMAGES [COMPRESS_IMAGES ...], --compress-images COMPRESS_IMAGES [COMPRESS_IMAGES ...]
                        Reduces load time of the generated html by saving all
                        images referenced by the given markdown file as jpeg.
                        This argument takes a piece of json data containing the
                        following information; if it is not used, no compression
                        is done: 
                        * bg-color: the color to use as a background color when
                        converting RGBA-images to jpeg (an RGB-format). Defaults
                        to "white" and accepts almost any HTML5 color-value
                        ("#FFFFFF", "#ffffff", "white" and "rgb(255, 255, 255)"
                        would've all been valid values).
                        * progressive: Save images as progressive jpegs. Default
                        is False. 
                        * srcset: Save differently scaled versions of the image
                        and provide them to the image in its srcset attribute.
                        Defaults to False. Takes an array of different widths or
                        True, which serves as a shortcut for "[500, 800, 1200,
                        1500, 1800, 2000]".
                        * quality: a value from 0 to 100 describing at which
                        quality the images should be saved (this is done after
                        they are scaled down, if they are scaled down at all).
                        Defaults to 90. If a specific size is specified for a
                        specific image in the html, the image is always
                        converted to the right size. If this argument is left
                        empty, no compression is down at all. If this argument
                        is set to True, all default values are used. If it is
                        set to json data and values are omitted, the defaults
                        are also used. If a dict is passed instead of json data
                        (when using the tool as a python module), the dict is
                        used as the result of the json data.
  -b BOX_WIDTH [BOX_WIDTH ...], --box-width BOX_WIDTH [BOX_WIDTH ...]
                        The text of the rendered file is always displayed in a
                        box, like GitHub READMEs and issues are. By default,
                        this box fills the entire screen (max-width: 100%), but
                        you can use this option to reduce its max width to be
                        more readable when hosted stand-alone; the resulting box
                        is always centered. --box-width accepts the same
                        arguments the css max-width attribute accepts, e.g. 25cm
                        or 800px.
  -a TOC, --toc TOC     Enables the use of `[[_TOC_]]`, `{:toc}` and `[toc]`
                        at the beginning of an otherwise empty line to create a
                        table of content for the document. These syntax are
                        supported by different markdown flavors, the most
                        prominent probably being GitLab-flavored markdown
                        (supports `[[_TOC_]]`), and since GitLab displays its
                        READMEs quite similar to how GitHub does it, this option
                        was added to improve support for GitLab- flavored
                        markdown.


As mentioned above, any image referenced in the markdown file is stored locally and
referenced using a root-relative hyperlinks in the generated html. How the converter
guesses the location of the image is shown in the following table, with the type of imagelink noted on the top and the type of input markdown noted on the left:

https:// or http:// abs. filepath rel. filepath starting with / (e.g. /image.png) not starting with / (e.g. image.png)
-t file from the address abs. filepath rel. filepath (from where the .md-file lies) - -
-t string from the address abs.filepath, but needs confirmation for security reasons rel. filepath (to where the tool is called from), but needs confirmation for security reasons - -
username/repo/dir/file.md -t repo from the address - - username/repo/imagedir/image.png username/repo/dir/imagedir/image.png
https://foo.com/bar/baz.md -t web from the address - - https://foo.com/image.png https://foo.com/bar/image.png

GitHub

https://github.com/phseiff/github-flavored-markdown-to-html