Edit online

DITA XML vs Markdown Syntax and Capabilities Comparison

16 Mar 2023
Read time: 19 minute(s)

The following article is a comparison between the DITA XML standard and Markdown. The comparison attempts to cover syntax specification and features. I attempted to write this comparison without any implicit bias towards one or the other. If there are DITA XML or Markdown features that I missed, that was done out of ignorance and not out of malice. Feedback is always welcomed, as usual.

Table 1. DITA XML vs Markdown
DITA XML Markdown
Short description DITA XML is a standard for designing, writing, managing, and publishing information. There are multiple versions of the DITA standard, the most popular one being version 1.3. Markdown is a lightweight markup language that you can use to add formatting elements to plain text documents. There was an effort to standardize Markdown to a specification named CommonMark. There are lots of Markdown flavors and extensions, most of them sharing a common set of features. The most popular are probably CommonMark and Github-flavored Markdown.
Why should I use this format?
  • You work (or want to work) for a company which manufactures similar products (large possibility of content reuse).
  • You want to obtain multiple output formats (HTML based, quality PDF).
  • You want to work with a standard and be able to change tools if necessary.
  • You want to impose structural validation constraints on the edited content.
  • You work for a company which has mostly one product and publishes the documentation mostly on the web.
  • You work with ticketing systems which use Markdown for styling content.
  • You produce internal documentation or small articles.
Useful resources for learning Resources for learning DITA with Oxygen
Pros

DITA XML pros and cons

  • OASIS Open standard.
  • Advanced support for content reuse either at topic, block, or inline level.
  • Advanced support for filtering (generating multiple similar user guides from the same content).
  • Open-source publishing engine with lots of supported output formats (some free, some commercial) like HTML5, Windows Help, PDF, Word, EPUB, and so on.

Why use DITA

Pros and Cons

Markdown pros and cons.

  • Large user base. Familiar to software engineers who use it to write issues.
  • Basic syntax, easy to learn.
  • Easier to read without specialized tools.
  • Offline and online free editing tools.
  • For the base syntax, quite easy to edit the content in a plain text editor tool.
  • Lots of static web site generator open-source tools like MKDocs or Jekyll.
Cons
  • Smaller user base.
  • Harder to learn.
  • XML is more verbose than plain text.
  • Visual editing requires the use of a commercial tool like Oxygen.
  • Smaller number of open source tools to generate professional looking outputs.
Pros and Cons
  • Not all language features are available in the base Markdown "specification". There are various flavors with various syntax differences between them and you probably need to pick a flavor to use and stick to it.
  • Advanced features like content reuse, for example, are not in the base standard but may be implemented with different syntaxes for various flavors.
  • Static web site generators are not compatible with each other (they have various specific configuration files) or to link between files.
  • Not many possibilities to assemble multiple Markdown files and publish outputs like PDF or Word, for example.
  • Cannot render complex cell content (multiple paragraphs, for example) in table cells or in list items.
Cross-Compatibility A DITA Map can refer to a Github-flavored Markdown file and the publishing engine can perform a dynamic conversion from Markdown to DITA while editing. -
Table of contents Gathering multiple DITA topics in a larger publication and defining the table of contents is done by using DITA Maps.

Working with DITA Maps

CommonMark does not define the possibility to create a table of contents or to aggregate multiple Markdown files in larger publications.

Various static web site generators have various ways to define table of contents, usually based on Yaml, like MKDocs.

Validation
  • Validation according to the DITA specification DTDs/schemas done when publishing or when editing.
  • Additional validation can be done with Schematron rules.
  • Usually with Markdown, you can look at a live preview while typing to see that everything looks OK.
  • There are various processors that may be used to validate Markdown, for example using a set of JSON rules.
Publishing
  • The DITA Open toolkit publishing engine comes with default support to publish DITA Maps and customize to plain HTML5, PDF.
  • There are additional open-source plugins to publish to MS Word or EPUB.
  • Other curated open-source plugins are available in the DITA OT plugins registry.
  • Commercial plugins are available to publish to WebHelp output like Oxygen WebHelp or Fluid Topics.
Most publishing libraries rely on the conversion from Markdown to HTML.
  • Lots of open-source static web site generators.
  • Lots of libraries (Javascript, Java, Python, etc) to convert Markdown to HTML.
  • Other conversion types available using Pandoc.
Translation There are translation agencies directly accepting DITA XML content or you can convert DITA XML to XLiff and use a translation system. Each DITA XML topic or map can have an @xml:lang attribute to specify the current language in which it is written.

Translating your DITA Project

There are various tools like Simpleen that seem to specifically handle Markdown translation.
Extensibility
  • Possibility to define a new specialization of the DITA vocabulary with new element names.
  • Use the @outputclass attribute value on elements to set custom values used when styling the output.
  • Use the DITA <data> element with custom names and values and take them into account with publishing time customizations.
  • Use the DITA <foreign> element (for example, embed HTML inside it using a custom publishing plugin).
  • Use HTML elements inside Markdown, for example, when defining complex tables or you do not have a Markdown equivalent.
  • Yaml headers.
  • Ability on certain Markdown flavors/extensions to define attributes for each element.
Metadata
  • The DITA <prolog> element can contain lots of metadata information, but not visible in the published output. Example:
    <topic id="topic_wcj_tgy_5wb">
     <title>The Title</title>
     <prolog>
      <author>The Author</author>
      <metadata>
        <keywords>
    	<keyword>one</keyword>
           <keyword>two</keyword>
        </keywords>
       </metadata>
      </prolog>
  • The <indexterm> elements are also considered metadata, as they are used to generate an index table.
  • Sometimes, Markdown files may contain Yaml headers before the actual content that define simple keys and values. Example:
    ---
    title: The Title
    author: The Author
    keywords: [one, two, three, four]
    ---
    # A Heading
    Text body. 
Content reuse: No content re-use support is in the standard base. Various extensions do exist, for example:
  • Redocly uses HTML <embed> tags with references to Markdown files to re-use entire chunks of Markdown content placed inside a file.
  • Hugo uses special notations named shortcuts.
Filters You can use profiling attributes in DITA XML topics or on topic references in a DITA Map map. By using a single DITA Map and filtering it differently, you can obtain multiple publications from it.

For example, for the Oxygen user's manual, we obtain lots of distinct publications for "Oxygen XML Editor", "Oxygen XML Author", "Oxygen XML Web Author" from the same DITA Map.

There may be, but I am not aware of such a feature in Markdown.
Headings
  • DITA topics have a <title> element that appears as a heading 1 when published and is also used for the <title > element in the published HTML document.
  • You can nest topics one inside the other and the generated HTML output will have <h2>, <h3>, etc for each nested topic, depending on the nested depth.
  • You can have <section> elements with <title>elements inside a topic (they cannot be nested one inside the other).
    <topic id="topic_wcj_tgy_5wb">
     <title>Title1</title>
     <body>
      <section>
        <title>Section 1</title>
        <p>paragraph</p>
      </section>
     </body>
     <topic id="inner">
      <title>Inner topic title</title>
     </topic>
    </topic>
You can use a number of # characters followed by space and text to define a new heading. Headings do not necessarily need to be incremental, you can start with heading level 2 and then have a heading level 1.
# Heading level 1
### Heading level 3
## Heading level 2
....
Block elements There are multiple topic types like <concept>, <task>, <reference>, and extra topic types can be added using a specialization. The basic block elements are <topic>, <title>, paragraph <p> elements, <codeblock>, lists <ul> <ol>, <table>, <section>, <fig>, <image>, <note>. There are also other block-level elements, depending on the topic type. Block elements: Paragraphs, tables, lists, images, block quotes, etc.
Inline elements <b>, <i>, <u>,<sup>,<sub> and other inline elements with more semantic meaning (like <codeph>, <uicontrol>, <filepath>). Bold, italic, underline. Depending on the Markdown flavor, other inlines like subscript, superscript, strike-through.
Equations / Formulas
  • MathML equations added directly inside the DITA topics or referenced as separate files.
  • Latex equations embedded using third party plugins in DITA <foreign> elements.
GitHub flavored Markdown and other flavors as well support embedding Latex equations in Markdown content using $.
Audio/Video The DITA <object> element can be used to reference audio, video or iframe content. No official support, maybe use embedded HTML content or add a link to the audio/video instead.
Tables The DITA <table> element is based on the CALS table specification. Cells can span multiple rows or columns and contain inside block elements content like lists, paragraphs. The table can have header and body rows. Markdown tables are usually written in an ASCII graphic, like representation allowing for cells content to be aligned left or right. By default, cells can contain only plain text inside. If more complex table structures are needed, HTML tables can be inserted directly in Markdown if the used Markdown flavor supports HTML elements inside it.
Lists Ordered <ol>, unordered <ul>, or definition lists <dl>. Other topic types like <task> contain, for example, the <steps> element that is an ordered list of steps. Each list item can contain block elements like paragraphs, other lists, tables, etc. Ordered and unordered lists. Each list element contains simple content. It cannot contain block- level elements like additional lists or multiple paragraphs.

The task list is an interesting extension to show checkboxes next to each list item.

Other types of lists (definition list for example) or list items that contain multiple block-level elements can be inserted directly in Markdown if the used Markdown flavor supports HTML elements inside it.

Links
  • Internal links (cross references):
    • Link to another topic.
    • Link to a particular element in another topic.
    • Links to web resources.
  • Related links (at the end of each topic)
    • Link to another topic.
    • Link to a particular element in another topic.
    • Links to web resources.
Links to other web resources.
Conclusion
  • Harder to type in a plain text area, requires DITA editing tools (most of which, are not free).
  • Advanced support for structured validation.
  • Advanced support for content re-use and profiling conditional text.
  • Publishing engine allows publishing to multiple output formats like HTML, PDF, and others based on plugins that can be installed.
  • Easy to manually type in a plain text area but a preview definitely helps.
  • More complex elements need to be inserted as HTML elements.
  • Various Markdown extensions have extra support for example for content reuse.
  • Mostly targeted towards obtaining web-based HTML content.
  • Looks like a language that is not intended to do the heavy lifting of producing multiple deliverable formats and deliverables from the same content.