Edit online

Word to DITA Conversion FAQ

29 Nov 2022
Read time: 16 minute(s)

How can I fix unrecognized style warnings?

When converting a Word document, the styles that don't have a mapping into the Word styles mapping table from the preferences page are converted to simple paragraph elements and a warning will be reported for each of them in the Results view.

Let's take the following example: I have converted the Word document and I see the following warnings in the Results view:

These are the steps that you should follow for adding this configuration:
  1. Open the Plugins / Batch Documents Converter preferences page from Oxygen.
  2. For fixing the Unrecognized "Document Title" style for "p" Word element warning add a new row in the Word styles mapping table with the following cells:
    1. Type the "p" text into the Word element cell, because the unrecognized style was found on a Word paragraph.
    2. Type the "Document Title" into the Word style cell.
    3. In the "HTML elements" cell you have to add a corresponding HTML element. For this one, a corresponding element is "<h1>" as for the default mapping of "Title" style. So, type "h1:fresh" into this cell. The ":fresh" suffix instructs the converter to create a new element every time it finds this kind of paragraph. When it's not set, the converter will try to reuse the elements and combine sequences of the same style paragraphs into a single element.
  3. For fixing the Unrecognized "Keyboard Key" style for "r" Word element warning add a new row in the Word styles mapping table with the following cells:
    1. Type the "r" text into the Word element cell, because the unrecognized style was found on a Word character.
    2. Type the "Keyboard Key" into the Word style cell.
    3. In the HTML elements cell you have to add a corresponding HTML element. For this one, a corresponding element is "<kbd>" as for the default mapping of "HTML Keyboard" style. So, type "kbd" in this cell.
After these steps, you should have these two rows in the configuration from the table:
p Document Title h1:fresh
r Keyboard Key kbd

For more information about the Word styles mapping configuration see this section from the documentation: Conversions from Word

How can I configure the styles mapping when the wanted element doesn't exist in HTML?

A frequent case when setting the mapping configuration for a custom style is not to find a correspondent element in HTML, although one exists in DITA.

Let's take the following example: I have a Word document that contains a character custom style named "filepath". We know that a correspondent element exists in DITA, but we cannot find one in HTML.

These are the steps that can be applied to handle this case:

  1. Go to the Plugins / Batch Documents Converter preferences page and add the following mapping into the Word styles mapping table:
    r Filepath i.filepath
  2. Convert the Word document to DITA. The characters styled with the Filepath style in Word are converted to the <i> element with the "filepath" @outputclass attribute on DITA.
  3. Move the output files into your project, select them and apply the "Rename element" refactoring operation using the "//i[@outputclass = 'filepath']" XPath for matching the target <i> element and changing them to the <filepath> DITA element.
  4. Apply the "Remove attribute" refactoring operation using the "//filepath[@outputclass = 'filepath']" XPath for matching the target elements and deleting the @outputclass attributes.

Instead of step 3 and 4 you can also create a custom refactoring operation that makes these two changes, like this:

  1. Create an XSLT file (for example, named batch-converter-post-processing.xsl) that iterates over all elements from the document finds the <i> elements with the "filepath" @outputclass attribute and replaces them with the <filepath> elements without copying the @outputclass attribute:
    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">
    
        <xsl:template match="@* | node()">
            <xsl:copy>
                <xsl:apply-templates select="@* | node()"/>
            </xsl:copy>
        </xsl:template>
    
        <xsl:template match="i[@outputclass = 'filepath']">
            <xsl:element name="filepath">
                <xsl:apply-templates select="node()"/>
            </xsl:element>
        </xsl:template>
    </xsl:stylesheet>
  2. Create an XML Refactoring operation descriptor (for example, named batch-converter-post-processing.xml) that references the stylesheet and provides descriptions:
    <?xml version="1.0" encoding="UTF-8"?>
    <refactoringOperationDescriptor xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="http://www.oxygenxml.com/ns/xmlRefactoring" id="op_qzq_y2x_nsb"
        name="Post-processing Batch Documents Converter">
        <description>Post-process the resulting DITA documents from the Word conversion using the Batch Documents Converter add-on.</description>
        <script type="XSLT" href="batch-converter-post-processing.xsl"/>
    </refactoringOperationDescriptor>
  3. Copy these two files into a folder scanned by Oxygen XML Editor when it loads the custom operation.
  4. Apply the new custom operation named "Post-processing Batch Documents Converter" that can be found now into the list of refactoring operations from the "XML Refactoring" dialog.

How can I configure the styles mapping for paragraphs styled as code blocks?

In Word, a code block is represented as a sequence of paragraphs styled with a custom style that adds a custom font and border, like this:

To add a mapping for this custom style, follow these steps:
  1. Open the Options → Preferences → Plugins → Batch Documents Converter preferences page in Oxygen.
  2. Add a new row to the Word styles mapping table.
  3. Enter "p" in the Word element cell, and enter "Code Snippet" in the Word style cell to match the paragraphs styled with "Code Snippet".
  4. Enter "pre:separator('\n')" in the HTML elements cell. The <pre> element is the corresponding HTML element for these types of paragraphs. Since we want to merge these sequences of "Code Snippet" style paragraphs, the ":fresh" marker was not used. The ":separator('')" syntax configures a separator when the same type of consecutive paragraphs are merged. If it isn't specified, we would obtain a <pre> element with a single line of text.

Using this configuration, a <pre> element is the result in the converted DITA output for every code block sequence. For obtaining <codeblock> elements in DITA, see How can I configure the styles mapping when the wanted element doesn't exist in HTML? for setting a class attribute on the resulting <pre> HTML element (by setting "pre.codeblock:separator('\n')" in the HTML elements cell) and creating a custom refactoring operation.