Word to DITA Conversion FAQ

Question 1

How can I fix unrecognized style warnings?

Answer

When converting a Word document, the styles that don't have a mapping in the Word styles mapping table from the preferences page are converted to simple paragraph elements and a warning is reported for each of them in the Results view.

As an example, suppose I have converted the Word document and I see the following warnings in the Results view:

Here are the steps that you should follow for this configuration:

Open the Plugins / Batch Documents Converter preferences page in Oxygen.
To fix the Unrecognized "Document Title" style for "p" Word element warning, add a new row in the Word styles mapping table with the following cells:
1. Type "p" in the Word element cell (because the unrecognized style was found on a Word paragraph).
2. Type "Document Title" in the Word style cell.
3. In the "HTML elements" cell, you need to add a corresponding HTML element. For this example, a corresponding element is "<h1>" for the default mapping of the "Title" style. So, type "h1:fresh" into this cell. The ":fresh" suffix instructs the converter to create a new element every time it finds this kind of paragraph. When it's not set, the converter will try to reuse the elements and combine sequences of the same style of paragraphs into a single element.
To fix the Unrecognized "Keyboard Key" style for "r" Word element warning, add a new row in the Word styles mapping table with the following cells:
1. Type "r" in the Word element cell (because the unrecognized style was found on a Word character).
2. Type "Keyboard Key" in the Word style cell.
3. In the HTML elements cell, you need to add a corresponding HTML element. For this example, a corresponding element is "<kbd>" for the default mapping of the "HTML Keyboard" style. So, type "kbd" in this cell.

After completing these steps, you should have these two rows in the configuration table:

p	Document Title	h1:fresh
r	Keyboard Key	kbd

For more information about the Word styles mapping configuration, see the following section from the documentation: Conversions from Word.

Question 2

How can I configure the styles mapping when the wanted element doesn't exist in HTML?

Answer

A frequent case when setting the mapping configuration for a custom style is to not find a correspondent element in HTML, although one exists in DITA.

As an example, suppose I have a Word document that contains a character custom style named "filepath". We know that a correspondent element exists in DITA, but we cannot find one in HTML.

These are the steps that can be applied to handle this case:

Go to the Plugins / Batch Documents Converter preferences page and add the following mapping in the Word styles mapping table:
r Filepath i.filepath
Convert the Word document to DITA. The characters styled with the Filepath style in Word are converted to the <i> element with the "filepath" @outputclass attribute.
Move the output files to your project, select them, and apply the "Rename element" refactoring operation using the "//i[@outputclass = 'filepath']" XPath for matching the target <i> element and changing them to the <filepath> DITA element.
Apply the "Remove attribute" refactoring operation using the "//filepath[@outputclass = 'filepath']" XPath for matching the target elements and deleting the @outputclass attributes.

Instead of steps 3 and 4, you can also create a custom refactoring operation that makes these two changes, like this:

Create an XSLT file (for example, named batch-converter-post-processing.xsl) that iterates over all elements from the document, finds the <i> elements with the "filepath" @outputclass attribute, and replaces them with the <filepath> elements without copying the @outputclass attribute:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="i[@outputclass = 'filepath']">
        <xsl:element name="filepath">
            <xsl:apply-templates select="node()"/>
        </xsl:element>
    </xsl:template>
</xsl:stylesheet>

Create an XML Refactoring operation descriptor (for example, named batch-converter-post-processing.xml) that references the stylesheet and provides descriptions:

<?xml version="1.0" encoding="UTF-8"?>
<refactoringOperationDescriptor xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns="http://www.oxygenxml.com/ns/xmlRefactoring" id="op_qzq_y2x_nsb"
    name="Post-processing Batch Documents Converter">
    <description>Post-process the resulting DITA documents from the Word conversion using the Batch Documents Converter add-on.</description>
    <script type="XSLT" href="batch-converter-post-processing.xsl"/>
</refactoringOperationDescriptor>

Copy these two files to a folder scanned by Oxygen XML Editor when it loads the custom operation.
Apply the new custom operation named "Post-processing Batch Documents Converter" that can now be found in the list of refactoring operations from the "XML Refactoring" dialog box.

Question 3

How can I configure the styles mapping for paragraphs styled as code blocks?

Answer

In Word, a code block is represented as a sequence of paragraphs styled with a custom style that adds a custom font and border, like this:

To add a mapping for this custom style, follow these steps:

Open the Options → Preferences → Plugins → Batch Documents Converter preferences page in Oxygen.
Add a new row to the Word styles mapping table.
Enter "p" in the Word element cell, and enter "Code Snippet" in the Word style cell to match the paragraphs styled with "Code Snippet".
Enter "pre:separator('\n')" in the HTML elements cell. The <pre> element is the corresponding HTML element for these types of paragraphs. Since we want to merge these sequences of "Code Snippet" style paragraphs, the ":fresh" marker was not used. The ":separator('')" syntax configures a separator when the same type of consecutive paragraphs are merged. If it isn't specified, we would obtain a <pre> element with a single line of text.

Using this configuration, a <pre> element is the result in the converted DITA output for every code block sequence. For obtaining <codeblock> elements in DITA, see How can I configure the styles mapping when the wanted element doesn't exist in HTML? for setting a class attribute on the resulting <pre> HTML element (by setting "pre.codeblock:separator('\n')" in the HTML elements cell) and creating a custom refactoring operation.

Question 4

How can I preserve the Word shapes?

Answer

The converter doesn't support the Word shapes and these will be ignored when the document is converted to HTML or DITA. A workaround is to save them manually as SVG images in MS Word before the conversion. For this, you should follow these steps for every shape inside the document:

Select the shape.
Invoke the Cut action from the contextual menu.
Invoke the Paste Special... action from the Home > Paste menu.
In the resulting dialog box, select Picture (SVG) and OK.

After completing these steps, the document will contain "SVG" images that will be preserved in the HTML or DITA results.