Wednesday, October 28, 2015

How Special Paste works in Oxygen

Share to Facebook Share to Twitter Email This Share on Google Plus Share on Tumblr

If you've worked with one of the XML vocabularies for which Oxygen has out of the box support like DITA, Docbook, TEI, XHTML you've probably already used the support Oxygen has for converting content pasted in the application from external applications like Microsoft Word, Excel or from any web browser. This is a very useful feature for converting various types of content to XML because it preserves and converts styling, links, lists, tables and image references.

The feature relies on the fact that when copying content in the applications mentioned above, they set in the clipboard the HTML equivalent of the copied content. So all Oxygen has to do is clean up that HTML, make it wellformed XHTML and apply conversion XSLT stylesheets over it.

This support is not hardcoded and anybody who is developing an Oxygen framework customization for a certain XML vocabulary can provide conversion stylesheets for external pasted HTML content.

I will describe how this works for the DITA framework and you can do the same for yours. You can also use this information to modify the way in which smart paste works for the bundled framework configurations.
  1. In the Preferences->Document Type Association page you can choose to edit (or extend) the DITA document type association.
  2. In the Extensions tab the Extensions bundle implementation is set to DITAExtensionsBundle which resides in the DITA Java extensions archive dita.jar.
  3. The DITAExtensionsBundle is an extension of the ExtensionsBundle API and it provides its own external object extension handler:
      /**
       * @see ro.sync.ecss.extensions.api.ExtensionsBundle#createExternalObjectInsertionHandler()
       */
      @Override
      public AuthorExternalObjectInsertionHandler createExternalObjectInsertionHandler() {
        return new DITAExternalObjectInsertionHandler();
      }
  4. The DITAExternalObjectInsertionHandler extends the base class AuthorExternalObjectInsertionHandler and provides a reference to its specific conversion stylesheet:
      /**
       * @see ro.sync.ecss.extensions.api.AuthorExternalObjectInsertionHandler#getImporterStylesheetFileName(ro.sync.ecss.extensions.api.AuthorAccess)
       */
      @Override
      protected String getImporterStylesheetFileName(AuthorAccess authorAccess) {
        return "xhtml2ditaDriver.xsl";
      }
    Note: The Extensions tab also allows you to specify the external object insertion handler as a separate extension.
  5. In the same Document Type edit dialog in the Classpath tab you will see that there is a reference to a framework-specific resources folder like:${framework}/resources/
  6. If you look on disk in the DITA framework resources folder: "OXYGEN_INSTALL_DIR\frameworks\dita\resources" you will find the xhtml2ditaDriver.xsl stylesheet there. The stylesheet imports various other stylesheets which you could probably fully reuse and which apply various cleanups on HTML produced with MS Word. It also handles the conversion between the pasted HTML content and DITA so it is a good starting point, you can copy the entire set of XSLT stylesheets to your framework and use those as a starting point.

4 comments:

  1. Is there a way to access the temporary XHTML file without having to initialy write the plugin?

    ReplyDelete
    Replies
    1. No, right now you need to take all those framework customization steps and after you write the custom XSLT, you can add an xsl:message in it and show in the XSLT messages console all the XHTML content which gets transformed.
      As a workaround you could use a clipboard inspector to obtain the HTML flavor set in it, save it to a file and then use Oxygen's "File->Import->HTML File" action to pass the HTML through the Neko parser and obtain the XHTML.
      Or you could write your own small Java program which looks in the clipboard and passes the HTML content through the Neko (or another) HTML parser.

      Delete
  2. Alternatively:

    1. Save .odt/.doc as .docx
    2. Remove headers, footers, set bulk of text style back to default or just get rid of paragraph auto-indenting/spacing for "Text Body" or equivalent.
    3. Use Pandoc to convert docx to HTML5 (Pandoc strips out most cruft).
    4. Load HTML file in oXygenXML Editor.
    5. Run XHTML to DITA converter.
    6. Tweak as necessary (including adding editin DITA map files or switching to D4P).
    7. Transform & publish.

    That right there is how I get the LibreOffice docs to channeled through to a preferred publishing format(s). It will get a bunch of chapter files for a novel into a fit state to generate ePubs and other formats in a matter of minutes (with each chapter being a single topic).

    ReplyDelete
    Replies
    1. Hi Ben,

      Sure, there can be multiple solutions for conversions. Also smart paste is a good thing for small projects, for large projects you would need some automation.
      Also when converting content you should always check how complex table layouts are preserved, how links and image references are ported.

      Regards,
      Radu

      Delete