Monday, June 12, 2017

Batch converting HTML to XHTML

Share to Facebook Share to Twitter Email This Share on Google Plus Share on Tumblr

Suppose you have a bunch of possibly "not-wellformed" HTML documents already created and you want to process them using XSLT. For example, you may want to migrate the HTML documents to DITA using the predefined XHTML to DITA Topic transformation scenario available in Oxygen. So you need to create valid XML wellformed XHTML documents from the existing HTML documents and you need to do this in a batch processing automated fashion.

There are lots of open source projects that deliver processors that can convert HTML to its wellformed XHTML equivalent. For this blog post, we'll use HTML Tidy. Here are some steps to automate this process:
  1. Create a new folder on your hard drive (for example, I created one on my Desktop: C:\Users\radu_coravu\Desktop\tidy).
  2. Download the HTML Tidy executable specific for your platform (http://binaries.html-tidy.org/) and place it in the folder you created in step 1.
  3. In that same folder, create an ANT build file called build.xml with the following content:
    <project basedir="." name="TidyUpHTMLtoXHTML" default="main">
        <basename property="filename" file="${file}"/>
      <target name="main">
          <exec command="tidy.exe -o ${output.dir}/${filename} ${file}"/>
      </target>
    </project>
  4. In the Oxygen Project view, link the entire folder where the original HTML documents are located.
  5. Right-click the folder, choose Transform->Configure Transformation Scenarios... and create a new transformation scenario of the type: ANT Scenario. Modify the following properties in the transformation scenario:
    1. Change the scenario name to something relevant, like HTML to XHTML.
    2. Change the Working Directory to point to the folder where the ANT build file is located (in my case: C:\Users\radu_coravu\Desktop\tidy).
    3. Change the Build file to point to your custom build.xml (in my case: C:\Users\radu_coravu\Desktop\tidy\build.xml).
    4. In the Parameters tab, add a parameter called file with the value ${cf} and a parameter called output.dir with the value of the path to the output folder where the equivalent XHTML files will be stored (in my case, I set it to: C:\Users\radu_coravu\Desktop\testOutputXHTML).
  6. Apply the new transformation scenario on the entire folder that contains the HTML documents. When it finishes, in the output folder you will find the XHTML equivalents of the original HTML files (XHTML documents that can later be processed using XML technologies such as XSLT or XQuery).

3 comments:

  1. Fran├žois Violette11:27 PM

    While that may be the less challenging piece in an "Export legacy HTML to DITA" scenario, it is the missing link between pre- and post-processing. You can get good results quick, for hundreds of pages. To reap the benefits, fiddle with the options a bit, even if you get accepable content at first.

    ReplyDelete
    Replies
    1. I fully agree we could have an ANT scenario which first applies the HTML to XHTML and then XHTML to DITA. It would be more user friendly. We might have enough time to add this feature in Oxygen 19.1.

      Delete
  2. This only works with the subset of HTML which Tidy can handle, of course. It's not hard to find HTML out there in the wild at which even Tidy will throw up its hands in despair...

    ReplyDelete