Batch converting HTML to XHTML

Read time: 4 minute(s)

Suppose you have a bunch of possibly "not-wellformed" HTML documents already created and you want to process them using XSLT. For example, you may want to migrate the HTML documents to DITA using the predefined XHTML to DITA Topic transformation scenario available in Oxygen. So you need to create valid XML wellformed XHTML documents from the existing HTML documents and you need to do this in a batch processing automated fashion.

There are lots of open source projects that deliver processors that can convert HTML to its wellformed XHTML equivalent. For this blog post, we'll use HTML Tidy. Here are some steps to automate this process:

Create a new folder on your hard drive (for example, I created one on my Desktop: C:\Users\radu_coravu\Desktop\tidy).
Download the HTML Tidy executable specific for your platform (http://binaries.html-tidy.org/) and place it in the folder you created in step 1.

In that same folder, create an ANT build file called build.xml with the following content:

<project basedir="." name="TidyUpHTMLtoXHTML" default="main">
    <basename property="filename" file="${filePath}"/>
  <target name="main">
      <exec command="tidy.exe -o ${output.dir}/${filename} ${filePath}"/>
  </target>
</project>

In the Oxygen Project view, link the entire folder where the original HTML documents are located.
Right-click the folder, choose Transform->Configure Transformation Scenarios... and create a new transformation scenario of the type: ANT Scenario. Modify the following properties in the transformation scenario:
1. Change the scenario name to something relevant, like HTML to XHTML.
2. Change the Working Directory to point to the folder where the ANT build file is located (in my case: C:\Users\radu_coravu\Desktop\tidy).
3. Change the Build file to point to your custom build.xml (in my case: C:\Users\radu_coravu\Desktop\tidy\build.xml).
4. In the Parameters tab, add a parameter called filePath with the value ${cf} and a parameter called output.dir with the value of the path to the output folder where the equivalent XHTML files will be stored (in my case, I set it to: C:\Users\radu_coravu\Desktop\testOutputXHTML).
Apply the new transformation scenario on the entire folder that contains the HTML documents. When it finishes, in the output folder you will find the XHTML equivalents of the original HTML files (XHTML documents that can later be processed using XML technologies such as XSLT or XQuery).