State of the art in generating PowerPoint Presentations with HTML content


Posted by Steven

Programmatically generating files in Microsoft file-formats is a common feature in applications I know. Excel spreadsheets are used for reporting, PowerPoint slides for presenting data and Word text files for textual reports. In this article, I want to focus on the current file format for PowerPoint: .pptx. The difficulty in creating these files lies in their format. There have been fundamental changes in the past, specifications that are several hundred pages long and a lot of other problems that I don't want to go into here. Also, this article is about exporting .pptx files, not importing them. A special difficulty is HTML formatted content. In our application, the user can format a text with an HTML editor and generate a .pptx report with that content. Of course, the generated file should look like the HTML the user entered. 

At the moment, we use Aspose.Slides for that task. Because of several problems we encountered, I searched for alternatives. Here is what I found. 

Aspose.Slides

Aspose.Slides is a Java port from a .NET library and is our current solution for creating PowerPoint files. Aspose itself doesn't know HTML and cannot render it. To display formatted content, we wrote code to translate from HTML to Aspose objects. This code only maps a small subset of HTML tags to Aspose objects because the dependencies between the tags can become quite complicated even when only supporting a few tags like bold, italic, underline, numbered and unnumbered bullet lists.

Pro:

  1. The template itself can be a .pptx-file that already has images, headers and other corporate identity components. That way, the customer can simply change the template. 

Con:

  1. Not open source and code obfuscated. To write our mapping code, it would have been hugely helpful to see the original variable names in our decompiled code.
  2. Works with own objects like Shape and Slide for which we had to build libraries to translate from HTML to these objects. This is an additional source of failures. Currently, we have several problems like wrong indentation with headers. What we want is a library that understands HTML out of the box.
  3. Aspose is quite slow and consumes a lot of heap space. Generating huge reports is not possible because our server cannot provide enough heap space. Current solution is to generate a number of small reports, each with 30 slides in it. 

Jasper

Jasper is an open source Java library that generates .pptx files.

Pro:

  1. open source
  2. Jasper understands HTML out-of-the-box, no need for writing mapper between HTML to Java-Objects like we have with Aspose.
  3. Much more performance than Aspose and needs much less memory. Can generate up to 200 slides in one .pptx file instead of just 30.  

Con:

  1. In HTML formated textareas, bullet lists are just "drawn" instead of being real lists. The bullet points are just an inserted special character and the indentation are just spaces. This way, the fake list is not really a list and cannot be edited as such after the report is generated. This behaviour is documented in this ancient defect from 2006.
  2. The template has to be created with a special tool, JasperSoft Studio, and cannot be a simple PowerPoint file. We don't want to add another software to our utility belt.

Apache POI & docx4java

Apache POI and docx4java are both open source libraries that can manipulate the XML in a .pptx file.
 
Pro:

  1. open source

Con:

  1. Rudimentary support for .pptx. Especially very few google hits for generating / writing .pptx files
  2. POI supports  HSLF files which include .pptx only in the Scratchpad. That means the API is not as stable as we wish.

Generally, the approach of writing XML is too low level. Optimally, we would simply paste HTML content to a field that is rendered by some library.  

Write pptx file completely from scratch

Because the .pptx file format is an open XML, it can be written from scratch or edited afterwards.
 
Pro

  1. no library needed
  2. full control of rendering

 Con

  1. Erratic cost because this approach is basically a reimplementation of existing frameworks.
  2. Very low level.

This approach could be eased by creating the .pptx file itself with POI and edit only the contents with custom XML. Nevertheless this is far too low level and will take a lot of time to get it right.

Inserting OpenDocumentText-objects in the slides of Aspose

Because Aspose uses .pptx files as template, we could insert an OpenDocumentText-object in this template and fill it with HTML content. This way, it gets rendered as HTML.
 
Con

  1. HTML is not rendered properly. We faced the same styling errors with bullet lists like described above.
  2. The OpenDocumentText-object cannot easily be edited afterwards because it is not an ordinary text area but a special object. 

Conclusion

Sadly, there doesn't seem to be a good library that generates .pptx files with HTML support out of the box. We decided to continue using Aspose and rewriting our HTML-to-Aspose-mapper. That could take a while, but it's the best option because we have experience with Aspose. However, this decision is far from being a great solution. 

What do you use to generate .pptx files? Do you render HTML content?

tl;dr

There is no good way of generating .pptx files with HTML content.

(Photo: Frits Ahlefeldt-Laurvig, https://www.flickr.com/photos/hikingartist/3515471358)

Category: 
Share: