Sunday, April 17, 2016

PDF text layout made easy with PDFBox-Layout

More than a decade ago I was using iText to create PDF documents from scratch. It was quite easy to use, and did all the stuff I needed like organizing text in paragraphs, performing word wrapping and marking up text with bold and italic. But once upon a time Bruno Lowagie - the developer of iText - switched from open source to a proprietary license for reasons I do understand.

So when I now had to do some PDF processing for a new project, I was looking for an alternative. PDFBox is definitely the best open source choice, since it is quite mature.But when I was searching on how to do layout, I found a lot of people looking for exactly those features, and the common answer was: you have to do it on your own! Say what? Ouch. There must be someone out there, who already wrote that stuff... Sure there is, but google did not find him. So I started to write some simple word wrapping. And some simple pagination. And some simple markup for easy highlighting with bold an italic. Don't get me wrong: the stuff I wrote is neither sophisticated nor complete. It is drop dead simple, and does the things I need. But just in case someone out there may find it useful, I made it public under MIT license on GitHub.



PDFBox-Layout acts as a layer on top of PDFBox that performs some basic layout operations for you:
  • word wrapping
  • text alignment
  • paragraphs
  • pagination
The API actually has two parts: the (low-level) text layout API, and the document layout API.

The Text Layout API

The text layout API is thought for direct usage with the low level PDFBox API. You may organize text into blocks, do word wrapping, alignment, and highlight text with markup. Means: most features described in the remainder of this article may be used directly with PDFBox without the document layout API.  For more details on this API see the Text API Wiki page. What the document layout API gives you as a surplus, is paragraph layout and pagination.

The Document Layout API

The ingredients of the document layout API are documents, paragraphs and layouts. It is thought to easily create complete PDF documents from scratch, and performs things like word-wrapping, paragraph layout and pagination for you.
Let's start with a simple example:

document = new Document(Constants.A4, 40, 60, 40, 60);

Paragraph paragraph = new Paragraph();
paragraph.addText("Hello Document", 20, PDType1Font.HELVETICA);

final OutputStream outputStream = 
    new FileOutputStream("hellodoc.pdf");;

We start with creating a Document, which acts as a container for elements like e.g. paragraphs. You specify the media box - A4 in this case - and the left, right, top and bottom margin of the document. The margins are applied to each page. After that we create a paragraph which is a container for text fragments. We add a text "Hello Document" with the font type HELVETICA and size 20 to the paragraph. That's it, let's save it to a file. The result looks like this:


Word Wrapping

As already said, you can also perform word wrapping with PDFBox-Layout. Just use the method setMaxWidth() to set a maximum width, and the text container will do its best to not exceed the maximum width by word wrapping the text:

Paragraph paragraph = new Paragraph();
    "This is some slightly longer text wrapped to a width of 100.", 
    11, PDType1Font.HELVETICA);


If you do not specify an explicit max width, the documents media box and the margins dictate the max width for a paragraph. Means: you may just write text, write text and more text without the need for any line beaks, and the layout will do the word wrapping in order to fit the paragraph into the page boundaries.


As you might have already seen, you can specify a text alignment on the paragraph:
Paragraph paragraph = new Paragraph();
    "This is some slightly longer text wrapped to a width of 100.", 


The alignment tells the draw method what to do with extra horizontal space, where the extra space is the difference between the width of the text container and the line. This means, that the alignment is effective only in case of multiple lines. Currently, Left, Center and Right alignment is supported.


The paragraphs in a document are sized and positioned using a layout strategy. By default, paragraphs are stacked vertically by the VerticalLayout. If a paragraph’s width is smaller than the page width, you can specify an alignment with a  layout hint:

    new VerticalLayoutHint(Alignment.Left, 10, 10, 20, 0));

You can combine text- and paragraph-alignment anyway you want:


An alternative to the vertical layout is the Column-Layout, which allows you to arrange the paragraphs in multiple columns on a page.

Document document = 
    new Document(Constants.A4, 40, 60, 40, 60);
Paragraph title = new Paragraph();
title.addMarkup("*This Text is organized in Colums*", 
    20, BaseFont.Times);
document.add(title, VerticalLayoutHint.CENTER);
document.add(new VerticalSpacer(5));

// use column layout from now on
document.add(new ColumnLayout(2, 10));

Paragraph paragraph1 = new Paragraph();
paragraph1.addMarkup(text1, 11, BaseFont.Times);


But you may also set an absolute position on an element. If this is set, the layout will ignore this element, and render it directly at the given position:

Paragraph footer = new Paragraph();
footer.addMarkup("This is some example footer", 6, BaseFont.Times);
paragraph.setAbsolutePosition(new Position(20, 20));


As you add more and more paragraphs to the document, the layout automatically creates a new page if the content does not fit completely on the current page. Elements have different strategies how they will divide on multiple pages. Text is simply split by lines. Images may decide to either split, or - if they fit completely on the next page - to introduce some vertical spacer in order to be drawn on the next page. Anyway, you can always insert a NEW_PAGE element to trigger a new page.


Often you want use just some basic text styling: use a bold font here, some words emphasized with italic there, and that's it. Let's say we want to use different font types for the following sentence:

"Markup supports bold, italic, and even mixed markup."

If you want to do that using the standard API, it would look like this:

Paragraph paragraph = new Paragraph();
paragraph.addText("Markup supports ", 11, PDType1Font.HELVETICA);
paragraph.addText("bold", 11, PDType1Font.HELVETICA_BOLD);
paragraph.addText(", ", 11, PDType1Font.HELVETICA);
paragraph.addText("italic", 11, PDType1Font.HELVETICA_OBLIQUE);
paragraph.addText(", and ", 11, PDType1Font.HELVETICA);
paragraph.addText("even ", 11, PDType1Font.HELVETICA_BOLD);
paragraph.addText("mixed", 11, PDType1Font.HELVETICA_BOLD_OBLIQUE);
paragraph.addText(" markup", 11, PDType1Font.HELVETICA_OBLIQUE);
paragraph.addText(".\n", 11, PDType1Font.HELVETICA);

That's annoying, isn't it? That's what the markup API is intended for. Use * to mark bold content, and _ for italic. Let's do the same example with markup:

Paragraph paragraph = new Paragraph();
    "Markup supports *bold*, _italic_, and *even _mixed* markup_.\n", 

To make things even more easy, you may specify only the font family instead:

paragraph = new Paragraph();
    "Markup supports *bold*, _italic_, and *even _mixed* markup_.\n",
    11, BaseFont.Helvetica);


That’s it

This was a short overview on what PDFBox Layout can do for you. Have a look at the Wiki and the examples for more information and some visual impressions.