How to correct the reading order in PDF documents automatically

1 September 2017

You may have noticed that the text selection can behave erratic when selecting text in your PDF reader (such as Adobe Reader). This is because text in a PDF document is not necessarily ordered in logical reading order. You can see an example of this behaviour in Figure 1.

<figcaption class="caption wp-caption-text">Figure 1</figcaption></figure>

So why is this a problem? It’s a problem because of two reasons; firstly, it means that Google will most likely index text in your document in the incorrect order. Sentences are in some cases broken and as a result you may not be getting the search hits you’re after for your content.

The second problem is that people who use screen readers won’t be able to read the documents properly. Text will most likely be read out in the wrong order.

 

Fixing the reading order automatically

The typical solution when dealing with reading order problems is to use something called PDF tagging. This requires you to go through the entire document and tag each text and mark which text that should follow that text in reading order.

Using machine learning, FlowPaper is able to reconstruct the logical reading order of your documents without any manual tagging work. Single column layouts, multiple column layouts, you name it. FlowPaper does this by analysing the layout of the page very much in similar ways as the human eye is recognising a page and its reading order.

 

How to fix the reading order of your PDF

Make sure you have the desktop publisher installed. It can be downloaded from our public download page.

<figcaption class="caption wp-caption-text">Figure 2</figcaption></figure>

  1. Firstly, Start up the desktop publisher and select the “Elements – Slide” viewer in the top right corner as seen  in Figure 2
  2. Make sure the “Improved Accessibility” checkbox is ticked in the “Accessibility & SEO” section
  3. You can now go ahead and import your PDF. Make your desired adjustments to style and click “Publish” in the top.

 

Verifying the Results

So how can we check that the corrected reading order is correct?

You could let a screen reader read it out if you have one installed. You can also check the text order manually if you know your way around Chrome a little bit.

  1. To use Chrome to check that the elements are in the correct order, open the publication in Chrome using “View Offline Version” from the desktop publisher and open up the Chrome dev tools from the View->Developer->Developer tools menu
  2. What you will see now is all the HTML5 elements that the desktop publisher has created when converting your PDF document. An easy way to check the reading order is just to step through the elements like in the animation below.

 

Voila! Please let us know if you have any questions regarding reading order or how it can be used in other scenarios!