PAGE XML format collection for document image page content and more

For an introduction, please see the following publication:

The most actively used XML formats are:

  • PAGE XML for page content (regions, text lines, words, glyphs, reading order, text content, ...)
  • PAGE XML for layout analysis evaluation (evaluation profiles, evaluation results, ...)
  • PAGE XML for document image dewarping (dewarping grids)

All formats are defined by an XML schema, hosted officially on

Please see the wiki for more information.

Note: The master branch contains the proposed changes for the next release.

Page Content

Proposed media type for page content: "application/"