Bug#1108130: pdf2xml.dtd should be installable for use with pdftohtml

John Scott jscott at posteo.net
Sat Jun 21 00:59:12 BST 2025


Package: poppler-utils
Version: 25.03.0-4
Severity: minor

The pdftohtml utility included with poppler-utils can be made to print an XML format with a command-line switch. Unlike converting to an SVG, this is a nice format for parsing and scraping afterwards or perhaps for subsequent conversion. To try it one can do this:
	pdftohtml -l 1 -q -i -stdout -xml /usr/share/doc/debian-reference-common/docs/debian-reference.en.pdf
and obtain something like
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="25.03.0">
	<page number="1" position="absolute" top="0" left="0" height="1262" width="892"/>
	<outline>
		<item page="28">GNU/Linux tutorials</item>
		<outline>
			<item page="28">Console basics</item>
...

That DTD file is in the source tree and can be used to validate the XML or to aid understanding the format. It's not installed but it would be helpful if poppler-utils did so. Be mindful that, if I recall correctly, Debian has specific packaging policy for SGML/XML DTDs such as this so it can be easily found by tools wanting it.



More information about the Pkg-freedesktop-maintainers mailing list