Bug#1120823: XML::LibXML::Parser.3pm: Some remarks and a patch with editorial changes for this man page

Mon Nov 17 00:23:17 GMT 2025

Package: libxml-libxml-perl
Version: 2.0207+dfsg+really+2.0134-6+b1
Severity: minor
Tags: patch

>From "/usr/share/doc/debian/bug-reporting.txt.gz":

  Don't file bugs upstream

   If you file a bug in Debian, don't send a copy to the upstream software
   maintainers yourself, as it is possible that the bug exists only in
   Debian. If necessary, the maintainer of the package will forward the
   bug upstream.

-.-

  I do not send reports upstream if I have to get an account there.
The Debian maintainers have one already.

  If I get a negative (or no) response from upstream, I send henceforth
bugs to Debian.

-.-

   * What led up to the situation?

     Checking for defects with a new version

test-[g|n]roff -mandoc -t -K utf8 -rF0 -rHY=0 -rCHECKSTYLE=0 -ww -z < "man page"

  [Use 

grep -n -e ' $' -e '\\~$' -e ' \\f.$' -e ' \\"' <file>

  to find (most) trailing spaces.]

  ["test-groff" is a script in the repository for "groff"; is not shipped]
(local copy and "troff" slightly changed by me).

  [The fate of "test-nroff" was decided in groff bug #55941.]

   * What was the outcome of this action?

Output from "test-nroff  -mandoc -t -K utf8 -rF0 -rHY=0 -rCHECKSTYLE=0 -ww -z ":

troff:<stdin>:632: warning [page 1, line 549]: cannot break line in l adjust mode; overset by 19n


   * What outcome did you expect instead?

     No output (no warnings).

-.-

  General remarks and further material, if a diff-file exist, are in the
attachments.


-- System Information:
Debian Release: forky/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 6.16.12+deb14+1-amd64 (SMP w/2 CPU threads; PREEMPT)
Locale: LANG=is_IS.iso88591, LC_CTYPE=is_IS.iso88591 (charmap=ISO-8859-1), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: sysvinit (via /sbin/init)

Versions of packages libxml-libxml-perl depends on:
ii  libc6                         2.41-12
ii  libxml-namespacesupport-perl  1.12-2
ii  libxml-sax-perl               1.02+dfsg-4
ii  libxml2-16                    2.15.1+dfsg-0.3
ii  perl                          5.40.1-6
ii  perl-base [perlapi-5.40.1]    5.40.1-6

libxml-libxml-perl recommends no packages.

libxml-libxml-perl suggests no packages.

-- no debconf information
-------------- next part --------------
Input file is XML::LibXML::Parser.3pm

Output from "mandoc -T lint  XML::LibXML::Parser.3pm": (shortened list)

     29 STYLE: input text line longer than 80 bytes: 

-.-.

Output from
test-nroff -mandoc -t -ww -z XML::LibXML::Parser.3pm: (shortened list)

      1 cannot break line in l adjust mode; overset by 19n

-.-.

Show if Pod::Man generated this.

2:.\" Automatically generated by Pod::Man 5.0102 (Pod::Simple 3.45)

Latest version in Debian testing:

This is perl 5, version 40, subversion 1 (v5.40.1) built for x86_64-linux-gnu-thread-multi
(with 48 registered patches, see perl -V for more detail)

-.-.

Change '-' (\-) to '\(en' (en-dash) for a (numeric) range.

GNU gnulib has recently (2023-06-18) updated its
"build_aux/update-copyright" to recognize "\(en" in man pages.

XML::LibXML::Parser.3pm:284:the fastest choice, about 6\-8 times faster then \fBparse_fh()\fR.
XML::LibXML::Parser.3pm:868:2001\-2007, AxKit.com Ltd.
XML::LibXML::Parser.3pm:870:2002\-2006, Christian Glahn.
XML::LibXML::Parser.3pm:872:2006\-2009, Petr Pajas.

-.-.

Three full stops (periods) are used for an ellipsis

836:\&  # but this works really globally, also in XML::LibXSLT include etc..)

-.-.

Strings longer than 3/4 of a standard line length (80).

Use "\:" to split the string at the end of an output line, for example a
long URL (web address).
This is a groff extension.

632 exploits of the type described at  (<http://searchsecuritychannel.techtarget.com/generic/0,295582,sid97_gci1304703,00.html>), where a service is tricked to expose its private data by letting it parse a

-.-.

Add a "\&" (or a comma (Oxford comma)) after an abbreviation
or use English words
(man-pages(7)).
Abbreviation points should be marked as such and protected against being
interpreted as an end of sentence, if they are not, and that independent
of the current place on the line.

227:(e.g. an XML fragment from a database). With XML::LibXML it is not required to
356:the incoming pieces of XML (e.g. to detect document boundaries).
378:true value (e.g. 1), the parsing will be stopped and the resulting document
552:if it can be used with a \f(CW\*(C`XML::LibXML\*(C'\fR parser object (i.e. passed to \f(CW\*(C`XML::LibXML\->new\*(C'\fR, \f(CW\*(C`XML::LibXML\->set_option\*(C'\fR, etc.)
631:This method can be used to completely disable entity loading, e.g. to prevent
633:remote file (RSS feed) that contains an entity reference to a local file (e.g. \f(CW\*(C`/etc/fstab\*(C'\fR).

-.-.

Wrong distance (not two spaces) between sentences in the input file.

  Separate the sentences and subordinate clauses; each begins on a new
line.  See man-pages(7) ("Conventions for source file layout") and
"info groff" ("Input Conventions").

  The best procedure is to always start a new sentence on a new line,
at least, if you are typing on a computer.

Remember coding: Only one command ("sentence") on each (logical) line.

E-mail: Easier to quote exactly the relevant lines.

Generally: Easier to edit the sentence.

Patches: Less unaffected text.

Search for two adjacent words is easier, when they belong to the same line,
and the same phrase.

  The amount of space between sentences in the output can then be
controlled with the ".ss" request.

Mark a final abbreviation point as such by suffixing it with "\&".

Some sentences (etc.) do not begin on a new line.

Split (sometimes) lines after a punctuation mark; before a conjunction.

  Lines with only one (or two) space(s) between sentences could be split,
so latter sentences begin on a new line.

Use

#!/usr/bin/sh

sed -e '/^\./n' \
-e 's/$[[:alpha:]]$\.  */\1.\n/g' $1

to split lines after a sentence period.
Check result with the difference between the formatted outputs.
See also the attachment "general.bugs"

[List of affected lines removed.]

-.-.

Split lines longer than 80 characters (fill completly
an A4 sized page line on a terminal)
into two or more lines.
Appropriate break points are the end of a sentence and a subordinate
clause; after punctuation marks.
Add "\:" to split the string for the output, "\<newline>" in the source.  

[List of affected lines removed.]

Longest line is number 790 with 298 characters
If called without an argument, returns true if the current value of the \f(CW\*(C`recover\*(C'\fR parser option is 2 and returns false otherwise. With a true argument sets the \f(CW\*(C`recover\*(C'\fR parser option to 2; with a false argument sets the \f(CW\*(C`recover\*(C'\fR parser option to 0.

-.-.

Add a zero (0) in front of a decimal fraction that begins with a period
(.)

7:.if t .sp .5v

-.-.

Put a parenthetical sentence, phrase on a separate line,
if not part of a code.
See man-pages(7), item "semantic newline".

[List of affected lines removed.]

-.-.

Use a character "\(->" instead of plain "->" or "\->", if not typeset with
a constant width font.


-.-.

Only one space character is after a possible end of sentence
(after a punctuation, that can end a sentence).

[List of affected lines removed.]

-.-.

Add lines to use the CR font for groff instead of CW.

.if t \{\
.  ie \\n(.g .ft CR
.  el .ft CW
.\}


11:.ft CW

-.-.

.\" Define a fallback for font CW with

.if \n(.g \{\
.  ie t .ftr CW CR
.  el .ftr CW R
.\}

[List of affected lines removed.]

-.-.

Put a (long) web address on a new output line to reduce the posibility of
splitting the address between two output lines.
Or inhibit hyphenation with "\%" in front of the name.


582:please see <http://bugzilla.gnome.org/show_bug.cgi?id=325533> for more details.
632:exploits of the type described at  (<http://searchsecuritychannel.techtarget.com/generic/0,295582,sid97_gci1304703,00.html>), where a service is tricked to expose its private data by letting it parse a

-.-.

Output from "test-nroff  -mandoc -t -K utf8 -rF0 -rHY=0 -rCHECKSTYLE=0 -ww -z ":

troff:<stdin>:632: warning [page 1, line 549]: cannot break line in l adjust mode; overset by 19n

-.-

Generally:

Split (sometimes) lines after a punctuation mark; before a conjunction.
-------------- next part --------------

--- XML::LibXML::Parser.3pm	2025-11-15 01:24:10.850990093 +0000
+++ XML::LibXML::Parser.3pm.new	2025-11-16 23:58:31.885215011 +0000
@@ -4,7 +4,7 @@
 .\" Standard preamble:
 .\" ========================================================================
 .de Sp \" Vertical space (when we can't use .PP)
-.if t .sp .5v
+.if t .sp 0.5v
 .if n .sp
 ..
 .de Vb \" Begin verbatim text
@@ -224,7 +224,7 @@ Such HTML documents should be parsed usi
 .PP
 The functions described above are implemented to parse well formed documents.
 In some cases a program gets well balanced XML instead of well formed documents
-(e.g. an XML fragment from a database). With XML::LibXML it is not required to
+(e.g.\& an XML fragment from a database). With XML::LibXML it is not required to
 wrap such fragments in the code, because XML::LibXML is capable even to parse
 well balanced XML fragments.
 .IP parse_balanced_chunk 4
@@ -281,7 +281,7 @@ This is an alias to process_xincludes, b
 .Sp
 This function parses an XML document from a file or network; \f(CW$xmlfilename\fR can
 be either a filename or an URL. Note that for parsing files, this function is
-the fastest choice, about 6\-8 times faster then \fBparse_fh()\fR.
+the fastest choice, about 6\(en8 times faster then \fBparse_fh()\fR.
 .IP parse_fh 4
 .IX Item "parse_fh"
 .Vb 1
@@ -353,7 +353,7 @@ a given source the push parser waits for
 .PP
 This allows one to parse large documents without waiting for the parser to
 finish. The interface is especially useful if a program needs to pre-process
-the incoming pieces of XML (e.g. to detect document boundaries).
+the incoming pieces of XML (e.g.\& to detect document boundaries).
 .PP
 While XML::LibXML parse_*() functions force the data to be a well-formed XML,
 the push parser will take any arbitrary string that contains some XML data. The
@@ -375,7 +375,7 @@ In XML::LibXML this is done by a single
 \&\fBparse_chunk()\fR tries to parse a given chunk of data, which isn't necessarily
 well balanced data. The function takes two parameters: The chunk of data as a
 string and optional a termination flag. If the termination flag is set to a
-true value (e.g. 1), the parsing will be stopped and the resulting document
+true value (e.g.\& 1), the parsing will be stopped and the resulting document
 will be returned as the following example describes:
 .Sp
 .Vb 5
@@ -549,7 +549,9 @@ used.
 Each of the flags listed below is labeled
 .IP /parser/ 4
 .IX Item "/parser/"
-if it can be used with a \f(CW\*(C`XML::LibXML\*(C'\fR parser object (i.e. passed to \f(CW\*(C`XML::LibXML\->new\*(C'\fR, \f(CW\*(C`XML::LibXML\->set_option\*(C'\fR, etc.)
+if it can be used with a \f(CW\*(C`XML::LibXML\*(C'\fR parser object (i.e. passed to
+.br
+\f(CW\*(C`XML::LibXML\->new\*(C'\fR, \f(CW\*(C`XML::LibXML\->set_option\*(C'\fR, etc.)
 .IP /html/ 4
 .IX Item "/html/"
 if it can be used passed to the \f(CW\*(C`parse_html_*\*(C'\fR methods
@@ -628,9 +630,20 @@ content of an external entity. It is cal
 (URI) and the public ID. The value returned by the subroutine is parsed as the
 content of the entity.
 .Sp
-This method can be used to completely disable entity loading, e.g. to prevent
-exploits of the type described at  (<http://searchsecuritychannel.techtarget.com/generic/0,295582,sid97_gci1304703,00.html>), where a service is tricked to expose its private data by letting it parse a
-remote file (RSS feed) that contains an entity reference to a local file (e.g. \f(CW\*(C`/etc/fstab\*(C'\fR).
+This method can be used to completely disable entity loading, e.g.\& to prevent
+exploits of the type described at
+.br
+.ie t \{\
+(<http://searchsecuritychannel.techtarget.com/generic/0,295582,sid97_gci1304703,00.html>),
+.\}
+.el \{\
+(<http://searchsecuritychannel.techtarget.com/generic/
+0,295582,sid97_gci1304703,00.html>),
+.\}
+.br
+where a service is tricked to expose its private data by letting it parse a
+remote file (RSS feed) that contains an entity reference to a local file
+(e.g.\& \f(CW\*(C`/etc/fstab\*(C'\fR).
 .Sp
 A more granular solution to this problem, however, is provided by custom URL
 resolvers, as in
@@ -833,7 +846,7 @@ Loads the XML catalog file \f(CW$catalog
 .Sp
 .Vb 2
 \&  # Global external entity loader (similar to ext_ent_handler option
-\&  # but this works really globally, also in XML::LibXSLT include etc..)
+\&  # but this works really globally, also in XML::LibXSLT include etc...)
 \&
 \&  XML::LibXML::externalEntityLoader(\e&my_loader);
 .Ve
@@ -865,11 +878,11 @@ Petr Pajas
 2.0134
 .SH COPYRIGHT
 .IX Header "COPYRIGHT"
-2001\-2007, AxKit.com Ltd.
+2001\(en2007, AxKit.com Ltd.
 .PP
-2002\-2006, Christian Glahn.
+2002\(en2006, Christian Glahn.
 .PP
-2006\-2009, Petr Pajas.
+2006\(en2009, Petr Pajas.
 .SH LICENSE
 .IX Header "LICENSE"
 This program is free software; you can redistribute it and/or modify it under
-------------- next part --------------
  Any program (person), that produces man pages, should check the output
for defects by using (both groff and nroff)

[gn]roff -mandoc -t -ww -b -z -K utf8 <man page>

  To find trailing space use

grep -n -e ' $' -e ' \\f.$' -e ' \\"' <man page>

  The same goes for man pages that are used as an input.

-.-

  For a style guide use

  mandoc -T lint

-.-

  For general input conventions consult the man page "nroff(7)" (item
"Input conventions") or the Texinfo manual about the same item.

-.-

  Any "autogenerator" should check its products with the above mentioned
'groff', 'mandoc', and additionally with 'nroff ...'.

  It should also check its input files for too long (> 80) lines.

  This is just a simple quality control measure.

  The "autogenerator" may have to be corrected to get a better man page,
the source file may, and any additional file may.

-.-

  Common defects:

  Not removing trailing spaces (in in- and output).
  The reason for these trailing spaces should be found and eliminated.

  "git" has a "tool" to point out whitespace,
see for example "git-apply(1)" and git-config(1)")

-.-

  Not beginning each input sentence on a new line.

Line length and patch size should thus be reduced when that has been fixed.

  The script "reportbug" uses 'quoted-printable' encoding when a line is
longer than 1024 characters in an 'ascii' file.

  See man-pages(7), item "semantic newline".

-.-

The difference between the formatted output of the original
and patched file can be seen with:

  nroff -mandoc <file1> > <out1>
  nroff -mandoc <file2> > <out2>
  diff -d -u <out1> <out2>

and for groff, using

\"printf '%s\n%s\n' '.kern 0' '.ss 12 0' | groff -mandoc -Z - \"

instead of 'nroff -mandoc'

  Add the option '-t', if the file contains a table.

  Read the output from 'diff -d -u ...' with 'less -R' or similar.

-.-.

  If 'man' (man-db) is used to check the manual for warnings,
the following must be set:

  The option "-warnings=w"

  The environmental variable:

export MAN_KEEP_STDERR=yes (or any non-empty value)

  or

  (produce only warnings):

export MANROFFOPT="-ww -b -z"

export MAN_KEEP_STDERR=yes (or any non-empty value)

-.-