Bug#342619: [exim] Potential logic error in retry handling for IPv4+IPv6 hosts

Mon Dec 19 02:33:32 UTC 2005

(Please cc me on all replies, I'm not subscribed)

On Sat, Dec 17, 2005 at 05:32:15PM +0100, Florian Weimer wrote:
> * Marc Haber:
> 
> > This is, btw, not an ipv6 issue exclusively, it might happen in
> > ipv4-only setups as well. See Debian Bug #342619 for another example.
> 
> I'm not sure if it's the same bug, and I wouldn't be surprised if the
> behavior was deliberate in that case (after all, the whole "long
> failure period" business is there to generate immediate bounces, so
> that users won't have to wait for five days until they are told about
> their mistake).
> 
> In the example in the bug report, we there are two A RRs:
> 
> mailrelay.direct-adsl.nl. 86400 IN      A       195.121.6.12
> mailrelay.direct-adsl.nl. 86400 IN      A       195.121.6.56
> 
> But resolvers MUST cache the whole set of records and expire them at
> the same time.  If the resolver fails to do this properly and provides
> a wrong view on DNS, there is no workaround on Exim's side.

The DNS setup changed in the meanwhile, because of, as I now know, a wrong
guess at the cause of the failure. The old setup had MX's from multiple
different zone's, and the one MX that had a long failure was also served
the DNS server of the mail server, but the secundary MX's to which the mail
should have been delivered to, was only remotely DNS-served -- so the DNS
server in question at times only had the broken MX cached (well, was
authoritive for it, even), and only that was in the additional section --
the IP addresses of the working MX's were out of the cache.

The only solution seems to me to actively query for all A (and AAAA)
records of all MX's before determining that no MX's are available for
delivery -- the additional DNS section is not to be trusted to ever give an
exhaustive list of IP addresses to try. I'd even say that it needs to
happen at every delivery attempt after delivery is found to be unsuccesful
to all the MX's in the additional section, because it can happen that some
MX's are more often in it than others. And you don't want delivery to fail
just because at the ultimate attempt all MX's happen to be down -- but some
were up in the past 4 days.

Note that RFC 974, MAIL ROUTING AND THE DOMAIN SYSTEM from 1986 (predating IPv6
by 12 years) explicitely warns against wrong handling of the DNS additional
section for MX queries:

|  The incomplete data problem also requires some care when handling
|  domain queries.  If the answer section of a query is incomplete
|  critical MX RRs may be left out.  This may result in mail looping, or
|  in a message being mistakenly labelled undeliverable.  As a result,
|  mailers may only accept responses from the domain system which have
|  complete answer sections.  Note that this entire problem can be
|  avoided by only using virtual circuits for queries, but since this
|  situation is likely to be very rare and datagrams are the preferred
|  way to interact with the domain system, implementors should probably
|  just ensure that their mailer will repeat a query with virtual
|  circuits should the truncation bit ever be set.

Even though the RFC only mentions problems regarding to DNS datagram
truncation, and does not mention the issue that is more relevant here,
incomplete answer due to unbalanced caching -- a problem which only can
happen with MX's out of multiple zone's or with IPv6 (or explicitely
unbalanced TTL's in one zone, but I consider *that* very unlikely).

--Jeroen

-- 
Jeroen van Wolffelaar
jeroen at wolffelaar.nl
http://jeroen.A-Eskwadraat.nl