I am very fond of the exim MTA. Its extremely flexible
configuration file format may just be Turing-complete, and it allows
me to play with my email in just about any way I like.
In one way, though, exim is strangely limited. When I think of the
many intersections between email and DNS, I expect an MTA to have many
varied ways to access DNS information. Exim uses DNS widely, of
course, as it must, but it can only access it the “dog standard” way,
through the stub resolver in the C runtime (ie. the gethostbyname
library function). It doesn’t use any of the modern alternative DNS
libraries such as ares. As a consequence, it is not possible
to specify an alternative nameserver, nor an alternative port number
for DNS queries.
Another place where I often feel exim slightly deficient is in the
choices it offers for storing data sets of IP addresses (usually
called lookups in the exim world). Such data sets are needed,
to give just the most basic example, for blocking SMTP connections
from known active spam bots. There are really only two ways of storing
such data built directly into exim: flat files accessed via linear
search, and hashed key-value stores such as the legendary Berkeley DB,
now owned and maintained by Oracle.
But hashing is not the most natural data structure for IP address data.
For most purposes one doesn’t want to store individual addresses but
rather CIDR ranges, ie. parts of IP space carved by fixing some number
of most significant bits and letting the least significant bits vary.
And this shape of data is a great fit for a radix tree structure,
or a trie which is bascially a radix tree cleverly optimized by path compression.
corkipset
One library that offers this kind of storage and functions to query it
is corkipset. But there are still two problems with getting
exim to use it:
- Unless one’s ready to locally hack on exim code it’s necessary to
use “dlopening” ie. runtime loading of shared libraries. This has
uncomfortable security connotations, as in most installations exim runs
setuid root, for rather complex reasons.
- Because the exim daemon spawns a worker process to handle each email
message, and each worker process calls the
exec
syscall to restart
the exim program image from scratch, there can be no permanently accessible
data available to the worker processes. Instead each worker process
must open the necessary data files anew. This is not a problem for
small files and pieces of data but is clearly inefficient for huge
data such as spammer block lists, especially when it must be parsed
anew every time from a human-readable format.
rbldnsd
rbldnsd is a nice solution to the last problem I wrote above.
It is an extremely lightweight daemon which stores sets of CIDR ranges
in its working memory and answers queries about them on a socket. But,
conveniently or not depending on situation and point of view, the socket
is an UDP one and the protocol spoken is DNS. For example, to query if
the address 12.34.56.78
is in the dataset, or to get the value
associated to it, we must make a DNS A record query for
78.56.34.12.ds.rbl.example.net
where ds.rbl.example.net
is the DNS zone for which rbldnsd is made
authoritative. The result of this is, if we wanted to query rbldnsd
directly from exim, using the native way exim does DNS, we’d have to
let rbldnsd use port 53 for its listening socket, and make the
listening address the systemwide resolver address, ie. put it on
the nameserver
line in /etc/resolv.conf
. This is clearly impossible
for any number of reasons.
Stub zones
Fortunately there is a workaround. Recursive DNS resolver daemons such
as unbound have a concept of stub zones. For example, we can
configure the ds.rbl.example.net
zone to be a stub zone in unbound,
which means unbound will delegate to another fixed nameserver any query
for the zone if it cannot find the answer in its own cache. The only
remaining problem is to find an address and a port for rbldnsd to listen
on such that it doesn’t clash with the system resolver. This can be done
in a couple of different ways; I have found it easiest to use an IPv6
address, because I have a huge supply of them while I only have a single
IPv4 address for the only interface on my VPS (other than loopback).