Wednesday, March 13, 2013

Autoremove unreachable node from balanced DNS

When you are offering a high availability service, you often balance users among nodes using DNS. The problem with DNS is the propagation time so, in case of a node failure, a quick response is very important. This is the reason why I developed a tiny script that checks the status of the balanced nodes for the bind9 DNS server.

The script will need a little modification on your bind9 zone files. This is the syntax:
[..]

ftp.example.com 3600 IN A 258.421.3.125

;; BALANCED_AUTOCHECK::80
balancer-www 120 IN A 258.421.3.182
balancer-www 120 IN A 258.421.3.183

balancer-http 3600 IN CNAME balancer-www.example.com.

[..]
Here we have the subdomain balancer-www balanced between two hosts with IP's 258.421.3.182, 258.421.3.183. At the top of these A records we have the "code" that we have to add for the script to know how to proceed. The sintax is simple: ;; BALANCED_AUTOCHECK::<service_port>. The BALANCED_AUTOCHECK part is only a matching pattern and the <service_port> is the port of the service to check. In the above example, we are checking a balancer for an http service, so we are using the port 80.

NOTE: For your interest, the rule matches the regular expression: ;;\s*BALANCED_AUTOCHECK::\d+$

Please, have in mind that no protocol check is made (i.e.: HTTP 400 errors, etc) but only a plain socket connection. If the socket connection fails, the IP is marked as down by commenting the A record it and if a recover is detected, the A record is uncommented.

Here is the help output of the script:
Usage: bind9_check_balancer [options] [dns_files]

Options:
  -h, --help          show this help message and exit
  -c COMMAND, --command=COMMAND
                        Command to be executed if changes are made.
                        (example:'service bind9 restart')
  -t TIMEOUT, --timeout=TIMEOUT
                        Socket timeout for unreachable hosts in seconds.
                        Default: 5
I think it explains quite well how it works but, just in case, here are some examples:

# Check test.net.hosts and test.com.hosts. If a change is made
# (down/recover detected), exec the command: /etc/init.d/bind9 reload
bind9_check_balancer -c '/etc/init.d/bind9 reload' /etc/named/test.net.hosts /etc/bind/test.com.hosts

# Check all files in /etc/named/zones directory and set a timeout
# of 1 second for connection checks. Also exec: service bind9 reload
bind9_check_balancer -c 'service bind9 reload' /etc/named/zones/*



This script is intempted to be executed as a cron job each minute or each 5 minutes (or each time you want). You can get the script form dgtool github repository at: https://github.com/diego-XA/dgtool/blob/master/ha/bind9_check_balancer