Wednesday, March 13, 2013

Autoremove unreachable node from balanced DNS

When you are offering a high availability service, you often balance users among nodes using DNS. The problem with DNS is the propagation time so, in case of a node failure, a quick response is very important. This is the reason why I developed a tiny script that checks the status of the balanced nodes for the bind9 DNS server.

The script will need a little modification on your bind9 zone files. This is the syntax:
[..] 3600 IN A 258.421.3.125

balancer-www 120 IN A 258.421.3.182
balancer-www 120 IN A 258.421.3.183

balancer-http 3600 IN CNAME

Here we have the subdomain balancer-www balanced between two hosts with IP's 258.421.3.182, 258.421.3.183. At the top of these A records we have the "code" that we have to add for the script to know how to proceed. The sintax is simple: ;; BALANCED_AUTOCHECK::<service_port>. The BALANCED_AUTOCHECK part is only a matching pattern and the <service_port> is the port of the service to check. In the above example, we are checking a balancer for an http service, so we are using the port 80.

NOTE: For your interest, the rule matches the regular expression: ;;\s*BALANCED_AUTOCHECK::\d+$

Please, have in mind that no protocol check is made (i.e.: HTTP 400 errors, etc) but only a plain socket connection. If the socket connection fails, the IP is marked as down by commenting the A record it and if a recover is detected, the A record is uncommented.

Here is the help output of the script:
Usage: bind9_check_balancer [options] [dns_files]

  -h, --help          show this help message and exit
  -c COMMAND, --command=COMMAND
                        Command to be executed if changes are made.
                        (example:'service bind9 restart')
  -t TIMEOUT, --timeout=TIMEOUT
                        Socket timeout for unreachable hosts in seconds.
                        Default: 5
I think it explains quite well how it works but, just in case, here are some examples:

# Check and If a change is made
# (down/recover detected), exec the command: /etc/init.d/bind9 reload
bind9_check_balancer -c '/etc/init.d/bind9 reload' /etc/named/ /etc/bind/

# Check all files in /etc/named/zones directory and set a timeout
# of 1 second for connection checks. Also exec: service bind9 reload
bind9_check_balancer -c 'service bind9 reload' /etc/named/zones/*

This script is intempted to be executed as a cron job each minute or each 5 minutes (or each time you want). You can get the script form dgtool github repository at:

1 comment:

  1. I found a little problem when parsing domains with a dash. Sorry if anybody was using this. It is fixed now.