I was trying to load-balance some packets...

neilschelly's picture

I was trying to load balance some syslog packets. I got socket buffer overflows with incoming syslog packets on a log receiver (running Logstash, yay!) and so many packets would come in with so much burstiness that some were inevitably getting dropped. What to do...

First idea... More Socket Buffers!
First, make the socket buffers bigger. I bumped some sysctl values up to give each of the listening sockets 8M of space, up from the Linux default 4k buffers. This helped most of my relays, but some were still getting overwhelmed.

net.core.rmem_max=8388608
net.core.rmem_default=8388608
net.core.netdev_max_backlog=2000

I didn't want to just keep increasing this for two reasons:

  • It didn't help. I tried, and it didn't get better after boosting it to 8M.
  • At some point, you don't want every socket on your system to be too high. I don't know where that limit is, but I didn't want to find it.

Second idea... IPTables Will Save Me!
First, I thought I'd use iptables to load balance the packets across multiple incoming sockets. I was receiving this traffic as UDP packets on port 514, so I setup Logstash to listen also on 515-518 for the same types of traffic. I configured IPTables to do something like this:

/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 4 --packet 0 -j DNAT --to-destination 127.0.0.1:515
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 4 --packet 1 -j DNAT --to-destination 127.0.0.1:516
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 4 --packet 2 -j DNAT --to-destination 127.0.0.1:517
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 4 --packet 3 -j DNAT --to-destination 127.0.0.1:518

This totally looks like the answer. It wasn't. Lots of packet loss resulted. I can't quite confirm that this used to be the correct way to use the statistic module in iptables, but it's definitely not anymore. Upon first read, it looks fine. It's really going to send every 4th packet to the first target, every 4th packet that's left to the next target, every 4th packet left will go to the 3rd target, etc. Each target will get a fraction of the load, and then there will be a bunch of packets remaining on port 514.

That explains the traffic I saw on port 514 at least, but I wasn't getting any traffic to 515-518 as expected. Those packets were getting lost somewhere. It turns out that there are some special considerations for 127.0.0.1 in the IPTables code base. If you attach an IP, say 10.1.2.3, to you loopback interface with something like `ip add 10.1.2.3/32 dev lo`, you can now direct traffic to it. The following iptables rules fix the packet balancing and the destination address:

/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 4 --packet 0 -j DNAT --to-destination 10.1.2.3:515
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 3 --packet 0 -j DNAT --to-destination 10.1.2.3:516
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 2 --packet 0 -j DNAT --to-destination 10.1.2.3:517
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 1 --packet 0 -j DNAT --to-destination 10.1.2.3:518

It's important to recognize that since these rules are jumping packets to another destination, each call of the statistics module will initialize it's own counters, so only packets that make it to a rule will be counted. Now, the first rule will match the 1st packet of every 4, the second rule will match the first packet of every 3 that remains, the third rule will match every other packet, and the fourth rule will match every remaining packet. And no more packets will get dropped because they are being NAT'd to 127.0.0.1.
There's a nice advantage here too over the first idea above. Now, I can put in targets on other machines and use IPTables as a load balancer amongst many machines! If you want to do that, you'll have to do a bit of extra work to allow forwarding packets to other machines, like 1.2.3.4 in the next example configured just like we are on localhost.

/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 4 --packet 0 -j DNAT --to-destination 10.1.2.3:515
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 3 --packet 0 -j DNAT --to-destination 10.1.2.3:516
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 2 --packet 0 -j DNAT --to-destination 10.1.2.3:517
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 1 --packet 0 -j DNAT --to-destination 10.1.2.3:518
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 4 --packet 0 -j DNAT --to-destination 1.2.3.4:515
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 3 --packet 0 -j DNAT --to-destination 1.2.3.4:516
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 2 --packet 0 -j DNAT --to-destination 1.2.3.4:517
/sbin/iptables -t nat -A PREROUTING -p udp --dport 514 -m statistic --mode nth --every 1 --packet 0 -j DNAT --to-destination 1.2.3.4:518
/sbin/iptables -A FORWARD -p udp --dport 514 --destination 1.2.3.4 --dport 515 -j ACCEPT
/sbin/iptables -A FORWARD -p udp --dport 514 --destination 1.2.3.4 --dport 516 -j ACCEPT
/sbin/iptables -A FORWARD -p udp --dport 514 --destination 1.2.3.4 --dport 517 -j ACCEPT
/sbin/iptables -A FORWARD -p udp --dport 514 --destination 1.2.3.4 --dport 518 -j ACCEPT
/sbin/sysctl net.ipv4.ip_forward=1

So it works, right? Yeah, that'd have been sweet!

Sorta... The problem now is that some machines are really squawky over syslog. Some machines are quite quiet. I want to load-balance the squawky ones, and I don't really care about the quiet ones. They aren't causing my load problems. A squawky one is going to keep using the same source and destination ports, so IPTables will consider them a "UDP connection" and help me out by making sure they all follow the same path. That's the opposite of load-balancing.

Third idea... IPTables Can Still Save Me!
Everything above would work great if IPTables would just stop giving a state entry to every UDP packet coming through. There's this thing called NOTRACK that disables the automatic state creation in IPTables. Let's add some rules:

iptables -t raw -A PREROUTING -p udp --dport 514 -j NOTRACK
iptables -t raw -A PREROUTING -p udp --dport 515 -j NOTRACK
iptables -t raw -A PREROUTING -p udp --dport 516 -j NOTRACK
iptables -t raw -A PREROUTING -p udp --dport 517 -j NOTRACK
iptables -t raw -A PREROUTING -p udp --dport 518 -j NOTRACK

That does a lot more than prevent a state from being created. Good luck getting packets to reach a DNAT target in the PREROUTING chain of the NAT tables. Your packets will now skip the NAT tables entirely, because IPTables will never send packets without a state through the NAT tables. Fundamentally, NAT in IPTables implies state, so there's no way to use DNAT to blast UDP packets through a firewall without creating state.
If you're like me, you'll be so close and want to spend hours toiling over diagrams on Google. It won't work. Back to the drawing board...

Seriously. IPVS FTW
The Linux Virtual Server Project has the answer I needed.

ipvsadm -A -u 1.2.3.5:514 -s rr -o
ipvsadm -a -u 1.2.3.5:514 -r 10.1.2.3:515 -m -w 1
ipvsadm -a -u 1.2.3.5:514 -r 10.1.2.3:516 -m -w 1
ipvsadm -a -u 1.2.3.5:514 -r 10.1.2.3:517 -m -w 1
ipvsadm -a -u 1.2.3.5:514 -r 10.1.2.3:518 -m -w 1
ipvsadm -a -u 1.2.3.5:514 -r 1.2.3.4:515 -m -w 1
ipvsadm -a -u 1.2.3.5:514 -r 1.2.3.4:516 -m -w 1
ipvsadm -a -u 1.2.3.5:514 -r 1.2.3.4:517 -m -w 1
ipvsadm -a -u 1.2.3.5:514 -r 1.2.3.4:518 -m -w 1

The first command sets up a round-robin-scheduler load-balancer rule on my listening 514 port. Most importantly, it uses the -o (or --ops) option to specify the one-packet-scheduler mechanism that the round-robin scheduler should use. Using OPS will require the latest version of ipvsadm 1.26. Ubuntu 12.04 didn't have it, but it can be trivially backported, or you can upgrade to a newer release.
This is working great now.