Prolixium dot com: News >> Blog >> Linux IPv6 FIB Bugs

tl;dr - I'm running into some FIB bug on Linux > 4.14

Ever since I upgraded excalibur (bare metal IPv6 router running BGP and tunnels) from Linux past version 4.14 I've been having some weird IPv6 FIB problems. The host takes a few full IPv6 BGP feeds (currently ~60K routes) and puts them into the Linux FIB via FRR. It also terminates a few OpenVPN and 6in4 tunnels for friends & family and happens to also host dax, the VM where prolixium.com web content is hosted.

The problems started when I upgraded to Linux 4.19 that was packaged by Debian in testing. About 90 minutes after the reboot and after everything had converged, I started seeing reachability issues to some IPv6 destinations. The routes were in the RIB (FRR) and in the FIB but traffic was being bitbucketed. Even direct routes were affected. Here's excalibur's VirtualBox interface to dax going AWOL from an IPv6 perspective:

(excalibur:11:02:EST)% ip -6 addr show dev vboxnet0
15: vboxnet0:  mtu 1500 state UP qlen 1000
    inet6 2620:6:200f:3::1/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::800:27ff:fe00:0/64 scope link
       valid_lft forever preferred_lft forever
(excalibur:11:02:EST)% ip -6  ro |grep 2620:6:200f:3
2620:6:200f:3::/64 dev vboxnet0 proto kernel metric 256 pref medium
(excalibur:11:02:EST)% ip -6 route get 2620:6:200f:3::2
2620:6:200f:3::2 from :: dev vboxnet0 proto kernel src 2620:6:200f:3::1 metric 256 pref medium
(excalibur:11:02:EST)% ping6 -c4 2620:6:200f:3::2
PING 2620:6:200f:3::2(2620:6:200f:3::2) 56 data bytes

--- 2620:6:200f:3::2 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3ms 

(excalibur:11:02:EST)%

In the above case, the route was there, a Netlink confirmed it, but no traffic would flow. The fix here was to either bounce the interface, restart FRR, or reboot.

Other times Netlink provides a negative response:

(excalibur:12:30:EST)% ip -6 route get 2620:6:200e:8100::
RTNETLINK answers: Invalid argument. 
(excalibur:12:30:EST)% ip -6 ro|grep 2620:6:200e:8100::/56
2620:6:200e:8100::/56 via 2620:6:200e::2 dev tun2 proto static metric 20 pref medium

In this case, the route appeared to be there but Netlink had some issue when querying it. Traffic to that prefix was being bitbucketed. The fix was to re-add the static route in FRR:

(excalibur:12:32:EST)# vtysh -c "conf t" -c "no ipv6 route 2620:6:200e:8100::/56 2620:6:200e::2"
(excalibur:12:32:EST)# vtysh -c "conf t" -c ipv6 route 2620:6:200e:8100::/56 2620:6:200e::2"
(excalibur:12:32:EST)% ip -6 route get 2620:6:200e:8100::
2620:6:200e:8100:: from :: via 2620:6:200e::2 dev tun2 proto static src 2620:6:200e::1 metric 20 pref medium

Downgrading from 4.19 to 4.16 seemed to have made the situation much better but not fix it completely. Instead of 50% routes failing to work after 90 minutes only handful of prefixes break. I'm not sure how many a handful is, but it's more than 1. I was running 4.14 for about 6 months without a problem so I might just downgrade to that for now.

I did try reproducing this on a local VM running 4.19, FRR, and two BGP feeds but the problem isn't manifesting itself. I'm wondering if this is traffic or load related or maybe even related to the existence of tunnels. I don't think it's FRR's fault but it certainly might be doing something funny with its Netlink socket that triggers the kernel bug. I also don't know how to debug this further, so I'm going to need to do some research.

Update 2019-02-16

I started playing with that local VM running 4.19 and can successfully cause IPv6 connectivity to "hiccup" if I do the following on it:

% ip -6 route|egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get

This basically walks the IPv6 Linux FIB and does an "ip -6 route get" for each prefix (first address in each). After exactly 4,261 prefixes Netlink just gives me network unreachable:

[...]
2001:df0:456:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
2001:df0:45d:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
2001:df0:465:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
[...]

It's funny because it's always at the same exact point. The route after 2001:df0:465::/48 is 2001:df0:467::/48, which I can query just fine outside of the loop:

(nltest:11:12:PST)% ip -6 route get 2001:df0:467::   
2001:df0:467:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
(nltest:11:12:PST)%

The only possible explanation I can come up with is that I'm hitting some Netlink limit and messages are getting dropped. If I don't Ctrl-C the script and just let it sit there spewing the unreachable messages on the screen eventually all IPv6 connectivity to my VM hiccups and cause my BGP sessions to bounce. I can see this when running an adaptive ping to the VM:

[...]
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1215 ttl=64 time=0.386 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1216 ttl=64 time=0.372 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1217 ttl=64 time=0.143 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1218 ttl=64 time=0.383 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1235 ttl=64 time=1022 ms     <--- segments 1219..1234 gone
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1236 ttl=64 time=822 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1237 ttl=64 time=621 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1238 ttl=64 time=421 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1239 ttl=64 time=221 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1240 ttl=64 time=20.6 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1241 ttl=64 time=0.071 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1242 ttl=64 time=0.078 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1243 ttl=64 time=0.081 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1244 ttl=64 time=0.076 ms
[...]

Next step here is to downgrade the VM to a 4.14 and run the same thing. It's possible I could just be burning out Netlink and this is normal, but I'm suspicious.

Update 2 2019-02-16

Downgrading my local VM to Linux 4.14 and running the same shell fragment above produces no network unreachable messages from Netlink, does not disturb IPv6 connectivity at all, and no BGP sessions bounce:

(nltest:11:44:PST)% ip -6 route|egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get 1> /dev/null 
(nltest:11:45:PST)%

Something definitely changed or got bugged in Netlink after 4.14.

Update 3 2019-02-16

After testing a few kernels, it seems this was introduced in Linux 4.18. More investigation needed.

Update 4 2019-11-17

It looks like I may have found something. I upgraded to 5.3.0-2-amd64 (Debian kernel) and ran the same test above. I got the same results but this time I saw something interesting in dmesg output:

[  119.460300] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
[  120.666697] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.
[  121.668727] Route cache is full: consider increasing sysctl net.ipv[4|6].route.max_size.

Apparently, net.ipv6.route.max_size was set very low:

(netlink:19:17:PST)% sudo sysctl -A|grep max_size
net.ipv4.route.max_size = 2147483647
net.ipv6.route.max_size = 4096

Well, I certainly have more than 4,096 routes. So, I increased it to 1048576. It WORKED!

(netlink:19:18:PST)% ip -6 route|egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get 1> /dev/null 
(netlink:19:23:PST)%

No output means no RTNETLINK errors.

This net.ipv6.route.max_size key is present and set to 4096 on my production routers running ~77K IPv6 routes with 4.14 kernels with no issue. So, I have lots of questions here:

What is the "route cache" seen in dmesg? Linux's route cache was done away with several years ago.
Why does Linux 4.14 handle ~77K IPv6 routes just fine with the value of max_size set to 4096?
Why does only Linux 5.3 emit the error about the route cache?
Will increasing the max_size on the Linux 4.14 systems help anything at all?

More research is needed but at least there's a way forward with kernels > 4.17.