Prolixium dot com: News >> Blog >> FreeBSD Sucks

Why? The kernel panics almost daily, and nobody cares to fix it.

I suppose I should elaborate…

I run two FreeBSD boxes, starfire and dax. dax, along with the help of Apache's httpd, generates and serves the content you're reading right now, along with a few other sites. It's located in NAC's data center in Parsippany, NJ. starfire is my main router at home. Both run the Quagga routing suite, terminate several OpenVPN tunnels, provide IPv6 connectivity, and selectively allow connections to and from the public Internet with pf. You can find them both on my network diagrams.

dax was preinstalled by Voxel with FreeBSD 5.2.1-RELEASE back in 2005. It was upgraded to 5.3-STABLE without incident, and appeared to handle anything I threw at it.

starfire was installed with 6.0-RELEASE a few months after dax, and seemed stable, too, although it was only handling a small amount of traffic from my home network.

Concluding that 6.0 was indeed stable, dax was upgraded to 6.0-STABLE early in 2006 (Jan 1st, actually). This was when the problems started. A few weeks after the upgrade, the box mysteriously rebooted. I thought it was power, or something, and ignored it. This kept happening every couple of weeks, so I screened a serial console session, and watched the console output. Sure enough, it was rebooting due to a panic, that looked similar to the following:

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x78
fault code = supervisor read, page not present
instruction pointer = 0x20:0xc0555579
stack pointer = 0x28:0xd43f2b28
frame pointer = 0x28:0xd43f2b2c
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = resume, IOPL = 0
current process = 11 (swi1: net)

Not knowing much about kernel debugging on FreeBSD, I didn't setup crash dumps correctly, so there wasn't much to go on. I poked around some forums, and asked a few folks in -bsd on Lily, but go nowhere. I upgraded to the latest 6-STABLE, which didn't fix anything, so I eventually had to downgrade back to 5.4-RELEASE, which was a complete nightmare. Thankfully, the box became stable once again.

I figured I would try to replicate this on starfire, since only my home network would be impacted during a panic. I upgraded to the latest 6-STABLE, and waited. Nothing.. stable as a rock. However, a few months later, after doing upgrades every couple of weeks, it started happening. I was able to get some crash dumps, and realized that I was dealing with two types of panics, not just one. Both appeared to be Quagga-related.

So, in late October, I submitted a PR (kern/104569) to the FreeBSD folks, complete with tons of system information (rc.conf, pkg_info dump, kernel config, etc.). I didn't hear anything. I kept updating the PR with information as I upgraded to the latest 6-STABLE. Since there was a lack of activity on the current PR, I decided to submit a new one (kern/105966) regarding the other panic I was receiving.

Fortunately, kern/105966 was quickly resolved (or so I thought), since apparently other users were reporting similar problems triggered by route6d(8). I didn't see it recur for a month or two, yet still kept encountering panics as described in kern/104569, and kept adding information to the PR. It appears that this bug is triggered by Quagga's zebra process during a route removal or addition (ie, LSA ages out when an OSPF neighbor is lost), if the box has been up for over a day, or so.

I was going to submit a bug to the Quagga folks, but realized this would be closed the minute it was posted, with "not a Quagga problem, or similar. And, I'd agree with them... userland apps shouldn't cause the kernel to freak out, unless kernel memory is being overridden, or something (right?).

To my dismay, the daily panics became more frequent, and then I realized that the panic described in kern/105966 was back, and I was battling two problems once again. I posted a response to the original PR, asking it to be reopened, and even emailed the developer who originally fixed the bug. Nothing…

At this point, I'm not sure what to do. I've added tons of follow-ups to both of the PR's, made some forum postings, and sent several emails to the freebsd-stable mailing list. Except for the initial fix for kern/105966, I have received NO responses or input from any developers about this.

I'm left with few options, at this point:

Convert both FreeBSD boxes to Linux
Migrate to other routing daemons
Keep both boxes running 5.4-RELEASE ... forever

#1 is a possibility, but is going to be a PAIN to accomplish for dax, since I'll probably need to pay Voxel at some ridiculous hourly rate to get a base install on the box. Also, iptables is much less elegant than pf.. Anyone have suggestions for #2? Requirements are: OSPFv2, OSPFv3, and [multiprotocol] BGP support. OpenBGPD is portable, and might work well, but I don't know of any other portable OSPF implementations. #3 just seems like a bad idea, since 5.4-RELEASE i s going to be no longer maintained, soon. So much for security updates.

Well, that's why I think FreeBSD sucks, at the moment. Thoughts? Bugfixes?