Prolixium dot com: News >> Blog

Although I say upgrade, it's really a replacement. This weekend, I upgraded my main workstation, tacolinux, from an AMD Athlon64 3200+ to an Intel Core 2 Extreme QX6700. The new system has been christened destiny. I took some photos of the process.

The system is very fast, yet stays cool and quiet, most of the time. Here are some specs:

Intel Core 2 Extreme QX6700
Intel D975XBX2 Motherboard
2GB DDR2 PC2-5300 Crucial Ballistix
EVGA 256-P2-N549-TR GeForce 7600GS
Western Digital WD1600JS-41S (160GB SATA 3.0Gb/sec)
Zalman CNPS9500 LED
Antec Phantom 500

There are a few outstanding issues, though.

First off, the system won't boot from the hard disk, at all. Don't get me wrong, it reads it fine, but the BIOS just won't boot from it. I've fooled around with AHCI settings in the BIOS until I was blue in the face, and flashed it while I was at it. The old SATA disk from tacolinux doesn't boot, either. The workaround for the moment to use a GRUB bootable CD-ROM. So, basically, I have to type two commands to get my system to boot. I suppose it can be viewed as security through obscurity?

The other open item relates to hardware monitoring. I believe there's a Windows application that displays data from thermal sensors and fans, but I couldn't get lm_sensors to pick up anything useful. More research is needed.

So, I finally broke down and started buying new computer parts. Still waiting for the hard drive, and time this weekend for building…

On Gentoo, it's horribly broken. This will prevent your bluez host from pairing with any device requiring a PIN. Observe:

% sudo hidd --search
Searching ...
Connecting to device 00:0C:1D:C5:44:91
Can't get device information: Host is down

It's really lying to you. About two seconds after the "Connecting to device" message, hidd tries to run bluepin, to pop up a dialog for you to enter the PIN. However, bluepin, a Python script, can't open the display (which is the bug) throws a whole bunch of GTK errors, then causes python to receive a SIG11:

/usr/lib64/python2.4/site-packages/gtk-2.0/gtk/__init__.py:69: GtkWarning: could not open display
warnings.warn(str(e), _gtk.Warning)
/usr/bin/bluepin:48: nbsp;Warning: invalid (NULL) pointer instance
gtk.Dialog.__init__(self)

Then, segfault:

bluepin[22482]: segfault at 0000000000000000 rip 00002b29d54167dc rsp 00007fffd6dffcb0 error 4

Running xhost + or starting X as root doesn't make any difference.

As described here, just enter the PIN in /etc/bluetooth/pin, then change /etc/bluetooth/hcid.conf accordingly:

# PIN helper
#pin_helper /usr/bin/bluepin;
pin_helper /etc/bluetooth/pin-helper

You'll probably have to update that /etc/bluetooth/pin file each time you pair a device, which requires a unique PIN. This also allows you to pair devices without using X.

Why? The kernel panics almost daily, and nobody cares to fix it.

I suppose I should elaborate…

I run two FreeBSD boxes, starfire and dax. dax, along with the help of Apache's httpd, generates and serves the content you're reading right now, along with a few other sites. It's located in NAC's data center in Parsippany, NJ. starfire is my main router at home. Both run the Quagga routing suite, terminate several OpenVPN tunnels, provide IPv6 connectivity, and selectively allow connections to and from the public Internet with pf. You can find them both on my network diagrams.

dax was preinstalled by Voxel with FreeBSD 5.2.1-RELEASE back in 2005. It was upgraded to 5.3-STABLE without incident, and appeared to handle anything I threw at it.

starfire was installed with 6.0-RELEASE a few months after dax, and seemed stable, too, although it was only handling a small amount of traffic from my home network.

Concluding that 6.0 was indeed stable, dax was upgraded to 6.0-STABLE early in 2006 (Jan 1st, actually). This was when the problems started. A few weeks after the upgrade, the box mysteriously rebooted. I thought it was power, or something, and ignored it. This kept happening every couple of weeks, so I screened a serial console session, and watched the console output. Sure enough, it was rebooting due to a panic, that looked similar to the following:

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x78
fault code = supervisor read, page not present
instruction pointer = 0x20:0xc0555579
stack pointer = 0x28:0xd43f2b28
frame pointer = 0x28:0xd43f2b2c
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags = resume, IOPL = 0
current process = 11 (swi1: net)

Not knowing much about kernel debugging on FreeBSD, I didn't setup crash dumps correctly, so there wasn't much to go on. I poked around some forums, and asked a few folks in -bsd on Lily, but go nowhere. I upgraded to the latest 6-STABLE, which didn't fix anything, so I eventually had to downgrade back to 5.4-RELEASE, which was a complete nightmare. Thankfully, the box became stable once again.

I figured I would try to replicate this on starfire, since only my home network would be impacted during a panic. I upgraded to the latest 6-STABLE, and waited. Nothing.. stable as a rock. However, a few months later, after doing upgrades every couple of weeks, it started happening. I was able to get some crash dumps, and realized that I was dealing with two types of panics, not just one. Both appeared to be Quagga-related.

So, in late October, I submitted a PR (kern/104569) to the FreeBSD folks, complete with tons of system information (rc.conf, pkg_info dump, kernel config, etc.). I didn't hear anything. I kept updating the PR with information as I upgraded to the latest 6-STABLE. Since there was a lack of activity on the current PR, I decided to submit a new one (kern/105966) regarding the other panic I was receiving.

Fortunately, kern/105966 was quickly resolved (or so I thought), since apparently other users were reporting similar problems triggered by route6d(8). I didn't see it recur for a month or two, yet still kept encountering panics as described in kern/104569, and kept adding information to the PR. It appears that this bug is triggered by Quagga's zebra process during a route removal or addition (ie, LSA ages out when an OSPF neighbor is lost), if the box has been up for over a day, or so.

I was going to submit a bug to the Quagga folks, but realized this would be closed the minute it was posted, with "not a Quagga problem, or similar. And, I'd agree with them... userland apps shouldn't cause the kernel to freak out, unless kernel memory is being overridden, or something (right?).

To my dismay, the daily panics became more frequent, and then I realized that the panic described in kern/105966 was back, and I was battling two problems once again. I posted a response to the original PR, asking it to be reopened, and even emailed the developer who originally fixed the bug. Nothing…

At this point, I'm not sure what to do. I've added tons of follow-ups to both of the PR's, made some forum postings, and sent several emails to the freebsd-stable mailing list. Except for the initial fix for kern/105966, I have received NO responses or input from any developers about this.

I'm left with few options, at this point:

Convert both FreeBSD boxes to Linux
Migrate to other routing daemons
Keep both boxes running 5.4-RELEASE ... forever

#1 is a possibility, but is going to be a PAIN to accomplish for dax, since I'll probably need to pay Voxel at some ridiculous hourly rate to get a base install on the box. Also, iptables is much less elegant than pf.. Anyone have suggestions for #2? Requirements are: OSPFv2, OSPFv3, and [multiprotocol] BGP support. OpenBGPD is portable, and might work well, but I don't know of any other portable OSPF implementations. #3 just seems like a bad idea, since 5.4-RELEASE i s going to be no longer maintained, soon. So much for security updates.

Well, that's why I think FreeBSD sucks, at the moment. Thoughts? Bugfixes?

Telia / NAC have really been annoying me, lately:

Latency to lo-0.hsa1.NewYork1.Level3.net

traceroute to hsa1.NewYork1.Level3.net (209.244.2.210), 64 hops max, 60 byte packets
1  voxel.prolixium.net (69.9.189.181)  0.634 ms
2  499.ge-6-1-0.gbr2.oct.nac.net (216.118.70.178)  0.809 ms
3  0.ge-0-0-0.gbr1.oct.nac.net (209.123.11.49)  1.036 ms
4  0.so-2-2-0.gbr1.tl9.nac.net (209.123.11.143)  89.376 ms
5  nyk-b1-link.telia.net (80.91.250.161)  13.986 ms
6  te-4-2.car2.NewYork1.Level3.net (4.68.110.81)  13.138 ms
7  lo-0.hsa1.NewYork1.Level3.net (209.244.2.210)  15.210 ms

Normally, I wouldn't care, but look at the trend. Average (well, used to be) RTT to anything in Manhattan is roughtly 3 ms. NAC is one of Voxel's transit providers, and Telia is one of NAC's. Ironically, Voxel had network capacity upgrades on the 19th and 20th. They also have Level(3) as transit, but I suppose are now taking advantage of hot potato routing (BGP) in order to whisk packets from inexpensive servers (mine) out cheaper transit (NAC).

Oh, yeah, this is killing my IPv6 latency :P

I took a trip to Asheville over the weekend, and passed by the Biltmore Estate. Huge house; complete with electricity and clock synchronization, so all the guests' tea and crumpets arrived on time.

Front view:

Inside (whoops, no photography):

Roof:

Top, again:

If you haven't been living under a rock for the past year, you might know that daylight savings time (DST) is changing. From The Energy Policy Act of 2005:

The bill amends the Uniform Time Act of 1966 by changing the start and end dates of daylight saving time starting in 2007. Clocks will be set ahead one hour on the second Sunday of March (March 11, 2007) instead of the current first Sunday of April (April 1, 2007). Clocks will be set back one hour on the first Sunday in November (November 4, 2007), rather than the last Sunday of October (October 28, 2007). This will make electronic clocks that had pre-programmed dates for adjusting to daylight saving time obsolete and will require updates to computer operating systems.

Along with possibly fattening up your kids by increasing daylight on Halloween, this requires most operating systems to be patched to compensate for the whims of the US government. Some Unix-like systems can be patched by doing the following:

Gentoo GNU/Linux: emerge sync && emerge sys-libs/timezone-data, assuming you're on a current profile
Debian and Ubuntu GNU/Linux: apt-get update && apt-get install tzdata
FreeBSD (other than >= 6.2): portinstall misc/zoneinfo or sync sources and rebuild world
Solaris 10 for SPARC: Grab patches 124630-03 and 119081-25 (or later) from Sun

Those are really the only operating systems I care about. I couldn't test the Solaris patch, since little ol' me doesn't have an expensive support contract, and 119081-25 isn't available to the general public. I suppose my Ultra 10 will have to suffer for a few weeks. Pity.

Network gear is affected, too. However, I suggest just switching to UTC, as it'll make your life easier.