Present Location: News >> Blog

Blog

> Linux IPv6 FIB Bugs
Posted by prox, from Seattle, on February 16, 2019 at 12:50 local (server) time

tl;dr - I'm running into some FIB bug on Linux > 4.14

Ever since I upgraded excalibur (bare metal IPv6 router running BGP and tunnels) from Linux past version 4.14 I've been having some weird IPv6 FIB problems.  The host takes a few full IPv6 BGP feeds (currently ~60K routes) and puts them into the Linux FIB via FRR.  It also terminates a few OpenVPN and 6in4 tunnels for friends & family and happens to also host dax, the VM where prolixium.com web content is hosted.

The problems started when I upgraded to Linux 4.19 that was packaged by Debian in testing.  About 90 minutes after the reboot and after everything had converged, I started seeing reachability issues to some IPv6 destinations.  The routes were in the RIB (FRR) and in the FIB but traffic was being bitbucketed.   Even direct routes were affected.  Here's excalibur's VirtualBox interface to dax going AWOL from an IPv6 perspective:

(excalibur:11:02:EST)% ip -6 addr show dev vboxnet0
15: vboxnet0:  mtu 1500 state UP qlen 1000
    inet6 2620:6:200f:3::1/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::800:27ff:fe00:0/64 scope link
       valid_lft forever preferred_lft forever
(excalibur:11:02:EST)% ip -6  ro |grep 2620:6:200f:3
2620:6:200f:3::/64 dev vboxnet0 proto kernel metric 256 pref medium
(excalibur:11:02:EST)% ip -6 route get 2620:6:200f:3::2
2620:6:200f:3::2 from :: dev vboxnet0 proto kernel src 2620:6:200f:3::1 metric 256 pref medium
(excalibur:11:02:EST)% ping6 -c4 2620:6:200f:3::2
PING 2620:6:200f:3::2(2620:6:200f:3::2) 56 data bytes

--- 2620:6:200f:3::2 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3ms 

(excalibur:11:02:EST)% 

In the above case, the route was there, a Netlink confirmed it, but no traffic would flow.  The fix here was to either bounce the interface, restart FRR, or reboot.

Other times Netlink provides a negative response:

(excalibur:12:30:EST)% ip -6 route get 2620:6:200e:8100::
RTNETLINK answers: Invalid argument. 
(excalibur:12:30:EST)% ip -6 ro|grep 2620:6:200e:8100::/56
2620:6:200e:8100::/56 via 2620:6:200e::2 dev tun2 proto static metric 20 pref medium

In this case, the route appeared to be there but Netlink had some issue when querying it.  Traffic to that prefix was being bitbucketed.  The fix was to re-add the static route in FRR:

(excalibur:12:32:EST)# vtysh -c "conf t" -c "no ipv6 route 2620:6:200e:8100::/56 2620:6:200e::2"
(excalibur:12:32:EST)# vtysh -c "conf t" -c ipv6 route 2620:6:200e:8100::/56 2620:6:200e::2"
(excalibur:12:32:EST)% ip -6 route get 2620:6:200e:8100::
2620:6:200e:8100:: from :: via 2620:6:200e::2 dev tun2 proto static src 2620:6:200e::1 metric 20 pref medium

Downgrading from 4.19 to 4.16 seemed to have made the situation much better but not fix it completely.  Instead of 50% routes failing to work after 90 minutes only handful of prefixes break.  I'm not sure how many a handful is, but it's more than 1.  I was running 4.14 for about 6 months without a problem so I might just downgrade to that for now.

I did try reproducing this on a local VM running 4.19, FRR, and two BGP feeds but the problem isn't manifesting itself.  I'm wondering if this is traffic or load related or maybe even related to the existence of tunnels.  I don't think it's FRR's fault but it certainly might be doing something funny with its Netlink socket that triggers the kernel bug.  I also don't know how to debug this further, so I'm going to need to do some research.

Update 2019-02-16

I started playing with that local VM running 4.19 and can successfully cause IPv6 connectivity to "hiccup" if I do the following on it:

% ip -6 route|egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get

This basically walks the IPv6 Linux FIB and does an "ip -6 route get" for each prefix (first address in each).  After exactly 4,261 prefixes Netlink just gives me network unreachable:

[...]
2001:df0:456:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
2001:df0:45d:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
2001:df0:465:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
RTNETLINK answers: Network is unreachable
[...]

It's funny because it's always at the same exact point.  The route after 2001:df0:465::/48 is 2001:df0:467::/48, which I can query just fine outside of the loop:

(nltest:11:12:PST)% ip -6 route get 2001:df0:467::   
2001:df0:467:: from :: via fe80::21b:21ff:fe3b:a9b4 dev eth0 proto bgp src 2620:6:2003:105:250:56ff:fe1a:afc2 metric 20 pref medium
(nltest:11:12:PST)% 

The only possible explanation I can come up with is that I'm hitting some Netlink limit and messages are getting dropped.  If I don't Ctrl-C the script and just let it sit there spewing the unreachable messages on the screen eventually all IPv6 connectivity to my VM hiccups and cause my BGP sessions to bounce.  I can see this when running an adaptive ping to the VM:

[...]
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1215 ttl=64 time=0.386 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1216 ttl=64 time=0.372 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1217 ttl=64 time=0.143 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1218 ttl=64 time=0.383 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1235 ttl=64 time=1022 ms     <--- segments 1219..1234 gone
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1236 ttl=64 time=822 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1237 ttl=64 time=621 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1238 ttl=64 time=421 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1239 ttl=64 time=221 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1240 ttl=64 time=20.6 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1241 ttl=64 time=0.071 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1242 ttl=64 time=0.078 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1243 ttl=64 time=0.081 ms
64 bytes from 2620:6:2003:105:250:56ff:fe1a:afc2: icmp_seq=1244 ttl=64 time=0.076 ms
[...]

Next step here is to downgrade the VM to a 4.14 and run the same thing.  It's possible I could just be burning out Netlink and this is normal, but I'm suspicious.

Update 2 2019-02-16

Downgrading my local VM to Linux 4.14 and running the same shell fragment above produces no network unreachable messages from Netlink, does not disturb IPv6 connectivity at all, and no BGP sessions bounce:

(nltest:11:44:PST)% ip -6 route|egrep "^[0-9a-f]{1,4}:"|awk '{ print $1; }'|sed "s#/.*##"|xargs -L 1 ip -6 route get 1> /dev/null 
(nltest:11:45:PST)% 

Something definitely changed or got bugged in Netlink after 4.14.

Update 3 2019-02-16

After testing a few kernels, it seems this was introduced in Linux 4.18.  More investigation needed.

Comments: 33
> Core i9-9900K Upgrade
Posted by prox, from Seattle, on December 10, 2018 at 16:24 local (server) time

It's been over eight (8) years since I built a PC with the last one being in 2010, a Core i7-980X (destiny), which was back when Intel actually produced motherboards!

destiny is my main home workstation.  I use it for some VMs, video encoding, photo editing, and the occasional game (Quake III Arena and Quake 4).  For the last two years I mulled over upgrading it to something more modern, even though my workloads on it didn't really justify the upgrade.  This year, I decided to pull the trigger and upgrade it to something that'll last for another 8 years.  I looked at the Ryzen ThreadRipper series and even added a few builds to my cart, but then realized it was probably going to be a waste since 99% of the time the plethora of cores won't be used.  So, I decided to look at the leader in single thread performance, which turned out to be Intel:

Single Thread Performance, Nov 2018

(source for the above screenshot is here)

AMD doesn't even make it to the top 10.  Since the i9-9900K had just been released by the time I was looking, I decided to go with it.  Here's the build:

New components:

Existing components:

The OCZ SSD is pretty old and a 3.5" form factor but it's still working so I have it mounted in ~/tmp for scratch space.  I'll probably replace it or some of the other SATA SSDs with M.2 SSDs if it turns out I need more space.  Right now, it's not a priority for me.  The Blu-ray optical drives limited my choice in cases, but the 200R is a really nice case and another one of my systems uses the same one so I've got experience with it.

The integration was, honestly, pretty easy.  I expected more of a fight.  It was a little tricky to get the Cooler Master heat sink oriented and attached to the processor but I wasn't in a rush and took my time.

MSI MEG Z390 ACE + i990K + Cooler Master Hyper 212 EVO

I had fully intended to use the old 850 watt Corsair power supply that I had used on my i7-980X system (TX850w) but it lacked the 2x CPU power connectors so I went for a new RM 850x.  Anyway, here's the [almost] final build:

Final Build

I say almost final because the OCZ SSD gave me some problems.  The power and SATA connectors on it stick out pretty far, so the case wouldn't close.  I even tried the L-shaped SATA connector that came with the MSI motherboard but that didn't work, either.  At the end of the day I just took it out of the 3.5" bay and placed it "free" in the case.  Yes, this is horrible and more of a reason for me to replace it with an M.2 SSD in the near future.

OCZ SSD Problems

Actually, I can see the above problem happening for any full 3.5" drive, which is a little weird.  Here's the final result:

Final Build (case)

Unlike many PCs I've built in the past, this one worked the first time!  After assembling everything the BIOS indicated to me that everything was set to defaults, so I explored a little bit of the BIOS and then booted the OS.

MSI BIOS

The BIOS is really slick, BTW.  I'm not one to like GUIs but it's nice to actually see a map of the board with annotations of what's connected to what. &nbs;The one thing that I still think I need to change is the memory speed.  Without OC'ing, Intel rates the memory speed of the i9-9900K at 2666 MHz, which is the speed of the RAM I have.  It's running at 2133 MHz right now.

I booted up Linux and did some tests running HandBrakeCLI and messing with cpufrequtils.  While I can set the processor to run at 5.00 GHz, it throttles itself down to 4.70 GHz under full load at 77°C, which seems fine to me.  The default governor is powersave, which runs all cores at 800 MHz during idle times and actually results in lower TDP than the 980X.

HandBrakeCLI Test

(sorry for the photo instead of a screenshot, but I was lazy)

It's been awhile, so I also fired up Quake 4, which was super smooth even with the quality turned up.  Although, most of that is GPU-heavy and even at max vsync (60FPS) only half a core was utilized during gameplay.

I'll probably have some updates here to share over the next few days but for now, I'm happy with the upgrade.  The system feels considerably faster and compared to the 980X, it definitely is:

i7-980X vs. i9-9900K

(source for the above screenshot is here)

Find dmesg, cpuinfo, and lspci output here, if you're curious.  All grainy cellphone photos are here, too.

Comments: 0
> Who hosts your cloud provider's status page?
Posted by prox, from Seattle, on October 22, 2018 at 23:08 local (server) time

I thought this was a little funny.  Here are the links to a few of the top cloud providers' status pages:

Now, here's who hosts the status page (courtesy of ipin, which is truly hideous Perl code that you should not read):

CenturyLink

(destiny:20:00:PDT)% ipin status.ctl.io.
  A record #1
4 Address: 172.217.6.211
4 PTR: lga25s54-in-f19.1e100.net.
4 PTR: lga25s54-in-f211.1e100.net.
4 Prefix: 172.217.6.0/24
4 Origin: AS15169 [GOOGLE - Google LLC, US]
  AAAA record #1
6 Address: 2607:f8b0:4006:804::2013
6 PTR: lga25s54-in-x13.1e100.net.
6 Prefix: 2607:f8b0:4006::/48
6 Origin: AS15169 [GOOGLE - Google LLC, US]

Amazon Web Services

(destiny:20:00:PDT)% ipin status.aws.amazon.com.
  A record #1
4 Address: 52.94.241.74
4 Prefix: 52.94.240.0/22
4 Origin: AS16509 [AMAZON-02 - Amazon.com, Inc., US]

Microsoft Azure

(destiny:20:01:PDT)% ipin azure.microsoft.com.  
  A record #1
4 Address: 13.82.93.245
4 Prefix: 13.64.0.0/11
4 Origin: AS8075 [MICROSOFT-CORP-MSN-AS-BLOCK - Microsoft Corporation, US]

Oracle Cloud

(destiny:20:02:PDT)% ipin ocistatus.oraclecloud.com.
  A record #1
4 Address: 18.234.32.149
4 PTR: ec2-18-234-32-149.compute-1.amazonaws.com.
4 Prefix: 18.232.0.0/14
4 Origin: AS14618 [AMAZON-AES - Amazon.com, Inc., US]

Google Cloud

(destiny:20:02:PDT)% ipin status.cloud.google.com.  
  A record #1
4 Address: 172.217.6.206
4 PTR: lga25s54-in-f14.1e100.net.
4 PTR: lga25s54-in-f206.1e100.net.
4 Prefix: 172.217.6.0/24
4 Origin: AS15169 [GOOGLE - Google LLC, US]
  AAAA record #1
6 Address: 2607:f8b0:4006:804::200e
6 PTR: lga25s54-in-x0e.1e100.net.
6 Prefix: 2607:f8b0:4006::/48
6 Origin: AS15169 [GOOGLE - Google LLC, US]

Linode

(destiny:20:03:PDT)% ipin status.linode.com.      
  A record #1
4 Address: 18.234.32.150
4 PTR: ec2-18-234-32-150.compute-1.amazonaws.com.
4 Prefix: 18.232.0.0/14
4 Origin: AS14618 [AMAZON-AES - Amazon.com, Inc., US]

Vultr

(destiny:20:03:PDT)% ipin status.vultr.com. 
  A record #1
4 Address: 104.20.23.240
4 Prefix: 104.20.16.0/20
4 Origin: AS13335 [CLOUDFLARENET - Cloudflare, Inc., US]
  A record #2
4 Address: 104.20.22.240
4 Prefix: 104.20.16.0/20
4 Origin: AS13335 [CLOUDFLARENET - Cloudflare, Inc., US]
  AAAA record #1
6 Address: 2606:4700:10::6814:16f0
6 Prefix: 2606:4700:10::/44
6 Origin: AS13335 [CLOUDFLARENET - Cloudflare, Inc., US]
  AAAA record #2
6 Address: 2606:4700:10::6814:17f0
6 Prefix: 2606:4700:10::/44
6 Origin: AS13335 [CLOUDFLARENET - Cloudflare, Inc., US]

Rackspace

(destiny:20:04:PDT)% ipin status.apps.rackspace.com.
  A record #1
4 Address: 152.195.12.244
4 Prefix: 152.195.12.0/24
4 Origin: AS15133 [EDGECAST - MCI Communications Services, Inc. d/b/a Verizon Business, US]

I'm not gonna lie, I had a chuckle when I saw who hosted Oracle Cloud's status page.  Other takeaways:

Really, this isn't indicitive of anything.  I'd probably host my status page elsewhere if I ran a hosting service, TBQH.

Comments: 0
> Bragging
Posted by prox, from Seattle, on September 08, 2018 at 14:24 local (server) time

In the mid-1990s my friends used to brag about how many TVs their family had, how many cars they owned, and, in general, how much stuff they had.  I'll admit that I used to brag about how many computers I had or how I connected to AOL over a LAN connection.  All of this was annoying.

Things are a bit different in 2018 but ultimately the same.  Instead of bragging about how much stuff people have they now brag about how much stuff they don't have.  Here are the typical statements I hear people periodically brag about, at least around the PNW:

The no-car and no-TV statements I hear most often and they're usually stated out of context.  These all don't bother me much because I'm an adult but every once and awhile it gets really annoying (hence this blog post).  Maybe I should counter these by bragging about my wife and I having no kids.

Tasteless?  Maybe.

Comments: 0
> Boot Messages
Posted by prox, from Seattle, on August 04, 2018 at 13:23 local (server) time

Most modern switches and routers today are based on a Linux or *BSD-flavoured operating system.  It's a given that these operating systems are fairly complex but what boggles my mind is when vendors ship them with their products and don't bother cleaning up the initialization scripts.

For example, Junos:

Attaching /cf/packages/junos via /dev/mdctl...
Mounted junos package on /dev/md1...
A
Media check on da0
Automatic reboot in progress...
** /dev/da0s2a (NO WRITE)
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
161 files, 75937 used, 74101 free (21 frags, 9260 blocks, 0.0% fragmentation)
mount reload of '/' failed: Operation not supported 

-a: not found
-a: not found
-a: not found
-a: not found
-a: not found
-a: not found
-a: not found
-a: not found
-a: not found
-a: not found
Checking integrity of BSD labels:
  s1: Passed
  s2: Passed
  s3: Passed
  s4: Passed

That -a: not found bugs my OCD and makes me worry that the -a argument was ignored because it was treated as a file.  The mount error is fun, too.

Comments: 0
> Linux USB Identifiers and Error Messages
Posted by prox, from Seattle, on April 14, 2018 at 20:38 local (server) time

It took me a few minutes to track this down, so I figured I'd share it with the world.

In the event of a USB error or warning, the Linux kernel will print a message like the following:

[15740840.830734] usb 2-3: Failed to suspend device, error -71

Most of us have a ton of USB-connected devices, so how does one figure out what "usb 2-3" refers to in order to diagnose the problem?  At first, I thought lsusb(8) would help:

(atlantis:17:29:PDT)% lsusb
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 010 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 009 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 008 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 004 Device 003: ID 0bc2:5031 Seagate RSS LLC FreeAgent GoFlex USB 3.0
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 002 Device 002: ID 2109:2812 VIA Labs, Inc. VL812 Hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 0409:0058 NEC Corp. HighSpeed Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 007 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 006 Device 005: ID 067b:2303 Prolific Technology, Inc. PL2303 Serial Port
Bus 006 Device 004: ID 1781:0a98 Multiple Vendors raphnet.net USBTenki
Bus 006 Device 003: ID 051d:0002 American Power Conversion Uninterruptible Power Supply
Bus 006 Device 002: ID 0451:2046 Texas Instruments, Inc. TUSB2046 Hub
Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub

That's nice, but I don't see bus number 2 and device number 3 or bus number 3 and device number 2 in that list. The verbose (-v) flag doesn't appear to help, either.  So, I try usb-devices(1) and am presented with even more information, like this for each device:

T:  Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=480 MxCh= 6
D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=1d6b ProdID=0002 Rev=04.13
S:  Manufacturer=Linux 4.13.0-1-amd64 ehci_hcd
S:  Product=EHCI Host Controller
S:  SerialNumber=0000:00:1a.7
C:  #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=0mA
I:  If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub

I still couldn't find a combination of 2-3 or 3-2 in there.  So, I started hunting around sysfs for an answer, and ended up finding it:

(atlantis:17:35:PDT)% cd /sys/bus/usb/devices/2-3                 
(atlantis:17:35:PDT)% lsusb|grep $(cat idVendor):$(cat idProduct)
Bus 002 Device 002: ID 2109:2812 VIA Labs, Inc. VL812 Hub

For some reason, lsusb(8) doesn't feel like displaying the what I learned is the device number and device path:

(atlantis:17:35:PDT)% echo $(cat devnum)-$(cat devpath)          
2-3

Although, I have three "hubs" connected to this machine, so tracking those down is another story.  At least I know what I'm looking for, now.

Comments: 0
> No More Quagga
Posted by prox, from Seattle, on April 04, 2018 at 01:51 local (server) time

It took awhile, but I finally converted the last two software routers (well, hosts that run routing protocols) on my network that were running Quagga to FRR:

bazooka.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
centauri.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
evolution.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
excalibur.prolixium.com.: Version: 4.0-1~debian9+1
exodus.prolixium.com.: Version: 1.6.3-3
firefly.prolixium.com.: Version: 1.6.3-3
mercury.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
nat.prolixium.com.: Version: 4.1-dev-1.0-1~debian9+1
nox.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
pathfinder.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
proteus.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
remus.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
scimitar.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
sprint.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
starfire.prolixium.com.: Version: 4.1-dev-1.0-1~debian9+1
storm.prolixium.com.: Version: 1.6.3-3
tachyon.prolixium.com.: Version: 3.1-dev
tiny.prolixium.com.: Version: 4.0-1~debian9+1
trident.prolixium.com.: Version: 3.1-dev
trill.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1
valen.prolixium.com.: Version: 4.1-dev-1.0-1~debian9+1
orca.prolixium.com.: Version: 3.1-dev-1.0-1~debian9+1

The -dev versions above are hand-rolled from the latest source code.  There are no pre-built Debian packages for i386 so I was forced to roll them by hand.

The 1.6 versions above are actually BIRD.

Comments: 0
> Odd iFit Boot Loop Fix
Posted by prox, from Seattle, on April 03, 2018 at 22:47 local (server) time

tl;dr I ran into a boot loop issue with my NordicTrak treadmill.  Turning off NTP solved the problem.

My wife and I purchased a NordicTrak C 990 treadmill in late 2016.  It doesn't get all that much use (I still prefer going to the pool and swimming) but we both periodically use it.  I have an iFit membership that's mostly a waste of money but allows the machine to report and track my workouts online.

The control plane of the machine runs Android 2.x and has always felt pretty brittle outside of the iFit application.  Connecting to Wi-Fi, for example, is done through the Android system dialog screens rather than through an iFit-branded screen.

Anyway, the whole system was working fine until I decided to use it today.  I put the key in and Android indicated it couldn't connect to Wi-Fi.  So, I power cycled the system (naturally).  Upon reboot the iFit screen would load but then after 10-15 seconds trigger a reboot of Android.  I searched around and found instructions like this that described how to reinstall the iFit application.  However, these instructions didn't work for me because even if I could draw the "figure 8" on the screen to exit the iFit application's splash page, the OS would still reboot seconds later.

I took a guess that something Wi-Fi-related was causing the reboot so I shut the 2.4GHz radios on my two Cisco WAPs (the treadmill is one of two devices that still use 2.4GHz only).  The reboots stopped.  Something network-related was definitely causing it.  Maybe it's some update check that is returning a value that is triggering a bug in Android?  So, I ran tcpdump(8) on my local router.  I started a continuous ping and the last packets transferred before the system rebooted were NTP queries.  Thinking that something time-related was killing the OS I went into Android settings and disabled network-provided time.  The system was still stable after boot even when Wi-Fi is on, now.

The system date was 2012-01-01 so I tried setting it to 2018-04-03.  Instantly, the system locked up and after a few seconds rebooted.  I even tried setting it to a last known good date earlier in the year when I knew the treadmill was still working - same thing, triggered a reboot.  It would appear that either something in the OS can't handle the date changing too drastically or there's something that can't handle a 2018 date.

So, the treadmill is functional but I now can't login to my iFit account.  I'm guessing that somehow the date is passed as one of the login parameters and the iFit platform rejects the login attempt.  I'll play more with it later and will not be renewing my iFit membership if I still can't login.

Hopefully this post will be useful to someone who's given up and about to buy a new treadmill..

Update: I played around with setting the date a bit more.  Even setting it to 2012-01-02 triggers a reboot.  It would appear the date can't actually be set, now.

Comments: 0

No Previous PageDisplaying page 1 of 121 of 962 results Next Page