Prolixium dot com: News >> Blog >> JBOD death

For the past four years, I've kept all of my media files (music, etc.) on a single filesystem spread across a varying number of physical volumes, utilizing LVM 2.0 on Linux. This JBOD array resided on atlantis, my old SuperMicro system with dual P3-800 CPUs.

The progression went the following, if I remember correctly:

2x 120GB, 1x 80GB = 320GB
1x 400GB, 2x 120GB, 1x 80GB = 720GB
1x 400GB, 1x 250GB, 2x 120GB = 890GB
1x 750GB, 1x 400GB, 1x 250GB, 1x 120GB = 1520GB

Adding and removing drives was simple, a couple LVM commands and a lengthly resize2fs. It worked well, although there was obviously no redundancy to speak of. I kept backing up a good portion (80%) of the content on CD-Rs, then DVD-Rs, then DVD-R DLs, just in case.

Yesterday, I set off to replace the old array with 2x Western Digital 1TB green disks that were much quieter and supposedly didn't chew up as much power as 4x EIDE disks. I had to recreate the filesystem, since ext3 doesn't support a 2TiB filesystem with a 1KiB block size. I created the new logical volume and ext3 filesystem, and started to copy the data.

Unfortunately, both the 400GB (Seagate) and 250GB (Western Digital) disks decided to die 20% into the copy (I used rsync) process. Great! I started & stopped the rsync, remounted, rebooted, etc. for an hour or two, and realized that things weren't looking good. Each time ext3 encountered an I/O error, it would invalidate the inode and cause it to be inaccessible, so rsync would log an error, and skip it, but not after creating the destination files filled with NULLs.

I tried running fsck.ext3 on the filesystem with -c, enabling the badblocks program. Unfortunately, that process would have taken a good two weeks, judging from the rate of completion.

I started up rsync again, this time in only one directory (should have been a couple hundred GiB) and let it go all night. It finished halfway through today, but by that time, the superblock(s) were corrupt, and the filesystem was toast. I tried freezing the drives, but that just made them click louder.

I picked up _another_ Western Digital 1TB disk today, and decided to run LVM on top of RAID-5, just in case anything like this happens again (well, hoping I'd only lose one drive at once). I am still in the process of copying and validating what made it onto the new filesystem - to the 750GB disk, so I can pull all three 1TB disks into a RAID-5 device, and put LVM on top. I just hope the 750GB holds up for the process …

Moral of the story is: JBOD is bad. Overall, I think I lost 40% of the files, according to a periodic ls -lR I run during system (boot disk) backups.