[OHM] ...a funny thing happened to me on the way to the machine room.
Rob H Warren
warren at muninn-project.org
Mon Jun 18 04:10:05 UTC 2018
After a bit of forensics and spending most of Friday in a machine room, here is a synopsis of what has happened:
The OHM server is a 'redundant everything' setup, everything comes in two's including the power supplies. As is best practice, one was routed to the UPS and the other directly to the mains to allow battery maintenance without downtime. Sometime in the past few months, the power supply hooked up to the UPS failed without triggering the software alert. This resulted in a situation where power bumps would trigger a hard reboot with the UPS reporting "that everything was completely under control, move along now".
Mystery reboots and filesystem faults had been sometimes that was being investigated; they were too much for the /usr filesystem which had been corrupting itself quietly while causing random software faults. Several layers of disk redundancy, data consistency checking and backups have ensured that no OHM data was lost but the base system itself is a mess at this point. To that end I will reinitialize the entire storage array, reinstall from scratch and update the rails port. We should be back up by Friday, a placeholder has been put up in the meantime.
If you happen to be the person that keeps pushing pins and nails into a voodoo doll mockup of OHM, would you stop already? -rhw
More information about the Historic