Archived posting to the Leica Users Group, 2014/07/01

[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]

Subject: [Leica] Narrative about the extended LUG outage
From: reid at mejac.palo-alto.ca.us (Brian Reid)
Date: Tue, 01 Jul 2014 21:24:45 -0700

In case you care.

Server computers that are engineered for reliability have two power 
supplies and two power cords. Power supplies are the most frequent 
component to fail in server computers, so having two of them makes it 
survive the outage of one.

The server computer that had supported the LUG had two power supplies. 
They were stacked vertically, one on top of the other. Both power 
supplies had been running 24x7 for about 9 years, and their fans had 
sucked in a certain amount of lint. Lint is flammable. The bottom power 
supply failed, and the lint caught fire. The flame rose to the upper 
power supply and ignited its stored-up lint also. Like firestarters in a 
Franklin stove, the 20-second burst of flame was enough to ignite the 
various flammable items (including lint) in the main enclosure. The 
flash fire probably only lasted 40 or 50 seconds, but it was hot enough 
to destroy most of the solder traces that were near the power supplies 
on the circuit boards. There were various plastic tags on some of the 
cables, which added flammable material.

You can go to the store and buy a laptop or a desktop computer, but you 
really can't go buy a server computer. Yes, this being silicon valley, 
there are stores around that sell server computers (Central Computer is 
the best of the lot) but buying a server computer at a retail store is 
like buying a bicycle at a department store. It's just not the same 
thing. Server computers are special-order, because there are so many 
variations on how they are built that no one can afford to keep good 
ones in inventory.

The fire was on a Saturday morning, and I knew that the soonest I could 
even place an order for a replacement server was Monday, and even at 
rush-rush prices I wouldn't get it until Thursday. At the time a 
Saturday-to-Thursday outage seemed unconscionable. So I decided to move 
the LUG and its supporting software to the newest and emptiest of my 
half-dozen servers. It wasn't exactly a spare--it was running a few 
little things--but mostly it was idle.

The LUG server had been running software from the era of its 
installation, about 2005. The new server was built with chips and 
components that the old software didn't understand, so I couldn't just 
restore the LUG server backups onto the new server. They wouldn't run. I 
had to get the new software working on the replacement server and then 
manuall move over each piece.

I made the mistake of believing the operating system documentation, 
which detailed a function called "system upgrade". It was supposed to 
work they way Mac or Windows updates work--you let it do its thing for a 
while, and then you reboot and all is well. After running the system 
upgrade, nothing worked any more, including the few services that had 
been on that machine. After asking the experts, I realized that I was 
going to have to wipe the machine, do a clean install, get all of the 
necessary apps installed, and then restore both sets of backups (LUG 
server and previous contents of that server) to the clean system.

So far this is not a crazy plan. I've done things like it many times 
before, though the 9-year software update gap made for a few challenges.

Once I got all of the apps installed and the backups restored, I 
immediately typed the command to turn it all on
        /local/mailman/bin/mailmanctl start
and nothing happened. The error log showed a preposterous, deeply hard 
to believe error message.

The wise person's first step in debugging strange failures on computers 
is to type the error message into a search engine (I use Bing) to see if 
other people had asked about it. To my great astonishment, no one had. 
This never happens. Somebody else *always* has the same problem and has 
asked about it.

I then started reading the source code of Mailman, trying to see what 
circumstances would cause it to generate that message.  Mailman is 
written in a language called Python. When you are having trouble like 
this, a good step is to explore "version skew". Mailman Version XXX 
works only with Python Version YYY. The versions of Python that are 
extant just now are 2.5, 2.6, 2.7, 3.2, 3.3, and 3.4.  This is an 
abnormally large spread of "current" versions, which usually means that 
the language developers have made incompatible changes and have to keep 
old versions around for apps that have come to depend on them.

I tried all 6 of those Python versions. I got the same odd error in the 
2.* versions, and absolute chaos in the 3.* versions. Since the version 
of Mailman that I wanted to use (2.1.18) failed the same way with all of 
the 2.* Python versions, I wiped the slate clean one last time and 
installed Python 2.7.

Gonna have to find this problem the old-fashioned way.

Many days pass as I read documentation, run tests, explore the software, 
use debuggers, create and read log files, all to no avail.

Then I decided to instrument and log what was happening when 
Mailman/Python started up. Figuring out how much information to put in a 
log file is a black art. If you log too much, you will never find what 
you are looking for in the swamp of details. If you log too little, you 
probably won't log what you're looking for.

After far too much time staring at the logs, I saw that Python was 
initializing from a library that was not listed in the Mailman 
docdumentation.

An aside: language systems like Python tend to be aggressive in how they 
find libraries. They look around and if they find something that looks 
like a library, they use it. I'm sure the Python designers (none of whom 
is named Monty) thought they were doing the world a favor by making it 
go out and find its own libraries. "Autoconfiguration" run amok. Bad idea.

This library was obsolete. In the 9 years of not upgrading, the Mailman 
software had changed the place where it kept certain library functions, 
and both of them were present in the version I was trying to run. The 
"wipe clean and reinstall" function only wiped the directories that it 
knew about, and this obsolete directory was not on its list -- it had 
been retired years ago -- so it didn't get removed by the "wipe clean" 
function.

If I had run all 12 of the upgrades between Mailman 2.1.6 and 2.1.18, 
one of them would surely have deleted that newly-obsolete directory. But 
I didn't, so it was still there.

When a complex computer system is using two different versions of the 
same library, with creation dates 7 years apart, it doesn't stand a 
chance of working.

I typed the Unix command "rm -rf /local/mailman/Mailman/pythonlib/email"
which got rid of the ancient and incompatible library
and everything started working. Perfectly.

There were hundreds of loose ends, and I spent the next week hunting 
them down, but it wasn't taking 18 hours a day and LUG mail was flowing 
while I did it.

Thanks for listening.
Brian Reid
LUG Saloonkeeper and server wrangler






Replies: Reply from rgacpa at gmail.com (Bob Adler) ([Leica] Narrative about the extended LUG outage)
Reply from hopsternew at gmail.com (Geoff Hopkinson) ([Leica] Narrative about the extended LUG outage)
Reply from george.imagist at icloud.com (George Lottermoser) ([Leica] Narrative about the extended LUG outage)
Reply from jayanand at gmail.com (Jayanand Govindaraj) ([Leica] Narrative about the extended LUG outage)
Reply from jbmmllug at jbm.org (Jeff Moore) ([Leica] Narrative about the extended LUG outage)
Reply from photo at frozenlight.eu (Nathan Wajsman) ([Leica] Narrative about the extended LUG outage)
Reply from pdzwig at summaventures.com (Peter Dzwig) ([Leica] Narrative about the extended LUG outage)
Reply from philippe.amard at sfr.fr (philippe.amard) ([Leica] Narrative about the extended LUG outage)
Reply from richard at richardmanphoto.com (Richard Man) ([Leica] Narrative about the extended LUG outage)
Reply from r.s.taylor at comcast.net (Richard Taylor) ([Leica] Narrative about the extended LUG outage)
Reply from spencer at aotera.org (Spencer Cheng) ([Leica] Narrative about the extended LUG outage)