Archived posting to the Leica Users Group, 2014/07/01

[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]

Subject: [Leica] Narrative about the extended LUG outage
From: philippe.amard at sfr.fr (philippe.amard)
Date: Wed, 2 Jul 2014 07:46:58 +0200
References: <53B3898D.2030005@mejac.palo-alto.ca.us>

Way too technical for me and on the verge of poetry as a result ;-)
But OMG thank you Brian for restoring the community to its normal state.
Well, if the LUG has ever a normal state :-)

Mille mercis et bravo au carr?.

Amities
Philippe

Le 2 juil. 14 ? 06:24, Brian Reid a ?crit :

> In case you care.
>
> Server computers that are engineered for reliability have two power  
> supplies and two power cords. Power supplies are the most frequent  
> component to fail in server computers, so having two of them makes  
> it survive the outage of one.
>
> The server computer that had supported the LUG had two power  
> supplies. They were stacked vertically, one on top of the other.  
> Both power supplies had been running 24x7 for about 9 years, and  
> their fans had sucked in a certain amount of lint. Lint is  
> flammable. The bottom power supply failed, and the lint caught fire.  
> The flame rose to the upper power supply and ignited its stored-up  
> lint also. Like firestarters in a Franklin stove, the 20-second  
> burst of flame was enough to ignite the various flammable items  
> (including lint) in the main enclosure. The flash fire probably only  
> lasted 40 or 50 seconds, but it was hot enough to destroy most of  
> the solder traces that were near the power supplies on the circuit  
> boards. There were various plastic tags on some of the cables, which  
> added flammable material.
>
> You can go to the store and buy a laptop or a desktop computer, but  
> you really can't go buy a server computer. Yes, this being silicon  
> valley, there are stores around that sell server computers (Central  
> Computer is the best of the lot) but buying a server computer at a  
> retail store is like buying a bicycle at a department store. It's  
> just not the same thing. Server computers are special-order, because  
> there are so many variations on how they are built that no one can  
> afford to keep good ones in inventory.
>
> The fire was on a Saturday morning, and I knew that the soonest I  
> could even place an order for a replacement server was Monday, and  
> even at rush-rush prices I wouldn't get it until Thursday. At the  
> time a Saturday-to-Thursday outage seemed unconscionable. So I  
> decided to move the LUG and its supporting software to the newest  
> and emptiest of my half-dozen servers. It wasn't exactly a spare--it  
> was running a few little things--but mostly it was idle.
>
> The LUG server had been running software from the era of its  
> installation, about 2005. The new server was built with chips and  
> components that the old software didn't understand, so I couldn't  
> just restore the LUG server backups onto the new server. They  
> wouldn't run. I had to get the new software working on the  
> replacement server and then manuall move over each piece.
>
> I made the mistake of believing the operating system documentation,  
> which detailed a function called "system upgrade". It was supposed  
> to work they way Mac or Windows updates work--you let it do its  
> thing for a while, and then you reboot and all is well. After  
> running the system upgrade, nothing worked any more, including the  
> few services that had been on that machine. After asking the  
> experts, I realized that I was going to have to wipe the machine, do  
> a clean install, get all of the necessary apps installed, and then  
> restore both sets of backups (LUG server and previous contents of  
> that server) to the clean system.
>
> So far this is not a crazy plan. I've done things like it many times  
> before, though the 9-year software update gap made for a few  
> challenges.
>
> Once I got all of the apps installed and the backups restored, I  
> immediately typed the command to turn it all on
>       /local/mailman/bin/mailmanctl start
> and nothing happened. The error log showed a preposterous, deeply  
> hard to believe error message.
>
> The wise person's first step in debugging strange failures on  
> computers is to type the error message into a search engine (I use  
> Bing) to see if other people had asked about it. To my great  
> astonishment, no one had. This never happens. Somebody else *always*  
> has the same problem and has asked about it.
>
> I then started reading the source code of Mailman, trying to see  
> what circumstances would cause it to generate that message.  Mailman  
> is written in a language called Python. When you are having trouble  
> like this, a good step is to explore "version skew". Mailman Version  
> XXX works only with Python Version YYY. The versions of Python that  
> are extant just now are 2.5, 2.6, 2.7, 3.2, 3.3, and 3.4.  This is  
> an abnormally large spread of "current" versions, which usually  
> means that the language developers have made incompatible changes  
> and have to keep old versions around for apps that have come to  
> depend on them.
>
> I tried all 6 of those Python versions. I got the same odd error in  
> the 2.* versions, and absolute chaos in the 3.* versions. Since the  
> version of Mailman that I wanted to use (2.1.18) failed the same way  
> with all of the 2.* Python versions, I wiped the slate clean one  
> last time and installed Python 2.7.
>
> Gonna have to find this problem the old-fashioned way.
>
> Many days pass as I read documentation, run tests, explore the  
> software, use debuggers, create and read log files, all to no avail.
>
> Then I decided to instrument and log what was happening when Mailman/ 
> Python started up. Figuring out how much information to put in a log  
> file is a black art. If you log too much, you will never find what  
> you are looking for in the swamp of details. If you log too little,  
> you probably won't log what you're looking for.
>
> After far too much time staring at the logs, I saw that Python was  
> initializing from a library that was not listed in the Mailman  
> docdumentation.
>
> An aside: language systems like Python tend to be aggressive in how  
> they find libraries. They look around and if they find something  
> that looks like a library, they use it. I'm sure the Python  
> designers (none of whom is named Monty) thought they were doing the  
> world a favor by making it go out and find its own libraries.  
> "Autoconfiguration" run amok. Bad idea.
>
> This library was obsolete. In the 9 years of not upgrading, the  
> Mailman software had changed the place where it kept certain library  
> functions, and both of them were present in the version I was trying  
> to run. The "wipe clean and reinstall" function only wiped the  
> directories that it knew about, and this obsolete directory was not  
> on its list -- it had been retired years ago -- so it didn't get  
> removed by the "wipe clean" function.
>
> If I had run all 12 of the upgrades between Mailman 2.1.6 and  
> 2.1.18, one of them would surely have deleted that newly-obsolete  
> directory. But I didn't, so it was still there.
>
> When a complex computer system is using two different versions of  
> the same library, with creation dates 7 years apart, it doesn't  
> stand a chance of working.
>
> I typed the Unix command "rm -rf /local/mailman/Mailman/pythonlib/ 
> email"
> which got rid of the ancient and incompatible library
> and everything started working. Perfectly.
>
> There were hundreds of loose ends, and I spent the next week hunting  
> them down, but it wasn't taking 18 hours a day and LUG mail was  
> flowing while I did it.
>
> Thanks for listening.
> Brian Reid
> LUG Saloonkeeper and server wrangler
>
>
>
>
>
> _______________________________________________
> Leica Users Group.
> See http://leica-users.org/mailman/listinfo/lug for more information

One sees clearly only with the heart. What is essential is invisible  
to the eye. Antoine de Saint Exup?ry in Le Petit Prince.
NO ARCHIVE






In reply to: Message from reid at mejac.palo-alto.ca.us (Brian Reid) ([Leica] Narrative about the extended LUG outage)