Archived posting to the Leica Users Group, 2014/07/02
[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]+1 and a standing ovation! On Wed, Jul 2, 2014 at 7:15 AM, Richard Taylor <r.s.taylor at comcast.net> wrote: > Oy vey!, as we used to say in the old country (Flatbush). Been there done > that on some systems I?ve worked on. I feel your pain. > > Best, > > Dick > > > > > On Jul 02, 2014, at 12:24 AM, Brian Reid <reid at mejac.palo-alto.ca.us> > wrote: > > > In case you care. > > > > Server computers that are engineered for reliability have two power > supplies and two power cords. Power supplies are the most frequent > component to fail in server computers, so having two of them makes it > survive the outage of one. > > > > The server computer that had supported the LUG had two power supplies. > They were stacked vertically, one on top of the other. Both power supplies > had been running 24x7 for about 9 years, and their fans had sucked in a > certain amount of lint. Lint is flammable. The bottom power supply failed, > and the lint caught fire. The flame rose to the upper power supply and > ignited its stored-up lint also. Like firestarters in a Franklin stove, the > 20-second burst of flame was enough to ignite the various flammable items > (including lint) in the main enclosure. The flash fire probably only lasted > 40 or 50 seconds, but it was hot enough to destroy most of the solder > traces that were near the power supplies on the circuit boards. There were > various plastic tags on some of the cables, which added flammable material. > > > > You can go to the store and buy a laptop or a desktop computer, but you > really can't go buy a server computer. Yes, this being silicon valley, > there are stores around that sell server computers (Central Computer is the > best of the lot) but buying a server computer at a retail store is like > buying a bicycle at a department store. It's just not the same thing. > Server computers are special-order, because there are so many variations on > how they are built that no one can afford to keep good ones in inventory. > > > > The fire was on a Saturday morning, and I knew that the soonest I could > even place an order for a replacement server was Monday, and even at > rush-rush prices I wouldn't get it until Thursday. At the time a > Saturday-to-Thursday outage seemed unconscionable. So I decided to move the > LUG and its supporting software to the newest and emptiest of my half-dozen > servers. It wasn't exactly a spare--it was running a few little things--but > mostly it was idle. > > > > The LUG server had been running software from the era of its > installation, about 2005. The new server was built with chips and > components that the old software didn't understand, so I couldn't just > restore the LUG server backups onto the new server. They wouldn't run. I > had to get the new software working on the replacement server and then > manuall move over each piece. > > > > I made the mistake of believing the operating system documentation, > which detailed a function called "system upgrade". It was supposed to work > they way Mac or Windows updates work--you let it do its thing for a while, > and then you reboot and all is well. After running the system upgrade, > nothing worked any more, including the few services that had been on that > machine. After asking the experts, I realized that I was going to have to > wipe the machine, do a clean install, get all of the necessary apps > installed, and then restore both sets of backups (LUG server and previous > contents of that server) to the clean system. > > > > So far this is not a crazy plan. I've done things like it many times > before, though the 9-year software update gap made for a few challenges. > > > > Once I got all of the apps installed and the backups restored, I > immediately typed the command to turn it all on > > /local/mailman/bin/mailmanctl start > > and nothing happened. The error log showed a preposterous, deeply hard > to believe error message. > > > > The wise person's first step in debugging strange failures on computers > is to type the error message into a search engine (I use Bing) to see if > other people had asked about it. To my great astonishment, no one had. This > never happens. Somebody else *always* has the same problem and has asked > about it. > > > > I then started reading the source code of Mailman, trying to see what > circumstances would cause it to generate that message. Mailman is written > in a language called Python. When you are having trouble like this, a good > step is to explore "version skew". Mailman Version XXX works only with > Python Version YYY. The versions of Python that are extant just now are > 2.5, 2.6, 2.7, 3.2, 3.3, and 3.4. This is an abnormally large spread of > "current" versions, which usually means that the language developers have > made incompatible changes and have to keep old versions around for apps > that have come to depend on them. > > > > I tried all 6 of those Python versions. I got the same odd error in the > 2.* versions, and absolute chaos in the 3.* versions. Since the version of > Mailman that I wanted to use (2.1.18) failed the same way with all of the > 2.* Python versions, I wiped the slate clean one last time and installed > Python 2.7. > > > > Gonna have to find this problem the old-fashioned way. > > > > Many days pass as I read documentation, run tests, explore the software, > use debuggers, create and read log files, all to no avail. > > > > Then I decided to instrument and log what was happening when > Mailman/Python started up. Figuring out how much information to put in a > log file is a black art. If you log too much, you will never find what you > are looking for in the swamp of details. If you log too little, you > probably won't log what you're looking for. > > > > After far too much time staring at the logs, I saw that Python was > initializing from a library that was not listed in the Mailman > docdumentation. > > > > An aside: language systems like Python tend to be aggressive in how they > find libraries. They look around and if they find something that looks like > a library, they use it. I'm sure the Python designers (none of whom is > named Monty) thought they were doing the world a favor by making it go out > and find its own libraries. "Autoconfiguration" run amok. Bad idea. > > > > This library was obsolete. In the 9 years of not upgrading, the Mailman > software had changed the place where it kept certain library functions, and > both of them were present in the version I was trying to run. The "wipe > clean and reinstall" function only wiped the directories that it knew > about, and this obsolete directory was not on its list -- it had been > retired years ago -- so it didn't get removed by the "wipe clean" function. > > > > If I had run all 12 of the upgrades between Mailman 2.1.6 and 2.1.18, > one of them would surely have deleted that newly-obsolete directory. But I > didn't, so it was still there. > > > > When a complex computer system is using two different versions of the > same library, with creation dates 7 years apart, it doesn't stand a chance > of working. > > > > I typed the Unix command "rm -rf /local/mailman/Mailman/pythonlib/email" > > which got rid of the ancient and incompatible library > > and everything started working. Perfectly. > > > > There were hundreds of loose ends, and I spent the next week hunting > them down, but it wasn't taking 18 hours a day and LUG mail was flowing > while I did it. > > > > Thanks for listening. > > Brian Reid > > LUG Saloonkeeper and server wrangler > > > > > > > > > > > > _______________________________________________ > > Leica Users Group. > > See http://leica-users.org/mailman/listinfo/lug for more information > > > _______________________________________________ > Leica Users Group. > See http://leica-users.org/mailman/listinfo/lug for more information >