Archived posting to the Leica Users Group, 2014/07/02
[Author Prev] [Author Next] [Thread Prev] [Thread Next] [Author Index] [Topic Index] [Home] [Search]Congrats Brian. Sounds like a lot of work? Was it "apt-get autoremove" that failed? I have been finding that a pain on my systems too. Thanks for all your hard work on our behalf. We are all very grateful for keeping the LUG going. Peter On 02/07/2014 05:24, Brian Reid wrote: > In case you care. > > Server computers that are engineered for reliability have two power > supplies and > two power cords. Power supplies are the most frequent component to fail in > server computers, so having two of them makes it survive the outage of one. > > The server computer that had supported the LUG had two power supplies. > They were > stacked vertically, one on top of the other. Both power supplies had been > running 24x7 for about 9 years, and their fans had sucked in a certain > amount of > lint. Lint is flammable. The bottom power supply failed, and the lint > caught > fire. The flame rose to the upper power supply and ignited its stored-up > lint > also. Like firestarters in a Franklin stove, the 20-second burst of flame > was > enough to ignite the various flammable items (including lint) in the main > enclosure. The flash fire probably only lasted 40 or 50 seconds, but it > was hot > enough to destroy most of the solder traces that were near the power > supplies on > the circuit boards. There were various plastic tags on some of the cables, > which > added flammable material. > > You can go to the store and buy a laptop or a desktop computer, but you > really > can't go buy a server computer. Yes, this being silicon valley, there are > stores > around that sell server computers (Central Computer is the best of the > lot) but > buying a server computer at a retail store is like buying a bicycle at a > department store. It's just not the same thing. Server computers are > special-order, because there are so many variations on how they are built > that > no one can afford to keep good ones in inventory. > > The fire was on a Saturday morning, and I knew that the soonest I could > even > place an order for a replacement server was Monday, and even at rush-rush > prices > I wouldn't get it until Thursday. At the time a Saturday-to-Thursday outage > seemed unconscionable. So I decided to move the LUG and its supporting > software > to the newest and emptiest of my half-dozen servers. It wasn't exactly a > spare--it was running a few little things--but mostly it was idle. > > The LUG server had been running software from the era of its installation, > about > 2005. The new server was built with chips and components that the old > software > didn't understand, so I couldn't just restore the LUG server backups onto > the > new server. They wouldn't run. I had to get the new software working on the > replacement server and then manuall move over each piece. > > I made the mistake of believing the operating system documentation, which > detailed a function called "system upgrade". It was supposed to work they > way > Mac or Windows updates work--you let it do its thing for a while, and then > you > reboot and all is well. After running the system upgrade, nothing worked > any > more, including the few services that had been on that machine. After > asking the > experts, I realized that I was going to have to wipe the machine, do a > clean > install, get all of the necessary apps installed, and then restore both > sets of > backups (LUG server and previous contents of that server) to the clean > system. > > So far this is not a crazy plan. I've done things like it many times > before, > though the 9-year software update gap made for a few challenges. > > Once I got all of the apps installed and the backups restored, I > immediately > typed the command to turn it all on > /local/mailman/bin/mailmanctl start > and nothing happened. The error log showed a preposterous, deeply hard to > believe error message. > > The wise person's first step in debugging strange failures on computers is > to > type the error message into a search engine (I use Bing) to see if other > people > had asked about it. To my great astonishment, no one had. This never > happens. > Somebody else *always* has the same problem and has asked about it. > > I then started reading the source code of Mailman, trying to see what > circumstances would cause it to generate that message. Mailman is written > in a > language called Python. When you are having trouble like this, a good step > is to > explore "version skew". Mailman Version XXX works only with Python Version > YYY. > The versions of Python that are extant just now are 2.5, 2.6, 2.7, 3.2, > 3.3, and > 3.4. This is an abnormally large spread of "current" versions, which > usually > means that the language developers have made incompatible changes and have > to > keep old versions around for apps that have come to depend on them. > > I tried all 6 of those Python versions. I got the same odd error in the 2.* > versions, and absolute chaos in the 3.* versions. Since the version of > Mailman > that I wanted to use (2.1.18) failed the same way with all of the 2.* > Python > versions, I wiped the slate clean one last time and installed Python 2.7. > > Gonna have to find this problem the old-fashioned way. > > Many days pass as I read documentation, run tests, explore the software, > use > debuggers, create and read log files, all to no avail. > > Then I decided to instrument and log what was happening when Mailman/Python > started up. Figuring out how much information to put in a log file is a > black > art. If you log too much, you will never find what you are looking for in > the > swamp of details. If you log too little, you probably won't log what you're > looking for. > > After far too much time staring at the logs, I saw that Python was > initializing > from a library that was not listed in the Mailman docdumentation. > > An aside: language systems like Python tend to be aggressive in how they > find > libraries. They look around and if they find something that looks like a > library, they use it. I'm sure the Python designers (none of whom is named > Monty) thought they were doing the world a favor by making it go out and > find > its own libraries. "Autoconfiguration" run amok. Bad idea. > > This library was obsolete. In the 9 years of not upgrading, the Mailman > software > had changed the place where it kept certain library functions, and both of > them > were present in the version I was trying to run. The "wipe clean and > reinstall" > function only wiped the directories that it knew about, and this obsolete > directory was not on its list -- it had been retired years ago -- so it > didn't > get removed by the "wipe clean" function. > > If I had run all 12 of the upgrades between Mailman 2.1.6 and 2.1.18, one > of > them would surely have deleted that newly-obsolete directory. But I > didn't, so > it was still there. > > When a complex computer system is using two different versions of the same > library, with creation dates 7 years apart, it doesn't stand a chance of > working. > > I typed the Unix command "rm -rf /local/mailman/Mailman/pythonlib/email" > which got rid of the ancient and incompatible library > and everything started working. Perfectly. > > There were hundreds of loose ends, and I spent the next week hunting them > down, > but it wasn't taking 18 hours a day and LUG mail was flowing while I did > it. > > Thanks for listening. > Brian Reid > LUG Saloonkeeper and server wrangler > > > > > > _______________________________________________ > Leica Users Group. > See http://leica-users.org/mailman/listinfo/lug for more information > . > -- =========================================================== Dr Peter Dzwig