OpenSuse 12.3 NFS Problems
Did OpenSuse 12.3 cause a problem? If you call hanging by the neck until dead a problem, then as Perry Mason would say: Yes. Yes it did. Update: Partial solution found
Summary
The Dell PowerEdge T410 uses a PERC H200 RAID controller. With hardware RAID, the BIOS takes care of all RAID functions like recovery and consistency checking transparently. The OS just sees one large disk. Someone at our location bought two of these babies, and despite our problems with OpenSuse 12.1 we decided to give OpenSuse another chance. We deleted the Microsoft Server partitions, added three Linux ones (root, /home, and a swap) and formatted them with ext4. OpenSuse installed without any major incident, but locked up solid after 300-400 GB of data were copied onto the hard drive.
Background
Since I am the closest thing my employer has to a Unix expert, they asked me to configure them for one of their locations, which is a three- to four-hour drive (depending on the traffic) away from where I work. Here are the gory details of the problems we had. Some of them were caused by wonky hardware. But the rest were caused by changes in Linux that seem to have been made simply for the sake of change.
-
Mexican Keyboard
Mexican keyboard: Somebody found a great deal on some USB keyboards.
Unfortunately, they turned out to be Mexican keyboards. Only a few of the
punctuation keys actually matched what was printed on the keycaps. For
example, to get a forward slash ('/'), you have to press the '_' key.
To get a '+', you press the '¿' key. Now you know why there are
so few good computer programmers in Mexico.
-
Acer V183HV Monitor: This inexpensive and compact monitor seems
to only have one screen mode: 1366×768. Linux Xorg doesn't handle
this mode without major tweaking. But we couldn't even get far enough to
get to the tweaking step, because Opensuse switched to graphics mode as soon
as it hit GRUB. At that point, the screen became unreadable. Configuration
was just not possible with this monitor attached. Changing GRUB's config
file boot parameters to use text mode or other VGA modes didn't work—they
were all ignored.
-
Other Monitors: Even though there's a computer store two blocks
away, no one at this location is allowed to purchase anything. Rules are
rules. The administrative staff has to ensure their continued employment
somehow. So for installation, we borrowed a CRT monitor from their old server.
Then someone found an old 14-inch LCD monitor that was not being used, stuck
behind the desk of a former employee. As soon as it was attached, Xorg
decided that there were “No Valid Screen Modes” and unceremoniously
deleted its own configuration file. As a result, no monitors would work.
It turns out there is no more xorg.conf
file. It's been replaced by a
/etc/X11/xorg.conf.d
directory with a bunch of separate files in it. But if you create an xorg.conf file, Xorg can use it. We finally found a file called/etc/X11/xorg.conf.install
and copied it to/etc/X11/xorg.conf
.
-
X11 doesn't start for regular user: Edit
/etc/permissions.local
and uncomment the last line about Xorg being 4711. Then, as root, typechmod 4711 /usr/bin/Xorg
.
Installation Oddities
Opensuse 12.3 has quite a few nice shiny new bugs. Or maybe we just never noticed them before.-
Automatic Login: On one computer we accidentally left this option checked.
By default the system boots up into what used to be called runlevel 5, and with automatic
login, rather than logging you in automatically as you might expect, it presents the xdm
login prompt for that particular user. If you enter their password, it says “login
successful” and then recycles endlessly back to their password prompt. So the
“Auto Login” feature actually blocks that particular person from
ever logging in. We couldn't find any way to fix this problem short of re-installing
the entire operating system.
-
inittab no longer works: The standard way of changing the default
runlevel no longer works. Damn kids, always changing stuff.
-
Systemd: All the distros (except Debian, which is planning to)
have switched to systemd, which means your old custom startup scripts don't work.
Someday we might get the time to figure out whether there's a way to make
systemd handle them. For now, we just put a note on the machine reminding
people to run the scripts manually.
-
Some config files work, and some don't: We were able to change from
xdm to gdm by editing
/etc/sysconfig/displaymanager
. But you can't use this method to switch to using startx instead of xdm. Changing it tonone
just causes Xorg to hang. Go figure.
-
IMAP The Cyrus Imapd on the DVD didn't work. See
linuxsetup117.html for details.
- Apache httpd The Apache web server, for the first time ever, worked out of the box, or at least seemed to, putting up a blank web page that says "It works." But we couldn't find it in the process table, and it didn't work with our PHP files. So we compiled and installed a custom one.
OpenSuse 12.3 NFS Hangs By The Neck Until It Is Dead
So we got past all these little annoyances. I guess the new guys have to make
their mark on Linux, so they make useless changes like switching from sysvinit to
systemd and eliminating /etc/inittab
, so this sort of thing is
inevitable. Maybe it even makes desktops boot faster. But a server will take 5-10
minutes before it even gets to GRUB. Those few seconds systemd saves mean nothing.
All it does is make it harder to administrate remotely. Used to be I could log in
over my cell phone and fix problems. Ah, the good old days.
We also scraped all the dried salsa off that Mexican keyboard, and resigned ourselves to using a giant 20-year-old CRT monitor from the Ronald Reagan era. Then we found something wonky with the NFS in OpenSuse 12.3 that caused big problems.
After four hours of copying users' data files over NFS, it just locked up. I
use cp -pRuv
so I can watch the files coming in, and after copying 300-400
GB, they just stopped coming. Typing df
caused the terminal to hang. In
another terminal, we found that all the partitions were readable except '/', which
caused Suse to hang whenever we tried to read it.
What. The. Fuque. By now it was 4 p.m., with an almost four hour drive ahead of me, and Opensuse had given me another flare-up of the old Multiple Continuous Spewing Expletive (MCSE) Syndrome. (Not to be confused with Microsoft Certified Systems Engineer Syndrome, which is very similar.)
The system load was near zero, and of course there were no error messages. So I rebooted into "rescue system" mode, and had no problems reading the files in the root directory ('/'). Everything looked intact. Nothing in log/messages. After reboot, I get the normal password prompt, then these symptoms:
- Hit the Entrar key without typing anything: it gives another login prompt immediately, as normal.
- Type an invalid user name or password: it hangs for 30 seconds, then says "invalid username or password" and the login prompt returns.
- Type a valid username and password: it hangs for 60 seconds, then says "Login timed out after 60 seconds" and the login prompt returns.
- Try to log in over ssh: after you type your password, it hangs indefinitely.
We considered that maybe the system was doing LDAP for some idiotic reason, possibly looking on the network for authentication. We booted into rescue mode, edited nsswitch.conf, and wiped everything that we could find that had anything to do with NIS or LDAP. (Of course, it was not possible to run yast2.) No effect.
No problem, we have another server configured identically. So I started copying files onto that one, and the Exact Same Thing happened, except this time it happened immediately after the first reboot. So it was not a bad hard drive; anyway, a RAID consistency check (a great feature, except that takes over an hour) showed no errors.
Partial solution: We finally discovered that the login problem was caused by having copied the shadow file from the old machine several hours earlier. Apparently the file format has been changed. The new file contains several extra fields.
Here is how to reproduce the problem:
-
Copy an old-style shadow file (with 20 or so usernames) and its corresponding
passwd file to
/etc/shadow
on the new machine. - Set the root password and one user password on the new machine. You now have some passwords in the new format and the rest in the old format.
- Verify that it is possible to log in and that su works. The system works fine, with no errors or warnings, until you reboot.
- Reboot.
-
Suddenly none of the passwords work, including the new ones in the new format,
and
login
hangs. You are locked out of the system.
Login will also hang if the number of lines in /etc/shadow
is
different from the number of lines in /etc/passwd
. Again, logins
work fine until you reboot, possibly months later, and you discover that you
are suddenly unable to log in.
It seems that login
processes the file differently after a reboot.
A single misplaced character in any line of /etc/shadow
could cause
it to hang. This is tough to identify, because it has no effect for weeks or
months until the next time you reboot. The only way to recover is to boot into
rescue mode and delete the shadow file, then rebuild it user by user. A pain in
the neck if you have thousands of them.
Another NFS problem: Copying files over NFS from a different old
server running Suse 9 with kernel 2.4.21-99 caused the new Suse 12.3 server to
hang repeatedly. The mouse, keyboard, and network all were non-responsive. It
doesn't even respond to pings. The
only way to recover was by yanking the power cable. The logs in the old server
had multiple repetitions of rpc-serv/tcp: nfsd sent only -32 bytes of 32900 -
shutting down socket
. We fixed this easily by changing the number of NFS
server processes on the old computer from 4 to 64 in /etc/sysconfig/nfs
.
Despite the errors, the old computer continued to run just fine, but the shiny new
computer with Suse 12.3 and Linux 2.6.25.20-0.5-pae locked up solid. This, children,
is what we call progress.