The HP-UX Admin Man: It's Not My Fault
Fred's been reading about "fault management" and shares some of the blame about what he's learned.
The term "fault management" is an interesting one. In a book Iread recently, it is defined as "detecting and reporting unusual or unacceptable behavior." I suppose it is like the half-full/half-empty glass syndrome. Would you rather manage system integrity or system faults? Does your job description say you are responsible for system faults? If so, what an easy job. You can spend the whole day going around unplugging network cables and peripherals, killing daemons, shaking disks and creating massive sparse files to overflow disks.
On the other hand, your job responsibility might include availability management. This would include monitoring the performance and behavior of the systems and networks under your control. You would be monitoring for abnormal conditions, so that you can fix them quickly if they arise, or before they affect availability.
Hmmm ... after thinking about it, maybe fault management is a pretty good term after all. This train of thought occurred recently when a book fell from heaven and landed in my snailmail box. The title, you could guess, is UNIX Fault Management (HP Professional Books Series, Prentice Hall) by Brad Stone and Julie Symons.
In the book, the authors make the assumption that the system operators are not the system administrators, which I found interesting. For many of us, they are the same person.
One of the first paragraphs in the book states that a study conducted by the GartnerGroup determined that operators are responsible for most unplanned downtime situations, which reinforces my thought that fault management really is a good term. It goes on to imply that automating system monitoring is a good idea, and one that I agree with. This leads to what I think is one of the two best uses of this book.
If you are trying to make a decision as to which of the many products available to purchase for monitoring systems, read the book. It reviews usage, extensibility and capability, of several products, discussing both free and purchased methods. It also talks about using system commands for this purpose. I think this would be very helpful in making a decision, partly since the book specifically mentions HP-UX products.
The second great use of this book would be to make it required browsing for new system administrators. The second chapter defines major events in the life of a system, defining those pieces that should be recorded, tracked or detected. They are all good things for admins to be aware of, if they are not already.
Chapter three is a good overview of how monitoring tools might (and do) work, listing out the general behavior of several tools, such as IT/Operations, Unicenter ITG, SyMON, BMC Patrol, PLATINUM ProVision and Measureware. This is good reading, since it tries to inform you of the general approach each tool takes, and you can decide if that fits your perspective.
The remainder of the book goes into more details about each section of the system you might need to monitor (system, disks, network, applications, databases and the enterprise as a whole). The book covers how the various tools perform, and, most important to a cheap company like ours, using system tools to perform each area of monitoring.
One thing I found rather annoying about the book is that they did not use any visual divider when changing from discussions of one tool to another. I would strongly suggest something more visible than just a heading in a larger point size. How about a new page, or a horizontal rule?
Here are some of the things I learned about using standard commands in just a brief browsing of chapter four.
The "file table full" error message often catches new system administrators. If you did not configure enough file table space, you can have a "full" disk, when most of it is empty. This can happen when not enough space is allowed for the "index" of files. I had never happened to read about the –i option to the bdf command before. When used, it adds two columns of output to the usual report. One is iused, the other is ifree. This lets you know quite easily if you are approaching a problem. In fact, a rather simple Perl program could be used to test for either type of disk full problem (not enough space, or not enough index area). The program lines:
open(BDF,"bdf –i |");
$cap=$data / $data * 100;
print "$data $cap\n";
Might result in something like:
This is an indicator that the first disk is getting into trouble (89 percent of the allocated inode space is used up). If you then compared disk free space with this, we might find that this is normal (full disk and full inode space is okay). If the disk free space is high, but we are running out of inode space, this would be very bad.
Another surprise to me was the ipcs command. I had been asked in the past about a way to list shared memory segments. The book mentioned that there was such a command. It also lists active message queues, and semaphores. Another memory jogger for me was the sysdef command, which lists kernel configurations.
Most of the usual system monitoring commands are mentioned, though in no real detail, or usage. That is why I recommend that new admins browse the book to become familiar with available methods. Some of the commands mentioned for system monitoring are mailstat, ps, sar, iostat, ioscan, swapinfo, top, etc.
The disk and network chapters do a good job of listing out available standard commands and what they are used for. I found no surprises there, but the section on available monitoring programs was quite interesting, since I have little experience with them.
I learned a lot by reading through descriptions of the different tools, and would want to read this if I was responsible for making a decision about an impending purchase.
The book also has a couple of case studies that give examples of detecting and correcting problems with memory and disk mirroring which would be good for newer administrators to get a taste of real life.
Chapter eight of the book is about databases, which is an area I have little experience with. I think I’ll go read it again. u
And if you agree that it’s not Fred’s fault, he can be reached at firstname.lastname@example.org.