Case Study: Bypassing Performance False Alarms
How a prominent hospital was able to increase the productivity of its IT staff and pay for itself all in about two months
When a heart surgeon’s beeper goes off in the middle of the night, there's every reason to believe something's wrong that requires her immediate attention. Unhappily, the same is not always true for the on-call IT administrator. Chances are, many of you have fielded a call or been beeped during your off hours, only to discover that the incident which prompted the call wasn't serious after all.
Dave Stinson, a network director with Oklahoma Heart Hospital (OHH), knows the feeling. Over time, he grew increasingly distrustful of the alerts generated by his performance monitoring software—too many cases of “crying wolf,” he says—and, more importantly, was losing productivity as a result of following-up false alarms.
It was with some relief, he says, that he tapped a new performance-monitoring offering from BMC Software Inc.—PATROL Analytics—that’s able to differentiate between real alerts and false alarms.
“PATROL is on all of our big nodes where all of our patient data is, and it generates a lot of alarms. It’s not too much that there are absolutely too many alarms, and they’re all legitimate, but some are more of an alarm than others,” he explains. “What I needed was a way to get them all under control.”
So when does an “alarm” cease to become an alarm? As far as Stinson is concerned, it's when an event becomes predictable and repeatable. A classic example is the lowly corporate e-mail server, which is typically busiest on Monday mornings, when workers spend more than the usual amount of time sorting through a weekend’s worth of accumulated messages. Not surprisingly, this activity tends to drive up CPU use on the mail server, which means that—more often than not—the organization’s performance-monitoring software is going to generate an alarm. Of course, IT can set the mail server’s threshold artificially high for just this reason, with the result that some legitimate alarms—such as a new mass-mailing worm that gets into the wild—will be ignored.
The problem, Stinson says, is that he and his colleagues were forced to sift through all of the “alarms” catalogued by PATROL—regardless of whether they were false or the real deal. The false alarms weren’t just limited to weekly e-mail spikes, either.
“Every Monday morning, when people come in, systems just get hit, and there is a lot of fallback from that,” he explains. “For example, locks get put on the database, but we were just finding out that it happens all of the time and it clears itself up in an hour. So those alerts really weren’t important to me. I’d rather get though those false alarms and make them positive alarms.”
Of course, OHH wasn’t just concerned about the performance of its e-mail or database servers. After all, the hospital also uses PATROL to monitor the health of its Cerner Millennium Health Information System, which provides vital information to doctors, nurses, and hospital administrators. At OHH and other hospitals, keeping Cerner up, running, and responsive is literally a matter of life and death importance. “It’s got payroll, medication, [and] pharmacy built into it. And it involves many different systems, many interfaces, TCP ports here, databases everywhere,” Stinson notes. “It’s a very complex, hard-to-manage solution, so just getting it all under control was my goal.”
That’s why Stinson and OHH tapped PATROL Analytics, a new offering from BMC that uses a third-party analysis engine from BMC partner Netuitive Inc. to crunch through performance data. Like a full-fledged business-intelligence or data-mining tool, PATROL Analytics sifts through this data, looking for hidden trends, relationships, and other items of interest. Over time, it collects performance information about an organization’s environment, essentially compiling a “profile” of system, network, and application performance over the course of an average business cycle. This lets it separate the wheat from the chaff, so to speak, of performance alerts.
“We begin to learn the behavior profiles of not only the system or a business service, but the correlated data or the parameters that impact that service,” says Sean Duclaux, director of infrastructure management with BMC. “We have found we are able to eliminate almost 100 percent of false positive alerts. … Not only are we reducing the overall volume of the alerts, but when we raise a trusted alarm, you know that is the one rabbit you should chase down the hole.”
Because OHH was already a PATROL customer, Stinson says he was intrigued by BMC’s new Analytics offering. After he saw a demo, Stinson confirms, he came away thinking that PATROL Analytics was like “a present from God.”
Since then, PATROL Analytics has more than delivered the goods, as far as Stinson is concerned. “It gives me one screen to look at what’s going on, as far as current alarms, current utilization, [and] what have you. On Monday morning, I’ll look at one of our thousands of ops jobs, and I can see what the trend is,” he explains. “Maybe some ops jobs aren’t running, maybe we didn’t get the reports—either way, I can look at one chart that at 2:00 on Tuesday morning, see that this happens every Tuesday morning, and I don’t get an alarm for it.”
In a few cases, Stinson says, PATROL Analytics helped his organization troubleshoot long-standing performance issues. “Over time, we find out what’s acceptable, and we expect it. Then we can start moving jobs around within the system to adapt to that,” he says. “We’re monitoring 20 jobs that are important to us, and they all seem to have problems around the same time. We found out that there is an Oracle archive job that happens at the same time, so guess what? We moved it to off-peak hours, and the problem was solved.”
BMC claims that most customers will realize an ROI on PATROL Analytics within the first three months. While Stinson stresses that OHH hasn’t commissioned an ROI study per se, he believes PATROL Analytics more or less paid for itself within about two months. “I think this would be the case for just about anybody. It depends on how long [PATROL Analytics] takes to learn [about your environment], and figure out what is trusted or not,” he confirms. “As for ROI, My time is the biggest cost savings, because I’m not that cheap. But you can’t put a cost on patient care. I’m now able to dedicate my time more toward pro-active improvement of things for the doctors and patients here than fire-fighting.”
For PATROL customers, Stinson argues, BMC’s new Analytics offering is a no-brainer: “There’s really no other product like it, and—because of what you get back from it—it should be very affordable for their IT budget.”
PATROL Analytics is a new offering, however, and there’s at least a couple of things Stinson would like to see changed—although, he stresses, these are mostly minor, cosmetic issues.
“I believe there were a few minor improvements that they could’ve made to it, graphical things, mostly just font issues,” he suggests.
Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.