How Rapid Problem Identification Improves Network Troubleshooting
Why new tools are needed for rapid problem identification
by David Messina
Initially built around static endpoints and centralized applications, today’s large enterprise networks are dynamic environments characterized by a plethora of diverse and highly distributed IP endpoints. These endpoints can be end-user devices scattered across both the LAN and the WAN, including mobile PDAs, smartphones, and laptops. Servers act as endpoints for applications, and with peer-to-peer networking, endpoints can act as servers. Even unmanned devices like IP-based security cameras, building controls, and slot-machines, add to the endpoint mix in today’s highly networked society.
Networked endpoints create and consume commercial and proprietary applications, including some not sanctioned by the corporate IT organization. If you could map the "conversations" these endpoints create, the resulting picture would be a tangled web of rapidly changing interactions.
This complexity has a wide-reaching impact on corporate network operations groups particularly because most troubleshooting tools have no concept of an "endpoint" and thus have difficulty identifying and locating the sources of problems. Most often the end user is the "first responder" when problems arise, typically with symptoms of sluggish performance. Even though fingers most often point at network operations when users complain of performance issues, in the vast majority of cases, the source of the problem lies not in the network itself but in some anomalous endpoint or application usage.
Traditional network performance monitoring and troubleshooting tools cannot offer a comprehensive approach to identifying problems because of their limited scope. Legacy tools typically view IT as a series of functional silos. One tool might identify problems with a specific networking device while another might be designed for use with servers, or SANs (storage), or another device or application. IT personnel who use these tools are frequently the "silo experts," and also unaware of the troubleshooting intricacies of devices or applications not within their domain.
With these siloed views, network operations personnel have no way to understand how all elements interact with one another via applications across the network, yet "performance problems transcend silos" (Metzler, 2007). As a result, problems often "ping pong" between IT departments for days or weeks, passing the buck on resolution while end users suffer.
Traditional tools are also very demanding of IT’s time. Ninety percent of network operations’ troubleshooting effort is spent identifying problems (Kerravala, Yankee Group, 2007). The majority of time, troubleshooting and monitoring tools are not effective in identifying a problem, and it will go unresolved (Nudler, Enterprise Management Associates, 2007). This is a troubling statistic, showing that issues affecting productivity and revenue can potentially persist indefinitely.
With so much time and effort devoted to trying to identify problems, never mind fixing them, some enterprises spend 90 percent of their IT budgets just standing still. On average, $8 out of every $10 spent in IT is "dead money" that does not contribute directly to business change and growth (Gartner, 2007). Add to that fact that on average, six service-desk calls are needed just to identify the problem owner (Garbani, Forrester Research, 2007) and you are left with an unproductive IT organization. Uniformly, analysts who cover network operations believe better tools are needed to identify problems and speed resolution.
Building the Big Picture from the Endpoints’ Perspective
A new technology that is now available can be used to build a comprehensive view of network performance by looking in aggregate at endpoint interactions and conversations and analyzing them in fine-grained detail. Called rapid problem identification (RPI), this technology automatically discovers all endpoints and all applications across all enterprise locations, and then precisely profiles their normal interactions to create a fine-grained view of what "should be;" that is, it builds a "big-picture" view of normal network performance patterns. By correlating real-time activity with normalized profile behavior on tens of thousands of endpoints and applications, RPI can quickly isolate the source of aberrations and provide network operations personnel with actionable data for addressing them.
RPI gives network operations the concrete evidence needed to make the most effective response to every performance problem and ensure the highest productivity for users. Problems that might have taken days or weeks to identify can be sourced in minutes or hours with RPI. The result is both a dramatic improvement in the proactivity of network operations personnel, and in the productivity of end users spared the performance issues that otherwise would have resulted.
How RPI Works
RPI uses flow information (NetFlow, sFlow, cFlow) or packets available from existing enterprise switches and routers to automatically discover all the IP endpoints (known and unknown). As part of this discovery process, RPI ties each endpoint to a specific location (topology) and evaluates all application usage across endpoints. (It is common for RPI technology to discover significantly more endpoints than IT personnel even knew existed on their enterprise.)
Using DNS, RPI maps the IP address of each endpoint to the device’s logical name; it then uses LDAP to tie physical assets to organizational entities such as users and services. With this mapping, operations personnel can determine exactly what resources are affected by an issue, giving IT strong performance context for each endpoint.
RPI then builds a precision profile of each endpoint across a range of variables: bit rate (in and out), packet rate (in and out), burstiness, interactions, endpoint affinity, application affinity, location affinity, and time affinity. The level of granularity is 40 times more precise than that offered by traditional troubleshooting tools.
Armed with fine-grained profiles that give a view of "a day in the life of" endpoints and applications, RPI can identify whether any current behavior is likely or unlikely, in excess or deficit. When there are differences between the current behavior and the precision profile, the system will note that as a symptom. An RPI system then needs to be optimized to correlate the flagged symptoms to identify the core problem.
Correlating the Symptoms
The next step for RPI is to narrow the issue to a single, specific problem. This requires deeper correlation and analysis. First, the technology begins to understand the relationship between the symptoms and the commonalities they share. Once they have been logically bucketed together, the system can, through heuristic analysis, eliminate groups or symptoms that have little or no probability of pointing to the root cause.
Higher probability symptoms are then examined across the three dimensions: time, location, and application. In the vast majority of cases, this analysis points to a single problem source. Network operations personnel can then determine exactly who "owns" the problem and they can immediately escalate the issue to the appropriate group for resolution.
Today’s help desk is optimized for desktop support: helping end users update their software, fixing damage caused by viruses, updating passwords, and the like. For any problem as sophisticated as performance issues, the help desk can only make an educated guess given the limited view provided by legacy troubleshooting tools. The status quo is to assume the network is the problem. Understanding performance issues requires looking at the situation from the end-user perspective, examining where performance issues are felt the most. That perspective is gathered from the interaction of network endpoints and the applications they produce and use.
RPI technology is the first troubleshooting alternative that gives network operations the complete picture of endpoint and application activity and performance; it is the only technology that can tie performance problems to a single core source in record time. RPI enables enterprises to maintain a quality application experience for every user.
As this technology gains acceptance, network operations personnel can stop being the convenient scapegoat for all network performance problems and can once again focus their attention on strategic network issues.
David Messina is vice president of marketing at Xangati.