In-Depth
SLAs Defined
Enterprises are increasingly dependent upon the multi-protocol networks as primarylinks between network users, customers, suppliers and partners. As networks grow in sizeand complexity, network management becomes more challenging. Given that network budgetsand headcount growth is nearly flat, network managers rely more upon tools. As a result,the need for effective network management tools for establishing, monitoring andmaintaining response time, service level agreements (SLA) has become tantamount.
One of the most important aspects of SLAs is response times. Response times areimportant to the enterprise because they directly impact end user productivity, whichdirectly impacts profitability. In some instances, poor response times can accrue lossesof up to millions of dollars per hour. Even though an investment must be made to monitorand maintain an SLA, the payback for maintaining proper response times can be substantial.Also, one must consider the soft costs associated with the drop in end user productivity.In turn, irate end users can make life difficult for the network operations staff.
SLAs can take at least three different forms. First, there is the SLA internal to acompany between the information systems department and the departments it serves. Second,there are SLAs between an enterprise customer and a service provider for some form of amanaged service. Third, there is an SLA that is a combination of the first two. Forexample, an enterprise may rely on a sister company to provide networking services. But,the sister company may outsource a managed Frame Relay service to a service provider forits Wide-Area Network (WAN) with an SLA. This type of SLA is becoming increasinglypopular. For each type of SLA, the network manager must be able to accurately trackresponse times and quickly isolate the root cause of poor response times.
Poor response times can be caused by many different factors. Typical causes can rangefrom congestion to processor overloading, from equipment failures to memory shortages,from line errors to software bugs. Regardless of the cause, a response time issue cannotbe resolved until the root cause of the problem is isolated. How does one go aboutisolating problems when a typical enterprise network consists of various end-stationsrunning multiple protocols, over Layer 3 routing devices and Layer 2 WAN and Local AreaNetwork (LAN) switching equipment? In todays multi-vendor, multi-layer,multi-protocol environment, the task is daunting to say the least.
Monitoring and Maintaining Response Times
A major challenge to establishing, monitoring and maintaining response times in amulti-protocol environment is finding a tool that can monitor response times for varioustypes of data traffic. The other key challenge is finding a tool that can provide both anend-to-end view of the multi-protocol network, as well as provide a granular view ofspecific areas within the network.
First, lets examine how to establish the SLA response time. When asked what is areasonable response time SLA, most network operations personnel would have difficultyproviding an appropriate answer. A network management tool should aid in determining theSLA by collecting and gathering historical information. Historical information is neededto establish a reasonable baseline for response times. Once the baseline is established,SLA thresholds can be set.
What is needed to maintain a response time SLA? Maintaining response times is afunction of a good network design. Adhering to good design principles, such as building asingle protocol IP backbone, selecting a single-vendor solution, minimizing the number ofconnections, such as leased lines or Frame Relay virtual circuits and properly sizingbandwidth and router processor needs goes a long way toward achieving this goal. Also,selecting and implementing the correct router and switching features, such asquality-of-service (QoS), contribute as well.
To monitor a response time SLA, network managers need to be notified that the SLA isabout to be violated. Next, network managers need effective trouble shooting tools toquickly diagnose which area of the network is causing the violation.
To attack this problem network managers must rely on tools that look at the networkfrom a global, end-to-end perspective. These tools must be aware of both the logicalnetwork layers and the physical connectivity. The logical network consists of the IProuting topology, the SNA or APPN topology integrated into the IP infrastructure, theprotocols in use by end-stations and the Data Link Switching (DLSw) peering structure. Thephysical connectivity is the Layer 2 connectivity that consists of leased lineconnections, Frame Relay or ATM circuits and LAN switching media and devices.
Ideally, a network management tool should indicate that response times exceed SLAthresholds and help isolate the problem. An effective tool should be able to pinpointresponse time failures between two or more routers, between legacy devices (such asmainframe or SNA controllers), or between logical connections (such as DLSw peers orTN3270 client/server sessions). For a global perspective, a tool should track the specificpath that SNA end users traverse to access the mainframe.
If the network management tool indicates that the connectivity between two routingdevices is Frame Relay, a managed service, then the network manager will want to contactthe service provider immediately. By pinpointing that the problem is in the Frame Relaynetwork, the network manager will not waste time searching for the problem within theinternal network.
All efforts to resolve the problem can be focused on the Frame Relay devices.Trouble-shooting efforts can be greatly reduced by relying on comprehensive networkmanagement tools to isolate problems.
Reporting tools are also needed to track each area of a network over defined periods oftime (e.g., weekly, monthly, quarterly) to identify trends. Reports play a key role innetwork planning. Additionally, reports become the evidence by which you manage serviceproviders to ensure SLAs are met.
Migration from SNA to IP
In general, the network managers role can be defined as the one who gathershistorical information, establishes a baseline and thresholds for response times and thenmonitors. In theory, these tasks sound fairly straightforward. However, in practice, thenetwork managers job is full of challenges.
Multi-protocol networks are common today. Few enterprises have a pure SNA environmentfrom mainframe to desktop. As Internet Protocol (IP) emerged as the protocol of choiceover the last decade, more enterprises have migrated from a pure SNA environment towardpure IP. But, this migration takes time given budget constraints and the need to maintainaccess to mission-critical mainframe applications and data. Thus, many networks todaysupport the legacy requirements of SNA, as well as the requirements of IP in a combinednetwork.
Figure 1 shows the different phases of migration from a pure SNA network to a pure IPnetwork. All of these phases are likely to coexist within the same network at any givenpoint in time. Enterprises migrate different portions of the network at different rates,depending on organizational requirements, such as budgeting, staffing and training.
In quadrant I, enterprises have an SNA at the mainframe, in the backbone network and atthe desktop.
In quadrant II, enterprises replace their SNA-centric networks with an IP backbone.
In quadrant III, enterprises are replacing their SNA desktops (e.g., 3270 dumbterminals) with IP desktops such as PCs. Desktop TN3270 emulation is commonly used toaccess SNA data and applications. Many companies are beginning to use Web browsers as theuniversal interface to legacy data and applications.
In quadrant IV, enterprises add an IP stack on the mainframe. In this scenario, themainframe takes on the role of a server to take advantage of IP-based applications such asfile transfer protocol (FTP) and Web applications.
Management Challenges in a Multi-Protocol Environment
For quadrant I, traditional network management tools are used to monitor networkresponse times for the legacy SNA network. These tools essentially monitor from anapplication perspective with little visibility of the entire network. They are adequatetools for measuring response times from the desktop to the mainframe for SNA trafficbecause the network is relatively flat and there is no need to consider intermediate hops.There are obviously no multi-protocol issues to contend with and these tools arecompletely unaware of any Layer 2 LAN devices.
In quadrant II, these traditional tools become obsolete because they cannot providevisibility into the entire network. They no longer provide enough information aboutresponse times for SNA traffic traveling over an IP backbone.
Figure 2 shows a configuration for a typical enterprise network that extends to branchoffices. The network configuration falls into quadrant II with an IP backbone. Traditionalnetwork management tools can only measure the overall response time between the desktopand the mainframe for SNA traffic. These tools are completely unaware of the IPinfrastructure. However, in this scenario, they can not measure the hop-by-hop performancebetween branch routers and the distribution routers in the data center. Without this levelof information, network managers cannot pinpoint the root of the problem to quicklyresolve SLA violations. This scenario highlights the need for network management toolsthat provide logical visibility into the network from Layer 3 and up.
In this environment, network managers need a granular look at the hop-by-hop responsetimes for the entire network. They need a tool that can measure the response time from thebranch office to the mainframe for SNA traffic, from the branch office to the distributionrouter for IP traffic, as well as each hop from router to router within the IP backbone.Traditional network management tools cannot provide this key measurement of response timesfor each router hop in the IP network. This hop-by-hop measurement is often referred to aspath analysis. Todays network managers need a tool that can effectively perform pathanalysis to maintain SLAs.
Some network managers have tried using two separate network management tools for eachtype of network traffic (e.g., one to track SNA traffic and the other for IP traffic).This is not a feasible alternative. First, it would double the costs for softwarepurchases, installation, configuration and training. More importantly, troubleshootingcould not be done quickly because of the complexity in trying to pinpoint the source ofthe problem. Additionally, correlating reports from each tool would also make establishingthe baseline, setting SLA thresholds and planning difficult (too much human interventionis required).
In quadrant III, network managers need to measure the response time of IP traffic fromthe desktop to the channel-attached router, SNA traffic from the distribution router tothe channel- attached router, SNA traffic from the channel-connected attached router tothe mainframe, as well as IP traffic for each hop from router to router within the IPbackbone.
For quadrant III, traditional network management tools also fail to provide the keymeasurement of response times from each hop between each of the routers. Nor cantraditional network management tools measure the response times between the distributionrouter and the channel-attached router.
In quadrant IV, the pure IP environment, the key network management metrics aremeasuring from the IP desktop to the IP stack on the mainframe and from the branch routerto the mainframe. Traditional network management tools have no application in the pure IPenvironment because they measure response times for SNA traffic only.
About the Author:
Jonathan Beck is a Product Line Manager for Cisco Systems InterWorks BusinessUnit. He can be reached at [email protected].
Bob Allegretti is a Technical Marketing Engineer for Ciscos InterWorks BusinessUnit. He can be reached at [email protected].