In-Depth
Drawing the Line: Online Storage and Transparent Network Transfer Boost Productivity
The explosive growth of e-business and e-commerce is triggering new and serious issues for system administrators in the areas of data storage and file transfer management. Offline and nearline are the traditional data backup and storage methods, but each has major drawbacks and limitations. For instance, offline archiving is a time-intensive manual process for moving data to a media that is no longer connected to the system environment.
Nearline archiving or hierarchical storage management (HSM) is faster since it is not a manual process. Yet, with nearline, the system administrator faces problematic areas, such as configuration requirements for optimum storage and avoiding, or at least minimizing, lag time when accessing a particular file and bringing it back to the online system.
Network transfer is also posing challenges with this dramatic growth in data usage. Forrester Research reports data storage is increasing at the rate of 50 percent a year. Moreover, 2,500 global companies reportedly have an average of 15 terabytes of data storage.
As can be expected, not all this data remains static in a file: It has to be transferred to other locations. For example, if your company is involved in taking orders, that data needs to be transferred out to warehouses and financial data needs to be transferred to the company’s accounting department and other pertinent groups. If you are with an online company, network transfer is even more intense with customer information and the orders those customer have placed. In these and other cases, the system administrator needs a fast, transparent network transfer method.
Offline and Nearline Issues
Traditional offline archiving is affecting user productivity more than ever before, due to the enormous data being generated via networks and the Internet. Consider the time you have to wait for an operator to locate the right tape and then load it. Locating the files can also be a time-consuming challenge since you’re dealing with hundreds, or even thousands, of on-site and off-site tapes. Once the files are found, the data on the tapes may be corrupted due to an indefinite shelf life of tape medium caused by oxidation. Also, offline archiving continues to increase management and help desk costs because it is a work-intensive operation.
Nearline archiving or HSM involves moving the data to a slower media, such as robotic tape and laser or magnetic optical jukeboxes. An HSM system selects files through a policy procedure and archives them. In this case, archiving is a multi-step process, including data compression and transferral of files to the nearline storage device.
Moreover, when a user or application tries to access an archived file, a time lag occurs. The HSM finds the device and media where the file is located and informs the device to load the correct media. Once the file media is loaded, the HSM retrieves the file from the media and decompresses it, at which time the file is made available.
Issues the system administrator faces in this regard include configuration requirements for optimum storage, meaning archiving of the least needed data. The HSM system must also operate as desired without affecting performance on a regular basis.
When an HSM system is configured and files are migrated to nearline devices, a so-called "performance hit" or lag time is required to access a particular file and bring it back to the online system. If the HSM system is not properly configured, one of two situations can occur: The system administrator is not archiving enough data because he or she is not sure whether it will be needed or whether the lag time is acceptable, or an overly large amount of data is archived and each time the file is accessed, lag time occurs.
An application most system administrators are familiar with is one that requires a nearline archived file every three months. Each time this file is retrieved from a tape robotics system and brought back into the system, a lag time occurs. In 60 days, this particular file is moved off the system and 30 days later, is moved back on the system. Due to this unproductive movement, most system administrators usually decide on the first extreme of not archiving enough data due to the lag time issue.
There is also the cost of nearline archiving, itself, due to its complexity and expensive hardware and software. However, managing it represents the highest cost incurred with nearline archiving or HSM. It is a complex system to configure and to manage well.
Without archiving data, system administrators will definitely run out of disk space. Each time this occurs, the system is brought down, and new hardware is installed. Then, it is configured, and the data is reloaded. Downtime and management are expensive. This scenario assumes that the hardware was already purchased and delivered. If not, the cost of managing this system skyrockets. Moreover, when you have greater numbers of hardware subsystems, the greater the possibility of failure. Table 1 shows that on average, the disk drive mean-time-between failure (MTBF) is five years for one disk. With 60 disks, MTBF is one month, and with 180 disks, it is 10 days.
Online Archiving for Backup
Some system administrators are learning that advanced storage technologies, like online archiving, provide the answers to storage and backup issues. Online archiving is best defined as taking data that’s not being used on a regular basis and storing it efficiently on direct access systems. These systems can be disk drives or enterprise storage systems connected via SCSI, fiber or other cabling.
An online archiving environment doesn’t require additional hardware, which is a relief to system administrators. In addition to efficient data storage, online archiving provides the system administrator with high-speed access. Another major benefit is performance gain during normal backup procedures due to the following aspects: First, compressed files remain compressed on the backup tape. This reduces the time and resources required in moving the data back online. Second, there is less data to travel through disks into computer memory and then to tape devices.
A probable benefit is the decrease of network bottlenecks during network backup or using Network Attached Storage (NAS). Other key benefits to the system administrator are reduced backup time and reduced hard-drive requirements, which translates into reduced management, maintenance and support expenses.
Cost of ownership is a perennial concern and it continues to increase. A $10,000 investment in hardware is a good example. The cost of operating a piece of hardware, like a disk drive, is $5 to $7 annually for every dollar spent on hardware, according to the experts. Hence, for a $10,000 investment, the annual cost is $50,000 to $70,000. A five-year cost of ownership for that $10,000 hardware investment is about a quarter of a million dollars, including the cost of the hardware.
Online archiving helps the system administrator to save high-level expenses by providing a more efficient method to store data. Files continue to reside on the direct access disks. This way, data availability is increased and access time is greatly reduced, compared to traditional archiving.
Set Policy and Forget It
Online archiving lets the user set the storage policy within seconds by specifying the filing characteristics, for example, file extension, size, name, last-modified date or owner. The user can then forget about it. Online archiving dynamically compresses the data according to the preset policy. The file remains online for immediate access and is transparent to users and applications. When the user accesses the file, online archiving retrieves that data twice as fast as it was compressed.
The user can juggle disk space online to make room for new files, emergency tasks or testing databases. In addition, files can be compressed to delay moving them to HSM tape storage. The user has compression ratios of up to 99 percent so that more files can be stored without adding disks, and thus, remain under budget and keep a safe lead ahead of today’s growing data.
When data is automatically compressed, all filing characteristics remain precisely the same. When users access those compressed files, they are automatically decompressed. When they complete their business with this particular data, the file is left uncompressed for better performance. Then, when they archive it again, based on the file meeting the policy, the file is re-compressed at that time. The rationale behind this is that users accessing a file will likely use it multiple times, when appending to it, updating it or simply reviewing it.
Online archiving also permits system administrators to tune compression to trade off speed versus compressibility. When the archiving policy is set up, part of that procedure is deciding whether to optimize speed or compressibility. The user can set the compression policy characteristics to uphold specific performance levels.
As far as performance, files compressed at any ratio consume less time to transfer to directories (e.g., NFS drives) or to write to tape. Faster restore reduces the time spent on system reloads and disaster recovery.
Methods of Data Transfer
There are three ways to transfer information over the external network. A common and most efficient method is via the file transfer protocol (FTP), but e-mail and Web browsers are used for smaller files. Web browsers simply invoke FTP to download files, but not all browsers can upload. However, FTP products give you the ability to download and upload, as well as additional functionality.
E-mail applications use simple mail transport protocol (SMTP) running on TCP/IP. FTP is also a special protocol written specifically to run on TCP/IP. But, of these two, FTP is considerably more efficient and by far the better method to do file transfer, since SMTP was originally written for ASCII and only supported seven-bit characters.
Yet, binary files require more than eight-bit characters. Therefore, when e-mail is sent with an attached binary file, the e-mail program must convert that attachment into an ASCII file. To do this, it has to convert the attached file, using base 64 encoding, which increases the file size by about a third. Hence, since it is not an efficient way of sending the data, system administrators usually rely on FTP for larger files.
There are a number of FTP products available with varying features (see Table 2). The most commonly used FTP software packages come bundled with either a UNIX or NT server. The bundled packages all have some shortcomings.
Primarily, they are slow. If big files are being transferred, your personnel must essentially monitor the progress for hours while this FTP program sends its files. These FTP programs are mostly manual, meaning your personnel have to help it along. For example, if a file transfer is interrupted, your personnel must intervene to initiate the transfer process all over again. Plus, command lines are used for the user interface, rather than a graphical user interface (GUI). While command lines may not be difficult for individuals familiar with a UNIX environment, they can be for other people.
A lack of scripting capability is a major issue with these bundled FTP products. FTP scripting gives the system administrator latitude for automatically transferring files at any time. For instance, they can be transferred late at night or very early in the morning when there’s less network congestion. As it stands now, without scripting, personnel must manually oversee file transfers.
Products like these also have little or no security. Passwords are sent unencrypted and anyone with a protocol analyzer can open the packet to see your password.
Anonymous FTP sites, which run bundled FTP programs where information can be freely downloaded, don’t notify the site guardian of attempted or actual security breaches. These system programs are not set up for realtime monitoring of server activities. Hence, security in FTP sites run by bundled FTP products is limited to setting permissions.
Ideally, the system administrator would like to have an FTP product with better performance that uses less bandwidth and takes less time to administer. It must be scalable and include security, such as encoded passwords and hacker-resistant coding. Finally, the product should adhere to open standards so that it operates smoothly with other existing FTP products.
Within UNIX, there are few products meeting these requirements. Some have enhanced processing and secure coding, and in some instances, the ability to do encryption. Generally, however, these products are not universally compatible with other existing FTP products. For an FTP to operate properly, the FTP products at both ends must be compatible. An ideal commercial FTP product must be able to communicate effectively with a server that has bundled FTP software.
High-Velocity Network Transfer
In the area of file transfer, software now exists that allows policy-based network file transfers using on-the-fly compression and decompression that is totally transparent to the user. The effect is unprecedented speed combined with reduced workload for the system administration personnel. It is easy to install and a GUI provides drag-and-drop file transfer capability. If a file transfer is interrupted, the software automatically restarts the transfer at the point from which it was interrupted. This software is also scalable. The larger the file, the greater the time and bandwidth savings.
Data may be transferred in a compressed state. However, if a similar version of the same file already resides on the destination system, this high-velocity network transfer software sends only the difference and updates the file.
In test comparisons, this high-velocity software exhibited remarkable file transfer speeds compared to normal FTP. For example, an Oracle data file transfer that took 2 hours, 14 minutes using conventional FTP, took only 67 seconds (160 times faster) using the high-velocity network transfer software. A regular ASCII text file that took 7 minutes, 24 seconds with normal FTP, took only 15 seconds (30 times faster) using the newer software. On average, depending on file type, this file transfer software provides about a 50 to 70 percent savings or more in bandwidth.
On the security features side, the high-velocity network transfer software has its own encoding routine for transfer of passwords. No sub-processes are called; therefore, hackers are thwarted from getting into the code. Moreover, the software assures that all buffers are cleared to make sure unauthorized individuals cannot break into the code.
Simple network management protocol (SNMP) is used in this high-velocity network transfer software to monitor security on the network. The software uses SNMP traps to give e-mail notification of any attempts to breach security to a selected station. For example, users can set the number of incorrect password attempts or unauthorized access attempts that will trigger an alert. Finally, realtime monitoring of the server is also an integral part of this software’s security attributes.
This high velocity network transfer software and its SNMP agent are interoperable on any network using standard network management tools like HP OpenView, IBM Tivoli and CA Unicenter. This compatibility and interoperability presents system administrators simplified data management and system administration.
These two applications, online archiving and transparent file transfer will eliminate most of the traditional issues encountered in dealing with large amounts of data. File transfers are done in a fraction of the time previously needed and files that previously had been moved to cold storage can now have a longer life on disk. These relatively easy to grasp principles, combined with the modern monitoring resources mentioned above will remedy many problems faced by the system administrator.
About the Author: Paul Wang is the President and Founder of Solution-Soft Systems (San Jose, Calif.; www.solution-soft.com).