Measuring IO Performance
What are a few milliseconds between friends?
Issues of storage solution performance seems to have fallen prey to overarching concerns about price and cost of ownership in the current economic climate. Even recent debates between EMC and Hitachi over the relative speeds and feeds of their high-end arrays received little ink in the trade press. The attitude of many buyers today seems to be that performance differences measured in microseconds are less important than the actual capability of an array (or collection of arrays in a fabric) to deliver non-disruptive scaling with minimal increase in labor or management costs.
Yet performance is what determines how much work gets done at the end of the day. Practically speaking, it defines the business value of a storage acquisition. The process of comparing the performance of what you have with what you intend to buy—the difference between the value of the new quantity of work enabled by the new acquisition versus how much work the status quo environment enabled— is key to making any claims about Return on Investment (ROI) or investment payback. Yet this process is too often ignored.
One reason is that measuring IO throughput is hard work. You can’t always trust vendor-supplied numbers. A little over a year ago, one vendor boasted that its new storage array was 100 times faster than established competitors. On closer examination, we reported in this column that the vendor was only measuring how quickly data was retrieved from its memory cache. Its competitors were providing estimates of throughput based on data retrievals, not from cache, but from back end disk drives. It was no wonder that the new array would be 100 times faster: the test compared apples to oranges.
Part of the difficulty in developing good throughput performance estimates is that different server operating systems (which are used by most performance tools to collect throughput data) report throughput using different metrics. For example, Hewlett Packard and the Open Systems Foundation traditionally used the number of words transferred from disk to define throughput. Solaris and AIX measured throughput in blocks, and Microsoft used bytes. Moreover, the data collectors provided by the server OS vendors may collect data in different ways and at different levels of granularity. UNIX traditionally collected disk access data using “active time” as the yardstick, while Windows NT used “percent disk active time," and Windows 2000 uses “percent disk idle time" as measures.
Normalizing this data for comparative purposes is a small but important hurdle, since, when comparing different solutions, you will need to describe both throughput (transfer times and accesses) and disk utilization (“active” versus “idle” time) characteristics. This is true for any heterogeneous server environment, regardless of whether the back end storage platform is heterogeneous or homogeneous. There are many tools to assist you in collecting and normalizing this data, but tools alone do not necessarily provide the whole picture.
In addition to collecting and normalizing data, you need to compare two or more storage solutions using a consistent and correct comparative methodology. The Storage Performance Council has been boasting about theirs, called SPC-1, since last year. They compare SPC-1 to the TPC benchmarking system used by the Transaction Processing Council, claiming that it enables competing vendor products to be compared to each other in a meaningful way.
Last December, I found myself agreeing with EMC when the vendor dropped out of SPC and called SPC-1 an arbitrary measure. EMC was correct in saying that a comparative test is meaningless unless it takes into account the infrastructure configuration and actual workload imposed by customer applications.
The SPC folks did their best to characterize this as sour grapes, implying that EMC’s dismissal of SPC-1 was predicated on the sub-par performance of its technology vis-à-vis competitors under the benchmark program. While this may or may not have been true, the fact is that the TPC benchmark, to which the Storage Performance Council proudly compared its storage benchmark, has long been held in disrepute by anyone concerned with real performance measurement. There is no shortage of specialized consultancies in California who will tune servers so they get high scores in TPC benchmarks – yielding configurations that would never be replicable in the real world.
What needs to be done is to run a standardized test using a real world workload across multiple vendor-proposed solutions. However, this is becoming more difficult to do for a number of reasons.
A key impediment is the increasing tendency, as reported in a major trade press publication this week, of business financial officers to target IT test labs as natural candidates for the budget axe. Those companies that were well-heeled enough to have their own test labs in the first place are finding them gutted in the latest round of “cost-saving cuts.” Without them, IT managers must turn to external sources, often with no funding, to ask for comparative testing.
Sometimes, this service can be replaced by working it into the contract the company has with a trusted reseller, integrator or vendor. However, the data derived from these sources can easily become colored by their primary objective – to sell you something. Like many IT professionals, it is easy to overlook minor performance differentiators. What, after all, is a millisecond or two between friends?
There are no easy answers, but one idea is for IT professionals to pool their resources and develop shared testing facilities. Another, though less specific, approach would be for all consumers to report their results with current and future storage solutions in the trade press or in some common open information repository. If these accounts are sufficiently detailed, they may provide a gross measure of what works and how fast it runs.
This column welcomes your reports of performance achieved with storage technology and will dutifully report your results to the rest of the community of IT professionals who comprise our readership.
Jon William Toigo is chairman of The Data Management Institute, the CEO of data management consulting and research firm Toigo Partners International, as well as a contributing editor to Enterprise Systems and its Storage Strategies columnist. Mr. Toigo is the author of 14 books, including Disaster Recovery Planning, 3rd Edition, and The Holy Grail of Network Storage Management, both from Prentice Hall.