A Closer Look: Network Appliance Clustering, Part 2

SANs are out, virtual grids (aka storage clusters) are in.

In my last column, I took a technical dive through Network Appliance’s GX storage clustering architecture. Bruce Moxon, senior director of strategic technology for Network Appliance and and Dave Hitz, CTO, provided detailed answers to our questions about GX and addressed critical comments from competitors.

According to Moxon, in its current manifestation, Network Appliance’s GX cluster is aimed at “high performance/technical computing” applications. In his words, “The initial version, ONTAP GX 10.0, released earlier this year, is targeted at what I call production-oriented technical computing applications. Those applications are typically Network File System (NFS)-centric, with some [Microsoft Common Internet File System] CIFS access required for data pre-processing, post-processing, and visualization.”

He adds that it will probably be a while before mainstream business IT organizations are ready for clustering GX-style—and before GX is ready for them, “In fact, we do currently support scalable CIFS on GX. However, there are some CIFS functions that enterprise customers need [such as] quotas, GPOs, folder redirection, and synchronization, integrated anti-virus, and so forth that are not currently supported in GX CIFS today. As many of those customers are currently very well served with our standard ONTAP 7G offering, we wanted to be sure and clearly indicate to them that 7G is the appropriate platform for them—for now. As some of those additional functions are brought online in future GX releases, we expect enterprise CIFS customers with scalability needs to move to ONTAP GX.”

Given the current limitations of the product and its narrow range of prospective users, why are we dedicating another column to discussing storage clustering and Network Appliance? There are several reasons.

There is a shift in what the leading analyst houses, such as IDC, are saying. Robert Gray, vice president of worldwide storage research at IDC recently delivered a Webcast with the provocative title, “SAN is Dead: Long Live Virtual Grids.” Long an advocate—some might say an evangelist—of Fibre Channel fabrics (so-called storage area networks), Gray is now reporting that the thrill is gone with respect to SANs in the minds of customers that IDC is consulting. In its place, “virtual grids” (which is market-speak for storage clusters) are on their way up the interest curve.

Folks at LeftHand Networks, Isilon Systems, and now Network Appliance are happy that IDC is finally catching up to what they have been saying. Clusters are wired, SANs are tired.

This new interest in storage clustering, in turn, underscores the importance of an in-depth examination of products coming out of leading vendors such as NetApp. Questions about potential limitations on performance and resiliency in scalable storage-clustering approaches are very important to understand and resolve if consumers are to make intelligent buying decisions. If a CIO is looking seriously at storage clustering as a strategic direction for his or her storage infrastructure, the need to understand the language of the vendor and the nuances of test data is paramount.

For example, Moxon acknowledged that there is a performance hit on storage operational efficiency that is introduced by any clustering technology. In the case of the GX, clustering NetApp Filers appears to reduce their operational efficiency by up to 36 percent or more.

In his words, “Our GX clusters are fashioned from the same building blocks as are our standard filers (i.e. FAS3050 and FAS6070) … [In recently performed SPEC.org SPECsfs tests] you could do the math and find that the overhead [imposed by clustering these Filers together] is about 36 percent on SPEC.ORG operations per second (OPS). Keep in mind that SPEC OPS are skewed towards the short transaction end (especially for “dataless” OPS like getattr) and will thus show a higher overhead than will large sequential OPS. Our engineering target, which we are still striving for and expect to meet, continues to be performance parity between 7G and GX on a per-node basis, with a 15 percent overhead for remote I/O operations.”

Warming Up to Clusters

Moxon’s comments are important for storage architects who are beginning to warm up to clusters. While the hype around storage clustering promises dramatic improvements in speed with the more nodes added, clustering itself will likely reduce the performance of individual un-clustered nodes. In Moxon’s words, “One can always get greater performance out of a purely partitioned (shared nothing) architecture—at the cost of application complexity and decreased manageability.”

He underscores this interesting point on his blog at gridguy.net. He writes, “The primary purpose of almost all scale-out storage architectures is to provide scalable AGGREGATE performance for a large number of clients—not to speed single stream performance to a single host. This is true both in technical computing apps and in large scale enterprise deployments.”

To summarize this somewhat confusing position, there’s performance (what you get from a single non-clustered storage platform) and then there’s performance (what you realize from a highly scalable platform). It sounds as though he is arguing that the benefits of clustering may not be improved performance at all but easier management.

“Customers deploying GX,” he notes, “find the additional functionality—global namespace, capacity and performance load balancing, load sharing mirrors, and striped volumes for enhanced single volume and single file performance—sufficiently compelling that that [performance and cluster overhead] tends not to be an issue.”

His comments also go to the heart of another issue: is SPEC.ORG testing the right thing with its high performance computing (HPC) test? According to Moxon, “We have realized in excess of 1M SpecSFS OPS. I would not necessarily characterize that as a demonstration that GX is capable of supporting the I/O per second required in an HPC environment. As I’m sure you know, many HPC environments require significant sequential I/O performance for large reads/writes—either in addition to, or instead of high aggregate random access performance. The SpecSFS benchmark results clearly don’t address that type of workload. For large sequential I/O workloads, we are happy to have our customers benchmark GX with workloads that are representative of their processing.”

What the SPEC.ORG tests actually show is that, given enough resources and money, a huge disk configuration, and a large computing front-end, a platform can be fielded that is capable of achieving the IOPS required in a high performance compute environment. As a practical matter, I wanted to know the overall cost of such a platform in terms of hardware investment, software costs (including management software, assuming that there is cluster management software that is capable of managing this configuration), and “soft costs” including labor, electricity, and so forth?

Moxon responded by reemphasizing that GX clusters scale in a more manageable way than do unclustered NAS heads, “ONTAP GX does require the management of aggregates and volumes that are spread across the individual controllers. That capability is provided by our current ONTAP GX management suite. Additionally, we are in the process of developing more comprehensive tools that allow provisioning “templates” to be quickly applied to common provisioning tasks in a data-centric manner. These templates can be applied to quickly create common directory structures (e.g., for different users or different aspects of a project) that instantiate multiple volumes at once—greatly simplifying the task of provisioning storage in GX.”

He went on to add, “Scalability is one of my favorite topics—not only from a storage perspective, but from an application perspective. How do your storage systems help streamline your operations—whether application provisioning, management, or monitoring? I think NetApp’s contributions in this space with ONTAP 7G are second to none. Flexible Volumes, Snapshot and FlexClone technology, and a tremendous variety of replication and data protection products (SnapMirror, SnapVault, and SnapLock and LockVault for retention/compliance requirements), coupled with recent additions in security of data-at-rest (Decru), and in [Virtual Tape Library], provide a strong portfolio of capabilities that our customers rely on to simplify their operations. ONTAP GX is a new “engine,” if you will, for delivering those same features on a scale-out platform. Quite simply, we believe “scale-out” is the way of the future, and that this approach allows us to build more capable storage systems more economically and more quickly than would be afforded otherwise (i.e. scale-up).”

For now, he says, Network Appliance has looked around the market it proposes to serve with GX and has determined that its pricing is competitive. To my knowledge, however, no definitive cost-of-ownership studies exist to suggest that the benefits of improved scalability outweigh the costs of such an infrastructure, or provide greater cost savings, risk reduction or process improvement than architecture that uses no clustering at all.

The Failure Factor

The other bugaboo of clustering, whether of servers or storage devices, usually has to do with resiliency. Simply put, clusters are complex and prone to failure.

I asked Moxon about this point and he responded, first, with a litany of resiliency features already part of NetApp Filer ONTAP OS, which carry over into GX clustering. He said that these standard features of NetApp Filers are augmented by active-active failover capabilities in clustering components themselves as well as cross-nodal volume mirroring capabilities.

Bottom line: Moxon says that ONTAP GX is a new “engine”, if you will, for delivering those same resiliency features customers have come to love on their Filers but on a scale-out platform. “Quite simply,” he says, “we believe scale-out is the way of the future, and that this approach allows us to build more capable storage systems more economically and more quickly than would be afforded otherwise [via non-clustered approaches].”

In his concluding remarks, Moxon concedes that it will take a while for GX clustering to go mainstream. However, he says, it is the architecture that Sunnyvale is pursuing in earnest and the architecture that they are betting that most storage administrators will pursue in the future. IDC’s Gray apparently agrees.

I am interested in hearing your views. jtoigo@toigopartners.com.