Network Appliance Clustering: What’s Going On under the Covers?
Feedback on our column about clustering inefficiency.
A couple of columns back, I interviewed John Spiers, CTO for LeftHand Networks, on his views about clustered storage products that seem to be entering the market these days in droves. To Spiers, a major criterion in the selection of a storage cluster —once you have determined that you need one at all—is the “hop count” manifested by the platform. You need to know how reads and writes are handled by the system: how the requests for data (or for space to write data) are routed and how these requests are resolved. The more hops, the lower the platform's performance.
Moreover, Spiers argued (and we agreed), the message hopping required to resolve a read/write request determined many things about the resiliency of the overall platform itself. Requiring an extra hop to look up the location of data from a data layout table or metadata table, he said, had the tendency of creating a single point of failure in the platform that might deny all access to data stored on any node in the cluster.
As he critiqued platforms in the market, he called a few competitors to task for their clustering inefficiency, prompting a raft of e-mail responses from the companies he criticized. Everyone wanted a chance to respond to Spiers’ remarks.
One of these was Network Appliance. This summer, the company announced that it would begin selling a clustering file storage platform called GX, based on technology acquired from Spinnaker Networks a few years ago. Network Appliance says that GX architecture will become a building block in new products from the company. Understandably, they wanted to dispel any misapprehensions left by Spiers’ comments.
We provided Bruce Moxon, senior director, Strategic Technology for Network Appliance and Dave Hitz, CTO Emeritus, a list of detailed questions about GX architecture. Moxon reported that he encountered some resistance inside Network Appliance to providing answers to our questions. Some objected that sharing the requested information would provide competitors with knowledge of the technical internals of GX architecture. Moxon says that he insisted on responding, if for no other reason than that the questions were likely to be the same ones asked by prospective customers.
He began by emphasizing that “the ONTAP GX product currently shipping is a ‘true integration’ of Spinnaker Networks’ clustered NAS architecture and a subset of NetApp’s ONTAP, the microkernel operating system that has been at the core of NetApp’s products for over a decade.” Answering concerns about the “inherent incompatibilities” between NetApp’s Berkeley Fast File System-derived Write Anywhere File Layout (WAFL) and Spinnaker’s Andrew File System-based architecture led the list of clarifications Moxon sought to make to our previous column.
Moxon noted: “AFS itself (i.e., the code) is not at the core of the Spinnaker product. Rather, some of the architectural concepts of AFS (and other parallel/distributed file systems) influenced its design. But the [NetApp] team also improved on the original AFS architecture—for example, the original AFS system required a special client-side protocol stack, whereas the Spinnaker implementation did not. The AFS influence led us to develop a 2-stage file system, where the client-accessible services were implemented in an ’N-blade,’ and the actual protocol-independent file system was implemented in a ‘D-blade.’ In ONTAP GX (the integrated product), WAFL is used to implement the D-blade.”
Dave Hitz contributed additional clarification. “WAFL is not based on the Berkeley Fast File System. It is very different. We did use a lot of code from the Berkeley release, like the whole TCP/IP stack, lots of low-level boot code and drivers, chunks from a variety of administrative commands and daemons, but WAFL and RAID we implemented from scratch.”
File System Confusion
He added that all the acronyms can be confusing, noting that even the term “file system” is tricky, because it refers to two very different things. “There are disk file systems, whose main job is to convert logical file-based requests into block requests to disk, and there are network file systems, whose job is to transport logical file-based requests from a client to a server somewhere else.”
AFS is a network file system, like NFS. WAFL is a disk file system. The original AFS team also developed a disk file system called Episode. Hitz suggests that Episode was in many ways quite similar to WAFL—while quite different from Berkeley FFS, “so, it's not surprising that the Spinnaker folks would find that [WAFL] had the features they needed to develop AFS-like functionality.”
Bottom line, according to the two spokespersons: Spinnaker and NetApp had more in common to build a clustered solution on than pundits give them credit for. Integration is continuing, but the parts of NetApp’s traditional ONTAP system that have been blended with Spinnaker architecture include the core container architecture (WAFL, on-disk checksums, raid groups, RAID-DP, aggregates, flexible volumes), snapshots, and GX SnapMirror (within a GX cluster). “Over time,” Moxon said, “additional functionality, such as FlexClone, SnapVault, and interoperable SnapMirror (with non-GX NetApp systems) will also be supported. New features, including load-sharing mirrors (read-only replicas) and striped volumes (volumes that span multiple controllers and their disks) have also been added to the system, specifically to support high-performance file system requirements.”
Moxon said that design goals for the integrated NAS cluster architecture are threefold. First is horizontal scaling (scale out), for the development of very large storage systems (in terms of both capacity and performance) from modular components. “This,” he says, “allows us to pull multiple NetApp controllers (i.e., filers) into a single system, with data spread across the individual controllers within that system, and clients spread across many interfaces on the front-end that are logically all part of the same storage system.”
The second goal is “true virtualization from a data access perspective”—providing a common, coherent view of storage to a client regardless of the location where a client connects into the storage system. Moxon says that GX provides a common namespace for NAS protocols, and will do so as a large addressable storage pool and set of LUNs for block protocols when that functionality comes online.
The third goal is “truly transparent data migration.” Moxon describes this as the ability to migrate data among controllers in the cluster without changing the logical path to the data and without disrupting business applications. In the case of NAS, this capability will be supported even while files are open and locks are held.
The Question of Hops
Lofty goals, however, do not address Spiers’ concerns about hop counts. He touted LeftHand Network’s “in-band” technique for locating data anywhere in the cluster, saying that it surmounted the extra hop required in most storage-clustering schemes—including Network Appliance GX—to look up a data location on a data layout table stored on one node in the cluster in order to access a file on any other node. Such an operation created inefficiency and a single point of failure in the cluster. We asked Moxon to address this question head on.
His response was an architectural tutorial, recounted here.
Said Moxon, in GX clustering, “clients are distributed across a set of virtual interfaces (VIFs) at the front end of the storage ’nodes’ in the cluster. A VIF is the GX representation of an IP/MAC address combination that provides client network access. At time of system configuration, VIFs are ‘bound’ to physical storage-controller networking ports. That binding can be dynamically updated, either manually or automatically, in response to a network port or controller failure, or as part of standard storage cluster expansions. Such port migrations are non-disruptive to the client applications.
A client connects to the storage system through a single VIF at time of mount. Mapping of clients to VIFs can be done by any of a number of means, including static (partitioned) assignments, DNS round robin, or through Level4 load balancing switches.”
Moxon provided an illustration to show that data is spread across a “partitioned SAN” on the back end, with each node connected to a subset of the overall storage, and volumes residing in one (and only one) partition. In his words, “The volumes are physically ‘stitched together’ into a global namespace by connecting them in to a directory in a ‘parent’ volume at time of creation.
When a client request is made…a quick, in-memory lookup is done (in the VLDB—volume location database) to determine which node has the data for the requested file, and the request is ‘routed’ (through the cluster fabric switch) to the appropriate [node]. (Think of the VLDB as a ‘volume router’—analogous to a network router, which caches its maps in memory for very rapid switching). The target [node] makes the appropriate storage request, and returns the necessary information … to the client. This all occurs over the SpinNP protocol, a lightweight RPC protocol engineered from scratch by the Spinnaker team to ensure low latency and efficiency of storage network operations.
“Bottom line: … you spread the clients out the front, the data out the back, and leverage a scalable cluster fabric that allows you to scale controller performance and end-to-end client and storage network bandwidth as capacity scales. Furthermore, because the individual storage partitions are themselves scalable—from both a capacity and performance perspective, the architecture can provide a wide range of storage system characteristics—from ‘wide and thin’ (high performance, low capacity) to ‘narrow and deep’ (higher capacity, lower performance).”
To make sure I understood Moxon’s overview, I repeated the question about an extra hop. His response was straightforward. “Yes, there is an extra hop for data access operations. As mentioned above, the client request is resolved through an in-memory VLDB lookup (‘volume routing’) to the node that ‘owns’ that volume.” He added that sophisticated load balancing is present to alleviate any operational inefficiencies that might accrue to this design.
“Load balancing is accomplished at a couple of levels. First, client sessions are distributed across N-blades and data across D-blades, so that a statistical distribution of activity across [user, data] sets is inherently load-balanced. Next, client sessions can be dynamically (non-disruptively) migrated across controllers (VIFs) to alleviate client-side load imbalances, and data can be dynamically (non-disruptively) migrated across D-Blades to alleviate storage-side load imbalances. These latter two activities are currently manually initiated operations; they will be part of an automated, policy-driven monitoring and management system in the near future. There are additional means of distributing load across the system—especially in high-concurrency, high-throughput applications (e.g., technical computing).”
In response to this lucid statement, Hitz contributed the following observation: “If the data lives on the node that received the request, then there are no hops at all. If the data is on a different node, then of course there is one hop, to go get the data, but there is no extra hop above and beyond that first one. GX has a distributed database that pushes the ‘metadata table’ to each node, so the ‘look-up to a metadata table’ is local.”
His next comment was quizzical, “When Bruce said there is an extra ‘hop’ for data access operations, I believe that he was referring to the first, obvious hop that gets the data, not to any sort of extra hop required to find the data. That kind of extra hop would be required if you had an architecture with a centralized meta-data node that all other nodes needed to consult. In a distributed-database architecture, you only need the one hop to get the data. As you observe, the centralized architecture has issues with both scaling and with failure modes.”
“Because GX is a distributed database architecture,” Hitz continued, “there is essentially no impact from adding extra nodes. You have at most one hop no matter how many nodes. You do have to distribute the database to more nodes, but it's a small database that doesn't get updated often, so impact is low. One performance impact of scaling is that data is more likely to be on a different node than the one that received the request. Assuming that requests are completely random, you'll go off-node half the time in a two-node system, but 90 percent of the time in a 10-node system. (Of course, things often are not random, and with smart management you can move data to the nodes that get the most requests.)
“For performance tuning, the engineers generally arrange the tests so that 100 percent of the requests go off-node. When doing the giant SPEC-SFS result, they targeted the system size required by testing two-node systems with 100 percent off-node traffic, and then doing simple math to figure out how big a system would be required. They saw almost zero degradation as the scaled to the full-sized cluster. To me, that linear scaling was probably the most impressive aspect of the whole exercise. Of course, the more nodes you have, the more bandwidth you need between them, so you obviously need to scale the network interconnect appropriately.”
Getting to the Core
Moxon offered no clarifications of Hitz’s contribution, which seemed to be in some respects contrary to his own analysis. So, I followed up with two questions to Moxon that go to the heart of the issue: Is hop count and hopping methodology important when considering a cluster storage solution? What other criteria should guide consumers to choose one storage clustering product or another?
His response was straightforward: “It can be, just as client-network latency can be important in delivering appropriate service levels to clients from any networked storage system. A well-engineered and provisioned grid/cluster fabric is important to get the most out of ONTAP GX, just as client networking configurations and SAN configurations are key in their respective implementations.”
Interestingly, GX, which is built on NetApp’s “universal building blocks”—e.g. Network Appliance non-clustered Filer products—exhibited about 32 percent slower performance than do NetApp Filers not in a GX cluster. Clearly, hop counts count for something so we dug deeper into performance test data revealed by the vendor at SPEC.org.
In part two of this discussion, we will continue the dialog with Moxon to illuminate more mysteries of storage clustering. Your comments are welcome: firstname.lastname@example.org.