Building Storage Part I: A Device-Driver Approach

Addressing the difficulties of both horizontal and vertical scalability has seen the rise of clustering techniques—but the question is where the brains for the cluster should reside.

The appliance model has begun to grate. Many vendors have decided to “purpose-build” arrays to do specific things, such as content addressing, data de-duplication, encryption, or disaster recovery.

The concept is not new, of course. One could argue that Network Appliance pioneered the idea of the storage appliance in the 1990s with the introduction of its first filers, which eventually demonstrated the problems with appliances in general. To scale capacity, you needed to buy more appliances. Horizontal scaling worked up to a point, then you started running into management issues. Though Sunnyvale has sought to address this issue today, it was all too common to chat with enterprise users of NetApp gear—such as Earthlink, which had 800 filers at one point—who would complain that managing capacity on the boxes required visiting the management interface of each box individually. “It’s like surfing the Web,” one fellow told me. “Then, when you reach the last one, you have to start all over again.”

Network Appliance tried over and over to rectify the situation. They released management software that could aggregate all of the information from the boxes— but it didn’t satisfy customers who felt that it lacked sufficient detail to enable single-throat-to-choke management. Since the complaints have died down, I assume NetApp has made some advances that have satisfied customer needs. I wouldn’t really know, however, because I can’t test their gear without running afoul of a software license proviso that restricts customers from speaking publicly about their experience with the product.

“Horizontal scaling equals management pain” was once the bottom line in appliance storage. That contrasted with the vertical scaling issues that plagued big iron vendors. Vertical scaling provides a method for capacity growth inside one chassis. Need more storage? Add more trays of drives. Presumably, management is a non issue because you are still dealing with a single controller head.

The problem with vertical scaling to massive multi-TB capacities, however, has always been the traffic bottlenecks created by placing data and applications behind one controller (or a redundant pair, if you wanted fault tolerance). EMC and others have tried to resolve that problem with more complex controllers that divvy up their memory caches to various requests. Hitachi Data Systems has put a switch in the controller to try to manage traffic better. Early reports from the field suggest that both approaches have legs … and limitations.

The Rise of Clustering

Recently, in Montreal , a fellow with a leading outsourcing company said that he was using EMC gear only because his customers wanted the “assurance” that his storage infrastructure was “best of breed”—whatever that means, given that everyone is buying a box of Seagate drives anyway. (We laughed about the similarities between such thinking and the idea of buying a Hummer for $40K more than the Yukon, despite the fact that both automobiles have exactly the same chassis.) He said he is restricted to using only about 40 percent of the capacity of each array because of the performance issues that are created if he fills every drive tray. He said he had looked at HDS’ TagmaStore, but found the interface too complex and difficult for performing even minor tasks. Vendors may disagree with these assessments, but they are coming straight from the user’s mouth.

To address the difficulties of both horizontal and vertical scalability, we have seen the rise of clustering techniques, joining many smaller arrays together via an application clustering head technology that gives the benefits of vertical scalability without the choke points, and of horizontal scalability but without the management issues. Lots of vendors are trying it. The idea of head clustering appears to have been the one concept that NetApp took away from its Spinnaker Networks acquisition a few years ago, though the solvency of their implementation, again, cannot be tested publicly. I await feedback from the user community—anonymously, of course.

Clustering, however, continues to have its bugaboos. Clustered storage frames, like their server counterparts, are difficult to deploy and to administer over time and usually require the development of a rarified set of skills among storage administrators. Those who have the coin and the expertise swear by clustering; those who don’t swear at it.

John Spiers, CTO of LeftHand Networks, speaks highly of clustering, which is integral to LeftHand products. He notes that all clustering is not the same, and that a key difference is whether the cluster is managed by a single head (designated as the master), or if management intelligence is distributed across many heads—usually across a network. The latter, he would argue (and I would agree), is closer to what one might call grid storage.

The question is where the smarts for the cluster should reside. Appliance makers, and I would include LeftHand in this category, might argue that the functionality of the cluster should be placed in hardware, with each node recognizing its peer in the cluster by virtue of some integral identifier. That’s one way to go.

The Device-Driver Approach

Another approach is to use a device driver to connect the storage—leveraging standard DHCP to assign the drives their identifiers—and freestanding software to organize disks or LUNs into volumes. Last week, I visited MIT Media Labs, where just such a strategy has been implemented.

A couple of months ago, the Labs announced that, courtesy of Seagate, Bell Micro, and Zetera, they would be building a multi-petabyte storage infrastructure comprised entirely of storage over IP, or “Z-SAN” as Zetera prefers to call it. Bell Micro has put together arrays, the Hammer Z series, using a box of Seagate drives with a Zetera front-end connection. That means you connect the boxes to an IP network, where they receive DHCP addresses either on a per-box or per-drive basis. Servers that must access these drives do so via a UDP-based connection and a specialized Zetera device driver loaded on each server.

In this implementation, you have the benefits of storage clustering, but without a lot of extra head engineering. If you want to create RAID—or perhaps more appropriately RAIN (redundant array of independent nodes)—configurations, you simply stripe across multiple nodes. Better yet, since the configuration captures the commodity price of disk, which is constantly decreasing, just double up on the disks and write the data to two or more disks simultaneously. That function leverages multicasting, something available in the IP protocol stack, but only if you use UDP rather than TCP.

To my way of thinking, this is a lot closer to the ideal of clustered storage than are many of the implementations being made in proprietary hardware today. Plus, it doesn’t cost nearly as much as proprietary gear and exposes itself to management, in part at least, using traditional network management systems and protocols, rather than highly specialized and pricey clustered storage management suites.

The Media Labs implementation of Zetera-enabled storage is certainly not the definitive example of the solvency of this strategy. Simply put, Media Labs professors are writing a ton of video data to disk as part of a human language learning study. They parse the video and retrieve sections when they want to look at a point in time. They also search across the video using some neat algorithms for overlaying audio and video data with standardized “schemas” so they can correlate audio/video events and hopefully illuminate language learning precursors and predictors. That’s very worthwhile research, indeed, but not representative of the workload that might be imposed on storage by a transaction-processing system in a bank or a busy e-mail system or 10,000 users simultaneously accessing a Web site such as

It will be interesting to speak with users of the Bell Hammer arrays from the world of business, if only to confirm that their approach to clustering and scaling is everything that the Zetera engineers believe it to be. More on this at a later time.

Until then, your comments are welcome: