In-Depth
Why All Data is Small Data
Just as small data has always driven the profitability of disk drive manufacturers (capacity improvements have always resulted in increased sales), it will also drive the development of next-generation data management solutions.
Although everyone is talking about "Big Data" -- as though that term has any meaning beyond what a vendor says it means -- I submit that the real money in the storage business has always been and will always be made in and around an even more revolutionary concept: "small data."
This may sound counterintuitive, but hear me out.
In 2011, analysts were seeing external storage arrays worldwide reaching aggregate capacities of 21 exabytes. Back then, a leading industry watcher suggested an annual data growth rate of 30 percent that provided a graceful growth curve up to about 46 exabytes by 2014.
When you add to the mix the capacity requirements imposed by virtual server technology, the analyst growth charts curve along a much steeper incline -- with aggregate external disk capacity requirements growing to between 168 and 212 exabytes midway through the second decade of the New Millenium. The difference in the projections seems to be closely linked to which storage array vendor paid the analyst for the report.
I know what you're thinking. That's really big storage, so it must hold "Big Data," but that isn't really true. Actually, all of the data it holds is really quite small.
The fundamental unit of data is, of course, a bit. A bit represents a discreet binary state (1 or 0) which is communicated through the detection of the magnetic energy field manifested by a really small part of a disk platter -- a cell, if you will. Each bit or data cell is so small you need a sophisticated sensor, such as a GMR read/write head, plus lots of high-tech signal amplification and background noise filtering stuff, just to detect it.
You can buy a 3.5 inch SATA disk drive today that holds 3TB of data stored as 625 gigabits (or 625,000,000,000 bits) per square inch. Last month, Seagate got bragging rights for a process called Heat Assisted Magnetic Recording (HAMR) that, used in conjunction with platter media coated with material having very high magnetic coercivity properties, increases this density to a terabit (a trillion bits) per square inch, setting the stage for much higher capacity drives in the same physical platter size.
Combine this with technology spearheaded by Toshiba about two years ago, called bit-patterned media (BPM), in which bits are stored in mesas and valleys on the media surface, and you may well see drives with in excess of 40TB (Seagate says 60TB) within the next few years. That's a foundation for Big Storage to be sure, but fundamentally -- regardless of disk capacity, array capacity, or the capacity of all external arrays combined -- a bit is a bit. All data is small data.
When you move outside of the realm of magnetic media storage and into the realm of, say, DNA, bits get even smaller. A gram of DNA is said to contain about 108 terabytes of data. Counting just the diploid cells, excluding the microorganisms that share our body mass, and taking an average body weight that is considerably smaller than mine, the total storage capacity in cellular DNA across a human being is about 150 zettabytes, according to genetic scientists.
One more factoid regarding mass storage in DNA: each of us shares a paltry 750 megabytes of data in order to make a baby -- an activity in which slower IOPS are usually more appreciated than faster IOPS, by the way -- and yet an extraordinary amount of information is replicated by this process. That suggests that the size of the data cell -- whether big or small -- says little about the importance, criticality, value, or impact of the contents of the information it stores.
I don't make the comparison between electronic and biological storage to be humorous or obtuse. Some years back in this very column, I noted that there were efforts afoot to write binary data to animal DNA so that your family dog could become your household media center or portable iTunes repository. (In a rare bit of dark humor, I noted that the equivalent of a head crash might result if the family pet ran out into a busy street.) This time, I want to make a very specific point.
Recently, some vendors have deviated from the original narrative around "Big Data," which was first described (notably by IBM's Jeff Jonas) as an application -- a sort of "data-mining-on-steroids." Instead, they are using the term to describe, well, the amount of data itself -- the enormous and growing quantity of stored bits and the myriad problems that this big data burgeon is creating. "Big Data," said one CA Technologies spokesperson recently, "refers to the problem of dealing with such an enormous quantity of small files."
In that vein, a few vendors have started using Big Data as shorthand to refer to new generation of applications, many in the earliest stages of development at start-ups or smaller firms, that will be used to wrangle all of those stored bits into functional or operational units, so that they avail themselves of organized, policy-based actions (replication, migration, etc.) in response to specific inputs.
If I am reading my marketecture tea leaves properly, this new interpretation of Big Data will usher in new flood of biological metaphors for data management, likely based on underlying principles of "object storage."
Imagine a functional unit of bits that automatically "divides" itself (aka cellular division with data duplication) in order to deliver redundancy and protection against, you guessed it, "viruses" and other data threats.
What about an operational unit of data that reorganizes itself to adapt to changing network conditions or to adapt itself for use with a new or different file system? That would certainly help solve knotty problems such as server hypervisor chokepoints or the readability of data stored in a long-term archive.
What would be the value of a self-organizing unit of bits that could work seamlessly with other units of data created by different applications or having different formats? Realizing this capability would enable us to combine all sorts of data effortlessly to support searches or analytical efforts.
All of these ideas have their analogs in the biological world, of course -- in embryology, immunology, virology, etc. Developing them as viable computer technologies requires a focus, first, on the nature and organization of small data rather than on some grand idea of building some sort of Big Data Analytics killer app.
Bottom line: I would argue that, just as small data has always driven the profitability of disk drive manufacturers, given that capacity improvements have always been rewarded by increased sales, it will also drive the development of next generation data management solutions. Provided, of course, the marketecture isn't allowed to overwhelm architectural work that is already underway.
Your comments are welcome. [email protected]