In-Depth
Managing the Information Lifecycle to Increase Database Performance
ILM is an emerging technology that promises to decrease the size and improve the performance of OLTP databases and data warehouses.
By all accounts, data volumes are going to explode over the next few years. Yet some researchers have predicted that advances in processing power and storage densities won’t keep pace with rapidly ballooning data volumes.
Companies struggling to keep pace with expanding data sets may find relief in an emerging technology practice known as information lifecycle management (ILM). ILM describes the process of moving aged or infrequently accessed information out of an OLTP database or data warehouse and into near-line or off-line storage tiers, which are typically based on commodity Serial ATA disk arrays, bulk tape storage devices, and other less-expensive alternatives to online SCSI, fibre channel, or Serial Attached SCSI storage. An ILM solution also maintains logical links to data that has been moved offline, so as far as the database is concerned, nothing has changed.
A lifecycle management approach has long been prevalent in the storage space—e.g., hierarchical storage management (HSM)—but it’s still coming into its own as a growth- and performance-management solution for database management systems. These days, vendors such as Princeton Softech and SAND Technology Corp., among others, market ILM products that tier data across online, near-line, and offline (tape, magneto-optical) storage media. It’s an approach, advocates say, that can benefit even the most demanding of OLTP database systems, resulting in improved performance and increased reliability.
“Database archiving lets you identify relational data that is not used often and gives you the option of storing it in another place, and because you’re reducing the size [of the database], that can result in some real performance benefits for customers,” said Jim Lee, vice-president of marketing with Princeton Softech, in an interview last year. Another upshot to database archiving, says Lee, is a reduction in hardware and management costs: Because you’re no longer keeping all of your data online, in your high-performance OLTP or data warehouse, you can tier it across more affordable Serial ATA near-line devices, or on to inexpensive tape or magneto-optical offline storage devices.
In addition to driving performance improvements, organizations can also tap data archiving to reduce operating costs. For example, smaller databases that don’t need as much high-performance storage can run on smaller servers, and require a smaller staff to effectively manage them. With structured data growing at a 125 percent clip annually (according to META Group), and with an increasing number of data warehouses trending toward or above the once-benchmark 1 TB level, that’s an important consideration. “Most prospective customers recognize the value of what we do and how we do it,” Lee notes. Several trends are driving uptake of data archiving solutions, he says, but two of the most important are regulatory compliance requirements and limited IT budgets.
Not Just Marketing Hype
The coming data crunch isn’t just a marketing theme cooked up by ILM purveyors. Experts say that a “Perfect Storm” of factors—including expanded data retention requirements for regulatory compliance; radio frequency identification (RFID) for supply chain, inventory, and asset management; and the strategic importance of business performance management and other practices that emphasize the agglomeration and analysis of huge volumes of data—will occasion a Malthusian explosion in the size and diversity of enterprise data volumes.
Take data warehousing powerhouse Teradata, a division of NCR Corp. Long the standard bearer for the high-end of the high-end data warehousing market, Teradata has recently begun to talk up a vision of “extreme” data warehousing that encompasses all of these factors.
“If you look at data volumes, we predict that what today people consider to be the detail data will no longer be the detail data in the future. If you are in the telecommunications business, the detail data is typically detail records, one record for each call your customer makes,” explains Teradata CTO Stephen Brobst. “Ten years ago, if you told people you were going to put detail data in your data warehouse, they’d have said, ‘You’re out of your mind!; The storage costs, the processing, the memory—it all will be prohibitively expensive.”
This has already changed, says Brobst, and will change even more dramatically over the next few years. “If you look at FedEx, the details used to be the package, but now, if I send a package from LA to New York, it’s going to go through 12 different scans on the way. So the detail is no longer the package—it’s all of those scans,” he indicates.
So how much data ware we talking about? The University of California at Berkeley, for example, has estimated that in the last three years, humankind generated more data than in the previous 40,000. In the year 2002, for example, five exabytes of data were created—and, although estimates aren’t yet available for 2003 and 2004, the totals for both years are believed to be much higher.
There’s a sobering upshot to such estimates, of course. For decades, the IT industry has more or less thrived on the ability of Moore’s Law—which says that processing power effectively doubles every 18 months—to roughly track the real-world requirements of information systems (or vice-versa, depending on your perspective). This is to say that if the size of your database grows by x percent from year to year, or the complexity of your database queries increases by y percent, you can plan to tap attendant increases in storage density and processing power to accommodate this growth.
But is there a chance the explosion in data volumes predicted by META Group, U.C. Berkeley, and other researchers will torpedo this accidental equilibrium? Brobst doesn’t think so. “If you believe the Berkeley numbers, which are every 18 months roughly doubling in data, and you believe Moore’s law, which is roughly every 18 months roughly doubling in processing speed, then there isn’t a problem,” he argues.
But Robert Thompson, vice-president of marketing with ILM software purveyor SAND Technology, isn’t so sure. “Today, data warehouses are growing at 100 percent compound annual growth rate in some cases, and the overhead associated with trying to manage this indexing burden is just exceeding whatever performance improvements you’re getting from [Moore’s Law],” he says. “Times have changed and the warehouse is being accessed by a lot more users than people ever anticipated, the standard ways of solving this [growth], such as reducing storage costs 15 percent per quarter, increasing processing power, really don’t get people ahead of the curve.”
ILM proponents have another trump card up their sleeves, too: regulatory compliance. Regulations such as the Sarbanes-Oxley Act of 2002 impose specific data retention requirements for all publicly traded companies. Instead of keeping this data online in relational databases or data warehouses—and developing a server, storage, and personnel infrastructure sufficient to support truly massive data volumes—organizations could opt to move it to near-line storage (where it can still be accessed by online user) or to offline storage resources, such as write once, read many (WORM) drives, which satisfy stringent data authenticity requirements. If this data is requested by regulators, subpoenaed as part of a lawsuit, or required in the event of a disaster, it can be quickly restored.
About the Author
Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.