Sparse Files

Can a 2 GB disk store a 32 GB file? At first blush, the answer is no. Even with the best compression technologies, that amount of savings in disk space isn’t possible. There are, remarkably, situations where extremely large NTFS files can be stored in a small place.

Some applications create enormous files with unusual characteristics. Large sections of these files are sometimes not significant or can be set to all zeroes. Handmade databases often have the property that rapid insertion and deletion of large amounts of variable length data leaves entire sections of a database unused. Image processing applications also often have large chunks of files set to a single value or zero. Files that contain large amounts of zeroes -- especially those that contain a huge amount of zeroes when compared with significant information -- are called sparse files.

The unpleasant reality of sparse files is that they take up more space than they need. But the effort to compact them and restore them for application use can be expensive and time-consuming. An emerging dilemma is that, as the complexity of applications and multimedia requirements increase, sparse files are becoming an increasingly common feature of our storage environment. For example, a scientific application with an enormous matrix may be a full terabyte in size, but there may only be a megabyte of meaningful data.

Compression has been built into Windows NT since version 3.51. Unfortunately, compression isn’t the answer for sparse files.

Compression can easily reduce a long sequence of zeroes to a tiny number of bytes -- but the time and power needed to compress and decompress very large files would be overwhelming. Something better is needed.

A little known feature of the new NT File System (NTFS) is its native support for sparse files. The version of NTFS that ships with Windows 2000 recognizes sparse files as a native file system feature. Windows 2000 also provides native programming and operating system support for dealing with sparse files.

Typically, a file is a stream of information that is written from a starting point to an ending point. When viewed this way, it’s clear how sparse files can become a problem. NTFS for Windows 2000 takes a slightly new approach: It allows for files to be written as if only the significant chunks were placed on disk. Any part of the file that was judged to be insignificant could be marked invalid and returned to the operating system for reuse. NTFS, however, keeps track of the spaces between the significant chunks so that, when needed by an application, the file appears no differently than any other file.

An advantage for applications is that once a file has its sparseness enabled, it can read all and only the nonzero parts of the file. When a sparse file is read, significant data is returned as stored and nonallocated data is returned by the operating system as zeroes. That speeds up processing of the file significantly. Perhaps the only downside of Windows 2000 NTFS sparseness is that once you’ve converted a file to a sparse file there’s no turning back. Setting the sparse attribute on a file is an irreversible operation.

So, can you store 32 GB on a 2 GB disk? You sure can! In a great article in Microsoft Systems Journal called "A File System for the 21st Century," Jeff Richter and Felipe Cabrera demonstrate a short, simple piece of code that would create a 32 GB file on almost any disk that supported the new NTFS. Their amazing piece of code is about 10 lines long and is simple to try out on any Windows 2000 system. You can find their article at http://www.microsoft.com/msj/1198/ntfs/ntfs.htm.

The result is nothing short of amazing: A 32-GB file stored in 8 KB of space! If you try it out, right click on the resulting file in Windows 2000 Explorer and have a look at the properties menu. Suddenly, along with the traditional Size property, a new property appears: Size on disk. The new property is the only obvious evidence of sparseness in Windows 2000.

Those who live in the world of daily spreadsheets and word processing files may never get to see this amazing feature of Windows 2000’s NTFS, but I think many of us will. Sparse files are likely to be a feature of many storage administrators’ lives as applications become more complex. --Mark McFadden is a consultant and is communications director for the Commercial Internet eXchange (Washington). Contact him at mcfadden@cix.org.

Must Read Articles