In-Depth
Are You Collecting the Wrong Data?
Organizations use as little as 25 percent of the data they collect, but is it the information they need most?
Jim Lair is a grizzled veteran with 43 years of IT experience, 25 years of which was spent with the federal service. In the mid-90’s, Lair also worked for data quality pioneer QDB Solutions. These days, Lair is president of the Center for Data Quality (C4DQ,
http://www.c4dq.com), a data quality consultancy based in Reston, Va.
Lair’s company provides customers with consulting expertise that demonstrates how data quality practices and methods can fit into their existing operations. For customers that have already undertaken data quality project implementations, C4DQ provides what Lair calls an “over the shoulder” review of their operations.
We spoke with Mr. Lair about the state of data quality in the enterprise today.
Q. How prevalent are data quality programs in enterprise environments today?
A. Some of them have been there for a long time. After we survived Y2K, more and more people began focusing on the quality of their data. Today, there are also a lot of vendors in the marketplace, so there’s a lot more positive noise about data quality, which is good. Part of this problem is just creating awareness, part of it is educating people. It’s not just about buying a tool. If you don’t understand how to use the tool, or have some methodology in your organization for implementing such a tool, that may not be the best way to solve your problems.
But data quality is an issue, mainly because organizations have always tried to capture as much data as possible. What’s surprising is how they use the data that they are capturing.
Q. How so?
A. [While conducting] many of my individual consulting engagements and those of the company, it’s interesting that we’re seeing a lot of data that doesn’t appear to be getting used. That doesn’t mean it’s not of value, it just means that for over some period of time, it’s not being touched.
I first stumbled into this years ago working with a tax service company, where they had something over 3100 data storehouses, but only 700 were being touched by any online batch activity. We turned on the operating system software to monitor what data elements were being used, and after about a six month period it was kind of interesting to note that only 25 to 30 percent of the actual data sitting out in these files was presumably being used!
Q. And yet, presumably, they still need all of the data that they’re archiving. What does this knowledge do for them?
A. What that usage then gets you, once you analyze the statistics of that, is it sets you up to be able to develop some intelligent archiving strategies. There’s a notion in the marketplace that disk space is so cheap, therefore we can keep everything online, but ultimately you can’t, and you don’t need to. So part of our work is to be able to assess usage when customers want to, and being able to implement more intelligent archiving strategies.
Q. Does this usage knowledge typically tell them anything else?
A. The whole combination of the data quality analysis and the usage frequently reveals that the data that people are depending on today may be different than the data they began depending on when the requirements were first developed. We sometimes see differences in the level of equality between mandatory data and optional data, and more often than not this indicates a change between the business itself and the way that business users are relying on the data. This can tell them a lot about the way that their business has changed, and sometimes they aren’t even aware that these changes have taken place.
Q. Is this another trend that you find when you go into most customer sites?
A. No. It’s not consistent. It’s just an anomaly. It doesn’t mean that anything bad has happened. It just means that people’s reliance on specific kinds of data has changed, which begs for some kind of an adjustment in the way that they collect their data.
Q. You spoke earlier of “intelligent” archiving strategies. Are you talking simply about backing up to tape? If so, doesn’t that assume additional custodial costs and other expenses?
A. We’re able to put [their data] into some focus, so we’re kind of a magnifying glass to show people what the issues that they have surrounding the quality of their data. From there, they can choose to cleanse or repair [it].
Cleansing, by the way, is sometimes misunderstood. Where you have reference data, you can cleanse, where you have a business rules, you can cleanse, but for a lot of data requirements you have to go back to the original document and look up the value that’s missing. What we help them do is narrow it down and put it into perspective. Sometimes you find data in files that is very suspect, but it’s also very old, so if that old data is not critical to the business at this point, there are several questions that you need to consider: why keep it online, number one, and why go fix it, number two, if it’s not going to be used. That’s what intelligent archiving is really about.
Q. Any final thoughts about data quality programs in IT organizations today?
A. Just this: A data quality program or initiative in an organization cannot be a hobby. It needs to be a recognized part of the organization that is funded, staffed and has some level of authority to get its job done.
I recently had a conversation with a person who said, ‘We know our data is bad, why do we need you?’ My answer was, ‘If you know it’s bad, then why do you keep putting it into the system?’ In that particular case, we had been hired by someone else to provide them with service, and we provided them with that service and showed them very positive areas of improvement to be made. Our goal was to help them understand this whole process and to identify potential issues related to suspect data. It’s their data, if they want to call it bad, they can call it bad. But it doesn’t have to be bad.
About the Author
Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.