Q&A: Data Profiling for Quality Assurance
Costs associated with poor data quality aren't immediately obvious to many companies, and current approaches (such as data cleansing) to improving quality fall short. Data profiling may be the answer.
The costs associated with poor data quality aren’t immediately obvious to many companies, according to Jack Olson, chief technology officer of data profiling specialist at Evoke Software. Olson is also the author of Data Quality: The Accuracy Dimension, published last year by Morgan Kaufmann.
Olson suggests that common approaches to data quality—such as data cleansing—aren’t always enough. He champions a new approach to data quality assurance—data profiling—that describes an analytical process that effectively regenerates new metadata from existing content.
Elsewhere, Olson says, more and more companies are starting to “get” the importance of data quality, and the field of data quality assurance could provide exciting employment opportunities for many IT professionals.
Why is data quality important? What form do data quality problems take, and how does this affect a company’s bottom line?
Fundamentally, there are a lot of ways at looking at it. One is the real dollars that they lose every day because of flawed transactions, which other people like [Larry] English and [Thomas] Redman and other people [who have written about data quality] have spent a lot of time on.
In my view, what is becoming a bigger problem is that companies are moving toward using data to make corporate decisions more and more. And the quality of the data is such that either they’re reluctant to use it to make those decisions or they’re making wrong decisions. This is the business intelligence side, the data warehousing side, where they’ve got this wonderful store of data that they hope to get a lot of value from, and that is a very important cost. It’s very difficult, if not impossible, to measure that.
There’s another huge cost that’s not identified very much: how quickly can a company respond to new business models? For example, I’m a bank and I buy another bank. Becoming one bank means merging our big processing systems, and if that takes one year, I get a lot better benefits than if it takes five years. Data quality plays a very important part in frustrating these attempts to finish these big projects.
If you scan the resume postings at Monster.com, you’ll see some folks who are touting ETL experience, report development experience, OLAP and analytic experience. Perhaps I’m not looking hard enough, but I haven’t seen much in the way of resumes trumpeting data quality experience. Does this say anything about how aware corporations are of the problem?
Absolutely. It’s really interesting from our [Evoke Software’s] point of view, in that we still have to fundamentally pitch the productivity side [i.e., data profiling as a tool to reduce the time and cost of projects] when we’re talking with customers. Corporations even today do not put value in data quality until they actually see the errors in their data and sit down and cogitate on it and say, "We’re really losing a lot of money!" But they will put value on a data warehouse project that they’re projecting to spend $150 million on.
So companies aren’t getting the message that poor data quality is costing them money?
They’re starting to. Basically, companies had to fall on their faces before they would realize, become aware that they have some fundamental problems in their operational systems that inhibit them from getting that value out of the data, and data quality is of course a major part of that, and they have to get over that before they start to get major quality out of their data
You talk about data quality initiatives in which companies actually form groups to determine potential quality issues with their data and propose potential remedies. How widespread is this practice?
It’s gotten much better over the last five years, actually. If you went around and counted the number of companies that had an important job assignment of improving data quality five years ago, major corporations, including IBM, did not have anyone assigned to that task. But in the last two years, it’s going on everywhere: [Companies are] bringing in people like DBAs, data architects, and business architects and putting them in [data quality assurance] groups and saying, "Your job for the next few years is to see if we can get a data quality initiative going." Now the timing didn’t exactly work out too well because of the recession, and of course that affected those groups, so some companies have scaled these efforts back.
Will we see the formation of a distinct job position—say, director of data quality—with attendant sub-positions?
We still haven’t seen the emergence of a title. Larry English and [Richard] Wang have proposed something in that regard, but nothing has come of it yet.
How do organizations populate these groups? What kinds of results are the companies that invest in it [data quality assurance] seeing? Is this a good field for IT professionals to consider pursuing?
Right now, they’re typically people who are loaned to this group. Remember, these groups are one to four years old and they’re not really sure there’s going to be traction there. It’s like quality assurance in software 20 years ago—most people didn’t have such groups then, now everybody does, and now quality assurance is considered a career.
I think that will happen with data quality, too. We have auditors in accounting, we have quality control in manufacturing, we have quality experts in almost every phase of the business except for IT, and we even have that in the software development with the quality assurance people. But for the data, there’s never been this quality assurance on the data itself. So it will happen and there will be a career path defined at some point in time.
You’ve written a book about how data profiling can be used as a tool to identify problems with mission-critical data and increase the accuracy of that data. First thing’s first, however: What is data profiling? How is it different from, say, data cleansing? Are the two strategies [mutually] exclusive, or can they be used to complement one another?
I have a standard definition that has been used by virtually every company that’s built a competitive product, and the definition is that [data profiling is] the use of analytical technologies on data for the purpose of developing a thorough understanding of its structure, content, and quality.
Basically, the key to it is that you’re using the data itself; you’re trying to reconstruct the metadata, so you’re not generating anything [in an operational system]. You’re asking questions like, "Is this table normalized or denormalized?" What is the true data type of this column? It may say that it’s an integer, but in fact they’re all dates. It’s going back to the content to reconstruct the metadata, because when you’re viewing the content, you can ask "Does the actual data model I’m looking at match what the business people say is in there?"
How is data profiling different from data cleansing then?
Profiling generates only information, it doesn’t change data, it doesn’t fix things, it only tells you about your data. It gives you a good metadata repository. It’ll give you a good business understanding of the data, but that’s it.
Cleansing tools will change your data. That’s what Trillium, Ascential, FirstLogic—those types of companies are the ones that have those products, they will find mistakes, they will correct mistakes, they will do duplication. The thing about data cleansing is that it is almost exclusively used on name and address fields, which are invaluable to CRM applications, because CRM depends on the accuracy of this data. But people need to be aware that [name and address fields aren’t] the only place where they have bad data. I’m shocked when executives at companies say that they’re cleaning their data, and then tell me that they use something like Trillium, and I say "Well, Trillium only does name and address cleaning. What are you doing about the rest of your data?"
You’ve described data profiling as an “emerging” choice for dealing with data quality problems. With this in mind, how would you describe the maturity of the data profiling tools that you [Evoke Software] and others provide?
I believe that there’s a lot of headroom left, that there’s a lot of code that needs to be written. Our product suite is the most comprehensive, and I know of no product that doesn’t have the functions we have. But I’ve got a long list of things that I need to write [for inclusion in the suite], because I’ve pulled them out of customer examples. So by no means is it mature. It’s nowhere near as mature as the relational database technology, for example, but [relational databases] didn’t mature until the mid-90s, and we’re like in 1985 for profiling. We got the concept down, we got it accepted, some of the real low-hanging fruit is covered by near everybody, but there’s a lot of function growth that needs to be done. As these [data quality assurance] groups in organizations start maturing, they will discover even more things for us to do.
Let’s say an organization implements a data quality program that keys heavily on data profiling. Is this enough to vouchsafe the integrity and accuracy of their data? Must they do still more?
In my book, I basically said, there are two views. There’s this data quality [assurance] group—where you look at the data and learn what you can from the data—and there's business process analysis, where you study how the data is collected, if it’s a bad [collection] practice, things like that. Neither one alone is going to get you everything you want to know.
A data quality group needs to be able to use both tools side by side and to use them against each other, so when I find quality problems through profiling, I need to go analyze the business processes. If I’m starting with biz processes and I find bad practices, I need to go back and analyze that data and see what I can learn. And don’t forget cleansing! Although it isn’t possible [to use data cleansing] in many cases, it should certainly be used wherever possible.