Q&A: Governance Challenges of Unstructured Data

The growth of unstructured data presents special data governance challenges, but it can be managed as can other challenges around data governance.

"In the last year or so, governance of unstructured data has probably been the highest-growth area in governance," says Daniel Teachey of DataFlux. Unstructured and semi-structured data can be governed and managed, he explains, but "the technology needs to be as smart as the different types of data it's being forced to handle."

In this interview, the second of two parts, Teachey discusses how unstructured data can be handled, along with how long it takes to realize a return from a governance program.

Teachey is senior director of marketing at DataFlux, where he manages global marketing efforts along with PR, product marketing, and customer relations. He joined DataFlux in 2003, having held positions with IBM, MicroMass Communications, and Datastream Systems.

Along with TDWI research director for data management Philip Russom, Teachey spoke on data governance at a TDWI Webinar on Sept. 14, Lifecycle Stages for Data Governance: Plan Ahead for Success and Sustainability.

BI This Week: One issue that was discussed in the Webinar was unstructured data. What challenges does unstructured data present when it comes to data governance?

Daniel Teachey: In the last year or so, governance of unstructured data has probably been the highest-growth area in governance. Let's take a financial example. Look at something that's, shall we say, quasi-structured, such as a financial trade between a buyer and seller on the stock market, or a transfer of money between international banks. These typically involve very specialized unstructured or semi-structured documents or files. They are being exchanged back and forth, and they need to include elements that can be used by data quality technology to manage them -- to understand that the routing number here is equal to the routing number there, or that the intended target bank is [correct].

The issue with unstructured and semi-structured data is the extent to which a technology can locate the pieces of information they need, understand what they have, and then have that represented within a governance technology structure. The systems need to be able to flag problems and alert compliance officers, for example, even if we're talking about an XML document that really isn't structured like a database or an application. There are triggers you can look for, such as routing numbers. The technology needs to be as smart as the different types of data it's being forced to handle.

There are a number of different ways to do that. Our parent company, SAS, has done [extensive work] in text monitoring and text mining, and we're starting to integrate more of that into our technology, because that's where the intelligence is. That's how you can make sense of things. There are these gigantic libraries -- how do you find an invoice in a PDF or Word document? How do you find a signatory in a contract? A lot of that has already been hashed out, and we can use that as pointers within our technology to figure out, OK, that's where the signature was, that's where the invoice number is, and add some additional detail to a record that we already have.

How long before a company sees results from implementing data governance in a pilot area?

You should start to see the first elements within months. If you take the approach of a rigorous assessment and look for the areas that are the highest need, and if you find staff that is truly behind the project, it takes just weeks, maybe months, before you start to realize that something was just really out of whack and now you've nudged it, if you will, into a better state.

What often gets left out -- not because people don't want to do it, but they just don't know how to -- is assigning a monetary or tangible value to the work that they do on a data governance team. Another part of [an initial] assessment is putting some expectations in place, and some metrics. They might say, for example, that it costs x dollars per hour to fix this right now. If you can lower that, it's part of your ROI calculation.

It may take you a bit more time up front, but again, you're then able to demonstrate value or validate the work that you're doing. That can't be shortchanged, especially in this economy. Every dollar that goes out is an expenditure. You have to understand that executives want to know what that's bringing back to the organization.

Is the goal here to eventually govern all data throughout the enterprise?

I'd say that it's governing the data that makes sense to govern. You may have mainframes full of historical data created back in the '70s or '80s. It serves a historical purpose, but maybe that's not something you concentrate on, or maybe it is -- it depends on the type of company you are.

The key thing is not creating data that is of high quality for the sake of creating high-quality data. There is usually a business goal. Can we optimize the process of bringing a new customer on board? Can we optimize the procurement process, so that we know are we within contracts? Good data supports the day-to-day decisions that are the backbone of the business processes that drive every company's invoicing, procurement, sales, and marketing campaigns.

I guess the Shangri-la here would be if data governance provides the intellectual fabric behind all these things, and there is an understanding and a cohesion between sales, marketing, support, procurement, supply chain, and administration that we're all on the same page, we've all come up with the same definitions for key data assets, customer, supplier, physical assets, inventory, and so forth, then all that can be pulled through [data governance] to optimize the processes. It will make you a more nimble company, you'll be more competitive, and you can get products to market faster. Bottom line: you can leverage your customers and treat them better, so you can have a higher lifetime value.

All the fun, buzz-worthy things can flow from that, but you first have to create sort of a foundation of good, clean, reliable, accurate data to support that, and as much as that can fall under the data governance umbrella, the better.

Remember, you don't just collect data as if you're a stamp collector. There's a reason you're collecting it -- to inform the business and make the business more knowledgeable about certain elements. Governance provides a framework to do that more accurately. To whatever extent you want to use it, it can be very valuable.

How useful are software tools in helping to automate data governance and to make it easier to implement? What does DataFlux bring to the table?

The software tools are incredibly important. ... Five or six years ago, a company might hold data governance meetings for 18 to 24 months before they would bring in a technology, because at that point [tools like ours were] typically just monitoring the business rules they had come up with.

What's changed is that our technology now allows you to do things like create a glossary of business terms so that when you specify "derivative trade," a number of elements have to be included. The business user can work directly with our tools to infuse a structure. ...

Providing that framework can inform every other part of the process. Whenever IT creates a specific job for an application using a derivative trade, it will have the same elements pull through from the work done by a business user at the front end. That's a big change.

What we're seeing more often is DataFlux technology being used early in data governance. During implementation, our data profiling and data analysis capabilities prove out that there is a big problem there, and huge room for improvement.

The technology isn't just fixing the problems, it's showing that you can hold people accountable and validate that processes are, in fact, improving. That's huge. That means data governance can become part of the day-to-day work of an organization, not just a separate thing that they do every other Thursday. That's a huge change and an area in which we've seen a lot of success.