In-Depth

A Prescription for Public-Sector Big Data Woes

Instead of just reporting on data, shops should be exploring their data. This is particularly true in government, where very large data sets can benefit from cutting-edge visual data exploration technologies.

Editor's note: Explore BI and DW in the public sector at TDWI's 2011 Government Summit being held this April in Crystal City, Virginia. For details, visit http://tdwi.org/dc2011.

For Very Big Data shops -- in either the public or private sectors -- data exploration can be a first step in the direction of getting one's data management house in order. It can likewise yield high-value insights along the way.

At a basic level, says Andrew Cardno, chief technology officer (CTO) with data visualization specialist BIS2 Inc., information analysis is a pretty similar proposition in both private- and public-sector shops.

"They're very similar problems, actually. [In both cases] you have a massive data out-load, [while] the analytical techniques that are being applied mostly focus on small pieces of data or [on] known trends in the data and things like that, instead of high-dimensional exploration," he says.

"The comparison I'll often use is that when you have known data and known relationships, you can do reports. When you have known data and unknown relationships, you're doing data exploration."

Reporting, Cardno says, is where most shops are. Data exploration, on the other hand, is where most shops need to be. "When you're doing data exploration, you really need to be able to see more dimensions," he explains, arguing that the ability to intelligently display data in several dimensions should likewise distinguish today's cutting edge visualization technologies.

Cardno cites a visual exemplar -- namely, Charles Minard's chart of the French Grande Armee's disastrous invasion of Russia -- that's a favorite among data visualization advocates. "In this graphic, there are something like six or seven dimensions of data. You get the whole picture. You see how it works. Any questions you ask after looking at that graphic are in the context of having understood the whole picture," he explains.

"It's a common problem for anyone who has a master set of data: the value is not normally in the things you know about it; the value is in the things you don't know about it, [in] the things that you haven't yet found."

Out of Order

How does a shop -- public- or private-sector -- achieve Minard-like visual displays when its internal data integration plumbing is, in many cases, largely siloed, unprofiled, uncleansed, unstandardized, and often not entirely structured? If private-sector shops are behind the curve when it comes to practices such as enterprise information integration, enterprise-wide data quality, or master data management, how are public-sector organizations doing?

Consider the case of a consultant with a prominent government services firm. This person tells a story about one of his clients: a government agency that's preparing a large multi-year study involving tens of thousands of different "widgets." Right now, this consultant says, this agency is in the process of populating its database of prospective widgets. At the outset, it's collecting data -- via a multi-page information form -- about all of the widgets that have signed up to participate in its study. How could this agency determine that the information it was feeding its database was both consistent and accurate?

Its solution is at once simple and breathtakingly hamfisted: the agency tasked two human beings to manually enter information into its repository. It plans to compare this information to ensure the data is "accurate."

This is a hyperbolic case, but it gets at a fundamental problem: if cutting-edge analytics or analytic technologies depend on reliable (and increasingly timely) feeds of clean, consistent, and accurate data, don't shops first have to get their data management houses in order before they step up to advanced analytic technologies, to say nothing of cutting-edge data visualization?

The answer, according to Cardno, is both yes and no.

If an organization wants to make effective use of most data visualization tools, he argues, it should expect to have a good understanding of its data.

This isn't necessarily a self-serving statement. According to business intelligence (BI) and data warehousing (DW) thought-leader Mark Madsen, a principal with consultancy Third Nature, Inc., data visualization tools tend to be as sensitive to the timeliness, accuracy, or consistency of data as any other analytic technology. "If you want to work on really large data sets [with most of these tools], you have to summarize them first. They use really smart techniques on the front end married to archaic plumbing from the 1980s and 90s on the back end," he says.

On the other hand, Cardno argues, some kinds of visual analytic technologies aren't as sensitive to the consistency or the quality of the data they're consuming. This is often the case with visualizations that involve huge data sets, he says.

Cardno's company, BIS2, specializes in visualization problems involving huge data sets. It prescribes Cardno's Super Graphics as a prescription for the limitations of so-called "traditional" data visualization technologies. "The nature of those [traditional] graphics -- with some exceptions like maybe scatterplots -- is that they depend on knowledge of the dimensionality of the data. They have a certain kind of expectation or understanding of information that normally comes with them. It's a very broad generalization," he maintains. "The nature of the SuperGraphic is that you can comprehend vast amounts of data in different dimensions at a glance."

Cardno and BIS2 aren't without their detractors. Data visualization thought-leader Stephen Few, for example, has described Super Graphics as "dysfunctional visualizations." Few has argued that Cardno's Super Graphics distort or produce "inaccurate representation[s]" of source data.

Cardno doesn't dispute this. He cites one of Few's criticisms -- the claim that Super Graphics distort the time dimension by depicting (in one example) the most recent calendar year as of a longer duration than others -- as a case in point. On anything but a quantum scale, 2011 isn't longer or shorter than 2010; but the distortions of the Super Graphic -- which are a product of the circular visual metaphor that Cardno employs -- help inform in part because they distort.

Cardno has his detractors, but he also has plenty of defenders, too. At last month's TDWI Winter World Conference in Las Vegas, for example, Madsen described BIS2's Super Graphics as "hyper-advanced." He and co-presenter Jos van Dongen (a principal with Dutch BI and DW consultancy Tholis) invited Cardno to demonstrate Super Graphics to attendees of their day-long BI and DW technology course.

Madsen endorses at least one of Few's criticisms -- chiefly, that Super Graphics aren't immediately intuitive -- even as he suggests that Cardno's critics are missing the point. "I've had my criticisms with his interface in that [Super Graphics are] not necessarily immediately intuitive; you first kind of have to learn what technique is being used before you can apply it," Madsen comments. "[Super Graphics] are geared toward a very specific kind of problem which is more than uni-dimensional or two-dimensional data, and very large sets of data being viewed in their entirety at once. It's really a question of data exploration."

Unlike most BI technologies (and even many data visualization tools), BIS2's Super Graphics are designed to be highly interactive. "It's fully-interactive over the data set, so you're showing on some of those interfaces several hundred million data points and, say, four or five dimensions of data," Madsen notes. "[Proponents of] conventional data visualization tends to think of reduced data sets, non-interactive viewing, and fairly simplistic techniques. That's great for the basics, but there are problem sets for which that doesn't work."

Tailor-made for the Public Sector?

The most obvious application is data exploration involving very large data sets. At present, BIS2's biggest reference customers are concentrated in two Very Big Data market segments: the gaming industry -- in which casinos are sifting through hundreds of terabytes of data about individual slot machines (involving multiple dimensions such as time, location, volume, and profit/loss), dealers, individual gamblers, and so on -- and the airline industry.

Cardno says his Super Graphics techniques are applicable to other markets, too -- including retail, financial services, and, of course, government.

He'll have a chance to make his case at TDWI's upcoming Government Summit, where he'll be discussing advanced data visualization with friend and frequent collaborator Stephen Brobst -- CTO of Teradata Corp. -- in a presentation geared for public-sector attendees.

Cardno doubts that government shops are any more behind the curve data management-wise than are private-sector firms. At the same time, he argues, there are data visualization techniques appropriate to different kinds of problems.

For Very Big Data shops -- in either the public or private sectors -- data exploration can be a first step in the direction of getting one's data management house in order. It can likewise yield high-value insights along the way.

"No matter how hard you try, when it comes to massive amounts of data entry, the data is always going to be error-prone. It may have large systematic errors, it may have errors in unexpected places. It's actually hard to anticipate just how or where [errors] will occur," he says. "You need the data to have a feedback loop. The day that people can see the data and understand the data and interact with it, is the day that data can start to have quality. Before you've done that, you don't know anything about them."

Must Read Articles