Data Deluge or Opportunity?

Why data aggregation, data exhaust, and metadata are the fundamental building blocks to tomorrow’s business model for organizations of nearly any size.

by Brian Gentile

The topic of very large (Hadoop-class) data sets comes up more regularly nowadays. I end up having a similar conversation about once per week and it goes something like this: “How many Google-, eBay-, or Amazon-sized data sets are out there anyway? Seems like very big data and Hadoop usage would be pretty limited, right?”

When this question is posed, I respond: “Right now it may be difficult to imagine data sets of this type and size becoming both common and in need of analysis, but let’s consider what’s happening in the world of the connected Web.” Then I proceed to discuss data aggregation, data exhaust, and metadata as the fundamental building blocks to tomorrow’s business model for organizations of nearly any size.

I describe these as three remarkably powerful data forces, and they arise mostly from the vast variety of new, Web-based applications and services. Further, they provide the backdrop to the enormous expansion of digital data that has been underway for more than a decade, with market research firm IDC estimating that 1,200 exabytes of data will be generated this year alone.

In a recent special report entitled Data Deluge, The Economist magazine both helped to solidify and extend my observations of these data forces. Like so much of its research and articles, McKinsey’s newest report on Clouds, Big Data, and Smart Assets: Ten Tech-enabled Business Trends to Watch provides a grounded summary on some of the possibilities from these powerful data forces. I’ll cite from both these reports as I consider each data force in turn.

Data Aggregation

Use of the term “data aggregation” in this context refers to combining related data into some useful format that yields patterns and otherwise imperceptible insight. Data is often nearly useless until aggregated. In the case of Web-generated data, volume matters, so aggregation is critical.

My favorite example of data aggregation is eBay because it makes the value created from aggregate data so clear. Before eBay, how would you calculate or even estimate the market value of an old, brass flugelhorn? You couldn’t, of course, because although the data might exist, it wasn’t in any way usable. Today, in about three minutes, you can learn that 10 used brass flugelhorns have sold via auction in the last two days for between $400 and $600. If you learned anecdotally that a neighbor sold his brass flugelhorn three months ago for $300, that single data point wouldn’t have told us much. Knowing that 10 sold yesterday within a given price range is valuable information. This is data aggregation and its applicability extends far beyond eBay.

In the normal course of business, many organizations can create, deliver, and even be paid for the value in their aggregate data. The key is efficiency. One of Jaspersoft’s customers, Monolith Software Solutions (also known as, but can be found at provides a SaaS-based analytic application for the quick-serve restaurant industry. Their software delivers analytic insight using a variety of point-of-sale data about the operational health of any given restaurant. Although the individual restaurant information helps the restaurant manager, the franchise-wide information aids the business unit manager in making informed decisions across the entire business.

Monolith combines all the data from individual groups of stores to deliver analysis across commonly-owned locations. Even further, if they aggregated information from all of their quick-serve restaurant customers, even in an anonymous format, it would provide a fascinating view into the trends and influences affecting that particular corner of the restaurant industry.

In this sense, Monolith should consider the value it could create by analyzing and making available this aggregate data. To monetize this kind of data, McKinsey suggests taking an inventory of all the data in a company’s business and then asking, “Who might find this information valuable?” The response should provide possibilities for disruption and opportunities for new businesses.

Data Exhaust

Data exhaust is the trail of digital data created by humans as they move along the pathway of their life, especially the clickstreams and digital artifacts generated through all types of Web activity. It is also sometimes referred to as our “digital shadow.” Often this means a large number of log files, but this exhaust also includes offline or “environmental” data that is generated by, for example, security cameras or sensors designed to track human activity. The bits of data exhaust left behind, analyzed either individually or in aggregate, can tell a powerful story. According to the Economist article, the amount of data exhaust generated by individuals this year will surpass the amount of data they generate intentionally. It’s almost unbelievable.

In a sense, the Google search engine gains its power and precision from data exhaust. Early on, Google developers realized that search results could be made more useful by tracking the clickthroughs of those who performed the same search in the past. With each search and more links clicked from the search results, the more accurate and useful future searches become. This is a clever use of data exhaust and “recursive” analytics for everyone’s benefit.

Also, some major cities are now using “smart assets” to monitor and manage traffic congestion, including sensors in mass transit systems (trains and buses) that enable real-time status reports and optimized routing at critical commute times. In law enforcement, police are using digital video cameras and data analytics to monitor city sectors, looking for hot spots of criminal activity that then trigger additional resources to be deployed. This environmental data, when analyzed properly, holds both great promise and potential problems (for privacy, mainly) as it is generated in larger volumes.

Some of my favorite and most modern examples of cleverly using data exhaust come from online gaming. In particular, Facebook games have garnered an enormous following with one of the Web’s most popular game environments, Zynga, supporting millions of users through 20 different games every day. Within these social media platforms (such as FaceBook), playing the game is free. To advance and win at different levels, though, a player must purchase even small items to become more successful. These small purchases are the monetization mechanism for the gaming companies.

The thousands of small nuances they can introduce (within the game itself) that might make a user more likely to purchase something can quickly add up. The most successful gaming companies are using the clickstream exhaust from their online gaming audience to learn the patterns, profiles, and pathways taken most frequently through the game, searching (in near real time) for possibilities to groom the game, encouraging more purchases each step along the way. The gamers in the morning make it more likely that the gamers in the afternoon will purchase that pick-ax or feed grain necessary to advance within their favorite game. These online gaming companies are more likely to use open source and cloud-based business intelligence because they are a natural fit for their development and analytic environment.


Metadata is contextual information about the data that provides richer possibilities for connections and exposes possible interrelationships among the data. Data aggregation and data exhaust have limited possibilities without metadata. BI systems without metadata would have limited semantic elegance, preventing self-service capabilities from flourishing in any organization.

Without richer semantic definition than exists in practice today, the entire Web is bound by significant constraints that will be exposed more dramatically as Web-based data volume, variety, and velocity increase. Experts agree that more unified metadata standards are necessary to allow connections among the expected future volumes of Web data, transforming its usefulness in powerful ways. In this sense, any discussion about aggregated, exhaust-filled, big data requires lessons in next-generation metadata.

Allen Cho’s article on the semantic Web -- The Semantics of the Dublin Core – Metadata for Knowledge Management -- provides a glimpse into the importance of metadata to next-generation, distributed business intelligence. “[Semantic Web] will be significant in the future of B2B, particularly since metadata plays a critical role in investments in data warehousing, data mining, business intelligence, customer relationship management, enterprise application integration, and knowledge management”, he wrote.

More specifically, linked data and greatly improved open-standards-based metadata provide the possibility for an explosion in the amount and sophistication of reporting and analytics on previously unrelated Web-based information. Forget mash-ups at the presentation level (as useful as they are), Web tools that implement richer, unified metadata standards should allow automatic discovery, interrogation, and integration of data for a variety of reporting and analytic uses.

Imagine eBay’s aggregated data being discoverable and analyzable by anyone who has the time and interest, or the myriad of government data that is being made available via the Web and which represents perfect candidates for linking and describing richly and in unified, standards-based ways. Metadata holds the key to turning the chaos of data on the Web into valuable information.

The Path Forward

Big data is here to stay -- and in increasingly large volumes. Early businesses that learned to monetize this deluge were Google, Yahoo, and Facebook, and all three were behind what became Hadoop, a new way of being able to affordably crunch especially massive data sets. Data volumes of a similar scale are coming to business of all sizes -- and it represents an opportunity. Now is the time to start thinking about how your business can turn this deluge of data into progress and profit.

Each business should start by understanding the gold mine of data that it may be sitting on today and what value that data may have once aggregated or once the exhaust is understood more fully.

The good news: open source business intelligence is making it possible for organizations of any size to tackle these large data volumes. Before Hadoop and open source BI tools, a business had to invest literally millions of dollars to store and manipulate data of any large scale. Now, it can be done powerfully and affordably and by any organization that wants to tackle an ”enterprise-class” data analysis problem. With more than three Exabytes of new data being created each day, that’s an option every organization needs.

Brian Gentile is the chief executive officer of Jaspersoft. You can contact the author at

Must Read Articles