In-Depth

Big Data Killed the Data Modeling Star

Big data offers BI professionals new ways of making information work for the business.

By Paul Sonderegger, Chief Strategist, Endeca Technologies, Inc.

MTV's first video, Video Killed The Radio Star, captured TV's disruption of the music industry. Big data is disrupting the BI industry in a similar way -- changing what BI teams do and how they do it.

It's not immediately obvious why this should be. Shouldn't big data be like a DBA right-to-work act? If managing data is what BI teams do today, a greater supply of information should mean their skills are in greater demand. That's true up to a point, but big data shoots past that point, inverting the relationship.

The volume of big data is such a change in degree that it's a change in kind. It's like running a zoo where every morning the number of animals you have grows by orders of magnitude. Yesterday you had three lions. Today you have 300.

Big volume isn't even the big story. The big story is the variety of data and its velocity of change. This is like running a zoo where the number of animals shoots up every morning as does the number of kinds of animals. Yesterday you had 300 lions. Today you have 30,000 animals, including lions, hummingbirds, giant squid, and more.

The biggest bottleneck in making this data menagerie productive is labor. In a big data world, data modeling, integration, and performance tuning are governors of data use because they rely on relatively slow manual processes done by relatively expensive specialists. In an ironic twist, the substitution of computing capital for labor that transformed other business processes (such as inventory management, manufacturing, and accounting) will do the same to information management itself.

Take the relatively simple case of a data mart with fast-growing volume. As the volume of data grows, query performance tuning becomes both more important and more difficult. Performance tuning requires trade-offs. For example, pre-aggregating the data improves query response but cuts off the user from detailed data which may be valuable for certain investigations. As data volume grows, more data aggregation may be required, eliminating levels of detail that used to be available. When the users rebel, the BI team has to haggle over remediation and strike a new balance. This time-consuming approach is simply unaffordable in a big-data world.

Removing this bottleneck is what data warehouse appliances are all about, including those from Netezza (now IBM) and Vertica (now HP), plus SAP's HANA and Oracle's Exalytics appliances. Dramatic increases in processing horsepower from in-memory architectures, as well as faster look-ups thanks to the improved compression and organization of columnar stores, make performance tuning through model-tweaking a thing of the past.

Now take the more complicated case of fast-growing volumes of highly diverse data. This could include the data from a warehouse appliance plus enterprise application data, documents from a content management system, and social media feeds (arguably, the giant squid of the data zoo). The huge variety of this data makes it difficult to design a model ahead of time, and the relentless change of multiple, distributed systems almost guarantees the model will be out of date before it's completed.

Solving this problem was the motivation behind post-relational technologies from Cloudera, MongoDB, Endeca, and Google's OpenDremel project, a little-known open source initiative for interactive analytics on big data. The genius of these systems is that they break with key principles of relational technology, namely that you have to fully organize information before you can do anything with it.

Although these individual technologies are very different, they share a common characteristic in that they break data down into its smallest atomic unit -- the attribute-value pair -- and operate on that, independent of any larger, overarching schema. All this reduces the need for pre-determined models and makes it easier to operate on data as it's found in its natural environment.

This has a few big benefits. It reduces the time and effort required to integrate diverse data sets together, and it allows fast response to changes in the data or in the user requirements. Big data may have it in for the data modeling star, but there's a silver lining for a new generation of BI pros.

Big-data technologies will allow BI teams to deliver more powerful analytic applications on data in the warehouse and beyond, and do so faster and more cost-effectively. The combination of more flexible data models and greater processing power will allow BI teams to support fact-based decision making in departments that need facts from the wider world. They'll also be able to respond more quickly to fast-changing requirements, improving their ability to collaborate with the business and answer unanticipated questions.

Just as MTV ushered in the era of mega-pop stars by offering artists new ways to captivate fans, big data offers BI pros new ways of making information work for the business.

Paul Sonderegger is chief strategist at Endeca Technologies, Inc. He helps global organizations turn big data into better daily decisions to gain competitive advantage. Previously, Paul was a Principal Analyst at Forrester Research, where he published numerous reports on enterprise search and user experience, and advised hundreds of executives at Global 2000 firms. Before that, Paul was an analyst at Strategic Interactive Group, now Digitas. Paul can be reached at [email protected]

Must Read Articles