Q&A: Emerging BI Technologies

Emerging tech is the theme of Orlando TDWI conference. BI expert and conference speaker Steve Dine leads us through the latest in BI innovations and trends.

TDWI's World Conference in Orlando on November 7-12 will focus on emerging technologies. To get a sense of what some of the latest innovations are and how they affect BI, we talked with BI expert Steve Dine, whose company focuses on innovative, lean, and scalable BI implementations and strategic program guidance. Along with Mark Madsen, who is also well-known for his expertise in new technologies, Dine is speaking at the Orlando conference on "Enabling BI for the 21st Century."

BI This Week: What sorts of new and emerging technologies are you planning to cover at the Orlando conference?

Steve Dine: We plan to cover collaboration, visualization, text analytics, predictive analytics, and analytic database technology, which includes MPP [Massively Parallel Processing], columnar, NoSQL, and data warehouse appliances. We're also going to talk about data integration technologies and techniques, such as ELT [Extract, Load, and Transform], streaming, and CEP [Complex Event Processing], as well as program management in this new environment, including agile BI.

What about the cloud? Is that still an emerging technology?

Yes, very much so. Companies are still very reluctant about moving their BI assets to the cloud. Part of that is some obvious concern over security, part of it is concern over service mobile agreements and part of it is concern over performance.

A big challenge is that companies see it as an all-or-nothing proposition versus the incremental proposition of being able to move certain aspect of their program out to the cloud. … There are a lot of reasons for that.

I also think that there's resistance on the part of BI vendors to get behind the cloud -- other than SaaS [Software-as-a-Service] vendors. The more traditional software and hardware vendors are a little bit afraid of losing their traditional licensing revenue and having to adapt their software to work in a scale-out environment. There's also the issue of the ability to troubleshoot when there are issues.

Security often comes up in relation to the cloud – is that concern mostly a red herring?

I'm not sure it's a red herring. It depends on the organization. For many organizations, it's a false concern because their data isn't really all that sensitive, or the aspects of their data that is sensitive can easily be masked in a database or encrypted. However, for other organizations, it's very much a challenge.

You simply can't guarantee 100 percent secure data in the cloud. You're reliant upon an external vendor to provide it. Having said that, a lot of companies already outsource their data centers and are comfortable with that environment even though it presents some security challenges that are similar to the cloud.

What about concerns regarding performance, and about moving large sets of data back and forth into and out of the cloud?

That's a very real concern. Once again, it's dependent upon the organization and its network to support large data movement across the Internet. Now, there are techniques for handling large data. The first step is to evaluate your data processing footprint. For instance, a lot of the large data warehouses have very large initial data loads or historical data loads, but in those cases, the incremental loads often aren't that large. Even fairly sizeable daily loads can be ZIPped, encrypted, compressed, sent over the wire, decompressed, and loaded. Many cloud providers also will facilitate the loading of physical media sent to them, so there are techniques for handling larger data loads.

Another aspect to consider is that with any cloud provider, you can't guarantee the architecture that your service will be running on. For instance, if you are truly going to run in a parallel environment -- an MPP environment -- the speed of your backbone between your servers and your storage, and your servers themselves, is very important. You're relying upon the messaging interface; you're relying upon data moving across those nodes.

That's especially true when you're joining data, because chances are your data isn't going to be co-located on the same node that you're joining. You need to have high-speed switches and high-speed backbones between your servers and also connecting to your storage. You can't necessarily guarantee that in a cloud environment if you're simply running on a public infrastructure.

Having said that, however, cloud providers are starting to provide application-specific architectures and high-speed options. For instance, Amazon is providing a high-compute cluster instance where the instances that you're provisioning are, in essence, located on the same subnet with gigabit Ethernet between the nodes and gigabits of high speeds into local storage. They're starting to provide infrastructures for applications such as data warehousing or business intelligence.

That's part of the future of the cloud -- we're going to start seeing more application-specific architectures where, when you provision a server, you're choosing whether it's for Web applications, for a BI application, for a transactional application, or for a game server. I think you'll then have the choice of provisioning on specific fabrics for the type of application that you're having hosted.

Regarding analytic database technologies, you mentioned MPP as a topic you'll cover in your Orlando presentation. In what sense is MPP a new technology?

MPP in itself isn't necessarily a new technology, but it's an emerging technology respective of BI, as data volumes grow and there's more demand for deeper analytics. The ability to distribute your computing across multiple nodes in a shared-nothing architecture is what enables databases in today's environment to really scale. Cost is another big factor because scaling up is very costly.

Distributed computing MPP allows you to really scale out. For instance, a number of vendors are providing pre-packaged, pre-configured hardware along with MPP databases, which are appliances. Others are providing MPP databases in which you can choose your architecture. An [important feature] of MPP is really the ability to scale with your data, which provides the ability to query and analyze larger amounts of data.

What about NoSQL data stores? That's an interesting emerging technology.

NoSQL is interesting. One of the reasons why we're covering it is because there is a lot of buzz around it and it may, at some point, play a larger role in BI. There aren't a large percentage of BI programs currently leveraging NoSQL. It's mostly software companies with large Web site applications or financial organizations targeting fraud detection.

However, it's a great technology for storing and retrieving enormous amounts of transactional data. Facebook and Twitter, for instance, run on a NoSQL engine.

Another benefit to NoSQL is that it's not in itself a relational database; you don't necessarily create a relational data model for it. We'll talk more about it at the conference, but basically, the benefit is that you're not tied into a very rigid relational structure. It's thus more flexible and faster in terms of implementation.

It's an interesting technology that scales out extremely well, [although] there are also many limitations with NoSQL. It's not as easy to use and it's not as easy to query [as a relational database]. I'm not sure how fast it's going to penetrate the broader BI market.

What other technologies fall under analytics in your list of emerging technologies?

That's an interesting question because there are still a lot of … different definitions around analytics. The traditional enterprise BI vendors consider reporting and OLAP as analytics. Other vendors, such as SAS, consider analytics to be more statistically based analytics or analysis. The types of analytics technologies that we'll focus on are advanced visualization and predictive analytics.

Let's talk about visualization. Where are we right now in terms of advanced visualization?

Visual analytics relies on a user's ability to detect patterns in the data visually. Once again, that's an area that there's some confusion over the definition of advanced visualization. If you were to talk to a company like QlikView, they would consider themselves to be advanced visualization providers, but when I think of advanced visualization, I'm thinking in terms of the BIS2's of the world. Advanced visualization goes beyond the simple line charts, heat maps, and pareto charts that we're used to and provides the ability to layer multiple dimensions on one analysis in a visual manner. These include temporal, spatial, pivotal, and so forth. You can really gain some valuable insights using these new ways of visually presenting the data.

So being able to layer multiple dimensions is key to your definition of advanced visualization?

Yes, combined with some more advanced analytic methods. [The product] doesn't necessarily need to provide statistical analysis. When I say advanced methods, I mean, for instance, splitting your data into revenue quartiles and displaying that, along with spatial data and other attributes that might be related statistically.. As you said, it's about the layering, but it's also the ability to provide context about what you're looking at.

Are we still at an early stage regarding that technology?

Yes, but really not from a technical perspective. From my perspective, it's not as much a technical hurdle but really more of a learning curve.

Do you mean in terms of users being able to understand what the capabilities are and appreciate what they're looking at with advanced visualization?

That's part of it. What I consider a larger challenge is understanding how to present data visually. It's much like the challenge with predictive analytics. It's not that the technologies aren't there to do it. It's that you can't just give somebody a data visualization tool and say, "Okay, create a visualization for me," much like you can't just give somebody SAS and say, "Create a predictive model for me." You have to know, going into it, how things are best displayed.

I put those technologies together because I think we face similar challenges with them. … If you don't know what type of analysis or statistical method is required for different types of analysis, then it doesn't really matter that you know how to use the tool. Data visualization is very similar -- you need to know what types of visualizations work best for what types of data.

That's an interesting point -- that our level of understanding in the BI industry hasn't caught up with some of these technologies.

Yes -- and it's one of the things we're going to talk about in Orlando. Some of the challenges we have with enabling BI for the 21st century -- really, enabling BI for the near-term future -- is that our teams, over the years, have become so much less technical. It's not that the BI resources we hire aren't as intelligent as years ago. It's that the tools have made it easier to create reports, perform ETL, and create OLAP analysis or event-based notification. The tools enable us to point and click through very advanced interfaces, and so we haven't needed the ability, from a hiring perspective, to hire people that are extremely technical, or at least have a programming background or have the technical background in database systems or computer science. It's been very easy to bring in people from the business side, train them on the tools, and make them report developers themselves, or BI developers, or even ETL developers.

The challenge as we move forward [is to realize that] you may need more of a technology background to enable some of [these emerging] technologies today.

That seems to apply to many of the technologies we've been talking about here.

It does. That's where the challenges are as we move forward. BI programs are maturing. The reason we're getting into areas such as predictive analytics or advanced data visualization is because a lot of organizations feel they've already picked the low-hanging fruit and they're looking to get additional value and additional competitive advantage out of their data in new ways.

That's why we're going to talk about things like NoSQL. It [can] apply to unstructured and semi-structured data. Depending on which statistic you choose, 80 percent or more of data is unstructured. That's the next great frontier. What do we analyze next?

Plenty of companies are starting to look at unstructured data in their organizations. Even if your data warehouse is only one or two terabytes, add in unstructured data, and you're looking at approximately 80 times more data. It's going to test BI programs as we move into the future; our existing technologies aren't necessarily well-suited for handling that much data.

Sounds like this will be an interesting presentation.

Yes, I'm excited about it. We're just now approaching some of the more exciting areas within BI.

I think in 10 years, people are going to look back and say, "Can you believe it? Every time we needed to expand our database, we bought a new server! Every time we couldn't get our ETL loads to finish within our load window, we bought another dedicated ETL server!" I think people will say, "Wow, I can't believe we did that. Today, when we run out of capacity, a new server automatically comes up."

Must Read Articles