Q&A: Cloud Computing's Pros, Cons, and Potential
Consultant and TDWI instructor Steve Dine talks with BI This Week about some of the ins and outs of cloud computing for business intelligence.
As the term "cloud computing" leaps into the headlines lately, it seems that every BI vendor is using the phrase, but what is really meant by "the cloud" and what potential does the technology present for business intelligence?
BI This Week spoke with Steve Dine, founder and president of Datasource Consulting, whose firm has tested BI architectures on cloud computing platforms for the past two years. Dine, who teaches regularly at TDWI conferences, recently spoke at TDWI's San Diego conference on the subject of BI in the cloud.
BI This Week: Your consultancy works extensively on projects involving cloud computing and BI. What's meant by "the cloud" as the term is commonly used?
Steve Dine: There isn't necessarily one definition of the cloud, which is why it's such a challenge for people to understand it and why there's a lot of confusion -- there are so many definitions out there. Some people refer to the cloud and mean platform-as-a-service, others mean infrastructure-as-a-service, and still others are referring to software-as-a- service.
I tend to define the cloud fairly strictly, and based more on infrastructure than software. My definition is based on five characteristics. First, the cloud means dynamically provisioned services. Second, it's utilization-based, so you pay for what you use -- if I'm using it, I get charged for it; if I'm not, I don't and there is no upfront, long-term commitment. Third, it's multi-tenant, meaning that you can have more than one user sharing something on one piece of hardware, on the same architecture. The fourth characteristic of the cloud is that it's virtualized, meaning it's essentially an image of a piece of hardware -- which is the foundation that enables multi-tenancy. Finally, it's service-oriented, meaning that you can interact with it via loosely-coupled services.
Especially concerning BI, what are some of the benefits the cloud offers?
The best way to look at the cloud is this: it's another option out there. I gravitated toward the cloud concept immediately [a few years ago] because when I was running a data warehouse program, many of the challenges we faced centered around trying to be more flexible and agile. We simply couldn't work on new projects without impacting our existing data warehouse structure. We were severely constrained by the size of our existing data warehouse and resources even in terms of what new software we could test.
With the cloud, on the other hand, you can start small but you can always grow.
One benefit of the cloud is the ability to scale resources with very few barriers. Another big benefit is the ability to shorten implementation windows. To get into a BI program, for example, at a very low cost, is a huge benefit. The ability to reduce costs, in general, is another big benefit. The ability to use the [computing] environment for proofs of concept and upgrades -- that's a big benefit. Finally, the ability to scale geographically can be an enormous benefit.
What about potential drawbacks and concerns with the cloud?
Based on a recent Information Week survey, the No. 1 concern out there is security. Most companies are challenged with establishing and maintaining proper data security, and so one way they feel better about security is to try and keep all of their data within their corporate walls. As we've seen with the number of high-profile security breaches, that clearly isn't always effective.
That's not to say that there aren't valid security concerns with cloud computing, although a lot of it is just perception, in my opinion. This is a technology that lowers the barrier to entry, so, conceivably, someone could go out and provision a server, load up a BI software program on it, use some freely available ETL tool, and start moving data out into the public network -- and there are some very real concerns about that.
However, if you look at the security of an Amazon data center versus some large companies in your geographic area, chances are that Amazon is a lot more secure than many of them.
What about performance as a potential issue with the cloud, especially transfer speeds?
Yes, that's absolutely an issue. If you have a 10-terabyte data warehouse and you're loading 250 gigabytes a day, you're going to have some challenges. There are technologies available, such as Hadoop and MapReduce, that can help you scale, but they work best with certain types of data. There are also methods for transferring large data volumes, but they can add additional overhead to your daily processes.
A related issue around performance in the cloud is that the ability to scale up is limited. Scaling up means adding a bigger computer with more processing power and a larger drive. When you do that, you've just made your computing environment much larger. There are many applications on the market that perform better in a scaled-up environment, such as a 32-way, 256-gigabyte box. Other software, on the other hand, works better on a scaled-out environment. An example is an MPP [massively parallel processor] database. As you add nodes to it, an MPP database scales with the additional CPU and memory.
Some applications are much better at scaling up; some are better at scaling out. In the cloud, because of the state of virtualization today, and because you have multi-tenant architecture, it's very difficult to have large instances of virtual machines. In other words, it's very hard to scale up.
Another concern around performance is the scalability of physical data access. When you're dynamically bringing up instances of virtualized servers, you don't know where those are located within a data center. You can put them within the same zone, but you can't necessarily co-locate those instances on the same box or rack. Therefore, you don't know what the network throughput is between your different cloud-located servers.
Also, in most cases you can't really control the architecture of your data storage layer. You will likely be limited to software RAID and won't be able to choose the type of communication backbone between your CPU and storage. You're essentially locked into how the cloud vendor's storage is architected.
What else should people be aware of regarding the cloud?
Pricing can be a challenge simply because it can be complex. Calculating your "spend" is sometimes difficult because so many factors go into it, such as the size of the server that you're provisioning, the amount of I/O in and out of the cloud, the amount of memory, and so forth. Each vendor seems to have a different way of charging.
A related issue with cloud pricing is that can be variable by month or quarter. In the world of corporate budgeting, it can be difficult to create a variable budget item that allows for the way pricing from cloud vendors often works.
Looking into your crystal ball, where do you see cloud computing headed? How important will it be to BI?
From my perspective, cloud computing may actually lead to what BI vendors have touted for many years, "pervasive BI." The reason is that we're starting to see licensing models change to meet the utilization-based computing model. In the near future, companies may no longer be constrained by large, upfront, user-based licensing fees. Software-as-a-service vendors are also leveraging the cloud, making it easier for small businesses to load and analyze their data with very little upfront cost and administrative overhead.
Another possibility is this: cloud computing opens up the ability to implement managed BI at lower costs than has traditionally been the case. Several years ago, a number of consulting companies said, let's create managed data warehousing -- essentially data warehouses that the vendor would manage for the customer. Many companies ended up going under because of that, basically because the cost structure was such that it wasn't viable -- you couldn't build and manage a data center, and bring clients in, and still cover your costs. They also didn't have the resources to support a large data center.
Given the state of communication networks today, along with the ability to manage large data centers "in the cloud," we may finally have the ability to implement true managed BI.
What is your company, Datasource Consulting, working on now?
We've been focusing on cloud computing for the last couple of years. We've spent a lot of time researching and testing; this is still a fairly new concept and there aren't a lot of easy tools to manage what you're doing. It takes time to understand everything from auto configuration to data management (how do you back up the cloud?) to disaster recovery (how do you recover if things get lost or servers go down?). Clients want to be able to take advantage of the flexible nature of the cloud. If they don't need something from 5 p.m. to 8 a.m., we can just shut the servers down, then bring them up the next day, and they don't have to pay for the down time.
We've spent considerable time testing different data integration architectures and different BI implementations in the cloud. We've built out a number of AMIs [Amazon machine instances] that we've preconfigured with BI software. We've also created scripts so we can bring those instances up and have them pre-configure themselves.
One interesting current project is with the BeyeNetwork, in which we're implementing a custom data warehouse and front-end BI solution in the cloud, working with an open source stack. Unlike most on-premise data integration projects we've implemented, our development environment was available immediately for our team, at minimal cost and overhead to the client. We also were able to scale our environment as the project required. The BeyeNetwork will be able to report and analyze integrated data with no on-premise server hardware, and will be required to pay only for what it uses.