Q&A: Data Modeling Gives Structure for Move to the Cloud

Why data modeling is important in cloud-based BI.

As cloud computing gains steam in the context of data warehousing and BI, so too should the importance of data modeling. In this interview, well-known data modeling expert Steve Hoberman explains that data modeling, often overlooked as a component of cloud computing, can provide structure and help companies decide what data should reside in the cloud.

Hoberman, a widely recognized innovator and thought leader in the field of data modeling, taught his first data modeling class in 1992; since then, he has educated many thousands of people about data modeling and BI techniques. A frequent presenter at industry conferences, including TDWI, Hoberman is also a columnist and contributor to a number of industry publications. He is the author of the books, Data Modeling Made Simple, Data Modeler's Workbench, and Data Modeling for the Business. He spoke at a TDWI Webinar on "Better BI Through Data Modeling" in February.

BI This Week: We're going to talk about how data modeling and cloud computing intersect, so let's start with your definition of cloud computing.

Steve Hoberman: There are lots of different definitions out there for the cloud, but here's what I think cloud computing is: The outsourcing of software, platform, or infrastructure. By infrastructure, I mean the underlying servers, network and so forth. By platform, I mean the entire development environment that teams use to build applications. Software, the third tier, means any of the applications used to run the business.

One thing that's very interesting about the cloud is that it allows organizations to get at their data through thin clients (such as Web browsers) without knowledge or control, of the exact location of their data.

Another thing that is fascinating about cloud computing is, based on my definition, organizations are still accountable for their data but not necessarily responsible for it. It's almost as though cloud computing comes down to accountability without responsibility. An organization is still accountable for making sure their sales numbers are correct. A health insurance company is still accountable for making sure they follow HIPAA regulations and that no one else sees any sensitive data, but they're not responsible for making sure that happens. Another organization is instead.

Do you see that as a positive with the cloud -- the fact that you share responsibility for your data with someone else, presumably with expertise in safeguarding it?

I see it as a caution, not as a positive or a negative per se. All of a sudden, somebody outside your company has access to the data, and they're responsible for keeping it up and making sure that it's protected. They're not necessarily accountable for it. They're responsible.

I often give presentations to audiences of data professionals, and many of them focus on issues such as stewardship and governance. They take data very seriously. This is a caution to them that with the cloud, there's probably more work involved to make sure that the data is still protected.

You've broken cloud computing into three areas: the backend servers (or infrastructure), a middle tier platform (including the development environment), and the applications (or software). Can we look at each of those areas in terms of data modeling?

Yes. Data modeling is evolving with each of those. When you think about those three layers, it's extremely important to get the infrastructure protected and right because if there are any issues there, it spreads to anything above it -- the platform and the software.

Data modeling plays a number of interesting roles here. One of them is determining what data it makes sense to send out onto the cloud. Should we send our general ledger data? Should we send our marketing data? Should we send something that's not sensitive data at all -- country codes or currency codes or other publicly available data?

With data modeling, someone has to do an analysis to first understand all the data in the organization and then decide what should be sent to the cloud. You can't just pick and choose the data to send, though -- you need to understand the whole organization.

Let's say an organization decides to send all of their prospects somewhere in the cloud to store and retrieve, and they're using some kind of CRM lead-generation software that's cloud-based. Well, how do all their contacts, phone numbers, and so on integrate with all their existing customers? When Bob the prospect becomes Robert the customer, who's responsible for making sure it all fits together?

I see a serious responsibility for the data architect -- outside of data warehousing -- to make sure all the pieces still fit together if something moves to the cloud.

There's now extra responsibility -- if we take part of our organization and send it somewhere else, what's the impact? If we do that with prospects, maybe we should do it with our whole customer area. By the way, if we do it with customers, what about orders and that sort of thing?

It's almost like a spider web, in the sense that we're not talking about isolated pieces of data.

So with the cloud, there's now another element that needs to be considered -- something else that comes into the data modeling equation.

Yes, and what's really fascinating is that I think people are missing this part of it.

Recently, I went to the Cloud Computing Expo. It's a world event -- a big conference right in New York. I expected to see lots of vendors talking about issues such as, how do you manage data governance and standards in the cloud?

What I found instead were little slices of cloud computing: here's a company that handles security encryption if you're in the cloud, here's a company that offers servers on a real-time basis. Not one organization there offered anything at all to do with data management or architecture [in the cloud]. I was surprised.

Is that because we're just too early along the adoption curve?

I think you're right. I was talking to a few people at the show, and that answer was a recurring theme. It's just the beginning. Even so, cloud computing is just now catching on, but it's not a new concept.

In terms of the three tiers of cloud computing and how they affect data modeling, are there specific considerations with applications, the top tier?

Yes, there are. There are different types of modeling an analyst can do. One is data modeling, in which they largely do what they've always done -- work with users, understand what they want, and then take their requirements and build the blueprint, the model, from it. That's the same layer and that doesn't change that much if you're using the cloud.

What does start to change is the next layer, which is almost an assessment piece. Now that I understand the requirements, now that I know the application I want to build, what parts, if any, of this design are eligible to be put in the cloud? What cloud vendors are the best candidates to handle this kind of information?

In that capacity, I see the data modeler playing the role of a cloud service broker, I believe the term is -- somebody who is a liaison between an organization and all the cloud vendors.

I also see the data modeler playing an additional role -- not just understanding requirements, but knowing what, if any, data belongs in the cloud. If something belongs in the cloud, which vendor is the best for that area? Should we go with a vendor that specializes? What type of cloud is the best? Is it a public cloud, meaning everybody shares all the resources? Or maybe a private cloud, which is just for one organization. Or maybe lots of neighborhood clouds, which are groupings of similar organizations that can share resources, such as medical trials in the medical field. You can also have hybrids of these types of clouds.

I think either the data modeler or the data architect is in the best position to understand what the best decisions are and where to move the data.

So the modeler or architect is playing that cloud service broker role you mentioned?

Yes. Also, it's an important point to mention this: I see many organizations with a desire to skip the whole data piece and jump right to development and rolling out applications. "We'll get to the data stuff later." In many organizations, that's curtailed when, sooner or later, somebody has to ask for a database, a server, the user IDs. IT then becomes aware of an application out there, and they bring in the data people. I've seen this happen in many shops; it's reactive: "Gee, you didn't do any data modeling, you didn't do any data analysis, but you have a database so you need to get IT involved."

With cloud computing, you actually don't have to bother going through IT at all. In an extreme case, you could just go to Amazon, get your servers and software, and IT never has to know there's an application out there. In some shops, data professionals are therefore sometimes fighting to get involved; otherwise, they won't even know about these applications.

In terms of the middle tier in your description of the cloud -- the platform – are tool vendors offering products yet in the cloud?

That middle tier is platform-as-a-service, and I don't see data modeling tools there yet, which is interesting. If you look at the big two data modeling tools vendors, CA and Sybase, I think there are plans to do so, but there's nothing out there yet.

That goes back to your comment, I suppose, that there are no data management tools out there yet for the cloud.

There really aren't. It's something of a niche. I would think that an organization could make an incredible investment diving into that area, because everyone is fighting over CPUs and renting servers out. There's this whole area that's just untapped.

Where do you see data modeling and the cloud going from here? Where might we be in 18 months to two years?

It's going to depend. There's going to be a very strong need for data modeling in any shops that use cloud computing, but I think it's either going to be proactive or reactive. The organizations that have a very rich methodology and very strong emphasis on data, enterprise quality, and so forth, are going to be positioned to reap the rewards of cloud computing. That's because they know their data and where it is.

In those shops, I think that data modeling will continue to play a very big, proactive role. On the other side, if organizations build those stealth applications that we talked about all over the place, I see data modeling being needed. However, maybe a year or two down the road when there's complete chaos and nobody knows where their data is, data modeling will be needed to help rope everything back in.

Sometimes people ask me, "Are there fewer data modelers? Are there more data modelers?" Really, it's an area that's always been growing because there's always more and more data. Either an organization does things the right way early on and they get data modelers involved in the beginning, or they go a few years without any modelers and then they need an incredible number of modelers to restore order where there is chaos.

You've mentioned some interesting cautions for companies around cloud computing. Where should an organization go, in terms of its data modeling, if it is interested in pursuing cloud computing? It sounds like you're saying, "Get your house in order first."

Yes, exactly. There was a study done a couple of years ago by IDC. They found that the No. 1 concern with executives around cloud computing, of course, is security. The No. 4 concern -- almost as high as security -- is the difficulty of integrating with IT.

Organizations should not jump to the cloud until they first get their shop in order or their house in order.

I also think there's a need. Because security is such a big concern, data modelers need to better understand security and maybe have an internal security data model that everything is connected to. There really isn't anything out there today.

There are certain standards around things such as how to model an address, and how to model time. There's no standard that I know of on how to model security, however. That's something that would be interesting to work on, and something for companies to give some thought to.

How excited are you about data modeling and cloud computing and where it's going?

One of the things I do is review data models; I've been doing that for over 10 years, and I'm in the process of building my own tool to review data models. To prove that data modeling and the cloud can work, I'm building a tool in the cloud called the Data Model Scorecard. If you talk cloud computing, you have to walk it, too. You have to really go through it and that's what I've been doing.

The application goes live on October 10. If readers want to stay on top of this, let me know via e-mail at me@stevehoberman.com

Must Read Articles