In-Depth
Enterprise Grid Computing: Usage Models (Part 4 of 7)
One benefit of the hierarchical framework for grids introduced last week is that it helps us discover (and provides insight into) key constituencies in a grid ecosystem and its dynamics. Understanding these dynamics is fundamental to building a sound grid strategy.
One of the challenges in deploying a grid infrastructure is that the pervasiveness of a large scale grid deployment touches multiple constituencies within an organization. Of course, the members of each of these audiences believe they have inalienable rights on how a grid should be budgeted, architected, and deployed. Ironically, the least powerful constituency may actually be the application end user, the person who uses a grid facility to run a grid-enabled application on a day-to-day basis to get his or her job done. That person constantly experiences the limitations of the available resources, yet can’t do much beyond complaining because relief won’t come before the next procurement cycle (which is a year or two away). This may change in the future because grid computing is at the forefront of implementing dynamic behaviors.
One of the benefits of the hierarchical framework for grids introduced in the previous article in this series is that it facilitates the discovery and provides insight into key constituencies in a grid ecosystem and the dynamics involved. Understanding these dynamics is fundamental to building a sound grid strategy.
We will discuss well-known usage models (such as cycle scavenging) and explore how they fit as particular cases in the framework. The hierarchical framework points to additional usage models that were not obvious before, such as the digital motion picture distribution.
Usage models are useful in the analysis of complex systems because they allow the discovery of system requirements that define an abstract architecture that meets these requirements. The architecture can be instantiated into a usable solution for a particular situation. These relationships are depicted in Figure 1.
The mapping of these concepts is eminently bi-directional. For instance, an application end user looking for a particular solution can usually make inferences about what specific grid architectures will (or will not) fit the particular solution. Likewise, for a given architecture, it is possible to carry out a formal analysis to find out the range of problems for which the architecture yields effective solutions. One trivial example is the class of problems known as “embarrassingly parallel” such as Monte Carlo simulations that can be run in a distributed environment with little communication. A grid designed to run these problems need not have a large portion of the capital investment allocated to the communication subsystem.When we apply the hierarchical grid framework to a particular grid, we discover that there exist unique usage models for each layer in the framework as illustrated in Figure 2. Usage models in one layer may have users (i.e., constituencies and use cases that are particular to that layer). In a large organization it is not uncommon to see constituencies in different layers at odds with each other.
For instance, in the Business Grid, a user is the chief financial officer (CFO) for the organization. The usage model could be the procurement model, and specific use cases are out-sourcing and in-sourcing grid services. The CFO is interested in minimizing the overall investment cost, which may be in conflict of the technical staff who want a powerful, possibly expensive system on which to run their applications. Yet another player is the IT staff making the actual purchases and providing maintenance services. The CIO would like to select equipment with the highest ROI and lowest operating costs.
Figure 2 shows one example of these ideas using the physical grid. CPUs sit at the lowest layer in the grid framework. We observe that ordinary application users have little interaction with CPUs (as physical devices). The actual user at this level is a hardware design engineer into specific 1-, 2-, or 4-processor solutions. Also note that although a processor has certain memory bandwidth capability, the actual bandwidth is determined by the chipset selected (selection is also the job of a design engineer).
The requirement for the actual memory performance is ultimately determined by the chipset. This is to say that the layer above comprising chipsets imposes requirements on the CPU layer. Without these requirements, one would arrive at the erroneous conclusion that programs run faster off the cache, and hence memory is only an unnecessary complication. This is, of course, not the case because applications running at the node level have specific memory space requirements. These requirements get passed down to the lower layers; logical connections exist between layers: requirements are passed from one level to the next.
A certain balance needs to exist across layers for the whole system to be viable. A high performance system cannot be built if it is so expensive that it breaks the budget. Any system represents a compromise that is acceptable to all constituencies involved.
The classical usage model of cycle scavenging for grids—where middleware is used to harvest otherwise unused cycles in workstations and servers—is a usage model applicable to the visible grid layer. The grid application user community benefits from this resource, but these cycles tend to be of low quality: the resource can be preempted any time by the owner of the workstation. The owner experiences deterioration in the responsiveness of the workstation. The workstation will likely need to be provisioned with more memory and storage, negating some of the economic benefit of dual use. Likewise the IT department providing service will likely spend more in labor and software licenses to maintain the equipment.
A cost/benefit analysis across two or three levels may indicate that a better solution is to implement a grid with dedicated hardware, that is, hardware that is shared as a grid but fully dedicated to grid use.
We hinted at some of the Business Grid usage models. Data grids are more than virtualized storage systems and could lead the way to new usage models and business models, which boil down to Business Grid usage models. For instance, a grid with 10,000 data nodes is effectively a device with 10,000-fold redundancy. The aggregation of so many nodes can be an advantage to reduce the probability of data loss. An application could implement a file system provably designed to meet virtually any level of reliability. It can spread out the data in the files so widely that the system behaves like a hologram: even if many of the nodes are lost, it would still be possible to recover all of the original data from the remaining ones. Storage need not be in-sourced; it could be outsourced and brokered to different data service providers.
The geographical spread is not always disadvantageous; consider the following analogy in the movie industry: The traditional method for a movie studio to release a motion picture is by physically shipping film cans to movie theatres; this is the film industry version of the sneaker net. It is only a matter of time until the entire distribution process becomes digital, where a movie is digitally transmitted and exhibited with a digital projector. The storage required for a theater-quality motion picture can span several terabytes. Using a central server for sending copies to every movie theater in the world is obviously an inefficient way of using long-haul Internet connectivity. Instead, the servers in each theater can be conceived of as nodes in a data grid. Using a tree topology, the studio could send copies of the file to a few designated distribution points in each country or state. Copies would then be sent from the distribution points to a local distribution point in each city, and then locally to all theaters within a city.
The quintessential usage model for the Visible Grid is parallel computation. For instance, if it takes one server-node 10 minutes to update 100,000 records, 10 nodes working together (i.e., in “parallel”) could theoretically do the same job in one minute. In practice, of course, the time required would be somewhat more than a minute due to overhead: the I/O subsystem may experience interference with 10 nodes doing simultaneous updates, there might be data dependencies, and one processor might have to wait until another is finished. Nevertheless, parallel processing would reduce the time required to perform the work.
In some cases, wall-clock time is of primary importance; for instance, a weather simulation for forecasting needs to be completed on a deadline. If these calculations can be accelerated by applying more CPUs, even if the CPUs interfere with each other.
A similar dynamic applies to simulation and analysis jobs in engineering shops, albeit less dramatically. Because of the potential savings in worker time, it is enormously valuable to be able to run jobs that take several CPU hours in a few minutes of clock time. Since design is an iterative process, detecting a flaw more quickly can equate to significant savings in workers' time, increasing productivity. In the late phases of a design cycle, parametric runs (i.e., similar runs with slightly different data) may be necessary. With a job that takes eight hours to run on one CPU, a one-CPU workstation running for an entire month will yield about 100 data points. If an unexpected flaw is discovered in the data at the end of that month, and the run needs to be repeated, the project essentially slips by a month.
If, instead of one CPU, 100 CPUs can be applied to the same problem in a grid environment, it is very likely that the computation will not be done 100 times faster – perhaps just 25 times faster. Thus, the grid system might yield one data point every 20 minutes (at 25 percent efficiency). Furthermore, let us assume that a grid with 4,000 nodes is available. In this case, 40 jobs can be launched in parallel, and hence the team might be able to deliver the 100 data points in one hour.
The productivity implications of being able to do a month’s work in one hour are epochal. It might mean shaving the production time of a $100M movie by a few weeks and a few million dollars through the use of parallel rendering engines, or the ability to base real-time quotes on complex derivative securities calculations.
In this article we discussed usage models that have created significant interest in the deployment of grids in the enterprise, including cycle scavenging, application and data grids, and parallel-distributed computation as well as real-time job provisioning. These usage models need to be considered in a multi-level context. Each level can have very particular user models and stakeholder constituencies. Deploying a grid is not exclusively a technical challenge. Organizational issues need to be addressed as well, and any practical grid represents a compromise that all stakeholders involved are willing to accept.