Cellular MultiProcessing Architecture
In this follow-up to "Cellular MultiProcessing: An Introduction, " Charlie Young delivers the technical details of the CMP architecture.
This article delves into the technical details of the Cellular MultiProcessing (CMP) architecture. It is a follow-on to the previous CMP article that introduced you to the Unisys CMP technology (see November Unisphere
, page 16). That article first looked at four related technologies: symmetric multiprocessing (SMP), clustering, massively parallel processing (MPP) and cache-coherent non-uniform memory access (ccNUMA).
Symmetric multiprocessing, by far the most prevalent technology available today, is based on multiple processors in a cabinet sharing the same memory and I/O subsystem connected by a system bus (or switching technology). Each processor has uniform memory access (UMA) through a system bus that supervises the electronic message traffic of the processors, memory and I/O. However, the system bus hardware becomes a bottleneck, constraining the number of processors in a system.
Clustering was devised where multiple SMP computers are tied together through a communication or I/O port. This uses a high-speed networking interconnect, slower than an SMP bus. Since each clustered node is separate from the others, architectural awareness is forced into the application.
Massively parallel processing systems are similar to clusters in that an MPP node, usually a uni- rather than multi-processor, is connected with scalable bandwidth. However, as with clustering, awareness of the architecture is pushed into the application.
A ccNUMA system (see Figure 1) can typically scale to a relatively large number of processors, up to 64 and beyond. It connects the busses of multiple SMP nodes, where access to an SMP’s memory from another processor is potentially non-uniform. But to achieve optimum performance when accessing the non-uniform memory, as with clustering and MPP, applications are required to understand the architecture. This forces switching between sending messages and using shared memory, as required, for communicating between nodes. And not all ccNUMA memory accesses are identical, potentially requiring unique code for different implementations.
Cellular MultiProcessing Design
The overarching design goal of the CMP architecture is to bring a mainframe-class NT processing system to the marketplace, providing an industrial-strength platform to run the complete suite of off-the-shelf Windows applications – while still permitting the use of the other prevalent operating systems, including Unisys enterprise server operating systems and associated applications. To do so, the limit on the number of SMP processors must be removed while offering high-speed, fault-tolerant and highly configurable clustering. Added also is the constraint to use plug-and-play components wherever possible, such as standard PCI cards, combined with high-performance, mainframe-class technologies, as required, to satisfy the design goals.
Some of the technologies used to achieve the objectives, which include processor "pods," cache memory, directory-based cache coherency and crossbar memory, interconnect. We will look at these in detail, following up with the architectural impacts caused by availability, clustering and manageability.
The CMP platform is based on Intel’s processors. It is designed for 4 to 32 IA-64 (Merced) processors, and supports the Pentium II Xeon chip in initial configurations. It also allows configurations incorporating Unisys CMOS processors with no other changes to the hardware. All processor types can co-exist in a single system. As you can imagine, this alone requires complex engineering.
The 32 CMP processors share the same main memory, which is partitionable. This structure allows the system to be configured as a single 32-way SMP system or configured within the cabinet into multiple variable cluster combinations of up to eight partitions, or processing elements.
Each processing element, called a sub-pod, consists of four processors and a shared cache. Each element can then run an independent operating environment. One or more elements can be combined to run an operating system. The configuration can be adjusted from the console.
A significant difference between CMP and other architectures is its main memory design. The design supports memory that can be private to a partition, while simultaneously allowing a portion of memory to be shared among two or more partitions. This blend combines the advantages of the fully shared memory of the SMP architecture with the completely disconnected memory structure of clusters.
When a single CMP system is configured as multiple separate systems that are clustered, the CMP shared memory architecture enables the server partitions to transfer information at memory speeds, rather than at the slower, conventional, clustering network-interconnect speed. And since standard Windows NT APIs can be used, the nature of the CMP system may be hidden from the application.
The main memory is divided into four memory storage units. In total, up to 64 GB of shared main memory is possible – 16 gigabytes per memory storage unit – upgradable in 128 MB increments or multiple thereof.
To fully understand CMP, it is necessary to get a firm grasp of the system memory architecture and the application of cache to enhance system performance.
Main memory is accessed by a processor across a system bus. For example, refer to Figure 1 on SMP. Memory and its access are much slower than the speed of the processor. When main memory is accessed, the processor sits idle for a number of clock cycles, called memory latency or wait state. This wait is for the correct address and data to be accessed and retrieved from the memory. For example, it can take as much as 60 cycles to access data in a typical local main memory. As the number of processors goes up, the latency goes up correspondingly as contention for the system bus and memory unit increases.
Cache memory is memory located closer to the processors, whose sole purpose is to reduce memory latency. If the processor and the memory ran at the same speeds, there would be no need for cache memory. The cache logic pre-fetches into cache instructions and data anticipated to be required by the processor; it also holds updated data in its cache. There are volumes of text written on methods of properly selecting the data to be placed in cache to keep latency at a minimum; this will not be covered here.
The closer cache memory is to the processor, the faster it is accessed and the less memory latency is. Also a rule of thumb, the closer to the processor the cache is, the more expensive it is and the smaller the amount of memory in the cache.
The fastest memory resides on the same chip as the processor, in the processor’s registers themselves. Also residing on the processor chip is a cache memory, L1 (level one). It’s from this cache that the instructions executed by the processor and the data operated on by the instructions are retrieved.
By way of example, the Pentium II processor contains 32 KB of L1 cache, consisting of 16 KB of instruction cache and 16 KB of data cache. When the required information is not available in L1 cache, an interrupt is generated, causing the needed information to be retrieved elsewhere.
The next-fastest cache memory, L2, is on a chip separate from the processor, but contained in the same package module as the processor. To give you an idea of memory latency, it can take up to 10 clock cycles for a typical processor to access L2 storage – and that’s even considering its nearness to the processor and its speed. The Pentium II processor has 512 KB of L2 cache and the Xeon has a 512 KB, 1 MB, or 2 MB L2 cache.
With the sub-pod design of the CMP system, a shared cache, called L3 (third-level cache), has been added. The size of the third-level cache is 8 or 16 MB for the Pentium II Xeon and 16 or 32 MB for the follow-on Merced technology. The L3 cache holds the most recently accessed and to-be-accessed instructions and data for all four processors in the sub-pod. So although every sub-pod has access to main memory, the L3 cache further reduces the requirement to access that slower main memory when the instructions and data are not in their L1 and L2 caches (see Figure 2).
Cache coherency is when all of the caches in the system reflect the latest modifications done by any of the processors or I/O in the system. Loss of coherency is when an entry in a cache becomes "stale," meaning that some processor or I/O has modified a cache line, but this modification was not made visible to all of the caches in the system that currently contain this cache line. A cache line is the smallest element of data that flows among the various caches and main memory.
Tracking the data in a cache, monitoring its flow from main memory to and from cache, and ensuring cache memory is coherent is a significant challenge. It is particularly challenging when there are multiple processors each independently operating on information contained in main memory, each potentially modifying the memory.
In typical SMP designs, monitoring of the system bus (bus-snooping) is used to maintain cache coherency. Snooping itself can become a scalability bottleneck even before memory bandwidth becomes the constraint.
A clustering approach does not use transparent hardware cache coherency, relying instead on the application software performing data locking to guarantee data integrity.
For ccNUMA systems, the Scalable Coherent Interface (SCI) linked list coherency protocol is used. This approach also seems to demonstrate significant overheads when scaling to large systems.
CMP adopted a directory-based coherency scheme, yielding optimal performance. A directory that keeps the status of each cache entry is maintained, which is a fast way to ensure coherence of multiple cache entries.
To summarize the flow of instructions and data, the processor first looks in the L1 cache on the processor chip. If the information is not in the L1 cache, it then checks the L2 cache tightly coupled in the processor chip’s package, then the L3 cache for the sub-pod, and finally the local main memory.
One of the aspects of an SMP bus architecture is that every time the bus is used by one device, it locks out the use of the bus for other devices connected to the bus. These devices are prevented from completing their tasks until the current controlling device releases the bus.
One of the biggest bottlenecks and constraints in an SMP system is the system bus. In an SMP architecture, the bus often limits the number of processors to four for most manufacturers, with Unisys being one of the exceptions. And similarly with a clustering scheme, as the number of nodes increases, so does the number of communication links, exponentially increasing complexity. The CMP architecture eliminates these performance bottlenecks by using a mainframe-class, high-performance crossbar switching technology.
All memory is visible uniformly to all processors via crossbars. Each pair of sub-pods has its own crossbar interconnect logic. (see Figure 3) The crossbar is a non-blocking electronic switch similar to those used in many mainframe-class systems. Also connected to each crossbar are up to 24 PCI (I/O) cards.
The CMP system is designed around a 4x4 crossbar architecture. The 4x4 comes from four crossbar interconnects (one for each pair of sub-pods) connecting to the four memory modules. As shown in the figure, it’s as if each crossbar has its own direct connection to every memory module. In addition the CMP system is considered four-in and four-out, as if each connection were bi-directional.
Because of its parallelism, a crossbar interconnect improves performance and eliminates the bottlenecks found in bus-based SMP architectures. Access is unimpeded between the local components, removing potential bottlenecks between the processors, memory and PCI devices.
The crossbar architecture offers significantly greater bandwidth and lower latencies than a conventional system bus-based architecture. And the crossbar facilitates system partitioning.
There are resiliency features associated with the crossbar architecture. In a system bus-based architecture, a failure in the bus interconnect affects all units attached to the bus (processors, memory and I/O modules), rendering the entire system unusable. With a crossbar interconnect, where there are dedicated point-to-point connections between individual processor, memory and I/O modules, any failure within the crossbar impacts only those distinct modules attached to that connection. All other independent system modules continue to function unaffected.
If CMP high availability features were geese, we would have a gaggle of them – incorporated into the electronic and physical architecture to increase fault tolerance. Many of these features are similar to those historically provided as standard in mainframe-class systems, only now making their way into Intel-NT-based systems.
The architecture includes full data path checking and extensive use of error checking and correction (ECC). ECC is included on the caches and main memory. The Integrated Management System (see below) remembers detected errors, and logs and reports them.
The physical CMP architecture provides for "hot swap" – live insertion – capabilities. Hot swap refers to the ability to remove and insert a field replaceable unit (FRU) without powering-down the system, minimizing the loss of system availability. The processor, memory, I/O, power, cooling, intra-connect and management modules can all be hot swapped. This same facility is available in other Unisys products, such as dynamically adding RAID disks to the Unisys Network-Attached Storage (NAS) product.
The CMP system has an integrated environmental monitor to supervise power supply voltage levels, system temperatures, and cooling blower speeds. Upon detecting a failure, the environmental monitor reports the failure to the management system and takes corrective action.
The CMP architecture includes a built-in integrated management system. It runs independently from the application processors. It monitors the health of the system, reports events and takes corrective actions where appropriate. The management system also controls the system partitioning and reconfiguration.
Being independent of the rest of the system, the management system can provide status and take corrective actions, such as electronically isolating a failed component, regardless of the state of the host. Corrective actions may be automated and include unit or system isolation, reconfiguration, and reboot. Automatic fault and/or event reporting, such as requesting maintenance, can be directed via modem to a service center or pager, to a LAN for remote operator attention, or to the management system display for local operator attention.
The CMP system is designed to support both local and remote operation and control. Any standard remote control capability for the operating environment can be used. The management system additionally supports local operations via the management display, or remote operation while utilizing a remote copy of the management system’s user interface.
The CMP system is architected to avoid a single point of failure: no single failure, either in the hardware or software, will result in a prolonged loss of processing capability. In addition, disruption caused by a failure and its isolation is minimized, the duration being operating system dependent.
CMP systems can be configured with 100 percent redundancy, such that no single failure results in a repair being required to continue operation. When taken in conjunction with clustering and/or partitioning, availability is further increased. All of the components of a CMP system can be duplicated.
The CMP design includes N+1 power and cooling. This means that there is one additional power supply and cooling impeller included above what is required for normal operation. An impeller has higher availability and life than simple fans. Thus, server operation continues uninterrupted in the event of a power supply or impeller failure. And the environmental monitor can automatically request maintenance.
The CMP system supports all forms of RAID for protection against loss or corruption of data. Redundant host adapters (controllers) and cables can be configured for RAID, and redundant Network Interconnect Cards (NICs) can be configured for network-attached storage, as well as to ensure uninterrupted network access.
Software DLLs (High-Availability Transaction Assistant – HA-Tx) are provided for easy, yet integrated, recovery of applications running with Oracle or SQL Server, over Tuxedo over Microsoft Cluster Server (MSCS, also known as Wolfpack). Additional middleware (Open Transaction Integration – OpenTI) is provided to allow Microsoft Transaction Server (MTS) to interact with Tuxedo or any other X/Open application transaction management interface (XATMI) compliant online transaction processing (OLTP) system, including the Unisys TransIT Open/OLTP products.
The design of the CMP system supports several varieties of clustering. MSCS is supported and provides fail-over of failed nodes to healthy nodes within the cluster. In such an environment, CMP can be clustered with other systems through PCI 2.1-compliant I/O cards, such as SCSI or Fibre.
A virtually unique capability allows the CMP system itself to be internally partitioned into a number of independent nodes. Unisys has historically built partitioning into their proprietary systems, and this same capability is designed into the architecture of the CMP.
Because of its sub-pod design, CMP systems can support up to eight partitions (one partition for each sub-pod). This is called intra-node partitioning and clustering. The system can be partitioned to operate as a single 32X SMP, two 16X SMPs, four 8X SMPs, eight 4X SMPs, or other combinations. NDIS (Network Driver Interface Specification) and Winsock (Windows Socket) drivers, for access to shared memory using open APIs, are also provided to allow existing applications to interact over shared memory without change to the application. This enables applications running in CMP partitions to communicate at high speeds via standard clustering software (e.g., MSCS) through the shared memory without change.
Each partition can support its own instance of an operating system. These now independent nodes may be interconnected to form a clustered system through I/O cards or through the shared memory capability of the CMP system.
The CMP partitioning guarantees that a failure in a component utilized by one partition does not affect another partition not utilizing that component. And a software failure in one partition does not affect any other partition (unless the failure occurs in shared memory).
CMP partitioning is controlled by its Integrated Management System (IMS). The CMP system is architected to support both static and dynamic partitioning. Static partitioning requires that any component redistribution is made with the operating environment stopped. Dynamic partitioning allows the redistribution to occur during operation if and only if supported by the operating system.
One of the challenges of clustering is administering multiple interconnected, though separate, systems. Administration is immensely simplified using CMP intra-node clustering, since it is a single system with a single console. Additional systems management software exists to allow the viewing of the various partitions as a single system. The intra-node clustering of CMP virtually eliminates the administration problems of complex networking issues associated with a typical clustered environment.
The resiliency provided by clustering is also inherent in CMP intra-node system clustering. In fact, fail-over is even easier with a CMP because of its shared memory.
Unisys provides the ability to manage multiple NT partitions from a single console. A single-point security administration and single logon to all the systems in an enterprise are available for CMP systems. Also available is an enterprise-level printing subsystem that utilizes an NT server to reformat and enhance print files from various hosts prior to printing, mailing or faxing.
CMP instrumentation is included so that in a networked environment the system and its partitions and clusters can be managed through standard third-party enterprise management products. These management capabilities interface with Computer Associates’ Unicenter TNG, Tivoli, and HP OpenView via, for example, SNMP (Simple Network Management Protocol) and other standard protocols.
Maintenance processors, as well as hardware registers that keep statistics about the utilization of each resource in the system, are also provided.
The CMP architecture has received positive press since its announcement. By incorporating mainframe-class technology in an Intel-based open platform, Unisys achieves its design goals of delivering an enterprise-capable system supporting Windows NT as well as Unisys operating systems and UnixWare. Using the technologies of directory-based cache coherency with a crossbar memory interconnect, and a sub-pod and extensively redundant design, the CMP architecture reduces or eliminates the performance and manageability challenges associated with other technologies, including SMP, clustering and ccNUMA.
About the Author:
Charlie Young is Director of U.S. Network Enable Solution Programs in the Global Customer Services (GCS) organization of the Unisys Corporation (Blue Bell, PA), where he has worked for 24 years. He can be contacted at email@example.com.