The manner in which you architect your cloud computing infrastructure can have a direct impact on its resistance to failure.
Adapted from “Cloud Computing: Theory and Practice” (Elsevier Science & Technology books)
Public and private clouds can be affected by both malicious attacks and infrastructure failures such as power outages. Such events can affect Internet domain name servers, prevent access to clouds or directly affect cloud operations.
For example, an attack at Akamai Technologies on June 15, 2004, caused a domain name outage and a major blackout that affected Google Inc., Yahoo! Inc. and many other sites. In May 2009, Google was the target of a serious denial-of-service (DoS) attack that took down services such as Google News and Gmail for several days.
Lightning caused a prolonged downtime at Amazon.com Inc. on June 29 and 30, 2012. The Amazon Web Services (AWS) cloud in the Eastern region of the United States, which consists of 10 datacenters across four availability zones, was initially troubled by utility power fluctuations probably caused by an electrical storm. A June 29, 2012, storm on the East Coast took down some Virginia-based Amazon facilities and affected companies using systems exclusively in this region. Instagram, a photo-sharing service, was reportedly one of the victims of this outage.
Recovery from these events took a long time and exposed a range of problems. For example, one of the 10 centers failed to switch to backup generators before exhausting the power that could be supplied by uninterruptible power supply (UPS) units. AWS uses “control planes” to let users switch to resources in a different region, and this software component also failed.
The booting process was faulty and extended the time required to restart Elastic Compute Cloud (EC2) and Elastic Block Store (EBS) services. Another critical problem was a bug in Elastic Load Balancing (ELB), which is used to route traffic to servers with available capacity. A similar bug affected the recovery process of the Relational Database Service (RDS). This event brought to light “hidden” problems that occur only under special circumstances.
A cloud application provider, a cloud storage provider and a network provider could implement different policies. The unpredictable interactions between load-balancing and other reactive mechanisms could lead to dynamic instabilities. The unintended coupling of independent controllers that manage the load, power consumption, and elements of the infrastructure could lead to undesirable feedback and instability similar to the ones experienced by the policy-based routing in the Internet Border Gateway Protocol (BGP).
For example, the load balancer of an application provider could interact with the power optimizer of the infrastructure provider. Some of these couplings may only manifest under extreme conditions and may be very hard to detect under normal operating conditions. They could have disastrous consequences when the system attempts to recover from a hard failure, as in the case of the 2012 AWS outage.
Clustering resources in datacenters located in different geographical areas is one of the means used to lower the probability of catastrophic failures. This geographic dispersion of resources can have additional positive side effects. It can reduce communication traffic and energy costs by dispatching the computations to sites where the electric energy is cheaper. It can also improve performance with an intelligent and efficient load-balancing strategy.
In establishing a cloud infrastructure, you have to carefully balance system objectives such as maximizing throughput, resource utilization and financial benefits with user needs such as low cost and response time and maximum availability. The price to pay for any system optimization is increased system complexity. For example, the latency of communication over a wide area network (WAN) is considerably larger than the one over a local area network (LAN) and requires the development of new algorithms for global decision making.
Cloud computing inherits some of the challenges of parallel and distributed computing. It also faces many major challenges of its own. The specific challenges differ for the three cloud delivery models, but in all cases the difficulties are created by the very nature of utility computing, which is based on resource sharing and resource virtualization and requires a different trust model than the ubiquitous user-centric model that has been the standard for a long time.
The most significant challenge is security. Gaining the trust of a large user base is critical for the future of cloud computing. It’s unrealistic to expect that a public cloud will provide a suitable environment for all applications. Highly sensitive applications related to critical infrastructure management, health-care applications and others will most likely be hosted by private clouds.
Many real-time applications will probably still be confined to private clouds. Some applications may be best served by a hybrid cloud setup. Such applications could keep sensitive data on a private cloud and use a public cloud for some of the processing.
The Software as a Service (SaaS) model faces similar challenges as other online services required to protect private information, such as financial or health-care services. In this case, a user interacts with cloud services through a well-defined interface. In principle, therefore, it’s less challenging for the services provider to close some of the attack channels.
Still, such services are vulnerable to DoS attacks and malicious insiders. Data in storage is most vulnerable to attack, so devote special attention to protecting storage servers. The data replication necessary to ensure continuity of service in case of storage system failure increases vulnerability. Data encryption may protect data in storage, but eventually data must be decrypted for processing. Then it’s exposed to attack.
The Infrastructure as a Service (IaaS) model is by far the most challenging to defend against attacks. Indeed, an IaaS user has much more freedom than the other two cloud delivery models. An additional source of concern is that the considerable cloud resources could be used to initiate attacks against the network and the computing infrastructure.
Virtualization is a critical design option for this model, but it exposes the system to new sources of attack. The trusted computing base (TCB) of a virtual environment includes not only the hardware and the hypervisor but also the management OS. You can save the entire state of a virtual machine (VM) to a file to allow migration and recovery, both highly desirable operations.
Yet this possibility challenges the strategies to bring the servers belonging to an organization to a desirable and stable state. Indeed, an infected VM can be inactive when the systems are cleaned up. Then it can wake up later and infect other systems. This is another example of the deep intertwining of desirable and undesirable effects of basic cloud computing technologies.
The next major challenge is related to resource management on a cloud. Any systematic (rather than ad hoc) resource management strategy requires the existence of controllers tasked to implement several classes of policies: admission control, capacity allocation, load balancing, energy optimization and, last but not least, the provision of quality of service (QoS) guarantees.
To implement these policies, the controllers need accurate information about the global state of the system. Determining the state of a complex system with 106 servers or more, distributed over a large geographic area, isn’t feasible. Indeed, the external load, as well as the state of individual resources, changes very rapidly. Thus, controllers must be able to function with incomplete or approximate knowledge of the system state.
It seems reasonable to expect that such a complex system can only function based on self-management principles. But self-management and self-organization raise the bar for the implementation of logging and auditing procedures critical to the security and trust in a provider of cloud computing services.
Under self-management it becomes next to impossible to identify the reasons that a certain action that resulted in a security breach was taken.
The last major challenge I’ll address is related to interoperability and standardization. Vendor lock-in—the fact that a user is tied to a particular cloud services provider—is a major concern for cloud users. Standardization would support interoperability and thus alleviate some of the fears that a service critical for a large organization may not be available for an extended period of time.
Imposing standards at a time when a technology is still evolving is challenging, and it can be counterproductive because it may stifle innovation. It’s important to realize the complexity of the problems posed by cloud computing and to understand the wide range of technical and social problems cloud computing raises. The effort to migrate IT activities to public and private clouds will have a lasting effect.
Dan C. Marinescu was a professor of computer science at Purdue University from 1984 to 2001. Then he joined the Computer Science Department at the University of Central Florida. He has held visiting faculty positions at the IBM T. J. Watson Research Center, the Institute of Information Sciences in Beijing, the Scalable Systems Division of Intel Corp., Deutsche Telecom AG and INRIA Rocquancourt in France. His research interests cover parallel and distributed systems, cloud computing, scientific computing, quantum computing, and quantum information theory.
For more on this and other Elsevier titles, check out Elsevier Science & Technology books.