Infrastructure Role Cluster Risk Management

Cc539255.chm_head_left(en-us,TechNet.10).gif Cc539255.chm_head_middle(en-us,TechNet.10).gif Cc539255.chm_head_right(en-us,TechNet.10).gif

Infrastructure Role Cluster Risk Management

The Infrastructure Role Cluster connects the knowledge, people, processes, technology, space, partners, and customers in many ways. Infrastructure management looks at the evolving enterprise architecture and ensures that plans are in place to meet the new and changing requirements of running the business from a networking, telecommunications, hardware, and software perspective.

The Capacity Management service management function (SMF) is most commonly associated with this role cluster. Capacity management is the process of planning, sizing, and controlling service or component capacity to satisfy business needs and user demand at an acceptable cost. Capacity management focuses on procedures and systems, including specification, implementation, monitoring, analysis, and tuning of IT resources and their resulting service performance.

For example, suppose someone is doing capacity management work at an application service provider (ASP). This person spends considerable time analyzing statistics generated by various tools. Everyone in the group is impressed by the volume of detail that the new tool provides, so much so that it can be hard to find the most important measurements. What if it becomes too hard to spot them? One consequence might be that outages and bottlenecks seem to occur without warning, which severely impacts customer satisfaction. Mitigations include reconfiguring the user interface, upgrading the tool, or replacing it with one that does not pose this risk. If none of these is an option, or if they will take time to implement, a contingency is to add capacity in hopes of staying ahead of demand.

The context for this risk is that someone acting in the Infrastructure Role Cluster at an ASP is performing capacity management and speculating about the risks related to a new tool. The tool may eliminate some risks related to older tools, but it may introduce the following risk as well.

The following table shows various examples of risk components pertinent to a scenario involving the Infrastructure Role Cluster.

Table: Infrastructure Role Cluster risk components

Risk component Statement
Root cause: People
Condition: The following event occurs ... The capacity management staff uses monitoring tools whose user interfaces are so complex that it is easy to overlook relevant information.
Operations consequence: ... operations will be hurt in this manner ... Capacity management is faced with outages and bottlenecks that seem to occur without warning.
Downstream effect: ... and the business as a whole will be hurt in this manner ... Customers are dissatisfied because of the ASPs inability to support the demand for service, and the customers react by switching to a competing ASP.
Mitigation: Prior to the condition occurring, we will try to reduce the impact and/or probability by ... Simplify the user interface by reconfiguring the existing tools, by installing an upgraded version of the tool, or by replacing the current tool with a better one from a different vendor.
Trigger: If the condition is imminent (but has not yet occurred), we will know because this happens... The ASP finds itself unable to meet service level agreements because of inaccurate capacity-utilization forecasts.
Contingency: If we are unable to prevent the condition, we will respond to the trigger in this way: Add capacity in hopes of staying ahead of demand.

Everything in this risk hinges on the source of the risk: people. Presumably, the problem stems from people misusing a tool that is functioning correctly. Under different circumstances, the operations team managing this risk might have decided that this is a technology issue, especially if the tool cannot be reconfigured to reduce the volume of data it presents. Taking it one step further, the risk management team might have asked whether the people knew that the tool could be reconfigured. If they did not because that topic was not covered in training, the failure in training would define "process" as the source of risk.

The distinctions are relevant for four reasons:

  • The real source of the problem (people, process, technology) greatly affects the mitigation. For example, altering the training will not prevent problems caused by defects in the tool.
  • The team may analyze current risks by grouping them according to source of risk. This might, for example, expose a set of risks related to poor training or defective tools from a certain vendor.
  • This illustrates how valuable diverse viewpoints can be during the identification step. Many people who consider the condition by itself would focus on one particular source of risk, potentially missing the others.
  • This illustrates why precision is important when documenting a risk. If this condition's wording were changed slightly, none of the other elements of the risk would make sense.