Operating a Global Messaging Environment by Using Exchange Server 2007
Technical White Paper
Published: November 27, 2007
Technical White Paper, 824 KB, Microsoft Word file
PowerPoint Presentation, 1.60 MB, Microsoft PowerPoint file
Products & Technologies
The Messaging Operations group within Microsoft IT's Exchange Messaging team must fulfill the dual goals of operating the messaging environment and performing product readiness validation with the Exchange Server product group.
By using the right combination of people, processes and technology, the Messaging Operations group meets the performance and availability targets defined in the SLA.
Enterprise IT organizations, including the Microsoft Information Technology (Microsoft IT) group, deal with service level agreements (SLAs) and power users accustomed to high levels of performance, availability, and responsiveness. The 130,000-plus users at Microsoft send over 3 million internal e-mail messages a day from more than 150 offices worldwide, as well as from home and while on the road. At Microsoft, many business-critical communication processes depend on the availability of messaging services provided through Microsoft® Exchange Server 2007.
Managing the complex Microsoft IT infrastructure is a team effort that involves many different groups, such as the Datacenter team, the Network Infrastructure team, the Active Directory® team, and the Exchange Messaging team. Overall, Microsoft IT manages three distinct environments: a pre-release production environment to test new product versions and upgrades prior to their release to manufacturing (RTM), a corporate production environment to provide IT services to Microsoft users, and the Microsoft Managed Services (MMS) environment to provide managed IT services to Microsoft customers. Within these three environments, the Microsoft IT Exchange Messaging team handles all Exchange-related operation, management, administration, and optimization processes. In that role, the Exchange Messaging team works with many other peer teams at Microsoft IT, sharing its operations and process optimization expertise to help those teams implement efficient and reliable operations processes.
The Messaging Operations group, managed by Microsoft IT Group Manager Gary Baxter, within the Exchange Management team, must meet several reliability, availability, and performance targets (such as 99.99 percent availability of Exchange services). To meet these targets, the Messaging Operations group makes use of industry-standard methodologies such as Microsoft Operations Framework (MOF), Microsoft Solutions Framework (MSF), and Information Technology Infrastructure Library (ITIL). For example, the operations model that the Messaging Operations group implemented based on the ITIL framework relies on structured incident management, problem handling, configuration management, and change control processes. These processes enable the Messaging Operations group to capitalize on Exchange Server 2007 administrative features, such as the Exchange Management Shell, to reduce operations costs and ensure efficiencies.
The key to success in daily operations is the right combination of technology, people, and processes. For example, the Messaging Operations group uses technical tools, such as the built-in product features of Exchange Server 2007 and Microsoft Operations Manager, combined with a clear team structure and work processes that facilitate collaboration. Built-in product features of Exchange Server 2007, such as cluster continuous replication (CCR), help the Messaging Operations group meet 99.99 percent availability and performance targets. New tools and software features, and optimization opportunities gained through customer feedback, enable the Messaging Operations group to analyze and implement changes when necessary to keep pace with the innovative and agile business landscape at Microsoft.
This white paper is for business decision makers, technical decision makers, and operations managers. It assumes that the reader has a working knowledge of Microsoft Windows Server® 2003, Active Directory, Exchange Server 2007, and Microsoft Operations Manager. Because many of the principles and procedures discussed in this paper are based on standard operations methodologies, a high-level understanding of the MOF, MSF, and ITIL models is also helpful.
Note: For security reasons, the sample names of forests, domains, internal resources, organizations, and internally developed security file names that are used in this paper do not represent real resource names used within Microsoft and are for illustration purposes only.
Since the earliest days of Microsoft Exchange Server, the Exchange Messaging team operated an enterprise-messaging environment with an emphasis on using Microsoft technologies wherever possible to keep total cost of ownership (TCO) as low as possible. However, Exchange Messaging team operations practices changed, driven by product advancements such as Active Directory integration, which became available with Microsoft Exchange 2000 Server and later versions. For example, almost a decade ago, the Exchange Messaging team managed a Microsoft Exchange Server 5.5 based environment from the viewpoint of individual servers that provided messaging services. Exchange Server 5.5 included its own directory, which made it convenient to treat Exchange Server 5.5 servers as centralized messaging islands in the IT environment. However, this paradigm changed with Exchange 2000 Server and all later versions. These product versions require not only a functioning operating system, but also a reliable TCP/IP network infrastructure, Domain Name System (DNS) configuration, and Active Directory environment. Managing messaging services from the viewpoint of individual servers was no longer an option for the Exchange Messaging team. The advent of Exchange 2000 Server marked a rigorous shift toward managing messaging as an end-to-end service with no exceptions.
The shift toward a service-oriented IT organization became apparent during the Microsoft Exchange Server 2003 time frame. During this period, the Exchange Messaging team oversaw a site and server consolidation initiative, creating a centralized environment comprising four datacenters with Mailbox servers. Cost savings were $20 million in the fiscal year of 2003 alone. Yet perhaps even more importantly, the consolidation initiative also yielded significant intangible benefits. Consolidating the messaging environment enabled the Exchange Messaging team to measure and enforce performance and availability SLAs across the entire corporate production environment on a global scale.
Today, the Exchange Messaging team operates with both formal and informal SLAs. Formal SLAs represent overall targets for performance and availability, whereas informal SLAs measure internal statistics for metrics and process improvement. For example, the formal SLA for internal e-mail message delivery is 90 seconds for 99 percent of the messages, measured from the moment any Exchange server processes the message. Informal SLAs measure finer aspects of the environment and often contribute to formal SLAs. For example, the Exchange Messaging team measures the availability SLA for Client Access servers individually (on an informal basis) and all Exchange servers as a whole (formally).
Within the three environments of Microsoft IT (pre-release, corporate production, and MMS), the Exchange Messaging team is responsible for meeting targets and SLAs, but with different emphasis during operations. In the pre-release environment, the team works with developers to sign off on final product versions before products are released, whereas in the corporate production and MMS environments, the team primarily works to ensure that performance targets are met. The formal and informal SLAs can be more or less rigid to accommodate the purpose of each environment. For example, the pre-release production environment helps developers improve product versions and service packs by testing them in a production-type environment. Meeting SLAs in the pre-release environment is not as crucial as meeting SLAs in the corporate production environment.
To efficiently operate Exchange Server 2007 environments at Microsoft , the Exchange Messaging team developed work processes based on common IT operations frameworks such as ITIL, MOF, and MSF. According to these frameworks, messaging operations involve managing people, processes, and technology, and more often than not these elements span the boundaries of individual teams and their respective areas of jurisdiction. For example, an important aspect of an incident and its resolution is that the Exchange Messaging team must collaborate with other teams also involved in handling the incident, such as front-line operators who identify the incident, analysts who resolve or escalate it, and technical leads and managers responsible for change management and process improvement.
Exchange Server 2007 provides enabling technologies that help the Exchange Messaging team meet its performance and availability SLAs. Exchange Server 2007 represents the next generation messaging system, designed to simplify and streamline operations tasks. Among other things, Exchange Server 2007 provides new features such as improved storage design, data replication capabilities (local and clustered) for high availability, modular setup and server provisioning based on server roles, and new graphical and command-line interfaces for improved manageability and increased automation. The seamless integration of Exchange Server 2007 with other Microsoft products, such as Microsoft Internet Security and Acceleration (ISA) Server 2006, also helps to reduce operational complexities.
The new features in Exchange Server 2007 provide the following advantages for the Exchange Messaging team:
- Less delivery overhead The integration of Exchange Server 2007 with ISA Server 2006 enables Microsoft IT to establish a scalable, load-balancing infrastructure for external messaging clients and to avoid complicated load distribution and client session affinity issues. Specifically, the Web Publishing Load Balancing feature that is available with ISA Server 2006 has significant operational impact. ISA Server 2006 can detect unavailable Client Access servers, and direct client connections to available Client Access servers automatically. This means that Microsoft IT does not need to apply ISA server configuration changes during maintenance cycles to exclude temporarily unavailable Client Access servers.
- Improved delivery time In previous versions of Exchange Server, the Mailbox server was responsible for the final delivery of messages to local mailboxes. On a server with a large number of mailboxes, the transport subsystem, including categorizer and local delivery queue, can become a bottleneck. In Exchange Server 2007, Hub Transport servers perform local delivery via Messaging API (MAPI) calls. By using multiple Hub Transport servers, the Exchange Messaging team can balance the load from delivery queues more evenly. The Messaging Operations group appreciates this capability, especially in situations when executive officers send messages to all employees.
- More levels of control over spam and viruses With Edge Transport servers and the Forefront Security for Exchange feature, the Exchange Messaging team can fine-tune antivirus and anti-spam settings to deliver legitimate e-mail while stopping unwanted messages. Among the anti-spam features with the most significant operational impact are safe sender awareness of content filtering, the automated maintenance IP block list entries based on Simple Mail Transfer Protocol (SMTP) protocol analysis and IP reputation updates from Microsoft Update Service, and support for auto-expiring allow list entries that lower the administrative overhead associated with maintaining IP allow lists.
- Greater resource specialization Exchange Server 2007 makes it possible to separate overall messaging services into individual services that each server role provides, and establish specialists for each server role. Single-role server deployments help to establish reliable, flexible, and scalable middle-tier services in the messaging environment, and to enable systems analysts to focus on being experts in one server role or technology area. The operational impact is that knowledge and resolution of issues can take place with greater efficiency than when all-purpose generalists research and solve issues.
- Increased high availability options Exchange Server 2007
enables Microsoft IT to eliminate all crucial single points of failure in the messaging
environment. This was not possible with previous versions of Exchange Server because
replication of mailbox data was unavailable. The CCR features in Exchange Server 2007
provide redundancy of both services and data for Mailbox servers. Exchange Server 2007
provides enhanced high availability options for other server roles as well. Microsoft
IT uses Network Load Balancing (NLB) with Client Access servers internally, Web
Publishing Load Balancing with ISA Server 2006 servers externally, and multiple
Hub Transport servers to provide redundancy and load balancing for message delivery.
Note: Detailed information about high availability with Exchange Server 2007 is available on Microsoft TechNet at http://technet.microsoft.com/en-us/library/bb123523.aspx.
- More control over change management tasks Exchange Management Shell provides the Exchange Messaging team with the means to create scripted and automated initial server configurations, configuration changes, and auditing. This is a key feature for the Exchange Messaging team to empower front-line operators. By running tested and approved scripts, front-line operators can respond to known issues without needing detailed product knowledge. This helps to lower the number of escalated issues. The internal goal is to have front-line operators handle 80 percent of all issues, so that members of the Messaging Operations group can focus the majority of their time on proactive initiatives rather than reactive incident response.
- Easier unified messaging service expansion Prior to the deployment of Microsoft Exchange Server 2007 Unified Messaging, Microsoft IT maintained Unified Messaging servers in the main office of each regional location. Exchange Server 2007 enabled Microsoft IT to consolidate these Unified Messaging server locations into the four datacenters that also contain the Mailbox servers, and to integrate the Unified Messaging servers with Microsoft Office Communications Server 2007. The positive impact on operations is centralized administration via Active Directory and automated monitoring of Unified Messaging servers and voice over IP (VoIP) gateways as part of the messaging environment. Another benefit is that Microsoft IT was able to reduce the complexity of the IT environment by eliminating third-party unified messaging systems.
The Microsoft IT messaging environment now consists of 62 Mailbox servers (each in a two-node CCR cluster configuration), 10 Edge Transport servers, 15 Hub Transport servers, 11 Unified Messaging servers with supporting VoIP gateways, 26 Client Access servers, and two multiple-role servers in Sao Paulo for the Hub Transport, Client Access, and Unified Messaging roles. There are approximately 130,000 mailboxes in the corporate production environment, 10,000 in the pre-release environment, and 30,000-plus in the MMS environment. Operating these environments requires structured human resources with clearly defined roles.
Some teams at Microsoft IT are responsible for a specific service, and some teams are responsible for a specific function. For example, the Exchange Messaging, Collaboration Services, and Communications teams provide end-to-end operations for their respective services, as shown in Figure 1. Each team makes use of its dedicated management resources, yet the teams often work together because of the interdependent nature of the Microsoft IT environment. The services these teams support rely on a common infrastructure that includes the physical TCP/IP network and the Active Directory environment.
Figure 1. Microsoft IT service teams' structure as of September 1, 2007
Microsoft IT includes the following teams:
- Collaboration Services The Microsoft environment has many Microsoft Office SharePoint® Server 2007 sites, which require significant coordination to operate. The Collaboration Services team manages all aspects related to Office SharePoint Server 2007, including planning for and performing upgrades.
- Communications This team handles all aspects of design, deployment, operations, administration, and management dealing with Office Communications Server 2007.
- Exchange Messaging This team handles the entire messaging environment at Microsoft. For more details about the functions and structure of the team, see the section below titled "Messaging Service Structure and Functions."
- Exchange Center of Excellence As part of its commitment to customers to share knowledge about running Exchange Server 2007, the Exchange Messaging team established the Microsoft Exchange Center of Excellence (ECoE), which is a task force inside Microsoft that helps customers get the most out of their Exchange Server 2007 deployments. For more information about the ECoE, see the section in this paper titled "Operations Process Improvement."
- Shared Services Microsoft IT created the Shared Services
team to reduce overlapping responsibilities and cut costs. Before the Shared Services
team existed, each service team had its own human resources for managing the tasks
that the Shared Services team now assumes. These tasks include common monitoring
and other front-line services for all operations teams within the messaging and
collaboration-related service teams. The Shared Services team consists of the following
Process Engineering This group looks at the processes of the Shared Services team to ensure that they meet the requirements of all peer teams that the Shared Services team supports.
Client Support This is a tier 2 support group for Microsoft users. The Client Support group focuses on issues related to end-user connectivity and productivity.
Monitoring This group performs all the front-line monitoring for other service groups, such as the Exchange Messaging group. The performance goal for the Shared Services team's Monitoring group is to resolve at least 80 percent of incidents. Therefore, the people in this group must have general server administration and resolution knowledge, and must follow product-specific resolution instructions to resolve incidents. The Messaging Operations group within the Exchange Messaging team creates the necessary Exchange Server specific knowledge base and resolution instructions, and provides training on general resolution and response processes.
Just as multiple teams handle the overall design, deployment, and operations functions of Microsoft IT, the functions within the Exchange Messaging team are similarly distributed. Exchange Messaging team members manage the messaging service from end to end. This entails monitoring messaging-related incidents; coordinating changes; collaborating with the Exchange Server product group and with other Microsoft IT teams; and, managing the MMS environment for customers. Within the Exchange Messaging team, people have specialized roles and work together in specialized groups, each of which handles a portion of the overall responsibilities, as shown in Figure 2.
Figure 2. Exchange Messaging team organizational structure as of September 2007
Exchange Server 2007 not only enables IT organizations to capitalize on expert knowledge according to individual server roles; it also provides cost-efficient opportunities to cover basic and general operational aspects via Shared Services teams. As shown in Figure 2, Microsoft IT takes advantage of this possibility with Exchange Server 2007 by using a Shared Services team to perform all front-line monitoring tasks. This frees the Exchange Messaging team to focus on escalated issues and complex tasks such as root cause analysis.
In order to carry out Exchange Server 2007 specific monitoring and incident response, the Shared Services team must have specific resolution steps, which the Exchange Messaging team provides. Specialists in the Messaging Operations group use their expert knowledge to create these detailed resolution steps. If an incident arises that these detailed steps do not cover, then the Shared Services team escalates the incident to the Messaging Operations group. Because the Shared Services team handles the vast majority of incidents without escalation, the Exchange Messaging team can apply expert knowledge in an increasingly targeted way.
Messaging Engineering Team Functions
The Messaging Engineering team within the Exchange Messaging team designs the messaging systems in the corporate production environment. This broad goal includes many complementary tasks, such as interacting with the developers in the Exchange Server product group, analyzing performance and scalability of server designs, technology evaluation, and performing research to accomplish these tasks. To design and deploy the messaging environment, the Messaging Engineering team verifies the recommended system parameters and configuration options set by the Exchange Server product group as well as the initial performance and configuration recommendations from the pre-release production environment. As part of designing the corporate production environment, the Messaging Engineering team also creates and maintains documentation that details overall environment design aspects, messaging topology, server specifications, and Exchange Server 2007 configuration settings.
Note: The Messaging Engineering team does not design the pre-release production environment. That design evolves from recommendations from the Exchange Server product group. However, it is the task of the Leads team in the Messaging Operations group within the Exchange Messaging team to deploy and operate the pre-release production environment. In this way, the Messaging Operations group verifies performance and functionality based on the default settings before making any customizations and changes for settings in the production environment.
The Messaging Engineering team designs the corporate production environment. Its design is based on the results of capacity planning; enterprise design and architecture practices; the results of lab evaluations and testing; and proven and verified results from the pre-release production environment. The latter entails collaborating with the product group to transfer knowledge such as the configuration and hardware settings, deployment steps, and best practices. Because of this close collaboration, members of the Messaging Operations group participate in many engineering projects and even get the chance to own some aspects of the corporate production environment design. By gaining real-world experience in collaboration with the Messaging Engineering team, members of the Messaging Operations group can move on with their careers as messaging engineers.
Service Management Functions
The Service Manager role within the Exchange Management team owns the end-to-end, user-focused aspects of the messaging environment. The Service Manager's primary responsibility is to serve as the interface between users and the Messaging Operations group. This entails responding to user questions, performing weekly service reviews, and communicating with regional Microsoft IT managers about any scheduled and unscheduled outages. Service managers take user requests for changes, and oversee service improvement in messaging to meet SLAs. Additionally, service managers are responsible for the TCO of their service.
Program Manager Functions
The Project Manager role within the Exchange Management team owns the physical hardware deployment for the messaging environment. The Project Manager's primary responsibility is to serve as the interface between the Exchange Server product group and the Messaging Operations product group. By acting as the intermediary, a project manager oversees feature and product implementation across all dependent teams. To accomplish this, a project manager attends weekly meetings with developers, coordinates schedules for corporate production environment deployment, and owns the overall hardware budgeting and maintenance tasks. This role is the strategic arm of the Messaging Operations group.
Messaging Operations Group
The Messaging Operations group is responsible for the dual goals of product improvement and providing highly reliable and available messaging services. The Messaging Operations group works very closely with the Exchange Server product group, and runs a pre-release production environment dedicated to trying new builds, verifying functionality, and discovering improvement opportunities before the Exchange Server product is released to manufacturing. The in-depth knowledge gained from this close collaboration enables the Messaging Operations group to create thorough incident response documentation for front-line Monitoring group operators, discover problem root causes, and oversee changes to the environment and product releases.
As the group manager for the Messaging Operations group, Gary Baxter is responsible for two teams that provide end-to-end operations for all mailboxes in the corporate production infrastructure, including unified messaging. In addition to his many responsibilities, Gary must also meet SLAs and decrease the cost of supporting and operating the environment through process improvement. Gary carries out the following specific tasks:
- Enabling customers Gary faces customers inside Microsoft
and in the messaging community at large. The responsibilities vary with each customer,
based on specific needs. For example, he is responsible for providing technical
guidance, supporting findings, and providing process improvement assistance. Gary
regularly interacts with the following peers:
Service and project managers Gary enables service managers by providing best practices and process engineering for messaging operations, which in turn enables other managers to run their respective services more efficiently.
Exchange Server product group Product improvement and validation is the primary goal for the corporate production environment. Gary's teams provide the Exchange Server product group with incident results to change the Exchange Server code during beta time frames. Additionally, Gary's team creates configuration templates that include the desired configuration of servers running Exchange Server 2007. These configuration templates enable the Exchange Server product group to provide final code with Exchange Server 2007 optimized with best practices and settings.
Sales, marketing, and product support teams Gary provides expert advice and documentation to assist sales, marketing, and support activities. For example, Gary participates in industry conferences such as TechEd, internal conferences such as TechReady, and helps to create documentation about how Microsoft does IT.
Microsoft IT service teams Gary assists peer service and engineering teams by providing and supporting environments for effective pre-release production verification and scenario validation of a variety of products, such as Exchange Server, Windows Server, ISA Server, and Microsoft Retail Management System.
Industry customers Gary also assumes responsibilities concerning Microsoft customers. For example, Gary is responsible for demonstrating product readiness to Microsoft customers by running the corporate messaging environment to mission-critical standards. According to the Microsoft IT vision of being the first and best customer, the Messaging Operations group also helps the Exchange Server product group proactively identify and address potential issues before external Microsoft customers notice these issues.
- Overseeing efficiencies in IT processes At Microsoft IT, the Messaging Operations group sets the process optimization standard for other service groups. Gary is responsible for making his team more efficient, and also for condensing service-agnostic process improvements and sharing them with other teams.
- Ensuring SLA compliance Gary is also responsible for messaging SLAs. Although multi-team contributions establish SLAs at both technical and managerial levels, Gary develops and executes strategies to ensure that the combination of people, process, and technology meets SLA targets in the messaging environment.
- Managing people Gary works with engineering and other teams to provide his group members with additional cross-team projects. In this way, Gary develops the skills of his team members, making it possible for them to advance in their careers.
- Driving down the cost of support and operations Gary is responsible for reducing the cost of support and operations by ensuring process efficiency, such as setting a target of 80 percent incident resolution by the Monitoring group. To accomplish this goal, the Messaging Operations group develops and provides the necessary tools and resources that enable the Monitoring group to reach this performance target.
- Designing process intelligence Gary's challenge in creating processes also includes ensuring that they can be re-used, automated, and easily modified to deal with changes. This is especially important in the MMS environment because customers have different needs, yet the processes for operating the messaging environment must be uniform.
Although all members of the Exchange Management team share the common goals of ensuring performance, availability, and reliability, the team structure enables members to specialize in a specific aspect of Exchange Server 2007 to accomplish the overall goals. The modular design of Exchange Server 2007, based on server roles, promotes specialization, yet an efficient IT organization also requires experts with a broader scope. Accordingly, Gary structures his teams with different specialization focuses to provide maximum operational process efficiency and to enable prompt incident handling, problem resolution, and changes to the environment.
There are other advantages to using teams with separate specialization focuses, such as reducing costs via documentation and balancing workload across specialists. A key part of reducing operation costs involves systematizing the knowledge of experts and creating documentation so a generalist can use it for messaging operations. However, creating thorough documentation that everyone can understand requires in-depth knowledge of Exchange Server 2007 and operations, which specialization provides.
The Messaging Operations group consists of two teams, led by Ryan McDonald and Jim Leigh, respectively, which perform specialized, yet sometimes overlapping tasks. The broad Exchange Server product knowledge that the members of each team possess enables team members to help each other when there is an unexpected workload, such as during an unscheduled outage. The two teams perform the following specific functions:
- Tier 3 Incident Management team This is Jim Leigh's team. It is composed of Exchange Server specialists who focus on a technology, rather than a specific server role, to handle any unresolved issues passed on to them by the Monitoring group. Team members also possess knowledge of IT operations and server administration. Their responsibilities include incident handling, escalation, and change management. The members of this team also manage ancillary services such as Forefront and Public Folders. Jim's team works closely with the Leads team in its specific areas of expertise in order to gain new knowledge, for career growth, and to provide as-needed resources.
- Leads team This is Ryan McDonald's team. It represents the last place for issue resolution. If the Tier 3 Incident Management team cannot resolve an issue, the issue escalates and Ryan's team resolves it. The team consists of six people who all specialize in a specific server role and work on operational projects such as automation work and scripting work. Two people specialize in Mailbox servers, and the remaining four cover Client Access, Hub/Edge Transport, and Unified Messaging (UM) roles. Ryan's team is the strategic team in messaging operations that handles deployment and feature sets, including feature validation and new product functionality verification. The Leads team ensures operational efficiency; manages service deployment readiness, readiness and fine-tuning of the Microsoft Operations Manager Exchange Management Pack, operational training for the team, and documentation for operational staff; and works with the product group to ensure the high quality of the product. Additionally, the team works closely with the Messaging Engineering team to transfer knowledge learned during product and feature verification. Ryan's team is specifically responsible for performing deployment consultation and automation work for new deployments in the MMS environment.
The Leads team performs additional operations work that is not 100 percent related to the Exchange Server product. For example, the team acts as configuration management advisors for Desired Configuration Monitoring (DCM), change approver for each service, and interacts with customers by presenting in the Computer Information Technology/Foundation (CITF) program.
Note: It is the task of the Messaging Operations group to resolve all issues escalated by the Monitoring group within Shared Services. The Messaging Operations group involves the Exchange Server product group when dealing with very complex issues.
"By meeting stringent SLAs that demand 99.99 percent availability across the entire corporate production environment, we prove to our customers the enterprise readiness of Exchange Server 2007. It takes the right mix of technology, processes, and people to achieve this high level of availability. Exchange Server 2007 provides the technological foundation, and tried and proven operational processes, driven by MOF principles, make it possible for us to have the right people do the right job at the right time in order to achieve high availability at Microsoft in a repeatable and measurable way."
Gary Baxter Group Manager
Microsoft IT pursues an end-to-end approach to SLAs, which means being responsible for meeting the SLAs, and also for all the services and components that contribute to SLAs. For example, if a network connectivity issue prevents users from accessing their mailboxes, Microsoft IT considers the messaging service unavailable even if all Exchange Server 2007 servers are up and running because the end-user experience is an unavailable messaging service. Exchange Server 2007 integrates tightly with Active Directory and depends on the TCP/IP network infrastructure and other technologies to unfold its features so that the Exchange Messaging team becomes aware of incidents not just specific to Exchange Server 2007. In fact, many teams at Microsoft IT, such as the Windows Server team and the Active Directory team, proactively report incidents to the Exchange Messaging team before anyone else does.
End-to-end operations provide Microsoft IT with many advantages over the previous server-centric approach. With end-to-end operations, incidents are resolved faster because teams own their incident tickets until an incident is resolved; processes are more flexible; costs are reduced through selective usage of specialists; and the overall performance of the Microsoft IT organization becomes the shared responsibility of all teams. The Exchange Messaging team is ultimately a heavy user of the underlying physical network infrastructure such as Active Directory, DNS, and firewalls. Yet, the team can only meet its SLAs if the other teams also meet theirs. At the core of no-excuses SLAs are individual teams that are responsible for specific areas but share overall accountability.
The no-excuses SLA policy came about when Microsoft IT management examined its organizational hierarchy and realized that users only see an outage, and not its causes. From a user's point of view, if a service is unavailable the user is witnessing a service outage. The cause of the outage may be an issue with the TCP/IP network, telecommunications provider, or underlying Active Directory infrastructure. None of these causes is Exchange-specific, yet each causes a messaging service outage, which counts against the availability SLA. With the no-excuses SLA, the source of the issue does not matter; the Exchange Messaging team owns the incident and its resolution, and has the responsibility to introduce changes to prevent issue recurrence.
The ambitious SLAs that the Exchange Messaging teams set, as discussed in the next section, "Performance Goals," are not strictly necessary to meet business needs and user expectations at Microsoft. Historically, Microsoft IT did not always have the 99.99 percent availability SLA, and the business still functioned profitably even at the performance level of 99.9 percent availability.
The Exchange Messaging team moved to SLAs that are more aggressive as a way to push the envelope and prove the possibilities with Exchange Server for even the most demanding customers. For many years previously, the team had gathered performance statistics and reviewed them weekly as scorecards that listed the SLAs and performance. The scorecards had a green indicator for met SLAs and a red indicator for unmet SLAs. According to Gary, there were too many green lights on these scorecards prior to the move to end-to-end operations and the adoption of the no-excuses SLA. All categories showed green with 99.9 percent targets. To avoid complacency and to foster an attitude of continuous performance improvement, the Exchange Messaging team moved to more rigorous SLAs that challenged all team members and promoted exceptional performance.
The transition to end-to-end operations developed in parallel with a transition of operating the Microsoft IT environment with all Microsoft customers in mind, not just Microsoft internal users. As already mentioned, the Messaging Operations group manages the pre-release production environment, which helps to verify performance and functionality in a real-world setting. Demonstrating product performance and readiness for even the most demanding customers is one key mission of Microsoft IT. Communicating feedback and requirements from external customers back to the Exchange Server product group is another.
Note: In addition to operating a pre-release production environment, Microsoft runs beta testing programs, and programs with partners where pre-release code is deployed in partner IT environments. This focus on partners means that Microsoft IT does not just look to its internal needs, but is always mindful of the needs of external customers.
With each product release, the Messaging Operations group has spent increasingly more time running Exchange Server in pre-release production environments. The pre-release verification period for Exchange 2000 Server started three weeks before the RTM date; the pre-release verification period for Exchange Server 2003 lasted only six months. In comparison, the Messaging Operations group used the beta versions of Exchange Server for more than 22 months before the product shipped. In fact, the Exchange Server product group does not ship product versions or service packs until the Messaging Operations group signs off on enterprise readiness. The release criteria state that the Messaging Operations group must demonstrate 99.99 percent availability for at least three weeks. The high-availability of Exchange Server 2007 technology and real-time monitoring tools such as Microsoft Operations Manager, as well as motivated people, clear communication paths, and efficient operations processes, enabled the Exchange Messaging team to deploy Exchange Server 2007 at full scale throughout the corporate production environment prior to RTM. The Messaging Operations group was able to report to the Exchange Server product group the achievement of the high-availability sign-off criteria across the entire corporate production environment on November 26, 2006. Exchange Server 2007 shipped on December 7, 2006.
Microsoft has an e-mail-centric culture. In a typical week, there are over 30 million messages processed by Hub Transport servers for internal and Internet messages, and over 14 million ActiveSync connections from mobile devices. Additionally, the environment supports trusted partner connections, multiple forests, and the global presence of Microsoft. Therefore, performance optimization and improvement is a vital task directly related to SLAs.
After setting overall organization-wide SLAs, the Messaging Operations group analyzed the various dependencies and developed targeted, team-specific SLAs. These team-specific SLAs yield a finer granularity in controlling and gathering statistics about outages for reports, which enables more accurate self-assessment and trend analysis. Although the team-specific SLA targets are not as strict, they enable a closer inspection of the environment and ensure achievement of the organization-wide SLAs.
Organization-wide SLAs represent broad performance goals in the Microsoft IT messaging environment. These SLAs represent a commitment to users, customers, and the Exchange Server product group that Exchange Server 2007 can deliver mission-critical results. The SLAs cover the important messaging aspects, such as delivery times and availability. More specifically, Microsoft IT defined the following SLAs:
- Delivery Delivery of 99 percent of all internal messages to their final destination must take 90 seconds or less.
- Availability Overall availability of messaging services must be 99.99 percent or greater.
- Business continuance Business will continue with messaging service in one hour or less.
- Deleted items retention There is a 14-day minimum retention of deleted items.
Team-Specific SLAs and Metrics
Team-specific SLAs and metrics focus on specific server roles, technologies, physical locations, or similar criteria to provide a convenient means for reporting and analysis. These SLAs and metrics address not only the technical aspects, but also "soft" factors such as user satisfaction. The Messaging Operations group tracks the following SLAs and metrics:
- Core e-mail: weighted overall mailbox availability
- Core e-mail: weighted unplanned mailbox availability
- Client availability
- Client performance
- Mobile Messaging: overall availability
- Mobile Messaging: unplanned availability
- Unified Messaging: overall availability
- Unified Messaging: unplanned availability
- Fax: overall availability
- Fax: unplanned availability
"Structured incident handling processes as well as tools and documentation from my team equip front-line operators with the necessary guidance to resolve over 80 percent of messaging incidents immediately. This frees my team to focus on specialized projects that drive process improvement. New Exchange Server 2007 features, such as Exchange Management Shell, enable us to develop efficient tools that front-line operators can use without requiring detailed product knowledge to achieve consistent incident resolution."
The Microsoft IT environment is global and has regional IT teams that are responsible for managing the site-specific hardware. Overall, more than 6,000 IT experts work at Microsoft IT; 50 percent of those IT experts are vendors. Running this enterprise IT organization requires established workflow processes, communication paths, and coordination to respond to incidents and resolve them in a timely manner.
The Messaging Operations group is a leader within Microsoft IT in overseeing cost-cutting and process-improvement measures for incident management. The group accomplishes this by decreasing the workload on specialists and transferring that knowledge to front-line Monitoring group operators that respond to incidents. For Microsoft IT, it means the Messaging Operations group can focus specialist resources on more involved processes.
Incident management provides the following advantages:
- Increased specialization opportunities Part of the method of increasing efficiency and lowering costs is to create specialists within a specific body of messaging knowledge. However, using specialists to solve all operational issues can be expensive. To maintain cost-efficiency, there also must be people who can take over some of the systematic aspects of operations, such as responding to an incident and following prescribed and documented resolution steps. With the Shared Services team acting as front-line monitoring operators for multiple services, each service group can develop service specialists.
- Fast response by front-line Monitoring group operators Front-line monitoring operators work in a 24-hour, 7-days-per-week datacenter where an operator watches the monitoring screen at all times. The Messaging Operations group takes this very seriously: if an operator wants a break, there must be another person monitoring the console. During incident reviews, one of the aspects of the review involves verifying that an operator was indeed present and watching the monitors when the incident occurred.
- Uniform and standardized handling of incidents With scripted and prescribed resolution steps that are tested and verified, front-line monitoring operators can follow identical resolution paths no matter the level of personal experience or expertise.
- Decreased support requirements for product experts If specialists transfer their knowledge of how to resolve incidents to front-line monitoring operators who do not have deep, Exchange-specific product knowledge, then the specialists can focus their energies on other tasks. To accomplish this, the specialists in the Messaging Operations group must perform the knowledge transfer and documentation tasks so that front-line monitoring operators have clear instructions for how to resolve Exchange-specific incidents.
- Measurable results By separating the overall operational processes according to individual components, the Messaging Operations group can measure each component to gather performance statistics. Having accurate statistics is important because it enables management to have an accurate picture of the environment, and they can therefore spot trends or process inefficiencies. Additionally, assigning primary incident response work to the front-line operators is more cost-efficient than having senior-level specialists resolve incidents.
- Focus on product validation Well-defined processes and roles help the Messaging Operations group focus on the dual goals of maintaining high standards for the messaging environment and providing product validation to the Exchange Server product group and Microsoft customers. By freeing up specialist resources to work in the pre-release production environment, the Messaging Operations group can devote more time to checking features and functionality of beta builds before the product reaches the marketplace. This results in the Messaging Operations group identifying over 90 percent of product issues in beta code before anyone else does.
The Messaging Operations group follows a structured framework for dealing with and resolving incidents. The people involved in responding to and resolving incidents follow a scripted series of processes. As discussed below, the life cycle of an incident involves both front-line monitoring operators from the Shared Services team and members of the Messaging Operations group, with defined roles, processes, and tools used from the initial response to final resolution. The teams go through the following incident life cycle:
- Awareness/notification Front-line monitoring operators become
aware of incidents via several sources. Most incidents originate with Microsoft
Operations Manager, which acts as the monitoring and detection system. Microsoft
Operations Manager includes rules to check the status of thousands of individual
factors, such as queue length and service status in the Exchange organization at
various levels of depth. Microsoft Operations Manager accomplishes this via an Exchange
Management Pack, which includes thousands of rules specifically for monitoring Exchange
Server 2007. For example, there are rules to verify that the server is up through
a heartbeat ping, and there are rules to perform synthetic mailbox logons to verify
that transport and delivery mechanisms function as expected. The Messaging Operations
group customizes the Exchange Management Pack by modifying rule settings and alert
triggers, as some examples show in Table 1. Another way the Monitoring group
becomes aware of incidents is via users who report to the Helpdesk. While the Helpdesk
resolves most user issues, some incidents require escalation to the front-line monitoring
operators. If they cannot resolve an incident, then it is escalated to the Messaging
Note: Because Microsoft Operations Manager delivers alerts in real time, and because alerts are proactive, front-line monitoring operators can resolve most incidents before users ever report them to Helpdesk. The Messaging Operations group resolves most of the incidents related to Exchange S
- Response Microsoft IT uses an incident tracking system that integrates with Microsoft Operations Manager. Alerts that generate from rules in the Exchange Management Pack also automatically generate a ticket in the tracking database. The alerts include two knowledge databases about how to resolve the alerts: the default knowledge base that comes with the Exchange Management Pack, and a messaging-specific knowledge base. The Exchange Messaging team created the messaging-specific knowledge base to gather very detailed information about incidents, in order to help with product improvement. A goal of the Messaging Operations group is to create scripted resolution guidance detailed enough for any front-line monitoring operator to follow the procedure and resolve the incident. To clarify resolution procedures, members of the Messaging Operations group routinely update the knowledge database with the latest guidance, based on their experiences.
- Management As already mentioned, Microsoft Operations Manager and the incident tracking system provide a way for operations personnel to view details about incidents, such as incident type, occurrence, existing knowledge for resolution guidance, and status. The tracking system enables front-line operators to escalate incidents for resolution if the knowledge base instructions do not resolve an incident. In addition to these tools, the Exchange Messaging team uses OpsWeb, an internal line-of-business (LOB) application that is available to the Helpdesk for viewing tickets, grouping them in selected views, appending Helpdesk tickets to a master ticket, and checking for existing issues in order to avoid repeat escalations.
- Resolution The Messaging Operations group seeks to resolve at least 80 percent of incidents at the Shared Services team level by the Monitoring group, and at least 90 percent of the remaining incidents at Jim Leigh's Tier 3 Incident Management team. After resolving an incident, the members of the Monitoring group mark the ticket as complete in the ticket tracking system and archive tickets older than three months. Only the most difficult incidents, or those flagged for further investigation and detailed root cause analysis, reach Ryan McDonald's Leads team. If the Messaging Operations group does not resolve an incident, then the incident further escalates to the Exchange Server product group, which has additional resources such as developers who can perform a live debug. During debugging, developers examine memory dumps to check for causes. Incidents that require additional research, product updates, or a major change generate another ticket in the development database as a change request. In this way, Microsoft IT helps the Exchange Server product group by providing developers with a real-world environment for deep debugging and detecting product issues that require code changes.
- Review As part of incident management within the incident life cycle, the Messaging Operations group performs monthly reviews of the incidents. The purpose of the review is to identify repeat incidents or trends of incident types for closer inspection. After identifying an incident for closer inspection, members of the Messaging Operations group analyze it to determine whether the incident is indicative of a larger underlying problem. At this point, the Messaging Operations group may decide to investigate it further and determine the root cause by using problem-handling processes, as discussed in the next section, "Problem Handling."
Table 1. Exchange Management Pack Customizations
Number of messages in transport queue.
Alert if there are more than 250 messages, which Exchange delivers in approx. 5 seconds.
Outlook Web Access connections.
Number of users connecting using Outlook Web Access.
This is a threshold rule. Microsoft IT decreases the threshold to see if decreased connections indicate users cannot connect.
Submission Queue Length-- sustained for 5 minutes on Hub Transport.
Length of submission queue on Hub Transport servers.
Red alert if queue size is greater than 250.
Retry Mailbox Delivery Queue Length--sustained for 5 minutes on Hub Transport.
Size of retry queue on Hub Transport servers.
Changed to alert if queue size is greater than 250.
Largest Delivery Queue Length--sustained for 5 minutes on Hub Transport.
Largest length of delivery queue on Hub Transport servers.
Changed to alert if queue size is greater than 250.
Aggregate Delivery Queue Length (all queues)--sustained for 5 minutes on Hub Transport servers.
Length on all queues on Hub Transport servers.
Changed to alert if queue size is greater than 500.
CCR Service Verification Script and Exchange 2007 CCR Service Verification.
This is specific to checking that the CCR service is running on both nodes.
Alert if Service State is stopped.
Whereas Microsoft IT incident management deals with restoring service as quickly as possible, problem handling deals with minimizing the impact of an incident and preventing recurring incidents by seeking to discover incident root causes. The problem-handling discipline focuses on the resolution of the underlying causes of the problem, rather than the speed of the resolution.
"Strict operational, proactive monitoring, and problem management processes enable us to solve many incidents before users ever become aware of a service issue. We prevent incident recurrence by systematically determining the root cause of incidents using problem management techniques and introducing timely changes to the messaging environment. By using Microsoft Operations Manager as the enterprise-monitoring tool, we provide front-line operators with proactive alerting accompanied by the latest resolution steps. Our problem-handling processes are repeatable in any Exchange enterprise environment because they are based on industry-standard operations frameworks."
For the Messaging Operations group, problem handling involves helping the Exchange Server product group ship the best software possible. This means not only solving issues as they arise, but also finding root the cause of an issue, documenting it, and making sure that there is either a published workaround, or permanent change in the product, that addresses the issue. Problem handling fosters change because it takes into account the people, processes, and technology involved in a particular incident. Evaluating how an incident arose, how it was resolved, and digging down to the root cause means considering what contributed to the problem: people, processes, technology, or a mix of these factors.
The Messaging Operations group uses three environments in its problem-handling processes:
- Pre-release This environment provides the Messaging Operations group with great flexibility in determining incident root causes because it is set up for the expressed purpose of product improvement and validation. Therefore, meeting rigorous SLAs and resolving incidents as quickly as possible is secondary to determining an incident's root cause, working with developers to replicate and understand product behaviors, and trying workarounds or product updates to rectify an incident. In this environment, it is acceptable to take longer to analyze and resolve an incident, because the analysis should determine the root cause of the incident.
- Corporate production environment In the corporate production environment, problem handling complements incident management by finding, if possible, the root cause of an incident. The Messaging Operations group allows extended downtime only if an incident is not reproducible in another environment to ensure that the developers implement necessary fixes in the product code before Microsoft releases Exchange Server 2007 to customers. Because of the rigorous SLAs, and because the Messaging Operations group must demonstrate product readiness, the Messaging Operations group documents the settings and configurations that led to an incident, in order to use that information when working to discover the root cause of the incident.
- Microsoft Managed Services The Messaging Operations group maintains this environment with the goal of restoring services as quickly as possible. Availability is very important because Microsoft has contractual agreements with clients to provide a specified level of service. This environment includes a dedicated Centralized Infrastructure team that informs the customer of tickets related to problem handling. However, the Messaging Operations group creates, manages, and owns the tickets. To accommodate problem handling, analysis, and change management in the MMS environment, there are scheduled change windows available every other Saturday.
The Messaging Operations group selects which incidents to investigate based on two factors: a list of incident types that require mandatory investigation and as-needed inquiries for remaining incidents. The list of events that require the creation of a problem ticket includes serious incidents, such as a queue size of more than 10,000 messages, UM service disruption, server outage, and so on. The Messaging Operations group creates a problem ticket for any incident that severely affects any service they support.
To select other tickets for investigation, the Messaging Operations group conducts weekly incident review meetings to discuss all outstanding problem tickets and review incidents from the previous week to determine whether any require the opening of a new trouble ticket. The most important criteria used to select as-needed incidents for further investigation is incident frequency and trends. When the same type of incident repeatedly occurs, it often signals an underlying problem that is not isolated to just a few servers. Trend analysis helps to evaluate frequency over a larger period to help determine whether to further investigate some incidents.
After narrowing down the list of incidents to investigate, the Messaging Operations group performs a sanity check on the incident summary to ensure that all the necessary components are present to open a problem ticket. For example, the incident report must include a full set of notes that document the incident through every step, from initial alert to final resolution. Without these components, the Messaging Operations group cannot select what to investigate because there is not enough data. After opening a ticket, the Messaging Operations group uses its tools to assign the ticket to a team member, who is responsible for following up at least once a week until the issue is resolved.
The Messaging Operations group maintains its own database tool to create and manage problem tickets. In addition, the team uses a custom Office SharePoint Server 2007 site for problem ticket review. These tools track progress, help assigning resources for a problem, and facilitate managing problem status. Figure 3 shows a sample trouble ticket, the major problem review (MPR) tool:
Figure 3. MPR ticket
Problem management also includes problem review and metrics. After finding the root cause of an issue, the team member responsible for handling the issue must create a corresponding entry in the knowledge base or similar documentation within seven days of resolution of the issue. Through experience, the Messaging Operations group discovered that if a problem occurs once on a specific server or group of servers, it often recurs with other servers. Therefore, it is most efficient to provide comprehensive resolution steps for front-line monitoring operators as soon as possible. The Messaging Operations group reviews problems at weekly meetings to provide status updates and to open new problem tickets.
At monthly meetings, the Messaging Operations group tracks metrics and trends, which presents an opportunity to view a scorecard of statistics and trends for the month. The scorecard consists of various MPR data viewed through pivot tables, including the number of tickets opened and the number of tickets resolved, with the root cause determined.
The Messaging Operations group uses configuration management processes as an opportunity to identify, record, and report on configuration items. These processes enable the Messaging Operations group to control the messaging infrastructure by maintaining information on the resources required to deliver messaging services.
The Messaging Operations group uses the following processes for configuration management:
- Configuration item identification Many software and hardware settings affect messaging services. Configuration items come from Windows® Registry, Active Directory, Windows Management Instrumentation (WMI) providers, the Internet Information Services (IIS) metabase, and the file system. These sources specify settings relevant to messaging, such as specific registry keys and network settings. Microsoft IT maintains a Configuration Management Database (CMDB) in the form of Excel spreadsheets for specific configuration data. Microsoft IT modifies this information according to performance measures and best practices.
- Baseline configuration template creation Microsoft IT analyzes the configuration items in the CMDB Excel spreadsheets to create baseline configuration templates. These templates vary for each server role and hardware type. They enable Microsoft IT to document standard collections of settings and rapidly make changes across multiple servers. Microsoft Systems Management Server (SMS) includes the capability to define templates and deploy them to servers as well as run audits to check for settings not in compliance with configuration templates. SMS also enables front-line monitoring operators to check for noncompliance with Microsoft Operations Manager. Front-line operators take action proactively upon discovering configuration noncompliance, and thereby avoid service interruptions.
- Configuration item and template management Microsoft IT uses Microsoft Systems Management Server (SMS) Desired Configuration Monitoring (DCM) 2.0 to audit the configuration settings based on template definitions. Based on SMS and the Microsoft Exchange Server Best Practices Analyzer (ExBPA), SMS DCM 2.0 compiles reports that can help detect noncompliant configuration settings. The Messaging Operations group configures SMS DCM 2.0 to audit template compliance every eight hours. In the event of noncompliance, Microsoft Operations Manager raises an alert to the front-line operators, who then respond to the incident.
- Risk analysis of changes and releases Prior to implementing changes, product updates, and new releases, the Messaging Operations group considers the impact of the changes. The Messaging Operations group notes the required changes to configuration items and maintains rollback procedures and backup templates to ensure that SMS applies the correct template version regardless of changes to the messaging environment.
- Best practices industry knowledge sharing The Messaging Operations group continuously shares its knowledge with Microsoft customers and consultants. Microsoft consultants who design and support solutions for clients use the knowledge from the Messaging Operations group to customize configuration templates for use in client messaging environments. Additionally, when the Messaging Operations group makes changes to configuration items that are relevant to other customers, these specific changes flow into ExBPA. In this way, customers have access to the latest best practices configurations from real-world operations of an enterprise-messaging environment.
For most IT organizations, introducing changes to a production environment typically involves a source for inputting changes (such as user feedback) and a way to design and verify the changes in a sandbox environment. Then they roll out changes in a software update or similar mechanism to the production environment. Additionally, change control incorporates review processes and management tools to ensure that teams working on changes track and complete change requests.
For the Messaging Operations group specifically, change control encompasses the traditional processes of accepting change requests from multiple sources and then designing, verifying, and rolling out changes. The underlying goal is to increase the prescribed handling of the change processes. The Messaging Operations group, in working through its change control processes, attempts to reduce the workload on specialists and distribute work to others by thoroughly documenting steps and procedures for implementing changes. This enables those who are not Exchange Server 2007 specialists to apply changes uniformly to all computers in the messaging environment.
A key success factor that enables the Messaging Operations group to manage all change requests centrally is the forward schedule of change (FSC) tool. The FSC tool is a custom Office SharePoint Server 2007 Web site with the list of change requests displayed in a calendar view. It enables the Messaging Operations group to create requests for changes (internally known as RFCs, not to be mistaken as Request for Comments [RFC]), approve them, schedule implementation, and manage changes. Typically, project managers and service managers create RFCs based on requests from the Exchange Server product group, whereas the Messaging Operations group enters emergency RFCs when responding to an incident. The service managers and project managers perform an initial feasibility analysis before entering RFCs by working with the Exchange Server product group to provide feedback regarding change implementation feasibility. Figure 4 shows the FSC tool:
Figure 4. FSC tool for change control
The Messaging Operations group uses the FSC tool to facilitate the following change control processes:
- Request for change creation Any internal user can create an RFC in the change management tool, such as a request for additional product features or a fix for a discovered root cause.
- Selection The Messaging Operations group accepts RFCs if they meet approval criteria such as completeness of information. Information completeness includes detailed rollback and rollout instructions, severity, and expected turnaround time frames. In dealing with RFCs, the Messaging Operations group maintains a staged set of instructions that a non-specialist can follow to implement the change.
- Severity and impact analysis The change advisory board conducts weekly meetings to select RFCs, update status, and evaluate the impact of new and ongoing changes. The board categorizes RFCs based on SLA impact, capacity, security, and disaster recovery readiness, and assigns minor, major, or automatic severity status. The automatic status is used for small changes that can be implemented without further investigation because they are either critical to performance and availability or pose no significant risk.
- Prioritization The RFC urgency complements its severity status. Some changes represent emergency solutions and require implementation in 24 hours or less, whereas others may be moderately urgent and can be scheduled for completion over one or more weeks.
- Implementation As part of the implementation process, the Messaging Operations group maintains a known change type list that details change categories for RFCs and the permitted change window.
In Microsoft IT, many people contribute during change control tasks, but the ultimate responsibility for resolving issues rests with the Messaging Operations group, which oversees the performance of Exchange Server 2007 across all Microsoft IT environments. Especially in the pre-release environment, the Messaging Operations group has many opportunities to verify specific product functionality of builds.
The Messaging Operations group is the first in the line of contributors that submit change requests to the Exchange Server product group. After identifying an issue that requires changes to the Exchange Server product code, the Messaging Operations group works with developers to create a design change request in the developer database, which automatically becomes the responsibility of the Exchange Server product group and developers. Although others may create product updates and other changes, the Messaging Operations group is responsible for requesting, verifying, and approving builds and updates for rollout to production environments. The Messaging Operations group provides a disciplined process for introducing required changes into a complex IT environment with minimal disruption to ongoing operations. The Messaging Operations group remains closely aligned with the release management process, and manages the release and deployment of changes into the production environment.
Another aspect of change control processes involves product validation, which the Messaging Operations group performs in collaboration with the Exchange Server product group. During beta testing and pre-release partner deployments, the Exchange Server product group may decide to implement changes to the code based on tester and partner feedback. Although the Exchange Server product group controls the code, the Messaging Operations group is responsible for validating the functionality of changed features and proving enterprise readiness by using it in the pre-release production environment after the Exchange Server product group completes typical quality assurance tasks.
Within the incident management, problem handling, change control, and configuration management processes that the Messaging Operations group performs, there is a constant effort to improve processes and thereby realize new levels of efficiency, scalability, repeatability, and cost savings.
A key source of process improvement comes from end users. Although the Helpdesk at Microsoft deals with first-tier support issues related to messaging, the Messaging Operations group participates in a satisfied-user initiative, which results in gathered feedback from users regarding functionality and performance in the messaging environment. The Messaging Operations group uses surveys to request feedback from users on satisfaction in a particular messaging service area, such as response times and availability, as well as to check users' general satisfaction. Some of the internal SLAs cover user satisfaction; meeting those SLAs and analyzing sources of dissatisfaction leads to an analysis of the people, processes and technology that are used to deliver messaging services. When this analysis results in the discovery of better processes or different combinations of people, processes, and technology, the Messaging Operations group makes appropriate changes to enact these improvements.
Microsoft IT deployed the first Exchange Server server in the corporate production environment more than ten years ago. Since that time, the Exchange Messaging team has amassed a wealth of experience and best practices around Exchange Server designs, operations, and troubleshooting. As part of its commitment to customers to share knowledge about running Exchange Server 2007, the Exchange Messaging team established the Microsoft Exchange Center of Excellence (ECoE), which is a task force inside Microsoft aimed at helping customers get the most out of their Exchange Server deployments.
The ECoE is a cross-organization team that also includes the Exchange Server product group and Microsoft Consulting Services (MCS). Its mission is to help customers better manage Exchange Server, taking advantage of expertise gained by Microsoft IT employees running the product in-house. The ECoE also administers the Microsoft Exchange Server Risk Assessment and Health Check Program (ExRAP) at Microsoft. An ExRAP engagement provides detailed on-site technical and operational analysis of large Exchange Server 2007 deployments.
The Messaging Operations group actively shares its knowledge with the messaging community and uses this interaction as a method to gather feedback and use that knowledge to improve operations processes. There are many ways the Messaging Operations group shares knowledge. For example, members participate in industry conferences, conduct seminars and presentations, and share operational knowledge with MCS, which then uses it for specific customers. They also participate in partner programs to perform product validation during alpha and beta releases of Exchange Server.
Another way the Messaging Operations group engages with customers is through the IT Fellowship series. Customers can talk with Microsoft IT about IT operations and specific services, and discuss general best practices during this two-week program.
As previously mentioned, the Messaging Operations group interacts with many Microsoft IT service teams as well as the Exchange Server product group to accomplish its dual goals of meeting SLAs and providing product validation to customers. Because of the volume of activities and work, the group follows a structured model of operations based on MOF and ITIL. The theories provide a framework and guidance for messaging operations, yet operations architects must ultimately make decisions based on what works in real world IT environments. As Figure 5 shows, the Messaging Operation group follows an orderly operations workflow with straightforward escalation paths and clear task assignments:
Figure 5. Messaging Operations group workflow
In its workflow, the Messaging Operations group defined the day-to-day tasks of operating the messaging environment, including responding to incidents, resolving incidents, determining the root cause of incidents, changing and improving the environment, and enforcing consistent hardware and software configurations across all servers. The following example demonstrates how all these processes fit together. It shows how the Messaging Operations group resolved a specific performance issue in the messaging environment.
The situation arose on October 1, 2006 when a front-line monitoring operator from the Shared Services team noticed an alert that Microsoft Operations Manager issued to the monitoring console. The alert indicated that an Exchange server was experiencing poor performance according to slow server response times. The following performance counters were below the alert threshold:
- Averaged latency over 50 for five minutes
- Processor load greater than 90 percent for five minutes
The front-line monitoring operator followed the suggested resolution steps to try to resolve the incident, including restarting services and monitoring individual service resource utilization. However, system performance did not improve. Because the front-line monitoring operator could not resolve the incident with the suggested knowledgebase resolution steps, the operator escalated the incident to Messaging Operations' Tier 3 Incident Management team by assigning the incident ticket to the Tier 3 Incident Management team alias in the ticket-tracking database.
Working with the Leads team, the Exchange Server product group followed up on the unresolved incident by analyzing captured system and memory data. The Exchange Server product group investigated all the submitted data, including similar occurrences in the past. This was not a one-time issue that suddenly occurred; front-line operators had previously noticed sporadic performance degradations. By comparing the various incidents, the Messaging Operations group recognized a pattern pointing to a third-party driver as the component responsible for the performance degradation. After determining that the driver was the root cause, the Tier 3 Incident Management team member created an emergency RFC in the FSC tool in order to immediately implement a remedy, and later follow it up with a permanent change. The risk analysis determined that this change did not pose a significant risk because it did not affect Exchange Server settings.
The change advisory board approved the emergency RFC quickly due to the critical nature of the issue, which enabled the Messaging Operations group to disable the driver across all affected servers rapidly. To ensure that the driver was disabled, the Messaging Operations group modified the SMS DCM 2.0 template used on the servers to include checking for the driver during SMS configuration compliance audits.
As part of following up on the incident and implementing a permanent solution, the Messaging Operations group notified the provider of the third-party driver about the incident, and requested that to insure stability, the vendor provide a pre-production driver for use in the pre-release production environment. Upon receiving the new driver and verifying its stability, the resolution process returned to the problem-handling discipline to review the incident and implement a permanent change.
In this particular incident, the problem-handling processes overlapped with many other processes such as change management and configuration management. However, this often occurs in the real world. The structured processes serve as a guide to accomplish operational goals. It is important to note, however, that even though multiple teams contributed to the root cause analysis and emergency change implementation, the Messaging Operations group owned and was responsible for handling the incident and associated issue until the final resolution.
In combination with verifying the functionality of the new driver that the third-party vendor supplied, the Messaging Operations group created an MPR ticket during a weekly incident review meeting. (At this point in the incident resolution process, the person who owns the MPR must provide at least weekly updates until the problem is resolved.) The MPR included the incident notes, action steps taken, and other details common to MPRs as previously mentioned in the section titled "Problem Handling." The MPR also included an action item to ensure the installation of the new driver across all servers in the latest cycle of driver updates.
After verifying the new driver's functionality, the Messaging Operations group created a change request in the FSC tool with normal priority, in order to ensure a uniformly applied permanent fix. During the next available time frame for rolling out updates, the Messaging Operations group installed the new driver in the environment.
It is not sufficient for the Messaging Operations group to implement an emergency fix followed by a permanent fix in the messaging environment. The Messaging Operations group must also ensure that the incident does not occur in the future, both in the Microsoft messaging environment and in customer environments.
The Messaging Operations group handles both aspects of the issue resolution process separately. To prevent issue recurrence in the Microsoft IT environment, the Messaging Operations group works with the Infrastructure team to make the new driver the default driver for all future builds and deployments. To provide the solution to all customers, the Messaging Operations group works with the Exchange Server product group to include a corresponding driver check in the ExBPA tool. This is one example of how the Exchange Messaging team translates the Microsoft IT vision of being the first and best customer of Microsoft into concrete proactive help for every Microsoft customer running Exchange Server 2007.
By adopting an operations framework based on industry standards such as MOF and ITIL, the Messaging Operations group was able to identify best practices that cover daily tasks of operations and provide guidance for IT professionals for designing and operating an enterprise-messaging environment based on Exchange Server 2007. These best practices sometimes apply to all operations such as the best practice of adopting scalable and flexible processes, and sometimes only to specific disciplines such as incident management or change management.
The Messaging Operations group relies on the following best practices:
- Use tools for tracking and management The Exchange Messaging team uses many tools as part of its Exchange Server operations. For example, Microsoft Operations Manager and SMS combined with a ticket-tracking database provide the capability to monitor the environment in real time, including configuration data, and track incidents from inception to resolution. The Messaging Operations group uses specialized tools, such as custom LOB applications for problem review and change implementation, in addition to custom scorecards for metrics and diagnostic and troubleshooting tools.
- Implement review processes for each discipline The Messaging Operations group specifically includes review processes for incident review, problem handling, configuration management, and change control. This enables the group to optimize its processes on an ongoing basis and foster a culture that embraces change and emphasizes improvements.
- Create baseline configurations To be able to analyze risks, it is important to know risk dependencies and effects in terms of configuration items. Before making changes, the Messaging Operations group evaluates change impact and enforces changes across all servers by using templates that specify configuration specifics.
- Perform monitoring centrally Microsoft IT relies on a Shared Services team for all monitoring in order to gain the benefit of centralized monitoring without the cost of using a distinct team for each service. With Microsoft Operations Manager, centralized monitoring provides at-a-glance summaries of system status as well as detailed reports and alerts.
- Conduct weekly and monthly reviews The Messaging Operations group reviews incidents, requests for changes (RFCs) and changes, and problem-handling tickets on a weekly basis. Additionally, there is a monthly review to identify trends and summarize performance.
- Systematize resolution steps and transfer knowledge to front-line operators With each new incident, the Messaging Operations group has an opportunity to improve the resolution guidance for front-line operators. The Messaging Operations group both reviews existing steps to improve guidance and documents resolution steps for new incidents from data gathered during problem handling and root-cause analysis processes.
- Measure statistics The Messaging Operations group measures not only overall SLAs, but also specific internal SLAs, which enables easier trend spotting and targeted performance improvement.
From the earliest days of operating the corporate messaging environment with Exchange Server to the present, Microsoft IT has continually increased its performance and availability targets and the scope of its goals for the messaging environment. Microsoft IT delivers consistent verification to even the most demanding customers that Exchange Server technology can meet rigorous availability and performance requirements by providing messaging services with a no-exception, end-to-end policy from a user's point of view and by achieving 99.99 percent availability.
For the Messaging Operations group delivering consistent results requires using the right mix of technology, people, and processes. Exchange Server 2007, Microsoft Operations Manager, and other Microsoft server products provide a sound technological foundation upon which people and processes can rely. The modular and flexible product design of Exchange Server 2007, based on server roles, promotes technical specialization within the Exchange Messaging team. The MOF and ITIL frameworks are the bases for implementing clear communication paths, team hierarchies, and escalation procedures.
Among other things, Exchange Server 2007 helps the Messaging Operations group to decrease operational costs through improved management and administration tools, while offering new technologies such as CCR for increased performance and availability. These benefits are not specific to Microsoft IT because they are repeatable in other environments that follow proven operational processes based on industry standard frameworks.
The Messaging Operations group continues to drive forward process improvement and knowledge sharing with other service teams as well as Microsoft customers. This includes all levels of IT operations: incident handling and response, problem handling, change management, configuration management, and even showing other IT organizations how to improve processes through reviews and improvement initiatives. The Exchange Messaging team is the originator of many change requests submitted to the Exchange Server product group for implementation in product updates, service packs, and future versions of Exchange Server. Internal experiences and customer feedback are the main sources. The close collaboration between the Exchange Messaging team and the Exchange Server product group ensures that Exchange Server technology continues to meet the present and future needs of real-world customers.
For more information about Microsoft products or services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information through the World Wide Web, go to