Operating a Global Messaging Environment by Using Exchange Server 2007
Technical White Paper
Published: November 27, 2007
|
Situation
|
Solution
|
Benefits
|
Products & Technologies
|
|
The Messaging Operations group within Microsoft IT's Exchange Messaging team must
fulfill the dual goals of operating the messaging environment and performing product
readiness validation with the Exchange Server product group.
|
By using the right combination of people, processes and technology, the Messaging
Operations group meets the performance and availability targets defined in the SLA.
|
- Operational effectiveness and consistency by using principles and procedures
based on Microsoft Operations Framework (MOF)
- Significant improvements in availability by using Exchange Server 2007 features
such as cluster continuous replication (CCR)
- Improved end-user productivity via new features in Exchange Server products
and in Office Outlook 2007
- More efficient planning, scheduling, and reviewing of deployment changes
|
- Active Directory
- Clustered servers
- Microsoft Exchange Server 2007
- Microsoft Office SharePoint Server 2007
- Microsoft Operations Manager
- Microsoft Systems Management Server
|
Contents
Executive Summary
Enterprise IT organizations, including the Microsoft Information
Technology (Microsoft IT) group, deal with service level agreements (SLAs) and power
users accustomed to high levels of performance, availability, and responsiveness.
The 130,000-plus users at Microsoft send over 3 million internal e-mail messages
a day from more than 150 offices worldwide, as well as from home and while on the
road. At Microsoft, many business-critical communication processes depend on the
availability of messaging services provided through Microsoft® Exchange Server 2007.
Managing the complex Microsoft IT infrastructure is a team effort that involves
many different groups, such as the Datacenter team, the Network Infrastructure team,
the Active Directory® team, and the Exchange Messaging team. Overall, Microsoft
IT manages three distinct environments: a pre-release production environment to
test new product versions and upgrades prior to their release to manufacturing (RTM),
a corporate production environment to provide IT services to Microsoft users, and
the Microsoft Managed Services (MMS) environment to provide managed IT services
to Microsoft customers. Within these three environments, the Microsoft IT Exchange
Messaging team handles all Exchange-related operation, management, administration,
and optimization processes. In that role, the Exchange Messaging team works with
many other peer teams at Microsoft IT, sharing its operations and process optimization
expertise to help those teams implement efficient and reliable operations processes.
The Messaging Operations group, managed by Microsoft IT Group Manager Gary Baxter,
within the Exchange Management team, must meet several reliability, availability,
and performance targets (such as 99.99 percent availability of Exchange services).
To meet these targets, the Messaging Operations group makes use of industry-standard
methodologies such as Microsoft Operations Framework (MOF), Microsoft Solutions
Framework (MSF), and Information Technology Infrastructure Library (ITIL). For example,
the operations model that the Messaging Operations group implemented based on the
ITIL framework relies on structured incident management, problem handling, configuration
management, and change control processes. These processes enable the Messaging Operations
group to capitalize on Exchange Server 2007 administrative features, such as
the Exchange Management Shell, to reduce operations costs and ensure efficiencies.
The key to success in daily operations is the right combination of technology, people,
and processes. For example, the Messaging Operations group uses technical tools,
such as the built-in product features of Exchange Server 2007 and Microsoft
Operations Manager, combined with a clear team structure and work processes that
facilitate collaboration. Built-in product features of Exchange Server 2007,
such as cluster continuous replication (CCR), help the Messaging Operations group
meet 99.99 percent availability and performance targets. New tools and software
features, and optimization opportunities gained through customer feedback, enable
the Messaging Operations group to analyze and implement changes when necessary to
keep pace with the innovative and agile business landscape at Microsoft.
This white paper is for business decision makers, technical decision makers, and
operations managers. It assumes that the reader has a working knowledge of Microsoft
Windows Server® 2003, Active Directory, Exchange Server 2007, and
Microsoft Operations Manager. Because many of the principles and procedures discussed
in this paper are based on standard operations methodologies, a high-level understanding
of the MOF, MSF, and ITIL models is also helpful.
Note: For security reasons, the sample names of forests, domains, internal
resources, organizations, and internally developed security file names that are
used in this paper do not represent real resource names used within Microsoft and
are for illustration purposes only.
Introduction
Since the earliest days of Microsoft Exchange Server, the Exchange Messaging team
operated an enterprise-messaging environment with an emphasis on using Microsoft
technologies wherever possible to keep total cost of ownership (TCO) as low as possible.
However, Exchange Messaging team operations practices changed, driven by product
advancements such as Active Directory integration, which became available with Microsoft
Exchange 2000 Server and later versions. For example, almost a decade ago,
the Exchange Messaging team managed a Microsoft Exchange Server 5.5 based environment
from the viewpoint of individual servers that provided messaging services. Exchange
Server 5.5 included its own directory, which made it convenient to treat Exchange
Server 5.5 servers as centralized messaging islands in the IT environment.
However, this paradigm changed with Exchange 2000 Server and all later versions.
These product versions require not only a functioning operating system, but also
a reliable TCP/IP network infrastructure, Domain Name System (DNS) configuration,
and Active Directory environment. Managing messaging services from the viewpoint
of individual servers was no longer an option for the Exchange Messaging team. The
advent of Exchange 2000 Server marked a rigorous shift toward managing messaging
as an end-to-end service with no exceptions.
The shift toward a service-oriented IT organization became apparent during the Microsoft
Exchange Server 2003 time frame. During this period, the Exchange Messaging
team oversaw a site and server consolidation initiative, creating a centralized
environment comprising four datacenters with Mailbox servers. Cost savings were
$20 million in the fiscal year of 2003 alone. Yet perhaps even more importantly,
the consolidation initiative also yielded significant intangible benefits. Consolidating
the messaging environment enabled the Exchange Messaging team to measure and enforce
performance and availability SLAs across the entire corporate production environment
on a global scale.
Today, the Exchange Messaging team operates with both formal and informal SLAs.
Formal SLAs represent overall targets for performance and availability, whereas
informal SLAs measure internal statistics for metrics and process improvement. For
example, the formal SLA for internal e-mail message delivery is 90 seconds for 99
percent of the messages, measured from the moment any Exchange server processes
the message. Informal SLAs measure finer aspects of the environment and often contribute
to formal SLAs. For example, the Exchange Messaging team measures the availability
SLA for Client Access servers individually (on an informal basis) and all Exchange
servers as a whole (formally).
Within the three environments of Microsoft IT (pre-release, corporate production,
and MMS), the Exchange Messaging team is responsible for meeting targets and SLAs,
but with different emphasis during operations. In the pre-release environment, the
team works with developers to sign off on final product versions before products
are released, whereas in the corporate production and MMS environments, the team
primarily works to ensure that performance targets are met. The formal and informal
SLAs can be more or less rigid to accommodate the purpose of each environment. For
example, the pre-release production environment helps developers improve product
versions and service packs by testing them in a production-type environment. Meeting
SLAs in the pre-release environment is not as crucial as meeting SLAs in the corporate
production environment.
To efficiently operate Exchange Server 2007 environments at Microsoft , the
Exchange Messaging team developed work processes based on common IT operations frameworks
such as ITIL, MOF, and MSF. According to these frameworks, messaging operations
involve managing people, processes, and technology, and more often than not these
elements span the boundaries of individual teams and their respective areas of jurisdiction.
For example, an important aspect of an incident and its resolution is that the Exchange
Messaging team must collaborate with other teams also involved in handling the incident,
such as front-line operators who identify the incident, analysts who resolve or
escalate it, and technical leads and managers responsible for change management
and process improvement.
Exchange Server 2007 provides enabling technologies that help the Exchange
Messaging team meet its performance and availability SLAs. Exchange Server 2007
represents the next generation messaging system, designed to simplify and streamline
operations tasks. Among other things, Exchange Server 2007 provides new features
such as improved storage design, data replication capabilities (local and clustered)
for high availability, modular setup and server provisioning based on server roles,
and new graphical and command-line interfaces for improved manageability and increased
automation. The seamless integration of Exchange Server 2007 with other Microsoft
products, such as Microsoft Internet Security and Acceleration (ISA) Server 2006,
also helps to reduce operational complexities.
The new features in Exchange Server 2007 provide the following advantages for
the Exchange Messaging team:
Overview of the Exchange Messaging
Team
The Microsoft IT messaging environment now consists of 62 Mailbox servers (each
in a two-node CCR cluster configuration), 10 Edge Transport servers, 15 Hub Transport
servers, 11 Unified Messaging servers with supporting VoIP gateways, 26 Client Access
servers, and two multiple-role servers in Sao Paulo for the Hub Transport, Client
Access, and Unified Messaging roles. There are approximately 130,000 mailboxes in
the corporate production environment, 10,000 in the pre-release environment, and
30,000-plus in the MMS environment. Operating these environments requires structured
human resources with clearly defined roles.
Operation Service Interdependence
Some teams at Microsoft IT are responsible for a specific service, and some teams
are responsible for a specific function. For example, the Exchange Messaging, Collaboration
Services, and Communications teams provide end-to-end operations for their respective
services, as shown in Figure 1. Each team makes use of its dedicated management
resources, yet the teams often work together because of the interdependent nature
of the Microsoft IT environment. The services these teams support rely on a common
infrastructure that includes the physical TCP/IP network and the Active Directory
environment.
.gif)
Figure 1. Microsoft IT service teams' structure as of September 1, 2007
Microsoft IT includes the following teams:
- Collaboration Services The Microsoft environment has many
Microsoft Office SharePoint® Server 2007 sites, which require significant coordination
to operate. The Collaboration Services team manages all aspects related to Office
SharePoint Server 2007, including planning for and performing upgrades.
- Communications This team handles all aspects of design,
deployment, operations, administration, and management dealing with Office Communications
Server 2007.
- Exchange Messaging This team handles the entire messaging
environment at Microsoft. For more details about the functions and structure of
the team, see the section below titled "Messaging Service Structure and Functions."
- Exchange Center of Excellence As part of its
commitment to customers to share knowledge about running Exchange Server 2007,
the Exchange Messaging team established the Microsoft Exchange Center of Excellence
(ECoE), which is a task force inside Microsoft that helps customers get the most
out of their Exchange Server 2007 deployments. For more information about the
ECoE, see the section in this paper titled "Operations Process Improvement."
- Shared Services Microsoft IT created the Shared Services
team to reduce overlapping responsibilities and cut costs. Before the Shared Services
team existed, each service team had its own human resources for managing the tasks
that the Shared Services team now assumes. These tasks include common monitoring
and other front-line services for all operations teams within the messaging and
collaboration-related service teams. The Shared Services team consists of the following
groups:
-
Process Engineering This group looks at the processes of
the Shared Services team to ensure that they meet the requirements of all peer teams
that the Shared Services team supports.
-
Client Support This is a tier 2 support group for Microsoft
users. The Client Support group focuses on issues related to end-user connectivity
and productivity.
-
Monitoring This group performs all the front-line monitoring
for other service groups, such as the Exchange Messaging group. The performance
goal for the Shared Services team's Monitoring group is to resolve at least 80 percent
of incidents. Therefore, the people in this group must have general server administration
and resolution knowledge, and must follow product-specific resolution instructions
to resolve incidents. The Messaging Operations group within the Exchange Messaging
team creates the necessary Exchange Server specific knowledge base and resolution
instructions, and provides training on general resolution and response processes.
Messaging Service Structure and
Functions
Just as multiple teams handle the overall design, deployment, and operations functions
of Microsoft IT, the functions within the Exchange Messaging team are similarly
distributed. Exchange Messaging team members manage the messaging service from end
to end. This entails monitoring messaging-related incidents; coordinating changes;
collaborating with the Exchange Server product group and with other Microsoft IT
teams; and, managing the MMS environment for customers. Within the Exchange Messaging
team, people have specialized roles and work together in specialized groups, each
of which handles a portion of the overall responsibilities, as shown in Figure 2.
.gif)
Figure 2. Exchange Messaging team organizational structure as of September 2007
Exchange Server 2007 not only enables IT organizations to capitalize on expert
knowledge according to individual server roles; it also provides cost-efficient
opportunities to cover basic and general operational aspects via Shared Services
teams. As shown in Figure 2, Microsoft IT takes advantage of this possibility with
Exchange Server 2007 by using a Shared Services team to perform all front-line
monitoring tasks. This frees the Exchange Messaging team to focus on escalated issues
and complex tasks such as root cause analysis.
In order to carry out Exchange Server 2007 specific monitoring and incident
response, the Shared Services team must have specific resolution steps, which the
Exchange Messaging team provides. Specialists in the Messaging Operations group
use their expert knowledge to create these detailed resolution steps. If an incident
arises that these detailed steps do not cover, then the Shared Services team escalates
the incident to the Messaging Operations group. Because the Shared Services team
handles the vast majority of incidents without escalation, the Exchange Messaging
team can apply expert knowledge in an increasingly targeted way.
Messaging Engineering Team Functions
The Messaging Engineering team within the Exchange Messaging team designs the messaging
systems in the corporate production environment. This broad goal includes many complementary
tasks, such as interacting with the developers in the Exchange Server product group,
analyzing performance and scalability of server designs, technology evaluation,
and performing research to accomplish these tasks. To design and deploy the messaging
environment, the Messaging Engineering team verifies the recommended system parameters
and configuration options set by the Exchange Server product group as well as the
initial performance and configuration recommendations from the pre-release production
environment. As part of designing the corporate production environment, the Messaging
Engineering team also creates and maintains documentation that details overall environment
design aspects, messaging topology, server specifications, and Exchange Server 2007
configuration settings.
Note: The Messaging Engineering team does not design the pre-release production
environment. That design evolves from recommendations from the Exchange Server product
group. However, it is the task of the Leads team in the Messaging Operations group
within the Exchange Messaging team to deploy and operate the pre-release production
environment. In this way, the Messaging Operations group verifies performance and
functionality based on the default settings before making any customizations and
changes for settings in the production environment.
The Messaging Engineering team designs the corporate production environment. Its
design is based on the results of capacity planning; enterprise design and architecture
practices; the results of lab evaluations and testing; and proven and verified results
from the pre-release production environment. The latter entails collaborating with
the product group to transfer knowledge such as the configuration and hardware settings,
deployment steps, and best practices. Because of this close collaboration, members
of the Messaging Operations group participate in many engineering projects and even
get the chance to own some aspects of the corporate production environment design.
By gaining real-world experience in collaboration with the Messaging Engineering
team, members of the Messaging Operations group can move on with their careers as
messaging engineers.
Service Management Functions
The Service Manager role within the Exchange Management team owns the end-to-end,
user-focused aspects of the messaging environment. The Service Manager's primary
responsibility is to serve as the interface between users and the Messaging Operations
group. This entails responding to user questions, performing weekly service reviews,
and communicating with regional Microsoft IT managers about any scheduled and unscheduled
outages. Service managers take user requests for changes, and oversee service improvement
in messaging to meet SLAs. Additionally, service managers are responsible for the
TCO of their service.
Program Manager Functions
The Project Manager role within the Exchange Management team owns the physical hardware
deployment for the messaging environment. The Project Manager's primary responsibility
is to serve as the interface between the Exchange Server product group and the Messaging
Operations product group. By acting as the intermediary, a project manager oversees
feature and product implementation across all dependent teams. To accomplish this,
a project manager attends weekly meetings with developers, coordinates schedules
for corporate production environment deployment, and owns the overall hardware budgeting
and maintenance tasks. This role is the strategic arm of the Messaging Operations
group.
Messaging Operations Group
The Messaging Operations group is responsible for the dual goals of product improvement
and providing highly reliable and available messaging services. The Messaging Operations
group works very closely with the Exchange Server product group, and runs a pre-release
production environment dedicated to trying new builds, verifying functionality,
and discovering improvement opportunities before the Exchange Server product is
released to manufacturing. The in-depth knowledge gained from this close collaboration
enables the Messaging Operations group to create thorough incident response documentation
for front-line Monitoring group operators, discover problem root causes, and oversee
changes to the environment and product releases.
As the group manager for the Messaging Operations group, Gary Baxter is responsible
for two teams that provide end-to-end operations for all mailboxes in the corporate
production infrastructure, including unified messaging. In addition to his many
responsibilities, Gary must also meet SLAs and decrease the cost of supporting and
operating the environment through process improvement. Gary carries out the following
specific tasks:
- Enabling customers Gary faces customers inside Microsoft
and in the messaging community at large. The responsibilities vary with each customer,
based on specific needs. For example, he is responsible for providing technical
guidance, supporting findings, and providing process improvement assistance. Gary
regularly interacts with the following peers:
-
Service and project managers Gary enables service managers
by providing best practices and process engineering for messaging operations, which
in turn enables other managers to run their respective services more efficiently.
-
Exchange Server product group Product improvement and validation
is the primary goal for the corporate production environment. Gary's teams provide
the Exchange Server product group with incident results to change the Exchange Server
code during beta time frames. Additionally, Gary's team creates configuration templates
that include the desired configuration of servers running Exchange Server 2007.
These configuration templates enable the Exchange Server product group to provide
final code with Exchange Server 2007 optimized with best practices and settings.
-
Sales, marketing, and product support teams Gary provides
expert advice and documentation to assist sales, marketing, and support activities.
For example, Gary participates in industry conferences such as TechEd, internal
conferences such as TechReady, and helps to create documentation about how Microsoft
does IT.
-
Microsoft IT service teams Gary assists peer service and
engineering teams by providing and supporting environments for effective pre-release
production verification and scenario validation of a variety of products, such as
Exchange Server, Windows Server, ISA Server, and Microsoft Retail Management System.
-
Industry customers Gary also assumes responsibilities concerning
Microsoft customers. For example, Gary is responsible for demonstrating product
readiness to Microsoft customers by running the corporate messaging environment
to mission-critical standards. According to the Microsoft IT vision of being the
first and best customer, the Messaging Operations group also helps the Exchange
Server product group proactively identify and address potential issues before external
Microsoft customers notice these issues.
- Overseeing efficiencies in IT processes At Microsoft IT,
the Messaging Operations group sets the process optimization standard for other
service groups. Gary is responsible for making his team more efficient, and also
for condensing service-agnostic process improvements and sharing them with other
teams.
- Ensuring SLA compliance Gary is also responsible for messaging
SLAs. Although multi-team contributions establish SLAs at both technical and managerial
levels, Gary develops and executes strategies to ensure that the combination of
people, process, and technology meets SLA targets in the messaging environment.
- Managing people Gary works with engineering and other teams
to provide his group members with additional cross-team projects. In this way, Gary
develops the skills of his team members, making it possible for them to advance
in their careers.
- Driving down the cost of support and operations Gary is
responsible for reducing the cost of support and operations by ensuring process
efficiency, such as setting a target of 80 percent incident resolution by the Monitoring
group. To accomplish this goal, the Messaging Operations group develops and provides
the necessary tools and resources that enable the Monitoring group to reach this
performance target.
- Designing process intelligence Gary's challenge in creating
processes also includes ensuring that they can be re-used, automated, and easily
modified to deal with changes. This is especially important in the MMS environment
because customers have different needs, yet the processes for operating the messaging
environment must be uniform.
Although all members of the Exchange Management team share the common goals of ensuring
performance, availability, and reliability, the team structure enables members to
specialize in a specific aspect of Exchange Server 2007 to accomplish the overall
goals. The modular design of Exchange Server 2007, based on server roles, promotes
specialization, yet an efficient IT organization also requires experts with a broader
scope. Accordingly, Gary structures his teams with different specialization focuses
to provide maximum operational process efficiency and to enable prompt incident
handling, problem resolution, and changes to the environment.
There are other advantages to using teams with separate specialization focuses,
such as reducing costs via documentation and balancing workload across specialists.
A key part of reducing operation costs involves systematizing the knowledge of experts
and creating documentation so a generalist can use it for messaging operations.
However, creating thorough documentation that everyone can understand requires in-depth
knowledge of Exchange Server 2007 and operations, which specialization provides.
The Messaging Operations group consists of two teams, led by Ryan McDonald and Jim
Leigh, respectively, which perform specialized, yet sometimes overlapping tasks.
The broad Exchange Server product knowledge that the members of each team possess
enables team members to help each other when there is an unexpected workload, such
as during an unscheduled outage. The two teams perform the following specific functions:
- Tier 3 Incident Management team This is Jim Leigh's team.
It is composed of Exchange Server specialists who focus on a technology, rather
than a specific server role, to handle any unresolved issues passed on to them by
the Monitoring group. Team members also possess knowledge of IT operations and server
administration. Their responsibilities include incident handling, escalation, and
change management. The members of this team also manage ancillary services such
as Forefront and Public Folders. Jim's team works closely with the Leads team in
its specific areas of expertise in order to gain new knowledge, for career growth,
and to provide as-needed resources.
- Leads team This is Ryan McDonald's team. It represents the
last place for issue resolution. If the Tier 3 Incident Management team cannot resolve
an issue, the issue escalates and Ryan's team resolves it. The team consists of
six people who all specialize in a specific server role and work on operational
projects such as automation work and scripting work. Two people specialize in Mailbox
servers, and the remaining four cover Client Access, Hub/Edge Transport, and Unified
Messaging (UM) roles. Ryan's team is the strategic team in messaging operations
that handles deployment and feature sets, including feature validation and new product
functionality verification. The Leads team ensures operational efficiency; manages
service deployment readiness, readiness and fine-tuning of the Microsoft Operations
Manager Exchange Management Pack, operational training for the team, and documentation
for operational staff; and works with the product group to ensure the high quality
of the product. Additionally, the team works closely with the Messaging Engineering
team to transfer knowledge learned during product and feature verification. Ryan's
team is specifically responsible for performing deployment consultation and automation
work for new deployments in the MMS environment.
The Leads team performs additional operations work that is not 100 percent related
to the Exchange Server product. For example, the team acts as configuration management
advisors for Desired Configuration Monitoring (DCM), change approver for each service,
and interacts with customers by presenting in the Computer Information Technology/Foundation
(CITF) program.
Note: It is the task of the Messaging Operations group to resolve all issues
escalated by the Monitoring group within Shared Services. The Messaging Operations
group involves the Exchange Server product group when dealing with very complex
issues.
Operational SLAs and Targets
"By meeting stringent SLAs that demand 99.99 percent availability across the entire
corporate production environment, we prove to our customers the enterprise readiness
of Exchange Server 2007. It takes the right mix of technology, processes, and
people to achieve this high level of availability. Exchange Server 2007 provides
the technological foundation, and tried and proven operational processes, driven
by MOF principles, make it possible for us to have the right people do the right
job at the right time in order to achieve high availability at Microsoft in a repeatable
and measurable way."
Gary Baxter
Group Manager
Microsoft Corporation
Microsoft IT pursues an end-to-end approach to SLAs, which means being responsible
for meeting the SLAs, and also for all the services and components that contribute
to SLAs. For example, if a network connectivity issue prevents users from accessing
their mailboxes, Microsoft IT considers the messaging service unavailable even if
all Exchange Server 2007 servers are up and running because the end-user experience
is an unavailable messaging service. Exchange Server 2007 integrates tightly
with Active Directory and depends on the TCP/IP network infrastructure and other
technologies to unfold its features so that the Exchange Messaging team becomes
aware of incidents not just specific to Exchange Server 2007. In fact, many
teams at Microsoft IT, such as the Windows Server team and the Active Directory
team, proactively report incidents to the Exchange Messaging team before anyone
else does.
End-to-end operations provide Microsoft IT with many advantages over the previous
server-centric approach. With end-to-end operations, incidents are resolved faster
because teams own their incident tickets until an incident is resolved; processes
are more flexible; costs are reduced through selective usage of specialists; and
the overall performance of the Microsoft IT organization becomes the shared responsibility
of all teams. The Exchange Messaging team is ultimately a heavy user of the underlying
physical network infrastructure such as Active Directory, DNS, and firewalls. Yet,
the team can only meet its SLAs if the other teams also meet theirs. At the core
of no-excuses SLAs are individual teams that are responsible for specific areas
but share overall accountability.
The no-excuses SLA policy came about when Microsoft IT management examined its organizational
hierarchy and realized that users only see an outage, and not its causes. From a
user's point of view, if a service is unavailable the user is witnessing a service
outage. The cause of the outage may be an issue with the TCP/IP network, telecommunications
provider, or underlying Active Directory infrastructure. None of these causes is
Exchange-specific, yet each causes a messaging service outage, which counts against
the availability SLA. With the no-excuses SLA, the source of the issue does not
matter; the Exchange Messaging team owns the incident and its resolution, and has
the responsibility to introduce changes to prevent issue recurrence.
Business Drivers for SLAs
The ambitious SLAs that the Exchange Messaging teams set, as discussed in the next
section, "Performance Goals," are not strictly necessary to meet business needs
and user expectations at Microsoft. Historically, Microsoft IT did not always have
the 99.99 percent availability SLA, and the business still functioned profitably
even at the performance level of 99.9 percent availability.
The Exchange Messaging team moved to SLAs that are more aggressive as a way to push
the envelope and prove the possibilities with Exchange Server for even the most
demanding customers. For many years previously, the team had gathered performance
statistics and reviewed them weekly as scorecards that listed the SLAs and performance.
The scorecards had a green indicator for met SLAs and a red indicator for unmet
SLAs. According to Gary, there were too many green lights on these scorecards prior
to the move to end-to-end operations and the adoption of the no-excuses SLA. All
categories showed green with 99.9 percent targets. To avoid complacency and to foster
an attitude of continuous performance improvement, the Exchange Messaging team moved
to more rigorous SLAs that challenged all team members and promoted exceptional
performance.
The transition to end-to-end operations developed in parallel with a transition
of operating the Microsoft IT environment with all Microsoft customers in mind,
not just Microsoft internal users. As already mentioned, the Messaging Operations
group manages the pre-release production environment, which helps to verify performance
and functionality in a real-world setting. Demonstrating product performance and
readiness for even the most demanding customers is one key mission of Microsoft
IT. Communicating feedback and requirements from external customers back to the
Exchange Server product group is another.
Note: In addition to operating a pre-release production environment, Microsoft
runs beta testing programs, and programs with partners where pre-release code is
deployed in partner IT environments. This focus on partners means that Microsoft
IT does not just look to its internal needs, but is always mindful of the needs
of external customers.
With each product release, the Messaging Operations group has spent increasingly
more time running Exchange Server in pre-release production environments. The pre-release
verification period for Exchange 2000 Server started three weeks before the
RTM date; the pre-release verification period for Exchange Server 2003 lasted
only six months. In comparison, the Messaging Operations group used the beta versions
of Exchange Server for more than 22 months before the product shipped. In fact,
the Exchange Server product group does not ship product versions or service packs
until the Messaging Operations group signs off on enterprise readiness. The release
criteria state that the Messaging Operations group must demonstrate 99.99 percent
availability for at least three weeks. The high-availability of Exchange Server 2007
technology and real-time monitoring tools such as Microsoft Operations Manager,
as well as motivated people, clear communication paths, and efficient operations
processes, enabled the Exchange Messaging team to deploy Exchange Server 2007
at full scale throughout the corporate production environment prior to RTM. The
Messaging Operations group was able to report to the Exchange Server product group
the achievement of the high-availability sign-off criteria across the entire corporate
production environment on November 26, 2006. Exchange Server 2007 shipped on
December 7, 2006.
Performance Goals
Microsoft has an e-mail-centric culture. In a typical week, there are over 30 million
messages processed by Hub Transport servers for internal and Internet messages,
and over 14 million ActiveSync connections from mobile devices. Additionally, the
environment supports trusted partner connections, multiple forests, and the global
presence of Microsoft. Therefore, performance optimization and improvement is a
vital task directly related to SLAs.
After setting overall organization-wide SLAs, the Messaging Operations group analyzed
the various dependencies and developed targeted, team-specific SLAs. These team-specific
SLAs yield a finer granularity in controlling and gathering statistics about outages
for reports, which enables more accurate self-assessment and trend analysis. Although
the team-specific SLA targets are not as strict, they enable a closer inspection
of the environment and ensure achievement of the organization-wide SLAs.
Organization-wide SLAs
Organization-wide SLAs represent broad performance goals in the Microsoft IT messaging
environment. These SLAs represent a commitment to users, customers, and the Exchange
Server product group that Exchange Server 2007 can deliver mission-critical
results. The SLAs cover the important messaging aspects, such as delivery times
and availability. More specifically, Microsoft IT defined the following SLAs:
- Delivery Delivery
of 99 percent of all internal messages to their final destination
must take 90 seconds or less.
- Availability Overall availability of messaging services
must be 99.99 percent or greater.
- Business continuance Business will continue with messaging
service in one hour or less.
- Deleted items retention There is a 14-day minimum retention
of deleted items.
Team-Specific SLAs and Metrics
Team-specific SLAs and metrics focus on specific server roles, technologies, physical
locations, or similar criteria to provide a convenient means for reporting and analysis.
These SLAs and metrics address not only the technical aspects, but also "soft" factors
such as user satisfaction. The Messaging Operations group tracks the following SLAs
and metrics:
- Core e-mail: weighted overall mailbox availability
- Core e-mail: weighted unplanned mailbox availability
- Client availability
- Client performance
- Mobile Messaging: overall availability
- Mobile Messaging: unplanned availability
- Unified Messaging: overall availability
- Unified Messaging: unplanned availability
- Fax: overall availability
- Fax: unplanned availability
Incident Management and Response
"Structured incident handling processes
as well as tools and documentation from my team equip front-line operators with
the necessary guidance to resolve over 80 percent of messaging incidents immediately.
This frees my team to focus on specialized projects that drive process improvement.
New Exchange Server 2007 features, such as Exchange Management Shell, enable
us to develop efficient tools that front-line operators can use without requiring
detailed product knowledge to achieve consistent incident resolution."
Jim Leigh
Operations Manager
Microsoft Corporation
The Microsoft IT environment is global and has regional IT teams that are responsible
for managing the site-specific hardware. Overall, more than 6,000 IT experts work
at Microsoft IT; 50 percent of those IT experts are vendors. Running this enterprise
IT organization requires established workflow processes, communication paths, and
coordination to respond to incidents and resolve them in a timely manner.
The Messaging Operations group is a leader within Microsoft IT in overseeing cost-cutting
and process-improvement measures for incident management. The group accomplishes
this by decreasing the workload on specialists and transferring that knowledge to
front-line Monitoring group operators that respond to incidents. For Microsoft IT,
it means the Messaging Operations group can focus specialist resources on more involved
processes.
Incident management provides the following advantages:
- Increased specialization opportunities Part of the method
of increasing efficiency and lowering costs is to create specialists within a specific
body of messaging knowledge. However, using specialists to solve all operational
issues can be expensive. To maintain cost-efficiency, there also must be people
who can take over some of the systematic aspects of operations, such as responding
to an incident and following prescribed and documented resolution steps. With the
Shared Services team acting as front-line monitoring operators for multiple services,
each service group can develop service specialists.
- Fast response by front-line Monitoring group operators Front-line
monitoring operators work in a 24-hour, 7-days-per-week datacenter where an operator
watches the monitoring screen at all times. The Messaging Operations group takes
this very seriously: if an operator wants a break, there must be another person
monitoring the console. During incident reviews, one of the aspects of the review
involves verifying that an operator was indeed present and watching the monitors
when the incident occurred.
- Uniform and standardized handling of incidents With scripted
and prescribed resolution steps that are tested and verified, front-line monitoring
operators can follow identical resolution paths no matter the level of personal
experience or expertise.
- Decreased support requirements for product experts If specialists
transfer their knowledge of how to resolve incidents to front-line monitoring operators
who do not have deep, Exchange-specific product knowledge, then the specialists
can focus their energies on other tasks. To accomplish this, the specialists in
the Messaging Operations group must perform the knowledge transfer and documentation
tasks so that front-line monitoring operators have clear instructions for how to
resolve Exchange-specific incidents.
- Measurable results By separating the overall operational
processes according to individual components, the Messaging Operations group can
measure each component to gather performance statistics. Having accurate statistics
is important because it enables management to have an accurate picture of the environment,
and they can therefore spot trends or process inefficiencies. Additionally, assigning
primary incident response work to the front-line operators is more cost-efficient
than having senior-level specialists resolve incidents.
- Focus on product validation Well-defined processes and roles
help the Messaging Operations group focus on the dual goals of maintaining high
standards for the messaging environment and providing product validation to the
Exchange Server product group and Microsoft customers. By freeing up specialist
resources to work in the pre-release production environment, the Messaging Operations
group can devote more time to checking features and functionality of beta builds
before the product reaches the marketplace. This results in the Messaging Operations
group identifying over 90 percent of product issues in beta code before anyone else
does.
Incident Life Cycle
The Messaging Operations group follows a structured framework for dealing with and
resolving incidents. The people involved in responding to and resolving incidents
follow a scripted series of processes. As discussed below, the life cycle of an
incident involves both front-line monitoring operators from the Shared Services
team and members of the Messaging Operations group, with defined roles, processes,
and tools used from the initial response to final resolution. The teams go through
the following incident life cycle:
- Awareness/notification Front-line monitoring operators become
aware of incidents via several sources. Most incidents originate with Microsoft
Operations Manager, which acts as the monitoring and detection system. Microsoft
Operations Manager includes rules to check the status of thousands of individual
factors, such as queue length and service status in the Exchange organization at
various levels of depth. Microsoft Operations Manager accomplishes this via an Exchange
Management Pack, which includes thousands of rules specifically for monitoring Exchange
Server 2007. For example, there are rules to verify that the server is up through
a heartbeat ping, and there are rules to perform synthetic mailbox logons to verify
that transport and delivery mechanisms function as expected. The Messaging Operations
group customizes the Exchange Management Pack by modifying rule settings and alert
triggers, as some examples show in Table 1. Another way the Monitoring group
becomes aware of incidents is via users who report to the Helpdesk. While the Helpdesk
resolves most user issues, some incidents require escalation to the front-line monitoring
operators. If they cannot resolve an incident, then it is escalated to the Messaging
Operations group.
Note: Because Microsoft Operations Manager delivers alerts in real time,
and because alerts are proactive, front-line monitoring operators can resolve most
incidents before users ever report them to Helpdesk. The Messaging Operations group
resolves most of the incidents related to Exchange S
- Response Microsoft IT uses an incident tracking system that
integrates with Microsoft Operations Manager. Alerts that generate from rules in
the Exchange Management Pack also automatically generate a ticket in the tracking
database. The alerts include two knowledge databases about how to resolve the alerts:
the default knowledge base that comes with the Exchange Management Pack, and a messaging-specific
knowledge base. The Exchange Messaging team created the messaging-specific knowledge
base to gather very detailed information about incidents, in order to help with
product improvement. A goal of the Messaging Operations group is to create scripted
resolution guidance detailed enough for any front-line monitoring operator to follow
the procedure and resolve the incident. To clarify resolution procedures, members
of the Messaging Operations group routinely update the knowledge database with the
latest guidance, based on their experiences.
- Management As already mentioned, Microsoft Operations Manager
and the incident tracking system provide a way for operations personnel to view
details about incidents, such as incident type, occurrence, existing knowledge for
resolution guidance, and status. The tracking system enables front-line operators
to escalate incidents for resolution if the knowledge base instructions do not resolve
an incident. In addition to these tools, the Exchange Messaging team uses OpsWeb,
an internal line-of-business (LOB) application that is available to the Helpdesk
for viewing tickets, grouping them in selected views, appending Helpdesk tickets
to a master ticket, and checking for existing issues in order to avoid repeat escalations.
- Resolution The Messaging Operations group seeks to resolve
at least 80 percent of incidents at the Shared Services team level by the Monitoring
group, and at least 90 percent of the remaining incidents at Jim Leigh's Tier 3
Incident Management team. After resolving an incident, the members of the Monitoring
group mark the ticket as complete in the ticket tracking system and archive tickets
older than three months. Only the most difficult incidents, or those flagged for
further investigation and detailed root cause analysis, reach Ryan McDonald's Leads
team. If the Messaging Operations group does not resolve an incident, then the incident
further escalates to the Exchange Server product group, which has additional resources
such as developers who can perform a live debug. During debugging, developers examine
memory dumps to check for causes. Incidents that require additional research, product
updates, or a major change generate another ticket in the development database as
a change request. In this way, Microsoft IT helps the Exchange Server product group
by providing developers with a real-world environment for deep debugging and detecting
product issues that require code changes.
- Review As part of incident management within the incident
life cycle, the Messaging Operations group performs monthly reviews of the incidents.
The purpose of the review is to identify repeat incidents or trends of incident
types for closer inspection. After identifying an incident for closer inspection,
members of the Messaging Operations group analyze it to determine whether the incident
is indicative of a larger underlying problem. At this point, the Messaging Operations
group may decide to investigate it further and determine the root cause by using
problem-handling processes, as discussed in the next section, "Problem Handling."
Table 1. Exchange Management Pack Customizations
|
Rule
|
Description
|
Customization
|
|
Queue size.
|
Number of messages in transport queue.
|
Alert if there are more than 250 messages, which Exchange delivers in approx. 5
seconds.
|
|
Outlook Web Access connections.
|
Number of users connecting using Outlook Web Access.
|
This is a threshold rule. Microsoft IT decreases the threshold to see if decreased
connections indicate users cannot connect.
|
|
Submission Queue Length-- sustained for 5 minutes on Hub Transport.
|
Length of submission queue on Hub Transport servers.
|
Red alert if queue size is greater than 250.
|
|
Retry Mailbox Delivery Queue Length--sustained for 5 minutes on Hub Transport.
|
Size of retry queue on Hub Transport servers.
|
Changed to alert if queue size is greater than 250.
|
|
Largest Delivery Queue Length--sustained for 5 minutes on Hub Transport.
|
Largest length of delivery queue on Hub Transport servers.
|
Changed to alert if queue size is greater than 250.
|
|
Aggregate Delivery Queue Length (all queues)--sustained for 5 minutes on Hub Transport
servers.
|
Length on all queues on Hub Transport servers.
|
Changed to alert if queue size is greater than 500.
|
|
CCR Service Verification Script and Exchange 2007 CCR Service Verification.
|
This is specific to checking that the CCR service is running on both nodes.
|
Alert if Service State is stopped.
|
Problem Handling
Whereas Microsoft IT incident management deals with restoring service as quickly
as possible, problem handling deals with minimizing the impact of an incident and
preventing recurring incidents by seeking to discover incident root causes. The
problem-handling discipline focuses on the resolution of the underlying causes of
the problem, rather than the speed of the resolution.
"Strict operational, proactive monitoring, and problem management processes enable
us to solve many incidents before users ever become aware of a service issue. We
prevent incident recurrence by systematically determining the root cause of incidents
using problem management techniques and introducing timely changes to the messaging
environment. By using Microsoft Operations Manager as the enterprise-monitoring
tool, we provide front-line operators with proactive alerting accompanied by the
latest resolution steps. Our problem-handling processes are repeatable in any Exchange
enterprise environment because they are based on industry-standard operations frameworks."
Ryan McDonald
Operations Manager
Microsoft Corporation
For the Messaging Operations group, problem handling involves helping the Exchange
Server product group ship the best software possible. This means not only solving
issues as they arise, but also finding root the cause of an issue, documenting it,
and making sure that there is either a published workaround, or permanent change
in the product, that addresses the issue. Problem handling fosters change because
it takes into account the people, processes, and technology involved in a particular
incident. Evaluating how an incident arose, how it was resolved, and digging down
to the root cause means considering what contributed to the problem: people, processes,
technology, or a mix of these factors.
The Messaging Operations group uses three environments in its problem-handling processes:
- Pre-release This environment provides the Messaging Operations
group with great flexibility in determining incident root causes because it is set
up for the expressed purpose of product improvement and validation. Therefore, meeting
rigorous SLAs and resolving incidents as quickly as possible is secondary to determining
an incident's root cause, working with developers to replicate and understand product
behaviors, and trying workarounds or product updates to rectify an incident. In
this environment, it is acceptable to take longer to analyze and resolve an incident,
because the analysis should determine the root cause of the incident.
- Corporate production environment In the corporate production
environment, problem handling complements incident management by finding, if possible,
the root cause of an incident. The Messaging Operations group allows extended downtime
only if an incident is not reproducible in another environment to ensure that the
developers implement necessary fixes in the product code before Microsoft releases
Exchange Server 2007 to customers. Because of the rigorous SLAs, and because
the Messaging Operations group must demonstrate product readiness, the Messaging
Operations group documents the settings and configurations that led to an incident,
in order to use that information when working to discover the root cause of the
incident.
- Microsoft Managed Services The Messaging Operations group
maintains this environment with the goal of restoring services as quickly as possible.
Availability is very important because Microsoft has contractual agreements with
clients to provide a specified level of service. This environment includes a dedicated
Centralized Infrastructure team that informs the customer of tickets related to
problem handling. However, the Messaging Operations group creates, manages, and
owns the tickets. To accommodate problem handling, analysis, and change management
in the MMS environment, there are scheduled change windows available every other
Saturday.
Selection Process
The Messaging Operations group selects which incidents to investigate based on two
factors: a list of incident types that require mandatory investigation and as-needed
inquiries for remaining incidents. The list of events that require the creation
of a problem ticket includes serious incidents, such as a queue size of more than
10,000 messages, UM service disruption, server outage, and so on. The Messaging
Operations group creates a problem ticket for any incident that severely affects
any service they support.
To select other tickets for investigation, the Messaging Operations group conducts
weekly incident review meetings to discuss all outstanding problem tickets and review
incidents from the previous week to determine whether any require the opening of
a new trouble ticket. The most important criteria used to select as-needed incidents
for further investigation is incident frequency and trends. When the same type of
incident repeatedly occurs, it often signals an underlying problem that is not isolated
to just a few servers. Trend analysis helps to evaluate frequency over a larger
period to help determine whether to further investigate some incidents.
After narrowing down the list of incidents to investigate, the
Messaging Operations group performs a sanity check on the
incident summary to ensure that all the necessary components are present to open
a problem ticket. For example, the incident report must include a full set of notes
that document the incident through every step, from initial alert to final resolution.
Without these components, the Messaging Operations group cannot select what to investigate
because there is not enough data. After opening a ticket, the Messaging Operations
group uses its tools to assign the ticket to a team member, who is responsible for
following up at least once a week until the issue is resolved.
The Messaging Operations group maintains its own database tool to create and manage
problem tickets. In addition, the team uses a custom Office SharePoint Server 2007
site for problem ticket review. These tools track progress, help assigning resources
for a problem, and facilitate managing problem status. Figure 3 shows a sample trouble
ticket, the major problem review (MPR) tool:
.gif)
Figure 3. MPR ticket
Problem Review
Problem management also includes problem review and metrics. After finding the root
cause of an issue, the team member responsible for handling the issue must create
a corresponding entry in the knowledge base or similar documentation within seven
days of resolution of the issue. Through experience, the Messaging Operations group
discovered that if a problem occurs once on a specific server or group of servers,
it often recurs with other servers. Therefore, it is most efficient to provide comprehensive
resolution steps for front-line monitoring operators as soon as possible. The Messaging
Operations group reviews problems at weekly meetings to provide status updates and
to open new problem tickets.
At monthly meetings, the Messaging Operations group tracks metrics and trends, which
presents an opportunity to view a scorecard of statistics and trends for the month.
The scorecard consists of various MPR data viewed through pivot tables, including
the number of tickets opened and the number of tickets resolved, with the root cause
determined.
Configuration Management
The Messaging Operations group uses configuration management processes as an opportunity
to identify, record, and report on configuration items. These processes enable the
Messaging Operations group to control the messaging infrastructure by maintaining
information on the resources required to deliver messaging services.
The Messaging Operations group uses the following processes for configuration management:
- Configuration item identification Many software and hardware
settings affect messaging services. Configuration items come from Windows® Registry,
Active Directory, Windows Management Instrumentation (WMI) providers, the Internet
Information Services (IIS) metabase, and the file system. These sources specify
settings relevant to messaging, such as specific registry keys and network settings.
Microsoft IT maintains a Configuration Management Database (CMDB) in the form of
Excel spreadsheets for specific configuration data. Microsoft IT modifies this information
according to performance measures and best practices.
- Baseline configuration template creation Microsoft IT analyzes
the configuration items in the CMDB Excel spreadsheets to create baseline configuration
templates. These templates vary for each server role and hardware type. They enable
Microsoft IT to document standard collections of settings and rapidly make changes
across multiple servers. Microsoft Systems Management Server (SMS) includes the
capability to define templates and deploy them to servers as well as run audits
to check for settings not in compliance with configuration templates. SMS also enables
front-line monitoring operators to check for noncompliance with Microsoft Operations
Manager. Front-line operators take action proactively upon discovering configuration
noncompliance, and thereby avoid service interruptions.
- Configuration item and template management Microsoft IT
uses Microsoft Systems Management Server (SMS) Desired Configuration Monitoring
(DCM) 2.0 to audit the configuration settings based on template definitions. Based
on SMS and the Microsoft Exchange Server Best Practices Analyzer (ExBPA), SMS DCM 2.0
compiles reports that can help detect noncompliant configuration settings. The Messaging
Operations group configures SMS DCM 2.0 to audit template compliance every
eight hours. In the event of noncompliance, Microsoft Operations Manager raises
an alert to the front-line operators, who then respond to the incident.
- Risk analysis of changes and releases Prior to implementing
changes, product updates, and new releases, the Messaging Operations group considers
the impact of the changes. The Messaging Operations group notes the required changes
to configuration items and maintains rollback procedures and backup templates to
ensure that SMS applies the correct template version regardless of changes to the
messaging environment.
- Best practices industry knowledge sharing The Messaging
Operations group continuously shares its knowledge with Microsoft customers and
consultants. Microsoft consultants who design and support solutions for clients
use the knowledge from the Messaging Operations group to customize configuration
templates for use in client messaging environments. Additionally, when the Messaging
Operations group makes changes to configuration items that are relevant to other
customers, these specific changes flow into ExBPA. In this way, customers have access
to the latest best practices configurations from real-world operations of an enterprise-messaging
environment.
Change Control
For most IT organizations, introducing changes to a production environment typically
involves a source for inputting changes (such as user feedback) and a way to design
and verify the changes in a sandbox environment. Then they roll out changes in a
software update or similar mechanism to the production environment. Additionally,
change control incorporates review processes and management tools to ensure that
teams working on changes track and complete change requests.
For the Messaging Operations group specifically, change control encompasses the
traditional processes of accepting change requests from multiple sources and then
designing, verifying, and rolling out changes. The underlying goal is to increase
the prescribed handling of the change processes. The Messaging Operations group,
in working through its change control processes, attempts to reduce the workload
on specialists and distribute work to others by thoroughly documenting steps and
procedures for implementing changes. This enables those who are not Exchange Server 2007
specialists to apply changes uniformly to all computers in the messaging environment.
A key success factor that enables the Messaging Operations group to manage all change
requests centrally is the forward schedule of change (FSC) tool. The FSC tool is
a custom Office SharePoint Server 2007 Web site with the list of change requests
displayed in a calendar view. It enables the Messaging Operations group to create
requests for changes (internally known as RFCs, not to be mistaken as Request for
Comments [RFC]), approve them, schedule implementation, and manage changes. Typically,
project managers and service managers create RFCs based on requests from the Exchange
Server product group, whereas the Messaging Operations group enters emergency RFCs
when responding to an incident. The service managers and project managers perform
an initial feasibility analysis before entering RFCs by working with the Exchange
Server product group to provide feedback regarding change implementation feasibility.
Figure 4 shows the FSC tool:
.gif)
Figure 4. FSC tool for change control
The Messaging Operations group uses the FSC tool to facilitate the following change
control processes:
- Request for change creation Any internal user can create
an RFC in the change management tool, such as a request for additional product features
or a fix for a discovered root cause.
- Selection The Messaging Operations group accepts RFCs if
they meet approval criteria such as completeness of information. Information completeness
includes detailed rollback and rollout instructions, severity, and expected turnaround
time frames. In dealing with RFCs, the Messaging Operations group maintains a staged
set of instructions that a non-specialist can follow to implement the change.
- Severity and impact analysis The change advisory board conducts
weekly meetings to select RFCs, update status, and evaluate the impact of new and
ongoing changes. The board categorizes RFCs based on SLA impact, capacity, security,
and disaster recovery readiness, and assigns minor, major, or automatic severity
status. The automatic status is used for small changes that can be implemented without
further investigation because they are either critical to performance and availability
or pose no significant risk.
- Prioritization The RFC urgency complements its severity
status. Some changes represent emergency solutions and require implementation in
24 hours or less, whereas others may be moderately urgent and can be scheduled for
completion over one or more weeks.
- Implementation As part of the implementation process, the
Messaging Operations group maintains a known change type list that details change
categories for RFCs and the permitted change window.
Team Involvement
In Microsoft IT, many people contribute during change control tasks, but the ultimate
responsibility for resolving issues rests with the Messaging Operations group, which
oversees the performance of Exchange Server 2007 across all Microsoft IT environments.
Especially in the pre-release environment, the Messaging Operations group has many
opportunities to verify specific product functionality of builds.
The Messaging Operations group is the first in the line of contributors that submit
change requests to the Exchange Server product group. After identifying an issue
that requires changes to the Exchange Server product code, the Messaging Operations
group works with developers to create a design change request in the developer database,
which automatically becomes the responsibility of the Exchange Server product group
and developers. Although others may create product updates and other changes, the
Messaging Operations group is responsible for requesting, verifying, and approving
builds and updates for rollout to production environments. The Messaging Operations
group provides a disciplined process for introducing required changes into a complex
IT environment with minimal disruption to ongoing operations. The Messaging Operations
group remains closely aligned with the release management process, and manages the
release and deployment of changes into the production environment.
Code and Product Improvement
Another aspect of change control processes involves product validation, which the
Messaging Operations group performs in collaboration with the Exchange Server product
group. During beta testing and pre-release partner deployments, the Exchange Server
product group may decide to implement changes to the code based on tester and partner
feedback. Although the Exchange Server product group controls the code, the Messaging
Operations group is responsible for validating the functionality of changed features
and proving enterprise readiness by using it in the pre-release production environment
after the Exchange Server product group completes typical quality assurance tasks.
Operations Process Improvement
Within the incident management, problem handling, change control, and configuration
management processes that the Messaging Operations group performs, there is a constant
effort to improve processes and thereby realize new levels of efficiency, scalability,
repeatability, and cost savings.
User Feedback
A key source of process improvement comes from end users. Although the Helpdesk
at Microsoft deals with first-tier support issues related to messaging, the Messaging
Operations group participates in a satisfied-user initiative, which results in gathered
feedback from users regarding functionality and performance in the messaging environment.
The Messaging Operations group uses surveys to request feedback from users on satisfaction
in a particular messaging service area, such as response times and availability,
as well as to check users' general satisfaction. Some of the internal SLAs cover
user satisfaction; meeting those SLAs and analyzing sources of dissatisfaction leads
to an analysis of the people, processes and technology that are used to deliver
messaging services. When this analysis results in the discovery of better processes
or different combinations of people, processes, and technology, the Messaging Operations
group makes appropriate changes to enact these improvements.
Exchange Center of Excellence
Microsoft IT deployed the first Exchange Server server in the corporate production
environment more than ten years ago. Since that time, the Exchange Messaging team
has amassed a wealth of experience and best practices around Exchange Server designs,
operations, and troubleshooting. As part of its commitment to customers to share
knowledge about running Exchange Server 2007, the Exchange Messaging team established
the Microsoft Exchange Center of Excellence (ECoE), which is a task force inside
Microsoft aimed at helping customers get the most out of their Exchange Server deployments.
The ECoE is a cross-organization team that also includes the Exchange Server product
group and Microsoft Consulting Services (MCS). Its mission is to help customers
better manage Exchange Server, taking advantage of expertise gained by Microsoft
IT employees running the product in-house. The ECoE also administers the Microsoft
Exchange Server Risk Assessment and Health Check Program (ExRAP) at Microsoft. An
ExRAP engagement provides detailed on-site technical and operational analysis of
large Exchange Server 2007 deployments.
Interaction with Microsoft Customers
The Messaging Operations group actively shares its knowledge with the messaging
community and uses this interaction as a method to gather feedback and use that
knowledge to improve operations processes. There are many ways the Messaging Operations
group shares knowledge. For example, members participate in industry conferences,
conduct seminars and presentations, and share operational knowledge with MCS, which
then uses it for specific customers. They also participate in partner programs to
perform product validation during alpha and beta releases of Exchange Server.
Another way the Messaging Operations group engages with customers is through the
IT Fellowship series. Customers can talk with Microsoft IT about IT operations and
specific services, and discuss general best practices during this two-week program.
Real-Life Operations Scenario
As previously mentioned, the Messaging Operations group interacts with many Microsoft
IT service teams as well as the Exchange Server product group to accomplish its
dual goals of meeting SLAs and providing product validation to customers. Because
of the volume of activities and work, the group follows a structured model of operations
based on MOF and ITIL. The theories provide a framework and guidance for messaging
operations, yet operations architects must ultimately make decisions based on what
works in real world IT environments. As Figure 5 shows, the Messaging Operation
group follows an orderly operations workflow with straightforward escalation paths
and clear task assignments:
.gif)
Figure 5. Messaging Operations group workflow
In its workflow, the Messaging Operations group defined the day-to-day tasks of
operating the messaging environment, including responding to incidents, resolving
incidents, determining the root cause of incidents, changing and improving the environment,
and enforcing consistent hardware and software configurations across all servers.
The following example demonstrates how all these processes fit together. It shows
how the Messaging Operations group resolved a specific performance issue in the
messaging environment.
Incident Response
The situation arose on October 1, 2006 when a front-line monitoring operator from
the Shared Services team noticed an alert that Microsoft Operations Manager issued
to the monitoring console. The alert indicated that an Exchange server was experiencing
poor performance according to slow server response times. The following performance
counters were below the alert threshold:
- Averaged latency over 50 for five minutes
- Processor load greater than 90 percent for five minutes
The front-line monitoring operator followed the suggested resolution steps to try
to resolve the incident, including restarting services and monitoring individual
service resource utilization. However, system performance did not improve. Because
the front-line monitoring operator could not resolve the incident with the suggested
knowledgebase resolution steps, the operator escalated the incident to Messaging
Operations' Tier 3 Incident Management team by assigning the incident ticket to
the Tier 3 Incident Management team alias in the ticket-tracking database.
Change Control and Configuration
Management
Working with the Leads team, the Exchange Server product group followed up on the
unresolved incident by analyzing captured system and memory data. The Exchange Server
product group investigated all the submitted data, including similar occurrences
in the past. This was not a one-time issue that suddenly occurred; front-line operators
had previously noticed sporadic performance degradations. By comparing the various
incidents, the Messaging Operations group recognized a pattern pointing to a third-party
driver as the component responsible for the performance degradation. After determining
that the driver was the root cause, the Tier 3 Incident Management team member created
an emergency RFC in the FSC tool in order to immediately implement a remedy, and
later follow it up with a permanent change. The risk analysis determined that this
change did not pose a significant risk because it did not affect Exchange Server
settings.
The change advisory board approved the emergency RFC quickly due to the critical
nature of the issue, which enabled the Messaging Operations group to disable the
driver across all affected servers rapidly. To ensure that the driver was disabled,
the Messaging Operations group modified the SMS DCM 2.0 template used
on the servers to include checking for the driver during SMS configuration compliance
audits.
As part of following up on the incident and implementing a permanent solution, the
Messaging Operations group notified the provider of the third-party driver about
the incident, and requested that to insure stability, the vendor provide a pre-production
driver for use in the pre-release production environment. Upon receiving the new
driver and verifying its stability, the resolution process returned to the problem-handling
discipline to review the incident and implement a permanent change.
Problem Handling
In this particular incident, the problem-handling processes overlapped with many
other processes such as change management and configuration management. However,
this often occurs in the real world. The structured processes serve as a guide to
accomplish operational goals. It is important to note, however, that even though
multiple teams contributed to the root cause analysis and emergency change implementation,
the Messaging Operations group owned and was responsible for handling the incident
and associated issue until the final resolution.
In combination with verifying the functionality of the new driver that the third-party
vendor supplied, the Messaging Operations group created an MPR ticket during a weekly
incident review meeting. (At this point in the incident resolution process, the
person who owns the MPR must provide at least weekly updates until the problem is
resolved.) The MPR included the incident notes, action steps taken, and other details
common to MPRs as previously mentioned in the section titled "Problem Handling."
The MPR also included an action item to ensure the installation of the new driver
across all servers in the latest cycle of driver updates.
After verifying the new driver's functionality, the Messaging Operations group created
a change request in the FSC tool with normal priority, in order to ensure a uniformly
applied permanent fix. During the next available time frame for rolling out updates,
the Messaging Operations group installed the new driver in the environment.
Resolution
It is not sufficient for the Messaging Operations group to implement an emergency
fix followed by a permanent fix in the messaging environment. The Messaging Operations
group must also ensure that the incident does not occur in the future, both in the
Microsoft messaging environment and in customer environments.
The Messaging Operations group handles both aspects of the issue resolution process
separately. To prevent issue recurrence in the Microsoft IT environment, the Messaging
Operations group works with the Infrastructure team to make the new driver the default
driver for all future builds and deployments. To provide the solution to all customers,
the Messaging Operations group works with the Exchange Server product group to include
a corresponding driver check in the ExBPA tool. This is one example of how the Exchange
Messaging team translates the Microsoft IT vision of being the first and best customer
of Microsoft into concrete proactive help for every Microsoft customer running Exchange
Server 2007.
Best Practices
By adopting an operations framework based on industry standards such as MOF and
ITIL, the Messaging Operations group was able to identify best practices that cover
daily tasks of operations and provide guidance for IT professionals for designing
and operating an enterprise-messaging environment based on Exchange Server 2007.
These best practices sometimes apply to all operations such as the best practice
of adopting scalable and flexible processes, and sometimes only to specific disciplines
such as incident management or change management.
The Messaging Operations group relies on the following best practices:
- Use tools for tracking and management The Exchange Messaging
team uses many tools as part of its Exchange Server operations. For example, Microsoft
Operations Manager and SMS combined with a ticket-tracking database provide the
capability to monitor the environment in real time, including configuration data,
and track incidents from inception to resolution. The Messaging Operations group
uses specialized tools, such as custom LOB applications for problem review and change
implementation, in addition to custom scorecards for metrics and diagnostic and
troubleshooting tools.
- Implement review processes for each discipline The Messaging
Operations group specifically includes review processes for incident review, problem
handling, configuration management, and change control. This enables the group to
optimize its processes on an ongoing basis and foster a culture that embraces change
and emphasizes improvements.
- Create baseline configurations To be able to analyze risks,
it is important to know risk dependencies and effects in terms of configuration
items. Before making changes, the Messaging Operations group evaluates change impact
and enforces changes across all servers by using templates that specify configuration
specifics.
- Perform monitoring centrally Microsoft IT relies on a Shared
Services team for all monitoring in order to gain the benefit of centralized monitoring
without the cost of using a distinct team for each service. With Microsoft Operations
Manager, centralized monitoring provides at-a-glance summaries of system status
as well as detailed reports and alerts.
- Conduct weekly and monthly reviews The Messaging Operations
group reviews incidents, requests for changes (RFCs) and changes, and problem-handling
tickets on a weekly basis. Additionally, there is a monthly review to identify trends
and summarize performance.
- Systematize resolution steps and transfer knowledge to front-line operators With
each new incident, the Messaging Operations group has an opportunity to improve
the resolution guidance for front-line operators. The Messaging Operations group
both reviews existing steps to improve guidance and documents resolution steps for
new incidents from data gathered during problem handling and root-cause analysis
processes.
- Measure statistics The Messaging Operations group measures
not only overall SLAs, but also specific internal SLAs, which enables easier trend
spotting and targeted performance improvement.
Conclusion
From the earliest days of operating the corporate messaging environment with Exchange
Server to the present, Microsoft IT has continually increased its performance and
availability targets and the scope of its goals for the messaging environment. Microsoft
IT delivers consistent verification to even the most demanding customers that Exchange
Server technology can meet rigorous availability and performance requirements by
providing messaging services with a no-exception, end-to-end policy from a user's
point of view and by achieving 99.99 percent availability.
For the Messaging Operations group delivering consistent results requires using
the right mix of technology, people, and processes. Exchange Server 2007, Microsoft
Operations Manager, and other Microsoft server products provide a sound technological
foundation upon which people and processes can rely. The modular and flexible product
design of Exchange Server 2007, based on server roles, promotes technical specialization
within the Exchange Messaging team. The MOF and ITIL frameworks are the bases for
implementing clear communication paths, team hierarchies, and escalation procedures.
Among other things, Exchange Server 2007 helps the Messaging Operations group
to decrease operational costs through improved management and administration tools,
while offering new technologies such as CCR for increased performance and availability.
These benefits are not specific to Microsoft IT because they are repeatable in other
environments that follow proven operational processes based on industry standard
frameworks.
The Messaging Operations group continues to drive forward process improvement and
knowledge sharing with other service teams as well as Microsoft customers. This
includes all levels of IT operations: incident handling and response, problem handling,
change management, configuration management, and even showing other IT organizations
how to improve processes through reviews and improvement initiatives. The Exchange
Messaging team is the originator of many change requests submitted to the Exchange
Server product group for implementation in product updates, service packs, and future
versions of Exchange Server. Internal experiences and customer feedback are the
main sources. The close collaboration between the Exchange Messaging team and the
Exchange Server product group ensures that Exchange Server technology continues
to meet the present and future needs of real-world customers.
For More Information
For more information about Microsoft products or services, call the Microsoft Sales
Information Center at (800) 426-9400. In Canada, call the Microsoft Canada information
Centre at (800) 563-9048. Outside the 50 United States and Canada, please contact
your local Microsoft subsidiary. To access information through the World Wide Web,
go to
http://www.microsoft.com
http://www.microsoft.com/technet/itshowcase