Operations Guide

Article
12/05/2007

This section of the Microsoft® Hotmail® Migration Technical Case Studycovers the deployment aspects of successfully migrating FreeBSD and Apache server to Windows® 2000 and Internet Information Services.

Preface

The operations section of this technical case study focuses on the "how-to" of the migration. However selected, pertinent areas are covered. Items that are intentionally not covered because there was little or no change in the support method between the pre-migration and post-migration environment include:

Inventory/Asset Management
Configuration Management However, the next phase of this effort is to leverage GPOs to facilitate this functionality. Other partners have seen a tremendous upside in terms of functionality by leveraging GPOs.
Change Management
Migrating legacy code requirements to use UNIX facilities to native Windows 2000 facilities (for example, "cron" jobs migrating over to Scheduler/AT jobs)

Reference the Microsoft Operations Framework material on https://www.microsoft.com/mof/ for best practices.

Fault Isolation, Recovery, Notification, and Remote Management

As with many predominately UNIX shops, Hotmail had developed a toolset of sophisticated, yet proprietary monitoring tools over the years. These tools assisted operations in the day-to-day tasks of supporting the FreeBSD environment.

Statistics, reports, metrics, and so on about the current and historical health of the system could be viewed at any time through a Web interface. There was a homegrown agent, which ran on the FreeBSD machines, which was, for all intents and purposes, a simplified Simple Network Management Protocol (SNMP) tool that collects per-machine statistics and turns them into alarms or trouble tickets, when the machines are outside of regular operating parameters. The operations staff can take corrective action. See below for additional detail.

One of the critical success factors for the migration was to minimize the impact to operations. This is one of the areas where the Microsoft Interix subsystem for Windows 2000 played an important role. The client agent, which ran under FreeBSD was rewritten to run as a Windows 2000 service. However, the vast majority of UNIX scripts continued to work. The Web site, which reported the health of the environment, used "the syslog feature," to gather statistics. Microsoft Interix provides that same syslog compatible file format and function as under FreeBSD. So, the Web reporting functionality continued to work.

As part of the migration to Windows 2000, the client agent—now a Windows 2000 service#8212;was extended to return arbitrary performance counters. These counters have been very useful because most performance measurements can be retrieved from performance counters. The application was extended to provide multiple performance counters, which provide granular, instantaneous feedback about the state of the application. Instrumentation of the application in this way, provided the developers and system administrators the ability to focus in on what constituted a valid alarm condition and thereby only generate an event when something required action.

In other words, common challenges that system administrators face when attempting to identify thresholds, which should generate an alarm when reached, are:

Determining which performance counters to collect.
Determining what constitutes the trigger level for reporting an issue (for example, do you alert at 85 percent CPU or 90 percent CPU).

Bottom line: It is good to realize that you will not get these things right the first time, and being able to update the agent (within limits) easily during one of the regular content pushes is be a good thing.

The daily management of the machines was always envisioned to be a difficult change for the administration staff. The ratio of administrators to servers at Hotmail is very low (approx 15 administrators for 3,800 machines), and a per-machine type solution for management was not realistic. (At the time of this writing, there are still 15 administrators but the number of machines has risen to over 5,000). The administrative staff is geographically remote from the machines. Therefore, almost everything must be automatic, and everything must be remote-able (aside from physical install/replace). In general, machines should look after themselves. They are either serving pages, or they are out of the loop. When the number of machines out of the loop gets beyond a certain threshold (normally 5% of a cluster), an administrator will take steps to repair or replace the machines. Terminal Services is used for access to the machines. However, many of the administrators still have to manage the UNIX storage servers, so ssh/telnet is the primary management interface. Rsh and rdist scripts are still used extensively to configure the machines and deploy code to the site. For more information, see Software Distribution.

Software Distribution

Scripts to implement the new version of the site code or other software components are pushed out to each machine (through RDIST). Usually a distribution release is not a full site code release. In this case, the files are placed in production directories on the server. If this is a full release of the site code, it is stored in another directory. A "cron" job, set up in advance wakes up and looks to see if there is anything to execute. And if so, stops the IIS service, renames the old and new directories, and then restarts the IIS service. That way, it is possible to synchronize the release of a new version of Hotmail code instantly across many thousands of machines. It also means that rollback is easier because the previous version of the code remains on the servers. Rdist provides checksum functionality for the scripts and other checks to make sure that all the machines have all the right bits on them.

Note: In other scenarios, we have successfully used Robocopy.exe (a resource kit utility) in place of RDIST. RDIST was the logical choice at Hotmail because the storage machines are still running UNIX and RDIST was already in use.

Quality Assurance

The Quality Assurance (QA) team have had to change some of their practices, specifically as it relates to testing the Hotmail application—not only because the application is running on a different OSbut also because the application is now multithreaded rather than per-process. It is necessary to be more diligent about testing for memory and handle leaks, blocked threads, context switches, and so on. It is not necessarily because these problems did not exist prior to migration. They did. However, it was very difficult, because of lack of tools to identify the problems short of getting the application into a production mix. Also, in the context of CGI, memory leaks are less significant because the memory freed after each request; therefore, there was a low priority on testing for leaks. Windows 2000 provides a much richer tool set. Performance counters are used extensively to monitor sustained performance, and the Windows Application Stress Tool (See footnote F1 at the end of this page), formerly known as Homer, is used for emulating and generating user interactions. Gathering performance counters is relatively nonintrusive. So counters can be collected on the live site (even though there is a great Lab environment in place. The edge cases sometimes show up on the live site.

F1 - The Windows Application Stress Tool is available for download at https://homer.rte.microsoft.com.

Click here to return to the introduction page of the Microsoft Hotmail Migration Technical Case Study.

The future is yours

Share via

Operations Guide

On This Page

Preface

Fault Isolation, Recovery, Notification, and Remote Management

Software Distribution

Quality Assurance

Additional resources