Export (0) Print
Expand All
Expand Minimize
1 out of 1 rated this helpful - Rate this topic

Building, Maintaining, and Tuning the Box

By Sean Deuby

building

Chapter 5 from Windows 2000 Server: Planning And Migration, published by New Riders

After you've selected the server hardware, you must deal with the other half of the equation: the server software. You must build the operating system, maintain it properly to maximize the server's availability, and be able to diagnose and head off performance problems.

Understanding Partition Types and Requirements

Partition sizing and configuration for a server is one of those planning tasks that, although mundane, will affect your entire Windows NT network. The ability to run utilities, perform automated upgrades, service applications, and store system dumps depends on how much free space you have on your operating system partition. It's very important to get it right ahead of time because it's very difficult and intrusive to reconfigure a partition. Before you can make intelligent decisions on how many partitions to have and how large they should be, you must understand the different partition types and their requirements.

The system partition contains the hardware-specific files, such as Ntldr, Boot.ini, and Ntdetect.com, needed to load Windows NT. This term system partition is also used by some hardware manufacturers for a partition to put system hardware configuration utilities and diagnostics in a fast and available location instead of always having to load a CD-ROM or set of disks. It's small (less than 50MB), and it doesn't have a drive letter. It requires a special BIOS, which makes it accessible before the Windows NT loader takes control.

The OS partition, sometimes called the boot partition, should contain the Windows NT operating system, critical system utilities such as the backup program and logs, and little else. The system utilities are on this partition so that full system functionality can be recovered by restoring the C: partition. This partition needs to be big, much bigger than you'd expect. The partition size you choose now must keep pace with Windows 2000, which is vastly larger than its predecessor. The partition can also incorporate many different services, each taking varying amounts of space.

Some of the unexpected disk hogs are listed here:

  • Page files—You can move the page file off the OS partition, but if you want a system crash to dump the contents of memory to a disk file, the page file must be equal to the real memory size.

  • Dump files—The partition must be large enough to hold the dump file in addition to the page file. Its default location is %Systemroot%\memory .dmp, where %Systemroot% is the environment variable assigned to the operating system directory (usually c:\Winnt). On a system with 512 MB of RAM, these components have already chewed through 1GB—and that's not even considering the operating system. If you want to store more than one copy of the dump file, add another chunk of space equal to the system's memory size.

  • Large SAM support—A domain controller in a large Windows NT 4 domain requires a significant boost in the Registry Size Limit setting to accommodate the resulting large Security Account Manager (SAM) database. This, in turn, boosts the paged pool requirements, which require an increase in page file size.

  • Active Directory—Large Windows 2000 domains require a large directory service database, %Systemroot%\Ntds\Ntds.Dit. A large company with 60,000 user accounts, 54,000 workstations, and the _full complement of attributes associated with these can have an Ntds.Dit of almost 800MB. You can choose on which partition to locate this, but you should definitely allocate plenty of room.

  • Spool files—The directory where temporary printing files are stored is in %Systemroot%\System32\Spool\Printers, on the OS partition. If you print large files or have a large number of active printers attached to the server, you need to account for this occupied space.

  • The operating system—The Windows NT 4 Server operating system directory takes a minimum of 130MB, excluding the page file. Windows 2000 disk requirements vary, from merely voracious for a member server to monstrous for a domain controller in a large enterprise. Big surprise, eh? The Windows 2000 %Systemroot% directory alone takes 600MB installed. This doesn't include the files it creates in \Program Files, and the installation process requires significantly more.

Table 5.1 is a quick reference of conservative disk space recommendations for a server's OS partition.

Table 5.1 Recommended Server OS Partition Sizes

Server Type

OS Partition Space Requirement

Windows NT 4 servers that will be retired for Windows 2000

2GB

Windows 2000 server, moderate domain size

4GB

Windows 2000 large enterprise domain controller, including _Global Catalog

9GB

Note that these partition requirements don't say whether a server is a domain controller or a member server. That's because in Windows 2000 it's a simple task to promote a member server to a domain controller. If you're parsimoniously allocating your disk space to a member server, the time will certainly come when you need to promote one in an emergency and can't because you don't have the free space.

Tip Remember that hardware—especially disk space—is cheap compared to support costs. Disk space is an area where it's much better to err on the conservative side. And these estimates don't include room for expansion.

FAT or NTFS?

When Windows NT Advanced Server 3.1 appeared, administrators argued with fervor over the benefits of formatting the operating system partition with the venerable FAT file system versus NTFS. There are valid points for both sides, but without getting into too much gory detail, the battle has been won. If you spend any time at all looking at the future of Windows NT and examine the space requirements previously discussed, it's pretty obvious that only NTFS is capable of managing the impending space, speed, and security required of a Windows NT server partition of any size. A FAT partition of a decent size would have a cluster size of 32KB, which means that every file would occupy at least 32KB of disk space, regardless of its size.

The FAT file system is best for drives or partitions under approximately 200MB and is faster for sequential I/O. FAT32 can address much larger disks, but it still has none of the security features of NTFS. NTFS works best for file systems above 400MB. NTFS has its own log file for internal consistency; FAT has none. Additionally, an NTFS Version 5 volume is required for the Windows 2000 SYSVOL. Finally, NTFS has detailed security built into it. This is important now for user data, and its granular security is becoming more important as many different Windows NT components that are managed by different groups are being incorporated into the %SYSTEMROOT% directory (usually C:\WINNT).

The FAT32 file system fixes a lot of FAT's problems, especially for larger disk sizes, but it's no more appropriate for a large Windows NT server than FAT is.

Tip The single most important first step you can take in securing your servers against security attacks is to use NTFS on all partitions.

Windows 2000 will have FAT32 support to ensure compatibility with Windows 98 and the OSR2 release of Windows 95. You'll finally be able to dual boot Windows 98 and Windows 2000, or to actually install Windows 2000 Professional on a system that has an OEM install of Windows 98 on a FAT32 partition. Windows NT 4, however, is incompatible with FAT32. You could get around this incompatibility, if you're a little daring, with a utility called FAT32 from http://www.syssolutions.com. FAT32 is a driver for Windows NT 4 that allows the operating system to access a FAT32 volume. The freeware version allows read-only access, and the $39 full version grants full access. This would allow you to dual boot between Windows 98 and Windows NT 4—but why would you want that on a server anyway?

Windows 2000 Startup Options and Automated System Recovery

Windows 2000 now comes with a Win9x-like boot menu, called Advanced Setup, that allows you to choose from a number of startup options:

Note: The following options are documented in Windows NT 5.0 Server Beta 2 ADVSETUP.TXT.

  • Safe mode—Starts Windows 2000 using only basic files and drivers, without networking. Use this mode if a problem is preventing your computer from starting normally.

  • Safe mode with networking—Starts Windows 2000 using only basic files and drivers—as with safe mode—but also includes network support.

  • Safe mode with command prompt—Starts Windows 2000 using only basic files and drivers, and displays only the command prompt.

  • Last-known good configuration—Starts Windows 2000 using the last-known good configuration. (Important: System configuration changes made after the last successful startup are lost.)

To start your computer using an Advanced Startup option, follow these steps:

  1. Restart your computer.

  2. When the OS Loader V5.0 screen appears, press F8.

  3. Select the Advanced Startup option you want to use, and then press Enter.

Partition Recommendations

Here's a list of recommendations that apply to most partitions:

  • To ensure that you have enough room to grow (and my, how it does grow!), your OS partition should be no less than 2GB, under any circumstances.

  • Keep all partition sizes down to what you can fit on one backup job (and, therefore, one restore job). If a partition takes 16 hours to back up, either it's too big or you need to upgrade your backup hardware.

  • If your fault tolerance setup allows it, keep the page file on the fastest disk possible.

  • Don't use Windows NT volume sets. A Windows NT volume set is a collection of partitions over a number of disks that allows you to create a large, contiguous volume out of several smaller ones. Once it spans more than one disk, however, you'll have jeopardized all your carefully planned fault tolerance. If one component partition of this volume set is on a non–fault-tolerant volume and it dies, the volume set will fail and your only recourse will be to recover from backups.

  • Seriously consider a disk defragmenter, and run it from the beginning of the server's life. It has much lower user impact to continuously correct small amounts of defragmentation than to schedule massive defrag sessions. If your fragmentation is bad enough that the defragger can't do a good job, your only choice is to back up, reformat, and restore to start clean before you enable the defrag service. Windows 2000 has built-in disk defragmentation courtesy of Executive Software. It's only manual, however, and is the equivalent of their DisKeeper Lite freeware product. To get fully automated defragmentation—which is obviously a useful feature—you must buy DisKeeper just as with your Windows NT 4 systems.

Building the Box

Now that you have unboxed the hardware, put it together, and performed initial partitioning and formatting, you're ready to install the operating system. This isn't a thorough treatise on how to install Windows 2000; as with everything related to this new version, it's a very big subject so I'll leave that to the doorstop books (so named because they make good doorstops on a windy day). Instead, I want to point out some issues on naming the box and cover some highlights of the SYSPREP utility for automated builds in Windows 2000.

In both Windows NT 4 and Windows 2000, all servers and workstations must have a unique name of up to 15 characters to identify them on the network. This is called the NetBIOS name, computer name, or machine name. In Windows NT 4, this name is the primary way clients locate a Windows NT system; Windows 2000 doesn't care much about it, but you choose one for compatibility with downlevel systems. Unlike DNS, which has a hierarchy separated by periods (myserver.mydepartment.mycompany.com), NetBIOS is a flat namespace. This means that the name myserver must be unique across the entire Windows NT network. This isn't a big deal for a small company whose naming convention can follow the Marx Brothers, but ensuring uniqueness can be a big headache for a large corporation.

Some pointers in choosing workstation and server naming conventions include these:

  • Choose a unique identifier by which each node on the network can be mapped to a person responsible for it. The simplest way to do this is by naming workstations with an employee number, which can then be looked up in the company's HR database. The key is to be able to look up the name; a workstation named JIMBOB might be acceptable as long as you can make a directory services query to discover that the owner is James Robert Worthington III. Perhaps the simplest method is to name the workstation after the owner's Windows NT account name.

  • Name your servers alphabetically ahead of your workstations. In Windows NT 4, the domain master browser has a buffer size of 64KB. This limits the number of workstations and servers displayed in the browse list to 2,000–3,000 computers, easily reached in a large Windows NT domain. When this happens, names beyond the limit simply won't appear on the list. Because servers are more heavily browsed than workstations, define a naming convention so that servers are alphabetically ahead of workstations. That way, in a large domain, workstations will drop off the browse list instead of servers.

Warning: Beware of the forward-compatibility problem between NetBIOS names and DNS names. In Windows 2000, the NetBIOS restriction will be lifted and network nodes such as servers and workstations will use DNS names to identify themselves. Remember that in Windows NT 4, NetBIOS names may contain special characters. In general, DNS domain and host names have restrictions in their naming that allow only the use of characters a–z, A–Z, 0–9, and "-" (hyphen or minus sign). The use of characters such as the "/," ".," and "_" (slash, period, and underscore) are not allowed.

Servers or workstations with NetBIOS names that don't fit into the DNS naming convention will have problems under Windows 2000. Windows NT 4 will warn you before attempting to create or change a NetBIOS name with a slash (/) or period (.), but it will allow an underscore. The underscore is allowed in Microsoft's implementation of DNS, but it doesn't comply with RFC (Request for Comments) 1035, "Domain names—implementation and specification." This means that if you use an underscore in your naming convention, it will work with Microsoft but probably won't work with other industry-standard DNS servers—and Microsoft could clamp down on this loophole in the future. Even this rudimentary character checking isn't done if you've upgraded from a previous version of Windows NT. The most common name violation is use of the underscore in NetBIOS names. Start stamping out incompatible names now!

Automating a Windows NT build has historically been very time-consuming and error-prone. Automating the final 20% was very tedious—and sometimes almost impossible. Microsoft has had much feedback from its customers to make automating a Windows NT rollout much easier to do. A very popular method has been to build a "template" Windows NT system, and clone it to hundreds of other systems by copying the template system's hard disk to the new systems. Cloning, as it's known, is no longer a no-no, but you should be aware of two issues.

Because you've exactly duplicated the template system, you've also duplicated its security identifier (SID). I won't get into SIDs here, but strictly speaking they are supposed to be unique on a Windows NT network. If you've cloned hundreds of workstations, you'll have hundreds of duplicate SIDs on your network.

Duplicate SIDs cause security problems in a Windows NT network in two situations: Workgroups (no Windows NT domains) and removable media, formatted with NTFS. Problems with the former happen because user SIDs on one workstation, with no domain component, can be identical to user SIDs on another. For instance, this allows the first-created user account on one workstation to access files of the first-created user account on another's workstation. The removable drive problems occur because NTFS ACLs depend on the SID for security. If the system is cloned and the security references a local account instead of a domain account, a removable drive could be inserted into a cloned machine. The hacker logs onto an account with the same name, such as Administrator (the password needn't be the same), and the data can be read.

So, duplicate SIDs aren't as big a problem in a domain-based environment as everyone thought. There is a basic requirement for cloned systems, though. Windows NT Setup goes through all sorts of hardware detection, so a cloned machine must have the same hardware of the template machine, or funny things start happening.

Note: This hardware exactness applies especially to hard disk controllers and the HAL. In one situation, cloned machines began breaking for no discernable reason. After many weeks of troubleshooting, the team discovered that all the broken machines had a slightly newer version of a hard disk BIOS.

Third-party tools such as Ghost from Symantec to clone systems have been very popular, and now that GhostWalker (a SID-scrambling tool) is also available, systems can be cloned and still have unique SIDs.

For Windows 2000, Microsoft has given in and provided the SYSPREP utility. SYSPREP is used after the template system has been configured, just the way you like it. The utility is run as the last act before shutting down the system. The template system's hard drive will then be cloned by a third-party tool. When the target machines are first booted, SYSPREP generates a unique security ID for the machine and then pops up with a short menu used to generate a few last items such as computer name and domain membership. SYSPREP has switches to do useful things such as run in unattended mode, re-run the Plug and Play detection on first startup, avoid generating a new SID on the computer, and automatically reboot the system when the final configuration is complete.

For system rollouts that will use different kinds of hardware, the old method of unattended setup with a customized answer file is still around. It has been expanded to take care of the differences in Windows 2000, but it's essentially the same as Windows NT 4.

The final, most sophisticated, and most restricted method of rollout is available via the Remote Installation Service of Windows 2000 Server. The "empty" workstation boots, contacts a Remote Installation Server, and downloads and installs the OS. Pretty slick, eh? Before you start jumping up and down, you need to know that you must have your Windows 2000 server infrastructure (including Active Directory) in place and Remote Installation Service up and running. Only then can you install Windows 2000 Professional on it. This option could be useful down the road, most likely after you've upgraded your clients to Windows 2000 Professional and when you're ready to begin replacing existing systems or installing new ones.

The situations where you can automate a rollout fall into three categories: cloning systems with exact hardware, building systems with unattended setup and answer files, and using the new Remote Installation Service to remotely install the Windows 2000 operating system.

Maintaining the Box

Four of the most important areas of maintaining a Windows NT server are backing it up, scanning for viruses, performing hard disk defragmentation, and maintaining software. All of these are discussed in the following sections.

Backing Up Windows NT Servers

Backing up server data, and coming up with a system recovery strategy have always been low on a network administrator's list of fun things to do. If you do the job right, people complain about the money spent on tape drives, tapes, and operators. If you don't do the job right—or if you do it right but don't constantly monitor the backup and tune the strategy—you can jeopardize the company, lose your job, or at least get yelled at.

Your Windows NT network's system recovery strategy should be designed at the same time you determine the type of backup hardware and backup media. This is because the implementation of a system recovery strategy depends on the media, while the amount the media is used (therefore influencing the choice of media type) depends on the strategy. This section does not include a thorough analysis of disaster recovery because much fine material has already been written about it (for example, John McMains and Bob Chronister's Windows NT Backup & Recovery). This section does cover the basics of putting a good strategy in place.

Using Storage Management Software

One of the first and most important choices you must make when putting together a backup strategy is that of the storage management software. (We all refer to these things as backup software, but the biggest packages do much more than just backups; storage management software is really more accurate.) After storage management software is in place, it becomes deeply entrenched in the environment because of the software expense, the operator training, the customer training (if customers can perform their own restores), and certainly the backups themselves if the software uses a proprietary format.

The storage management software must be capable of growing with your company's needs. You may not initially need enterprise management utilities for multiple servers, but with good fortune you'll need it in the future. You don't want to be forced to switch storage management software because yours couldn't stretch to fit your growing company's needs.

Another advantage to fully featured storage management software from a major vendor is consistency. Whenever possible, you should have one vendor for all your storage management needs, simplifying management, taking advantage of the integration within the vendor's product line, and reducing your total cost of ownership.

Following these guidelines, you should choose your storage management software from one of these four vendors:

  • ADSTAR Distributed Storage Manager (ADSM), by IBM

  • ARCserve, by Computer Associates

  • Backup Exec, by Seagate Software

  • Networker, by Legato

I haven't listed these products in any particular order. All are very good and offer a wide range of utilities; they've been the market leaders ever since Windows NT's inception. Ntbackup, the backup utility included with the operating system, is a lobotomized version of Backup Exec, and most of these company's products have existed for many years for other operating systems.

The following are recommendations to keep in mind when choosing your storage management software:

  • Clearly define your requirements. Do you want to back up your clients as well as your servers? Do you want your clients to be able to do their own restores? Do you need Web-based administration?

  • Think big. Choose a solution once that will do all you'll ever need so that you never have to do it again. And do your research carefully: If you haven't looked at storage management solutions recently, you'll be astounded at the depth and breadth of what they can do. Of course, this makes you continually re-examine your requirements. When you realize what these suites can do, you'll be able think of new ways to handle data you didn't know were possible.

  • Choose a vendor that supports the widest variety of client operating systems possible—not just Windows NT. You want to be able to back up all your business's clients, not just your Microsoft-based ones.

  • Make sure the software you're considering also supports all the databases you're using—and those you may ever possibly use in the future. Look carefully into the support so that you're sure you understand what it can and can't do. SQL Server, Oracle, SAP, and Exchange all may require additional software that not every solution may support. For example, if you use the OpenIngre database, of the four mentioned only ARCserve for Windows NT has an agent to back it up.

  • The storage management software must support the broadest range of backup devices, from 4mm DAT up to tape libraries with terabyte capacities. Again, think about your company's growth and a single-vendor solution.

Choosing a Backup Hardware Format

It's worth noting that when you're perusing the brochures for different backup types, most have split numbers such as 4/8, 7/14, 15/30, 35/70, and so on. This is a capacity description of each tape, uncompressed and with maximum 2:1 hardware compression with the right data (such as bitmaps). Obviously, your mileage will vary depending on the type of data you're backing up. If you have a good mix of data in your backup stream, splitting the difference between the two extremes will probably yield a valid estimate. I've seen a solid 27GB per tape on a 15/30 DLT when backing up standard file server data: Word documents, Excel spreadsheets, Dilbert" cartoon archives, and so on.

Note: There's a very important trend to be aware of in data storage: The ability to put more data on disk drives has in no way kept up with the ability to back them up. With the exception of network backup servers using autoloaders and tape libraries, file servers of enterprise scale cannot be adequately backed up without using tape autoloaders. You must take this into consideration when you're putting together a purchase order for servers. Don't cut corners on your backup solution; if anything happens—and, of course, it will—you'll be reviled more for not having adequate backups than for the redundant power supply you just had to have.

A number of tape formats are available. You can quickly narrow your choices, however, to one or two formats based on how much you must backup and automation you need, what speed you need, and how much you're willing to pay. The next few sections talk about the most popular formats and their strengths and drawbacks.

QIC

The QIC (quarter-inch cartridge) format and capacity has evolved over the years. Originally limited to 100MB or 200MB per tape, it's now capable of up to 8GB using Travan cartridges. QIC been a mainstay of PC backups for many years. Its throughput can rival that of 4mm, but because it won't scale to larger systems through use of magazines, it's limited to workstations and workgroups.

8mm

8mm tapes are about the size of a deck of cards and use the same helical scan technology found in the family VCR. Backing up between 7GB and 14GB, 8mm tapes can be found with autofeeders to increase its capacity. As with your family VCR, however, the drive demands regular cleaning when subjected to heavy use.

4mm DAT

DAT (digital audio tape) is only about two-thirds the size of an audio cassette. Relatively inexpensive and fast, DAT also uses helical scan echnology in which data is recorded in diagonal stripes with a rotating drum head while a relatively slow tape motor draws the media past the recording head. DAT tapes hold 2GB–24GB depending on the compression achieved by the backup hardware, and they have some popularity in the IA server world. To increase their capacity, DAT tapes are available with loader magazines that can exchange up to 12 tapes without operator intervention. DATs can back up data at about 1GB–3GB per hour. (These are conservative numbers based on actual experience rather than product brochures.)

DLT

DLT (digital linear tape) is a fast and reliable backup medium that uses advanced linear recording technology. DLT technology segments the tape medium into parallel horizontal tracks and records data by streaming the tape across a single stationary head at 100–150 inches per second during read/write operations. Its path design maintains a low constant tension between the tape and read/write head, which minimizes head wear and head clogging. This extends the operational efficiency of the drive as well _as its useful life by as much as five times over helical scan devices such as the 4mm and 8mm formats. DLT is the most expensive, per unit, of standard tape backup solutions, but for your money you get the fastest throughput (3GB–9GB/hour real-world), greatest capacity (35GB–70GB in the newest drives), and best reliability (minimum life expectancy of 15,000 hours under worst-case temperature and humidity conditions; the tapes have a life expectancy of 500,000 passes) of all drive types. The current standard is 35GB uncompressed or 70GB with 2:1 compression, and it's been around for a while. On the horizon is 50/100, but it still isn't keeping up with the increase in disk capacity.

DLT cartridges look disconcertingly like old 8-track cartridges. Loaders to handle multiple tapes and increase capacity without operator intervention are also available from many vendors, and these are a necessity if you have a number of servers to back up and limited operators to swap tapes.

Network Backup

One problem with traditional local tape backups is that when you have a lot of servers, you have a lot of tapes and tape drives to manage. An unattended network backup solution moves the backup media to one server specifically designed to back up massive amounts of data as quickly as possible. Truly awesome amounts of storage can be built into supported devices: an ADIC Scalar 1000 DLT library integrated into ADSM can hold up to 5.5TB. Whatever its advantages and disadvantages for server backups, network backups are a huge win for lowering the total cost of ownership in a network. All backup operations are automated, which means that the single most expensive overhead item for operations—the operators themselves—aren't needed for changing tapes at all hours and performing tedious data restores.

Tape Format Recommendations

Table 5.2 shows a comparison of backup methods. I've listed the advantages and disadvantages in relative order of importance, based on a goal of high availability. Your business priorities for your Windows NT network may be different, perhaps compromising on longer restoration times to save hardware money. Keep in mind, however, that mainstream backup hardware and tapes almost always cost far less than the price of tens or hundreds of workers sitting on their hands because the server's down.

My recommendation hands-down is DLT for a shop of any size. Besides speed and reliability, it comes up the winner in an area most people haven't thought about: longevity of the format. If you're backing up design data to be archived for 5 years, or company financials for 10 years, you have to think of the environment at the time of restore. Will a tape drive of the right format still be around? I know of a large company that kept a Digital Equipment Corporation RV20 write once-read many (WORM) optical drive around (and had to pay for maintenance) for years after the technology was obsolete because it was the one piece of hardware that could read their archives.

Table 5.2 A Comparison of Backup Methods

Advantages

Disadvantages

Network Backups

 

Lower support costs; very little operator intervention is needed for regular backups.
No "end of tape" to deal with; extremely large storage capacity.
Full remote control of backups (you don't need an operator to change tapes).
Built-in disaster recovery if tape silo is offsite.
Also has optional DRM (Disaster Recovery Manager) to assist in removing media from silo for offsite storage.
Backup policy changes are easy to enact.

Very expensive for large tape storage devices.
High-speed network required.
Problematic for remote sites with slow links.
Must have guaranteed network connectivity during (potentially long) backup or restore process, or else backup/restore job fails.
Restoration time may be unacceptable for large volumes, or the network may be too unstable for restoration even to be possible.
Some products (Networker) use parallelism for increased speed.
Disaster recovery is compromised if the silo is onsite.
Total costs are not as well understood as with standard tape.

Tape Backups

 

Dramatically lower local hardware cost: $1,000–$5,000 per server
Faster backup and restore.
Tapes can be carried offsite in all cases for disaster recovery.
Not dependent on network bandwidth or reliability, if backing up local server only.
Good remote management.

Higher operations costs.
Remote management available within the tape, or tape magazine only (need an operator to change tapes).

4mm DAT

 

Inexpensive.
Good capacity, approaching 72GB with 12-tape magazine.
Good throughput, 1GB–3GB/hr.

Higher maintenance—heads must be cleaned frequently, magazines jam, and so on. This can be a problem with high duty-cycle backups (8–12hrs/night).
Tapes wear out relatively fast.

DLT

 

High capacity: 35GB–70GB per tape. Densities as high as 100GB achievable when backing up design data.
Highest throughput, 3GB–9GB/hr.
Low maintenance, rarely need_cleaning. Better for high_duty-cycle backups than 4mm.

More expensive than DAT.
Autoloaders are much more expensive than comparable DAT autoloaders.

Building a System Recovery Strategy

Before you begin designing a system recovery strategy, you must ask your customers, your managers, your operations staff, and yourself a number of questions.

Gathering Customer Requirements

Here are some questions to ask your customers and their managers:

  • How long should a file exist before it gets backed up? In a typical backup scenario, any file that has existed more than 24 hours gets backed up. This doesn't mean, however, that it will stay backed up for months or even weeks. If a document is created on a Tuesday and erased on a Thursday, its chances of being recovered after several weeks are less than if it was created on a Thursday and erased on a Tuesday. This seems arbitrary, but due to standard business hours, full backups are best done over the weekend. This period where files greater than 24 hours old on any day of the week are backed up can be described as the high-detail backup period. So be aware: Depending on when they are created and erased, a file's life on backup media will vary. Fortunately, most people's habits keep them from quickly erasing any data they deem valuable. When's the last time you saw disk space utilization on a server volume go down?

  • How many copies of a file should be kept? Is it important that users can retrieve a specific version of a file? In most file server environments, the number of versions created by several weeks of full backups is often enough, but special applications may require more.

  • How far back in time should a file be recoverable? Certainly, everyone would like to be able to recover that year-old weekly report of accomplishments for the following year's annual review. In reality, however, most people ask for restores within the last two or three weeks—and most of these ask for recovery from yesterday's backup! Increasing backup media storage time, known as the retention period, beyond three to six months dramatically increases the cost of media.

  • Should the customer be able to restore the file himself? This would be a nice feature, but server backups don't currently allow users to restore their own files. It's a server backup, so it must be restored by server operations. This could be done in an indirect manner, however, if you have a network backup system such as ADSM. ADSM can back up client workstations with a simple user interface that allows workstation users to restore their own files. If the client includes a personal share on the server as part of the workstation's backup, he will also be able to recover the server share. Of course, this raises a number of other problems, one of which is multiple backups of the same data. The server may back up the same client share as the workstation backups because it has no knowledge of the workstation backups, and vice versa. Another problem is network shares on the user's workstation. After backing up his personal share, the simple marking of a checkbox may allow the user to back up a 1GB department share (which is already being backed up by the server backups).

  • How long may restoration of the user data to the server take? If the integrity of the data is more important than its restoration speed—and especially if you have lots of data—a network backup product may be acceptable. As availability requirements increase, the backup data storage possibilities move from local tape to online copies, to network mirrored data, to clustering.

Gathering Data Center Requirements

Here are some questions to ask your management and other IS managers:

  • How long may the server itself be down? If the Windows NT operating system is dead but the user data is fine, you need to be able to recover the operating system quickly and with as little pain as possible. This is a balance between your business needs of server availability and what you're willing to pay for it. If you require instant availability, you should have a clustering or network mirroring solution. Data on the OS partition changes slowly, so if you have partitioned it correctly, a weekly full backup to tape will ensure that the operating system can be rebuilt within 30 minutes or so. If recovery time isn't important and the server receives its data from replication, backups of the operating system partition may not even be necessary.

  • Do you have SLAs (service-level agreements) signed with your customers stating the maximum recovery time? If you do, you must base your calculations on this. If it turns out that you were hopelessly optimistic in your recovery time estimates, you'll have to crawl back _to the negotiating table for more money or renegotiate the recovery times based on realistic expectations with the business.

  • What level of disaster recovery does the company believe should be implemented? Despite your belief in the importance and irreplaceability of your Windows NT systems and data, the company may not share your convictions. Assuming that your company has an existing disaster recovery plan for its existing systems, the decision needs to be made whether to include the Windows NT systems in it. Certainly the simplest solution is to incorporate Windows NT into an existing DR plan. Disaster recovery can imply several levels, too; when you say "disaster recovery" to a server operator, she may think of two hard drives failing simultaneously. When you say it to a DR planner, he may think of a 747 crash-landing on the roof. These require different levels of planning, and the disaster scenarios must be thought out ahead of time. If your company doesn't have a disaster recovery plan for its computers, run—don't walk—to the bookstore where you bought this, and buy a book on disaster recovery planning! Do something, even if it means carrying one set of tapes home weekly.

  • Is there an operations staff at each server site? You need operators to change tapes. This is where network backups shine. No operators are needed for network backups to a tape library, only for assistance in catastrophic system recovery. If operators are unavailable or must make limited trips to the site, consider an autoloader tape magazine that can hold a number of tapes. You can schedule various jobs to run on different tapes in the magazine, and you can even include a cleaning tape that runs at scheduled intervals. Of course, in this scenario there's limited disaster recovery potential because the backup tapes are sitting next to the computer!

  • What are their hours and how busy are they? Even if the computer room or server area is staffed 24x7, you need to consider operator availability when scheduling tape changes, tape drive cleaning, offsite disaster recovery shipments, and so on. Balance this with the need _to run backups at off-peak hours, and you have defined certain time windows in which tapes must be handled.

  • What is the network bandwidth at each server computer room and between these locations? This factor determines whether network backups are practical. In large installations, at least a 100Mbps network backbone is necessary to provide enough bandwidth to _back up multiple servers in a practical amount of time and without saturating the network.

  • Has the recovery plan been tested, and have all operators been trained on it? Too often, the recovery plan isn't really tested until a failure happens.

  • Have you factored performance degradation into your recovery times? Be very conservative in estimating the amount of time required to recover a system, especially if it's to be restored from a network backup system. If you don't have a dedicated backup network, the length of time required to restore the server depends on the network traffic. Indeed, you may have a service agreement in place that requires data restores to wait until the evening if network traffic is above a certain level. If you run this kind of risk, you shouldn't be using network backups.

  • If chargeback (billing customers for your services) is not used, how much is the information systems department willing to spend on backups? In most shops I've encountered, the IS department simply eats the cost because the amount billed isn't worth the overhead for journal entries. In other words, what's your budget? Be prepared to go back for more after you've done your research. You could also consider offering a tiered pricing structure based on the level of service being offered.

  • Do you have any way to automate the review of backup results? You must have a way to notify your operators if a backup has failed. Enterprise-scale storage management software now has automated alerting functions, but unless you're prepared to write your own _log-viewing facility, you also have to buy the coordinating systems management software.

  • How important are all these to you and your customers (i.e., what are they willing to pay for these things)? Every one of these costs money—some much more than others. A 4mm DAT drive can be acquired for $1,000 and will hold between 4GB and 8 GB, while an automated tape silo costs well into five figures and holds terabytes of storage. You and your managers need to establish where the balance lies between high server availability and the cost to keep it high. Unfortunately, braggin' rights tend to focus on the 99.9% availability your servers averaged last month; it's harder to crow about how much less your servers cost to maintain!

Types of Backups

Regardless of your choice of backup hardware or media, the type of backup being performed usually falls into three categories:

  • Full—Everything on the selection list is backed up, period. The archive bit, a property of all PC-based files that indicates whether the contents of the file have changed since it was last set, is unconditionally reset to 0. Full backups are often broken into two categories: weekly and monthly. They perform the same function; the difference is that weekly backups are once a month, while monthly backups are cycled for the length of the tape retention period. For an example, take a standard configuration where a month equals five weekly cycles and tapes will be retained for one year. In this case, a weekly job will be reused every five weeks, and a monthly job will be reused once a year. Any file that exists during a weekly job and that is less than five weeks old may be recovered. Any file that exists during a monthly job and that is less than a year old may also be recovered.

    The advantage is that because all data that was selected was backed up, this is a snapshot in time of an entire disk volume or server (if you built the selection list correctly). With this set of tapes, you should be able to restore a server to some level of service. Full backups are the foundation of the remaining backup types.

    The disadvantage is that because everything gets backed up, this chews up a lot of tape. Full backups are the most time-consuming backups. If backups are run across a network, they consume a lot of network bandwidth during this long backup time.

  • Incremental—Everything on the selection list with the archive bit turned on is backed up. After the backup completes successfully, the archive bit is reset to 0. From a practical point of view, this means that every file that has changed since the last backup (full or incremental) gets backed up. In backup documents, this is usually (but not always) the type of job defined as a "daily."

    The main advantage of this type of backup is that it is much speedier than a full backup because, if incrementals are run daily, relatively few files have changed compared to all files on a volume (typically 5% or less on a file server). Incrementals allow versioning: If they run daily and the contents of a file are also changed daily, the file's archive bit gets set to 1 and the incremental job backs up a new version every day.

    The main disadvantage of an incremental backup is that it usually cannot be used by itself to restore a volume or server; it must depend on the data from a full backup being restored first. On a system such as a database server, where data files are interrelated and depend on each other's versions, all incrementals must be applied to a restore job. This can be a time-consuming process, and the time required for restoration will eventually outweigh the convenience of a speedy backup. For example: An SQL server's database files are backed up with a full backup on Sunday and incrementals early every morning during the week. If the database becomes corrupted Friday afternoon, the restoration process requires 1) restoring Sunday's full backup, and 2) restoring the five incremental backups taken Monday through Friday morning.

  • Differential—Everything on the selection list with the archive bit turned on is backed up. Unlike an incremental, the archive bit is not reset to 0. As with an incremental, every file that has changed since the last backup gets backed up. This type of backup job is occasionally used _as a "daily" is much speedier than a full backup.

    Differential backups also offer versioning. Unlike an incremental backup, a volume or server can be restored more quickly to a "snapshot in time" with a full backup plus a differential. This is because the backup contains all the changes made since the last full backup; an incremental may contain only changes since the last incremental. (Though it's possible to run differentials after incrementals, it gets very complex and isn't recommended.)

    Because the archive bit isn't reset to 0 after a differential is run, the number of files to be backed up grows every day. A system recovery strategy that uses daily differentials consumes more tape than one that uses daily incrementals, but unlike incrementals, you don't have to restore multiple differentials to rebuild a snapshot of the system.

    As with an incremental backup, a differential backup usually cannot be used by itself to restore a volume or server; it must depend on the data from a full backup being restored first.

  • Copy—A copy job is identical to a full backup job, with the exception that the archive bit isn't reset. It is usually used for special jobs such as disaster recovery, where a complete copy of the system is desired but won't normally be available for file restoration.

In a world with unlimited backup storage and tremendous backup speeds, a full backup every day would be the simplest system recovery strategy. Because this would require a tremendous amount of time and tapes, a good system recovery strategy balances retention period, high-detail backup period, disaster recovery, operator availability, and minimizes tape use. It's no wonder that a good system recovery strategy is hard to find.

A Backup Example

As a real-life example, let's answer many of the questions ourselves:

  • How long should a file exist before it gets backed up? 24 hours.

  • How many copies of a file should be kept? This isn't as important to the customer as recovery time—let's say three copies.

  • How far back in time should a file be recoverable? Six months.

  • How long may restoration of the user data take? Within four hours.

  • How long may the server be down? Two hours.

  • What kind of disaster recovery should be implemented? Data will be stored offsite. Because of operational considerations, it will be between three weeks and six months old.

  • What is the site operations staff's hours? Staff are available 7AM–4PM, Monday through Friday.

  • What is the network bandwidth at each site and between these sites? The servers are on FDDI rings at each site, and the major sites are linked by an ATM network. Remote sites are connected to the main campus by T1 WAN circuits with apparent bandwidth of 10Mbps.

  • What kind of backup hardware will be used? 35-70GB DLT, single tape (not changers).

Armed with this information and data on the available backup options, we can put together a system recovery strategy. For the sake of an interesting example, let's assume that ADSM is already being used for workstation backups across most of the company, including remote sites.

Figure 5.1 is an example of a backup schedule for systems with a dedicated DLT tape drive. It covers an entire month and handles full and incremental backups, cleaning, and disaster recovery jobs. The letters A–E in Figure 5.1 stand for the following tape operator duties:

  1. Dismount Daily tape. If needed, mount cleaning tape and allow drive to clean heads and eject tape. Mount corresponding coded Weekly tape into drive.

  2. Ship third-oldest Weekly tape offsite for storage as Disaster Recovery tape.

  3. Dismount Weekly tape. Relabel Weekly tape as Monthly tape, and store. Mount corresponding coded Daily tape into drive.

  4. Dismount Weekly tape. Mount corresponding coded Daily tape into drive.

  5. Manually clean drive, if necessary.

Bb727090.05dnt01(en-us,TechNet.10).gif

Figure 5.1: A sample backup schedule: a calendar for file server with DLT drives.

The following are more complete descriptions of the individual jobs:

  • Weekly and monthly jobs—Full backups will be taken weekly and be kept for five weeks. Ideally, these long-running backups would be taken over the weekend when there is little user activity. In this case, however, the operator's schedule dictates that they must run on a weekday. (Let's choose Wednesday.) The full backup jobs are called weeklies; monthly jobs will simply be weekly jobs that are removed from the five-week backup cycle and kept onsite.

  • Daily jobs—Incrementals will run on the remaining nights of the week. They will be kept for two weeks, and they'll be called daily jobs. These jobs will be of two types: Daily Append and Daily Replace. Daily Replace will run once a week after the weekly/monthly and will overwrite (replace) the contents of the tape. Daily Append will run on the remaining days of the week and append its data to the tape. The combination of these two jobs creates a single tape with a week's worth of incremental backups.

  • Disaster recovery jobs—Disaster recovery jobs can be created in two ways. If you have 24x7 or 24x5 operator support, a DR job can be run one night a week after the nightly job has completed. This requires one more intervention by the operator than described on the calendar, but this will assure that disaster recovery tapes are no more than one week old. If you don't have the luxury of 24-hour operator support or an autoloader, unless you are able to run backup jobs during the day (not recommended due to open files and server load), you must use the simpler method indicated in the backup schedule. In this case, disaster recovery jobs will be the third-oldest weekly job, rotated offsite until the next weekly job is run. The obvious drawback to this method is the age of the data; it's always at least three weeks old. The reason for this method is because a single tape drive doesn't allow you a way to change tapes without operator intervention, and your operator's schedule ensures that no one will be around to perform the change required for the DR job after the regular nightly job. Other possibilities are to substitute a DR job for a daily (losing 24-hour recoverability for one day of the week) or pay an operator overtime one night a week to drive in and swap tapes. Only one set of DR tapes is kept offsite; you may want to increase that to two sets for redundancy.

  • Cleaning—DLT drives don't need to be cleaned nearly as often as 4mm tapes, and an indicator light tells the operator when it's needed. The CLEAN job on Wednesday that uses a special cleaning tape is optional and needn't be performed unless the cleaning light is on.

  • Open files—Any file on a Windows NT system that has an exclusive lock on it—whether it's an open Word document or the SQL Server master database—may not be backed up. I say may because most storage management software offers optional open file backup modules, but if you don't get it and a file is open, the backup program will skip it. This has very large and unfortunate consequences (especially related to database systems) if you aren't aware of this. That SQL server you've been backing up for six months with the native Windows NT backup tool actually has no worthwhile data on tape because SQL wasn't shut down before backups and has locked all its data structures open. As a result, the master database and all the database devices have been skipped by the backup program. This is an excellent reason to examine your backup logs on a daily basis, because these errors show up there.

If you must have high availability on your databases and you either won't buy an open file backup module or your software doesn't offer it, a simple process can provide data integrity:

  1. On a nightly basis, just before the scheduled backup time, dump the databases to dump devices. The scheduled job will then be capable of backing up these files.

  2. If possible, perform a full backup of the server with the database shut down. This should ideally be done whenever a physical database device has changed on the system because those are the files that the operating system sees. To minimize the number of database downs, you'll want to change this device as little as possible. The advantage of a full physical backup of the database devices is that in the event of a failure, the rebuilding of the database system will be greatly speeded. SQL Server won't have to be reinstalled and the database devices won't have to be re-created manually.

  3. If you can't bring the database system down for the occasional backup, keep printed copies of database device information. After a failure, you'll have to rebuild them manually before you can load the database dumps from tape.

Restoration of service will be slower with this method because you must reinstall SQL Server and rebuild the database structure before loading your database dumps. If you shut down the database before backups and backed up everything at once, a simple restore job would bring back the entire database and its executables and data structures at once. It's a compromise you must make between availability and restoration speed.

To determine the right number of tapes to purchase for this backup scenario, we need to add up the number of tapes needed for each kind of job, the number of times they run, and factor in the various retention periods. Let's figure it out by job type:

Daily:

 

 

(# of tapes needed for dailies)

=

(Total # of daily jobs) x (# of tapes per daily job)

 

=

(6 dailies/week x 2 weeks) x (1/6 tape per daily job)

 

=

2 tapes

Note: "1/6 tape per daily job" really means that the same tape is used for dailies all week.

Weekly:

 

 

(# of tapes needed for weeklies)

=

(Total # of weekly jobs) x (# of tapes per weekly job)

 

=

(5 weekly jobs) x (4 tapes per weekly job)

 

=

20 tapes

Monthly:

 

 

(# of tapes needed for monthlies)

=

(# of tapes per weekly job) x (# of tapes pulled from rotation)

 

=

(# of months that weeklies are in rotation)

 

=

(4 tapes per weekly) x [(6 months)—(1 month of weeklies)]

 

=

20 tapes

Note: In this case, "# of tape sets pulled from rotation" is equal to the retention period of six months because a set of weekly tapes is set aside every month for six months. Likewise, "# of months that weeklies are in rotation" means that you don't need to account for special monthly tapes when you can pull a valid weekly tape (one that's in current backup rotation) off the rack. In most cases, this will be equal to 1.

Disaster Recovery:

 

 

(# of tapes needed for DR)

=

(# of tapes per weekly job) x (# of tape sets pulled from rotation)

 

=

(4 tapes per weekly) x (6 months)

 

=

5 tapes

These formulas may seem like overkill when you can probably work it out with a little head scratching, but they will work in a situation where it gets too large or complicated for seat-of-the-pants reckoning. Having said that, don't forget to add Finagle's Constant of about 10% to cover bad tapes, underestimation of how many tapes you think you'll need, and so on.

Virus Scanning

It's well known by now that Windows NT is susceptible to viruses. The boot partition and users' files are especially at risk, with much resulting pain and anguish. A server can be infected with a boot sector virus through infected service disks such as system configuration utilities, Windows NT boot disks, BIOS updates, and others. It's vital that you practice safe hex by scanning these disks regularly with an up-to-date virus scanner.

Building a comprehensive virus policy for Windows NT servers is more complicated than it might first appear. The most obvious requirements of a virus scanner are listed here:

  • Detection and correction—The scanner must be capable of finding and fixing as many viruses as possible.

  • Unobtrusive operation—The scanner must interfere with normal server operations as little as possible.

  • Comprehensive notification and reporting—It's very important to _have flexibility in how virus alerts are sent. A scanner with a highly configurable alert utility will allow you to distribute file disinfection to local administrators by partition or share.

Selecting and configuring a Windows NT Server virus scanner is the easy part of building a virus scanning policy; figuring out what to do with a virus when you find it is the hard part. Before you sneer, "Automatically clean it off—Duh!", think ahead to the consequences.

Whose responsibility should it be to disinfect viruses? It's pretty clear that the administrations/operations group should keep the boot sector and OS partition clean. What about the user partitions?

What is the process for cleaning a virus off the server? The only way some viruses can be cleaned requires erasing the infected file. How will that be handled? "Dear sir: We erased your critical spreadsheet from the server because it was infected with a virus (even though you could read the data). Have a nice day, MIS." If the virus is on a personal share, the owner can spread it by emailing the file to others. If it's on a group share, it may be infecting many other people. The moral: Automatic correction on user partitions is okay if the file is not deleted, but get the alert about the virus out to the user right away so the user can notify others who may have been infected. If the virus requires that the file be deleted, move it to a special directory and notify the user immediately.

Who gets notified if a virus is detected on a user partition? The server administration staff? If the infected file is on a group share, where does the alert go?

Don't forget that here economies of scale work against you. If you have 1,000 users on a server and have 5 servers, and if 35% of them have at least one virus on their files, that's 1,750 cases that need to be dealt with. If you roll out a virus scanner across multiple servers in a short period, the help desk will be swamped with calls.

Suppose that one day you get all the viruses off the server. You now have a clean server with hundreds or thousands of dirty workstations reinfecting it every day. To work effectively, a comprehensive server virus policy must dovetail with a workstation virus policy. Fortunately, workstation virus scanners have become quite sophisticated and can catch many viruses the instant they're loaded into memory. You also need a process of regularly updating the virus signature database on all servers.

Fragmentation

Yes, Virginia, there is disk fragmentation on NTFS volumes, and it can affect your system's performance. Although they don't fragment as quickly as FAT, NTFS partitions can be badly fragmented by normal operations. Take print spooling, for example. The process of printing a file to a network printer requires a spool file be created and almost immediately deleted. This process alone, repeated hundreds or thousands of times in the course of a normal working day, can defragment any type of partition.

Disk fragmentation is measured by how many fragments a file is broken into. A completely defragmented file has a fragments per file ratio of 1.0: The file consists of one fragment. Executive Software, the company that _literally wrote the book on the subject, believes that any partition with a fragments per file ratio greater than 1.5 is badly fragmented and severely impacts performance. An analysis I did of a Windows NT file server whose disk array had been around for three years yielded a fragments per file ratio of 3.64! Historically, the only thing that could be done about fragmentation was to back up the volume, format it, and load the data. With the advent of Windows NT 4, however, hooks have been added to the operating system's microkernel to allow real-time defragmentation, and several products are on the market.

Software Maintenance

The task of keeping Microsoft operating system software up-to-date is pretty straightforward once you understand the concepts of hotfixes and Service Packs. Equally important, you should understand how much Microsoft itself trusts each of these updates.

Using Hotfixes

As problems are reported with Windows NT, Microsoft develops fixes for them. (We'll leave the subject of how quickly and how well for another time.) Called hotfixes, not much regression and integration testing is done on them.

Hotfixes can be installed in one of two ways. The simplest way is to save the hotfix executable in a temp directory and run it. A more organized way, especially if you have a number of hotfixes, is to run the executable with the /x switch. This will extract the hotfix and its symbol files. You can then install the hotfix with the HOTFIX command. Table 5.3 lists the switches (that is, options) for the HOTFIX command.

Table 5.3 Switches for the HOTFIX Command

/y

Perform an uninstall (but only with /m or /q switches)

/f

Force applications closed at shutdown

/n

Don't create an uninstall directory (not recommended)

/z

Don't reboot when the update completes

/q

Use Quiet mode—don't ask the user for anything

/m

Use unattended mode, very similar to quiet mode

/l

List the hotfixes already installed on this system

There are two other ways to quickly check whether hotfixes have been applied to a system. The first is to check in the %systemroot% directory for hidden directories:

$NTUninstallQxxxxxx$

Here, xxxxxx is the hotfix number.

The second method is to look directly in the registry:

HKEY_LOCAL_MACHINE \Software \Microsoft \Windows NT\CurrentVersion\Hotfix

If the \hotfix key isn't there, or if it's empty, no hotfixes have been installed.

Tip If you come to Microsoft operating systems from a different background, be careful about your assumptions on software maintenance. Unlike IBM mainframe maintenance (which recommends that you keep up with monthly fixes and believes these program update tapes [PUTs] are safe), Microsoft takes a different position. Don't apply hotfixes unless they fix a specific, critical problem you're encountering—and then be prepared for something else that had a dependency on the hotfixed files to possibly break. Microsoft doesn't hide the fact that most hotfixes are only minimally tested and certainly aren't for all the dependencies in the OS. So don't apply a hotfix unless you really need it, and then test it thoroughly in your environment before releasing it to your production servers. As always, make sure you have a good backup of your server before you apply any maintenance of this type.

Using Service Packs

As a large number of hotfixes are accumulated, they'll be rolled up into an overall maintenance package called a Service Pack. Service Packs are infrequent events; there were only five for Windows NT 3.51, and there are four so far for Windows NT 4. Service Packs also developed in size and sophistication over their history. Besides becoming easier to install, since Windows NT 4 SP3 they've featured an uninstall option. The _original files are stored in a hidden directory named %systemroot%\ $NTServicePackUninstall$. Unlike individual hotfixes, the updates in a Service Pack are regression and integration tested to ensure that they're more stable than an ad-hoc collection of individual hotfixes. Service _Packs are cumulative, which means that updates from previous Service Packs and hotfixes are rolled up into later ones. Service Pack 4 is 100MB in size and has more than 1,200 files.

Service Packs are more thoroughly tested and theoretically are more stable than hotfixes, but again Microsoft's policy has long been, "If you aren't having problems, don't apply it." There are several reasons to cautiously change that mindset. The first is that as the world has begun to pay attention to Windows NT, the security attacks against it have dramatically increased, and therefore so have Microsoft's security hotfixes. Test carefully before you install an individual security hotfix, but applying a Service Pack with a tested suite of these updates is a good idea.

The second reason is that Service Packs have evolved into providing new features as well as fixing existing ones. Password filtering in Windows NT 4 Service Pack 3 is a good example; it provides a higher level of password security if you choose to implement it.

You should read the release notes before installing a Service Pack, and carefully consider all the implications of what you read. For example, Service Pack 4 updates the NTFS file system driver, so a Windows NT 4 _system will coexist with NTFS Version 5. Does this affect your disk utilities, such as your defragmenter or your virus scanner? You need to have answers to questions like these before you apply the newest Service Pack.

The Windows NT Service Pack site (for all releases of Windows NT) can be found at this address:

ftp://ftp.microsoft.com/bussys/winnt/winnt-public/fixes/

Using System Dump Examinations

When a system bugchecks and generates the infamous blue screen of death (BSOD) and a dump file, symbol files are needed during dump analysis to determine where in the operating system code the problem occurred. The symbol files can be found on the Windows NT Server CD at \support\_debug\processor type\symbols; the dump analysis tools are at \support\_debug\processor type. However, you don't have to keep all this straight because the batch stream \support\debug\expndsym.cmd will install the symbol files in the correct place. The syntax is as follows:

Expndsym <CD-ROM_drive_letter> %systemroot%

Here's an example:

expndsym d: c:\winnt

You could easily substitute %systemroot% or even hard code it into any copies you make of the batch stream. This would work as long as you were actually running under the operating system on which you wanted to use the symbol files.

Symbol files take up a lot of space (approximately 100MB) and must be kept up-to-date. To accurately debug a dump with symbol files, every Service Pack or hotfix applied to the OS must have its corresponding _symbol file updated in %systemroot%\system32\symbols. When repeated over hundreds of servers, this can obviously be a configuration management nightmare. Fortunately, you don't need to install symbol files on every server. Instead, install them on a single troubleshooting server that runs the OS release and fix level of your production baseline. If you have more than one OS baseline, install that baseline with symbols in another directory. When a bugcheck occurs on a production server, send the dump to the _troubleshooting server—you may want to zip it if you have slow links—_and perform the dump analysis from there. The DUMPEXAM command is the starting point for reducing the dump:

dumpexam -v -f output_filename -y

symbol_search_path, usually "c:\winnt\system32\symbols"

crashdumpfile

Obviously, it's easier to write a batch program to take care of the dump file processing. Listing 5.1 is a simple batch stream called DUMPSHOOT that runs DUMPEXAM, writes it to a text file, then displays it for you.

Listing 5.1. A Batch Stream to Simplify the Dump Processing

@Echo off
if "%1"*==""* goto a
if "%1"*=="/?"* goto help
if "%1"*=="?"* goto help
set dumpfile=%1
goto exam
:a
set dumpfile=memory.dmp
:exam
Echo Attempting to extract information from c:\dumpster\%dumpfile%...
dumpexam -v -f c:\dumpster\dumpshoot.txt 
         (continued) -y %systemroot%\symbols c:\dumpster\%dumpfile%
Echo Would you like to view the crash dump analysis?(CTL-C if not!)
pause
notepad c:\dumpster\dumpshoot.txt
goto exit
:help
Echo "DUMPSHOOT <dump file>"
Echo DUMPSHOOT condenses NT crash dumps into a useable format.
Echo DUMPSHOOT invokes DUMPEXAM with the right parameters.
Echo If no dump file is specified, the file is assumed to be
Echo found in C:\DUMPSTER\MEMORY.DMP.
Echo If you specify a dump file, I still assume it's in C:\DUMPSTER.
Echo The output is C:\DUMPSTER\DUMPSHOOT.TXT
:exit

The output from DUMPSHOOT is dumpshoot.txt. Most of the time it contains all the information that Microsoft product support needs to move forward in problem resolution, without having to ship a 128MB dump file to their ftp server.

Software Maintenance Recommendations

Here's a summary of my software maintenance recommendations:

  • Only apply hotfixes if you really need them, and test them in your environment before putting them in production.

  • As with anything else brand new, don't be in a rush to install a new Service Pack. Microsoft has had a mixed record on the reliability of its Service Packs; unless you're dying for the updates, wait until they've aged just a little.

  • Be sure you pull down the hotfix for the right server architecture. The IA architecture hotfixes end in i, and the Alpha ones end in a.

  • For Windows NT 4, apply maintenance such as hotfixes and Service Packs only after you've installed and configured all the system's software. If you install a Service Pack and then install a software component, the installation process will overwrite the updated components with the original media components. This recommendation therefore leads into the following one.

  • If you have installed software after applying maintenance on a Windows NT 4 system, reapply the maintenance. This unfortunately means that more OS partition disk space is chewed up because a new uninstall directory is created every time a Service Pack is applied.

  • Service Packs in Windows 2000 are supposed to be intelligent enough that if you install software after the Service Pack has been applied, you won't have to re-apply the Service Pack.

  • Print and read the documentation very carefully. This is not some program you want to just install without looking! Even though it has an uninstall option, a service pack often changes the basic structure of the SAM database or the registry so that it's not possible to completely back out without restoring from backups.

  • Take a full backup of your boot partition, and update your Repair Disk, before you install a hotfix.

  • Don't keep hotfixes older than the most recent Service Pack. At this point, they've been rolled into that pack; if they haven't, it's because the hotfix has been withdrawn. If that's the case, you don't want it on your system anyway!

Monitoring Performance and Tuning the Box

Windows NT is the most self-tuning operating system ever devised for the commodity server market. As a result, there are very few knobs the Windows NT administrator can turn to alter the performance of a Windows NT server—and turning them without restraint will probably degrade the system more than if you had left it alone. However, it's important to understand the performance characteristics of Windows NT and learn where it's most likely to get clogged up. I'll just hit the high points of detecting bottlenecks and tuning a Windows NT server by subsystems, sprinkled with some general rules. For a complete treatment on Windows NT performance, look in the Optimizing Windows NT volume from the Windows NT 3.51 Resource Kit (a similar volume doesn't exist in the 4 Server Resource Kit).

Note: A note on Windows 2000: Even though performance documentation may say 3.51 instead of 5.0 or Windows 2000, 98% of it is perfectly relevant. At all but the deepest level of detail, performance characteristics and bottlenecks of Windows NT are the same from versions 3.5 to 5.0. Indeed, the basics apply almost exactly across any virtual memory operating system, whether IBM, Sun, Compaq, or Microsoft is on the box.

What follows are four of the basic tenets of system performance diagnosis and tuning for all computer systems. The fifth (Task Manager) is specifically for Windows NT systems:

  • Thou shalt not change more than one system parameter at a time. It's important to look at performance problems logically because you will always have these four dynamic variables interacting with one another in a system—and it's easy to lose track of where you are in a four-variable equation. If, in your hurry to get your boss off your neck, you tune several system parameters at one time to correct a problem, you'll never know exactly what fixed the problem and what didn't. Ask your boss if he really wants to see the problem appear again because a good problem analysis wasn't done the first time, or would he spend some extra time and fix it just once?

  • There is always a bottleneck; tuning just minimizes it and moves it around. One subsystem will always have more load than another, even if just a little. A bottleneck occurs when a task on the system must wait for a system resource (processor, memory, disk, or network) because it's tied up with another task. Bottleneck equals wait.

  • One bottleneck may mask another. A heavily loaded system may have several bottlenecks, but until the first bottleneck is corrected, often only one shows. A common example is a database system without enough CPU resources. The processor is pegged (old analog gauge slang, for you new technocrats) at 100%, but disk I/O is at manageable levels. Upgrade the processor or add a second processor, and that _bottleneck is removed, allowing the database engine to make I/O requests to its database unhindered by a slow processor. Suddenly the disk I/O goes through the roof! This is something you need to warn your boss about before it happens so that you don't look like an idiot.

  • The Heisenberg Uncertainty Principle also applies to performance _monitoring. To paraphrase an important tenet of quantum mechanics: "You can interfere with the performance of a system simply by monitoring it too closely." (Mr. Heisenberg was specifically referring to the momentum and location of subatomic particles.) Performance Monitor, when recording lots of data over short intervals, can impact the performance of a Windows NT system. The Perfmon utility uses CPU, memory, and disk I/O. If you monitor the server remotely, you reduce these three, but you increase network I/O as the performance data is sent over the network to the computer running Perfmon. This isn't normally enough to worry about but, it's good to be aware of. If, for example, you're remotely collecting log information and have selected the process, memory, logical disk, and network interface objects, a moderate but continuous load has been put on your network interface.

  • Use the Task Manager. In Windows NT 4 and Windows 2000, the Task Manager (shown in Figure 5.2) has been greatly expanded from its original role as a simple way to shut down unruly applications. Launched from the three-finger salute (Ctrl+Alt+Delete) or by simply clicking on an empty spot of the taskbar with the secondary mouse button, it now has Processes and Performance property sheets that can provide a great deal of detail on the current system status. The Processes property sheet allows you to quickly view processes that were previously more time-consuming to reach; by clicking on the column headers, you can sort for the highest values in each field. The menu item View, Select Columns allows you to add up to 13 object to monitor and sort. A limitation of this expanded tool is that it can be run only locally. To view system processes remotely, you must use Performance Monitor.

    Bb727090.05dnt02(en-us,TechNet.10).gif

    Figure 5.2: The Task Manager.

Because Windows NT is such a self-tuning operating system, performance and tuning often distill into two steps. Step 1 is finding the Windows NT subsystem(s) with the bottleneck, and step 2 is throwing hardware at it! To use an old Detroit saying, "There ain't no substitute for cubic inches." Most of us don't have unlimited hardware budgets, however, so a detailed performance analysis will tell you exactly where the problem lies, will offer the best course of action to fix it, and will provide documentation _to support your conclusion when the bean counters get upset.

What about performance and tuning for Windows 2000? There's no need to get worked up over the new release in this area because performance basics are the same for any server. The user interface to find the right knobs has definitely been pushed around, however. Where there's a difference, I'll show how to get there. For example, Performance Monitor from Windows NT 4 has been moved to the MMC as a snap-in (Perfmon.Msc). It functions pretty much the same as its predecessor (see Figure 5.3).

Bb727090.05dnt03(en-us,TechNet.10).gif

Figure 5.3: Performance Monitor in Windows 2000.

A Windows NT system can be analyzed in four sections: processor, memory, disk I/O, and network I/O. Use this organization to logically investigate any performance problem you encounter on a Windows NT server.

Tuning the Processor

People who don't know much about Windows NT performance always seem to focus on the CPU as the source—and the solution—to server performance problems. Although that isn't true, it's pretty easy to spot CPU bottlenecks.

The following are recommendations of the Performance Monitor processor-related counters to watch:

  • System % Total Processor Time—Performance Monitor object consistently near 100%. A snapshot can also be seen from the Windows NT 4 Task Manager Performance property sheet, CPU Usage History section.

  • Here's an easy way to see what processes are hogging the CPU: In Perfmon, select the Process object. Select the % Processor Time instance associated with the Process object. To the right of these, click the second instance (below "_Total"), and drag the mouse down to include every instance. Now either scroll up and Ctrl+click to remove the Idle instance, or delete it later. Click the Add button. You're now tracking every process on the system by percentage of processor utilization. To make the chart easier to read, click Options, Chart, or the rightmost button on the display. Change the Gallery section from Graph to Histogram, and click OK. Hit the backspace key to enable process highlighting. Perfmon now displays a histogram of all the active processes on the system. You can scroll through them with the up and down arrows, and the instance that's in focus will be highlighted in white. If you haven't deleted the Idle instance, you can do it now by selecting it from the list and pressing the Delete key.

The following are recommendations about how to tune your processor:

  • Take the doctor's advice: "If it hurts when you do that, don't do that." At least not during prime time. Schedule CPU-hungry jobs for off-hours, when possible. For example, programs that read the SAM of a large Windows NT 4 domain to process user accounts for expired passwords can peg the primary domain controller for quite a while.

  • Upgrade the processor. If you're considering whether to switch from an Intel architecture to an Alpha, look at the System Context Switches/ Sec. object. Don't switch if this counter is the primary source of processor activity; relatively speaking, an Alpha takes as long as an Intel to do context switches. (A context switch occurs when the operating system switches from user mode to kernel mode, or vice versa.) And, of course, you shouldn't make big decisions like this based solely on the System Context Switches counter!

  • Add processors if the application in question is multithreaded and can take advantage of multiple processors.

  • Use fast network cards. A 16-bit network interface card (NIC) uses more CPU than a 32-bit card.

  • Use bus-mastering SCSI controllers. A bus-mastering controller takes over the process of an I/O request, thus freeing the CPU.

  • Use the START command. This command has /low, /normal, /high, and /realtime switches to start programs with varying levels of priority. This is the only way to externally influence the priority of individual programs.

  • You can also tune the foreground application response time with the Performance property sheet, found in the System applet of the Control Panel. In Windows NT 4, it's a three-position slider. Figure 5.4 shows how it looks in Windows 2000.

    Bb727090.05dnt04(en-us,TechNet.10).gif

    Figure 5.4: Foreground application response in Windows 2000.

This alters the following:

SYSTEM\CurrentControlSet\Control\PriorityControl\Win32PrioritySeparation

This code has a value from 2 (highest foreground priority) to 0 (all programs have equal processor priority).

Tip It's interesting to note that in Windows NT Server 4, this counter is tuned to give foreground applications priority over the background applications that are, after all, the main business of a server. The reasoning may be that if you do run a program from the console, it's not done casually, so you want good response time. You should consider setting this counter to None. In Windows NT Server 5.0, it's correctly set to Background Services.

Understanding Memory Performance

Memory, not processor utilization, is the first thing administrators should look at when a Windows NT system is experiencing performance problems.

Paging is a necessary evil—bad, but unavoidable. Okay, to be fair, it's an integral part of memory management, so "bad" may be an overstatement, but avoid it as much as possible. If you're reading this book, you've _probably heard the term "paging" for a while, have seen it occurring with Perfmon, and can even convince your boss that you know what it means—but you probably would hate to be cornered into defining it or explaining the concept to a new hire. Here's a (hopefully) simple explanation.

Windows NT uses a demand-paged virtual memory model. That's four adjectives attached to one noun, so it deserves explanation. Windows NT is a 32-bit operating system, so programs that run on it can see 232 GB, or 4GB, in a flat address space. ("Flat" means that there are no segments or other compromises to worry about, as in previous versions of Windows.) The upper 2GB is reserved for system code, so the lower 2GB is available for user programs. The basic problem is obvious: You can run a program that may try to load data up near a 4GB memory address (location), but you probably don't have 4GB of physical RAM shoehorned inside your servers. This is where the term "virtual" comes in. Windows NT juggles its limited amount of memory resources by pulling data into main memory when it's asked for, writing it out to disk when it has been written to in memory, and reclaiming memory by writing the least recently used data to a page file. This process is called paging. For efficiency's sake, this data is moved around in chunks called page frames that are 4KB in size for Intel systems and 8KB for Alpha systems. Virtual memory is how the operating system lies to everyone and everything that asks for memory by saying, "Sure, no problem, I have room in memory for you!" and then scurrying around under the covers to page data in and out of main memory to provide it. (A good definition I heard for virtual memory is that the word that follows it is a lie.) Good virtual memory managers are masterful at maximizing the use of system memory and automatically adjusting their actions to outside conditions.

So, the page file (PAGEFILE.SYS) is the space on your hard disk that Windows NT uses as if it were actually memory. Why is it a problem if the system pages out to the page file? (Paging to get data from disk is unavoidable, so it doesn't matter in this discussion.) Isn't that how it's designed? Well, yes, it is, but it's slow. How slow is it, you may ask? Average computer memory today has an access time of 50 nanoseconds, or 5x10-8 seconds. Very fast disk access time today is about 6 milliseconds, or 6x10-3 seconds. This means that memory is 120,000 times as fast as disk!

A good analogy is to increase the time scale to something we're more comfortable with. A Windows NT program executing in the CPU asks for data. If that data is already in main memory, let's say it takes 1 second to return it. If the data it needs is out on disk, it will have to wait almost a day and a half to get the data it needs to continue. Now, the virtual memory manager mitigates this wait by passing control to other programs that don't have to wait, but it's obviously a tremendous performance impact. When paging rates go too high, the system gets caught in a vicious cycle of declining resources and begins "thrashing."

The two best ways to avoid paging are to add physical memory and to tune your applications (especially database applications) carefully to balance their needs with the operating systems needs. Unfortunately, the most obvious Performance Monitor counter of Memory Pages/Sec can be misleading, as explained here.

The following are recommendations of what Performance Monitor memory-related counters to watch:

  • Memory Available bytes consistently less than 4MB (Intel) or less than 8MB (Alpha). A snapshot of this can also be seen from the Windows NT 4 Task Manager Performance property sheet, Physical Memory section, Available counter. As an indicator of the amount of free memory pages in the system, when this value drops below 1,000 pages (4MB in an Intel system using 4K pages), the system paging rate increases in an attempt to free up memory. This was also seen in Windows NT 3.51 from the WINMSD utility, Memory section, as memory load. In that utility, a memory load of

    0 = 1100+ available pages

    and a load of

    100 = 100- available pages

    Any values in between have an appropriate load. For example, a memory load of 25 indicates that 3MB are available, and a memory load of 75 means that only 1MB is available. Several shareware or freeware memory load monitors can be found to monitor this important indicator.

  • Memory Available bytes decreasing over time. This indicates a memory leak condition, where a process requests memory but never releases it—there's a bug in an application. To determine the culprit, monitor the Private Bytes counter of the Process object, and watch for an increasing value that never goes down. (Actually, the term "memory leak" is a misnomer; memory isn't leaking out of the system—it's being kept by a process.)

  • Paging File: % Usage, % Usage Peak is near 100%. Don't let the page file grow, as it will have a significant impact on system performance. All disk I/O ceases during page file growth. Not only that, but page file growth very likely causes fragmentation of the page file. This means that during normal paging operations, the operating system will have to jump the physical read/write heads all over the disk instead of one contiguous area. The simplest way to avoid this is to make the page file larger than its default size of physical memory plus 12MB, especially on memory-constrained systems. The next simplest way, after the page file has already become fragmented, is to move it to another partition, reboot, defrag the original partition, and move the page file back. This will create a contiguous page file.

  • Memory Committed Bytes is less than RAM. Memory Committed Bytes is the amount of virtual memory actually being used. If the system is using more virtual memory than exists in physical memory, it may be paging heavily. Watch paging objects such as Memory Pages/Sec and Memory Page Faults/Sec for heavy usage. The Task Manager equivalent of Memory Committed Bytes can be found in its Performance property sheet, Commit Charge section, Total counter.

  • If Memory Committed Bytes approaches Memory Commit Limit, and if the page file size has already reached the maximum size as defined in Control Panel, System, there are simply no more pages available in memory or in the page file. If the system reaches this point, it's already paging like a banshee in an attempt to service its memory demands. The Task Manager equivalent of Memory Committed Bytes can be found in its Performance property sheet, Commit Charge section, Limit counter. This is the same as %Committed Bytes In Use. A number less than 80% is good.

  • Memory Pages/sec can be a misleading counter. For performance reasons, in NT 4 Memory Pages/Sec was moved from the memory subsystem to the file subsystem. Instead of detecting actual page faults in memory, it simply increments every time a non-cached read (i.e., from disk) occurs. This makes the counter somewhat unreliable in a file server where many open file activities take place and very unreliable where a database server (that manages its own memory) may be doing much database I/O.

The following are recommendations on how to optimize your memory performance:

  • Add physical memory. Generally, the best thing you can do to boost Windows NT performance is to add memory. Lack of memory is by far the most common cause of performance problems on Windows NT systems. If your boss corners you in your office and asks why server XYZ is so slow—and you didn't even know XYZ existed—answer, "It's low on memory," and you'll probably be right. You can approximate (or, guess) how much memory you need by looking at the page file(s) and using the following line of reasoning: If you had a system with no memory constraints, it would almost never page and the page file utilization would approach zero. You don't, so the operating system needs some number of megabytes in the page file to back its memory requests. The worst-case amount can be found in the Paging File % Usage Max counter of the Paging File object. So, if the system in question has a page file of 100MB and the Paging File % Usage Max counter is 75%, at its most heavily loaded point the system required 75MB more than it had available in physical memory. Therefore, adding 75MB of physical memory would be a good guess. Of course, the Paging File % Usage Max counter measures an instantaneous maximum, so if an operator quickly launched and then canceled three big utilities from the server console during your monitoring period, the value will be too high. On the other hand, if you already have a processor or I/O bottleneck, the value may be too low. As I said, it's just a guess.

  • If one application is the troublemaker, run it during off-peak hours. Remember that it will have to share time with long-running system utilities such as backups, anti-virus scanners, and defragmenters.

  • If the page file utilization hits 100% and its size is less than the maximum set in Control Panel, System, Performance, Virtual Memory, Paging File, the page file will extend itself. You don't want this to happen, for several reasons. First, all system I/O will halt while the page file extension occurs. Secondly, the odds are good that no _contiguous space will be available after the page file, so it will become fragmented. This means that whenever the system becomes heavily loaded enough to use the extra space, the disk heads must jump around the disk to simply page, adding extra baggage to a system already in trouble. Set the initial page file size sufficiently large when the system is built or recently defragmented so that it won't need to extend. Disk space is cheaper than a fragmented page file.

  • If the system in question is a BDC of a large Windows NT 4 domain, consider converting it to a member server. All Windows NT 4 domain controllers have a SAM database that is stored in paged pool memory. This means that when a domain controller authenticates a user's logon, it must page the entire contents of its SAM into main memory to get the account's credentials. If it doesn't perform any more authentications for a while. the dirty pages will get reused for other programs; however, authentications on a domain controller usually happen frequently enough that this doesn't happen. So, a domain controller has a chunk of main memory semi-permanently allocated for user authentications. How much memory is used depends on the size of the SAM.

Tuning Disk I/O

Because of the mechanical nature of hard disk drives, the mass storage subsystem is always the slowest of the four subsystems in a computer. As we've seen so far in this section, it's 120,000 times slower than memory. As a result, all sorts of elaborate data caching and buffering schemes have been devised to minimize the disk's performance penalty. This subsystem can be the most important area you tune. If you have a 500MHz processor but your hard drives came from a salvage sale, you've effectively put a ball and chain around its leg whenever the system has to page!

When working with hard disk drives, a good analogy to use is that of an LP-playing jukebox. Inside the drive's case are one or more constantly spinning aluminum-alloy platters, arranged one on top of another in a stack. When you are at work on your computer, you enter commands through your keyboard or mouse. The hard drive's actuator arm—much like a jukebox's tone arm—responds to these commands and moves to the proper place on the platter. When it arrives, the drive's read/write head—like the needle on the tone arm—locates the information you've requested and reads or writes data.

The following are recommendations of which Performance Monitor disk I/O-related counters to watch:

  • Physical Disk % Disk Time counter consistently at or near 67%. This is the percentage of time that this particular disk drive is busy servicing read or write requests.

  • Physical Disk Queue Length > 2. Any time the queue length on an I/O device is greater than 1 or 2, it indicates significant congestion for the device.

  • The following are recommendations of how to tune your disk I/O:

  • Minimize head movement. The slowest actions of a hard drive are the time expended waiting for the disk's actuator arm to move its read/write heads to the correct track (the seek time)—and once it's there, you must wait for the correct sector to come under the heads (rotational latency) so that data can be read or written. There isn't much you can do about rotational latency, but you can minimize head movement. The most effective way to minimize head movement is to defragment your disk and to make sure that your page file(s) are contiguous. You can tell if the page file is fragmented by looking at the text mode results of a disk analysis from DisKeeper. The operating system won't allow disk defragmenters to defragment the page file, so you must do it yourself. The technique is simple: Create a page file on another partition and remove the original, reboot, and recreate the original configuration.

  • The second way to minimize head movement is to think about what kind of data is on the disk. Place large, heavily accessed files on different physical disks to prevent the heads from jumping back and forth between two tracks. For example, let's say that you create an SQL server with the operating system on disk 0, the database device _on disk 1, and the transaction log on disk 2. A little later, you discover that the database device is both heavily accessed and isn't large enough, so you extend it with a second database device on disk 2. You now have a case of head contention on disk 2. The read/write heads focus on the heavily accessed database device (at the inside of the physical disk platters, because it was created last), with constant interruptions from the transaction log (at the outside of the physical disk platter because it was created first). The transaction log is written to in small bursts whenever a transaction is made to the database device.

    The heads continuously bounce back and forth across the full extent of the disk. For the same reasons, you shouldn't install Office components on the same physical disk if you're not using RAID.

    This won't apply in systems where the disk subsystem has been striped in a RAID 0 or RAID 5 configuration. Data is evenly striped across the physical RAID set regardless of where it appears to be on a logical partition.

  • Use NTFS compression sparingly. Disk compression is a great way to squeeze more data onto a disk. It's also a great way to increase the average percentage of processor utilization and to fragment the disk. I recommend that compression be used for low-access document shares and to temporarily buy back disk space when a server's data drive is almost full. In Windows NT 4, compressed files on disk must be decompressed by the server before the data is sent to the client. Windows 2000 Professional will support compression over the wire, which keeps the data compressed until it reaches the client where it is then decompressed. This offers two big benefits: It offloads the CPU cycles required for decompression from the server to the client, and it decreases network bandwidth. Compression load on a processor will be less of an issue if you're buying a new server with the latest high-speed processors.

  • Use fast disks, controllers, and technology. Almost all modern disks and controllers supplied with servers are SCSI. Fibre Channel technology (133MB/sec) is faster than Wide Ultra-2 SCSI (80MB/sec), which is faster than Wide UltraSCSI (40MB/sec), which is faster than Fast Wide SCSI (20MB/sec), which is faster than Fast SCSI (10MB/sec), which is faster than SCSI-2 (5MB/sec), which, finally, is faster than IDE (2.5MB/sec).

  • Use mirroring to speed up read requests. The I/O subsystem can issue concurrent reads to a pair of mirrored disks, assuming your disk controller can handle asynchronous I/O.

  • If you are using a RAID 5 array, increasing the number of drives in the array will increase its performance.

Tuning Network I/O

Network I/O is the subsystem through which the server moves data to its users. This is the server's window to the world. You may have spent money on the fastest server in the world, but if you use a cheap NIC, it will look just as slow as your oldest servers. Here are some recommendations to help your network I/O:

  • The more bits, the better. The number of bits in a NIC's description refers to the size of the data path, so more is better. 32-bit NICs are faster than 16-bit NICs, which are faster than 8-bit NICs. A caution to this is that you should match the NIC to the bus. If you have a PCI (32-bit) bus, you should use a 32-bit card. An EISA bus will support 8-, 16-, and 32-bit NICs, but if you follow the previously stated rule, a 32-bit NIC will be the best performer.

  • Install the Network Monitor Agent service—but leave it in Manual mode. If you have Network Monitor Agent installed, a very useful Network Interface object will be added to Performance Monitor. This provides 17 different counters on the virtual interface to the network. I say "virtual" because, in addition to any physical NICs you have installed, it also includes an instance for every virtual RAS adapter you have defined on your system. For the NIC that you're probably interested in, however, it monitors the physical network interface. Leave the service turned off until you need it to reduce system overhead.

  • The server should always have a faster network interface than its clients. A server is a focal point of network traffic and should therefore have the bandwidth to service many clients at one time. This means that if your clients all have 10BaseT, the server should have 100BaseT. If the clients are running at 100, the server should have an FDDI interface.

The following are recommendations for what Performance Monitor network I/O-related performance counters to watch:

  • Network Interface Bytes Total/sec is useful to figure out how much _throughput the card is getting compared to a theoretical maximum. For instance, a bytes total/sec of 655,360 for a 10baseT NIC on standard Ethernet is shown here:

    (655360 bytes/sec)¥(8 bits/byte)/(1024 bits/Kb)¥(1024Kb/Mb)=5Mbps

    Because the theoretical bandwidth for standard Ethernet is 10Mbits/sec, this card is running at 50% of its theoretical maximum. In reality, it's much closer to its maximum because the Ethernet collision rate begins to rise dramatically when network utilization rises above 66%.

  • Broadcasts/sec or Multicasts/sec is greater than 100/sec. A certain number of network broadcasts or multicasts are normal; for example, DHCP requests from clients are broadcasts. However, excessive broadcasts or multicasts are bad because every card on the network segment must examine the broadcast/multicast packet to see whether it's destined for its client. This means that the NIC must generate an interrupt on its clients' CPU and allow the packet to be passed up to the transport for examination. This can cause serious processor utilization problems.

  • Network Segment % Network Utilization should be considered when things start slowing down to the point at which they are no longer acceptable. Some say that this point is around 40%–50%. Then the network is the bottleneck.

The following are recommendations for how to tune your network I/O:

  • Analyze network I/O based on the OSI model. (For more information on the 7-layer OSI model, see http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/Default.asp?url=/technet/prodtechnol/windows2000serv/reskit/cnet/cnfh_osi_OWSV.asp.) This allows you to look at network I/O performance problems from the bottom up.

  • Consider the following at Layer 1 (the Physical Layer): Is the network overloaded? Is the NIC handling too much data? Are there excessive network broadcasts that the NIC must receive and analyze?

  • Consider the following at Layer 4 (the Transport Layer): Is your primary protocol first in the network binding order? If it isn't, you've unnecessarily increased the average connection time to other network nodes. Figure 5.5 shows the most common situation on a Windows NT 4 system. When you request a connection to shared resources on a remote station, the local workstation redirector submits a TDI connect request to all transports simultaneously; when any one of the transport drivers completes the request successfully, the redirector waits until all higher-priority transports return. For example: The primary protocol for your network is TCP/IP, and that's the only protocol most of your workstations are running. You have TCP/IP and also NetBEUI installed on your server because you must still service the occasional NetBEUI workstation. NetBEUI is first in the network binding order. When the server attempts a session setup with another network resource, the server must wait for NetBEUI to time out before completing the TCP/IP session setup.

    Bb727090.05dnt05(en-us,TechNet.10).gif

    Figure 5.5: Poor binding order for a TCP/IP network
  • Consider the following at Layer 5 (the Session Layer): The Server service's responsibility is to establish sessions with remote stations and receive server message block (SMB) request messages from those stations. SMB requests are typically used to request the Server service to perform I/O—such as open, read, or write on a device or file located on the Windows NT Server station.

    You can configure the Server service's resource allocation and associated nonpaged memory pool usage. In Windows 2000, it's buried in the Network Control Panel applet, Local Area Connection properties, then File And Print Sharing For Microsoft Networks Properties (see Figure 5.6).

Bb727090.05dnt06(en-us,TechNet.10).gif

Figure 5.6: Server memory optimization.

You may want to consider a specific setting, depending on factors such as how many users will be accessing the system and the amount of memory in the system. The amount of memory allocated to the Server service differs dramatically based on your choice:

  • The Minimize Memory Used level is meant to accommodate up to 10 remote users simultaneously using Windows NT Server.

  • The Balance option supports up to 64 remote users.

  • Maximize Throughput for File Sharing allocates the maximum memory for file-sharing applications. You should use this setting for Windows NT servers on a large network. With this option set, file cache access takes priority over user application access to memory. This is the default setting.

  • Maximize Throughput for Network Applications optimizes server memory for distributed applications that do their own memory caching, such as Microsoft SQL Server. With this option set, user application access to memory takes priority over file cache access.

Tuning Database Servers

Database servers deserve special mention here because they are so often accused of poor performance. Using the following rules will help you understand the performance characteristics of a database server.

Most of the time, poor performance isn't the server's fault—it's the application's design at fault. It's much easier to write inefficient relational database queries than to mess up the tuning of a Windows NT system. Unfortunately, you will almost always have to prove beyond the shadow _of a doubt that the system is performing adequately before the application developers will go back and begin optimizing their code. It's all too common to be forced into throwing hardware at a poorly designed application.

Allocate enough memory for Windows NT, and then give the rest to the database. It may seem obvious, but after the operating system, the most important entity in a database server is the database engine. Most databases designed for Windows NT have a parameter to reserve physical memory for their own use—and most databases voraciously gobble up every byte you can give them. Exactly how many bytes to give them is an inexact process, but the general process for Windows NT 4 is listed here:

  • Give 24MB to Windows NT and the rest to the database.

  • Watch Windows NT's paging rate. If under normal conditions Windows NT pages excessively (consistently more than 30–40 pages/sec), give it more memory by reserving less for the database. Keep doing this until the average paging rate is manageable. Because the Performance Monitor Memory Pages/Sec object can't separate database paging from operating system paging, to get Windows NT paging rate you must subtract from that number the sum of SQL Server Page Reads/Sec and SQL Server Single Page Writes/Sec.

  • Watch the databases paging rate. Again, you want to minimize database paging because it incurs a high performance penalty. If _the database paging rate is too high (consistently more than 10–20 pages/sec), you must add physical memory to the system.

  • Database servers are the biggest beneficiaries of multiple processors. _By adding a second processor to a uniprocessor database server that's a bit processor-bound, you may almost double your throughput. Adding additional processors will continue to boost performance, but the single biggest gain will come from adding a second processor. Don't forget: In addition to running the system setup utility, in Windows NT 4 you must update the Windows NT OS to a multiprocessor kernel and HAL before it will be recognized. The UPTOMP.EXE utility in the Windows NT Resource Kit automates this process.

  • Watch the database server subsystem loads. The load on database server subsystems is listed here, in order:

    Processor

    Memory

    I/O subsystem

    Network I/O

Processor, memory, and disk I/O are heavily used by a database, while network I/O is relatively low. This is because a well-designed client/server database passes only the database query and the query results over the wire. The operations to form the query are done on the client, and the execution of the query is done on the server.

Even though you use a logical approach to performance analysis, there are so many variables out of your control that, in the end, there's still some black art to it. You must look at your systems regularly, understand the applications they are running, and be able to read the tea leaves to come up with a feel for a system's performance problems.

Tuning Control Panel Settings

The Control Panel is the place to go for 90% of a Windows NT 4 server's general tuning. The other 10% are sketchily documented or undocumented keys and values in the Registry. In Windows 2000, you can forget almost everything you learned about where controls are located in the user interface; most have changed out of recognition. Fortunately, beta feedback has pointed this out to Microsoft, so the help system has a specific section on how to find the new ways to do old tasks. What follows are tips on Control Panel settings to make managing a server a bit easier:

  • The Console—In the Layout tab, change the screen buffer size height to 999. This will give you a scrollable command prompt window that will display the last 999 lines of data or commands. In Windows 2000, the easiest way to reach this is to launch a command prompt from the start menu, click the icon in the upper-left corner, choose Properties, and then select the Layout tab.

Tip In a command prompt window, you can view the buffer of your previously entered commands by pressing F7.

  • Network—Review your bindings to be sure that you have removed or disabled all unnecessary protocols.

  • Server—Update the description field with pertinent information about the server. This might include the server model, the owning organization, the purpose, and the location. In Windows 2000, the Description field is buried in Control Panel, Computer Management. Right-click the uppermost icon labeled Computer Management, choose Properties, and then choose Networking ID.

  • Services—Review the services. Do they all need to be started? For example, the Messenger service can be disabled on most servers because they rarely need to receive a message sent via NET SEND to the console. In Windows 2000, there are a ton of new services; services administration has also moved to the Computer Management tool.

  • System—The System applet controls basic functionality (such as startup and shutdown options), the paging file, and general performance options. In the System applet, you'll find the following tabs:

    • Startup/Shutdown tab—Set the Show List timer to 5 seconds. On a dedicated Windows NT server, there's no choice to be made other than the base video mode. Ensure that all check boxes in the Recovery section are checked.

    • Performance tab—In Windows NT 4, consider sliding the Foreground Application Boost slider to None (see Figure 5.7). The setting for this control can be argued two ways. The first theory is simpler: A server's primary purpose is to serve its network customers, so foreground applications should always take the back seat to the customer's needs. The second one proposes that if an operator does need to do something on a server console, it's for a very good reason and is worth taking cycles away from paying customers to get good response time. Foreground boost set to None on a heavily loaded, bottlenecked server could result in very slow response time for a console operator. In Windows 2000, this boost control changes to a radio button, shown previously in Figure 5.3.

    Bb727090.05dnt07(en-us,TechNet.10).gif

    Figure 5.7: Foreground application boost
  • Display—Don't use a screen saver. If you do, set it to a simple one such as Marquee or a blank screen. I know it doesn't look nearly as cool as a row of monitors running 3D textured flags, but elaborate screen savers chews up CPU for no good reason. If you must have some kind of a high-tech screen saver to impress your boss when he visits the computer room, choose Beziers and back the speed down a bit.

Summary

This chapter covers many of the basic practical matters in assembling a server and then keeping it in good working order. It's a really big subject, so I've skimmed over some intimate details. Instead, I've included lots of important points in these areas to help keep you on a straight course as you wade through all those intimate details. Server performance, backup media and jobs, software maintenance—there are hundreds of pitfalls you can encounter. This chapter has laid out principles you can use to avoid them.

About the Author

Sean Deuby is a Senior Systems Engineer with Intel Corporation, where he focuses on large-scale Windows 2000 and Windows NT Server issues. Before joining Intel, he was a technical lead in the Information Systems & Services NT Server Engineering Group of Texas Instruments (TI). In that role he was a principal architect of TI's 17-country, 40,000 account enterprise NT network. Sean has been a charter member of the Technical Review Board of Windows NT Magazine, has published several articles in the magazine, and is a contributing author to the Windows NT Magazine Administrator's Survival Guide. He speaks on NT Server and Windows 2000 topics at computer conferences around the world. His domain design white paper for TI, "MS Windows NT Server Domain Strategy" has been published monthly on the Microsoft TechNet CD since 1996. Sean has been a Microsoft Certified Systems Engineer since 1996 and a Certified Professional in Microsoft Windows NT Server and Windows NT Workstation since 1993.

We at Microsoft Corporation hope that the information in this work is valuable to you. Your use of the information contained in this work, however, is at your sole risk. All information in this work is provided "as -is", without any warranty, whether express or implied, of its accuracy, completeness, fitness for a particular purpose, title or non-infringement, and none of the third-party products or information mentioned in the work are authored, recommended, supported or guaranteed by Microsoft Corporation. Microsoft Corporation shall not be liable for any damages you may sustain by using this information, whether direct, indirect, special, incidental or consequential, even if it has been advised of the possibility of such damages. All prices for products mentioned in this document are subject to change without notice.

International rights = English Only

Link
Click to order


Did you find this helpful?
(1500 characters remaining)
Thank you for your feedback
Show:
© 2014 Microsoft. All rights reserved.