Wes Miller

Contents

So What Can Happen?
Tools of the Trade
Operating System Is Missing
Boot Loader Is Missing
System Boots—then Hangs
Windows Safe Mode and Restore Points
System Crashes before Windows Finishes Loading
Last Known Good Configuration
System Crashes after Windows Starts

If you've worked with Windows for any length of time, chances are you've seen it fail at one time or another. Though Windows has grown increasingly reliable with each release, there are things that neither you nor Windows can control: drivers fail, systems get hit by power outages, files get corrupted, and disks crash. And when these things happen, you worry that you have to look for a new machine. In this column, I'll look at the things that can go wrong in a Windows system and how you can troubleshoot them to get it working again.

So What Can Happen?

After nearly 20 years of working with Windows, I've seen my fair share of systems that won't boot. The culprit can usually be found pretty quickly—depending on exactly what the system is doing. Let's categorize the symptoms that Windows systems will show and build up some cases to walk through for troubleshooting.

  • System won't perform a BIOS POST (Power On Self Test; it doesn't beep when you power it on).
  • System POSTs but says Missing Operating System or Operating System not found.
  • System POSTs but fails with either NTLDR not found or BOOTMGR not found.
  • System begins to boot but hangs during startup.
  • System begins to boot but crashes before the Windows desktop appears (and is stuck in a loop).
  • System makes it to the Windows desktop but then crashes while Windows is running (and is stuck in a loop).

These scenarios may sound altogether different, but, in fact, there are only a few common problems that cause most of them, and with some troubleshooting, you can figure out what went wrong and what you might have to do to fix it. One of the most difficult situations is when a system will no longer POST at all.

Unfortunately, this is almost always a hardware issue. It may require something as minor as a CMOS battery replacement or as complicated as a new motherboard or power supply. But if your system won't POST, you should probably grab the support phone number for your system OEM, as you're not likely to be able to solve this one on your own.

Tools of the Trade

As long as your system is POSTing, you stand a good chance of recovering, as the problem is not just hardware that's gone bad (though, of course, the problem may involve both hardware as well as software). Depending on how far your system gets into the boot process, you should be able to gradually remove various "suspects" from your list and get the system working again.

Before you start, here are some tools you'll want to keep handy:

  • Ideally, you should have access to another computer running Windows to use for crash analysis, which should include the Debugging Tools for Windows.
  • You should have a copy of the Microsoft Diagnostics and Recovery Toolset (DaRT), which is part of the Microsoft Desktop Optimization Pack (MDOP). At you can find a 30-day evaluation copy online. Alternatively, you could use a Windows PE CD (ideally version 2.1, especially if you are recovering a Windows Vista or Windows Server 2008 system).
  • You should have a USB flash drive large enough to hold any crash dump from your problematic system.
  • You should have any tools necessary to remove hardware from your problematic system.

Operating System Is Missing

If you receive an error regarding a missing operating system or the operating system not found (the text may vary depending on the BIOS used on your PC), the problem is that, effectively, your system is missing the boot sector—the section on the disk that says where to find the boot loader. I've only once had this issue crop up unexpectedly—an executive's machine was hit with a surge during a power outage—and, surprisingly, this was the only issue that was immediately apparent.

Unfortunately, upon deeper investigation, we found that her system appeared to have completely lost all of its partitions. Ironically, this was at Winternals, and we then created a tool called Disk Commander (see Figure 1) that is now part of the Diagnostics and Recovery Toolset.

fig01.gif

Figure 1 The Disk Commander recovery tool

Disk Commander can also be useful for recovering entire deleted directories. In this instance, it was just what the doctor ordered, since it was able to scan the disk for recently deleted partitions and completely recover them (see Figure 2).

fig02.gif

Figure 2 Recovering a disk partition with Disk Commander

Partition recovery is not foolproof, but on recently deleted or lost partitions—and using a tool like Disk Commander—it stands a decent chance of success. Of course, this error may be caused by other problems (usually hardware). The Knowledge Base article "'Operating System Not Found' or 'Missing operating system' error message when you start your Windows XP-based computer" discusses the topic a bit further. As with a BIOS problem, if you have a hardware issue that prevents the disk from even showing up, there isn't much you can do with DaRT or any other tool.

Boot Loader Is Missing

If your system is running Windows Server 2003 or earlier, a boot loader missing error will reference NTLDR; if you've upgraded or you dual boot with Windows Vista or later, it would be BOOTMGR. Basically, the message depends on which boot loader the boot sector is pointing to.

This error doesn't generally just happen, though I have heard of it occurring in much the same manner as I described regarding the lost partition. The main thing to remember is that all you need to do is boot to Windows PE and replace the NTLDR and NTDetect.com files from a CD or a share with Windows on it. You want to ensure the replacement files are from Windows release, as new or newer than those you are replacing (use the latest copy from the latest service pack that is available; these files are backward compatible). In the case of Windows Vista or Windows Server 2008, you should copy the BOOTMGR file and make sure that your Boot directory (hidden by default) is also there.

The article "An NTLDR or NTDETECT.COM Not Found Error" provides more information. You may notice that many of the Microsoft Knowledge Base articles suggest using the Windows Recovery Console, and some, such as the one at "How to use the Bootrec.exe tool in the Windows Recovery Environment to troubleshoot and repair startup issues in Windows Vista," can be quite useful.

My recommendation, however, is to use Windows PE in this scenario. Though there are a few things the Recovery Console can do more easily than Windows PE, the overall power and flexibility of Windows PE generally make it easier for you to get things up and running.

I recommend that you run chkdsk after completing either the partition or boot loader recovery to ensure that you don't have any further disk damage that might surface later.

In the case of file damage due to power outage or other issues, I also like to ensure that there aren't any Self-Monitoring Analysis and Reporting Technology (S.M.A.R.T.) disk monitoring errors logged to the event log. Anything that causes enough data loss to wipe out a partition or files can quite easily lead to bigger problems down the road.

System Boots—then Hangs

This one has a symptom that many encounter, but it rarely reflects the actual cause. Very often, you will see agp440.sys named as a culprit for the hang. But that driver is but an unwitting victim. It's simply the last driver loaded, as you can see in Figure 3, which shows a system booting into Windows Safe Mode right before Windows starts.

fig03.gif

Figure 3 Drivers loading in Windows Safe Mode

Some articles I've read—including some Knowledge Base articles—suggest that disabling that driver is the best way to get a system working again. Not so. To get your system working again, here's what I suggest:

  • Start by removing any new hardware you have recently installed. Do this selectively, one piece at a time.
  • Try booting into Windows Safe Mode to see if that works. If it does, most likely it's a third-party driver causing the problem (since third-party drivers aren't started during Safe Mode).
  • Use either Windows PE or ERD Commander (from the DaRT) to disable any newly installed drivers. See the sidebar "Disabling Drivers or Services" for more information.
  • Try using the Last Known Good Configuration (discussed below) to find out if a recent driver change, not reflected in the last working configuration set, might be the source of the problem.

If these four options do not get you up and running, you may need to perform a Windows repair installation (booting from a CD that is a service pack match of your malfunctioning Windows installation), or you may need to reinstall Windows.

Disabling Drivers or Services

ERD Commander and Windows PE both allow you to disable drivers or services that are hanging Windows at boot, albeit with different levels of difficulty.

To use Windows PE to enable and disable problematic services, first boot to Windows PE (be sure you have necessary storage controller drivers for the system being booted). Start up the Windows Registry Editor (regedit.exe) and select the HKEY_LOCAL_MACHINE hive.

From the file menu, select Load Hive, and browse to C:\Windows\System32\Config\system (adjusting to fit your own Windows path), and specify a name to use for the hive while editing (the name doesn't matter). Browse into that new key\CurrentControlSet\Services\servicename and take note of the Start value, which can be 0-4:

0—Boot start: started by the operating system loader first.

1—System start: loaded during kernel startup after boot start drivers.

2—Auto start: Service Control Manager (SCM) starts these next.

3—Demand start: started on demand by the SCM.

4—Disabled: will not load.

Set the service or driver you need to disable to 4. Do this carefully—some drivers have interdependencies, and if you disable them and not their dependencies, you may crash your system, not just hang it. When you're finished, unload the registry hive and reboot your system. The driver or service should no longer hold up the boot process.

Now, if you have the MDOP, DaRT makes this truly easy. You can simply boot it up, connect to a Windows installation, and via the Services and Drivers application shown in Figure A, you can enable or disable services or drivers through a very simple UI.

figsidebar.gif

Figure A An easy way to disable drivers or services

Windows Safe Mode and Restore Points

Of course, if your system is hanging or crashing at start up, booting into Safe Mode may help, especially if you don't have a copy of Windows PE handy and if Safe Mode will start. On client versions of Windows, you may also be able to use Restore Points to recover your system, assuming you have them enabled and that they have protected the files you need to restore (they don't protect everything on Windows XP).

It's important to note that if you are using Windows PE to recover your Windows Vista system, be sure that you only boot it with Windows PE 2.0 or later. Booting with earlier versions will corrupt your Restore Points, rendering them unusable. (This is due to the way that System Restore Points on Windows Vista watch for disk writes. Windows PE 1.x does not know how to interact with the system without causing writes that lead to corruption of the restore points).

System Crashes before Windows Finishes Loading

There are two classes of problems that often lead to system crashes before Windows finishes loading. The first is registry corruption. This is more common with earlier versions of Windows, as well as on systems that have had an abrupt system restart. Typically, if registry corruption is preventing your system from starting, it is just a small section of the registry that's the problem. With Windows Vista, you have two copies of the registry located in the \windows\system32\config\regback folder, both of which are less than 24 hours old, in most cases. You can try replacing the registry files with these.

Alternatively, using the approach I described earlier to load the registry offline under Windows PE will sometimes solve it. The Windows Registry Editor has logic built into it that attempts to repair corruption when encountered. This corruption repair was good in Windows XP, and it's even better in Windows Vista and Windows Server 2008.

If you can't fix the problem via Windows PE 1.6, try either using Windows PE 2.x or copying the registry off to a Windows Vista system, repairing it using methods mentioned earlier, and then copying it back. Unfortunately, there is often not much you can do if this doesn't recover the system. You can try performing a repair installation for earlier versions of Windows, but in general I would recommend performing a reinstall. Again, you should perform a chkdsk to verify that there are no other problems.

Probably the most common issue leading to system crashes (blue screens) before Windows has started is drivers. I recommend preparing a DaRT CD and using the Crash Analyzer (see Figure 4), an easy-to-use mechanism for analyzing Windows crash dumps—even on an unbootable system.

fig04.gif

Figure 4 Analyzing a dump file with Crash Analyzer

If you don't have the DaRT, or don't have it handy, you can also use Windows PE and the Debugging Tools for Windows to see where the general point of failure may be. Note that a crash dump may be corrupt or may be inconclusive, but quite often it will point you in the right direction. Here are the steps to follow:

  1. Copy the most recent *.dmp file(s) from the crashed system via Windows PE. These files are located under %windir%\minidump\. If you have full dumps enabled, they will be in the Windows directory itself and will be at least as large as the memory on your system (so they can be quite sizable).

  2. Start up the Debugging Tools for Windows and, via File | Open Crash Dump, select the dump file you just copied.

  3. Set the Symbol Path to provide debugging information for Windows binaries to diagnose. Type in:

    .sympath= SRV*C:\SYMBOLS\*https://
    msdl.microsoft.com/download/symbols
    

    and then click Enter.

  4. Type .reload and click Enter.

  5. Type !analyze –v and click Enter.

  6. The result will most likely point out the driver (or drivers) involved.

Note that you can often get a false positive that points to one driver when the cause is actually another, so you may want to do a little Web searching to see the opinions related to your scenario. Often others will have encountered the same problem with that driver before. You can use the steps described earlier to disable the driver(s) or boot into Safe Mode to see if it makes a difference.

Last Known Good Configuration

The Last Known Good Configuration, shown in Figure 5, can often help if your system is having problems. It contains a copy of the last set of services and drivers that booted successfully. But it isn't guaranteed and won't help if you have registry corruption. And it can only make a difference if Windows is crashing before Win32 starts (before you see the Windows desktop start initialization). Otherwise, that startup will be considered "good," and if Windows crashes after that point, it won't be recoverable this way.

Figure 5 Choosing the Last Known Good Configuration may allow your system to start

System Crashes after Windows Starts

If your system is crashing after Windows has started, the cause may be a driver, but it can just as easily be hardware. To diagnose, retrieve the .dmp file, as described earlier, and see what kind of clues you find.

Depending upon the direction Windbg points, you may try disabling services or drivers, or you may want to examine any newly added hardware. Memory (particularly newly added memory) can often cause crashes, as can a disk that is not working as it should. You may need to connect with your OEM or your software ISVs to see if they have seen any results like yours, or search the Web for similar situations.

If you suspect memory is the source of the problem, make sure you try the Windows Memory Diagnostic (which is included in Windows Vista and Windows Server 2008). The Memory Diagnostic tool is a very comprehensive way to test your system's RAM if you suspect problems that are leading to crashes.

While it's certainly frustrating when Windows doesn't boot correctly, the causes can generally be categorized into a pretty narrow group of problems. Understanding where and how to look when this occurs can often help you get Windows back up and running, without needing to revert to a full reimage.

Wes Miller is a Senior Technical Product Manager at CoreTrace (CoreTrace.com) in Austin, Texas. Previously, he worked at Winternals Software and as a Program Manager at Microsoft. Wes can be reached at technet@getwired.com.