Troubleshooting “Primary CSS down” scenario

Author

Arpad Banyai - ISA server Support Engineer, Microsoft CSS Forefront Edge Team

Technical Reviewers:

Thomas Detzner - Escalation Engineer, Microsoft CSS Forefront Edge Team

Jim Harrison - Program Manager, FF Edge CS

Yuri Diogenes - Sr Security Support Escalation Engineer, Microsoft CSS Forefront Edge Team

David Mosier - Senior Consultant

This article deals with providing a recovery plan in the event that the Primary CSS database gets corrupted or otherwise unavailable. The article will discuss the scenario where we have a replica (backup) CSS acting as alternate CSS server in the deployment as shown in Figure 1:

Ee388575.ece1059e-eb12-475b-a782-12841775a9f8(en-us,TechNet.10).gif

Figure 1 – Topology used in this Scenario

Although it is possible to have the firewall service installed in connection with the CSS (primary or backup) on the same machine, this scenario has all 3 server roles (firewall/proxy, Primary Configuration storage, Replica CSS) in separated servers.

Disaster Recovery Scenario

The question that most of ISA Server’s administrators ask when the subject is CSS is: what happens if the Primary CSS database gets corrupted or lost otherwise?

Ee388575.1c1fafd1-9a2c-4d95-9449-b25df98c20d1(en-us,TechNet.10).gif

Figure 2 – Losing the primary CSS.

The normal procedure for this situation will be the following:

  • Firewall service on both nodes detect that primary CSS is unavailable

  • If the primary Configuration Storage server is unavailable for 30 minutes (the default fallback delay), the array member switches to the other Configuration Storage server, if an alternate Configuration Storage servers is specified.

Note

To change the default CSS polling interval see more details in https://technet.microsoft.com/en-us/library/cc302469.aspx

It might seem that the easiest way to fix it is to reinstall CSS database on the Primary server, and because we already have a healthy configuration on the Alternate CSS it should simply replicate to our reinstalled one. Well this assumption would be false as the reinstall process it not so straight forward as the first CSS node installed within one ISA Enterprise has a privileged role and not surprisingly, we get the following error message during installation.

Note

We will run on this error message only in case we install the CSS on the same server (with the same name) were we previously lost the database

Ee388575.d292dc56-dad6-4dab-8fe0-b57f9fe8c2cb(en-us,TechNet.10).gif

Figure 3 – Error message that appears during the installation process.

Error 0x80074e46 has the following description in installation log :

10:29:47 ISA setup CA ERROR  : Failed to run ADAM setup. Error: 0x80074e46
10:29:47 ISA setup CA ERROR  : ExecuteAdamSetup: Adam_Install(1) failed, hr=0x80074e46
10:29:47 ISA setup CA ERROR  : ExecuteAdamSetup(NOT FIRST_ADAM_IN_ENTERPRISE) failed
10:29:47 ISA setup CA ERROR  : DisplayPopup: Setup failed to install ADAM in replica mode.

Note

ISA server installation logs are located at %systemroot%\Temp directory. For troubleshooting help, see Setup logs (ISA*.log) For more information on the Setup logs, see Microsoft knowledgebase article 837347 (https://support.microsoft.com/kb/837347)

To be able to fix it let’s try to understand the reason behind this error.

For other disaster recovery scenarios please refer to the following flowchart

Ee388575.a10c7baa-8e48-456d-9282-2e2367f7c289(en-us,TechNet.10).gif

Figure 4 – Backup and restore operations for different ISA server disaster recovery scenarios

The Theory

CSS is an ADAM based application similar to Windows Active Directory but simpler. By corrupting the CSS database on the primary mode we also corrupted the replication with other CSS servers. In order restore healthy functionality we have to restore the replication first.

In the following example we will work though the steps to fix it:

Ee388575.5c49ffeb-0652-47d2-88a7-4a8c469a5b9d(en-us,TechNet.10).gif

Figure 5 – Shows the lost CSS and the broken replication

In our example the CSS database on isa1-contoso gets corrupted or deleted. A “healthy” replica still remains on isa2-contoso, but this server does not own any FSMO roles. In order to install a new CSS instance as a replica, a server that owns the FSMO Naming and Schema master roles must be running. Because that server is not available, the replica CSS installation fails.

Note

For more details about FSMO roles please see the following KB: https://support.microsoft.com/kb/197132 and also check out the blog at https://blogs.technet.com/isablog/archive/2009/03/31/transferring-configuration-storage-server-fsmo-roles.aspx

ADAM is a little bit different from Active Directory. While in Active Directory we have a multi-master model with 5 FSMO roles, ISA Server CSS uses a single-master model but with only 2 FSMO roles: Schema Mater and Naming Master. ADAM database consists of 3 partitions; schema and naming partitions and an application partition where CSS stores ISA server’s all Enterprise configuration.

When we install a Replica CSS replication partnership will be created automatically between these 3 partitions Figure 6 illustrates this relationship below.The first CSS server to be installed in the ISA Enterprise will hold the 2 FSMO Schema Master and Naming Master roles. If a server with these roles is not available we cannot make schema changes (e.g. installing ADAM SP1 when it is necessary see more https://support.microsoft.com/kb/934608 ), and add CSS replicas to the node. To see this state, we can use repadmin tool.

Note

repadmin is part of the ADAM installation and can be located at %systemroot%\ADAM directory

Our Example

- Primary CSS is called isa1-contoso which was the first CSS of this enterprise and holding schema and naming roles

- isa2-contoso is the Replica (note in a real world scenario this can also be a CSS server located at another site in a branch office)

With repadmin /showreps command show us the partition structure and each partitions replication partnerWith repadmin /showconn command we can display the replication connections between partitions

This is illustrated with the command line extracts in the figure below. Because the connection objects on ISA2-contoso are not valid when ISA1-contoso fails, it will keep trying to replicate with replication partner (ISA1-contoso), which will fail. That failure will generate error messages in the “ADAM [ISASTGCTRL] Replication” log detailed below.

Note

ADAM [ISASTGCTRL] Replication is a separate log type which can be found with event viewer in each machine which has CSS role installed

Ee388575.1080c0ac-2a82-4642-83ba-425ef4561a88(en-us,TechNet.10).gif

Figure 6 – ADAM Partitions

CSS replication creates 2 connections; one inbound and one outbound - each of which serves all 3 replicating partitions as shown in Figure 6. The partition highlighted in green (CN=FPC2) contains ISA configuration data. This is the partition where we connect if we open the ISA MMC console of an Enterprise Edition of ISA 2006

Error messages in ADAM [ISASTGCTRL] log:

Event Type:Error
Event Source:ADAM [ISASTGCTRL] Replication
Event Category:Replication 
Event ID:1864
Date:2/4/2009
Time:11:21:41 PM
User:NT AUTHORITY\ANONYMOUS LOGON
Computer:ISA2-CONTOSO
Description:
This is the replication status for the following directory partition on this directory server. 
 
Directory partition:
CN=Schema,CN=Configuration,CN={234666A7-6F39-410B-BEE7-D3338CED0C89} 
 
This directory server has not recently received replication information from a number of directory servers.  The count of directory servers is shown, divided into the following intervals. 
 
More than 24 hours:
1 
More than a week:
1 
More than one month:
0 
More than two months:
0 
More than a tombstone lifetime:
0 
Tombstone lifetime (days):
180 
 
Directory servers that do not replicate in a timely manner may encounter errors. They may miss password changes and be unable to authenticate. A DC that has not replicated in a tombstone lifetime may have missed the deletion of some objects, and may be automatically blocked from future replication until it is reconciled. 
 
To identify the directory servers by name, use the dsdiag.exe tool. 
You can also use the support tool repadmin.exe to display the replication latencies of the directory servers.   The command is "repadmin /showvector /latency <partition-dn>".

Event Type:Warning
Event Source:ADAM [ISASTGCTRL] Replication
Event Category:Replication 
Event ID:2093
Date:2/4/2009
Time:11:21:41 PM
User:NT AUTHORITY\ANONYMOUS LOGON
Computer:ISA2-CONTOSO
Description:

The remote server which is the owner of a FSMO role (Operations Master role) is not responding.  This server has not replicated with the FSMO role owner recently. 
 
Operations which require contacting a FSMO role owner will fail until this condition is corrected. 
 
FSMO Role (Operations Master role): CN=Schema,CN=Configuration,CN={234666A7-6F39-410B-BEE7-D3338CED0C89} 
FSMO Server DN: CN=NTDS Settings,CN=ISA1-CONTOSO$ISASTGCTRL,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,CN={234666A7-6F39-410B-BEE7-D3338CED0C89} 
Latency threshold (hours): 24 
Elapsed time since last successful replication (hours): 198 
 
User Action: 
 
This server has not replicated successfully with the FSMO role holder server. 
1. The FSMO role holder server may be down or not responding. Please address the problem with this server. 
2. Determine whether the role is set properly on the FSMO role holder server. If the role needs to be adjusted, utilize ADAMUTIL.EXE to transfer or seize the role. This may be done using the steps provided in KB articles 255504 and 324801 on https://support.microsoft.com. 
3. If the FSMO role holder server used to be a directory server, but was not demoted successfully, then the objects representing that server are still in the configuration set. This can occur if a directory server has its operating system reinstalled or if a forced removal is performed.  These lingering state objects should be removed using the ADAMUTIL.EXE metadata cleanup function. 
4. The FSMO role holder may not be a direct replication partner. If it is an indirect or transitive partner, then there are one or more intermediate replication partners through which replication data must flow. The total end to end replication latency should be smaller than the replication latency threshold, or else this warning may be reported prematurely. 
5. Replication is blocked somewhere along the path of servers between the FSMO role holder server and this server.  Consult your configuration set topology plan to determine the likely route for replication between these servers. Check the status of replication using repadmin /showrepl at each of these servers. 
 
The following operations may be impacted: 
Schema Master: You will no longer be able to modify the schema for this configuration set. 
Naming Master: You will no longer be able to add or remove partitions from this configuration set.

Seize FSMO roles to the replica CSS

The last error message suggests that we need to transfer or seize the database to a functioning server. In other words we must transfer the roles held by ISA1-contoso to the working database on ISA2-contoso.Using the dsmgmt tool we can list FSMO role assignment. You may also use the tool provided at https://isatools.org/tools/findcssfsmo.zip.

Note

dsmgmt.exe can be located at %systemroot%\ADAM directory. You can read more about this tool at: https://technet.microsoft.com/en-us/library/cc781970.aspx

C:\WINDOWS\ADAM>dsmgmt.exe
dsmgmt.exe: roles
fsmo maintenance: connections=> first we connect to one of the ADAM servers 
server connections: connect to server isa2-contoso:2171
Binding to isa2-contoso:2171 ...
Connected to isa2-contoso:2171 using credentials of locally logged on user.
server connections: quit
fsmo maintenance: select operation target
select operation target: list roles for connected server
Server "isa2-contoso:2171" knows about 2 roles
Schema - CN=NTDS Settings,CN=ISA1-CONTOSO$ISASTGCTRL,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,CN=
{4542EC6D-7EE3-4BE0-B181-0FD60525A4F3}
Naming Master - CN=NTDS Settings,CN=ISA1-CONTOSO$ISASTGCTRL,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configurat
ion,CN={4542EC6D-7EE3-4BE0-B181-0FD60525A4F3}

According to the highlighted section the corrupted server still owns all FSMO roles.Transferring naming master role from ISA1-contoso to ISA2-contoso

Let’s continue with the FSMO role seizure process. For more detail refer to articles:https://blogs.technet.com/isablog/archive/2009/03/31/transferring-configuration-storage-server-fsmo-roles.aspxhttps://technet.microsoft.com/en-us/library/cc758598.aspx

After completing this process, ISA2-contoso will function as the Primary CSS server because it owns both FSMO roles. The error messages from ADAM [ISASTGCTRL] log will also cease.

Reinstall CSS on the corrupted node

Now if we try to reinstall the CSS on isa1-contoso (the server where the database got corrupted) during installation we can receive the following error message:

Setup failed to install ADAM in replica mode

(0x80074e46)

Note

This error message appears because we tried to re-install CSS on a machine which already contained a copy of the three partitions it wants to create. Installing a new CSS replication partner will not generate this error.

In order to reinstall CSS on ISA1-contoso, we have to delete the damaged CSS’s reference from configuration partition using ADSIEDIT tool.

Note

We need to complete the following steps on the current primary CSS server holding Schema and Naming master roles

To connect to the configuration partition:

  1. Start ADAM ADSI Edit MMC console on the primary CSS server. This console can be found under start menu ADAM folder or can be accessed at ADAM installation folder under %systemroot%\ADAM\ADAM-adsiedit.msc

  2. Go to the Action menu and select Connect to…

  3. Leave localhost in the servername field and change the port to 2171

  4. If we are logged to the server with an account that is and ISA Server Enterprise Administrator then leave the account setting at currently logged in user

  5. Now we are able to browse the CSS’s Configuration partition.

  6. The old reference of the replication connection object can be found under CN=Servers,CN=Default-Fist-Site-Name,CN=Sites,CN=Configuration (see Figure 7 below)

  7. We have to find the damaged CSS server’s name and then right-click an delete

    Ee388575.b31fc625-e935-40fb-9a58-a04085d82698(en-us,TechNet.10).gif

    Figure 7 – Deleting the old reference of the corrupted database

After that we can now successfully install the CSS in replica mode. ISA2-contoso becomes the primary node owning the FSMO roles a ISA1-contoso become the replica.

If we want the CSS configuration identical to the original we can now transfer the FSMO roles back to primary server ISA1-contoso using the same dsmgmt.exe in Example1.

Note

For further details rebuilding our CSS server and transferring back FSMO roles refer to https://blogs.technet.com/isablog/archive/2009/03/31/transferring-configuration-storage-server-fsmo-roles.aspx

Summary

When the database on the first CSS server becomes unusable, the array will be still operational but we will see error messages in ADAM [ISASTGCTRL]. We won’t be able to re-install CSS on the same machine because of its old reference in ADAM and error message (0x80074e46) will appear during installation. To resolve this state, we need to seize Schema and Naming master roles of ADAM to a replica CSS and then delete the invalid data from the former primary CSS if this still exists. Then we will be able to re-install the Configuration Storage server on the rebuilt server. Optionally (for example if we want to keep the all FSMO master roles on a CSS server that is based on the main office) we can transfer Schema and Naming master roles back to it.