Windows Server Failover Clustering (WSFC) with SQL Server
Topic Status: Some information in this topic is preview and subject to change in future releases. Preview information describes new features or changes to existing features in Microsoft SQL Server 2016 Community Technology Preview 2 (CTP2).
A Windows Server Failover Clustering (WSFC) cluster is a group of independent servers that work together to increase the availability of applications and services. SQL Server 2016 takes advantage of WSFC services and capabilities to support AlwaysOn Availability Groups and SQL Server Failover Cluster Instances.
In this topic:
Windows Server Failover Clustering provides infrastructure features that support the high-availability and disaster recovery scenarios of hosted server applications such as Microsoft SQL Server and Microsoft Exchange. If a cluster node or service fails, the services that were hosted on that node can be automatically or manually transferred to another available node in a process known as failover.
The nodes in the WSFC cluster work together to collectively provide these types of capabilities:
Distributed metadata and notifications. WSFC service and hosted application metadata is maintained on each node in the cluster. This metadata includes WSFC configuration and status in addition to hosted application settings. Changes to a node's metadata or status are automatically propagated to the other nodes in the cluster.
Resource management. Individual nodes in the cluster may provide physical resources such as direct-attached storage, network interfaces, and access to shared disk storage. Hosted applications register themselves as a cluster resource, and may configure startup and health dependencies upon other resources.
Health monitoring. Inter-node and primary node health detection is accomplished through a combination of heartbeat-style network communications and resource monitoring. The overall health of the cluster is determined by the votes of a quorum of nodes in the cluster.
Failover coordination. Each resource is configured to be hosted on a primary node, and each can be automatically or manually transferred to one or more secondary nodes. A health-based failover policy controls automatic transfer of resource ownership between nodes. Nodes and hosted applications are notified when failover occurs so that they may react appropriately.
For more information, see: Failover Clusters in Windows Server 2008 R2
SQL Server 2016 AlwaysOn is a new high availability and disaster recovery solution that takes advantage of WSFC. AlwaysOn provides an integrated, flexible solution that increases application availability, provides better returns on hardware investments, and simplifies high availability deployment and management.
Both AlwaysOn Availability Groups and AlwaysOn Failover Cluster Instances use WSFC as a platform technology, registering components as WSFC cluster resources. Related resources are combined into a resource group, which can be made dependent upon other WSFC cluster resources. The WSFC cluster service can then sense and signal the need to restart the SQL Server instance or automatically fail it over to a different server node in the WSFC cluster.
To take full advantage of SQL Server AlwaysOn technologies, you should apply several WSFC-related prerequisites.
For more information, see: Prerequisites, Restrictions, and Recommendations for AlwaysOn Availability Groups (SQL Server)
Instance-level High Availability with AlwaysOn Failover Cluster Instances
An AlwaysOn Failover Cluster Instance (FCI) is a SQL Server instance that is installed across nodes in a WSFC cluster. This type of instance has resource dependencies on shared disk storage (via Fibre Channel or iSCSI SAN) and on a virtual network name. The virtual network name has a resource dependency on one or more virtual IP addresses, each in a different subnet. The SQL Server service and the SQL Server Agent service are registered as resources, and both are made dependent upon the virtual network name resource.
In the event of a failover, the WSFC service transfers ownership of instance's resources to a designated failover node. The SQL Server instance is then re-started on the failover node, and databases are recovered as usual. At any given moment, only a single node in the cluster can host the FCI and underlying resources.
An AlwaysOn Failover Cluster Instance requires symmetrical shared disk storage such as a storage area network (SAN) or SMB file share. The shared disk storage volumes must be available to all potential failover nodes in the WSFC cluster.
For more information, see: AlwaysOn Failover Cluster Instances (SQL Server)
Database-level High Availability with AlwaysOn Availability Groups
An availability group is a set of user databases that fail over together. An availability group consists of a primary availability replica and one to four secondary replicas that are maintained through SQL Server log-based data movement for data protection without the need for shared storage. Each replica is hosted by an instance of SQL Server on a different node of the WSFC cluster. The availability group and a corresponding virtual network name are registered as resources in the WSFC cluster.
An availability group listener on the primary replica's node responds to incoming client requests to connect to the virtual network name, and based on attributes in the connection string, it redirects each request to the appropriate SQL Server instance.
In the event of a failover, instead of transferring ownership of shared physical resources to another node, WSFC is leveraged to reconfigure a secondary replica on another SQL Server instance to become the availability group's primary replica. The availability group's virtual network name resource is then transferred to that instance.
At any given moment, only a single SQL Server instance may host the primary replica of an availability group's databases, all associated secondary replicas must each reside on a separate instance, and each instance must reside on separate physical nodes.
AlwaysOn Availability Groups do not require deployment of a Failover Cluster Instance or use of symmetric shared storage (SAN or SMB).
A Failover Cluster Instance (FCI) may be used together with an availability group to enhance the availability of an availability replica. However, to prevent potential race conditions in the WSFC cluster, automatic failover of the availability group is not supported to or from an availability replica that is hosted on a FCI.
For more information, see: Overview of AlwaysOn Availability Groups (SQL Server)
High availability for an AlwaysOn solution is accomplished though proactive health monitoring of physical and logical WSFC cluster resources, together with automatic failover onto and re-configuration of redundant hardware. A system administrator can also initiate a manual failover of an availability group or SQL Server instance from one node to another.
Failover Policies for Nodes, Failover Cluster Instances, and Availability Groups
A failover policy is configured at the WSFC cluster node, the SQL Server Failover Cluster Instance (FCI), and the availability group levels. These policies, based on the severity, duration, and frequency of unhealthy cluster resource status and node responsiveness, can trigger a service restart or an automatic failover of cluster resources from one node to another, or can trigger the move of an availability group primary replica from one SQL Server instance to another.
Failover of an availability group replica does not affect the underlying SQL Server instance. Failover of a FCI moves the hosted availability group replicas with the instance.
For more information, see: Failover Policy for Failover Cluster Instances
WSFC Resource Health Detection
Each resource in a WSFC cluster node can report its status and health, periodically or on-demand. A variety of circumstances may indicate resource failure; e.g. power failure, disk or memory errors, network communication errors, or non-responsive services.
WSFC cluster resources such as networks, storage, or services can be made dependent upon one another. The cumulative health of a resource is determined by successively rolling up its health with the health of each of its resource dependencies.
WSFC Inter-node Health Detection and Quorum Voting
Each node in a WSFC cluster participates in periodic heartbeat communication to share the node's health status with the other nodes. Unresponsive nodes are considered to be in a failed state.
A quorum node set is a majority of the voting nodes and witnesses in the WSFC cluster. The overall health and status of a WSFC cluster is determined by a periodic quorum vote. The presence of a quorum means that the cluster is healthy and able to provide node-level fault tolerance.
A quorum mode is configured at the WSFC cluster level that dictates the methodology used for quorum voting and when to perform an automatic failover or take the cluster offline.
It is best practice to always have an odd number of quorum votes in a WSFC cluster. For the purposes of quorum voting, SQL Server does not have to be installed on all nodes in the cluster. An additional server can act as a quorum member, or the WSFC quorum model can be configured to use a remote file share as a tie-breaker.
For more information, see: WSFC Quorum Modes and Voting Configuration (SQL Server)
Disaster Recovery Through Forced Quorum
Depending upon operational practices and WSFC cluster configuration, you can incur both automatic and manual failovers, and still maintain a robust, fault-tolerant SQL Server AlwaysOn solution. However, if a quorum of the eligible voting nodes in the WSFC cluster cannot communicate with one another, or if the WSFC cluster otherwise fails health validation, then the WSFC cluster may go offline.
If the WSFC cluster goes offline because of an unplanned disaster, or due to a persistent hardware or communications failure, then manual administrative intervention is required to force a quorum and bring the surviving cluster nodes back online in a non-fault-tolerant configuration.
Afterwards, a series of steps must also be taken to reconfigure the WSFC cluster, recover the affected database replicas, and to re-establish a new quorum.
For more information, see: WSFC Disaster Recovery through Forced Quorum (SQL Server)
Several layers of relationships exist between SQL Server AlwaysOn and WSFC features and components.