Configuring MPIO Timers

Applies To: Windows Server 2008 R2

The following MPIO timer values can be configured to tune the behavior of MPIO to meet operational requirements. In most cases, the default values may be adequate; however, it may be necessary to adjust these settings to obtain optimal performance for your environment.

Consider the following scenarios:

Scenario 1   A two-node Windows Server failover cluster is configured with multiple connections to storage for each node by using MPIO, and employs multiple active paths to maximize throughput. Due to application Service Level Agreement (SLA) requirements, in the event of path failures, short timeout values are required by the customer so that the resources will failover to the other cluster node more quickly. In this case, timer values such as the PDORemovePeriod are set to a low value and then tested to ensure compliance with customer requirements.

Scenario 2   A single server that is not configured as a failover cluster is configured with MPIO and multiple connections to storage to provide both increased throughput and fault tolerance of path failures. In this case, timers such as PDORemovePeriod are increased to allow additional time for path recovery to occur. Additionally, testing is performed to ensure that if the maximum timer values are experienced, it would balance, allowing ample time for path recovery under realistic production load with the need to minimize the amount of time allowed before the disk objects are removed and I/O failures are exposed to upper-level applications to alert support personnel of the issue.

For any customer scenario, determining the best timer values to use depends on a number of different variables, such as any of the following, which could all potentially impact whether the current settings would meet SLA or Operating Level Agreement (OLA) requirements:

  • The number of paths that exist at the time a problem is encountered

  • The number of paths that are failed

  • The number of paths that are already attempting to recover

  • The amount of in-flight I/Os, and so forth

  • The CPU load on the system at the time an issue occurs

Note

Settings 1 through 5 in the following table can be set through the user interface. The information provided on these settings is specific to the use of Microsoft DSMs. When using a vendor-provided DSM, refer to vendor documentation for information about the recommended timer values.

Important

Although it is possible to set the following values to a very large number, we recommend that you use caution when doing so, and that you test the values for applicability prior to using them in a production environment.
For example, the value MAXULONG is 0xFFFFFFFF. If this value were applied to a setting such as PDORemovePeriod (where it represents seconds), the value would equate to approximately 49,000 days of delay before an error would be reported.

Setting Definition

PathVerifyEnabled

Flag that enables path verification by MPIO on all paths every N seconds (where N depends on the value set in PathVerificationPeriod).

Type is boolean and must be filled with either 0 (disable) or 1 (enable). By default, it is disabled.

PathVerificationPeriod

This setting is used to indicate the time period (in seconds) with which MPIO has been requested to perform path verification. This field is only honored if PathVerifyEnabled is TRUE.

This timer is specified in seconds. The default is 30 seconds. The maximum allowed is MAXULONG.

PDORemovePeriod

This setting controls the amount of time (in seconds) that the multipath pseudo-LUN will continue to remain in system memory, even after losing all paths to the device.

When this timer value is exceeded, pending I/O operations will be failed, and the failure is exposed to the application rather than attempting to continue to recover active paths.

This timer is specified in seconds. The default is 20 seconds. The max allowed is MAXULONG.

RetryCount

This setting specifies the number of times a failed I/O if the DSM determines that a failing request must be retried. This is invoked when DsmInterpretError() returns Retry = TRUE. The default setting is 3.

RetryInterval

This setting specifies the interval of time (in seconds) after which a failed request is retried (after the DSM has decided so, and assuming that the I/O has been retried a fewer number of times than RetryCount).

This value is specified in seconds. The default is 1 second.

The following two registry key settings are new in Windows Server 2008 R2:

Setting Definition

HKLM\System\CurrentControlSet\Services\mpio\Parameters\UseCustomPathRecoveryInterval

If this key exists and is set to 1, it allows the use of PathRecoveryInterval.

HKLM\System\CurrentControlSet\Services\mpio\Parameters\PathRecoveryInterval

Represents the period after which PathRecovery is attempted. This setting is only used if it is not set to 0 and UseCustomPathRecoveryInterval is set to 1.

The two new settings were introduced to account for the following scenario:

  • A transient error somewhere causes a path to briefly fail and recover.

  • MPIO detects that the path has failed and thus performs a failover.

  • The failed path was the last path for a particular pseudo-LUN, so its PDO Remove Timer started ticking down.

  • The error was brief enough and PnP was busy enough that PnP missed the fact that the path went away and came back. Thus, there are no PnP events generated to indicate that the path is back online.

  • The pseudo-LUN never sees the path come back online and it gets removed after its PDO Remove Timer runs out.

The end result is that the system now has at least one path and one device online, but no pseudo-LUN to represent that device.

MPIO has a path recovery mechanism that can be used to avoid this issue. However, by default, the period at which path recovery is attempted is set to twice in the PDORemovePeriod. In the majority of cases, the default is acceptable, but it does not solve the problem in this particular scenario. This is where the settings listed in the previous tables come into play. They allow you to configure the timer that determines the period at which path recovery attempts are done. Thus, by setting the PathRecoveryInterval to less than the PDORemovePeriod, the path recovery attempt happens before the pseudo-LUN gets removed, the path is detected as back online, and the pseudo-LUN is saved from removal.

We recommend that you test the use of this value before widespread deployment in production to ensure that path recovery attempts are not happening so frequently that it has a significant impact on regular I/O.

For example, if the PDORemovePeriod is set to 60 seconds, a good starting point for the PathRecoveryInterval may be 30 seconds. This interval causes path recovery to be attempted every 30 seconds.

Important

Caution is advised when setting the PathRecoveryInterval to small values. By decreasing this value, larger amounts of path verification traffic are generated. This traffic increases with the number of LUNs available on the host, and the smaller the value.