Table of contents
TOC
Collapse the table of content
Expand the table of content

Storage-class Memory (NVDIMM-N) Health Management in Windows

Jason Gerend|Last Updated: 10/12/2016
|
2 Contributors

Applies To: Windows Server 2016, Windows 10 (version 1607)

This article provides system administrators and IT Pros with information about error handling and health management specific to storage-class memory (NVDIMM-N) devices in Windows, highlighting the differences between storage-class memory and traditional storage devices.

If you aren't familiar with Windows’ support for storage-class memory devices, these short videos provide an overview:

JEDEC-compliant NVDIMM-N storage-class memory devices are supported in Windows with native drivers, starting in Windows Server 2016 and Windows 10 (version 1607). While these devices behave similar to other disks (HDDs and SSDs), there are some differences.

All conditions listed here are expected to be very rare occurrences, but depend on the conditions in which the hardware is used.

The various cases below may refer to Storage Spaces configurations. The particular configuration of interest is one where two NVDIMM-N devices are utilized as a mirrored write-back cache in a storage space. To set up such a configuration, see Configuring Storage Spaces with a NVDIMM-N write-back cache.

Checking the health of storage-class memory

To query the health of storage-class memory, use the following commands in a Windows PowerShell session.

PS C:\> Get-PhysicalDisk | where BusType -eq “SCM” | select SerialNumber, HealthStatus, OperationalStatus, OperationalDetails

Doing so yields this example output:

SerialNumberHealthStatusOperationalStatusOperationalDetails
802c-01-1602-117cb5fcHealthyOK
802c-01-1602-117cb64fWarningPredictive Failure{Threshold Exceeded,NVDIMM_N Error}

For help understanding the various health conditions, see the following sections.

“Warning” Health Status

This condition is when you check the health of a storage-class memory device and see that it's Health Status is listed as Warning, as shown in this example output:

SerialNumberHealthStatusOperationalStatusOperationalDetails
802c-01-1602-117cb5fcHealthyOK
802c-01-1602-117cb64fWarningPredictive Failure{Threshold Exceeded,NVDIMM_N Error}

The following table lists some info about this condition.

Description
Likely conditionNVDIMM-N Warning Threshold breached
Root CauseNVDIMM-N devices track various thresholds, such as temperature, NVM lifetime, and/or energy source lifetime. When one of those thresholds is exceeded, the operating system is notified.
General behaviorDevice remains fully operational. This is a warning, not an error.
Storage Spaces behaviorDevice remains fully operational. This is a warning, not an error.
More infoOperationalStatus field of the PhysicalDisk object. EventLog – Microsoft-Windows-ScmDisk0101/Operational
What to doDepending on the warning threshold breached, it may be prudent to consider replacing the entire, or certain parts of the NVDIMM-N. For example, if the NVM lifetime threshold is breached, replacing the NVDIMM-N may make sense.

Writes to an NVDIMM-N fail

This condition is when you check the health of a storage-class memory device and see the Health Status listed as Unhealthy, and Operational Status mentions an IO Error, as shown in this example output:

SerialNumberHealthStatusOperationalStatusOperationalDetails
802c-01-1602-117cb5fcHealthyOK
802c-01-1602-117cb64fUnhealthy{Stale Metadata, IO Error, Transient Error}{Lost Data Persistence, Lost Data, NV...}

The following table lists some info about this condition.

Description
Likely conditionLoss of Persistence / Backup Power
Root CauseNVDIMM-N devices rely on a back-up power source for their persistence – usually a battery or super-cap. If this back-up power source is unavailable or the device cannot perform a backup for any reason (Controller/Flash Error), data is at risk and Windows will prevent any further writes to the affected devices. Reads are still possible to evacuate data.
General behaviorThe NTFS volume will be dismounted.
The PhysicalDisk Health Status field will show “Unhealthy” for all affected NVDIMM-N devices.
Storage Spaces behaviorStorage Space will remain operational as long as only one NVDIMM-N is affected. If multiple devices are affected, writes to the Storage Space will fail.
The PhysicalDisk Health Status field will show “Unhealthy” for all affected NVDIMM-N devices.
More infoOperationalStatus field of the PhysicalDisk object.
EventLog – Microsoft-Windows-ScmDisk0101/Operational
What to doWe recommended backing-up the affected NVDIMM-N’s data. To gain read access, you can manually bring the disk online (it will surface as a read-only NTFS volume).

To fully clear this condition, the root cause must be resolved (i.e. service power supply or replace NVDIMM-N, depending on issue) and the volume on the NVDIMM-N must either be taken offline and brought online again, or the system must be restarted.

To make the NVDIMM-N usable in Storage Spaces again, use the Reset-PhysicalDisk cmdlet, which re-integrates the device and starts the repair process.

NVDIMM-N is shown with a capacity of ‘0’ Bytes or as a "Generic Physical Disk"

This condition is when a storage-class memory device is shown with a capacity of 0 bytes and cannot be initialized, or is exposed as a "Generic Physical Disk" object with an Operational Status of Lost Communication, as shown in this example output:

SerialNumberHealthStatusOperationalStatusOperationalDetails
802c-01-1602-117cb5fcHealthyOK
WarningLost Communication

The following table lists some info about this condition.

Description
Likely conditionBIOS Did Not Expose NVDIMM-N to OS
Root CauseNVDIMM-N devices are DRAM based. When a corrupt DRAM address is referenced, most CPUs will initiate a machine check and restart the server. Some server platforms then un-map the NVDIMM, preventing the OS from accessing it and potentially causing another machine check. This may also occur if the BIOS detects that the NVDIMM-N has failed and needs to be replaced.
General behaviorNVDIMM-N is shown as uninitialized, with a capacity of 0 bytes and cannot be read or written.
Storage Spaces behaviorStorage Space remains operational (provided only 1 NVDIMM-N is affected).
NVDIMM-N PhysicalDisk object is shown with a Health Status of Warning and as a "General Physical Disk"
More infoOperationalStatus field of the PhysicalDisk object.
EventLog – Microsoft-Windows-ScmDisk0101/Operational
What to doThe NVDIMM-N device must be replaced or sanitized, such that the server platform exposes it to the host OS again. Replacement of the device is recommended, as additional uncorrectable errors could occur. Adding a replacement device to a storage spaces configuration can be achieved with the Add-Physicaldisk cmdlet.

NVDIMM-N is shown as a RAW or empty disk after a reboot

This condition is when you check the health of a storage-class memory device and see a Health Status of Unhealthy and Operational Status of Unrecognized Metadata, as shown in this example output:

SerialNumberHealthStatusOperationalStatusOperationalDetails
802c-01-1602-117cb5fcHealthyOK{Unknown}
802c-01-1602-117cb64fUnhealthy{Unrecognized Metadata, Stale Metadata}{Unknown}

The following table lists some info about this condition.

Description
Likely conditionBackup/Restore Failure
Root CauseA failure in the backup or restore procedure will likely result in all data on the NVDIMM-N to be lost. When the operating system loads, it will appear as a brand new NVDIMM-N without a partition or file system and surface as RAW, meaning it doesn't have a file system.
General behaviorNVDIMM-N will be in read-only mode. Explicit user action is needed to begin using it again.
Storage Spaces behaviorStorage Spaces remains operational if only one NVDIMM is affected).
NVDIMM-N physical disk object will be shown with the Health Status “Unhealthy” and is not used by Storage Spaces.
More infoOperationalStatus field of the PhysicalDisk object.
EventLog – Microsoft-Windows-ScmDisk0101/Operational
What to doIf the user doesn't want to replace the affected device, they can use the Reset-PhysicalDisk cmdlet to clear the read-only condition on the affected NVDIMM-N. In Storage Spaces environments this will also attempt to re-integrate the NVDIMM-N into Storage Space and start the repair process.

Interleaved Sets

Interleaved sets can often be created in a platform's BIOS to make multiple NVDIMM-N devices appear as a single device to the host operating system.

Windows Server 2016 and Windows 10 Anniversary Edition do not support interleaved sets of NVDIMM-Ns.

At the time of this writing, there is no mechanism for the host operating system to correctly identify individual NVDIMM-Ns in such a set and clearly communicate to the user which particular device may have caused an error or needs to be serviced.

© 2017 Microsoft