No redundancy reported for a Cluster Virtual Disk

I ran into an interesting issue recently. We attempted to update the second node in a 2 node cluster which appeared to cause the Cluster Shared Volume to report "no redundancy".

I want to preface this by saying with proper procedures and good disk reporting this entire situation could have likely been avoided completely. Unfortunately, neither of those things are present at the place this happened so now I get to write about it.

First, some context

We have a Windows Failover Cluster consisting of 2 nodes, both running Windows Server 2016 Datacenter. Both servers have a bunch of identical hard drives added to a Storage Spaces Direct pool with a Cluster Virtual Disk and Cluster Shared Volume housing all of our virtual machine vhdx files. Those of you intimately familiar with Windows' Failover Clustering might have spotted a potential issue here already, but for everyone else who's like me, let me explain what happened.

It was around the time to apply updates to them so we drained the roles from one of them, which I will call "Node 1", and updated it. There were... complications with one of the virtual machines that I won't go into here (they were entirely stemming from a poorly written application and some of the dumbest solutions you've ever seen to try and work around it), but other than that it updated just fine.

Day 2 comes around and we're on track to update the other server, which I will call "Node 2". The roles drain just fine, updates seem to apply, and the server reboots. However, after starting up again it decided that it didn't like the updates and rolls them back. There was probably another reboot here, but I was not the one dealing with updating it so I can't say for certain.

The problem presents itself

What I can say for certain though, is that somewhere during the update and reboot process of Node 2, a disk on Node 1 started throwing uncorrected read errors.

I suspect it was slightly after the initial reboot. Maybe when S2D tried to resynchronize after regaining connection. But I'm also no expert and just going off feeling/amount of ReadErrorsUncorrected I saw.

When Node 2 started back up, S2D attempted to resynchronize any changes, but was likely getting a bunch of those read errors from some sectors on of the disks in Node 1 and couldn't properly sync that part. This tracks with the symptoms of just the Cluster Virtual Disk being in a state of "no redundancy" while the actual cluster and S2D disk pool all reported as fine.

Basically, one of the disks on the non-restarted node is about to fail but hasn't quite done so yet. In light of this, I spent my weekend babysitting virtual machine replication and ensuring there were backups (because there aren't always those) for everything on the cluster.

Detecting the problem

Detecting the problem was actually surprisingly tricky. Partially because I was not the one who did the actual updates (I am basically "tier 3 troubleshooter" for whatever issues stump tier 1 and 2) but mostly because I'm a blind fool and somehow managed to be turned away from the screen every time the Storage Spaces warning notification popped up on screen letting me know that the pool had warnings because a disk might have failed.

Anyway, I got started on this track by finding this article, which linked to another article on the same blog that was amazingly informative and described the exact situation I was in.

I started with the obvious commands to get some information on the storage pool and Cluster Virtual Disk.

Get-StoragePool -IsPrimordial $False | Select-Object FriendlyName, HealthStatus, OperationalStatus, ReadonlyReason

FriendlyName          OperationalStatus HealthStatus IsPrimordial IsReadOnly
------------          ----------------- ------------ ------------ ----------
S2D on ExampleCluster OK                Healthy      False        False
Get-VirtualDisk | Select-Object FriendlyName, HealthStatus, OperationalStatus, DetachedReason

FriendlyName         HealthStatus OperationalStatus DetachedReason
------------         ------------ ----------------- --------------
Cluster Virtual Disk Unhealthy    No Redundancy     None

Both of these were not a ton of help really. They just confirmed what I had been told after I'd been called in to help. Next up I checked the physical disks installed across both servers.

Get-PhysicalDisk | Format-Table FriendlyName, MediaType, Size, CanPool, CannotPoolReason

This gave me a big old list of drives, but gave no further indication of anything being wrong as all the drives were there and accounted for. But the second blog post linked above is what introduced me to the Get-StorageReliabilityCounter cmdlet. This cmdlet accepts a Powershell object of one or more disks and attempts to read various drive health and SMART stats.

Get-PhysicalDisk | Get-StorageReliabilityCounter

DeviceId Temperature ReadErrorsUncorrected Wear PowerOnHours
-------- ----------- --------------------- ---- ------------
1009                 0                     5    15185
2010                 0                     5    15301
1010                 0                     5    15185
1                    0                     0    31315
0                    0                     0    32197
2009                 0                     5    15301
2005                 0                     0    50260
2002                 0                     0    54068
1002                 0                     0    50199
1006                 0                     0    5437
1007                 0                     0    5438
2008                 0                     0    5462
1003                 2                     0    52483
2004                 0                     0    50315
1004                 0                     0    49798
2003                 0                     0    21424
2007                 0                     0    5462
2006                 0                     0    5462
1008                 0                     0    5438
1005                 650                   0    49972

Just by default this cmdlet shows the information we're most interested in, which is that one line down at the bottom. Looks like drive 1005 has been throwing a lot of uncorrected read errors!

Those few drives with a wear level of 5 look a little out of place, but those are the SSD cache drives. As best as I can tell the reported wear level is flash wear percentage.

Now unfortunately, the Get-PhysicalDisk cmdlet doesn't seem to support filtering by DeviceId so I had to select the whole list and figure out which disk it was manually.

 Get-PhysicalDisk | Select-Object FriendlyName, SerialNumber, DeviceID

Once I had the serial number, I could look it up in our asset management tool and find exactly what server (and even drive bay, thanks past whoever set up that netbox server) the problematic drive was in.

Fixing the Cluster Virtual Disk

Yeah uh, I haven't done this part yet. I'm currently babysitting the VM replication to a warm spare server in the event something goes horribly wrong when we replace the drive. I also need to look into replacing drives in S2D pools a bunch before we do so as well since I seem to be one of few people here that actually searches for information about the thing I'm going to do before I do it.

What if I don't want this situation again?

Fortunately for you, Microsoft has been continually improving support for 2 node Failover Clusters for those of us stupid enough to use them. One of these improvements is nested resiliency which is used in place of the classic two-way mirroring and is designed for exactly this scenario. Unfortunately, this feature is only available in Windows Server 2019.

Not that we would be able to use it anyway as I'm fairly certain changing the mirroring type of an S2D pool requires rebuilding the entire thing.