I haven’t had the time lately to get a fairly in-depth post going. I’ve been focusing on a bit of study work because I had to nail down my final VCAP-DCV test to acquire my VCIX-DCV (BOOM!), and start drilling down for the VCDX-DCV. Yes, you heard it here first. I plan to make an attempt at the VCDX. That’s not what we’re here for today though, so more on that later. A few weeks back I decided to evolve the lab here at dirmann.tech. I have a little story that I want to share about the experience.
We had a handful of spindle drives laying around, not doing anything. Coincidentally, they were all the same (even decent) capacities. If you know me, then you know I’m a huge proponent for vSAN, so I’ll give you one guess at what we did. If you guess “decided to convert two of the four clusters that we have to vSAN, order the necessary SSDs for the cache tier, wait anxiously for them to arrive, place all the drive in their bays, and get to configuring vSAN”, then by the wings of Odin’s ravens you are 100% correct! Damn, you’re good. The day the SSDs arrived at the office I was like a kid at Christmas, only much more geeky.
Well, I started off with our Management cluster. Beforehand, we were running iSCSI across the cluster and booted off of SD-cards so I had to modify the RAID controller settings. I always put my controllers in passthrough/JBOD mode. I checked and verified that all drives were being seen without issue and booted the host back up. Rinse and repeat for the second host. Since they were previously used, I went through each and every spinning disk and wiped any existing partitions and began my journey into vSAN-hood.
Kernels added, DRS/HA temporarily disabled, vSAN witness deployed and configured, vSAN cluster configuration initiated. I marked all my capacity disks accordingly, as well as my cache tier and BAM! I got hit with the following warning:
Some hosts are claiming less disks than others.
How? How is this possible? Why would you do this to me? My first thought was immediately the worst case scenario – “We have a bad drive and there’s definitely no more available to try.” Once I snapped out of it I started troubleshooting. The problem followed the drive. I tried moving the drive to a different bay, to a different server. This somewhat reinforced my original thought. In order to just get vSAN going, I removed a capacity drive from the second host. Now, they were both claiming equal capacity and cache. vSAN up!
It’s Not Over
The battle may have been won, but the war was not over. Time to figure out what’s the deally with this HDD. To be sure, I wipe the partitions again via vCenter. No dice. Dropped in to a SSH session on the host to execute some commands:
[@:~] esxcli ssacli cmd -q "ctrl slot=0 pd all show"
Smart Array P440ar in Slot 0 (Embedded) (HBA Mode)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS HDD, 900 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS HDD, 900 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS HDD, 900 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS HDD, 900 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS HDD, 900 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS HDD, 900 GB, OK)
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SATA SSD, 480 GB, OK)
[@:~] esxcli ssacli cmd -q "ctrl slot=0 pd 2I:1:6 show detail"
Smart Array P440ar in Slot 0 (Embedded) (HBA Mode)
Drive Type: HBA Mode Drive
Interface Type: SAS
Size: 900 GB
Drive exposed to OS: True
Logical/Physical Block Size: 512/512
Rotational Speed: 10000
Firmware Revision: HPDC
Serial Number: [rem]
Model: HP EG0900FBVFQ
Current Temperature (C): 38
Maximum Temperature (C): 50
PHY Count: 2
PHY Transfer Rate: 6.0Gbps, Unknown
PHY Physical Link Rate: Unknown, Unknown
PHY Maximum Link Rate: Unknown, Unknown
Drive Authentication Status: OK
Carrier Application Version: 11
Carrier Bootloader Version: 6
Sanitize Erase Supported: False
Shingled Magnetic Recording Support: None
According to the RAID controller, HDD is present and clean. Still no dice. I tried to format the disk with VMFS and make it a datastore to see if I can even write to the thing. No….dice! Being that I was fried out from studying, I decided to reach out to community and I pinged the forums.
@TheBobkin reached out to me in response to my post. I’ve seen him in the forums a bunch. Very knowledgeable. He asked me to execute the following (where the NAA is the WWID of my problematic HDD):
[@:~] vdq -q | grep naa.5000cca05739f1bc
“Name” : “naa.5000cca05739f1bc”,
“VSANUUID” : “”,
“State” : “Ineligible for use by VSAN”,
“Reason” : “Has partitions”,
“IsSSD” : “0”,
“IsPDL” : “0”,
“Size(MB)” : “858483”,
“FormatType” : “512n”,
“IsVsanDirectDisk” : “0”
WHAT?! How did this drive escape my partition-wiping lasers of doom?? I’ll get you this time!
[@:~] dd if=/dev/zero of=/vmfs/devices/disks/naa.5000cca05739f1bc bs=1M count=50 conv=notrunc
[@:~] partedUtil mklabel /vmfs/devices/disks/naa.5000cca05739f1bc msdos
…and the result is…
Big shout out to @TheBobkin for his help. If you want to take a look at the forum post, you can find it here. Thanks for reading. If you enjoyed the post make sure you check us out at dirmann.tech and follow us on LinkedIn, Twitter, Instagram, and Facebook!
Paul Dirmann (vExpert PRO*, vExpert**, VCIX-DCV, VCAP-DCV Design, VCAP-DCV Deploy, VCP-DCV, VCA-DBT, C|EH, MCSA, MCTS, MCP, CIOS, Network+, A+) is the owner and current Lead Consultant at Dirmann Technology Consultants. A technology evangelist, Dirmann has held both leadership positions, as well as technical ones architecting and engineering solutions for multiple multi-million dollar enterprises. While knowledgeable in the majority of the facets involved in the information technology realm, Dirmann honed his expertise in VMware’s line of solutions with a primary focus in hyper-converged infrastructure (HCI) and software-defined data centers (SDDC), server infrastructure, and automation. Read more about Paul Dirmann here, or visit his LinkedIn profile.