A new feature with ESXi 5.1 is the ability to check SSD health from the command line. Once you have SSH’d into the ESXi box, you can check the drive health with the following command:
esxcli storage core device smart get -d [drive]
…where [drive] takes the format of: t10.ATA?????????. You can find out the right drive name by the following:
ls -l /dev/disks/
This will return output something like the following:
mpx.vmhba32:C0:T0:L0
mpx.vmhba32:C0:T0:L0:1
mpx.vmhba32:C0:T0:L0:5
mpx.vmhba32:C0:T0:L0:6
mpx.vmhba32:C0:T0:L0:7
mpx.vmhba32:C0:T0:L0:8
t10.ATA_____M42DCT064M4SSD2__________________________000000001147032121AB
t10.ATA_____M42DCT064M4SSD2__________________________000000001147032121AB:1
t10.ATA_____M42DCT064M4SSD2__________________________0000000011470321ADA4
t10.ATA_____M42DCT064M4SSD2__________________________0000000011470321ADA4:1
Here I can use the t10.xxx names without the :1 at the end to see the two SSDs available, copying and pasting the entire line as the [drive]. The command output should look like:
~ # esxcli storage core device smart get -d t10.ATA_____M42DCT064M4SSD2__________________________000000001147032121AB
Parameter Value Threshold Worst
—————————- —– ——— —–
Health Status OK N/A N/A
Media Wearout Indicator N/A N/A N/A
Write Error Count N/A N/A N/A
Read Error Count 100 50 100
Power-on Hours 100 1 100
Power Cycle Count 100 1 100
Reallocated Sector Count 100 10 100
Raw Read Error Rate 100 50 100
Drive Temperature 100 0 100
Driver Rated Max Temperature N/A N/A N/A
Write Sectors TOT Count 100 1 100
Read Sectors TOT Count N/A N/A N/A
Initial Bad Block Count 100 50 100
One figure to keep an eye on is the reserved sector count – this should be around 100, and diminishes as the SSD replaces bad sectors with ones from this reservoir. The above statistics are updated every 30 minutes. As a point of interest, in this case ESXi isn’t picking up on the data correctly – the SSD doesn’t actually have exactly 100 power-on hours and 100 power cycle count.
Assuming it works for your SSDs, this is quite a useful tool – knowing when a drive is likely to fail can give you the opportunity for early replacement and less downtime due to unexpected failures.