vSphere Troubleshooting Series: Part 5 – Storage Troubleshooting

If a virtual machine cannot access its virtual disks, the cause of the problem might be anywhere from the virtual machine to physical storage.

As you can see below, there are multiple types of storage, it’s important to determine what type you’re troubleshooting before starting. A “datastore” can be multiple things with different types of connectivity.

Storage Troubleshooting Scenario #1 – Storage is not reachable from ESXi host.

This problem typically will be noticed when a datastore falls offline. The ESXi host itself appears fine but something has caused the datastore to fall offline.

Typically, the best method to start with would be:

  • Verify that the ESXi host can see the LUN by running: “esxcli storage core path list” from the host.
  • Check to see if a rescan of storage resolves it by running: “esxcli storage core adapter rescan -A vmhba##”

If the rescan does not resolve it, it is likely that something else is causing the issue. Have there been any other recent changes to the ESXi host?

Some other possible causes:

A firewall is not blocking traffic. VMkernel interface is misconfigured
IP Storage is not properly configured NFS storage is not configured properly
iSCSI port (3260) is unreachable The actual storage device is functioning ok.
LUN masking in place? Is the LUN still presented? Check to see if the array is supported.

Check your adapter settings. Are the network port bindings setup properly? Is the target name spelled properly? Is the initiator name correct? Are there any required CHAP settings needed? Do you see your storage devices under the devices tab?

If the storage device is online but functioning poorly, check your latency metrics as well. Your goal is to not oversubscribe your links. Try to isolate iSCSI and NFS.

Use the esxtop or resxtop command and press d:

Column  Description
CMDS/s This is the total amount of commands per second and includes IOPS (Input/Output Operations Per Second) and other SCSI commands such as SCSI reservations, locks, vendor string requests, unit attention commands etc. being sent to or coming from the device or virtual machine being monitored.
DAVG/cmd This is the average response time in milliseconds per command being sent to the device.
KAVG/cmd This is the amount of time the command spends in the VMkernel.
GAVG/cmd This is the response time as it is perceived by the guest operating system.

Storage Troubleshooting Scenario #2 – NFS Connectivity Issues

If you have virtual machines that are on NFS datastores, verify that the configuration is correct.

  • Check NFS server name or IP address.
  • Is the ESXi host mapped to the virtual switch?
  • Does the VMkernel port have the right IP configuration?
  • On the NFS server, are the ACLs correct? (read/write or read only)

VMware supports both NFS v3 and v4.1, but it’s important to remember that that use different locking mechanisms:

  • NFS v3 uses proprietary client-side cooperative locking.
  • NFS v4.1 uses server-side locking.

Configure an NFS array to allow only one NFS protocol. Use either NFS v3 or NFS v4.1 to mount the same NFS share across all ESXi hosts. It is not a good idea to mix. Data corruption might occur if they try to access the same NFS share with different client versions.

NFS 4.1 also does not currently support Storage DRS, vSphere Storage I/O Control, Site Recovery Manager or Virtual Volumes.

Storage Troubleshooting Scenario #3 – One or more paths to a LUN is lost.

If an ESXi host at one point had storage connectivity but the LUN is now dead, here are a few esxcli commands to run to use when troubleshooting this issue.

  • esxcli storage core path list

  • esxcli storage nmp device list

A path to a storage/LUN device can be marked as Dead in these situations:

  • The ESXi storage stack determines a path is Dead due to the TEST_UNIT_READY command failing on probing
  • The ESXi storage stack marks paths as Dead after a permanent device loss (PDL)
  • The ESXi storage stack receives a Host Status of 0x1 from an HBA driver

For iSCSI storage, verify that NIC teaming is not misconfigured. Next verify your path selection policy is setup properly.

Check for Permanent Device Loss or All Paths Down. There are two distinct states a device can be in when storage connectivity is lost; All Paths Down or Permanent Device Loss. For each of these states, the device is handled is different. All Paths Down (APD) is a condition where all paths to the storage device are lost or the storage device is removed. The state is caused because the change happened in an uncontrolled manner, and the VMkernel storage stack does not know how long the loss of access to the device will last. The APD is a condition that is treated as temporary (transient), since the storage device might come back online; or it could be permanent, which is referred to as a Permanent Device Loss (PDL).

Permanent Device Loss (PDL):

  • A datastore is shown as unavailable in the Storage view
  • A storage adapter indicates the Operational State of the device as Lost Communication
  • A planned PDL occurs when there is an intent to remove a device presented to the ESXi host.
  • An unplanned PDL occurs when the storage device is unexpectedly unpresented from the storage array without the unmount and detach being ran on the ESXi host.

All Paths Down (APD):

  • You are unable to connect directly to the ESXi host using the vSphere Client
  • All paths to the device are marked as Dead
  • The ESXi host shows as Disconnected in vCenter Server

The storage all paths down (APD) handling on the ESXi host is enabled by default. When it is enabled, the host continues to retry nonvirtual machine I/O commands to a storage device in the APD state for a limited time frame. When the time frame expires, the host stops the retry attempts and terminates any nonvirtual machine I/O. You can disable the APD handling feature on your host. If you disable the APD handling, the host will indefinitely continue to retry issued commands to reconnect to the APD device. If you disable it, it’s possible that the host could exceed their internal I/O timeout and become unresponsive.

You might want to increase the value of the timeout if there are storage devices connected to your ESXi host which might take longer than 140 seconds to recover from a connection loss. You can enter a value between 20 and 99999 seconds for the Misc.APOTimeout value.

  • Browse to the host in the vSphere Web Client.
  • Click the Manage tab, and click settings.
  • Under System, click Advanced System Settings.
  • Under Advanced System Settings select the Misc.APDHandlingEnable parameter and click the Edit icon.
  • Change the value to 0.
  • Edit the MiscAPDTimeout value if desired.

Storage Troubleshooting Scenario #4 – vSAN Troubleshooting

Before you begin it is important to realize that vSAN is a software based storage product that is entirely dependent on the proper functioning underlying hardware components, like network, storage I/O controller and the individual storage devices. You always need to follow the vSAN Compatibility Guide for all deployments.

Many vSAN errors can be traced back to faulty VMkernel ports, mismatched MTU sizes, etc. It’s far more than simple TCP/IP.

Some of the tools you can use to troubleshoot vSAN are:

  • vSphere Web Client
    • The primary tool to troubleshoot vSAN.
    • Provides overviews of individual virtual machine performance.
    • Can inspect underlying disk devices and how they are being used by vSAN.
  • esxcli vsan
    • Get information and manage the vSAN cluster.
    • Clear vSAN network configuration.
    • Verify which VMkernel network adapters are used for vSAN communication.
    • List the vSAN storage configuration.
  • Ruby vSphere Console
    • Fully implemented since vSphere 5.5
    • Commands to apply licenses, check limits, check state, change auto claim mechanisms, etc.
  • vSAN Observer
    • This tool is included within the Ruby vSphere Console.
    • Can be used for performance troubleshooting and examined from many different metrics like CPU, Memory or disks.
  • Third Party Tools
This entry was posted in Troubleshooting, vSphere. Bookmark the permalink.

One Response to vSphere Troubleshooting Series: Part 5 – Storage Troubleshooting

  1. Pingback: vSphere Troubleshooting Series: Part 4 – Virtual Machine Troubleshooting | Ryan Birk – Virtual Insanity

Leave a Reply

Your email address will not be published. Required fields are marked *