vSphere Troubleshooting Series: Part 5 – Storage Troubleshooting

If a virtual machine cannot access its virtual disks, the cause of the problem might be anywhere from the virtual machine to physical storage.

As you can see below, there are multiple types of storage, it’s important to determine what type you’re troubleshooting before starting. A “datastore” can be multiple things with different types of connectivity.

Storage Troubleshooting Scenario #1 – Storage is not reachable from ESXi host.

This problem typically will be noticed when a datastore falls offline. The ESXi host itself appears fine but something has caused the datastore to fall offline.

Typically, the best method to start with would be:

  • Verify that the ESXi host can see the LUN by running: “esxcli storage core path list” from the host.
  • Check to see if a rescan of storage resolves it by running: “esxcli storage core adapter rescan -A vmhba##”

If the rescan does not resolve it, it is likely that something else is causing the issue. Have there been any other recent changes to the ESXi host?

Some other possible causes:

A firewall is not blocking traffic. VMkernel interface is misconfigured
IP Storage is not properly configured NFS storage is not configured properly
iSCSI port (3260) is unreachable The actual storage device is functioning ok.
LUN masking in place? Is the LUN still presented? Check to see if the array is supported.

Check your adapter settings. Are the network port bindings setup properly? Is the target name spelled properly? Is the initiator name correct? Are there any required CHAP settings needed? Do you see your storage devices under the devices tab?

If the storage device is online but functioning poorly, check your latency metrics as well. Your goal is to not oversubscribe your links. Try to isolate iSCSI and NFS.

Use the esxtop or resxtop command and press d:

Column  Description
CMDS/s This is the total amount of commands per second and includes IOPS (Input/Output Operations Per Second) and other SCSI commands such as SCSI reservations, locks, vendor string requests, unit attention commands etc. being sent to or coming from the device or virtual machine being monitored.
DAVG/cmd This is the average response time in milliseconds per command being sent to the device.
KAVG/cmd This is the amount of time the command spends in the VMkernel.
GAVG/cmd This is the response time as it is perceived by the guest operating system.

Storage Troubleshooting Scenario #2 – NFS Connectivity Issues

If you have virtual machines that are on NFS datastores, verify that the configuration is correct.

  • Check NFS server name or IP address.
  • Is the ESXi host mapped to the virtual switch?
  • Does the VMkernel port have the right IP configuration?
  • On the NFS server, are the ACLs correct? (read/write or read only)

VMware supports both NFS v3 and v4.1, but it’s important to remember that that use different locking mechanisms:

  • NFS v3 uses proprietary client-side cooperative locking.
  • NFS v4.1 uses server-side locking.

Configure an NFS array to allow only one NFS protocol. Use either NFS v3 or NFS v4.1 to mount the same NFS share across all ESXi hosts. It is not a good idea to mix. Data corruption might occur if they try to access the same NFS share with different client versions.

NFS 4.1 also does not currently support Storage DRS, vSphere Storage I/O Control, Site Recovery Manager or Virtual Volumes.

Storage Troubleshooting Scenario #3 – One or more paths to a LUN is lost.

If an ESXi host at one point had storage connectivity but the LUN is now dead, here are a few esxcli commands to run to use when troubleshooting this issue.

  • esxcli storage core path list

  • esxcli storage nmp device list

A path to a storage/LUN device can be marked as Dead in these situations:

  • The ESXi storage stack determines a path is Dead due to the TEST_UNIT_READY command failing on probing
  • The ESXi storage stack marks paths as Dead after a permanent device loss (PDL)
  • The ESXi storage stack receives a Host Status of 0x1 from an HBA driver

For iSCSI storage, verify that NIC teaming is not misconfigured. Next verify your path selection policy is setup properly.

Check for Permanent Device Loss or All Paths Down. There are two distinct states a device can be in when storage connectivity is lost; All Paths Down or Permanent Device Loss. For each of these states, the device is handled is different. All Paths Down (APD) is a condition where all paths to the storage device are lost or the storage device is removed. The state is caused because the change happened in an uncontrolled manner, and the VMkernel storage stack does not know how long the loss of access to the device will last. The APD is a condition that is treated as temporary (transient), since the storage device might come back online; or it could be permanent, which is referred to as a Permanent Device Loss (PDL).

Permanent Device Loss (PDL):

  • A datastore is shown as unavailable in the Storage view
  • A storage adapter indicates the Operational State of the device as Lost Communication
  • A planned PDL occurs when there is an intent to remove a device presented to the ESXi host.
  • An unplanned PDL occurs when the storage device is unexpectedly unpresented from the storage array without the unmount and detach being ran on the ESXi host.

All Paths Down (APD):

  • You are unable to connect directly to the ESXi host using the vSphere Client
  • All paths to the device are marked as Dead
  • The ESXi host shows as Disconnected in vCenter Server

The storage all paths down (APD) handling on the ESXi host is enabled by default. When it is enabled, the host continues to retry nonvirtual machine I/O commands to a storage device in the APD state for a limited time frame. When the time frame expires, the host stops the retry attempts and terminates any nonvirtual machine I/O. You can disable the APD handling feature on your host. If you disable the APD handling, the host will indefinitely continue to retry issued commands to reconnect to the APD device. If you disable it, it’s possible that the host could exceed their internal I/O timeout and become unresponsive.

You might want to increase the value of the timeout if there are storage devices connected to your ESXi host which might take longer than 140 seconds to recover from a connection loss. You can enter a value between 20 and 99999 seconds for the Misc.APOTimeout value.

  • Browse to the host in the vSphere Web Client.
  • Click the Manage tab, and click settings.
  • Under System, click Advanced System Settings.
  • Under Advanced System Settings select the Misc.APDHandlingEnable parameter and click the Edit icon.
  • Change the value to 0.
  • Edit the MiscAPDTimeout value if desired.

Storage Troubleshooting Scenario #4 – vSAN Troubleshooting

Before you begin it is important to realize that vSAN is a software based storage product that is entirely dependent on the proper functioning underlying hardware components, like network, storage I/O controller and the individual storage devices. You always need to follow the vSAN Compatibility Guide for all deployments.

Many vSAN errors can be traced back to faulty VMkernel ports, mismatched MTU sizes, etc. It’s far more than simple TCP/IP.

Some of the tools you can use to troubleshoot vSAN are:

  • vSphere Web Client
    • The primary tool to troubleshoot vSAN.
    • Provides overviews of individual virtual machine performance.
    • Can inspect underlying disk devices and how they are being used by vSAN.
  • esxcli vsan
    • Get information and manage the vSAN cluster.
    • Clear vSAN network configuration.
    • Verify which VMkernel network adapters are used for vSAN communication.
    • List the vSAN storage configuration.
  • Ruby vSphere Console
    • Fully implemented since vSphere 5.5
    • Commands to apply licenses, check limits, check state, change auto claim mechanisms, etc.
  • vSAN Observer
    • This tool is included within the Ruby vSphere Console.
    • Can be used for performance troubleshooting and examined from many different metrics like CPU, Memory or disks.
  • Third Party Tools
Posted in Troubleshooting, vSphere | 1 Comment

vSphere Troubleshooting Series: Part 4 – Virtual Machine Troubleshooting

Virtual Machine Troubleshooting

Before we jump into troubleshooting virtual machines, let’s review some of the typical virtual machine files you will run into.

Typical Virtual Machine Files
File Usage Description
.vmx <VM-Name>.vmx Virtual machine configuration file
.vmxf <VM-Name>.vmxf Extended configuration file
.vmdk <VM-Name>.vmdk Virtual disk characteristics
-flat.vmdk <VM-Name>-flat.vmdk Virtual machine data disk
.nvram <VM-Name>.nvram Virtual machine BIOS or EFI configuration
.vmsd <VM-Name>.vmsd Virtual machine snapshot database file
.vmsn <VM-Name>.vmsn Virtual machine snapshot data file
.vswp <VM-Name>.vswp Virtual machine swap file
.vmss <VM-Name>.vmss Virtual machine suspend file
.log vmware.log Current virtual machine log file
-#.log vmware-#.log Older virtual machine log entries

Virtual Machine Troubleshooting Scenario #1 – Content ID Mismatch

One of the most frustrating issues that comes up with VMs can be snapshots. In fact, our first few virtual machine troubleshooting scenarios will be focused on snapshots. Occasionally you will receive errors that return a content ID mismatch error like the one below.

Cannot open the disk ‘/vmfs/volumes/4a496b4g-eceda1-19-542b-000cfc0097g5/virtualmachine/virtualmachine-000002.vmdk’ or one of the snapshot disks it depends on. Reason: The parent virtual disk has been modified since the child was created.

Content ID mismatch conditions are triggered by interruptions to major virtual machine migrations such as Storage vMotion or Migration, VMware software error, or user action.

The Content ID (CID) value of a virtual machine disk descriptor file aids in the goal of ensuring content in a parent virtual disk file, such as a flat or base disk, is retained in a consistent state. The child delta disks that derive from that base disk’s snapshot contain all further writes and changes. These changes depend on the source disk to remain intact.

To resolve, open the latest vmware.log and locate the specific disk chain affected. You will see a line or warning that is similar to Content ID mismatch (parentCID ed06b3ce != 0cb205b1)

In our case change the parentCID in the disk descriptor file from ed06b3ce to 0cb205b1. Then overwrite the existing vmdk file and power the machine back on.

Virtual Machine Troubleshooting Scenario #2 – Snapshot Issues

Taking a snapshot of a machine fails. The user cannot create or commit a snapshot to the VM. Typical errors will say something like:

  • Cannot create a quiesced snapshot because the snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
  • An error occurred while quiescing the virtual machine. The error code was: 4 The error was: Quiesce aborted.

Quiescing is done by two technologies.

  • Microsoft Volume Shadow Copy Service
  • VMware Tools Sync Driver

When taking snapshots be sure the following occur:

  • VSS prerequisites are met. (See VMware KB 1007696)
  • VMware Tools is running.
  • The VSS provider is used.
  • All the VSS writers are not showing errors.

When taking snapshots, be sure you do not reach 32 levels. If you have more than 32, you cannot create more snapshots. Generally, it’s a recommended practice to keep as little of snapshots as possible on a virtual machine. They can be a performance hit and difficult to troubleshoot.

If a snapshot creation also fails, check that the user has permissions to take a snapshot. Then check that the disk is also supported. RDMs in physical mode, independent disks or VMs with bus-sharing are not supported.

Snapshots will grow based on delta files. You cannot create or commit a snapshot if a snapshot (delta) does not have a descriptor file.

Additional Snapshot Machine Files
<vm name>-00000n-delta.vmdk A delta vmdk is created whenever a snapshot is taken. The pre-snapshot vmdk in use is locked for writing. Any changes from there on are written to the vm’s delta disk. This allows a vm to be restored to any state prior to a specific snapshot being taken.
<vm name>-00000n.vmdk The descriptor file for the delta vmdk file.

If the –delta.vmdk has no descriptor file, you will need to create one before doing anything:

    1. Copy the base disk descriptor file, use the name of the missing descriptor file.
    2. Edit the new descriptor file. Change the format from a base disk to a snapshot delta disk descriptor.

Another possible issue that might arise when troubleshooting snapshots could be insufficient space on a datastore to commit all the snapshots. Be sure to check the Summary tab of your datastore or run the command “df -h” to determine if you have enough space. You’ll need to increase the size of a the datastore or move virtual machines to other datastores with enough space.

Virtual Machine Troubleshooting Scenario #3 – Virtual Machine Power On Issues

 Typically when a virtual machine does not power on, it is recommended to start by creating a test virtual machine and power it on. Does the test VM power on successfully? If the test VM did not power on, check your ESXi host resources to make sure sufficient resources exist and that the host is responsive. If the test VM does power on, that indicates it is more than likely an individual virtual machine issue with the virtual machine. Log files is your best place to start from there.

Browse to the location of the VM and determine that all the virtual machine files are there. Look for vmx, vmdks, etc. Restore the file if you see anything missing.

A virtual machine will also not power on if one of the virtual machine’s files is locked.

Perform these steps to find a locked file:

  1. Power on a virtual machine.
    • If the power-on fails, look for the affected file.
  2. Determine whether the file can be locked.
    • touch filename
  3. Determine which ESXi host has locked the file.
    • vmkfstools -D /vmfs/volumes/Shared/VM02/VM02-flat.vmdk
    • Check the MAC address at the location (See below) in the output.
    • If you see all zeros for the owner that means the owner is the current ESXi server.
  4. Login to the host that has the locked file and identify the process.
  5. Kill the process that is locking the file.

Virtual Machine Troubleshooting Scenario #4 – Orphaned Virtual Machines

When a virtual machine is orphaned, you should begin by trying to determine if a vCenter reboot has occurred. Occasionally if you try to move a machine through a vMotion migration to another host and the vCenter is rebooted it can cause them to be orphaned. Virtual machines can become orphaned if a host failover is unsuccessful, or when the virtual machine is unregistered directly on the host. Some additional symptoms:

  • Virtual Machines show as invalid or orphaned after a VMware High Availability (VMware HA) host failure occurs
  • Virtual Machines show as invalid or orphaned after an ESX host comes out of maintenance mode
  • Virtual Machines show as invalid or orphaned after a failed DRS migration
  • Virtual Machines show as invalid or orphaned after a storage failure
  • Virtual Machines show as invalid or orphaned after the connection is lost between the vCenter Server and the host where the virtual machine resides

To fix, follow the steps below:

  1. Determine the datastore where the virtual machine configuration (.vmx) file is located.
  2. Return to the virtual machine in the vSphere Web Client, right-click, and select:
    1. All Virtual Infrastructure ActionsRemove from Inventory.
  3. Click Yes to confirm the removal of the virtual machine.

If you were looking to recreate and not just remove the virtual machine try the following:

  1. Browse to the datastore and verify that the virtual machine files exist.
  2. If the vmx configuration file was deleted or removed and the disk files are still there, attach the old disk files to a newly created machine.
  3. If the disk files were deleted, restore from a backup.

Next post in this series: http://www.ryanbirk.com/vsphere-troubleshooting-series-part-5-storage-troubleshooting/

Posted in Troubleshooting | Leave a comment

Vembu Webinar: How to address Data Center Challenges

Hi everyone, here is a webinar link that one of my blog sponsors Vembu is running.

The webinar will run on July 4th at 12:00pm GST.

  • How fast can I recover the data center during the disaster?
  • How can I avoid data loss during DR?
  • How can I rely on my backups?
  • Exponential data growth
  • Migration Plans – P2V and V2V

There also will be discussion on use-cases that have been developed with Vembu.

Posted in Backup, Webinars | Leave a comment

vSphere Troubleshooting Series: Part 3 – vSphere Installation Troubleshooting

This part of the series will cover some of the common issues with vSphere deployments. We will split this section into two sections. The first will cover ESXi host troubleshooting during installation, the second will cover vCenter deployments at installation.

ESXi Host Troubleshooting

It’s common to think that ESXi will just install on any hardware, but it’s important to know a few details before you decide to get started. First, VMware only will support hardware that is officially supported on the VMware Hardware Compatibility List. Specific drivers are tested and chosen. If it’s not on the list, don’t expect support. VMware has a large partner eco-system and both hardware and software goes through rigorous testing and is signed off on for official support.

VMware also has various community driver support. What this means is that even though your hardware can work with ESXi, it’s not running in a fully supported mode. This is a nice feature for users who build homelabs for practice.

Another important note to remember during installation is that not all of your drivers might install automatically. It’s possible that your hardware could be newer and you might have to download a vSphere Installation Bundle, also called a VIB. A VIB is somewhat like a tarball or ZIP archive in that it is a collection of files packaged into a single archive to make software deployments easier.

A VIB is comprised of three parts:

  • A file archive
  • An XML descriptor file
  • A signature file

The signature file is the electronic signature used to verify the level of trust. The trust level will be one of the four listed below:

  • VMwareCertified:  VIBs created and tested by VMware.  VMware Certified VIBs undergo thorough testing by VMware.
  • VMwareAccepted:  VIBs created by a VMware partners that are approved by VMware.  VMware relies on partners to perform the testing, but VMware verifies the results.
  • PartnerSupported:  VIBs created and tested by a trusted VMware partner.  The partner performs all testing.  VMware does not verify the results.
  • CommunitySupported:  VIBs created by individuals or partners outside of the VMware partner program.  These VIBs do not undergo any VMware or trusted partner testing and are not supported by VMware or its partners.

If installation was successful and you have all the right VIBs and software configured, but other issues have come up, you should always check the hostd.log file first. The hostd management service is the main communication channel between ESXi hosts and VMkernel. If hostd fails, the ESXi host disconnects from vCenter and cannot be easily managed.

  • Try restarting hostd by running /etc/init.d/hostd restart

Occasionally, an ESXi host will crash and display a purple diagnostic screen. A host can crash for several reasons. CPU exceptions, driver issues, machine check exceptions, hardware fault or a software bug.

To recover from a PSOD, you should try following these steps:

PSOD

  1. Take a screenshot of the screen.
  2. Restart the host, get the VMs up and running on another host if possible. If using HA, this should happen on its own if configured properly.
  3. Contact VMware support if you can’t find any information online. Occasionally others have the same issue and the fix can be implemented easily through firmware or software updates.

Another possible issue is that the ESXi host simply hangs during the boot process. You never get a PSOD, it just sits there and the entire system becomes unresponsive. Typically hangs happen during a power cycle of a system during the boot process. It’s caused by VMkernel being too busy or a possible hardware lockup.

To determine that the host has locked up:

  1. Ping the VMkernel (Management) network interface.
  2. Try to login to the host with the client.
  3. Monitor network traffic from the ESXi host.

If you can ping the host, that’s a good sign. Next connect to the DCUI to display any messages on the screen. Press Alt-F12 at the host console to do that.

To recover from a host that has hung, try rebooting the ESXi host, review logs and gather performance statistics. If you determine it’s a hardware issue, fix the hardware and if required reinstall or reconfigure ESXi. Lastly update the host with the most recent patches.

vCenter Troubleshooting

When installing the vCSA, VMware has split the install into two different stages. Stage 1 is the appliance deployment. Stage 2 is the configuration of the appliance.

Stage 1 in most cases, is a very straightforward install. Stage 2 is where traditionally, users have had issues with deployments and it generally can be resolved with verifying your DNS settings and NTP.

Some deployments seem successful but upon login, the authentication fails if the NTP server on the ESXi host and the newly created appliance are not synced to the same source.

Occasionally you might run into issues replacing certificates with the Certificate Manager. It can hang at 0% and perform an automatic rollback error. This issue can be caused by using non-Base64 certificates. To resolve, manually publish the full chain to the certificate store.

Nest post in this series: vSphere Troubleshooting Part #4 – Virtual Machine Troubleshooting

Posted in Troubleshooting, vSphere | 1 Comment

vSphere Troubleshooting Series: Part 2 – vSphere Troubleshooting Tools

Before you can troubleshoot issues, you need to understand the various tools that are out there. In this section, we will discuss some of the tools that VMware provides and how to identify the log locations for additional troubleshooting.

VMware Command Line Tools

You can run command-line tools on an ESXi host in several ways:

  • The vSphere ESXi shell itself, which includes:
    • esxcli commands (esxcli network, esxcli storage, esxcli vm, etc)
    • A set of esxcfg-* commands: The esxcfg commands are deprecated but you will likely still see some older documentation with them. The recommendation today is to use esxcli.
    • The host shell can be accessed a couple of different ways, either by using the local DCUI (Direct Console User Interface) or via SSH.
      • Local access by using the Direct Console User Interface (DCUI):
        1. Enable the vSphere ESXi Shell service, either in the DCUI or vSphere Web Client. Typically, this is running by default.
        2. Access the ESXi Shell from the DCUI by pressing Alt-F1 after logging in.
        3. When finished, disable the ESXi Shell service when not using it.
      • Remote access by using PuTTY or an SSH client.
        1. Enable the SSH service on your ESXi host, either in the DCUI or through the vSphere Web Client.
        2. Use PuTTY or your preferred SSH Client to access the ESXi host.
        3. Disable the SSH Service when finished.
  • vSphere Management Assistant (This tool has been deprecated. 6.5 is final release):
    • A virtual appliance that includes components for running vSphere commands:
    • esxcli
    • vmware-cmd
    • vicfg-* commands
    • vi-fastpass authentication components for automated authentication to vCenter or ESXi hosts. This saves you from having to type your name and password with every command that is ran.
  • VMware PowerCLI:
    • VMware PowerCLI provides an easy-to-use Windows PowerShell interface for command-line access to administration tasks or for creating executable scripts.

VMware Log Locations for Troubleshooting

VMware stores logs for their products in various locations. It’s important to know where to look when you’re having problems quickly and efficiently.

  • vCenter Log Locations:
    • Location for vCenter Server on Windows 2008/2012:
      • %ALLUSERSPROFILE%\VMWare\vCenterServer\logs
    • Location for VMware vCenter Server Appliance:
      • /var/log/vmware/
        • Includes logs for SSO, Inventory Service and the Web Client.
      • Useful ESXi Host Logs:
        • log: Host management service logs.
        • log: Service initialization, watchdogs, scheduled tasks, DCUI.
        • log: Core VMkernel logs. Storage and Networking device events.
        • log: Warning and alert log messages.
        • log: ESXi host startup and shutdown, uptime details, resource consumption.
      • vCenter vpxd.log
        • This log file is the main vCenter Server log file. If you ever contact VMware for support, it is highly likely that they will ask you for this file. Don’t confuse this with vpxa, that is the vCenter agent and runs on the ESXi hosts.
        • You can monitor and view the logs easily through the vSphere Web Client, under the Monitor tab (Figure 1), with an SSH session at /var/log (Figure 2) or in the DCUI under “View System Logs” under System Customization (Figure 3).

The vSphere Syslog Collector

You can gather logs at the above locations or setup a single location for all of your ESXi hosts to point to. It uses port 514 for TCP and UDP, and port 1514 for SSL. The Syslog collector is installed on both the Windows based vCenter and the vCenter Appliance.

The vm-support Command

In addition to the Syslog Collector, you can also gather logs with the vm-support command. It collects data from the ESXi hosts and compresses the following data:

  • Log files
  • System status
  • Configuration files

The tool does not require any arguments and it create a zip file using the host name and time stamp.

The vCenter Bash Shell

You can also use the vCenter Bash Shell from the vCenter Appliance console under troubleshooting options. From the Bash shell, you can verify the status of a service and start or restart services.

Part 3 of this troubleshooting series can be found here: http://www.ryanbirk.com/vsphere-troubleshooting-series-part-3-vsphere-installation-troubleshooting/

Posted in Troubleshooting, vSphere | 1 Comment

What’s New in Performance: vSphere 6.5

Underlying each release of VMware vSphere are many performance and scalability improvements. The VMware vSphere 6.5 platform continues to provide industry-leading performance and features to ensure the successful virtualization and management of your entire software-defined datacenter.

This whitepaper is broken into various subsections that show increases and improvements around performance.

Posted in vSphere 6.5, Whitepapers | Leave a comment