What’s New in vSphere 6.7: Whitepaper

VMware vSphere 6.7 delivers key capabilities to enable IT organizations to address the following notable trends that are putting new demands on their IT infrastructure:

  • Explosive growth in quantity and variety of applications, from business-critical applications to new intelligent workloads
  • Rapid increase in hybrid cloud environments and use cases
  • Global expansion of on-premises data centers, including at the edge
  • Heightened importance of security relating to infrastructure and applications

Download the Technical White Paper: What’s New in vSphere 6.7

Posted in vSphere, Whitepapers | Leave a comment

Altaro VM Backup: 7.6 Review

Hi everyone. I wanted to get a quick post out there about one of my blog sponsors, Altaro. They’re a great partner of mine and I also happen to write content for their VMware blog over here. With that little tidbit out of the way, lets get to the good stuff. They have released a new 7.6 version and I thought I’d writer a bit about some of my favorite new features.

  • With Altaro VM Backup 7.6, users can switch from running daily backups to a continuous data protection model yielding an improved Recovery Point Objective (RPO) of up to 5 minutes.
  • Altaro VM Backup 7.6 introduces GFS (Grandfather-Father-Son (GFS) Archiving), enabling users to choose to archive the backup versions over and above their continuous and daily backups instead of deleting them (local backups only). Now you can easily set up separate backup cycles to store a new backup version every week, every month and every year.
  • In previous Altaro VM Backup Versions only one operation could be performed on a Virtual Machine at the same time. This caused the following pain points:
    • If a retention policy takes quite long to complete, then backups and restore operations are queued until retention is complete.
    • If an Offsite Copy to Azure takes days to complete, especially for the initial backup; then backups and restore operations for that VM are queued until it is complete
    • If a Restore, File Level Restore or Boot from Backup operation is active then no backups for that Virtual Machine could take place until they are completed.

    Each of these limitations have been addressed in v7.6 , allowing users to restore and take Offsite Copies without delaying any scheduled or CDP backups whether scheduled or CDP.

Altaro is still very competitive on price for the feature set you get. Per host, unlimited sockets. You can check their pricing calculator here and see for yourself.

I would also recommend checking out the video Andy Syrewicze did that demos some of the new v7 features. Myself, I find the interface very easy to use and to setup. I had no trouble navigating the client and setting things up without having to read the entire manual. 🙂 Within 15 minutes I had multiple hosts and VMs setup and backup jobs running. I have also tested the restore and sandbox functions and they have worked each time I have tried it. My upgrade was very smooth as well.

Posted in Backup | Leave a comment

vSphere Troubleshooting Series: Part 6 – Network Troubleshooting

In vSphere, networking problems can occur at many different levels. It is important to know which level to start with. Is it a virtual machine problem or a host problem? Did the issue arise when you migrated the machine to a new host?

  • Virtual switch connectivity can be managed in two ways:
    • Standard switches
    • Distributed switches

You also must determine if it’s a virtual machine or a host management issue.

Network Troubleshooting Scenario #1 – No network connectivity to other systems.

One of the first things you need to do is a simple ping. Ping a system that is up and that you have tested and should be accessible to the ESXi host.

Starting at the ESXi host, verify these possible configuration problems:

  • Does the ESXi host network configuration appear correct? IP, subnet mask, gateway?
  • Is the uplink plugged in? Yes, that had to be said.
    • esxcli network nic list
  • If using VLANs, does the VLAD ID of the port group look correct?
    • esxcli network vswitch standard portgroup list
  • Check the trunk port configuration on the switch. Have there been any recent changes?
  • Does the physical uplink adapter have all settings configured properly? (speed, duplex, etc.)
    • vicfg-nics –d duplex -s speed vmnic#
  • If using NIC teaming, is it setup and configured properly?
  • Are you using supported hardware? Any driver issues?
  • If all of the above test ok, check that you don’t have a physical adapter failure.

If you recently moved the VM to a new host, also verify that an equivalent port group exists on the host and that the network adapter is connected in the virtual machine settings. The firewall in the guest operating system might be blocking traffic. Ensure that the firewall does not block required ports.

Network Troubleshooting Scenario #2 – ESXi hosts dropping from vCenter

Occasionally an ESXi host is added to the vCenter Server inventory with no issues at all, but disconnects 60 seconds after the task completes.

Typically, this issue is because of lost heartbeat packets between vCenter (vpxd) and an ESXi host (vpxa).

The first thing you should check is that no firewall is in place blocking the vCenter communication ports. Then verify that network congestion is not occurring on the network. This issue is more prevalent with Windows based vCenter systems.

Adjust the Windows Firewall settings:

  • If ports are not configured, disable Windows Firewall.
  • If the firewall is configured with the proper ports, ensure that Windows Firewall is not blocking UDP port 902.

By default vpxa uses UDP port 902, but it is possible to change the ports to something else. Check the /etc/vmware/vpxa/vpxa.cfg file <ServerPort> setting.

When it comes to network congestion, dropped heartbeats can happen as well. Some tools you can use to troubleshoot:

  • You can use the resxtop utility or graphical views to analyze traffic.
  • The pktcap-uw command is an enhanced packet capture and analysis tool.
    • pktcap is unidirectional and defaults to inbound direction only.
    • Direction of traffic is specified using –dir 0 for inbound and –dir 1 for outbound.
    • Two (or more) separate traces can be run in parallel but need to be merged later in wireshark.
  • Wireshark

Network Troubleshooting Scenario #3 – No Management Connectivity on ESXi Host

VMware Management networks are configured using VMkernel port groups. Typically, when a host loses connectivity to vCenter and was working prior, a recent change to that port group has caused the issue.

One feature VMware has, which helps in this case is the Rollback feature. Several different types of events can trigger a network rollback:

  • Updating the speed or duplex of a physical NIC
  • Updating teaming and failover for the management VMkernel adapter
  • Updating DNS and routing settings on the ESXi host
  • Changing the IP settings of a management VMkernel adapter

If any of the above are changed and it fails, the host rolls back to the last known good configuration.

You can also restore the network configuration from the DCUI. Select “Network Restore Options” and you can select to restore either standard switches or distributed switches. The Restore Network Settings option deletes all the current network settings except for the Management network if you’re looking to start with a new configuration.

Posted in Troubleshooting | Leave a comment

Altaro Webinar: 5 Performance-boosting vSphere Features You’re Missing Out On

Altaro is a blog sponsor of mine and occasionally I work with them on VMware webinars. We’d be happy to have you join us!

When: Tuesday Sep 19th 2017

Are you running your vSphere environment to its full potential? Have you overlooked features you already have access to but didn’t know could make a major difference?

Not sure?

Many organizations make use of only the most commonly used features in vSphere such as vMotion, HA, and DRS, but there many ways to get more performance out of your setup. Even if you’re part of a small or medium-sized organization, these performance-boosters can significantly enhance your IT productivity.

This is also not to mention that you’ll likely want to fully leverage your investment in the vSphere platform. You wouldn’t buy a supercar and only stay in first gear, would you?

With that idea in mind, join us for our upcoming webinar, and learn from VMware vExperts Ryan Birk and Andy Syrewicze who will show you how to use some of the lesser-known features and capabilities of vSphere to unleash your full potential.

In this webinar you’ll learn about:

  • How to leverage the full power of vSphere
  • Lesser known features that can bring great benefits
  • Best practices for the features mentioned

At the end of this session we’ll also run a Q & A on the topic where you can ask Ryan and Andy your questions!

For more info and to register, check out the registration link: https://www.altaro.com/vmware-backup/webinars/5-vsphere-features.php

Posted in Webinars | Leave a comment

vSphere Troubleshooting Series: Part 5 – Storage Troubleshooting

If a virtual machine cannot access its virtual disks, the cause of the problem might be anywhere from the virtual machine to physical storage.

As you can see below, there are multiple types of storage, it’s important to determine what type you’re troubleshooting before starting. A “datastore” can be multiple things with different types of connectivity.

Storage Troubleshooting Scenario #1 – Storage is not reachable from ESXi host.

This problem typically will be noticed when a datastore falls offline. The ESXi host itself appears fine but something has caused the datastore to fall offline.

Typically, the best method to start with would be:

  • Verify that the ESXi host can see the LUN by running: “esxcli storage core path list” from the host.
  • Check to see if a rescan of storage resolves it by running: “esxcli storage core adapter rescan -A vmhba##”

If the rescan does not resolve it, it is likely that something else is causing the issue. Have there been any other recent changes to the ESXi host?

Some other possible causes:

A firewall is not blocking traffic. VMkernel interface is misconfigured
IP Storage is not properly configured NFS storage is not configured properly
iSCSI port (3260) is unreachable The actual storage device is functioning ok.
LUN masking in place? Is the LUN still presented? Check to see if the array is supported.

Check your adapter settings. Are the network port bindings setup properly? Is the target name spelled properly? Is the initiator name correct? Are there any required CHAP settings needed? Do you see your storage devices under the devices tab?

If the storage device is online but functioning poorly, check your latency metrics as well. Your goal is to not oversubscribe your links. Try to isolate iSCSI and NFS.

Use the esxtop or resxtop command and press d:

Column  Description
CMDS/s This is the total amount of commands per second and includes IOPS (Input/Output Operations Per Second) and other SCSI commands such as SCSI reservations, locks, vendor string requests, unit attention commands etc. being sent to or coming from the device or virtual machine being monitored.
DAVG/cmd This is the average response time in milliseconds per command being sent to the device.
KAVG/cmd This is the amount of time the command spends in the VMkernel.
GAVG/cmd This is the response time as it is perceived by the guest operating system.

Storage Troubleshooting Scenario #2 – NFS Connectivity Issues

If you have virtual machines that are on NFS datastores, verify that the configuration is correct.

  • Check NFS server name or IP address.
  • Is the ESXi host mapped to the virtual switch?
  • Does the VMkernel port have the right IP configuration?
  • On the NFS server, are the ACLs correct? (read/write or read only)

VMware supports both NFS v3 and v4.1, but it’s important to remember that that use different locking mechanisms:

  • NFS v3 uses proprietary client-side cooperative locking.
  • NFS v4.1 uses server-side locking.

Configure an NFS array to allow only one NFS protocol. Use either NFS v3 or NFS v4.1 to mount the same NFS share across all ESXi hosts. It is not a good idea to mix. Data corruption might occur if they try to access the same NFS share with different client versions.

NFS 4.1 also does not currently support Storage DRS, vSphere Storage I/O Control, Site Recovery Manager or Virtual Volumes.

Storage Troubleshooting Scenario #3 – One or more paths to a LUN is lost.

If an ESXi host at one point had storage connectivity but the LUN is now dead, here are a few esxcli commands to run to use when troubleshooting this issue.

  • esxcli storage core path list

  • esxcli storage nmp device list

A path to a storage/LUN device can be marked as Dead in these situations:

  • The ESXi storage stack determines a path is Dead due to the TEST_UNIT_READY command failing on probing
  • The ESXi storage stack marks paths as Dead after a permanent device loss (PDL)
  • The ESXi storage stack receives a Host Status of 0x1 from an HBA driver

For iSCSI storage, verify that NIC teaming is not misconfigured. Next verify your path selection policy is setup properly.

Check for Permanent Device Loss or All Paths Down. There are two distinct states a device can be in when storage connectivity is lost; All Paths Down or Permanent Device Loss. For each of these states, the device is handled is different. All Paths Down (APD) is a condition where all paths to the storage device are lost or the storage device is removed. The state is caused because the change happened in an uncontrolled manner, and the VMkernel storage stack does not know how long the loss of access to the device will last. The APD is a condition that is treated as temporary (transient), since the storage device might come back online; or it could be permanent, which is referred to as a Permanent Device Loss (PDL).

Permanent Device Loss (PDL):

  • A datastore is shown as unavailable in the Storage view
  • A storage adapter indicates the Operational State of the device as Lost Communication
  • A planned PDL occurs when there is an intent to remove a device presented to the ESXi host.
  • An unplanned PDL occurs when the storage device is unexpectedly unpresented from the storage array without the unmount and detach being ran on the ESXi host.

All Paths Down (APD):

  • You are unable to connect directly to the ESXi host using the vSphere Client
  • All paths to the device are marked as Dead
  • The ESXi host shows as Disconnected in vCenter Server

The storage all paths down (APD) handling on the ESXi host is enabled by default. When it is enabled, the host continues to retry nonvirtual machine I/O commands to a storage device in the APD state for a limited time frame. When the time frame expires, the host stops the retry attempts and terminates any nonvirtual machine I/O. You can disable the APD handling feature on your host. If you disable the APD handling, the host will indefinitely continue to retry issued commands to reconnect to the APD device. If you disable it, it’s possible that the host could exceed their internal I/O timeout and become unresponsive.

You might want to increase the value of the timeout if there are storage devices connected to your ESXi host which might take longer than 140 seconds to recover from a connection loss. You can enter a value between 20 and 99999 seconds for the Misc.APOTimeout value.

  • Browse to the host in the vSphere Web Client.
  • Click the Manage tab, and click settings.
  • Under System, click Advanced System Settings.
  • Under Advanced System Settings select the Misc.APDHandlingEnable parameter and click the Edit icon.
  • Change the value to 0.
  • Edit the MiscAPDTimeout value if desired.

Storage Troubleshooting Scenario #4 – vSAN Troubleshooting

Before you begin it is important to realize that vSAN is a software based storage product that is entirely dependent on the proper functioning underlying hardware components, like network, storage I/O controller and the individual storage devices. You always need to follow the vSAN Compatibility Guide for all deployments.

Many vSAN errors can be traced back to faulty VMkernel ports, mismatched MTU sizes, etc. It’s far more than simple TCP/IP.

Some of the tools you can use to troubleshoot vSAN are:

  • vSphere Web Client
    • The primary tool to troubleshoot vSAN.
    • Provides overviews of individual virtual machine performance.
    • Can inspect underlying disk devices and how they are being used by vSAN.
  • esxcli vsan
    • Get information and manage the vSAN cluster.
    • Clear vSAN network configuration.
    • Verify which VMkernel network adapters are used for vSAN communication.
    • List the vSAN storage configuration.
  • Ruby vSphere Console
    • Fully implemented since vSphere 5.5
    • Commands to apply licenses, check limits, check state, change auto claim mechanisms, etc.
  • vSAN Observer
    • This tool is included within the Ruby vSphere Console.
    • Can be used for performance troubleshooting and examined from many different metrics like CPU, Memory or disks.
  • Third Party Tools
Posted in Troubleshooting, vSphere | 1 Comment

vSphere Troubleshooting Series: Part 4 – Virtual Machine Troubleshooting

Virtual Machine Troubleshooting

Before we jump into troubleshooting virtual machines, let’s review some of the typical virtual machine files you will run into.

Typical Virtual Machine Files
File Usage Description
.vmx <VM-Name>.vmx Virtual machine configuration file
.vmxf <VM-Name>.vmxf Extended configuration file
.vmdk <VM-Name>.vmdk Virtual disk characteristics
-flat.vmdk <VM-Name>-flat.vmdk Virtual machine data disk
.nvram <VM-Name>.nvram Virtual machine BIOS or EFI configuration
.vmsd <VM-Name>.vmsd Virtual machine snapshot database file
.vmsn <VM-Name>.vmsn Virtual machine snapshot data file
.vswp <VM-Name>.vswp Virtual machine swap file
.vmss <VM-Name>.vmss Virtual machine suspend file
.log vmware.log Current virtual machine log file
-#.log vmware-#.log Older virtual machine log entries

Virtual Machine Troubleshooting Scenario #1 – Content ID Mismatch

One of the most frustrating issues that comes up with VMs can be snapshots. In fact, our first few virtual machine troubleshooting scenarios will be focused on snapshots. Occasionally you will receive errors that return a content ID mismatch error like the one below.

Cannot open the disk ‘/vmfs/volumes/4a496b4g-eceda1-19-542b-000cfc0097g5/virtualmachine/virtualmachine-000002.vmdk’ or one of the snapshot disks it depends on. Reason: The parent virtual disk has been modified since the child was created.

Content ID mismatch conditions are triggered by interruptions to major virtual machine migrations such as Storage vMotion or Migration, VMware software error, or user action.

The Content ID (CID) value of a virtual machine disk descriptor file aids in the goal of ensuring content in a parent virtual disk file, such as a flat or base disk, is retained in a consistent state. The child delta disks that derive from that base disk’s snapshot contain all further writes and changes. These changes depend on the source disk to remain intact.

To resolve, open the latest vmware.log and locate the specific disk chain affected. You will see a line or warning that is similar to Content ID mismatch (parentCID ed06b3ce != 0cb205b1)

In our case change the parentCID in the disk descriptor file from ed06b3ce to 0cb205b1. Then overwrite the existing vmdk file and power the machine back on.

Virtual Machine Troubleshooting Scenario #2 – Snapshot Issues

Taking a snapshot of a machine fails. The user cannot create or commit a snapshot to the VM. Typical errors will say something like:

  • Cannot create a quiesced snapshot because the snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.
  • An error occurred while quiescing the virtual machine. The error code was: 4 The error was: Quiesce aborted.

Quiescing is done by two technologies.

  • Microsoft Volume Shadow Copy Service
  • VMware Tools Sync Driver

When taking snapshots be sure the following occur:

  • VSS prerequisites are met. (See VMware KB 1007696)
  • VMware Tools is running.
  • The VSS provider is used.
  • All the VSS writers are not showing errors.

When taking snapshots, be sure you do not reach 32 levels. If you have more than 32, you cannot create more snapshots. Generally, it’s a recommended practice to keep as little of snapshots as possible on a virtual machine. They can be a performance hit and difficult to troubleshoot.

If a snapshot creation also fails, check that the user has permissions to take a snapshot. Then check that the disk is also supported. RDMs in physical mode, independent disks or VMs with bus-sharing are not supported.

Snapshots will grow based on delta files. You cannot create or commit a snapshot if a snapshot (delta) does not have a descriptor file.

Additional Snapshot Machine Files
<vm name>-00000n-delta.vmdk A delta vmdk is created whenever a snapshot is taken. The pre-snapshot vmdk in use is locked for writing. Any changes from there on are written to the vm’s delta disk. This allows a vm to be restored to any state prior to a specific snapshot being taken.
<vm name>-00000n.vmdk The descriptor file for the delta vmdk file.

If the –delta.vmdk has no descriptor file, you will need to create one before doing anything:

    1. Copy the base disk descriptor file, use the name of the missing descriptor file.
    2. Edit the new descriptor file. Change the format from a base disk to a snapshot delta disk descriptor.

Another possible issue that might arise when troubleshooting snapshots could be insufficient space on a datastore to commit all the snapshots. Be sure to check the Summary tab of your datastore or run the command “df -h” to determine if you have enough space. You’ll need to increase the size of a the datastore or move virtual machines to other datastores with enough space.

Virtual Machine Troubleshooting Scenario #3 – Virtual Machine Power On Issues

 Typically when a virtual machine does not power on, it is recommended to start by creating a test virtual machine and power it on. Does the test VM power on successfully? If the test VM did not power on, check your ESXi host resources to make sure sufficient resources exist and that the host is responsive. If the test VM does power on, that indicates it is more than likely an individual virtual machine issue with the virtual machine. Log files is your best place to start from there.

Browse to the location of the VM and determine that all the virtual machine files are there. Look for vmx, vmdks, etc. Restore the file if you see anything missing.

A virtual machine will also not power on if one of the virtual machine’s files is locked.

Perform these steps to find a locked file:

  1. Power on a virtual machine.
    • If the power-on fails, look for the affected file.
  2. Determine whether the file can be locked.
    • touch filename
  3. Determine which ESXi host has locked the file.
    • vmkfstools -D /vmfs/volumes/Shared/VM02/VM02-flat.vmdk
    • Check the MAC address at the location (See below) in the output.
    • If you see all zeros for the owner that means the owner is the current ESXi server.
  4. Login to the host that has the locked file and identify the process.
  5. Kill the process that is locking the file.

Virtual Machine Troubleshooting Scenario #4 – Orphaned Virtual Machines

When a virtual machine is orphaned, you should begin by trying to determine if a vCenter reboot has occurred. Occasionally if you try to move a machine through a vMotion migration to another host and the vCenter is rebooted it can cause them to be orphaned. Virtual machines can become orphaned if a host failover is unsuccessful, or when the virtual machine is unregistered directly on the host. Some additional symptoms:

  • Virtual Machines show as invalid or orphaned after a VMware High Availability (VMware HA) host failure occurs
  • Virtual Machines show as invalid or orphaned after an ESX host comes out of maintenance mode
  • Virtual Machines show as invalid or orphaned after a failed DRS migration
  • Virtual Machines show as invalid or orphaned after a storage failure
  • Virtual Machines show as invalid or orphaned after the connection is lost between the vCenter Server and the host where the virtual machine resides

To fix, follow the steps below:

  1. Determine the datastore where the virtual machine configuration (.vmx) file is located.
  2. Return to the virtual machine in the vSphere Web Client, right-click, and select:
    1. All Virtual Infrastructure ActionsRemove from Inventory.
  3. Click Yes to confirm the removal of the virtual machine.

If you were looking to recreate and not just remove the virtual machine try the following:

  1. Browse to the datastore and verify that the virtual machine files exist.
  2. If the vmx configuration file was deleted or removed and the disk files are still there, attach the old disk files to a newly created machine.
  3. If the disk files were deleted, restore from a backup.

Next post in this series: http://www.ryanbirk.com/vsphere-troubleshooting-series-part-5-storage-troubleshooting/

Posted in Troubleshooting | Leave a comment