Stretched vSAN Cluster

markjramos
Jul 19, 2022
7 min read

Updated: Apr 24, 2023

What is a Stretched Cluster?

A stretched cluster, sometimes called metro-cluster, is a deployment model in which two or more host servers are part of the same logical cluster but are located in separate geographical locations, usually two sites. In order to be in the same cluster, the shared storage must be reachable in both sites.

Stretched cluster, usually, are used and provided high availability (HA) and load balancing features and capabilities and build active-active sites.

The vSphere Metro Storage Cluster (vMSC) is just a configuration option for a vSphere cluster, where half of the virtualization hosts are in one site and the second half is in a second site. Both site work in an active-active way and common vSphere features, like vMotion or vSphere HA, can be used.

Requirements and limitations

There are some technical constraints related to the of an online migration of VMs, the following specific requirements must be met prior to consideration of a stretched cluster implementation:

To support higher latency in vMotion, vSphere Enterprise Plus is required (although this requirement is no longer explicated in vSphere 6.x).
ESXi vSphere vMotion network must have a link minimum bandwidth of 250Mbps.
Maximum supported network latency between sites should be around 10ms round-trip time (RTT). Note that vSphere vMotion, and vSphere Storage vMotion, supports a maximum of 150ms latency as of vSphere 6.0, but this is not intended for stretched clustering usage.
VMs networks should be on stretched L2 network (or that some network virtualization techniques are used). Note that ESXi Management network and vMotion network could be also L3.

For the storage part we need:

Storage must be certified for vMSC.
Supported storage protocols are Fibre Channel, iSCSI, NFS, and FCoE. Also, vSAN is supported.
The maximum supported latency for synchronous storage replication links is 10ms RTT or lower.

The storage requirements can, of course, be slightly more complex, depending on the storage vendor, the storage architecture and the specific storage product, but usually there are both specific VMware KB articles and storage vendor related papers. A vSphere Metro Storage Cluster requires what is in effect a single storage subsystem that spans both sites and permit ESXi to access datastores from either array transparently and with no impact to ongoing storage operations, with potentially active-active read/write simultaneously from both sites.

Uniform vs. non-uniform

There are two main vMSC architectures, based on a fundamental difference in how hosts access storage:

Uniform host access configuration
Nonuniform host access configuration

In a uniform architecture, ESXi hosts from both sites are all connected to a storage node in the storage cluster across all sites. Paths presented to ESXi hosts are stretched across a distance.

This tends to be a more common design, which provides full storage access, can handle local storage failure without interruption and helps provide a better level of uptime for virtual machines.

For this type of architecture, we need a way to provide the notion of “site affinity” for a VM to localize all the storage I/O locally in each sited (and define a “site bias”) in order to minimize un-necessary cross-site traffic.

In a non-uniform architecture ESXi hosts at each site are connected only to the storage node(s) at the same site. Paths presented to ESXi hosts from storage nodes are limited to the local site. Each cluster node has read/write access to the storage in one site, but not to the other.

A key point with this configuration is that each datastore has implicit “site affinity,” due to the “LUN locality” in each site. Also, if anything happens to the link between the sites, the storage system on the preferred site for a given datastore will be the only one remaining with read/write access to it. This prevents any data corruption in case of a failure scenario.

vSphere Metro Storage Cluster supports uniform or non-uniform storage presentation and fabric designs. Within the vMSC entity, one presentation model should be implemented consistently for all vSphere hosts in the stretched cluster. Mixing storage presentation models within a stretched cluster is not recommended. It may work but it has not been formally or thoroughly tested.

Note that vSAN stretched cluster configuration it’s only a uniform more, but with some specific concepts and improvements for data locality and data affinity.

Synchronous vs. Asynchronous

A stretched cluster usually used synchronous replication to guarantees data consistency (zero RPO) between both sites. The write I/O pattern sequence with synchronous replication:

The application or server sends a write request to the source.
The write I/O is mirrored to the destination.
The mirrored write I/O is committed to the destination.
The write commit at the destination is acknowledged back to the source.
The write I/O is committed to the source.
Finally, the write acknowledgment is sent to the application or server.

The process is repeated for each write I/O requested by the application or server.

Asynchronous replication tries to accomplish a similar data protection goal, but with a not null RPO. Usually, it defines a frequency to define how often data are replicated.

The write I/O pattern sequence with respect to asynchronous replication.

The application or server sends a write request to the source volume.
The write I/O is committed to the source volume.
Finally, the write acknowledgment is sent to the application or server.

The process is repeated for each write I/O requested by the application or server.

Periodically, a batch of write I/Os that have already been committed to the source volume are transferred to the destination volume.
The write I/Os are committed to the destination volume.
A batch acknowledgment is sent to the source.

Resiliency and availability

Of course, the overall infrastructure must provide a better resiliency and availability compared to a single site. By default, stretched cluster provide at least a +1 redundancy (for the entire site), but more can be provided with a proper design.

This must start from the storage layer where you must tolerate a total failure of a storage in one site without service interruption (in uniform access) or with a minimal service interruption (in non-uniform access). But site resiliency it’s just a part: what about local resiliency for the storage? That means redundant arrays and local data redundancy for external storage. For hyper-converged this means just local data redundancy with at recommended 3 nodes per sites (or 5 if erasure coding is used). Keep in mind also maintenance windows and activities (for this reason the number of nodes has been increased by one).

Disaster recovery vs. Stretched cluster

Although stretched cluster can be used also of disaster recovery and not only for disaster avoidance, there are some possible limitations on using a stretched cluster also as disaster recovery:

Stretched cluster can’t protect you from site link failures and can be affected by the split-brain scenario.
Stretched cluster usually works with synchronous replication, that means limited distance, but also the difficult to provide multiple restore point at different timing.
Bandwidth requirements are really high, to minimize storage latency. So you need not only reliable lines but also larger.
Stretched cluster can be costlier than a DR solution, but of course, can provide also disaster avoidances in some cases.

In most cases, where a stretched cluster is used, then there could be third site acting as a traditional DR, using in this way a multi-level protection approach.

Stretched Cluster in SDDC Design

VMware Validated Design provides alternative guidance for implementing an SDDC that contains two availability zones. You configure vSAN stretched clusters in the management domain and the workload domains to create second availability zones. The SDDC continues operating during host maintenance or if a loss of one availability zone occurs.

In a stretched cluster configuration, both availability zones are active. If a failure in either availability zone occurs, the virtual machines are restarted in the operational availability zone because virtual machine writes occur to both availability zones synchronously.

Regions and Availability Zones

In the multi-availability zone version of the VMware Validated Design, you have two availability zones in Region A.

Physical Infrastructure

You must use homogenous physical servers between availability zones. You replicate the hosts for the first cluster in the management domain and shared edge and workload cluster in a workload domain, and you place them in the same rack.

Infrastructure Architecture for Multiple Availability Zones

Component Layout with Two Availability Zones

The management components of the SDDC run in Availability Zone 1. They can be migrated to Availability Zone 2 when an outage or overload occurs in Availability Zone 1.

You can start deploying the SDDC in a single availability zone configuration, and then extend the environment with the second availability zone.

vSphere Logical Cluster Layout for Multiple Availability Zones for the Management Domain

Network Configuration

NSX-T Edge nodes connect to top of rack switches in each data center to support northbound uplinks and route peering for SDN network advertisement. This connection is specific to the top of rack switch that you are connected to.

Dynamic Routing in Multiple Availability Zones

If an outage of an availability zone occurs, vSphere HA fails over the edge appliances to the other availability zone by using vSphere HA. Availability Zone 2 must provide an analog of the network infrastructure which the edge node is connected to in Availability Zone 1.

The management network in the primary availability zone, and the Uplink 01, Uplink 02, and Edge Overlay networks in each availability zone must be stretched to facilitate failover of the NSX-T Edge appliances between availability zones. The Layer 3 gateway for the management and Edge Overlay networks must be highly available across the availability zones.

The network between the availability zones should support jumbo frames and its latency must be less than 5 ms. Use a 25-GbE connection with vSAN for best and predictable performance (IOPS) of the environment.

Networks That Are Stretched Across Availability Zones

Witness Appliance

When using two availability zones, deploy a vSAN witness appliance in a location that is not local to the ESXi hosts in any of the availability zones.

VMware Validated Design uses vSAN witness traffic separation where you can use a VMkernel adapter for vSAN witness traffic that is different from the adapter for vSAN data traffic. In this design, you configure vSAN witness traffic in the following way:

On each management ESXi host in both availability zones, the vSAN witness traffic is placed on the management VMkernel adapter.
On the vSAN witness appliance, you use the same VMkernel adapter for both management and witness traffic.

vSAN Witness Network Design in the Management Domain