VCF 9.0 GA Mental Model Part 6: Topology and Identity Boundaries for Single Site, Dual Site, and Multi-Region

TL;DR

  • Scope: VMware Cloud Foundation 9.0.0.0 GA (primary platform build 24703748) and the associated 9.0 GA BOM levels for key components:
    • SDDC Manager: 9.0.0.0 build 24703751
    • vCenter: 9.0.0.0 build 24755230
    • ESXi: 9.0.0.0 build 24755229
    • NSX: 9.0.0.0 build 24752083
    • VCF Operations: 9.0.0.0 build 24705084
    • VCF Operations Fleet Management: 9.0.0.0 build 24704881
    • VCF Automation: 9.0.0.0 build 24786202
    • VCF Identity Broker: 9.0.0.0 build 24786209
  • Your topology decision is really about failure domains:
    Single site -> simplest operations.
    Two sites in one region -> availability engineering (stretched networking and usually stretched storage).
    Multi-region -> disaster recovery engineering (asynchronous replication + runbooks).
  • Your identity decision is a blast radius decision:
    Fleet-wide Single Sign-On (SSO) maximizes convenience, but centralizes login impact.
    Instance-level SSO shrinks blast radius, but increases operational overhead.
  • Operational punchline: Choose topology and SSO model as day-0 decisions, because your day-2 posture (change windows, incident scope, and who gets paged) is set by those boundaries.

Architecture Diagram

Table of Contents

  • Scope and terminology guardrails
  • Assumptions
  • Decision criteria
  • Challenge
  • Solutions
  • Identity boundaries
  • Who owns what
  • Version compatibility matrix
  • Architecture tradeoff matrix
  • Failure domain analysis
  • Day-0, day-1, day-2 action map
  • Operational runbook snapshot
  • Validation
  • Troubleshooting workflow
  • Anti-patterns
  • Summary and takeaways
  • Conclusion

Scope and terminology guardrails

You will move faster as an organization if you treat these as non-negotiable guardrails:

  • Fleet is your centralized governance and lifecycle scope for fleet-level services (for example, VCF Operations and VCF Automation).
  • Instance is a discrete VCF deployment unit with its own instance-level management components.
  • Domains (management domain and VI workload domains) are lifecycle and isolation boundaries inside an instance.
  • Clusters are the scaling unit inside a domain.

For topology conversations, you also need consistent physical vocabulary:

  • Region is one or more physical sites in a single metro area, typically aligned to synchronous replication latencies.
  • Single site is a single fault domain at some layer (power, HVAC, core network, etc.), even if you have multiple racks.
  • Multiple sites in a single region is an availability pattern, usually implemented with stretched clusters.
  • Multiple sites across multiple regions is a disaster recovery pattern. Treat it as DR engineering, not “metro HA, but farther away”.

Assumptions

  • You are designing for VCF 9.0.0.0 GA (not 9.0.x maintenance releases).
  • You are greenfield for VCF bring-up.
  • You plan to deploy both VCF Operations and VCF Automation from day-1.
  • You want to support three topology postures:
    • Single site
    • Two sites in one region
    • Multi-region
  • You need to support two identity postures:
    • Shared identity and shared SSO boundary where appropriate
    • Separate SSO boundaries for regulated isolation where required

Decision criteria

Use these criteria to keep topology and identity debates grounded in operational outcomes:

  • Availability objective
    • Are you trying to survive host/rack failures, or a full site loss?
    • Do you need “continue running” vs “recover quickly”?
  • Latency reality
    • Two sites in one region implies tight latency constraints and resilient inter-site networking.
    • Multi-region implies you are in DR territory, not synchronous HA territory.
  • Isolation and compliance
    • Do you need separate admin planes and authentication boundaries for regulated workloads?
  • Operational model
    • Can your teams support stretched designs (storage, networking, failure testing)?
    • Do you have the maturity to run parallel instances and DR runbooks?
  • Scale and growth
    • Will you scale by adding clusters, adding domains, or adding instances?
    • Are you trying to cap blast radius for lifecycle events?

Challenge

You need a topology and identity posture that:

  • Matches real failure domains (host, rack, site, region)
  • Keeps lifecycle operations predictable (patching, certificates, identity changes)
  • Makes ownership clear (platform team vs VI admins vs app/platform teams)
  • Avoids accidental coupling (shared services that turn into shared outages)

Solutions

Solution A: Single site

When it fits

  • You want the fastest path to a stable VCF 9.0 platform.
  • Your highest-probability failures are host and rack, not full-site loss.
  • You want to minimize “distributed systems” complexity in your management components.

What it looks like operationally

  • One fleet, one instance, one site.
  • You still separate management domain and workload domains early so lifecycle and security boundaries stay clean.
  • If you adopt “minimal footprint” patterns, validate whether your VCF Automation tenancy model requires a second cluster for scale and availability.

Failure posture

  • You can engineer strong resilience for component and host failures.
  • Site loss is usually an outage unless you build a separate recovery site (which becomes Solution C).

Day-2 characteristics

  • Lowest overhead for upgrades and identity changes.
  • Lowest number of moving parts to test during maintenance windows.

Solution B: Two sites in one region

When it fits

  • You need resilience across two facilities in the same metro area.
  • You can meet the networking and storage requirements to operate stretched designs reliably.
  • You accept more complex failure testing and more disciplined change management.

What it looks like operationally

  • Usually one fleet and one instance spanning two sites in a single region.
  • Stretched clusters are used to increase availability across sites.
  • Expect “site affinity” considerations for key components and edge services, plus explicit failover capacity planning.

Failure posture

  • Well-designed two-site patterns can tolerate a single site loss for some tiers of workloads.
  • Your success depends on:
    • Inter-site link design (bandwidth, latency, convergence)
    • First-hop gateway failover behavior
    • Your storage model (stretched vs replicated)
    • A tested operational runbook

Day-2 characteristics

  • Higher operational toil:
    • More health dependencies (link stability, witness placement, routing)
    • Higher change risk if you treat the stretched fabric casually
  • Upgrade impact can be broader if maintenance touches shared stretched components.

Solution C: Multi-region

When it fits

  • You need regional survivability and a credible DR story.
  • You accept asynchronous replication and DR orchestration as first-class requirements.
  • You can operationalize regular failover testing.

What it looks like operationally

  • One fleet with multiple instances, typically aligning instances to regions.
  • Each region runs its own instance-level management components for that instance.
  • You add replication and failover solutions on top (data replication is not “free” just because you have two regions).

Failure posture

  • Region loss becomes a recovery process, not an HA event.
  • Your RPO/RTO is determined by:
    • Replication technology and mode (async, periodic)
    • Runbook execution time (automation maturity)
    • DNS, identity, and access dependencies

Day-2 characteristics

  • More upgrade surface area:
    • More instance-level stacks to patch and validate
    • More compatibility and sequencing to track
  • More change management work:
    • Cross-region DR testing, runbook maintenance, replication monitoring

Identity boundaries

VCF 9.0 gives you flexibility in how far you extend SSO convenience. Your decision should be explicit, because it determines operational coupling.

Identity design-time decisions that matter

  • Do you want fleet-wide login convenience or per-instance blast radius control?
  • Do you need one identity provider across the fleet, or separate identity sources for isolation?
  • Do you need a highly available Identity Broker deployment model for scale and resilience?

Challenge

You want a clean login experience for operators and consumers, without turning identity into a single point of operational failure.

Solutions

Solution A: Fleet-wide Single Sign-On

Best for

  • A single platform operations team supporting multiple instances
  • Environments where cross-instance operations are common
  • Organizations optimizing for ease of use and consistent access patterns

Operational reality

  • One Identity Broker scope can service all instances within a fleet (large convenience scope).
  • This can create a larger login blast radius if the Identity Broker service is unhealthy or unavailable.

Day-2 implications

  • Identity changes become high-impact changes.
  • You must run solid backup, restore, and certificate practices for identity components.

Solution B: Instance-level Single Sign-On

Best for

  • Regulated isolation
  • Multi-tenant environments where identity boundaries must map to tenant boundaries
  • Organizations optimizing for smaller incident scope

Operational reality

  • You accept more overhead (more identity configurations to manage).
  • You gain containment: login impact is limited to the instance.

Day-2 implications

  • More repeated work during identity provider changes
  • More places to validate role mappings and permissions

Solution C: Cross-instance Single Sign-On segmentation

Best for

  • A practical middle ground
  • You want to group instances by risk domain (for example, production vs regulated)

Operational reality

  • Multiple Identity Broker instances serve defined subsets of instances in the same fleet.
  • You reduce login blast radius versus a single shared Identity Broker, but still gain some cross-instance convenience.

Rollback and safety notes for identity

Identity changes are rarely “undo-able” in a clean way.

Operational behaviors to plan for:

  • Resetting or deregistering SSO can remove provisioned users and groups and may be irreversible for the removed identities.
  • Even after configuring SSO centrally, you often still need to log in to individual components and assign roles and permissions for users and groups.

Treat identity changes as:

  • A change window item with defined blast radius
  • A runbook with an explicit backout plan (often “restore from backup” rather than “click undo”)

Who owns what

Use this chart to stop ownership drift before it becomes incident fuel.

Capability / Task Area Platform team (fleet) VI admin (instance + domains) App/platform teams (consumers)
Fleet topology decisions (fleet count, instance strategy) Own Consult Inform
VCF Operations + Fleet Management lifecycle Own Consult Inform
VCF Automation lifecycle and platform guardrails Own Consult Consult
Identity Broker and SSO model selection Own Consult Inform
Identity provider integration and federation policy Own Consult Inform
Instance bring-up, SDDC Manager health Consult Own Inform
Management domain operations (vCenter/NSX for mgmt) Consult Own Inform
Workload domain lifecycle (create/expand/delete) Consult Own Inform
Network services consumption (projects, VPCs, templates) Guardrails Provide capacity Own
Workload placement, sizing, app RTO/RPO Guardrails Provide platform SLAs Own
DR runbooks for workloads Provide platform primitives Support infra failover Own (execute + validate)

Version compatibility matrix

This matrix is here to reduce ambiguity in architecture reviews and incident calls.

Component Role in the model 9.0 GA version 9.0 GA build
VMware Cloud Foundation Platform level 9.0.0.0 24703748
SDDC Manager Instance mgmt 9.0.0.0 24703751
vCenter Domain mgmt 9.0.0.0 24755230
ESXi Host layer 9.0.0.0 24755229
NSX Network virtualization 9.0.0.0 24752083
VCF Operations Fleet-level ops 9.0.0.0 24705084
VCF Operations Fleet Management Fleet lifecycle plane 9.0.0.0 24704881
VCF Automation Fleet-level consumption 9.0.0.0 24786202
VCF Identity Broker Identity plane 9.0.0.0 24786209

Architecture tradeoff matrix

Use this table in design boards to turn opinions into tradeoffs.

Attribute Single site Two sites in one region Multi-region
Primary goal Operational simplicity Site resilience (metro) Regional survivability (DR)
Typical instance count 1 1 2+
Data protection posture Local HA + backups Often synchronous within region Asynchronous replication + DR
Network demands Standard DC Stretched, resilient inter-site L3 between regions + DR routing/DNS
Change risk Lowest Medium to high High (more components)
Upgrade impact Smallest Broader (shared stretched deps) Broadest (multiple instances)
Identity blast radius Depends on SSO model Depends on SSO model Higher if identity is centralized
Best for Getting started, most orgs Metro availability Regulated DR, geo resilience

Failure domain analysis

You need a shared language for “what breaks what”:

  • Fleet service incident (Operations/Automation/Identity Broker)
    Impacts governance, provisioning workflows, centralized observability, and potentially login flows (depending on your SSO model).
    It does not automatically mean instance-level vCenter or NSX is down.
  • Instance incident (SDDC Manager, management domain services)
    Impacts domain lifecycle operations and management workflows for that instance. Workloads may keep running, but lifecycle and orchestration stop being safe.
  • Domain incident (a workload domain vCenter/NSX, or cluster issues)
    Impacts workloads in that domain. Other domains and instances can remain healthy.

Now map that to topology:

  • Single site: Failure domains are clean, but “site loss” is still a hard stop unless you add DR.
  • Two sites in one region: Link failure and split-brain conditions become first-class failure modes.
  • Multi-region: DR orchestration and identity dependencies become the most common hidden risk.

Day-0, day-1, day-2 action map

Day-0 decisions

These are the “you will regret not deciding early” items:

  • Topology pattern: single site vs dual site vs multi-region
  • Scaling strategy: add clusters vs add domains vs add instances
  • SSO model: fleet-wide vs instance-level vs segmented cross-instance
  • Where fleet services live, and how you protect them
  • Certificate authority strategy and renewal model
  • Backup and restore posture for:
    • VCF Operations + Fleet Management
    • VCF Automation
    • Identity Broker
    • SDDC Manager and management vCenter/NSX

Day-1 actions

Day-1 is “build the platform safely”:

  • Deploy and configure VCF Installer (new in VCF 9.x lineage vs older Cloud Builder workflows).
  • Bring up the first instance and management domain.
  • Deploy fleet services (Operations and Automation) to match your desired HA footprint.
  • Configure Identity Broker and SSO model.
  • Create initial workload domains and attach them to the consumption model you plan to support.
  • For anything beyond baseline wizard-driven deployment (for example, specific network constructs), plan on JSON spec-driven deployment where required.

Day-2 operations

Day-2 is where topology decisions become either leverage or pain:

  • Lifecycle management:
    • Fleet services lifecycle
    • Instance and domain lifecycle
  • Governance and drift:
    • Out-of-band changes are the fastest way to break day-2 workflows
  • Capacity and scale:
    • Add clusters to domains
    • Add domains to instances
    • Add instances to fleets (most often for geographic dispersal and isolation)
  • Identity and certificates:
    • Role mapping validation after identity changes
    • Certificate renewal to avoid service disruption
  • DR and resilience:
    • Regular restoration testing for fleet services
    • Runbook execution practice for multi-region

Operational runbook snapshot

Use this as a starting point and adjust to your org’s risk model.

Minimum viable backup posture

  • Back up fleet services and identity:
    • VCF Operations + Fleet Management
    • VCF Automation
    • Identity Broker
  • Back up instance-level management:
    • SDDC Manager and management vCenter/NSX

Starting targets you can use when leadership asks “what’s good enough”:

  • Fleet services (Operations/Automation/Identity):
    • RPO: 24 hours (starter), 4 hours (mature), 1 hour (high-critical)
    • RTO: 4-8 hours (starter), 2-4 hours (mature), under 2 hours (high-critical)
  • Workload domains (apps):
    • RPO and RTO should be app-tier driven, not “platform averages”

Identity provider change runbook

  • Pre-change:
    • Confirm break-glass access to each component
    • Export role mappings and admin group membership
    • Confirm backups exist for identity components
  • Change:
    • Implement identity provider change in the selected SSO model scope
    • Re-validate role mapping per component (vCenter, NSX, Operations, Automation)
  • Post-change:
    • Validate login across:
      • Fleet UI
      • Instance components
      • Automation portals
    • Update documentation and on-call procedures

Validation

Use validation as your “trust but verify” step after topology or identity work.

Before you declare success, validate:

  • DNS resolution for all management endpoints
  • NTP sync consistency across fleet services and instance services
  • Login paths:
    • Fleet services
    • vCenter and NSX
    • Automation portals
  • Health and connectivity:
    • VCF Operations cluster health
    • Automation cluster health
    • Identity Broker health

Use the following commands to validate the basics from a jump host.

Run these DNS and connectivity checks:

# DNS resolution
nslookup vcf-ops-fqdn.example.com
nslookup vcf-automation-fqdn.example.com
nslookup vcenter-mgmt-fqdn.example.com
nslookup nsx-mgmt-fqdn.example.com

# TLS reachability (headers only)
curl -kI https://vcf-ops-fqdn.example.com/
curl -kI https://vcf-automation-fqdn.example.com/
curl -kI https://vcenter-mgmt-fqdn.example.com/
curl -kI https://nsx-mgmt-fqdn.example.com/

Troubleshooting workflow

When something breaks, your first job is to identify which boundary you are in.

Step-by-step triage

  • Step 1: Is this a login issue or a lifecycle issue?
    • Login failures often point to identity scope or identity broker health.
    • Lifecycle failures often point to fleet management services or instance manager state.
  • Step 2: Is impact fleet-wide, instance-wide, or domain-only?
    • Fleet-wide symptoms: multiple instances show the same governance or login issues.
    • Instance-wide symptoms: one instance fails lifecycle tasks across its domains.
    • Domain-only symptoms: a workload domain is isolated while other domains operate normally.
  • Step 3: Validate time and certificates
    • Time drift and certificate issues are repeat offenders in management plane failures.
    • Fixing time and trust chains often restores otherwise “mysterious” behavior.

Common issues

  • SSO works in one UI, fails in another
    • Usually role mappings are incomplete in individual components even though SSO is configured centrally.
  • Automation provisioning failures after identity changes
    • Often stale user/group bindings or missing project/organization role bindings.
  • Stretched design instability
    • Often inter-site routing and gateway failover behavior, not “vSphere problems”.

Anti-patterns

Avoid these and you avoid most self-inflicted outages.

  • Treating dual-site in one region like “simple HA”
    • It is not simple. It is distributed systems engineering.
  • Treating multi-region as “active-active by default”
    • Multi-region is a DR posture unless you intentionally architect otherwise.
  • Choosing fleet-wide SSO without an identity resilience plan
    • Convenience without resilience becomes a fleet-wide login incident.
  • Mixing regulated tenants into a shared identity boundary “for simplicity”
    • That is an audit finding waiting to happen.
  • Out-of-band changes without drift detection and an operational reconciliation practice
    • This creates silent divergence between actual state and expected state.

Summary and takeaways

  • Topology is a failure domain decision. Identity is a blast radius decision.
  • Single site is the fastest path to a stable platform and clean day-2 operations.
  • Two sites in one region is an availability posture that requires disciplined engineering and testing.
  • Multi-region is a DR posture that requires replication, orchestration, and practiced runbooks.
  • Fleet-wide SSO is about user experience. Instance-level SSO is about containment.
  • Put the ownership model on paper early, or your incident bridge will do it for you.

Conclusion

VCF 9.0 becomes dramatically easier to operate when you explicitly separate topology decisions (site, region, instance placement) from governance decisions (fleet services) and then choose identity boundaries that match your isolation and resilience goals. Once you standardize these mental models, your teams can scale the platform without scaling confusion.

Sources

VMware Cloud Foundation 9.0 Documentation (TechDocs landing page): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0.html
VMware Cloud Foundation 9.0 Release Notes – Bill of Materials: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/release-notes/vmware-cloud-foundation-90-release-notes/vmware-cloud-foundation-bill-of-materials.html
Design Blueprints for VMware Cloud Foundation 9.0: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/design/blueprints.html
VCF Fleet-Wide Single Sign-On Model: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/design/design-library/single-sign-on-models/-fleet.html
VCF Single Sign-On Models (Design Library index): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/design/design-library/single-sign-on-models.html
VCF Installer Product Support Notes (VCF 9.0 Release Notes): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/release-notes/vmware-cloud-foundation-90-release-notes/platform-product-support-notes/product-support-notes-installer.html
VMware Cloud Foundation Installer API Reference Guide: https://developer.broadcom.com/xapis/vcf-installer-api/latest
VMware Cloud Foundation API Reference Guide: https://developer.broadcom.com/xapis/vmware-cloud-foundation-api/latest

Similar Posts