VXLAN, MBGP EVPN with Ingress Replication – Part 1 – Basic Facts, Design Considerations and Security

I found too many reference docs on VXLAN, most of them cover early solutions that do not use MP-BGP EVPN and manage advertisement of BUM traffic (broadcast, unknown unicast and multicast) via multicast. I know people who do not want to run mcast in their network!

Here, I focus on VXLAN with MP-BGP EVPN with ingress replication to manage BUM traffic (VXLAN + MPBGP EVPN + Ingress Replication).

MP-BGP EVPN is the next generation solution becoming widely popular in Data Center networks (VXLAN EVPN) and Service Provider networks (MPLS PBB-EVPN).

My plan is to create following step-by-step reference documents for VXLAN EVPN with ingress replication.

  • VXLAN, MBGP EVPN with ingress replication – Part 1 – Basic Facts, Design Considerations and Security
  • VXLAN, MBGP EVPN with ingress replication – Part 2 – Configure VXLAN on a single POD – L2VNI
  • VXLAN, MBGP EVPN with ingress replication – Part 3 – Configure VXLAN on multi PODs – L2VNI
  • VXLAN, MBGP EVPN with ingress replication – Part 4 – Configure VXLAN on multi PODs – L3VNI
  • VXLAN, MBGP EVPN with ingress replication – Part 5 – Configure VXLAN on multi PODs including a collapsed POD besides Spine and Leaf PODs

This is the Part 1.


Let’s Get Some Basic Facts About VXLAN –

  • the initial specification of VXLAN described in RFC 7348; this describes the need for overlay networks within virtualized data centers accommodating high density tenants (4096++) as traditional VLAN based segmentation can go max up to 4096
  • so based on RFC7348, VXLAN is the solution to get rid of classical ethernet (CE) in a data center and extend VLAN boundaries from 4096 to above; ah! NO more VLAN and STP!
  • similar alternative are TRILL, NVGRE, Cisco OTV; however, none are widely accepted except VXLAN
  • VXLAN header size is 8-byte; this includes a layer 2 virtual network identifier (VNI), which is 24-bit long
  • VNI represents a broadcast domain; traditional VLANs are associated with a unique VNI number; you often see the term “bridge-domain” which are in a sense similar to VLANs (or multiple VLAN/subnets of a tenent); a VNI can represent a bridge-domain
  • since maximum number of IEEE 802.1Q VLAN is 4096 – you can have max 4096 VNIs per POD (point of delivery); yes – you sill use VLANs for segregation!
  • in a large DC environment, you have multiple PODs inter-connected together per data center or across multiple data centers; thus you can have a high density multi-tenancy network that goes beyond 4096! Here you need VXLAN VNIs that gives you segments up to 24-bit (16,777,216 unique network segments in decimal)
  • so, VLANs are still there! ah! YES, they are! but VLANs are now local to per POD and/or per switch only; VLAN extension to intra-switches and intra-PODs are done via VXLAN VNIs; switch-to-switch connections are L3 for VXLAN instead of L2 in a typical VLAN based network
  • VXLAN use UDP instead of TCP and use port number 4789
  • L3 connectivity leverage equal cost multi-paths (ECMP) and use all inter-connect links that provide max throughput and redundancy compared to L2 STP that do not forward traffic over all links because of STP block ports
  • since VXLAN header size is 8-byte; VXLAN adds extra overhead to traditional 1500 MTU; VXLAN MTU size is “1500 payload with original IP header + 14 byte Ethernet header + 8 byte VXLAN header + 8 byte UDP header + 8 byte IP header”
  • in a real world – VXLAN deployments are done on 9K MTU size end-to-end; none use 1500 MTU
  • VXLAN requires end-to-end L3 reachability in the underlay network; underlay reachability is done via IGP, most cases OSPF or IS-IS
  • VXLAN encapsulate MAC address into IP packet and transport over L3 network
  • VXLAN is the “overlay networking” that runs on the top of underlay that use local “VXLAN Tunnel End Point – VTEP” interfaces to encapsulate packages into VXLAN
  • VTEP is the interface where VXLAN traffic encapsulation and de-encapsulation happen (origination and termination of VXLAN traffic)
  • VTEP can be hardware based – which is a dedicated network device capable of encapsulation and de-encapsulation of VXLAN packets; Cisco Nexus/Juniper QFX/Arista are good example
  • VTEP can be software based – VXLAN encapsulation and de-encapsulation happen on software based virtual network appliance within a virtualization “hypervisor servers”; underlaying physical network is totally unaware of VXLAN; VMware NSX is an example of software based VXLAN VTEP
  • software based VTEP handles only traffic those traverse via the hypervisor host machine; whereas hardware based VTEP can handle VXLAN traffic in a much broader space
  • VXLAN EVPN involves a “control plane” that handle the MAC address learning (BUM traffic)
  • VXLAN EVPN support “ARP suppression” which can reduce arp flood for “silent” hosts/clients (most hosts send GARP/RARP to the network when they come online; silent hosts dont do that)
  • VXLAN L3VNI requires “anycast gateway” on the Leaf switches which has a shared IP address across all the participating Leaf switches; very similar to other FHRP (VRRP/HSRP/etc…)


(VXLAN header details – picture copied from cisco.com)


Security in VXLAN MP-BGP EVPN based VTEP

  • previous multicast based VTEP peer discovery didn’t have a mechanism or a method for authenticating VTEP peers; in plain English there was “NO” whitelist for VTEP peers!
  • the above limitations present major security risks in real-world VXLAN deployments because it allows insertion of a rogue VTEP into a VNI segment!
  • if a rogue VTEP has been inserted into the segment, it can send and receive VXLAN traffic! ah! goccha!
  • MP-BGP EVPN based VTEP peers are pre-authenticated and whitelisted by BGP; BGP sessions must be established first for a VTEP device to discover remote VTEP peers
  • in addition to an established BGP session requirements – BGP session authentication can be added to BGP peers (MD5 3DES)
  • in addition to BGP session security – IGP security (aka. auth) can be added to the “underlay” routing protocols


Few Quick Notes on VXLAN Network Design

  • VXLAN network design doesn’t follow traditional “three layers” network design approach (core – dist – access)
  • VXLAN network design typically has two tiers – Spine and Leaf; this design can grow horizontally “pay as you go”; you can add more Spine and Leaf anytime! No more fixed number of switch ports per POD!
  • you can have “super spine” on the top of Spine switches
  • Leaf switches are connected to Spine switches within the same POD
  • Spine to Spine direct network connections are “not” necessary but they can be connected
  • underlay IGP ensure end-to-end L3 connectivity within Leaf and Spine switches
  • clients are connected to the Leaf switches (servers, hypervisors, routers etc…)
  • in a multi-POD DC scenario – Spine switches need to be inter-connected (same EVPN control plane across multi-POD); intra-site DCI
  • in a multi-site data centre inter-connect (DCI) scenario “segmented” VXLAN “control plane” are deployed to minimise BUM per data center; inter-DC traffic are handled by VXLAN Border Gateway (BGW) routers
  • in a multi-site DCI scenario, the Border Gateway router (BGW) can be configured on the Spine switches (there are many other connectivity model/scenarios available for BGW); in this case Spine DC-x to Spine DC-y are connected back to back over via L3 link which is very similar to multi-POD Spine to Spine connectivity




VXLAN traffic flow diagram – inter-switch VLAN traffic follow L3 path.



Few Notes While Configuring VXLAN on Cisco Nexus NXOS

  • VXLAN EVPN is based on MP-BGP; this is just an extension to MP-BGP which is very similar to MPLS VPNV4 or VPLS l2vpn
  • if you have configured MP-BGP MPLS before – you will find VXLAN EVPN configuration is super easy
  • VXLAN VTEP switches are much like “PE” router in a typical MPLS network



Cisco Nexus vPC and non-vPC VLANs together on the same platform (Nexus hybrid setup) – NXOS v7.0(3)4(1)

Cisco discontinued “spanning-tree pseudo-information” starting from NXOS version 7.0(3)4(1) on the 9000 platforms. So what is the solution for Nexus vPC and non-vPC VLANS on the same platform (hybrid)? Is it no longer going to be supported on NXOS/9000 platforms?

Although hybrid is not recommended vPC design for “aggregation layer” but you find a lot scenarios where you need to have both vPC and non-vPC within the same platform (mostly in mid-size data centres; where you have a lot reasons you can’t deploy traditional ethernet switches; hence you have considered the Cisco Nexus platforms).

If you carry everything (both vPC VLANs and non-vPC VLANs) over the vPC peer-links (yes, vPC carry orphaned VLANs as well!) – in this case if there is any issue happen on the vPC peer links and that stops the vPC working, the downstream switches those are connected to the Nexus via non-vPC/non-vPC VLANs will stop forwarding frames due to STP (any STP – STP/RSTP/PVSTP/MSTP) blockage as they don’t have any information about what happened between the vPC peer Nexus switches in a vPC peer failed scenario.

Experts suggest that you should have an additional Layer 2 trunk port-channel alongside your vPC peer-link; this Layer 2 port-channel will carry non-vPC VLANs in case vPC stops working (whatever reason; could be a Nexus reboot during maintenance!).

Well, now you have setup a seperate Layer 2 trunk port channel for non-vPC VLANs and shutdown the vPC peer link to test it – but you found it is still not working as expected! STP is blocking the Layer 2 trunk link, whats the problem! Cisco used to have a solution for this called “spanning-tree pseudo-information“; as I have mentioned in the beginning Cisco discontinued this starting from version 7.0(3)l4(1) on the Nexus 9000 platform. So what is your option for this? Should you stop using Nexus if you don’t follow spine and leaf design?

Yes – there is an answer to this, there is a small trick to make this working! By discontinuing psuedo-information, Cisco basically makes it ever easier to configure; you need to do the following –

i. First, you need to set STP root priority for the VLANs to a lower priority number on one of Nexus switch (default priority is 32768, lower is prefer; this should be done on the primary vPC role Nexus switch) and leave STP untouched on the other Nexus switch (this is the vPC secondary role switch).

ii. Second, on the non-vPC trunk port-channel set “spanning-tree port type normal” on “both” the switches; (“spanning-tree port type network” is recommended for vPC peer-link).

Here is an example config -Here is an example config –

On the vPC primary role switch, apply the following -

(config)#spanning-tree vlan 10,20,30,40-50 priority 8192

On both the switches, apply the following -

(config)#interface port-channel10
(config-if)#description "non-vPC trunk port" 
(config-if)#switchport mode trunk 
(config-if)#switchport trunk allowed vlan 10,20,30,40-50 
(config-if)#spanning-tree port type normal

Without having the above configuration applied you will find STP is blocking the non-vPC Layer 2 trunk link even if the vPC peer-link is shutdown. Also in this example – the vPC primary role switch will be the STP “root bridge” because of lower priority configured (8192).

To test your configuration – shutdown the vPC peer-link, run “show spanning-tree vlan xxx” – you should see STP put the L2 trunk interfaces in forwarding state immediately.

Here is a vPC and non-vPC VLANs on the same platform diagram –