VXLAN, MBGP EVPN with Ingress Replication – Part 1 – Basic Facts, Design Considerations and Security

I found too many reference docs on VXLAN, most of them cover early solutions that do not use MP-BGP EVPN and manage advertisement of BUM traffic (broadcast, unknown unicast and multicast) via multicast. I know people who do not want to run mcast in their network!

Here, I focus on VXLAN with MP-BGP EVPN with ingress replication to manage BUM traffic (VXLAN + MPBGP EVPN + Ingress Replication).

So, What Is “Ingress Replication” Compared To “Multicast” Based VXLAN Solution?

The answer is – ingress replication is called head-end-replication which performs unicast delivery of VXLAN encapsulated packet across remote VTEPs. Unicast replication requires a source VTEP to delivery same data to every single remote VTEPs in “one-to-one” fashion – whereas in multicast a rendezvous point (preferred is PIM-SM RP) defined where all the VTEPs join to receive delivery of VXLAN encapsulated data in “one-to-many” fashion. Multicast has lower overhead and can provide faster delivery compared to unicast; however, multicast is less secure.

MP-BGP EVPN is the next generation solution becoming widely popular in Data Center networks (VXLAN EVPN) and Service Provider networks (MPLS PBB-EVPN).

My plan is to create following step-by-step reference documents for VXLAN EVPN with ingress replication.

VXLAN, MBGP EVPN with ingress replication – Part 1 – Basic Facts, Design Considerations and Security
VXLAN, MBGP EVPN with ingress replication – Part 2 – Configure VXLAN on a single POD – L2 VNI – here
VXLAN, MBGP EVPN with ingress replication – Part 3 – Configure VXLAN on multi PODs – L2 VNI
VXLAN, MBGP EVPN with ingress replication – Part 4 – Configure VXLAN on multi PODs – L3 VNI
VXLAN, MBGP EVPN with ingress replication – Part 5 – Configure VXLAN on multi PODs including a collapsed POD besides Spine and Leaf PODs

This is the Part 1.

Let’s Get Some Basic Facts About VXLAN

the initial specification of VXLAN described in RFC 7348; this describes the need for overlay networks within virtualized data centers accommodating high density tenants (4096++) as traditional VLAN based segmentation can go max up to 4096
so based on RFC7348, VXLAN is the solution to get rid of classical ethernet (CE) in a data center and extend VLAN boundaries from 4096 to above; ah! NO more VLAN and STP!
similar alternative are TRILL, NVGRE, Cisco OTV; however, none are widely accepted except VXLAN
VXLAN header size is 8-byte; this includes a layer 2 virtual network identifier (VNI), which is 24-bit long
VNI represents a broadcast domain; traditional VLANs are associated with a unique VNI number; you often see the term “bridge-domain” which are in a sense similar to VLANs (or multiple VLAN/subnets of a tenent); a VNI can represent a bridge-domain
since maximum number of IEEE 802.1Q VLAN is 4096 – you can have max 4096 VNIs per POD (point of delivery); yes – you sill use VLANs for segregation!
in a large DC environment, you have multiple PODs inter-connected together per data center or across multiple data centers; thus you can have a high density multi-tenancy network that goes beyond 4096! Here you need VXLAN VNIs that gives you segments up to 24-bit (16,777,216 unique network segments in decimal)
so, VLANs are still there! ah! YES, they are! but VLANs are now local to per POD and/or per switch only; VLAN extension to intra-switches and intra-PODs are done via VXLAN VNIs; switch-to-switch connections are L3 for VXLAN instead of L2 in a typical VLAN based network
VXLAN use UDP instead of TCP and use port number 4789
L3 connectivity leverage equal cost multi-paths (ECMP) and use all inter-connect links that provide max throughput and redundancy compared to L2 STP that do not forward traffic over all links because of STP block ports
since VXLAN header size is 8-byte; VXLAN adds extra overhead to traditional 1500 MTU; VXLAN MTU size is “1500 payload with original IP header + 14 byte Ethernet header + 8 byte VXLAN header + 8 byte UDP header + 8 byte IP header”
in a real world – VXLAN deployments are done on 9K MTU size end-to-end; none use 1500 MTU
VXLAN requires end-to-end L3 reachability in the underlay network; underlay reachability is done via IGP, most cases OSPF or IS-IS
VXLAN encapsulate MAC address into IP packet and transport over L3 network
VXLAN is the “overlay networking” that runs on the top of underlay that use local “VXLAN Tunnel End Point – VTEP” interfaces to encapsulate packages into VXLAN
VTEP is the interface where VXLAN traffic encapsulation and de-encapsulation happen (origination and termination of VXLAN traffic)
VTEP can be hardware based – which is a dedicated network device capable of encapsulation and de-encapsulation of VXLAN packets; Cisco Nexus/Juniper QFX/Arista are good example
VTEP can be software based – VXLAN encapsulation and de-encapsulation happen on software based virtual network appliance within a virtualization “hypervisor servers”; underlaying physical network is totally unaware of VXLAN; VMware NSX is an example of software based VXLAN VTEP
software based VTEP handles only traffic those traverse via the hypervisor host machine; whereas hardware based VTEP can handle VXLAN traffic in a much broader space
VXLAN EVPN involves a “control plane” that handle the MAC address learning (BUM traffic)
VXLAN EVPN support “ARP suppression” which can reduce arp flood for “silent” hosts/clients (most hosts send GARP/RARP to the network when they come online; silent hosts dont do that)
VXLAN L3VNI requires “anycast gateway” on the Leaf switches which has a shared IP address across all the participating Leaf switches; very similar to other FHRP (VRRP/HSRP/etc…)

vxlan-header-cisco-com

(VXLAN header details – picture copied from cisco.com)

Security in VXLAN MP-BGP EVPN based VTEP

previous multicast based VTEP peer discovery didn’t have a mechanism or a method for authenticating VTEP peers; in plain English there was “NO” whitelist for VTEP peers!
the above limitations present major security risks in real-world VXLAN deployments because it allows insertion of a rogue VTEP into a VNI segment!
if a rogue VTEP has been inserted into the segment, it can send and receive VXLAN traffic! ah! goccha!
MP-BGP EVPN based VTEP peers are pre-authenticated and whitelisted by BGP; BGP sessions must be established first for a VTEP device to discover remote VTEP peers
in addition to an established BGP session requirements – BGP session authentication can be added to BGP peers (MD5 3DES)
in addition to BGP session security – IGP security (aka. auth) can be added to the “underlay” routing protocols

Few Quick Notes on VXLAN Network Design

VXLAN network design doesn’t follow traditional “three layers” network design approach (core – dist – access)
VXLAN network design typically has two tiers – Spine and Leaf; this design can grow horizontally “pay as you go”; you can add more Spine and Leaf anytime! No more fixed number of switch ports per POD!
you can have “super spine” on the top of Spine switches
Leaf switches are connected to Spine switches within the same POD
Spine to Spine direct network connections are “not” necessary but they “can be” connected
underlay IGP ensure end-to-end L3 connectivity within Leaf and Spine switches
clients are connected to the Leaf switches (servers, hypervisors, routers etc…)
in a multi-POD DC scenario – Spine switches need to be inter-connected (same EVPN control plane across multi-POD); intra-site DCI
in a multi-site data centre inter-connect (DCI) scenario “segmented” VXLAN “control plane” are deployed to minimise BUM per data center; inter-DC traffic are handled by VXLAN Border Gateway (BGW) routers
in a multi-site DCI scenario, the Border Gateway router (BGW) can be configured on the Spine switches (there are many other connectivity model/scenarios available for BGW); in this case Spine DC-x to Spine DC-y are connected back to back over via L3 link which is very similar to multi-POD Spine to Spine connectivity

Typical VXLAN Design Diag

VXLAN traffic flow diagram – inter-switch VLAN traffic follow L3 path.

VXLAN-VLAN-Path-Diag

Few Notes While Configuring VXLAN on Cisco Nexus NXOS

VXLAN EVPN is based on MP-BGP; this is just an extension to MP-BGP which is very similar to MPLS VPNV4 or VPLS l2vpn
if you have configured MP-BGP MPLS before – you will find VXLAN EVPN configuration is super easy
VXLAN VTEP switches are much like “PE” router in a typical MPLS network

17 thoughts on “VXLAN, MBGP EVPN with Ingress Replication – Part 1 – Basic Facts, Design Considerations and Security”

Stephen on August 27, 2019 at 5:43 am said:

Nice Work Jamster!!!!

LikeLike

Reply ↓
- infojami on August 27, 2019 at 6:28 am said:
  
  thank you mate.
  
  LikeLike
  
  Reply ↓
Lane Lu on August 27, 2019 at 11:24 pm said:

Nice share as always Jami, looks you slipped into networking much deeper now 😉

LikeLike

Reply ↓
- infojami on August 27, 2019 at 11:26 pm said:
  
  hehe…indeed. thanks mate.
  
  LikeLike
  
  Reply ↓
Loopy on September 3, 2019 at 4:02 am said:

That is an amazing, simple to understand article, seriously kudos to the author. Following Blog! 🙂

LikeLike

Reply ↓
- infojami on September 3, 2019 at 4:15 am said:
  
  thanks for your nice words mate! i am glad i was able to simplify this!
  
  LikeLiked by 1 person
  
  Reply ↓
Pingback: VXLAN, MBGP EVPN with ingress replication – Part 2 – Configure VXLAN L2VNI on a single POD | iCookServers-&-Networks
arsenalytic on September 7, 2020 at 9:14 am said:

Concise!

Is it possible to write a blog on how to design and configure “Configure VXLAN on multi PODs and multi DC environment – L2 VNI” as data centers are in active active mode these days ?

LikeLike

Reply ↓
- infojami on September 8, 2020 at 4:09 am said:
  
  Yes – I have a plan to write up articles on multi-POD and also on multi-site (dual control-plane EVPN/VXLAN). I will let you know once I publish it. But multi-POD is super easy – its just extention of BGP peers (underlay + overlay) and VTEPs to other PODs.
  
  LikeLiked by 1 person
  
  Reply ↓
  - arsenalytic on September 8, 2020 at 4:41 am said:
    
    Thank you so much !
    
    LikeLike
tester on March 24, 2021 at 8:41 am said:

very good explanation

LikeLike

Reply ↓
- infojami on March 26, 2021 at 1:17 am said:
  
  I am glad you liked it; thank you.
  
  LikeLike
  
  Reply ↓
BB on March 25, 2021 at 6:16 pm said:

Thanks for this article, very useful. How would you connect to this POD to another POD across the WAN so for instance to extend some VLANs -L2 stretching?

LikeLike

Reply ↓
- infojami on March 26, 2021 at 1:39 am said:
  
  I am glad that you liked this article; multi-POD is not too hard to setup; “POD-X SPINEs” will have BGP peering to “POD-Y SPINEs” and they exchange “l2vpn evpn as address family” with “retain all route-targets”. Connectivity between POD-X SPINEs to POD-Y SPINEs for my cases are always have been dark-fiber DCI connectivity; however, this can be as Juniper mentioned on their doco for DCI – “WAN underlay can be a Layer 3 VPN from a third-party, dark fiber, IPSec over Internet, etc.”; here is a link https://www.juniper.net/documentation/en_US/release-independent/nce/topics/concept/nce-211-ott-dci-overview.html; since EVPN is not a Cisco/Juniper proprietary – what Juniper doco stated should be applicable to most scenarios.
  
  LikeLike
  
  Reply ↓
  - BB on March 26, 2021 at 3:35 pm said:
    
    Thanks so much for your reply and great work with this site.
    
    LikeLike
Peter on March 26, 2021 at 12:39 am said:

Best overview I think I’ve seen! brilliant.. thanks a mil 👍

LikeLike

Reply ↓
- infojami on March 26, 2021 at 1:39 am said:
  
  thank you – I am glad you liked it.
  
  LikeLike
  
  Reply ↓