L3VPN with FRR and EVPN VXLAN on Linux

FRRouting is a fully-featured IP routing stack which runs on a variety of Unix-like operating systems.

In this post I will show how to configure FRR on Linux to create L3 VPNs across a common IP underlay network. BGP EVPN is used between end hosts to distribute VRF routing tables and encapsulation information. VXLAN is used on the wire to encapsulate VRF traffic across a common IP network.

BGP EVPN / VXLAN

BGP EVPN signalled networks, with VXLAN transport, have become very popular in datacentre networks and beyond in recent years. The original and primary use-case for such networks is layer-2 virtualization across IP frabrics.

The EVPN standard, however, also includes a pure layer-3 "prefix route" (type 5). This route-type can be used to build L3VPNs similar to using VPN-IPV4 or VPN-IPV6 SAFIs.

One nice feature of using EVPN for signalling a network using VXLAN transport, is that end devices ("PEs", "VTEPs" or whatever you want to call them,) do not need to have any shared state or knowledge of each other outside the information that EVPN carries. This is unlike MPLS L3VPNs, which generally require a mechanism to signal labels, or establish GRE tunnels, in addition to the information in BGP.

With EVPN, end systems simply need to be able to send IP traffic to the remote next-hop IPs (VTEPs) encoded in the BGP updates. The underlay routing on end devices could be as simple as a few static routes (although using a dynamic protocol to indicate reachability is better.) The underlay might run an IGP or BGP unicast protocol for that, and can stretch across any combination of such domains. The public internet could also serve as an underlay (MTU issues notwithstanding).

Route Reflectors can be deployed to scale and build out the control plane to suit the hierarchy of the underlay.

Software performance probably won’t be great, but it should be possible to leverage hardware acceleration for VXLAN encap on supporting NICs.

Network Setup

The setup I'll show here is a lot simpler than any of my ramblings above!

3 machines are configured. The two edge machines (PE1/PE2) each have 2 Linux VRFs defined, RED and BLUE. The VRFs simply have dummy/loopback interfaces in them (no externally connected interfaces) just so we can send some test pings.

PE1 and PE2 each have a second IP configured on their loopback interface (1.1.1.1 and 2.2.2.2 respectively). This is the IP that we will bind VXLAN devices to and use on each device as VTEP source.

Each have a static route to the loopback of the other, via an intermediary VM - "CORE1". That VM just has two static routes configured, one for each of the PE interfaces.

NOTE: I've built this all with IPv4 but it should work equally well with IPv6. EVPN can carry either type of prefix in a type-5 route.

NOTE 2: My terminology isn’t exact, the term PE is more associated with MPLS networks than anything else. The devices I’ve called “PEs” are basically the edge devices with all the tenant VRFs defined. In a data centre Clos network they would be LEAF switches and the “Core” machine would be SPINEs.

Basic VM Configuration

Basic Linux Network Config

All devices have forwarding enabled, in /etc/sysctl.conf, something like:

net.ipv4.ip_forward=1

As I've only used loopback interfaces within the VRFs, in theory this doesn't need to be set on the PEs. But in a real-world scenario, where you will have access interfaces belonging to particular VRFs on the PEs, it should be there. Also please note this was just a quick test, so I only configured IPv4 routing, IPv6 should work just as well and EVPN can carry such routes in a type-5 update.

All machines are running Ubuntu 18.04. All have been updated to Linux kernel version 5.3. In addition all have had netplan removed and ifupdown2 installed in its place. Ifupdown2 supports Linux VRF and some other nice things and generally I'm a fan. Installing is very easy, as simple as:

sudo apt install ifupdown2
sudo apt remove netplan.io libnetplan0

CORE Device Config

In addition to having forwarding enabled the CORE machine has very little going on, the interfaces are configured and static routes for PE loopbacks are defined:

root@core:/etc/network/interfaces.d# cat ens192.cfg 
allow-hotplug ens192
auto ens192
iface ens192 inet static
    address 198.18.3.1/24
    post-up ip route add 1.1.1.1/32 via 198.18.3.2

root@core:/etc/network/interfaces.d# cat ens224.cfg 
allow-hotplug ens224
auto ens224
iface ens224 inet static
    address 198.18.4.1/24
    post-up ip route add 2.2.2.2/32 via 198.18.4.2

PE Reachability Config

On "PE" devices the configuration of interfaces is fairly straightforward. Interfaces lo and ens224 are configured with the appropriate IPs, and a static route is added for the loopback of the remote PE:

## PE1

auto lo
iface lo inet loopback
    address 1.1.1.1/32

allow-hotplug ens224
auto ens224
iface ens224 inet static
    address 198.18.3.2/24
    post-up ip route add 2.2.2.2/32 via 198.18.3.1
## PE2

auto lo
iface lo inet loopback
    address 2.2.2.2/32

allow-hotplug ens224
auto ens224
iface ens224 inet static
    address 198.18.4.2/24
    post-up ip route add 1.1.1.1/32 via 198.18.4.1

With that done we can now ping between the two loopback interfaces:

root@pe1:~# ping -I 1.1.1.1 2.2.2.2
PING 2.2.2.2 (2.2.2.2) from 1.1.1.1 : 56(84) bytes of data.
64 bytes from 2.2.2.2: icmp_seq=1 ttl=63 time=0.823 ms
64 bytes from 2.2.2.2: icmp_seq=2 ttl=63 time=1.37 ms
64 bytes from 2.2.2.2: icmp_seq=3 ttl=63 time=1.35 ms

PE VRF Config

Ok so now the two "PE" VMs can send traffic to each other. What we want to do is to create isolated Linux VRFs on both machines, and exchange routes between PEs using a single BGP EVPN session.

EVPN type-5 routes will be announced for all routes in each VRF. These routes use a route distinguisher to ensure uniqueness within the EVPN RIB, and route-targets which allow for local VRF import/export. Additionally each route has a VNID, specifying the identifier used for VXLAN encapsulation, and a “Router MAC” - which defines the destination MAC address to use in the Ethernet header of the encapsulated frame (VXLAN can only carry Ethernet so all encapsulated IP packets need a MAC header inside the UDP VXLAN transport.)

We need to create several interfaces to make this work. Firstly we will create a local VRF device (l3mdev) for each VRF. We will create a loopback/dummy interface and place it in the VRF as our test interface.

Additionally, to support VXLAN transport, we will create a VXLAN device for each VRF. We allocate this interface a VNID when it is created, and bind it to the local loopback IP in the global table (i.e. 1.1.1.1/2.2.2.2). We also create a Linux bridge device, to which we enslave the VXLAN interface, and place into the VRF.

The bridge device supplies the MAC address that will be used as the Router-MAC in the EVPN type-5 prefix announcements. It also stores the MAC addresses of remote device(s) RMACs announced in EVPN routes in its FDB table.

That's all a little complicated to get layer-3 working, and for the most part represents the origins of VXLAN as a purely layer-2 overlay technique. We need the layer-2 devices and information because VXLAN packets can only carry Ethernet frames. So we need to build a MAC header for encapsulated VRF traffic, even if that traffic comes from a non-Ethernet device like the dummy loopback we created.

All in all the VRF config ends up looking like below. I chose to make a single file in /etc/network/interfaces.d/ for each VRF, and put the definitions for all the required devices into that file:

# PE1 - VRF RED

root@pe1:/etc/network/interfaces.d# cat vrf_red.cfg 
auto iface red
iface red
    address 127.0.0.1/8
    vrf-table 1001

auto vxlan1001
iface vxlan1001
    vxlan-id 1001
    vxlan-local-tunnelip 1.1.1.1
    bridge-learning off
    bridge-arp-nd-suppress on
    mtu 1450

auto br1001
iface br1001
    vrf red
    bridge_ports vxlan1001
    bridge_stp off
    bridge_fd 0
    mtu 1450

auto lo1001
iface lo1001 inet loopback
    vrf red
    address 1.0.0.1/32
    pre-up ip link add name lo1001 type dummy
# PE1 - VRF BLUE

auto iface blue
iface blue
   address 127.0.0.1/8
    vrf-table 1002

auto vxlan1002
iface vxlan1002
    vxlan-id 1002
    vxlan-local-tunnelip 1.1.1.1
    bridge-learning off
    bridge-arp-nd-suppress on
    mtu 1450

auto br1002
iface br1002
    vrf blue
    bridge_ports vxlan1002
    bridge_stp off
    bridge_fd 0
    mtu 1450

auto lo1002
iface lo1002 inet loopback
    vrf blue
    address 2.0.0.1/32
    pre-up ip link add name lo1002 type dummy
# PE2 - VRF RED

auto iface red
iface red
    address 127.0.0.1/8
    vrf-table 1001

auto vxlan1001
iface vxlan1001
    vxlan-id 1001
    vxlan-local-tunnelip 2.2.2.2
    bridge-learning off
    bridge-arp-nd-suppress on
    mtu 1450

auto br1001
iface br1001
    vrf red
    bridge_ports vxlan1001
    bridge_stp off
    bridge_fd 0
    mtu 1450

auto lo1001
iface lo1001 inet loopback
    vrf red
    address 1.0.0.2/32
    pre-up ip link add name lo1001 type dummy
# PE2 - VRF BLUE

auto iface blue
iface blue
    address 127.0.0.1/8
    vrf-table 1002

auto vxlan1002
iface vxlan1002
    vxlan-id 1002
    vxlan-local-tunnelip 2.2.2.2
    bridge-learning off
    bridge-arp-nd-suppress on
    mtu 1450

auto br1002
iface br1002
    vrf blue
    bridge_ports vxlan1002
    bridge_stp off
    bridge_fd 0
    mtu 1450

auto lo1002
iface lo1002 inet loopback
    vrf blue
    address 2.0.0.2/32
    pre-up ip link add name lo1002 type dummy

I've chosen the table numbers for each VRF based on what the system normally auto-creates if you let it do that (i.e. first one is 1001). I used the same number in the loopback, bridge and vxlan device names, to keep things consistent.

With this configured, one can see (for instance on PE1,) that the VRFs exist, have routes, and we can ping the lo100x loopback IPs:

root@pe1:~# ip vrf list
Name              Table
-----------------------
blue              1002
red               1001

root@pe1:~# ip route show vrf red
127.0.0.0/8 dev red proto kernel scope link src 127.0.0.1 
root@pe1:~# 
root@pe1:~# ip route show table 1001
local 1.0.0.1 dev lo1001 proto kernel scope host src 1.0.0.1 
broadcast 127.0.0.0 dev red proto kernel scope link src 127.0.0.1 
127.0.0.0/8 dev red proto kernel scope link src 127.0.0.1 
local 127.0.0.1 dev red proto kernel scope host src 127.0.0.1 
broadcast 127.255.255.255 dev red proto kernel scope link src 127.0.0.1 
root@pe1:~# 
root@pe1:~# ip vrf exec red ping 1.0.0.1
PING 1.0.0.1 (1.0.0.1) 56(84) bytes of data.
64 bytes from 1.0.0.1: icmp_seq=1 ttl=64 time=0.015 ms
64 bytes from 1.0.0.1: icmp_seq=2 ttl=64 time=0.077 ms

Interestingly, as can be seen, the local interface IPs, such as 1.0.0.1, only appear when "show ip route table X" is issued, "ip route show vrf red" only shows other routes not directly connected. I believe this may be due to how VRF works on Linux, and the fact that the l3mdev only affects routing and forwarding decisions.

FRRouting

All well and good so far. But our VRFs on PE1 and PE2 are effectively isolated islands with no connectivity - how do we learn about remote networks and populate VRF routing tables with the routes at the remote side? Enter FRR.

FRR Basics

FRR was installed from apt. Once installed a few things are needed to get up and going. I've tended to use the separate per-daemon config files, rather than single frr.conf, although I'm not quite sure why I started this way. Anyway to enable that we delete /etc/frr.conf, and add this line to /etc/vtysh.conf:

no service integrated-vtysh-config

In /etc/daemons.conf there is one change to be made - enabled bgpd by setting it to 'yes':

bgpd=yes

That should be enough to get going, we can then enable and restart FRR if it hasn't been already:

systemctl enable frr
systemctl restart frr

We can then run the "vtysh" command to get to the FRR CLI:

root@pe1:~# vtysh

Hello, this is FRRouting (version 7.3).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

pe1# 
pe1# 

FRR VRF / IPv4 Unicast BGP config

Ok so the first thing we want to do is to bind the VRF tables in FRR to the VNI of the VXLAN devices we created earlier. So for instance the VXLAN device we created in VRF red, on both PEs, was assigned VNI 1001. So we add this configuration (same on both devices):

vrf red
 vni 1001
 exit-vrf
!
vrf blue
 vni 1002
 exit-vrf
!

We need to enable BGP unicast in every VRF, which creates separate BGP unicast tables to hold the routes from each. Ultimately we will export all these routes to EVPN, adding route-distinguishers, targets and other attributes/communities as we do so. We use the same ASN (in this example private AS 65000) in the configuration for each VRF:

router bgp 65000 vrf blue
 address-family ipv4 unicast
  redistribute connected
 exit-address-family
!
router bgp 65000 vrf red
 address-family ipv4 unicast
  redistribute connected
 exit-address-family
!

Notice I also issued "redistribute connected" to redistribute local networks into BGP (in this case that just means the lo100x interface subnets, as the VRFs have no other interfaces).

Once configured local routes should be present in the per-vrf IPv4 unicast tables:

pe1# show bgp vrf red ipv4 unicast 
BGP table version is 1, local router ID is 1.0.0.1, vrf id 12
Default local pref 100, local AS 65000
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 1.0.0.1/32       0.0.0.0                  0         32768 ?

Displayed  1 routes and 1 total paths
pe1# show bgp vrf blue ipv4 unicast 
BGP table version is 1, local router ID is 2.0.0.1, vrf id 8
Default local pref 100, local AS 65000
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 2.0.0.1/32       0.0.0.0                  0         32768 ?

Displayed  1 routes and 1 total paths
pe1# 

EVPN Config

So far we just have local routes. We need to create a BGP peering between the two PE devices, and enable it for the EVPN SAFI:

# PE1

router bgp 65000
 neighbor 2.2.2.2 remote-as 65000
 neighbor 2.2.2.2 update-source 1.1.1.1
 !
 address-family ipv4 unicast
  no neighbor 2.2.2.2 activate
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor 2.2.2.2 activate
  advertise-all-vni
  advertise ipv4 unicast
 exit-address-family
!
# PE2

router bgp 65000
 neighbor 1.1.1.1 remote-as 65000
 neighbor 1.1.1.1 update-source 2.2.2.2
 !
 address-family ipv4 unicast
  no neighbor 1.1.1.1 activate
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor 1.1.1.1 activate
  advertise-all-vni
  advertise ipv4 unicast
 exit-address-family
!

Note as can be seen I have disabled the default IPv4 unicast address family for this peering. We only want to exchange EVPN routes. Also note the peering is a multihop iBGP session between loopbacks of the PE devices.

Finally we need to configure each VRF to export their unicast BGP routes to the global EVPN table, so that they can be announced over EVPN peering between PEs:

router bgp 65000 vrf blue
 address-family l2vpn evpn
  advertise ipv4 unicast
 exit-address-family
!
router bgp 65000 vrf red
 address-family l2vpn evpn
  advertise ipv4 unicast
 exit-address-family
!

Astute observes may note that I have not manually configured route-distinguishers or route-target import/exports for the VRFs. This can be configured manually, however if left unspecified FRR will auto-derive them which works just fine, so no need to adjust this unless needed.

Test session

Once the above has been configured we should see the BGP session come up:

pe2# show bgp l2vpn evpn summary 
BGP router identifier 2.2.2.2, local AS number 65000 vrf-id 0
BGP table version 0
RIB entries 7, using 1288 bytes of memory
Peers 1, using 20 KiB of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
1.1.1.1         4      65000     213     211        0    0    0 00:00:09            2

Total number of neighbors 1

Given that there are 2 VRFs configured on each of the 2 PEs, and on each every VRF has a single loopback interface redistributed to BGP, we end up with 4 loopback interfaces and thus 4 EVPN routes:

# PE1

pe1# show bgp l2vpn evpn 
BGP table version is 1, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 1.0.0.1:2
*> [5]:[0]:[32]:[1.0.0.1]
                    1.1.1.1                  0         32768 ?
                    ET:8 RT:65000:1001 Rmac:f6:34:1d:eb:36:ee
Route Distinguisher: 1.0.0.2:2
*>i[5]:[0]:[32]:[1.0.0.2]
                    2.2.2.2                  0    100      0 ?
                    RT:65000:1001 ET:8 Rmac:62:d2:01:db:a1:ed
Route Distinguisher: 2.0.0.1:3
*> [5]:[0]:[32]:[2.0.0.1]
                    1.1.1.1                  0         32768 ?
                    ET:8 RT:65000:1002 Rmac:de:3a:88:6b:c7:b3
Route Distinguisher: 2.0.0.2:3
*>i[5]:[0]:[32]:[2.0.0.2]
                    2.2.2.2                  0    100      0 ?
                    RT:65000:1002 ET:8 Rmac:be:99:d1:fd:b3:a4

Displayed 4 out of 4 total prefixes


pe1# show bgp l2vpn evpn 1.0.0.2
BGP routing table entry for 1.0.0.2:2:[5]:[0]:[32]:[1.0.0.2]
Paths: (1 available, best #1)
  Not advertised to any peer
  Route [5]:[0]:[32]:[1.0.0.2] VNI 1001
  Local
    2.2.2.2 from 2.2.2.2 (2.2.2.2)
      Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
      Extended Community: RT:65000:1001 ET:8 Rmac:62:d2:01:db:a1:ed
      Last update: Mon Jun 15 20:33:23 2020


pe1# show bgp l2vpn evpn 2.0.0.2
BGP routing table entry for 2.0.0.2:3:[5]:[0]:[32]:[2.0.0.2]
Paths: (1 available, best #1)
  Not advertised to any peer
  Route [5]:[0]:[32]:[2.0.0.2] VNI 1002
  Local
    2.2.2.2 from 2.2.2.2 (2.2.2.2)
      Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
      Extended Community: RT:65000:1002 ET:8 Rmac:be:99:d1:fd:b3:a4
      Last update: Mon Jun 15 20:33:24 2020
# PE2

pe2# show bgp l2vpn evpn
BGP table version is 7, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 1.0.0.1:2
*>i[5]:[0]:[32]:[1.0.0.1]
                    1.1.1.1                  0    100      0 ?
                    RT:65000:1001 ET:8 Rmac:f6:34:1d:eb:36:ee
Route Distinguisher: 1.0.0.2:2
*> [5]:[0]:[32]:[1.0.0.2]
                    2.2.2.2                  0         32768 ?
                    ET:8 RT:65000:1001 Rmac:62:d2:01:db:a1:ed
Route Distinguisher: 2.0.0.1:3
*>i[5]:[0]:[32]:[2.0.0.1]
                    1.1.1.1                  0    100      0 ?
                    RT:65000:1002 ET:8 Rmac:de:3a:88:6b:c7:b3
Route Distinguisher: 2.0.0.2:3
*> [5]:[0]:[32]:[2.0.0.2]
                    2.2.2.2                  0         32768 ?
                    ET:8 RT:65000:1002 Rmac:be:99:d1:fd:b3:a4

Displayed 4 out of 4 total prefixes


pe2# show bgp l2vpn evpn 1.0.0.1
BGP routing table entry for 1.0.0.1:2:[5]:[0]:[32]:[1.0.0.1]
Paths: (1 available, best #1)
  Not advertised to any peer
  Route [5]:[0]:[32]:[1.0.0.1] VNI 1001
  Local
    1.1.1.1 from 1.1.1.1 (1.1.1.1)
      Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
      Extended Community: RT:65000:1001 ET:8 Rmac:f6:34:1d:eb:36:ee
      Last update: Mon Jun 15 20:33:24 2020


pe2# show bgp l2vpn evpn 2.0.0.1
BGP routing table entry for 2.0.0.1:3:[5]:[0]:[32]:[2.0.0.1]
Paths: (1 available, best #1)
  Not advertised to any peer
  Route [5]:[0]:[32]:[2.0.0.1] VNI 1002
  Local
    1.1.1.1 from 1.1.1.1 (1.1.1.1)
      Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
      Extended Community: RT:65000:1002 ET:8 Rmac:de:3a:88:6b:c7:b3
      Last update: Mon Jun 15 20:20:13 2020

In turn, based on the route-targets attached to the routes, FRR will import them into the appropriate local VRF BGP unicast tables:

pe1# show bgp vrf blue ipv4 unicast 
BGP table version is 2, local router ID is 2.0.0.1, vrf id 8
Default local pref 100, local AS 65000
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 2.0.0.1/32       0.0.0.0                  0         32768 ?
*>i2.0.0.2/32       2.2.2.2                  0    100      0 ?

Displayed  2 routes and 2 total paths
pe1# show bgp vrf red ipv4 unicast 
BGP table version is 2, local router ID is 1.0.0.1, vrf id 12
Default local pref 100, local AS 65000
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 1.0.0.1/32       0.0.0.0                  0         32768 ?
*>i1.0.0.2/32       2.2.2.2                  0    100      0 ?

Displayed  2 routes and 2 total paths

Testing

Leaving the FRR CLI (do a 'write' to save config!) we should see the routes imported into the kernel VRF tables:

root@pe1:~# ip route show vrf red
1.0.0.2 via 2.2.2.2 dev br1001 proto bgp metric 20 onlink 
127.0.0.0/8 dev red proto kernel scope link src 127.0.0.1 
root@pe1:~# ip route show vrf blue
2.0.0.2 via 2.2.2.2 dev br1002 proto bgp metric 20 onlink 
127.0.0.0/8 dev blue proto kernel scope link src 127.0.0.1 

And voila, if I issue a ping from the local loopback in the 'red' VRF towards the remote one it works!

root@pe1:~# ip vrf exec red ping -I 1.0.0.1 1.0.0.2
PING 1.0.0.2 (1.0.0.2) from 1.0.0.1 : 56(84) bytes of data.
64 bytes from 1.0.0.2: icmp_seq=1 ttl=64 time=0.564 ms
64 bytes from 1.0.0.2: icmp_seq=2 ttl=64 time=1.46 ms
64 bytes from 1.0.0.2: icmp_seq=3 ttl=64 time=1.38 ms

If I look at a packet capture, taken on the intermediate 'core' machine, I can see the traffic is encapsulated with VXLAN and uses the identifiers we expect:

Wireshark Packet
Full PACP File Download

Speed Test

A very quick iperf3 test, on my relatively modest machine with 1 core per VM and no tuning whatsoever did just over 400Mbit/second at an average packet size (512 bytes).

root@pe1:~# ip vrf exec blue iperf3 -Z -l 512 -B 2.0.0.1 -c 2.0.0.2 -i 4 -t 20
Connecting to host 2.0.0.2, port 5201
[  5] local 2.0.0.1 port 33503 connected to 2.0.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  5]   0.00-4.00   sec   183 MBytes   385 Mbits/sec   74   1.33 MBytes       
[  5]   4.00-8.00   sec   197 MBytes   414 Mbits/sec    2   1.14 MBytes       
[  5]   8.00-12.00  sec   196 MBytes   411 Mbits/sec    0   1.19 MBytes       
[  5]  12.00-16.00  sec   196 MBytes   411 Mbits/sec    0   1.21 MBytes       
[  5]  16.00-20.00  sec   198 MBytes   415 Mbits/sec    0   1.40 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  5]   0.00-20.00  sec   971 MBytes   407 Mbits/sec   76             sender
[  5]   0.00-20.00  sec   969 MBytes   406 Mbits/sec                  receiver

Investigation showed that CPU on the 'Core1' machine in the middle, which is doing no VXLAN encapsulation and just forwarding between interfaces, hit 100% during this test. As the machines all had the same spec I took that as a good sign the encapsulation wasn't a complete performance killer. CPU usage on PE1 generating traffic in this test was 100% also, CPU on PE2 receiving the traffic was approx 70%.

Observations

A few notes on what happens:

The RMACs on type-5 routes in EVPN come from the MAC address assigned to the bridge device in each VRF.

The MAC address of that same bridge device is used as the source MAC in the encapsulated Ethernet frame on the wire.

Locally when an EVPN type 5 route is learnt by FRR, Zebra creates an entry with the prefixes RMAC in the fdb table of vxlan100x:

root@pe1:~# bridge fdb show dev vxlan1001
62:d2:01:db:a1:ed vlan 1 extern_learn master br1001 
62:d2:01:db:a1:ed extern_learn master br1001 
f6:34:1d:eb:36:ee vlan 1 master br1001 permanent
f6:34:1d:eb:36:ee master br1001 permanent
62:d2:01:db:a1:ed dst 2.2.2.2 self extern_learn 

In the VRF routing table a route like this is inserted:

root@pe1:~# ip route show vrf red
1.0.0.2 via 2.2.2.2 dev br1001 proto bgp metric 20 onlink 

The key thing in the above route is that it is "onlink", which tells the kernel to accept the route even though the next-hop is not on any directly connected network. The fact the route is via an Ethernet bridge causes the kernel to build a layer-2 frame out of any packets using the VRF route entry. The source MAC address will be the bridge's own MAC address

The 'onlink' route, in an Ethernet context anyway, tells the kernel to use the layer-2 address listed against the destination address, on the given interface, when creating the layer-2 packet (as opposed to trying to ARP for the destination IP or anything). Zebra inserts a static ARP/neighbour entry in the kernel's ARP table for the remote VTEP IP (BGP nexthop), listed against the bridge associated with the L3VNI. The MAC address it lists for the VTEP is the RMAC from the EVPN learnt route.

Typically single VTEPs will use a different RMAC per-VRF. This doesn't cause a conflict, however, as the VTEP IP can be listed multiple times in the ARP table, each time against a different interface. As the VRF routing table route specifies the VTEP IP and bridge device the kernel knows which ARP entry for the VTEP IP for use when routing a given packet.

The below gives some sense of what is going on (it's about IPv6 but the general approach is used for pure IPv4 too):

https://github.com/FRRouting/frr/commit/f50dc5e6070383e803dc3441aedd5a435974c762

NOTE: In my tests I could not see the ARP/Neighbour entries for the remote VTEP IPs in the neighbor table. It must be there however as this worked, and several people in the know confirmed that is how it works. So there is some bug / something not fully explained as to why the ARP/neighbor entry was not visible to me.

The bridge device shows the RMAC learnt against port-1:

root@pe1:~# brctl showmacs br1001
port no mac addr        is local?   ageing timer
  1 62:d2:01:db:a1:ed   no      1920.62
  1 62:d2:01:db:a1:ed   no      1116.62
  1 f6:34:1d:eb:36:ee   yes        0.00
  1 f6:34:1d:eb:36:ee   yes        0.00

Port 1 is the VXLAN device, the only member of the bridge:

root@pe1:~# brctl show br1001
bridge name bridge id       STP enabled interfaces
br1001      8000.f6341deb36ee   no      vxlan1001

Quite a lot of additional information is present in FRR if queried:

pe1# show interface vxlan1001
Interface vxlan1001 is up, line protocol is up
  Link ups:       0    last: (never)
  Link downs:     0    last: (never)
  vrf: default
  index 10 metric 0 mtu 1450 speed 0 
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: f6:34:1d:eb:36:ee
  Interface Type Vxlan
  VxLAN Id 1001 VTEP IP: 1.1.1.1 Access VLAN Id 1

  Master interface: br1001
pe1# 
pe1# show interface vxlan1002
Interface vxlan1002 is up, line protocol is up
  Link ups:       0    last: (never)
  Link downs:     0    last: (never)
  vrf: default
  index 6 metric 0 mtu 1450 speed 0 
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: de:3a:88:6b:c7:b3
  Interface Type Vxlan
  VxLAN Id 1002 VTEP IP: 1.1.1.1 Access VLAN Id 1

  Master interface: br1002
pe1# show interface vxlan1001
Interface vxlan1001 is up, line protocol is up
  Link ups:       0    last: (never)
  Link downs:     0    last: (never)
  vrf: default
  index 10 metric 0 mtu 1450 speed 0 
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: f6:34:1d:eb:36:ee
  Interface Type Vxlan
  VxLAN Id 1001 VTEP IP: 1.1.1.1 Access VLAN Id 1

  Master interface: br1001
pe1# 
pe1# show interface vxlan1002
Interface vxlan1002 is up, line protocol is up
  Link ups:       0    last: (never)
  Link downs:     0    last: (never)
  vrf: default
  index 6 metric 0 mtu 1450 speed 0 
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: de:3a:88:6b:c7:b3
  Interface Type Vxlan
  VxLAN Id 1002 VTEP IP: 1.1.1.1 Access VLAN Id 1

  Master interface: br1002
pe1# show evpn vni 
VNI        Type VxLAN IF              # MACs   # ARPs   # Remote VTEPs  Tenant VRF                           
1002       L3   vxlan1002             1        1        n/a             blue                                 
1001       L3   vxlan1001             1        1        n/a             red   
pe1# show evpn vni 1001
VNI: 1001
  Type: L3
  Tenant VRF: red
  Local Vtep Ip: 1.1.1.1
  Vxlan-Intf: vxlan1001
  SVI-If: br1001
  State: Up
  VNI Filter: none
  System MAC: f6:34:1d:eb:36:ee
  Router MAC: f6:34:1d:eb:36:ee
  L2 VNIs: 
pe1# 
pe1# 
pe1# show evpn vni 1002
VNI: 1002
  Type: L3
  Tenant VRF: blue
  Local Vtep Ip: 1.1.1.1
  Vxlan-Intf: vxlan1002
  SVI-If: br1002
  State: Up
  VNI Filter: none
  System MAC: de:3a:88:6b:c7:b3
  Router MAC: de:3a:88:6b:c7:b3
  L2 VNIs: 
pe1#
pe1# show bgp l2vpn evpn vni 1001
VNI: 1001 (known to the kernel)
  Type: L3
  Tenant VRF: red
  RD: 1.0.0.1:2
  Originator IP: 1.1.1.1
  Advertise-gw-macip : n/a
  Advertise-svi-macip : n/a
  Advertise-pip: Yes
  System-IP: 1.1.1.1
  System-MAC: f6:34:1d:eb:36:ee
  Router-MAC: f6:34:1d:eb:36:ee
  Import Route Target:
    31984:1001
  Export Route Target:
    31984:1001
pe1# 
pe1# 
pe1# show bgp l2vpn evpn vni 1002
VNI: 1002 (known to the kernel)
  Type: L3
  Tenant VRF: blue
  RD: 2.0.0.1:3
  Originator IP: 1.1.1.1
  Advertise-gw-macip : n/a
  Advertise-svi-macip : n/a
  Advertise-pip: Yes
  System-IP: 1.1.1.1
  System-MAC: de:3a:88:6b:c7:b3
  Router-MAC: de:3a:88:6b:c7:b3
  Import Route Target:
    31984:1002
  Export Route Target:
    31984:1002
pe1# 
pe1# show evpn rmac vni 1001
Number of Remote RMACs known for this VNI: 1
MAC               Remote VTEP          
62:d2:01:db:a1:ed 2.2.2.2    
pe1# show interface br1001 
Interface br1001 is up, line protocol is up
  Link ups:       0    last: (never)
  Link downs:     0    last: (never)
  vrf: red
  index 11 metric 0 mtu 1450 speed 0 
  flags: <UP,BROADCAST,RUNNING,MULTICAST>
  Type: Ethernet
  HWaddr: f6:34:1d:eb:36:ee
  Interface Type Bridge
  Bridge VLAN-aware: no

I tried to understand further what was going on when the EVPN routes were recieved, and how they were installed into the Linux kerenl.

Using "ip monitor all" I could see this:

[NEIGH]1.1.1.1 dev br1001 lladdr f6:34:1d:eb:36:ee NOARP
[NEIGH]dev vxlan1001 lladdr f6:34:1d:eb:36:ee REACHABLE
[NEIGH]dev vxlan1001 lladdr f6:34:1d:eb:36:ee REACHABLE
[NEIGH]??? dev vxlan1001 lladdr f6:34:1d:eb:36:ee REACHABLE
[NEIGH]dev vxlan1001 lladdr f6:34:1d:eb:36:ee REACHABLE
[ROUTE]1.0.0.1 via 1.1.1.1 dev br1001 table red proto bgp metric 20 onlink 

What's interesting to me is the first line, adding the MAC address neighbor for 1.1.1.1 setting to the RMAC in the BGP route. I feel there has to be a way, in the VRF, to resolve that MAC address for the next-hop IP. Once the MAC is known it's looked up in the bridge device in the VRF, which reveals it's on the VXLAN interface, and the packet it built and sent. Without the link from IP address to RMAC I can't quite work out how this is working.

"brigde monitor all" showed this:

[NEIGH][NEIGH][NEIGH][NEIGH]f6:34:1d:eb:36:ee dev vxlan1001 extern_learn master br1001 
[NEIGH]f6:34:1d:eb:36:ee dev vxlan1001 vlan 1 extern_learn master br1001 
[NEIGH]f6:34:1d:eb:36:ee dev vxlan1001 dst 1.1.1.1 self extern_learn 
[NEIGH]f6:34:1d:eb:36:ee dev vxlan1001 self

Conclusions

If you need to do quick and dirty VRF at modest scale this could be a very attractive optoin. I'd like to see how it would perform with hardware offload of the encapsulation.

Contact me @toprankinrez on twiteroo


You'll only receive email when they publish something new.

More from techtrips
All posts