L3VPN with FRR and EVPN VXLAN on Linux
June 14, 2020•5,027 words
FRRouting is a fully-featured IP routing stack which runs on a variety of Unix-like operating systems.
In this post I will show how to configure FRR on Linux to create L3 VPNs across a common IP underlay network. BGP EVPN is used between end hosts to distribute VRF routing tables and encapsulation information. VXLAN is used on the wire to encapsulate VRF traffic across a common IP network.
BGP EVPN / VXLAN
BGP EVPN signalled networks, with VXLAN transport, have become very popular in datacentre networks and beyond in recent years. The original and primary use-case for such networks is layer-2 virtualization across IP frabrics.
The EVPN standard, however, also includes a pure layer-3 "prefix route" (type 5). This route-type can be used to build L3VPNs similar to using VPN-IPV4 or VPN-IPV6 SAFIs.
One nice feature of using EVPN for signalling a network using VXLAN transport, is that end devices ("PEs", "VTEPs" or whatever you want to call them,) do not need to have any shared state or knowledge of each other outside the information that EVPN carries. This is unlike MPLS L3VPNs, which generally require a mechanism to signal labels, or establish GRE tunnels, in addition to the information in BGP.
With EVPN, end systems simply need to be able to send IP traffic to the remote next-hop IPs (VTEPs) encoded in the BGP updates. The underlay routing on end devices could be as simple as a few static routes (although using a dynamic protocol to indicate reachability is better.) The underlay might run an IGP or BGP unicast protocol for that, and can stretch across any combination of such domains. The public internet could also serve as an underlay (MTU issues notwithstanding).
Route Reflectors can be deployed to scale and build out the control plane to suit the hierarchy of the underlay.
Software performance probably won’t be great, but it should be possible to leverage hardware acceleration for VXLAN encap on supporting NICs.
Network Setup
The setup I'll show here is a lot simpler than any of my ramblings above!
3 machines are configured. The two edge machines (PE1/PE2) each have 2 Linux VRFs defined, RED and BLUE. The VRFs simply have dummy/loopback interfaces in them (no externally connected interfaces) just so we can send some test pings.
PE1 and PE2 each have a second IP configured on their loopback interface (1.1.1.1 and 2.2.2.2 respectively). This is the IP that we will bind VXLAN devices to and use on each device as VTEP source.
Each have a static route to the loopback of the other, via an intermediary VM - "CORE1". That VM just has two static routes configured, one for each of the PE interfaces.
NOTE: I've built this all with IPv4 but it should work equally well with IPv6. EVPN can carry either type of prefix in a type-5 route.
NOTE 2: My terminology isn’t exact, the term PE is more associated with MPLS networks than anything else. The devices I’ve called “PEs” are basically the edge devices with all the tenant VRFs defined. In a data centre Clos network they would be LEAF switches and the “Core” machine would be SPINEs.
Basic Linux Network Config
All devices have forwarding enabled, in /etc/sysctl.conf, something like:
net.ipv4.ip_forward=1
As I've only used loopback interfaces within the VRFs, in theory this doesn't need to be set on the PEs. But in a real-world scenario, where you will have access interfaces belonging to particular VRFs on the PEs, it should be there. Also please note this was just a quick test, so I only configured IPv4 routing, IPv6 should work just as well and EVPN can carry such routes in a type-5 update.
All machines are running Ubuntu 18.04. All have been updated to Linux kernel version 5.3. In addition all have had netplan removed and ifupdown2 installed in its place. Ifupdown2 supports Linux VRF and some other nice things and generally I'm a fan. Installing is very easy, as simple as:
sudo apt install ifupdown2
sudo apt remove netplan.io libnetplan0
CORE Device Config
In addition to having forwarding enabled the CORE machine has very little going on, the interfaces are configured and static routes for PE loopbacks are defined:
root@core:/etc/network/interfaces.d# cat ens192.cfg
allow-hotplug ens192
auto ens192
iface ens192 inet static
address 198.18.3.1/24
post-up ip route add 1.1.1.1/32 via 198.18.3.2
root@core:/etc/network/interfaces.d# cat ens224.cfg
allow-hotplug ens224
auto ens224
iface ens224 inet static
address 198.18.4.1/24
post-up ip route add 2.2.2.2/32 via 198.18.4.2
PE Reachability Config
On "PE" devices the configuration of interfaces is fairly straightforward. Interfaces lo and ens224 are configured with the appropriate IPs, and a static route is added for the loopback of the remote PE:
## PE1
auto lo
iface lo inet loopback
address 1.1.1.1/32
allow-hotplug ens224
auto ens224
iface ens224 inet static
address 198.18.3.2/24
post-up ip route add 2.2.2.2/32 via 198.18.3.1
## PE2
auto lo
iface lo inet loopback
address 2.2.2.2/32
allow-hotplug ens224
auto ens224
iface ens224 inet static
address 198.18.4.2/24
post-up ip route add 1.1.1.1/32 via 198.18.4.1
With that done we can now ping between the two loopback interfaces:
root@pe1:~# ping -I 1.1.1.1 2.2.2.2
PING 2.2.2.2 (2.2.2.2) from 1.1.1.1 : 56(84) bytes of data.
64 bytes from 2.2.2.2: icmp_seq=1 ttl=63 time=0.823 ms
64 bytes from 2.2.2.2: icmp_seq=2 ttl=63 time=1.37 ms
64 bytes from 2.2.2.2: icmp_seq=3 ttl=63 time=1.35 ms
PE VRF Config
Ok so now the two "PE" VMs can send traffic to each other. What we want to do is to create isolated Linux VRFs on both machines, and exchange routes between PEs using a single BGP EVPN session.
EVPN type-5 routes will be announced for all routes in each VRF. These routes use a route distinguisher to ensure uniqueness within the EVPN RIB, and route-targets which allow for local VRF import/export. Additionally each route has a VNID, specifying the identifier used for VXLAN encapsulation, and a “Router MAC” - which defines the destination MAC address to use in the Ethernet header of the encapsulated frame (VXLAN can only carry Ethernet so all encapsulated IP packets need a MAC header inside the UDP VXLAN transport.)
We need to create several interfaces to make this work. Firstly we will create a local VRF device (l3mdev) for each VRF. We will create a loopback/dummy interface and place it in the VRF as our test interface.
Additionally, to support VXLAN transport, we will create a VXLAN device for each VRF. We allocate this interface a VNID when it is created, and bind it to the local loopback IP in the global table (i.e. 1.1.1.1/2.2.2.2). We also create a Linux bridge device, to which we enslave the VXLAN interface, and place into the VRF.
The bridge device supplies the MAC address that will be used as the Router-MAC in the EVPN type-5 prefix announcements. It also stores the MAC addresses of remote device(s) RMACs announced in EVPN routes in its FDB table.
That's all a little complicated to get layer-3 working, and for the most part represents the origins of VXLAN as a purely layer-2 overlay technique. We need the layer-2 devices and information because VXLAN packets can only carry Ethernet frames. So we need to build a MAC header for encapsulated VRF traffic, even if that traffic comes from a non-Ethernet device like the dummy loopback we created.
All in all the VRF config ends up looking like below. I chose to make a single file in /etc/network/interfaces.d/ for each VRF, and put the definitions for all the required devices into that file:
# PE1 - VRF RED
root@pe1:/etc/network/interfaces.d# cat vrf_red.cfg
auto iface red
iface red
address 127.0.0.1/8
vrf-table 1001
auto vxlan1001
iface vxlan1001
vxlan-id 1001
vxlan-local-tunnelip 1.1.1.1
bridge-learning off
bridge-arp-nd-suppress on
mtu 1450
auto br1001
iface br1001
vrf red
bridge_ports vxlan1001
bridge_stp off
bridge_fd 0
mtu 1450
auto lo1001
iface lo1001 inet loopback
vrf red
address 1.0.0.1/32
pre-up ip link add name lo1001 type dummy
# PE1 - VRF BLUE
auto iface blue
iface blue
address 127.0.0.1/8
vrf-table 1002
auto vxlan1002
iface vxlan1002
vxlan-id 1002
vxlan-local-tunnelip 1.1.1.1
bridge-learning off
bridge-arp-nd-suppress on
mtu 1450
auto br1002
iface br1002
vrf blue
bridge_ports vxlan1002
bridge_stp off
bridge_fd 0
mtu 1450
auto lo1002
iface lo1002 inet loopback
vrf blue
address 2.0.0.1/32
pre-up ip link add name lo1002 type dummy
# PE2 - VRF RED
auto iface red
iface red
address 127.0.0.1/8
vrf-table 1001
auto vxlan1001
iface vxlan1001
vxlan-id 1001
vxlan-local-tunnelip 2.2.2.2
bridge-learning off
bridge-arp-nd-suppress on
mtu 1450
auto br1001
iface br1001
vrf red
bridge_ports vxlan1001
bridge_stp off
bridge_fd 0
mtu 1450
auto lo1001
iface lo1001 inet loopback
vrf red
address 1.0.0.2/32
pre-up ip link add name lo1001 type dummy
# PE2 - VRF BLUE
auto iface blue
iface blue
address 127.0.0.1/8
vrf-table 1002
auto vxlan1002
iface vxlan1002
vxlan-id 1002
vxlan-local-tunnelip 2.2.2.2
bridge-learning off
bridge-arp-nd-suppress on
mtu 1450
auto br1002
iface br1002
vrf blue
bridge_ports vxlan1002
bridge_stp off
bridge_fd 0
mtu 1450
auto lo1002
iface lo1002 inet loopback
vrf blue
address 2.0.0.2/32
pre-up ip link add name lo1002 type dummy
I've chosen the table numbers for each VRF based on what the system normally auto-creates if you let it do that (i.e. first one is 1001). I used the same number in the loopback, bridge and vxlan device names, to keep things consistent.
With this configured, one can see (for instance on PE1,) that the VRFs exist, have routes, and we can ping the lo100x loopback IPs:
root@pe1:~# ip vrf list
Name Table
-----------------------
blue 1002
red 1001
root@pe1:~# ip route show vrf red
127.0.0.0/8 dev red proto kernel scope link src 127.0.0.1
root@pe1:~#
root@pe1:~# ip route show table 1001
local 1.0.0.1 dev lo1001 proto kernel scope host src 1.0.0.1
broadcast 127.0.0.0 dev red proto kernel scope link src 127.0.0.1
127.0.0.0/8 dev red proto kernel scope link src 127.0.0.1
local 127.0.0.1 dev red proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev red proto kernel scope link src 127.0.0.1
root@pe1:~#
root@pe1:~# ip vrf exec red ping 1.0.0.1
PING 1.0.0.1 (1.0.0.1) 56(84) bytes of data.
64 bytes from 1.0.0.1: icmp_seq=1 ttl=64 time=0.015 ms
64 bytes from 1.0.0.1: icmp_seq=2 ttl=64 time=0.077 ms
Interestingly, as can be seen, the local interface IPs, such as 1.0.0.1, only appear when "show ip route table X" is issued, "ip route show vrf red" only shows other routes not directly connected. I believe this may be due to how VRF works on Linux, and the fact that the l3mdev only affects routing and forwarding decisions.
FRRouting
All well and good so far. But our VRFs on PE1 and PE2 are effectively isolated islands with no connectivity - how do we learn about remote networks and populate VRF routing tables with the routes at the remote side? Enter FRR.
FRR Basics
FRR was installed from apt. Once installed a few things are needed to get up and going. I've tended to use the separate per-daemon config files, rather than single frr.conf, although I'm not quite sure why I started this way. Anyway to enable that we delete /etc/frr.conf, and add this line to /etc/vtysh.conf:
no service integrated-vtysh-config
In /etc/daemons.conf there is one change to be made - enabled bgpd by setting it to 'yes':
bgpd=yes
That should be enough to get going, we can then enable and restart FRR if it hasn't been already:
systemctl enable frr
systemctl restart frr
We can then run the "vtysh" command to get to the FRR CLI:
root@pe1:~# vtysh
Hello, this is FRRouting (version 7.3).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
pe1#
pe1#
FRR VRF / IPv4 Unicast BGP config
Ok so the first thing we want to do is to bind the VRF tables in FRR to the VNI of the VXLAN devices we created earlier. So for instance the VXLAN device we created in VRF red, on both PEs, was assigned VNI 1001. So we add this configuration (same on both devices):
vrf red
vni 1001
exit-vrf
!
vrf blue
vni 1002
exit-vrf
!
We need to enable BGP unicast in every VRF, which creates separate BGP unicast tables to hold the routes from each. Ultimately we will export all these routes to EVPN, adding route-distinguishers, targets and other attributes/communities as we do so. We use the same ASN (in this example private AS 65000) in the configuration for each VRF:
router bgp 65000 vrf blue
address-family ipv4 unicast
redistribute connected
exit-address-family
!
router bgp 65000 vrf red
address-family ipv4 unicast
redistribute connected
exit-address-family
!
Notice I also issued "redistribute connected" to redistribute local networks into BGP (in this case that just means the lo100x interface subnets, as the VRFs have no other interfaces).
Once configured local routes should be present in the per-vrf IPv4 unicast tables:
pe1# show bgp vrf red ipv4 unicast
BGP table version is 1, local router ID is 1.0.0.1, vrf id 12
Default local pref 100, local AS 65000
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 1.0.0.1/32 0.0.0.0 0 32768 ?
Displayed 1 routes and 1 total paths
pe1# show bgp vrf blue ipv4 unicast
BGP table version is 1, local router ID is 2.0.0.1, vrf id 8
Default local pref 100, local AS 65000
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 2.0.0.1/32 0.0.0.0 0 32768 ?
Displayed 1 routes and 1 total paths
pe1#
EVPN Config
So far we just have local routes. We need to create a BGP peering between the two PE devices, and enable it for the EVPN SAFI:
# PE1
router bgp 65000
neighbor 2.2.2.2 remote-as 65000
neighbor 2.2.2.2 update-source 1.1.1.1
!
address-family ipv4 unicast
no neighbor 2.2.2.2 activate
exit-address-family
!
address-family l2vpn evpn
neighbor 2.2.2.2 activate
advertise-all-vni
advertise ipv4 unicast
exit-address-family
!
# PE2
router bgp 65000
neighbor 1.1.1.1 remote-as 65000
neighbor 1.1.1.1 update-source 2.2.2.2
!
address-family ipv4 unicast
no neighbor 1.1.1.1 activate
exit-address-family
!
address-family l2vpn evpn
neighbor 1.1.1.1 activate
advertise-all-vni
advertise ipv4 unicast
exit-address-family
!
Note as can be seen I have disabled the default IPv4 unicast address family for this peering. We only want to exchange EVPN routes. Also note the peering is a multihop iBGP session between loopbacks of the PE devices.
Finally we need to configure each VRF to export their unicast BGP routes to the global EVPN table, so that they can be announced over EVPN peering between PEs:
router bgp 65000 vrf blue
address-family l2vpn evpn
advertise ipv4 unicast
exit-address-family
!
router bgp 65000 vrf red
address-family l2vpn evpn
advertise ipv4 unicast
exit-address-family
!
Astute observes may note that I have not manually configured route-distinguishers or route-target import/exports for the VRFs. This can be configured manually, however if left unspecified FRR will auto-derive them which works just fine, so no need to adjust this unless needed.
Test session
Once the above has been configured we should see the BGP session come up:
pe2# show bgp l2vpn evpn summary
BGP router identifier 2.2.2.2, local AS number 65000 vrf-id 0
BGP table version 0
RIB entries 7, using 1288 bytes of memory
Peers 1, using 20 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
1.1.1.1 4 65000 213 211 0 0 0 00:00:09 2
Total number of neighbors 1
Given that there are 2 VRFs configured on each of the 2 PEs, and on each every VRF has a single loopback interface redistributed to BGP, we end up with 4 loopback interfaces and thus 4 EVPN routes:
# PE1
pe1# show bgp l2vpn evpn
BGP table version is 1, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 1.0.0.1:2
*> [5]:[0]:[32]:[1.0.0.1]
1.1.1.1 0 32768 ?
ET:8 RT:65000:1001 Rmac:f6:34:1d:eb:36:ee
Route Distinguisher: 1.0.0.2:2
*>i[5]:[0]:[32]:[1.0.0.2]
2.2.2.2 0 100 0 ?
RT:65000:1001 ET:8 Rmac:62:d2:01:db:a1:ed
Route Distinguisher: 2.0.0.1:3
*> [5]:[0]:[32]:[2.0.0.1]
1.1.1.1 0 32768 ?
ET:8 RT:65000:1002 Rmac:de:3a:88:6b:c7:b3
Route Distinguisher: 2.0.0.2:3
*>i[5]:[0]:[32]:[2.0.0.2]
2.2.2.2 0 100 0 ?
RT:65000:1002 ET:8 Rmac:be:99:d1:fd:b3:a4
Displayed 4 out of 4 total prefixes
pe1# show bgp l2vpn evpn 1.0.0.2
BGP routing table entry for 1.0.0.2:2:[5]:[0]:[32]:[1.0.0.2]
Paths: (1 available, best #1)
Not advertised to any peer
Route [5]:[0]:[32]:[1.0.0.2] VNI 1001
Local
2.2.2.2 from 2.2.2.2 (2.2.2.2)
Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
Extended Community: RT:65000:1001 ET:8 Rmac:62:d2:01:db:a1:ed
Last update: Mon Jun 15 20:33:23 2020
pe1# show bgp l2vpn evpn 2.0.0.2
BGP routing table entry for 2.0.0.2:3:[5]:[0]:[32]:[2.0.0.2]
Paths: (1 available, best #1)
Not advertised to any peer
Route [5]:[0]:[32]:[2.0.0.2] VNI 1002
Local
2.2.2.2 from 2.2.2.2 (2.2.2.2)
Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
Extended Community: RT:65000:1002 ET:8 Rmac:be:99:d1:fd:b3:a4
Last update: Mon Jun 15 20:33:24 2020
# PE2
pe2# show bgp l2vpn evpn
BGP table version is 7, local router ID is 2.2.2.2
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 1.0.0.1:2
*>i[5]:[0]:[32]:[1.0.0.1]
1.1.1.1 0 100 0 ?
RT:65000:1001 ET:8 Rmac:f6:34:1d:eb:36:ee
Route Distinguisher: 1.0.0.2:2
*> [5]:[0]:[32]:[1.0.0.2]
2.2.2.2 0 32768 ?
ET:8 RT:65000:1001 Rmac:62:d2:01:db:a1:ed
Route Distinguisher: 2.0.0.1:3
*>i[5]:[0]:[32]:[2.0.0.1]
1.1.1.1 0 100 0 ?
RT:65000:1002 ET:8 Rmac:de:3a:88:6b:c7:b3
Route Distinguisher: 2.0.0.2:3
*> [5]:[0]:[32]:[2.0.0.2]
2.2.2.2 0 32768 ?
ET:8 RT:65000:1002 Rmac:be:99:d1:fd:b3:a4
Displayed 4 out of 4 total prefixes
pe2# show bgp l2vpn evpn 1.0.0.1
BGP routing table entry for 1.0.0.1:2:[5]:[0]:[32]:[1.0.0.1]
Paths: (1 available, best #1)
Not advertised to any peer
Route [5]:[0]:[32]:[1.0.0.1] VNI 1001
Local
1.1.1.1 from 1.1.1.1 (1.1.1.1)
Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
Extended Community: RT:65000:1001 ET:8 Rmac:f6:34:1d:eb:36:ee
Last update: Mon Jun 15 20:33:24 2020
pe2# show bgp l2vpn evpn 2.0.0.1
BGP routing table entry for 2.0.0.1:3:[5]:[0]:[32]:[2.0.0.1]
Paths: (1 available, best #1)
Not advertised to any peer
Route [5]:[0]:[32]:[2.0.0.1] VNI 1002
Local
1.1.1.1 from 1.1.1.1 (1.1.1.1)
Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
Extended Community: RT:65000:1002 ET:8 Rmac:de:3a:88:6b:c7:b3
Last update: Mon Jun 15 20:20:13 2020
In turn, based on the route-targets attached to the routes, FRR will import them into the appropriate local VRF BGP unicast tables:
pe1# show bgp vrf blue ipv4 unicast
BGP table version is 2, local router ID is 2.0.0.1, vrf id 8
Default local pref 100, local AS 65000
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 2.0.0.1/32 0.0.0.0 0 32768 ?
*>i2.0.0.2/32 2.2.2.2 0 100 0 ?
Displayed 2 routes and 2 total paths
pe1# show bgp vrf red ipv4 unicast
BGP table version is 2, local router ID is 1.0.0.1, vrf id 12
Default local pref 100, local AS 65000
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 1.0.0.1/32 0.0.0.0 0 32768 ?
*>i1.0.0.2/32 2.2.2.2 0 100 0 ?
Displayed 2 routes and 2 total paths
Testing
Leaving the FRR CLI (do a 'write' to save config!) we should see the routes imported into the kernel VRF tables:
root@pe1:~# ip route show vrf red
1.0.0.2 via 2.2.2.2 dev br1001 proto bgp metric 20 onlink
127.0.0.0/8 dev red proto kernel scope link src 127.0.0.1
root@pe1:~# ip route show vrf blue
2.0.0.2 via 2.2.2.2 dev br1002 proto bgp metric 20 onlink
127.0.0.0/8 dev blue proto kernel scope link src 127.0.0.1
And voila, if I issue a ping from the local loopback in the 'red' VRF towards the remote one it works!
root@pe1:~# ip vrf exec red ping -I 1.0.0.1 1.0.0.2
PING 1.0.0.2 (1.0.0.2) from 1.0.0.1 : 56(84) bytes of data.
64 bytes from 1.0.0.2: icmp_seq=1 ttl=64 time=0.564 ms
64 bytes from 1.0.0.2: icmp_seq=2 ttl=64 time=1.46 ms
64 bytes from 1.0.0.2: icmp_seq=3 ttl=64 time=1.38 ms
If I look at a packet capture, taken on the intermediate 'core' machine, I can see the traffic is encapsulated with VXLAN and uses the identifiers we expect:
Speed Test
A very quick iperf3 test, on my relatively modest machine with 1 core per VM and no tuning whatsoever did just over 400Mbit/second at an average packet size (512 bytes).
root@pe1:~# ip vrf exec blue iperf3 -Z -l 512 -B 2.0.0.1 -c 2.0.0.2 -i 4 -t 20
Connecting to host 2.0.0.2, port 5201
[ 5] local 2.0.0.1 port 33503 connected to 2.0.0.2 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 5] 0.00-4.00 sec 183 MBytes 385 Mbits/sec 74 1.33 MBytes
[ 5] 4.00-8.00 sec 197 MBytes 414 Mbits/sec 2 1.14 MBytes
[ 5] 8.00-12.00 sec 196 MBytes 411 Mbits/sec 0 1.19 MBytes
[ 5] 12.00-16.00 sec 196 MBytes 411 Mbits/sec 0 1.21 MBytes
[ 5] 16.00-20.00 sec 198 MBytes 415 Mbits/sec 0 1.40 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 5] 0.00-20.00 sec 971 MBytes 407 Mbits/sec 76 sender
[ 5] 0.00-20.00 sec 969 MBytes 406 Mbits/sec receiver
Investigation showed that CPU on the 'Core1' machine in the middle, which is doing no VXLAN encapsulation and just forwarding between interfaces, hit 100% during this test. As the machines all had the same spec I took that as a good sign the encapsulation wasn't a complete performance killer. CPU usage on PE1 generating traffic in this test was 100% also, CPU on PE2 receiving the traffic was approx 70%.
Observations
A few notes on what happens:
The RMACs on type-5 routes in EVPN come from the MAC address assigned to the bridge device in each VRF.
The MAC address of that same bridge device is used as the source MAC in the encapsulated Ethernet frame on the wire.
Locally when an EVPN type 5 route is learnt by FRR, Zebra creates an entry with the prefixes RMAC in the fdb table of vxlan100x:
root@pe1:~# bridge fdb show dev vxlan1001
62:d2:01:db:a1:ed vlan 1 extern_learn master br1001
62:d2:01:db:a1:ed extern_learn master br1001
f6:34:1d:eb:36:ee vlan 1 master br1001 permanent
f6:34:1d:eb:36:ee master br1001 permanent
62:d2:01:db:a1:ed dst 2.2.2.2 self extern_learn
In the VRF routing table a route like this is inserted:
root@pe1:~# ip route show vrf red
1.0.0.2 via 2.2.2.2 dev br1001 proto bgp metric 20 onlink
The key thing in the above route is that it is "onlink", which tells the kernel to accept the route even though the next-hop is not on any directly connected network. The fact the route is via an Ethernet bridge causes the kernel to build a layer-2 frame out of any packets using the VRF route entry. The source MAC address will be the bridge's own MAC address
The 'onlink' route, in an Ethernet context anyway, tells the kernel to use the layer-2 address listed against the destination address, on the given interface, when creating the layer-2 packet (as opposed to trying to ARP for the destination IP or anything). Zebra inserts a static ARP/neighbour entry in the kernel's ARP table for the remote VTEP IP (BGP nexthop), listed against the bridge associated with the L3VNI. The MAC address it lists for the VTEP is the RMAC from the EVPN learnt route.
Typically single VTEPs will use a different RMAC per-VRF. This doesn't cause a conflict, however, as the VTEP IP can be listed multiple times in the ARP table, each time against a different interface. As the VRF routing table route specifies the VTEP IP and bridge device the kernel knows which ARP entry for the VTEP IP for use when routing a given packet.
The below gives some sense of what is going on (it's about IPv6 but the general approach is used for pure IPv4 too):
https://github.com/FRRouting/frr/commit/f50dc5e6070383e803dc3441aedd5a435974c762
NOTE: In my tests I could not see the ARP/Neighbour entries for the remote VTEP IPs in the neighbor table. It must be there however as this worked, and several people in the know confirmed that is how it works. So there is some bug / something not fully explained as to why the ARP/neighbor entry was not visible to me.
The bridge device shows the RMAC learnt against port-1:
root@pe1:~# brctl showmacs br1001
port no mac addr is local? ageing timer
1 62:d2:01:db:a1:ed no 1920.62
1 62:d2:01:db:a1:ed no 1116.62
1 f6:34:1d:eb:36:ee yes 0.00
1 f6:34:1d:eb:36:ee yes 0.00
Port 1 is the VXLAN device, the only member of the bridge:
root@pe1:~# brctl show br1001
bridge name bridge id STP enabled interfaces
br1001 8000.f6341deb36ee no vxlan1001
Quite a lot of additional information is present in FRR if queried:
pe1# show interface vxlan1001
Interface vxlan1001 is up, line protocol is up
Link ups: 0 last: (never)
Link downs: 0 last: (never)
vrf: default
index 10 metric 0 mtu 1450 speed 0
flags: <UP,BROADCAST,RUNNING,MULTICAST>
Type: Ethernet
HWaddr: f6:34:1d:eb:36:ee
Interface Type Vxlan
VxLAN Id 1001 VTEP IP: 1.1.1.1 Access VLAN Id 1
Master interface: br1001
pe1#
pe1# show interface vxlan1002
Interface vxlan1002 is up, line protocol is up
Link ups: 0 last: (never)
Link downs: 0 last: (never)
vrf: default
index 6 metric 0 mtu 1450 speed 0
flags: <UP,BROADCAST,RUNNING,MULTICAST>
Type: Ethernet
HWaddr: de:3a:88:6b:c7:b3
Interface Type Vxlan
VxLAN Id 1002 VTEP IP: 1.1.1.1 Access VLAN Id 1
Master interface: br1002
pe1# show interface vxlan1001
Interface vxlan1001 is up, line protocol is up
Link ups: 0 last: (never)
Link downs: 0 last: (never)
vrf: default
index 10 metric 0 mtu 1450 speed 0
flags: <UP,BROADCAST,RUNNING,MULTICAST>
Type: Ethernet
HWaddr: f6:34:1d:eb:36:ee
Interface Type Vxlan
VxLAN Id 1001 VTEP IP: 1.1.1.1 Access VLAN Id 1
Master interface: br1001
pe1#
pe1# show interface vxlan1002
Interface vxlan1002 is up, line protocol is up
Link ups: 0 last: (never)
Link downs: 0 last: (never)
vrf: default
index 6 metric 0 mtu 1450 speed 0
flags: <UP,BROADCAST,RUNNING,MULTICAST>
Type: Ethernet
HWaddr: de:3a:88:6b:c7:b3
Interface Type Vxlan
VxLAN Id 1002 VTEP IP: 1.1.1.1 Access VLAN Id 1
Master interface: br1002
pe1# show evpn vni
VNI Type VxLAN IF # MACs # ARPs # Remote VTEPs Tenant VRF
1002 L3 vxlan1002 1 1 n/a blue
1001 L3 vxlan1001 1 1 n/a red
pe1# show evpn vni 1001
VNI: 1001
Type: L3
Tenant VRF: red
Local Vtep Ip: 1.1.1.1
Vxlan-Intf: vxlan1001
SVI-If: br1001
State: Up
VNI Filter: none
System MAC: f6:34:1d:eb:36:ee
Router MAC: f6:34:1d:eb:36:ee
L2 VNIs:
pe1#
pe1#
pe1# show evpn vni 1002
VNI: 1002
Type: L3
Tenant VRF: blue
Local Vtep Ip: 1.1.1.1
Vxlan-Intf: vxlan1002
SVI-If: br1002
State: Up
VNI Filter: none
System MAC: de:3a:88:6b:c7:b3
Router MAC: de:3a:88:6b:c7:b3
L2 VNIs:
pe1#
pe1# show bgp l2vpn evpn vni 1001
VNI: 1001 (known to the kernel)
Type: L3
Tenant VRF: red
RD: 1.0.0.1:2
Originator IP: 1.1.1.1
Advertise-gw-macip : n/a
Advertise-svi-macip : n/a
Advertise-pip: Yes
System-IP: 1.1.1.1
System-MAC: f6:34:1d:eb:36:ee
Router-MAC: f6:34:1d:eb:36:ee
Import Route Target:
31984:1001
Export Route Target:
31984:1001
pe1#
pe1#
pe1# show bgp l2vpn evpn vni 1002
VNI: 1002 (known to the kernel)
Type: L3
Tenant VRF: blue
RD: 2.0.0.1:3
Originator IP: 1.1.1.1
Advertise-gw-macip : n/a
Advertise-svi-macip : n/a
Advertise-pip: Yes
System-IP: 1.1.1.1
System-MAC: de:3a:88:6b:c7:b3
Router-MAC: de:3a:88:6b:c7:b3
Import Route Target:
31984:1002
Export Route Target:
31984:1002
pe1#
pe1# show evpn rmac vni 1001
Number of Remote RMACs known for this VNI: 1
MAC Remote VTEP
62:d2:01:db:a1:ed 2.2.2.2
pe1# show interface br1001
Interface br1001 is up, line protocol is up
Link ups: 0 last: (never)
Link downs: 0 last: (never)
vrf: red
index 11 metric 0 mtu 1450 speed 0
flags: <UP,BROADCAST,RUNNING,MULTICAST>
Type: Ethernet
HWaddr: f6:34:1d:eb:36:ee
Interface Type Bridge
Bridge VLAN-aware: no
I tried to understand further what was going on when the EVPN routes were recieved, and how they were installed into the Linux kerenl.
Using "ip monitor all" I could see this:
[NEIGH]1.1.1.1 dev br1001 lladdr f6:34:1d:eb:36:ee NOARP
[NEIGH]dev vxlan1001 lladdr f6:34:1d:eb:36:ee REACHABLE
[NEIGH]dev vxlan1001 lladdr f6:34:1d:eb:36:ee REACHABLE
[NEIGH]??? dev vxlan1001 lladdr f6:34:1d:eb:36:ee REACHABLE
[NEIGH]dev vxlan1001 lladdr f6:34:1d:eb:36:ee REACHABLE
[ROUTE]1.0.0.1 via 1.1.1.1 dev br1001 table red proto bgp metric 20 onlink
What's interesting to me is the first line, adding the MAC address neighbor for 1.1.1.1 setting to the RMAC in the BGP route. I feel there has to be a way, in the VRF, to resolve that MAC address for the next-hop IP. Once the MAC is known it's looked up in the bridge device in the VRF, which reveals it's on the VXLAN interface, and the packet it built and sent. Without the link from IP address to RMAC I can't quite work out how this is working.
"brigde monitor all" showed this:
[NEIGH][NEIGH][NEIGH][NEIGH]f6:34:1d:eb:36:ee dev vxlan1001 extern_learn master br1001
[NEIGH]f6:34:1d:eb:36:ee dev vxlan1001 vlan 1 extern_learn master br1001
[NEIGH]f6:34:1d:eb:36:ee dev vxlan1001 dst 1.1.1.1 self extern_learn
[NEIGH]f6:34:1d:eb:36:ee dev vxlan1001 self
Conclusions
If you need to do quick and dirty VRF at modest scale this could be a very attractive optoin. I'd like to see how it would perform with hardware offload of the encapsulation.
Contact me @toprankinrez on twiteroo