BreakingExpress

Solve community fragmentation with MTU

During the implementation of OpenStack workloads, a typical difficulty is fragmentation all through the community, inflicting unexpected efficiency points. Fragmentation is generally tough to handle as a result of networks can get advanced, so the trail of packets might be exhausting to hint or predict.

OpenStack initiates the community interface card (NIC) configuration in the course of the preliminary setup of the cluster or when new nodes are added. The Message Transfer Unit (MTU) configuration can also be generated at this stage. Changing the configuration after the cluster is deployed is just not beneficial. Normally, the System Integrator expects that the end-to-end path is correctly configured earlier than deploying and configuring the community for the stack to keep away from fixed MTU modifications only for testing.

Neutron networks are created after OSP is deployed. This permits directors to create 1500 MTU networks for the cases. However, the compute node itself continues to be set to the MTU, so fragmentation should still happen. In telco workloads, for instance, the commonest MTU worth for all cases is 9000, so it is simple to inadvertently trigger fragmentation after networks and cases have been created.

Jumbo frames

Here’s an instance of an occasion (deployed in OSP 16.1.5) configured with jumbo frames (8996), however you possibly can see that the community path doesn’t even have jumbo frames configured. This causes fragmentation as a result of system packets use 8996 because the MTU.

$ ping 10.169.252.1 -M do -s 8968
PING 10.169.252.1 (10.169.252.1) 8968(8996) bytes of information.

--- 10.169.252.1 ping statistics ---
7 packets transmitted, 0 obtained, 100% packet loss, time 5999ms

This reveals 100% packet loss when no fragmentation is allowed. The output successfully identifies the difficulty and divulges an issue with the MTU within the community path. If you enable fragmentation, you possibly can see there’s a profitable ping.

$ ping 10.169.252.1 -M dont -s 8968
PING 10.169.252.1 (10.169.252.1) 8968(8996) bytes of information.
8976 bytes from 10.169.252.1: icmp_seq=1 ttl=255 time=3.66 ms
8976 bytes from 10.169.252.1: icmp_seq=2 ttl=255 time=2.94 ms
8976 bytes from 10.169.252.1: icmp_seq=3 ttl=255 time=2.88 ms
8976 bytes from 10.169.252.1: icmp_seq=4 ttl=255 time=2.56 ms
8976 bytes from 10.169.252.1: icmp_seq=5 ttl=255 time=2.91 ms

--- 10.169.252.1 ping statistics ---
5 packets transmitted, 5 obtained, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 2.561/2.992/3.663/0.368 m

Having confirmed the difficulty, you may need to attend till the community crew resolves the issue. In the meantime, fragmentation exists and impacts your system. You should not replace the stack to verify whether or not the difficulty has been mounted, so on this article, I share one secure strategy to decrease the end-to-end MTU contained in the compute node.

Adjusting the MTU

Step 1: Identify the hypervisor your occasion is working on

First, you could get hold of details about your occasion. Do this from the Overcloud utilizing the openstack command:

(overcloud)[director]$ openstack server
present 2795221e-f0f7-4518-a5c5-85977357eeec
-f json
{
  "OS-DCF:diskConfig": "MANUAL",
  "OS-EXT-AZ:availability_zone": "srvrhpb510-compute-2",
  "OS-EXT-SRV-ATTR:host": "srvrhpb510-compute-2.localdomain",
  "OS-EXT-SRV-ATTR:hostname": "server-2",
  "OS-EXT-SRV-ATTR:hypervisor_hostname": "srvrhpb510-compute-2.localdomain",
  "OS-EXT-SRV-ATTR:instance_name": "instance-00000248",
  "OS-EXT-SRV-ATTR:kernel_id": "",
  "OS-EXT-SRV-ATTR:launch_index": 0,
  "OS-EXT-SRV-ATTR:ramdisk_id": "",
  "OS-EXT-SRV-ATTR:reservation_id": "r-ms2ep00g",
  "OS-EXT-SRV-ATTR:root_device_name": "/dev/vda",
  "OS-EXT-SRV-ATTR:user_data": null,
  "OS-EXT-STS:power_state": "Running",
  "OS-EXT-STS:task_state": null,
  "OS-EXT-STS:vm_state": "active",
  "OS-SRV-USG:launched_at": "2021-12-16T18:57:24.000000",
  <...>
  "volumes_attached": ""
}

Step 2: Connect to the hypervisor and dump the XML of the occasion

Next, you want a dump of the XML (utilizing the virsh dumpxml command) that defines your occasion. So you possibly can filter it within the subsequent step, redirect the output right into a file:

[compute2]$ sudo podman
exec -it nova_libvirt bash

(pod)[compute2]# virsh
record --all
 Id   Name                State
-----------------------------------
 6    instance-00000245   working
 7    instance-00000248   working

(pod)[compute2]# virsh dumpxml instance-00000245 | tee inst245.xml
<area kind='kvm' id='6'>
  <identify>instance-00000245</identify>
  <uuid>1718c7d4-520a-4366-973d-d421555295b0</uuid>
  <metadata>
    <nova:occasion xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:bundle model="20.4.1-1.20201114041747.el8ost"/>
      <nova:identify>server-1</nova:identify>
      <nova:creationTime>2021-12-16 18:57:03</nova:creationTime>
[...]
</area>

Step 3: Examine the XML output

After you’ve the XML output, use your favorite pager or textual content editor to get the community interface data for the occasion.

<interface kind='bridge'>
      <mac tackle='fa:16:3e:f7:15:db'/>
      <supply bridge='br-int'/>
      <virtualport kind='openvswitch'>
        <parameters interfaceid='da128923-84c7-435e-9ec1-5a000ecdc163'/>
      </virtualport>
      <goal dev='tap123'/>
      <mannequin kind='virtio'/>
      <driver identify='vhost' rx_queue_size='1024'/>
      <mtu measurement='8996'/>
      <alias identify='net0'/>
      <tackle kind='pci' area='0x0000' bus='0x00' slot='0x03' operate='0x0'/>
    </interface>

From this output, filter the supply bridge (on the compute node) and the goal gadget (the bodily interface within the compute node).

This output can change relying on the firewall kind you’re utilizing, or if you’re utilizing safety teams the place the stream is a bit completely different, however all of the host interfaces are displayed, and the following steps apply to all of them.

Step 4: Look on the goal gadget

In this case, tap123 on the compute node is the goal gadget, so study it with the ip command:

[compute2]$ ip addr present tap123

tap123: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 8996
        inet6 fe80::fc16:3eff:fef7:15db  prefixlen 64  scopeid 0x20<hyperlink>
        ether fe:16:3e:f7:15:db  txqueuelen 10000  (Ethernet)
       [...]

You can see that the MTU is 8996, as anticipated. You may discover the MAC tackle (fe:16:3e:f7:15:db), so you possibly can optionally verify the port utilizing the OpenStack port instructions.

You may verify this interface is within the br-int bridge:

Bridge br-int
       [...]
        Port tap123
            tag: 1
            Interface tap123

That’s additionally as anticipated as a result of this permits South and North site visitors for this occasion utilizing the exterior community.

Step 5: Change the MTU

Apply a typical MTU change on the host particularly in your goal interface (tap123 on this instance).

[compute2]$ sudo ifconfig tap123 mtu 1500
[compute2]$ ip addr present tap123 | grep mtu
tap123: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

Step 6: Repeat

Now repeat the process contained in the occasion to maneuver the mtu from 8996 to 1500. This covers the hypervisor half, as neutron continues to be configured with jumbo frames.

[localhost]$ sudo ip hyperlink set dev eth0 mtu 1500
[localhost]$ ip addr present eth0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.169.252.186  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::f816:3eff:fef7:15db  prefixlen 64  scopeid 0x20<hyperlink>
        ether fa:16:3e:f7:15:db  txqueuelen 1000  (Ethernet)
        RX packets 1226  bytes 242462 (236.7 KiB)
        RX errors 0  dropped 0  overruns 0  body 0
        TX packets 401  bytes 292332 (285.4 KiB)
        TX errors 0  dropped 0 overruns 0  service 0  collisions 0

Validation

Now the trail contained in the native community has an MTU of 1500. If you attempt to ship a packet larger than this, an error ought to be displayed:

[localhost]$ ping 10.169.252.1 -M do -s 1500
PING 10.169.252.1 (10.169.252.1) 1500(1528) bytes of information.
ping: native error: Message too lengthy, mtu=1500
ping: native error: Message too lengthy, mtu=1500
ping: native error: Message too lengthy, mtu=1500
ping: native error: Message too lengthy, mtu=1500

--- 10.169.252.1 ping statistics ---
4 packets transmitted, 0 obtained, +4 errors, 100% packet loss, time 3000ms

This ping provides 28 bytes to the header, making an attempt to ship a payload of 1500 bytes + 28 bytes. The system can’t ship it as a result of it exceeds the MTU. Once you lower the payload to 1472, you possibly can efficiently ship the ping in a single body.

[localhost]$ ping 10.169.252.1 -M do -s 1472
PING 10.169.252.1 (10.169.252.1) 1472(1500) bytes of information.
1480 bytes from 10.169.252.1: icmp_seq=1 ttl=255 time=1.37 ms
1480 bytes from 10.169.252.1: icmp_seq=2 ttl=255 time=1.11 ms
1480 bytes from 10.169.252.1: icmp_seq=3 ttl=255 time=1.02 ms
1480 bytes from 10.169.252.1: icmp_seq=4 ttl=255 time=1.12 ms

--- 10.169.252.1 ping statistics ---
4 packets transmitted, 4 obtained, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 1.024/1.160/1.378/0.131 ms

This is learn how to finish fragmentation issues when the platform sends 9000-byte packets to the community, however fragmentation nonetheless happens in some community elements. You have now solved retransmission points, packet loss, jitter, latency, and different associated issues.

When the community crew resolves the community points, you possibly can revert the MTU instructions again to the earlier worth. This is the way you repair community points while not having to redeploy the stack.

End-to-end simulation

Here’s learn how to simulate the difficulty in an end-to-end situation to see the way it works. Instead of pinging the gateway, you possibly can ping a second occasion. You ought to observe how an MTU mismatch causes points, particularly when an software is marking packets as Not-Fragment.

Assume your servers have the next specs:

Server 1:
Hostname: server1
IP: 10.169.252.186/24
MTU: 1500

Server 2:
Hostname: server2
IP: 10.169.252.184/24
MTU: 8996

Connect to server1 and ping to server2:

[server1]$ ping 10.169.252.184
PING 10.169.252.184 (10.169.252.184) 56(84) bytes of information.
64 bytes from 10.169.252.184: icmp_seq=1 ttl=64 time=0.503 ms
64 bytes from 10.169.252.184: icmp_seq=2 ttl=64 time=0.193 ms
64 bytes from 10.169.252.184: icmp_seq=3 ttl=64 time=0.213 ms

--- 10.169.252.184 ping statistics ---
3 packets transmitted, 3 obtained, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.193/0.303/0.503/0.141 ms

Connect to server1 and ping to server2 with out fragmentation with an MTU of 1500:

[server1]$ ping 10.169.252.184 -M do -s 1472
PING 10.169.252.184 (10.169.252.184) 1472(1500) bytes of information.
1480 bytes from 10.169.252.184: icmp_seq=1 ttl=64 time=0.512 ms
1480 bytes from 10.169.252.184: icmp_seq=2 ttl=64 time=0.293 ms
1480 bytes from 10.169.252.184: icmp_seq=3 ttl=64 time=0.230 ms
1480 bytes from 10.169.252.184: icmp_seq=4 ttl=64 time=0.268 ms
1480 bytes from 10.169.252.184: icmp_seq=5 ttl=64 time=0.230 ms
1480 bytes from 10.169.252.184: icmp_seq=6 ttl=64 time=0.208 ms
1480 bytes from 10.169.252.184: icmp_seq=7 ttl=64 time=0.219 ms
1480 bytes from 10.169.252.184: icmp_seq=8 ttl=64 time=0.229 ms
1480 bytes from 10.169.252.184: icmp_seq=9 ttl=64 time=0.228 ms

--- 10.169.252.184 ping statistics ---
9 packets transmitted, 9 obtained, 0% packet loss, time 8010ms
rtt min/avg/max/mdev = 0.208/0.268/0.512/0.091 ms

The MTU of server1 is 1500, and server2 has an MTU measurement bigger than that, so an software working on server1 sending packets to server2 has no fragmentation points. What occurs if server2‘s software can also be set to Not-Fragment, however makes use of an MTU of 9000?

[localhost]$ ping 10.169.252.186 -M do -s 8968
PING 10.169.252.186 (10.169.252.186) 8968(8996) bytes of information.

--- 10.169.252.186 ping statistics ---
10 packets transmitted, 0 obtained, 100% packet loss, time 8999ms

Fragmentation happens, and the packets despatched have been misplaced.

To right this, repeat the MTU repair in order that each servers have the identical MTU. As a check, revert server1:

[compute2]$ sudo ip hyperlink set dev tap123 mtu 8996
[compute2]$ ip addr present tap123 | grep mtu
tap123: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 8996

[server1]$ sudo ip hyperlink set dev eth0 mtu 8996
[server1]$ ip addr present eth0 | grep mtu
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 8996
[...]

Now repeat the 9000 byte payload ping with out fragmentation allowed:

[server2]$ ping 10.169.252.186 -M do -s 8968
PING 10.169.252.186 (10.169.252.186) 8968(8996) bytes of information.
8976 bytes from 10.169.252.186: icmp_seq=1 ttl=64 time=1.60 ms
8976 bytes from 10.169.252.186: icmp_seq=2 ttl=64 time=0.260 ms
8976 bytes from 10.169.252.186: icmp_seq=3 ttl=64 time=0.257 ms
8976 bytes from 10.169.252.186: icmp_seq=4 ttl=64 time=0.210 ms
8976 bytes from 10.169.252.186: icmp_seq=5 ttl=64 time=0.249 ms
8976 bytes from 10.169.252.186: icmp_seq=6 ttl=64 time=0.250 ms

--- 10.169.252.186 ping statistics ---
6 packets transmitted, 6 obtained, 0% packet loss, time 5001ms
rtt min/avg/max/mdev = 0.210/0.472/1.607/0.507 ms

Troubleshooting MTU

This is a straightforward workaround to assist community directors tackle MTU points while not having a stack replace to maneuver MTUs forwards and backwards. All these MTU configurations are additionally non permanent. An occasion or system reboot causes all interfaces to revert to the unique (and configured worth).

It additionally takes only some minutes to carry out, so I hope you discover this convenient.

Exit mobile version