Debugging eBPF XDP Drops on Mellanox ConnectX-6

A ConnectX-6 Dx running mlx5_core on a 6.6 LTS kernel will happily attach an XDP program, report success, and then silently drop every packet that hits the program with XDP_PASS. The counter that tells you this is rx_xdp_drop under ethtool -S, and it is almost always caused by one of three things: a headroom miscalculation, a channel/queue mismatch after you reshaped the NIC, or a ring that is too shallow to absorb the extra descriptors XDP demands. If you are chasing ebpf xdp packet drops mellanox symptoms on ConnectX-6, those three suspects catch the overwhelming majority of cases before you ever need to open bpftool prog dump xlated.

This guide walks through the counters, the sysfs knobs, and the tracepoints I reach for first, in the order I reach for them. Every command here is copy-pasteable against a Linux 6.6+ host with an OFED or in-tree mlx5_core driver and a ConnectX-6 / ConnectX-6 Dx card.

Read the mlx5 drop counters before touching the program

The mlx5_core driver exposes a surprisingly detailed set of XDP counters through ethtool. Before you suspect your own BPF code, dump them and look for the non-zero ones:

ethtool -S enp1s0f0np0 | grep -Ei 'xdp|drop|cache|err'

On a healthy XDP workload you should see rx_xdp_redirect or rx_xdp_tx_xmit climbing and rx_xdp_drop either zero or matching your program’s deliberate XDP_DROP verdicts. The counters that actually indicate trouble are these:

  • rx_xdp_drop — the program returned XDP_DROP or the driver could not deliver the frame to the program. These two cases are indistinguishable from the counter alone, which is the first gotcha.
  • rx_xdp_tx_full / rx_xdp_tx_err — the TX ring used for XDP_TX redirect is backed up. Usually means you need more TX descriptors or a deeper queue.
  • rx_cache_full / rx_cache_reuse — the per-queue page-pool is saturated. XDP needs page-per-packet allocation on mlx5, so this climbs fast if rings are undersized.
  • rx_xdp_redirect_err — the bpf_redirect_map target was invalid or the devmap slot was empty. Classic after a veth or AF_XDP socket was torn down without updating the map.
Official documentation for ebpf xdp packet drops mellanox
Official documentation — the primary source for this topic.

The official mlx5 ethtool counter reference lists every one of these alongside the exact code path that increments them, and it is the single most useful page for diagnosing mellanox ebpf xdp packet drops — cross-check the counter name against the driver source if anything looks ambiguous, because a few of the names shifted between MLNX_OFED 5.8 and the upstream 6.x naming.

The headroom trap: why XDP_PASS silently drops

XDP on mlx5_core requires 256 bytes of headroom in front of every packet, matching XDP_PACKET_HEADROOM. The driver allocates one page per descriptor in XDP mode, and if your configured MTU plus headroom plus skb_shared_info padding exceeds what fits in a single page frag, the driver refuses to attach in native mode and silently falls back — or worse, attaches and then drops oversize frames into rx_xdp_drop.

On a 4K page x86_64 box, the practical ceiling for native XDP on ConnectX-6 is an MTU of about 3498 bytes. Jumbo frames at 9000 bytes will not work in native mode without multi-buffer XDP, which requires both kernel 6.3+ and a program compiled with xdp.frags in the section name. If you try to attach a non-frags program with an MTU above the single-page ceiling, the bpf syscall returns -EOPNOTSUPP and dmesg prints mlx5_core ... MTU ... is too big for non-linear XDP.

The fix is either to lower the MTU:

ip link set dev enp1s0f0np0 mtu 3498

or to rebuild your program with multi-buffer support. The clang invocation looks like this with libbpf 1.4+:

SEC("xdp.frags")
int xdp_filter(struct xdp_md *ctx)
{
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;
    struct ethhdr *eth = data;

    if (eth + 1 > data_end)
        return XDP_DROP;

    return XDP_PASS;
}

The .frags suffix is what libbpf uses to set BPF_F_XDP_HAS_FRAGS at load time. Without it, a 9K jumbo on ConnectX-6 dies at the first packet and never increments anything except rx_xdp_drop. The mlx5 multi-buffer XDP support landed in upstream kernel 6.3 via commit ea5d49bdae8b, and MLNX_OFED exposes it from 23.10 onward.

Channels, queues, and why half your cores see drops

XDP programs attach per-channel on mlx5. If you reshape channels while a program is loaded, the driver tears the program off and re-attaches it — but only on channels that exist afterwards. A common failure mode: you run ethtool -L enp1s0f0np0 combined 16 on a 32-core box while an XDP program is running, and suddenly half your traffic starts dropping because RSS is still spreading flows across all 32 hardware queues, but only 16 of them have the XDP hook attached. The driver logs this to dmesg but the message is terse:

mlx5_core 0000:01:00.0: mlx5e_xdp_set: channels 16, xdp 16

Always detach XDP, reshape channels, then re-attach:

bpftool net detach xdp dev enp1s0f0np0
ethtool -L enp1s0f0np0 combined 16
ethtool -G enp1s0f0np0 rx 4096 tx 4096
bpftool net attach xdpdrv id 42 dev enp1s0f0np0

The xdpdrv verb forces native mode; use xdpoffload only if you actually have a Smart NIC with HW offload, which ConnectX-6 Dx does not support for generic BPF — only the BlueField variants do. If you see xdpgeneric in bpftool net show, you are in the SKB fallback path and every metric including latency will be wrong.

Ring depth matters because XDP recycles pages through the driver’s page-pool. The default 1024 RX descriptors is too shallow for sustained 100 Gbps line-rate workloads; push it to 4096 or 8192 with ethtool -G. The rx_cache_full counter is your thermometer — if it is climbing, the ring is starving the page-pool and you will see rx_xdp_drop tick up in lockstep.

Benchmark: XDP Drop Rate vs Packet Size on ConnectX-6
Performance comparison — XDP Drop Rate vs Packet Size on ConnectX-6.

The chart shows something most operators do not expect: drop rate is nearly flat from 64B up to about 512B per packet, then falls off a cliff above 1024B. That is the page-pool pressure signature — small packets pack more descriptors per second into the same ring, so a ring sized for MTU-1500 line rate will be chronically short at 64B. The practical takeaway is to size the ring against your smallest expected packet, not your MTU.

Tracepoints and perf — when counters are not enough

When rx_xdp_drop is incrementing but you cannot tell whether the program or the driver is to blame, attach to the xdp:xdp_exception tracepoint. It fires every time the kernel rejects an XDP verdict, and the act field tells you exactly which verdict was returned:

perf record -e xdp:xdp_exception -a -g -- sleep 10
perf script | head -40

Typical output looks like prog_id=42 act=XDP_ABORTED. XDP_ABORTED means the program hit a verifier-invisible runtime error — a null deref on a map lookup return value is the classic case. The cure is always to check the return of bpf_map_lookup_elem for NULL before touching it, even when you think the key must exist.

For deeper work, bpftool prog profile id 42 duration 10 cycles instructions gives you per-program CPU cost on kernels 5.7+ with CONFIG_BPF_KPROBE_OVERRIDE. If your program is burning more than about 200 cycles per packet, you are probably doing a map lookup per packet that you could precompute or hoist. The bpftool source tree documents every subcommand and it is the right reference for anything profiling-related.

Topic diagram for Debugging eBPF XDP Drops on Mellanox ConnectX-6
Purpose-built diagram for this article — Debugging eBPF XDP Drops on Mellanox ConnectX-6.

The diagram traces a packet from the ConnectX-6 receive queue, through the mlx5e page-pool allocator, into the XDP program, and onto one of four terminal paths: XDP_PASS up to the stack, XDP_TX back out the same queue, XDP_REDIRECT through a devmap or cpumap, or XDP_DROP into oblivion. The rx_xdp_drop counter sits on the drop branch — but it also increments on the far-left branch, before the program runs, if the driver cannot build a valid xdp_buff. That is the source of the ambiguity you hit in step one: the same counter covers a driver-side reject and a program-side verdict.

CQE compression and striding RQ — the subtle ones

Two mlx5-specific features interact badly with XDP and trip people up. The first is CQE compression, which packs multiple completions into a single cache line to save PCIe bandwidth. It is enabled by default on 100G+ firmware. XDP in native mode is incompatible with compressed CQEs in some firmware/driver combinations — the driver quietly disables compression when you attach a program, but if your firmware is older than 22.33.1048 you may see a stall at attach time. Check with:

ethtool --show-priv-flags enp1s0f0np0 | grep -i cqe
mstflint -d 01:00.0 q | grep FW

If rx_cqe_compress is on and firmware is below 22.33, upgrade firmware before you spend another hour on the program. NVIDIA’s mlx5 firmware changelog calls out the specific XDP + CQE compression interactions and is worth bookmarking.

The second is Striding RQ, which lets the NIC post a single large buffer and fill it with multiple packets. It is great for the non-XDP path and mandatory for some RoCE workloads, but native XDP disables it automatically and reverts to linear RQ. You can verify which mode you are in with ethtool --show-priv-flags — look for rx_striding_rq: off when XDP is attached. If it stays on after attach, you are in generic XDP, not native, and every measurement you take will be off by 3-5x on latency.

A minimal debugging checklist

When a new XDP program misbehaves on ConnectX-6, this is the sequence that catches 90% of problems in under ten minutes:

  1. Confirm you are in native mode: bpftool net show dev enp1s0f0np0 must show xdp/drv, not xdp/generic.
  2. Check MTU against the page-frag ceiling. Anything above 3498 needs xdp.frags and kernel 6.3+.
  3. Dump ethtool -S and look for rx_xdp_drop, rx_cache_full, and rx_xdp_tx_full. Non-zero values point at very different root causes.
  4. Confirm channel count and ring depth. ethtool -l and ethtool -g. Detach-reshape-reattach if you need to change either.
  5. Attach perf record -e xdp:xdp_exception to distinguish program bugs from driver rejects.
  6. If firmware is older than 22.33, upgrade it before anything else.

The single most common root cause I see reported on the xdp-project tracker for ConnectX-6 is still the headroom/MTU mismatch — someone sets MTU 9000 expecting jumbo support, loads a program without xdp.frags, and watches rx_xdp_drop climb at line rate. The fix is thirty seconds of work once you know where to look. Everything else on this checklist is rarer but more subtle, which is exactly why you want to eliminate the obvious cases first before you start reading disassembled BPF bytecode at 3am.

References

More From Author

Debugging MTU Blackholes on WireGuard Over Starlink

Leave a Reply

Your email address will not be published. Required fields are marked *