Stop Trying to Kill TCP/IP (Unless You’re Building a Supercomputer)

The Cockroach of Protocols

Every three to five years, like clockwork, someone announces they’ve finally killed TCP/IP. It’s usually a hardware vendor, sometimes a hyperscaler, and recently, it’s the AI crowd. I saw the headlines again last week—another “revolutionary” protocol promising to replace the internet’s backbone because TCP is apparently too slow, too fat, and too old for the exascale era.

And look, they aren’t entirely wrong. But they aren’t right, either.

I’ve been tuning networks since the days when 10Mbps was considered “broadband,” and if I had a dollar for every time a proprietary fabric promised to bury TCP, I’d have enough money to buy a strictly average house in the Bay Area. Remember ATM? Remember when InfiniBand was going to replace Ethernet everywhere? Yet here we are in 2026. If you’re reading this, your packets almost certainly rode the TCP train to get to your screen.

But the context has shifted. We aren’t just moving web pages anymore. We’re moving trillion-parameter model weights across thousands of GPUs. And in that specific, sweaty, high-pressure environment, TCP is starting to look less like a reliable workhorse and more like a bottleneck.

Why AI Hates Your Kernel

The problem isn’t really TCP itself—the protocol is mathematically beautiful in its resilience. The problem is where it lives. For decades, the TCP stack has lived comfortably inside the OS kernel. You open a socket, the kernel handles the handshake, the window scaling, the congestion control, and the retransmissions. It’s safe. It’s standard.

It’s also excruciatingly slow when you need microsecond latency.

Every time a packet lands on your NIC, the CPU has to stop what it’s doing, context switch into kernel mode, copy data from kernel space to user space, and then switch back. Doing this at 1Gbps is trivial. Doing it at 400Gbps or 800Gbps—which is standard for AI training clusters now—burns so much CPU cycles that your expensive H100s or Dojo chips end up waiting on the network.

I ran into this wall myself last year while debugging a distributed storage cluster. The disks were fast (NVMe), the network was fast (100GbE), but the throughput was garbage. Why? Because the CPU was pegged at 100% just processing interrupt requests (IRQs) for TCP packets.

ethernet cables connected to switch - Connect the RJ-45 Ethernet Cables - Oracle® InfiniBand Switch IS2 ...
ethernet cables connected to switch – Connect the RJ-45 Ethernet Cables – Oracle® InfiniBand Switch IS2 …

We had to tune the hell out of the Linux kernel just to make it usable. If you’ve never looked at sysctl.conf with a sense of existential dread, you haven’t lived.

# The "Please don't die" TCP tuning starter pack
# I keep this snippet handy for every high-throughput box I touch

# Increase buffer sizes to ridiculous levels (for 100G+ links)
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728

# Enable BBR (Bottleneck Bandwidth and RTT) - usually better than Cubic for high throughput
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

# Don't cache metrics, start fresh every connection
net.ipv4.tcp_no_metrics_save = 1

# If you don't touch these, your 400Gbps link acts like a 10Gbps link
net.core.netdev_max_backlog = 250000

See that? That’s just to get TCP to wake up on a modern link. And even with that, you’re still bound by the CPU.

Enter the “TCP Killers” (TTPoE and Friends)

This is why things like RDMA (Remote Direct Memory Access) and the newer TTPoE (Tesla Transport Protocol over Ethernet) exist. They want to bypass the kernel entirely. The idea is to let the NIC write data directly into the application’s memory. No CPU interrupts, no context switches, no copies.

TTPoE is particularly interesting because it embraces lossiness.

Traditional HPC networking (like InfiniBand) tries to be “lossless.” It uses flow control (PAUSE frames) to tell the sender “Whoa, stop, I’m full!” so that no packet is ever dropped. This sounds great until one switch gets congested and screams “STOP!” to its neighbor, which screams “STOP!” to its neighbor, and suddenly your entire million-dollar cluster is gridlocked. We call this “congestion spreading,” and it’s a nightmare to debug.

TCP handles loss by retransmitting. It assumes the network is unreliable. But retransmitting takes time—milliseconds of it. In AI training, if one GPU is waiting for a packet retransmission, thousands of other GPUs might stall. It’s the “tail latency” problem.

The new wave of protocols, TTPoE included, tries to find a middle ground. They run over standard Ethernet (cheap, ubiquitous) but strip out the heavy TCP state machine. They push the complexity to the hardware (the NIC or the switch) or simplify the protocol so much that it’s just a dumb pipe dumping data into memory.

The “Dumb Switch” Philosophy

I’ve always preferred dumb networks. Give me a switch that just forwards packets and doesn’t try to be smart. When you put complex logic in the network (like InfiniBand’s credit-based flow control), the network becomes a failure domain.

If TTPoE or RoCEv2 (RDMA over Converged Ethernet) runs on standard Ethernet switches, that’s a win. You can use off-the-shelf Arista or Cisco gear. You don’t need a proprietary InfiniBand subnet manager that crashes at 2 AM on a Saturday.

ethernet cables connected to switch - Network switch and ethernet cable connect to computer system ...
ethernet cables connected to switch – Network switch and ethernet cable connect to computer system …

Why TCP Won’t Die (Yet)

So, is TCP dead? No.

Go try to run TTPoE over the public internet. It won’t work. These high-performance protocols rely on low RTT (Round Trip Time) and often assume a relatively clean, flat Layer 2 topology. They are designed for the fabric—the controlled environment inside a data center where you own every cable and switch.

TCP is designed for the wild. It survives bad WiFi, congested undersea cables, and misconfigured ISP routers. It is the universal adapter of the digital world.

What we are seeing in 2026 is a bifurcation of the network stack:

  • The Control Plane & WAN: TCP/IP remains king. It’s robust, compatible, and “fast enough” for moving web requests, database queries, and Netflix streams.
  • The Data Plane (AI/HPC): Specialized protocols like TTPoE, RoCE, or proprietary NVLink fabrics take over. Here, raw throughput and microsecond latency matter more than compatibility.

It’s not a murder; it’s a specialization. TCP is retiring from the heavy lifting of exascale compute to focus on management and wide-area transport.

A Quick Python Reality Check

AI supercomputer server room - Server room with supercomputer equipment ready for data mining and ...
AI supercomputer server room – Server room with supercomputer equipment ready for data mining and …

To visualize why kernel bypass matters, think about how Python handles sockets. I wrote a quick test script last week to measure the overhead of just establishing connections vs sending data.

import socket
import time
import os

# A simple benchmark to feel the kernel overhead
def stress_test(target_ip, port, duration=5):
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    # This option alone can change throughput by 20%
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
    
    try:
        start_time = time.time()
        sock.connect((target_ip, port))
        
        # Sending 1GB of zeroes in 4KB chunks
        # This forces the CPU to switch context 250,000 times
        chunk = b'\x00' * 4096 
        bytes_sent = 0
        
        while time.time() - start_time < duration:
            sock.send(chunk)
            bytes_sent += len(chunk)
            
        print(f"Throughput: {(bytes_sent / 1024 / 1024) / duration:.2f} MB/s")
        
    except Exception as e:
        print(f"Broken: {e}")
    finally:
        sock.close()

# If you run this, watch 'top' or 'htop'. 
# You'll see high "sy" (system) usage. That's the kernel overhead.
# stress_test('192.168.1.50', 8080)

When I run this on a standard Linux box, I might get 2-3 GB/s if I'm lucky. The bottleneck isn't the 100Gb NIC; it's the send() syscall. Every time that loop runs, the CPU halts. A kernel-bypass protocol (which you can't easily write in vanilla Python) would keep that loop in userspace and blast data at line rate.

The Verdict

I’m skeptical of anyone claiming they’ve "solved" networking. Physics is stubborn. You can trade reliability for speed (UDP/TTPoE), or speed for reliability (TCP), but you can't have both for free.

The rise of these new protocols for the Dojo supercomputer and similar clusters is necessary. We are hitting the physical limits of what general-purpose CPUs can encapsulate. But don't go ripping out your TCP stack just yet. Unless you have 10,000 GPUs trying to learn the entire internet by Tuesday, TCP is still your best friend.

It’s boring, it’s old, and it works. In my book, that’s a feature, not a bug.

The OSI Model is a Lie That We Need

Leave a Reply

Your email address will not be published. Required fields are marked *

Zeen Widget