Skip to content

Kernel Bypass Techniques DPDK, RDMA, Solarflare OpenOnload

Kernel Bypass Techniques: DPDK, RDMA, Solarflare/OpenOnload

Kernel bypass techniques eliminate the operating system kernel from the data path, allowing applications to directly access network hardware. This dramatically reduces latency by eliminating syscalls, context switches, and kernel overhead.

Why Kernel Bypass?

Traditional networking involves multiple layers of abstraction:

cpp
Application
    ↓ (syscall)
Kernel Network Stack
    ↓ (driver call)
Network Driver
    ↓ (hardware access)
Network Hardware

Each layer adds latency:

  • System calls: 100-1000 cycles
  • Context switches: 1-30 microseconds
  • Kernel processing: Variable overhead
  • Memory copying: Additional cycles

Kernel bypass eliminates these layers:

cpp
Application
    ↓ (direct access)
Network Hardware

DPDK: Data Plane Development Kit

DPDK is an open-source framework for fast packet processing in user space.

How DPDK Works

cpp
// Traditional networking with kernel
int sock = socket(AF_INET, SOCK_STREAM, 0);  // syscall
bind(sock, &addr, sizeof(addr));             // syscall
listen(sock, 10);                            // syscall
int client = accept(sock, NULL, NULL);       // syscall
recv(client, buffer, size, 0);               // syscall

// DPDK networking - direct hardware access
struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", 8192, 0, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
struct rte_eth_conf port_conf = {0};
rte_eth_dev_configure(port_id, 1, 1, &port_conf);  // Direct hardware config
rte_eth_rx_queue_setup(port_id, 0, 512, rte_eth_dev_socket_id(port_id), NULL, mbuf_pool);
rte_eth_tx_queue_setup(port_id, 0, 512, rte_eth_dev_socket_id(port_id), NULL);
rte_eth_dev_start(port_id);

// Polling loop - no syscalls
while (true) {
    struct rte_mbuf *pkts[32];
    uint16_t nb_rx = rte_eth_rx_burst(port_id, 0, pkts, 32);
    for (int i = 0; i < nb_rx; i++) {
        process_packet(pkts[i]);
        rte_pktmbuf_free(pkts[i]);
    }
}

Key DPDK Features

1. Polling Mode

cpp
// Instead of interrupts, continuously poll for packets
while (true) {
    struct rte_mbuf *pkts[32];
    uint16_t nb_rx = rte_eth_rx_burst(port_id, 0, pkts, 32);
    if (nb_rx > 0) {
        // Process packets
    }
    // No context switches, no interrupts
}

2. Huge Pages

cpp
// DPDK uses huge pages to reduce TLB misses
// Traditional: 4KB pages
// DPDK: 2MB or 1GB huge pages

3. NUMA Awareness

cpp
// Bind memory and threads to specific NUMA nodes
rte_mempool_create("pool", 8192, 0, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());

DPDK Performance

  • Latency: 1-10 microseconds (vs 50+ microseconds with kernel)
  • Throughput: 10-100 Gbps per core
  • CPU Usage: Dedicated cores for packet processing

RDMA: Remote Direct Memory Access

RDMA allows one computer to access another computer's memory directly without involving the CPU or operating system.

How RDMA Works

cpp
// Traditional networking with copying
// Sender
char data[1024];
memcpy(data, source_data, 1024);  // Copy to send buffer
send(socket, data, 1024, 0);      // syscall

// Receiver  
char buffer[1024];
recv(socket, buffer, 1024, 0);    // syscall
memcpy(dest_data, buffer, 1024);  // Copy from receive buffer

// RDMA - direct memory access
// Sender
ibv_post_send(qp, &wr, &bad_wr);  // Direct memory access

// Receiver
// Data appears directly in destination memory
// No syscalls, no copying

RDMA Operations

1. RDMA Read

cpp
// Read remote memory directly
struct ibv_send_wr wr = {
    .wr_id = (uintptr_t)context,
    .opcode = IBV_WR_RDMA_READ,
    .send_flags = IBV_SEND_SIGNALED,
    .wr.rdma = {
        .remote_addr = (uintptr_t)remote_buffer,
        .rkey = remote_key,
        .local_addr = (uintptr_t)local_buffer,
        .lkey = local_key,
        .len = size
    }
};

2. RDMA Write

cpp
// Write directly to remote memory
struct ibv_send_wr wr = {
    .wr_id = (uintptr_t)context,
    .opcode = IBV_WR_RDMA_WRITE,
    .send_flags = IBV_SEND_SIGNALED,
    .wr.rdma = {
        .remote_addr = (uintptr_t)remote_buffer,
        .rkey = remote_key,
        .local_addr = (uintptr_t)local_buffer,
        .lkey = local_key,
        .len = size
    }
};

3. RDMA Atomic Operations

cpp
// Atomic compare-and-swap
struct ibv_send_wr wr = {
    .wr_id = (uintptr_t)context,
    .opcode = IBV_WR_ATOMIC_CMP_AND_SWP,
    .send_flags = IBV_SEND_SIGNALED,
    .wr.atomic = {
        .remote_addr = (uintptr_t)remote_addr,
        .rkey = remote_key,
        .compare_add = compare_value,
        .swap = swap_value
    }
};

RDMA Performance

  • Latency: 0.5-2 microseconds (vs 50+ microseconds with TCP)
  • Throughput: 100+ Gbps
  • CPU Usage: Near zero CPU overhead

Solarflare OpenOnload

OpenOnload provides kernel bypass performance without requiring application code changes.

How OpenOnload Works

cpp
// Application code remains unchanged
int sock = socket(AF_INET, SOCK_STREAM, 0);
bind(sock, &addr, sizeof(addr));
listen(sock, 10);
int client = accept(sock, NULL, NULL);
recv(client, buffer, size, 0);

// OpenOnload intercepts these calls and provides kernel bypass
// No code changes required!

OpenOnload Architecture

cpp
Application (unchanged)

OpenOnload Library

Onload Driver

Network Hardware

Key Features:

  • Transparent Acceleration: No code changes required
  • Kernel Bypass: Eliminates syscalls and context switches
  • TCP Offload: Hardware TCP processing
  • Zero-Copy: Direct memory access

Performance Benefits

  • Latency: 1-5 microseconds (vs 50+ microseconds)
  • Throughput: 10-100 Gbps
  • Compatibility: Works with existing applications

Comparison of Kernel Bypass Techniques

TechniqueLatencyThroughputCode ChangesComplexity
Traditional50+ μs1-10 GbpsNoneLow
DPDK1-10 μs10-100 GbpsMajorHigh
RDMA0.5-2 μs100+ GbpsMajorVery High
OpenOnload1-5 μs10-100 GbpsNoneLow

When to Use Each Technique

DPDK

  • Use when: You need maximum performance and can rewrite networking code
  • Examples: Network appliances, packet processing, custom protocols
  • Trade-offs: High complexity, requires dedicated cores

RDMA

  • Use when: You need the lowest possible latency between two machines
  • Examples: HFT systems, distributed databases, high-performance computing
  • Trade-offs: Very high complexity, requires special hardware

OpenOnload

  • Use when: You want performance without code changes
  • Examples: Existing applications, rapid deployment, legacy systems
  • Trade-offs: Vendor lock-in, hardware requirements

Implementation Considerations

Hardware Requirements

  • Network Cards: Must support kernel bypass
  • CPU: Dedicated cores for polling
  • Memory: Huge pages for DPDK
  • NUMA: Proper memory placement

Software Requirements

  • DPDK: Custom networking stack
  • RDMA: InfiniBand or RoCE hardware
  • OpenOnload: Solarflare network cards

Development Complexity

  • DPDK: High - requires networking expertise
  • RDMA: Very High - requires deep systems knowledge
  • OpenOnload: Low - transparent acceleration

The Bottom Line

Kernel bypass techniques eliminate the operating system from the data path, achieving microsecond or even sub-microsecond latencies. The choice between DPDK, RDMA, and OpenOnload depends on your specific requirements for performance, development complexity, and hardware constraints.

Key Takeaways:

  • Kernel bypass eliminates syscalls and context switches
  • DPDK provides maximum performance with high complexity
  • RDMA provides lowest latency but requires special hardware
  • OpenOnload provides performance without code changes
  • Every technique requires dedicated hardware and careful tuning

Questions

Q: What is the main goal of kernel bypass techniques?

Kernel bypass techniques eliminate syscalls and context switches by allowing applications to directly access network hardware, dramatically reducing latency.

Q: What does DPDK stand for?

DPDK stands for Data Plane Development Kit, a set of libraries and drivers for fast packet processing in user space.

Q: How does DPDK achieve high performance?

DPDK achieves high performance by polling network cards continuously instead of using interrupts, eliminating interrupt overhead and context switches.

Q: What is RDMA?

RDMA (Remote Direct Memory Access) allows one computer to access another computer's memory directly without involving the CPU or operating system.

Q: What is the main advantage of Solarflare's OpenOnload?

OpenOnload provides kernel bypass performance without requiring application code changes, making it easier to adopt than DPDK or RDMA.