Zero-Copy Messaging

Zero-Copy Messaging is a technique that is used to improve the throughput of messaging systems. It refers to a mode in which data is presented to an application, without the need to copy into intermediate data structures.

A Networked Application

Consider an application that receives messages from a network socket. These messages will be encoded in some format suitable for transmission. The format may be something “general purpose” such as the popular JSON format, or something more application-specific, such as an efficient binary representation. One component of the application is responsible for copying data off the network, and storing it into a buffer. Another part of the application processes the buffered messages as they arrive. The buffering approach allows the application as a whole to respond to bursts in the arrival rate.  

When the application comes to process an inbound message, it must be converted into a format suitable for the runtime environment in use. When using a low-level language such as C, it would be possible to simply overlay a “struct” over a chunk of memory. This is assuming that the data was formatted in a compatible manner.

Managed Runtimes

In an object-orientated runtime such as Java, it is necessary to convert the data from the buffer into an object on the stack or heap. This conversion process requires compute cycles, and the allocation of extra memory.

Recall the last issue, where we discussed memory caches and how re-using the same regions of memory could result in performance gains. In this case, the buffered data is likely to be present in a cache (since it has been recently written to by the message receiving component of the application), so access to it can be very fast. If that data must be copied from the buffer into another piece of memory on the heap, then the runtime may encounter cache-misses, in addition to the extra time spent copying bytes from one area of memory to another.

So how could we use zero-copy techniques in a managed runtime environment? The answer is to use flyweight objects, which are used to overlay structure onto raw data.

The Order Object

This is best illustrated with an example. Let’s define a type that might be in use in a trading system:

public interface Order
    enum OrderType

    int accountId();
    long quantity();
    long price();
    OrderType orderType();

We can see that we need 24 bytes to store this message efficiently (if using a 32-bit int to store the enum ordinal). If the message is encoded using the following layout:

Then we can design a flyweight object that knows how to index into the data:

public final class OrderFlyweight 
    implements Order
    private static final OrderType[] 
        ORDER_TYPES = OrderType.values();
    private static final int 
        ACCOUNT_ID_OFFSET = 0;
    private static final int 
    private static final int 
    private static final int 
    static final int 

    private final UnsafeBuffer 
        buffer = new UnsafeBuffer();
    private int bufferPosition;

    public void wrap(final ByteBuffer buffer)
        this.bufferPosition = buffer.position();

    public int accountId()
        return buffer.getInt(
            bufferPosition + ACCOUNT_ID_OFFSET);

    public long quantity()
        return buffer.getLong(
            bufferPosition + QUANTITY_OFFSET);

    public long price()
        return buffer.getLong(
            bufferPosition + PRICE_OFFSET);

    public OrderType orderType()
        return ORDER_TYPES[
                bufferPosition + ORDER_TYPE_OFFSET)];

We can use a JMH benchmark to show the difference in memory usage between using a flyweight, and constructing a simple POJO:

public void dispatchFlyweight()


public void dispatchPojo()
    final int accountId = 
    final long quantity = 
    final long price = 
    final Order.OrderType orderType = 

        new OrderPojo(
            accountId, quantity, price, orderType));

Measuring using the throughput mode of JMH, we can see that the flyweight method provides better throughput than creating a POJO for each invocation:

Benchmark                     Mode    Score   Error Units
Benchmark.dispatchFlyweight  thrpt  105.872 ± 1.381 ops/us
Benchmark.dispatchPojo       thrpt   80.312 ± 2.257 ops/us

We have already hypothesized that the flyweight method will have better memory locality, as the data is used straight from the buffer, rather than being copied to an object on the heap or stack.  

This can be seen when using JMH’s perf profiler and looking at the difference in cache usage between the two benchmark methods:

         7,160,070      L1-dcache-load-misses
            76,685      LLC-load-misses    
       991,470,241      L1-dcache-load-misses
         1,825,240      LLC-load-misses    

The POJO method encounters over 100 times more L1 cache misses. This is due to objects being allocated in the thread-local allocation buffer (TLAB) by the JVM. As the buffer is used by the application, that memory must be moved into the CPU’s L1 cache. In a real application, this will cause other data to be evicted from the cache to make way for the newly allocated object, possibly causing further cache misses.


We have seen that using flyweights to map over binary data can lead to significant throughput improvements. Does this mean we should always avoid allocation? Not at all; modern runtimes go to great lengths to make allocation cheap (the TLAB for example), or to remove allocation altogether using techniques such as escape analysis. As always, it is important to measure the impact of using something like flyweights, and to make a decision based on the needs of the application.

Subscribe to receive the Four Steps Newsletter: