Zero-Copy Messaging is a technique that is used to improve the throughput of messaging systems. It refers to a mode in which data is presented to an application, without the need to copy into intermediate data structures.
A Networked Application
Consider an application that receives messages from a network socket. These messages will be encoded in some format suitable for transmission. The format may be something “general purpose” such as the popular JSON format, or something more application-specific, such as an efficient binary representation. One component of the application is responsible for copying data off the network, and storing it into a buffer. Another part of the application processes the buffered messages as they arrive. The buffering approach allows the application as a whole to respond to bursts in the arrival rate.
When the application comes to process an inbound message, it must be converted into a format suitable for the runtime environment in use. When using a low-level language such as C, it would be possible to simply overlay a “struct” over a chunk of memory. This is assuming that the data was formatted in a compatible manner.
Managed Runtimes
In an object-orientated runtime such as Java, it is necessary to convert the data from the buffer into an object on the stack or heap. This conversion process requires compute cycles, and the allocation of extra memory.
Recall the last issue, where we discussed memory caches and how re-using the same regions of memory could result in performance gains. In this case, the buffered data is likely to be present in a cache (since it has been recently written to by the message receiving component of the application), so access to it can be very fast. If that data must be copied from the buffer into another piece of memory on the heap, then the runtime may encounter cache-misses, in addition to the extra time spent copying bytes from one area of memory to another.
So how could we use zero-copy techniques in a managed runtime environment? The answer is to use flyweight objects, which are used to overlay structure onto raw data.
The Order Object
This is best illustrated with an example. Let’s define a type that might be in use in a trading system:
public interface Order { enum OrderType { LIMIT, MARKET } int accountId(); long quantity(); long price(); OrderType orderType(); }
We can see that we need 24 bytes to store this message efficiently (if using a 32-bit int to store the enum ordinal). If the message is encoded using the following layout:

Then we can design a flyweight object that knows how to index into the data:
public final class OrderFlyweight implements Order { private static final OrderType[] ORDER_TYPES = OrderType.values(); private static final int ACCOUNT_ID_OFFSET = 0; private static final int QUANTITY_OFFSET = ACCOUNT_ID_OFFSET + 4; private static final int PRICE_OFFSET = QUANTITY_OFFSET + 8; private static final int ORDER_TYPE_OFFSET = PRICE_OFFSET + 8; static final int LENGTH = ORDER_TYPE_OFFSET + 4; private final UnsafeBuffer buffer = new UnsafeBuffer(); private int bufferPosition; public void wrap(final ByteBuffer buffer) { this.buffer.wrap(buffer); this.bufferPosition = buffer.position(); } @Override public int accountId() { return buffer.getInt( bufferPosition + ACCOUNT_ID_OFFSET); } @Override public long quantity() { return buffer.getLong( bufferPosition + QUANTITY_OFFSET); } @Override public long price() { return buffer.getLong( bufferPosition + PRICE_OFFSET); } @Override public OrderType orderType() { return ORDER_TYPES[ buffer.getInt( bufferPosition + ORDER_TYPE_OFFSET)]; } }
We can use a JMH benchmark to show the difference in memory usage between using a flyweight, and constructing a simple POJO:
@Benchmark public void dispatchFlyweight() { flyweight.wrap(payload); orderHandler.onOrder(flyweight); } @Benchmark public void dispatchPojo() { final int accountId = payload.getInt(0); final long quantity = payload.getLong(4); final long price = payload.getLong(12); final Order.OrderType orderType = ORDER_TYPES[payload.getInt(20)]; orderHandler.onOrder( new OrderPojo( accountId, quantity, price, orderType)); }
Measuring using the throughput mode of JMH, we can see that the flyweight method provides better throughput than creating a POJO for each invocation:
Benchmark Mode Score Error Units Benchmark.dispatchFlyweight thrpt 105.872 ± 1.381 ops/us Benchmark.dispatchPojo thrpt 80.312 ± 2.257 ops/us
We have already hypothesized that the flyweight method will have better memory locality, as the data is used straight from the buffer, rather than being copied to an object on the heap or stack.
This can be seen when using JMH’s perf profiler and looking at the difference in cache usage between the two benchmark methods:
Flyweight: 7,160,070 L1-dcache-load-misses 76,685 LLC-load-misses Pojo: 991,470,241 L1-dcache-load-misses 1,825,240 LLC-load-misses
The POJO method encounters over 100 times more L1 cache misses. This is due to objects being allocated in the thread-local allocation buffer (TLAB) by the JVM. As the buffer is used by the application, that memory must be moved into the CPU’s L1 cache. In a real application, this will cause other data to be evicted from the cache to make way for the newly allocated object, possibly causing further cache misses.
Summary
We have seen that using flyweights to map over binary data can lead to significant throughput improvements. Does this mean we should always avoid allocation? Not at all; modern runtimes go to great lengths to make allocation cheap (the TLAB for example), or to remove allocation altogether using techniques such as escape analysis. As always, it is important to measure the impact of using something like flyweights, and to make a decision based on the needs of the application.
Subscribe to receive the Four Steps Newsletter: