ByteArrayOutputStream is really, really slow sometimes in JDK6

TLDR: Yesterday I mentioned on Twitter that I’d found a bad performance problem when writing to a large ByteArrayOutputStream in Java. After some digging, it appears to be the case that there’s a bad bug in JDK6 that doesn’t affect correctness, but does cause performance to nosedive when a ByteArrayOutputStream gets large. This post explains why.

Two of Impala’s server processes have both C++ and Java components (for reasons both historic and pragmatic). We often need to pass data structures from C++ to Java and vice versa, and mapping the C++ representation onto a Java one via JNI is too painful to contemplate. So instead we take advantage of the fact that Thrift is very good at generating equivalent data structures in different languages, and make every parameter to methods on the JNI boundary a serialised Thrift structure. That is, it’s a byte array that Thrift on both sides knows how to convert into a Thrift structure. So we pass byte arrays back and forth, and use Thrift to convert them to language-readable data structures. This works pretty well. (To see exactly how, start by reading frontend.cc and JniFrontend.java). We pay an extra copy or two, plus the CPU overhead of the serialisation, but the benefits in terms of usability and maintainability of the interface vastly outweigh some pretty nominal performance hits.

If the performance hit isn’t nominal, however, we have a problem. And this is what we observed earlier this week: one of the JNI methods was trying to pass a huge data structure back from Java to C++. Doing so was taking a long time – on the order of minutes. What was particularly of interest was that the performance dropped off a cliff: a data structure half the size was happily serialising in about 500ms. So we have a non-linear relationship between the size of the input and the cost of serialising it. We can’t really absorb that cost, so we had to understand the problem.

So how did we get there? Thrift’s Java serialisation implementation works by having a TSerializer object, which contains a ByteArrayOutputStream, call write() on a Thrift structure with its ByteArrayOutputStream as an argument. The Thrift structure then walks its members and writes object headers and then serialised data for each field in turn. The result is lots of small write() calls to the ByteArrayOutputStream.

The first thing was to connect a profiler (YourKit, but honestly repeated SIGHUP to get the stack trace would have worked). During the long serialisation period, almost all the time was spent inside java.util.Arrays.copyOf, inside a method to write a byte[] to a ByteArrayOutputStream. Progress was being made – the item being written to the ByteArrayOutputStream was changing – but it was taking an unreasonably long time to write each field.

A ByteArrayOutputStream is not necessarily initialised with any estimate of the ultimate size of the byte array it wraps. So it needs a mechanism to resize when more space is required. The source for ByteArrayOutputStream.write(byte[], int, int) in JDK6 shows the (very standard) strategy it uses.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public synchronized void write(byte b[], int off, int len) {
    if ((off < 0) || (off > b.length) || (len < 0) ||
        ((off + len) > b.length) || ((off + len) < 0)) {
        throw new IndexOutOfBoundsException();
    } else if (len == 0) {
        return;
    }
    int newcount = count + len;
    if (newcount > buf.length) {
        buf = Arrays.copyOf(buf, Math.max(buf.length << 1, newcount));
    }
    System.arraycopy(b, off, buf, count, len);
    count = newcount;
}

The first six lines just deal with parameter validation; they can be ignored from here on. Lines 8-9 are interesting: we compute the new size of the array after the write completes, and then, if that size is larger than the current size, we need to do something to compensate.

Line 10 is where that compensation happens. Arrays.copyOf() creates a new array containing all the bytes from the original array, but with a larger size. The size of the new array is the maximum of twice the current length (buf.length << 1) and the requested size of the array after the write completes (this is so that a large write that more than doubles the current size of the array can be accommodated). Performing this copy is expensive, but since the size of the array should grow exponentially, frequent copies are hopefully unlikely. C++'s vector does the same thing.

After that (lines 12-13) we copy in the argument, and update the tracked number of bytes in the array.

My working hypothesis was that copyOf() was being called on every write() (since that matched up with what the profiler was telling us). The source code tells us the only way that can happen is if newcount is always larger than buf.length. This leads to two possibilities: newcount is getting large quickly, or buf.length is getting large slowly. The former seems unlikely - Thrift serialisation works by writing many small byte arrays - so to support my hypothesis, buf.length had to be growing slowly so that the copyOf() branch was being taken much more frequently than we expected.

A session with JDB (a terrible, terrible debugger) confirmed this. During the slow serialisation period, the size of the array increased on every write only by the amount required to contain the write in progress. On every write of say 2 bytes, the array size would increase by exactly those 2 bytes and a copy would be taken. The array itself was about 1GB in size, so the copy was really expensive.

This leads us to the bug. The size of the array is determined by Math.max(buf.length << 1, newcount). Ordinarily, buf.length << 1 returns double buf.length, which would always be much larger than newcount for a 2 byte write. Why was it not?

The problem is that for all integers larger than Integer.MAX_INTEGER / 2, shifting left by one place causes overflow, setting the sign bit. The result is a _negative_ integer, which is always less than newcount. So for all byte arrays larger than 1073741824 bytes (i.e. one GB), any write will cause the array to resize, and only to exactly the size required.

You could argue that this is by design for the following reason: the maximum size of any array in Java is Integer.MAX_INTEGER (minus a few bytes for preamble). Any array larger than Integer.MAX_INTEGER / 2 bytes long would become larger than that limit when doubling in size. However, the source for ByteArrayOutputStream.write() could handle this case by setting the new length to Integer.MAX_INTEGER if buf.length > Integer.MAX_INTEGER / 2 to give the array the maximum chance to grow with few copies.

The true fix is for us to cut down the size of the object we want to marshal, or to come up with some less expensive way of doing so (we could use a different TSerializer implementation, for example). Still, it's an unfortunate degradation an a fairly commonly used class, even if there are other, better ways of achieving the same thing.

Postscript

In fact, JDK7 'fixed' the issue by correctly dealing with overflow, but if the resulting doubled array-length was larger than Integer.MAX_INTEGER, an exception is thrown. You can check by running this code on both JDK6 and JDK7:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import java.io.ByteArrayOutputStream;
 
public class TestByteArray {
 
  static byte[] chunk = new byte[1024 * 1024];
  public static void main(String[] args) {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    int numChunks = 2 * 1024 * 1024;
    for (int  i = 0; i < numChunks; ++i) {
      long start = System.currentTimeMillis();
      baos.write(chunk, 0, chunk.length);
      long end = System.currentTimeMillis();
      System.out.println("Chunk " + i + " of " + numChunks + " took: "
          + ((end - start) / 1000.0) + "s, total written: " + (i * chunk.length) + " bytes");
    }
  }
}

On JDK6:

...
Chunk 1015 of 2097152 took: 0.0010s, total written: 1064304640 bytes
Chunk 1016 of 2097152 took: 0.0s, total written: 1065353216 bytes
Chunk 1017 of 2097152 took: 0.0010s, total written: 1066401792 bytes
Chunk 1018 of 2097152 took: 0.0s, total written: 1067450368 bytes
Chunk 1019 of 2097152 took: 0.0s, total written: 1068498944 bytes
Chunk 1020 of 2097152 took: 0.0010s, total written: 1069547520 bytes
Chunk 1021 of 2097152 took: 0.0s, total written: 1070596096 bytes
Chunk 1022 of 2097152 took: 0.0s, total written: 1071644672 bytes
Chunk 1023 of 2097152 took: 0.0010s, total written: 1072693248 bytes
Chunk 1024 of 2097152 took: 1.163s, total written: 1073741824 bytes <-- >1s per write!
Chunk 1025 of 2097152 took: 0.979s, total written: 1074790400 bytes
Chunk 1026 of 2097152 took: 0.948s, total written: 1075838976 bytes
Chunk 1027 of 2097152 took: 1.053s, total written: 1076887552 bytes
Chunk 1028 of 2097152 took: 1.033s, total written: 1077936128 bytes
Chunk 1029 of 2097152 took: 1.123s, total written: 1078984704 bytes
Chunk 1030 of 2097152 took: 0.723s, total written: 1080033280 bytes
Chunk 1031 of 2097152 took: 0.603s, total written: 1081081856 bytes
...

On JDK7:

...
Chunk 1015 of 2097152 took: 0.0s, total written: 1064304640 bytes
Chunk 1016 of 2097152 took: 0.0s, total written: 1065353216 bytes
Chunk 1017 of 2097152 took: 0.0s, total written: 1066401792 bytes
Chunk 1018 of 2097152 took: 0.0s, total written: 1067450368 bytes
Chunk 1019 of 2097152 took: 0.0s, total written: 1068498944 bytes
Chunk 1020 of 2097152 took: 0.0s, total written: 1069547520 bytes
Chunk 1021 of 2097152 took: 0.001s, total written: 1070596096 bytes
Chunk 1022 of 2097152 took: 0.0s, total written: 1071644672 bytes
Chunk 1023 of 2097152 took: 0.001s, total written: 1072693248 bytes
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:2271)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        at TestByteArray.main(TestByteArray.java:11)