June 18, 1999

Strange behavior in java.io.InputStream

The multi-byte read() method in java.io.InputStream and its subclasses has this signature:

public int read(byte[] input, int offset, int length) throws IOException

This method reads length bytes of data from the input stream into the array input beginning at the index offset. Now consider this code fragment. Pay special attention to array indexes. The in variable is an InputStream from the standard class library.

    byte[] input = new byte[5];
    
    try { // read all five bytes from data into input
      in.read(input, 0, 5);
      System.out.println("in.read(input, 0, 5) succeeded");
    }
    catch (Exception e) {
      System.out.println("in.read(input, 0, 5) failed");
    }


    try { // read into the fifth byte of input
      in.read(input, 5, 0);
      System.out.println("in.read(input, 5, 0) succeeded");
    }
    catch (Exception e) {
      System.out.println("in.read(input, 5, 0) failed");
    }

    try { // read into the sixth byte of input
      in.read(input, 6, 0);
      System.out.println("in.read(input, 6, 0) succeeded");
    }
    catch (Exception e) {
      System.out.println("in.read(input, 6, 0) failed");
    }  

There are three reads here: one read of five bytes starting at 0, one read of 0 bytes starting at 5, and one read of 0 bytes starting at 6. The input array has five bytes with indices 0 through 4. Thus both 5 and 6 are out of bounds for this array. Assuming in can provide at least five bytes of data, which of these reads succeed and which fail?

Most people's first reaction is that the first read succeeds and the second two fail with ArrayIndexOutOfBoundsExceptions. Most peoples' second reaction on further reflection is that maybe all three reads succeed because the second two don't actually read any bytes and don't need to store anything in the input array. In fact, the truth is stranger still. The first two reads succeed; the third fails. If you don't believe me, run the code and see. All multi-byte read() methods in the Sun-supplied input stream classes behave like this.

How this happens is not today's question. That's easy to answer through a quick peek at the source code which looks a lot like this: (This next fragment is actually taken from Example 6-9 in Java I/O, DumpFilter, since I don't want to reveal Sun's source code and contaminate anyone doing a clean room implementation; but all the Sun classes behave the same.)

  public int read(byte[] data, int offset, int length) throws IOException {
  
    if (data == null) {
      throw new NullPointerException();
    } 
    else if ((offset < 0) || (offset > data.length) || (length < 0) 
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();
    } 
    else if (length == 0) {
      return 0;
    }

    // check for end of stream
    int datum = this.read();
    if (datum == -1) {
      return -1;
    }
    
    data[offset] = (byte) datum;

    int bytesRead = 1;
    try {
      for (; bytesRead < length ; bytesRead++) {
      
        datum = this.read();
        
        // in case of end of stream, return as much as we've got,
        //  then wait for the next call to read to return -1
        if (datum == -1) break;
        data[offset + bytesRead] = (byte) datum;
      }
    }
    catch (IOException e) {
      // return what's already in the data array
    }
    
    return bytesRead;   
    
  }
The strange behavior is all a result of this if-else if-else construct:

    if (data == null) {
      throw new NullPointerException();
    } 
    else if ((offset < 0) || (offset > data.length) || (length < 0) 
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();
    } 
    else if (length == 0) {
      return 0;
    }
In particular, it's a result of using ((offset + length) > data.length) instead of ((offset + length) >= data.length). If >= were used instead of >, the ArrayIndexOutOfBoundsException would be thrown whenever the offset was out of bounds for the array. However, as matters stand now it's only thrown if the offset is out of bounds by at least 2. Now after all that setup, here's the question of the week:

Why is read(byte[] data, int offset, int length) implemented in this fashion? What was going through Sun's heads when they designed this? Is this a bug? An oversight? Or is there a deliberate reason to allow zero-bytes reads into the last-plus-one element of the array, but not into subsequent non-existent elements? What do you think?

A free signed copy of Java I/O goes to the best answer to this question. Thanks to David Vriend for suggesting this question.

Answer

Before trying to answer this, I'll attempt to correct a misconception several people had, and that led to the suggestion of this question in the first place. Alberto Squassabia (and several others) noted:
Worse, there is a test for (offset<0) || (length<0) || ((offset+length)<0) that makes no sense. For (offset+length) to be less than 0, either or both must be less than 0, and the third test in the OR is redundant

That test is actually not redundant. In an era of 38 gigabyte hard drives, 6 gigabyte movie DVD-ROMs, and terabyte databases, there is a very real possibility that while offset and length are each separately greater than 0, their sum exceeds the maximum size of an int (about 2.1 billion) and wraps around into the negative numbers. Even if that seems unlikely today, it will only become more likely as time passes. It's never a good idea to make assumptions about the maximum size of files or file systems if you can help it. I recently encountered a known bug in Dantz Retrospect 3.0 where a programmer had mistakenly assumed that no volume would ever have more than 32,000 files. I'm sure that seemed like a lot at the time.

The best answer came from Remko Popma who explains

In the question it was stated that: "In particular, it's a result of using ((offset + length) > data.length) instead of ((offset + length) >= data.length). If >= were used instead of >, the ArrayIndexOutOfBoundsException would be thrown whenever the offset was out of bounds for the array."

However, if >= were used instead of >, the ArrayIndexOutOfBoundsException would be thrown whenever you tried to fill the array completely:

byte[] input = new byte[5];
in.read(input, 0, 5); // throws an ArrayIndexOutOfBoundsException

So, the question is not whether ">" or ">=" should be used, the question is why the check "else if (length == 0)" is performed last.

What would happen if we changed the order:

public int read(byte[] data, int offset, int length) throws IOException {
  if (data == null) {
    throw new NullPointerException();
  } 
  else if (length == 0) {
    return 0;    
  }
  else if ((offset < 0) || (offset > data.length) || (length < 0) 
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();
  } 

This would produce the following behaviour:

int bytesread;
byte[] input = new byte[5];
bytesread = in.read(input, 0, 5); // succeeds, bytesread = 5
bytesread = in.read(input, 5, 0); // succeeds, bytesread = 0
bytesread = in.read(input, 6, 0); // succeeds, bytesread = 0

In my opinion, this behaviour would be more consistent than having the last read throw an ArrayIndexOutOfBoundsException.

So, to rephrase the question again: Should an ArrayIndexOutOfBoundsException be thrown at all when the array is not used?

This depends on the semantics of the read(byte[] data, int offset, int length) method. A look at the API documentation gives:

(from 1.2 java.net.InputStream apidocs)

public int read(byte[] b,
                int off,
                int len)
         throws IOException
(...)

If b is null, a NullPointerException is thrown.

If off is negative, or len is negative, or off+len is greater than the length of the array b, then an IndexOutOfBoundsException is thrown.

If len is zero, then no bytes are read and 0 is returned; otherwise, there is an attempt to read at least one byte. If no byte is available because the stream is at end of file, the value -1 is returned; otherwise, at least one byte is read and stored into b. (...)

According to these semantics, the answer to the last question is yes, an IndexOutOfBoundsException should be thrown if off+len > b.length, even when len is zero and the array is not actually used.

To answer the original question: Is this a bug? No, it works exactly as advertised. An oversight? Perhaps, I can imagine Sun didn't have the time to think about what would be the most consistent behaviour when people are reading zero bytes into the array. (And even if they did notice, imagine the time and effort it would cost to fix this code, test this class (and all subclasses) to make sure this fix didn't introduce any new bugs, and update the documentation, just to fix something that already works...)

At least it is a documented feature...

There were several other interesting responses. Erwin Moedersheim noted that not all VMs exhibit this behavior. In particular IBM's JDK 1.1.7 for Windows handles things quite differently. John Wilson noted that the test offset > data.length is redundant and unnecessary. Complete responses and the original question are available on the question page.

Remko gets a free copy of Java I/O for his detailed answer. There'll be a new question here just as soon as I can think of a good one.

Other Responses


From: "John Wilson" <tug@wilson.co.uk>
To: <elharo@ibiblio.org>
Subject: Question of the Week: Strange behavior in java.io.InputStream
Date: Thu, 24 Jun 1999 15:30:53 +0100
Organization: The Wilson Partnership


    else if ((offset < 0) || (offset > data.length) || (length < 0)
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();


Firstly, as it stands,  there is an unnecessary test in this code:

If offset >= 0 and length >= 0 and (offset + length) >= 0  and  (offset +
length) <= data.length then offset must be  less than or equal to
data.length.

The first three tests tells us that both offset and length are zero or
greater and their sum is zero or greater (i.e. does not overflow). In this
case, if offset is greater than data.length then the sum of offset and
length must be greater than data.length.

We can rewrite the test as follows:

    else if ((offset < 0) || (length < 0) || ((offset + length) < 0)
     || ((offset + length) > data.length)) {
      throw new ArrayIndexOutOfBoundsException();


Now the reason that (offset + length) > data.length is used rather than
offset + length) >= data.length is that you would get spurious exceptions
otherwise. Consider the case when offset == 0, length == 1 and data.length
== 1.  (0 + 1) >= 1 is true but you don't want to throw an
ArrayIndexOutOfBoundsException.

If you wanted to throw the exception when the offset was out of bounds even
though (offset + length) <= data.length then you need to put the "redundant"
test of offset back, but you have to get it right this time;)


    else if ((offset < 0) || (offset >= data.length) || (length < 0)
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();

That is to say that you have to explicitly test that the offset is not out
of bounds when length is 0. So the Sun programmer got this test wrong, but
then so did you so he's in good company;)))

Personally, I'm happy with the current behaviour (though not happy with the
code that produces it). It allows the edge condition:

    byte[] input = new byte[0];

    try { // read no bytes from data into input
      in.read(input, 0, 0);
      System.out.println("in.read(input, 0, 0) succeeded");
    }
    catch (Exception e) {
      System.out.println("in.read(input, 0, 0) failed");
    }

which can arise in practical programs.



John Wilson
The Wilson Partnership
5 Market Hill, Whitchurch, Aylesbury, Bucks HP22 4JB, UK
+44 1296 641072, +44 976 611010(mobile), +44 1296 641874(fax)
Mailto: tug@wilson.co.uk


From: m9m@hotmail.com To: elharo@ibiblio.org Date: Tue, 22 Jun 1999 22:41:10 +0200 Subject: Question of the Week: Strange behavior in java.io.InputStream Rusty, I don't have access to the Sun source here at home, so I took a peek at the implementation that comes with the IBM JDK 1.1.7, available separately or integrated in VisualAge for Java, Enterprise Update. It appears to behave differently from what you describe. This scares me because it means that using the Sun or IBM JDKs will result in different behavior!!! Some comments on the implementation that I have here: 1) The problem you cite is not present. 2) No ArrayIndexOutOfBoundsException is thrown if the indicated length is < 0, while Sun appears to throw one! 3) The byte array is NEVER checked, instead the stream is read, and the Java null and bounds checking mechanism is used to throw an ArrayOutOfBoundsException. Since the number of bytes read is not returned when such an exception is thrown, the stream is left in an inconsistent state! 4) java.io.IOException is properly propagated when the first byte is read, but are caught and NOT HANDLED for all subsequent bytes, which effectively means that IOExceptions could be ignored! I think that both implementations simply contain bugs. If you really want to see why Java is no good for nuclear installations, take a peek at the non-public class java.lang.FloatingDecimal: it is plain scary. Erwin Moedersheim Consultant Engineer, e-Business Services IBM Global Services, Paris
From: "Anjum S. Naseer" <A.Naseer@global-communications.com> To: <elharo@ibiblio.org> Subject: Question of the Week: Strange behavior in java.io.InputStream Date: Fri, 25 Jun 1999 10:33:16 +0100 I think behaviour was designed-in for efficiency reasons when reading a byte-stream in chunks. An example of this is shown below: do { ... bytesLeft = bytesLeft - in.read( buffer, bufferSize - bytesLeft, bytesLeft ) ; ... } while ( bytesLeft > 0 ) If the read method did not allow reading into offset 'bufferSize' then the above code would have to be modified to read as follows: do { ... if ( bytesLeft > 0 ) { bytesLeft = bytesLeft - in.read( buffer, bufferSize - bytesLeft, bytesLeft ) ; } ... } while ( bytesLeft > 0 ) Which is less efficient. The above could have been re-written as: while ( bytesLeft > 0 ) { ... bytesLeft = bytesLeft - in.read( buffer, bufferSize - bytesLeft, bytesLeft ) ; ... } But there are occasions where you want to execute the while loop at least once, and, for these cases, I thinks Sun's decision was wise in allowing the read method to behave in the manner it does. Hope this helps, Anjum S. Naseer Global Communications Limited http://www.globalcomm.co.uk
From: alsq@alsqhpw.cnd.hp.com (Alberto Squassabia) Date: Mon, 21 Jun 1999 11:02:13 -0600 Subject: QOW: behavior of InputStream The offset is not an index in the destination array until there is some actual data read. The behavior suffers from mumbled semantic in a corner case. I'm going to second-guess a justification, or the lack of one, for this behavior. There are a few questions to answer when attempting to access a stream: Does the stream exist? Do I really want to access it? Can I get what I want out of it? Can I get all of it? Tests (data==null) and (length==0) address the first two--but only in part. It would make more sense to test for both _before_ accessing the stream. If so, all 0-byte read would succeed, which may be questionable when trying to read past the end of stream. So, (length==0) may [to support intuitive behavior], but not necessarily must [for correctness], also add a check for a potential out-of-bound read. When (length<0) is true, the semantic of, Do I really want to access it? , is called in question again. The test for (length<0) is, in my opinion, misplaced. This test throws an out-of-bounds exception that is inappropriate at least some of the time. Some, however, may draw on the overpowering consideration that most accesses will have length>0, so performance may justify cascading the (length==0) test and shortcutting it. Too bad the argument does not hold water, since (length<0) is tested too. Worse, there is a test for (offset<0) || (length<0) || ((offset+length)<0) that makes no sense. For (offset+length) to be less than 0, either or both must be less than 0, and the third test in the OR is redundant In sum, when the predicates are true: (1) (data==null) is properly handled; (2) (length==0) -> if also (offset > data.length()) may optionally throw out-of-bounds, if not, behavior must be documented -> else must return 0 (3) (length<0) should throw some exception (4) (offset<0), ((offset+length)>data.length) should throw out-of-bounds Since (offset+length)==(offset) when length is 0, then it may make sense to test for that only once (length may or may not be 0), and throw out-of-bounds. And since I'd be throwing out-of-bounds, I may as well test for (offset<0) at the same time. Which means I'd have to test for (length==0) afterwards. But when I test for (offset+length), then I must make sure that length is not negative, so I need to test for (length<0) when I'm considering throwing out-of-bounds. This logic justifies in part the present structure of the source. I'm unable to justify the test for (offset<0)||(length<0)||((offset+length)<0) as explained above, for which (offset<0)||(length<0) is enough. And testing for ((offset+length)>data.length)||(offset>data.length) is redundant, since it is possible to make sure that length is non-negative when (offset+length)>data.length. As written, the code is muddled, not tuned for performance, and lacks intuitive behavior. I can provide only partial justification for the implementation, which I believe may be due to too much coffee and yet another late late night. The counterintuive behavior is probably an oversight. -- Alberto Squassabia (970)898-7705 alsq@cnd.hp.com HP TMD " We are Pentium of Borg. Division is futile. You " " will be approximated. (ca. AD 1995) "
From: "Mark Wutka" <mark@wutka.com> To: <elharo@ibiblio.org> Subject: Question of the week Date: Fri, 18 Jun 1999 12:13:11 -0400 Hi, Rusty When I first saw the code, I thought I was just an oversight. The more I think about it, however, the more it makes sense. The way it is written, the expression read(theArray, numReadSoFar, arrayLength-numReadSoFar) is valid as long as you aren't trying to read more bytes than the array can hold, and the array length is correct. It would be useful for a loop like this: byte b[] = new byte[30]; int pos = 0; int numRead; while ((numRead = read(b, pos, b.length - pos)) != 0) pos += numRead; The loop would break out in two situations. First, if you were reading from a socket and the connection closed, read would return 0 and you'd break out early. Second, you'd break out when you finished reading. If you had done something like this: while ((pos += read(b, pos, b.length - pos)) <=b.length); You would stop when you read the full amount without trying to read 0 bytes, but your loop could spin indefinitely on a closed socket connection because pos would never increment. Mark
From: Christopher_Jones@baynetworks.com (Christopher Jones) To: <elharo@ibiblio.org> Subject: Question of the Week Date: Tue, 22 Jun 1999 15:07:22 -0400 Message-ID: <002201bebce2$75951d50$effcf584@rtp-cbjones.corpeast.baynetworks.com> Why wasn't the comparison (offset > data.length) just written as (offset == data.length) since length must be positive (or else also cause an exception) and ((offset + length) > data.length) covers the same condition plus some, then your case would be covered I think. I guess the same could be said of the additional comparison ((offset + length) < 0) when there are already comparisons for (offset < 0) and (length < 0). Just with respect to why was ((offset + length) > data.length) used instead of ">=" I'm just guessing that it might be an optimization in bytecode though that's pure conjecture. if (data == null) { throw new NullPointerException(); } else if ((offset < 0) || (offset > data.length) || (length < 0) || ((offset + length) > data.length) || ((offset + length) < 0)) { throw new ArrayIndexOutOfBoundsException(); } else if (length == 0) { return 0; } Brian cbj@gnu.org
From: Si Ly <sly@uswebcks.com> To: elharo@ibiblio.org Subject: Question of the Week Elliote, I think ((offset + length) > data.length) is correct. Replacing > with >= would be incorrect because then it would reject expressions like these: in.read(buffer, 0, buffer.length) in.read(buffer, 1, buffer.length-1) The off-by-one behavior (no exception being thrown when offset = buffer.length) occurs only when the length parameter is 0. Your DataFilter example will even accept negative ints for length. In which case, it would incorrectly return 1 (bytesRead) as long as (offset + length) >= 0). I can't verify this behavior using Sun's implementation. -- Si

From: S R <gsekar@hotmail.com> To: elharo@ibiblio.org Subject: Question of the Week: Strange behavior in java.io.InputStream Date: Mon, 21 Jun 1999 15:15:51 PDT Why is read(byte[] data, int offset, int length) implemented in this fashion? What was going through Sun's heads when they designed this? Is this a bug? An oversight? Or is there a deliberate reason to allow zero-bytes reads into the last-plus-one element of the array, but not into subsequent non-existent elements? What do you think? Answer : I think this is not a bug but an improper implementation of the logic. The key is the fact that the offset parameter is the inclusive beginning index. So let us take an inputstream with the following characters qwert byte[] data = new byte[3]; a read(data,2,3) would expected to return the int 3 and the data should contain ert. if as suggested, the comparison is (offset + length) >= data.length, then in this case the read would throw an ArrayIndexOutOfBoundsException which is incorrect. Having said that, if I were to implement this the logic I would use is if (data == null) { throw new NullPointerException(); } else if ((offset < 0) || (offset >= data.length) || (length < 0) || ((offset + length) > data.length) || ((offset + length) < 0)){ throw new ArrayIndexOutOfBoundsException(); } else if (length == 0) { return 0; } All I have done is modified (offset > data.length) to (offset >= data.length). This would do the trick. Thanks Sekar Ranganathan
From: Ling Siu Shian <siushian@TP.EDU.SG> To: "'Elliotte R Harold'" <elharo@ibiblio.org> Subject: Question of the Week: Strange behavior in java.io.InputStream Date: Tue, 22 Jun 1999 14:51:17 +0800 Hi Elliotte, This is my understanding: if (data == null) { //making sure that the sink is valid throw new NullPointerException(); } else if ( (offset < 0) || //ensure index to array is valid (offset > data.length) || //this is a bug, it should have been offset >= data.length //it ensures index to array is valid (length < 0) || //ensure a valid request, //otherwise there is no way to deal with it ((offset + length) > data.length) || //this is ok, //ensure the array is able to keep data as asked ((offset + length) < 0)) { //this one serves no purpose, and should not be needed throw new ArrayIndexOutOfBoundsException(); } else if (length == 0) { //simply verify that there is nothing to do return 0; } My 2 cents only. Yours, Siu-Shian LING Temasek Polytechnic 21 Tampines Ave 1 Singapore 529757 Ling SS voice (65)7805457

[ Cafe au Lait | Books | Trade Shows | Links | FAQ | Tutorial | User Groups ]

Copyright 1999 Elliotte Rusty Harold
elharo@metalab.unc.edu
Last Modified June 26, 1999