June 18, 1999

Strange behavior in java.io.InputStream

The multi-byte read() method in java.io.InputStream and its subclasses has this signature:

public int read(byte[] input, int offset, int length) throws IOException

This method reads length bytes of data from the input stream into the array input beginning at the index offset. Now consider this code fragment. Pay special attention to array indexes. The in variable is an InputStream from the standard class library.

    byte[] input = new byte[5];
    
    try { // read all five bytes from data into input
      in.read(input, 0, 5);
      System.out.println("in.read(input, 0, 5) succeeded");
    }
    catch (Exception e) {
      System.out.println("in.read(input, 0, 5) failed");
    }


    try { // read into the fifth byte of input
      in.read(input, 5, 0);
      System.out.println("in.read(input, 5, 0) succeeded");
    }
    catch (Exception e) {
      System.out.println("in.read(input, 5, 0) failed");
    }

    try { // read into the sixth byte of input
      in.read(input, 6, 0);
      System.out.println("in.read(input, 6, 0) succeeded");
    }
    catch (Exception e) {
      System.out.println("in.read(input, 6, 0) failed");
    }

There are three reads here: one read of five bytes starting at 0, one read of 0 bytes starting at 5, and one read of 0 bytes starting at 6. The input array has five bytes with indices 0 through 4. Thus both 5 and 6 are out of bounds for this array. Assuming in can provide at least five bytes of data, which of these reads succeed and which fail?

Most people's first reaction is that the first read succeeds and the second two fail with ArrayIndexOutOfBoundsExceptions. Most peoples' second reaction on further reflection is that maybe all three reads succeed because the second two don't actually read any bytes and don't need to store anything in the input array. In fact, the truth is stranger still. The first two reads succeed; the third fails. If you don't believe me, run the code and see. All multi-byte read() methods in the Sun-supplied input stream classes behave like this.

How this happens is not today's question. That's easy to answer through a quick peek at the source code which looks a lot like this: (This next fragment is actually taken from Example 6-9 in Java I/O, DumpFilter, since I don't want to reveal Sun's source code and contaminate anyone doing a clean room implementation; but all the Sun classes behave the same.)

  public int read(byte[] data, int offset, int length) throws IOException {
  
    if (data == null) {
      throw new NullPointerException();
    } 
    else if ((offset < 0) || (offset > data.length) || (length < 0) 
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();
    } 
    else if (length == 0) {
      return 0;
    }

    // check for end of stream
    int datum = this.read();
    if (datum == -1) {
      return -1;
    }
    
    data[offset] = (byte) datum;

    int bytesRead = 1;
    try {
      for (; bytesRead < length ; bytesRead++) {
      
        datum = this.read();
        
        // in case of end of stream, return as much as we've got,
        //  then wait for the next call to read to return -1
        if (datum == -1) break;
        data[offset + bytesRead] = (byte) datum;
      }
    }
    catch (IOException e) {
      // return what's already in the data array
    }
    
    return bytesRead;   
    
  }

The strange behavior is all a result of this if-else if-else construct:

    if (data == null) {
      throw new NullPointerException();
    } 
    else if ((offset < 0) || (offset > data.length) || (length < 0) 
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();
    } 
    else if (length == 0) {
      return 0;
    }

In particular, it's a result of using ((offset + length) > data.length) instead of ((offset + length) >= data.length). If >= were used instead of >, the ArrayIndexOutOfBoundsException would be thrown whenever the offset was out of bounds for the array. However, as matters stand now it's only thrown if the offset is out of bounds by at least 2. Now after all that setup, here's the question of the week:

Why is read(byte[] data, int offset, int length) implemented in this fashion? What was going through Sun's heads when they designed this? Is this a bug? An oversight? Or is there a deliberate reason to allow zero-bytes reads into the last-plus-one element of the array, but not into subsequent non-existent elements? What do you think?

A free signed copy of Java I/O goes to the best answer to this question. Thanks to David Vriend for suggesting this question.

Answer

Before trying to answer this, I'll attempt to correct a misconception several people had, and that led to the suggestion of this question in the first place. Alberto Squassabia (and several others) noted:

Worse, there is a test for (offset<0) || (length<0) || ((offset+length)<0) that makes no sense. For (offset+length) to be less than 0, either or both must be less than 0, and the third test in the OR is redundant

That test is actually not redundant. In an era of 38 gigabyte hard drives, 6 gigabyte movie DVD-ROMs, and terabyte databases, there is a very real possibility that while offset and length are each separately greater than 0, their sum exceeds the maximum size of an int (about 2.1 billion) and wraps around into the negative numbers. Even if that seems unlikely today, it will only become more likely as time passes. It's never a good idea to make assumptions about the maximum size of files or file systems if you can help it. I recently encountered a known bug in Dantz Retrospect 3.0 where a programmer had mistakenly assumed that no volume would ever have more than 32,000 files. I'm sure that seemed like a lot at the time.

The best answer came from Remko Popma who explains

In the question it was stated that: "In particular, it's a result of using ((offset + length) > data.length) instead of ((offset + length) >= data.length). If >= were used instead of >, the ArrayIndexOutOfBoundsException would be thrown whenever the offset was out of bounds for the array."

However, if >= were used instead of >, the ArrayIndexOutOfBoundsException would be thrown whenever you tried to fill the array completely:
byte[] input = new byte[5];
in.read(input, 0, 5); // throws an ArrayIndexOutOfBoundsException
So, the question is not whether ">" or ">=" should be used, the question is why the check "else if (length == 0)" is performed last.

What would happen if we changed the order:
public int read(byte[] data, int offset, int length) throws IOException {
  if (data == null) {
    throw new NullPointerException();
  } 
  else if (length == 0) {
    return 0;    
  }
  else if ((offset < 0) || (offset > data.length) || (length < 0) 
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();
  } 
This would produce the following behaviour:
int bytesread;
byte[] input = new byte[5];
bytesread = in.read(input, 0, 5); // succeeds, bytesread = 5
bytesread = in.read(input, 5, 0); // succeeds, bytesread = 0
bytesread = in.read(input, 6, 0); // succeeds, bytesread = 0
In my opinion, this behaviour would be more consistent than having the last read throw an ArrayIndexOutOfBoundsException.

So, to rephrase the question again: Should an ArrayIndexOutOfBoundsException be thrown at all when the array is not used?

This depends on the semantics of the read(byte[] data, int offset, int length) method. A look at the API documentation gives:

(from 1.2 java.net.InputStream apidocs)
public int read(byte[] b,
                int off,
                int len)
         throws IOException
(...)
If b is null, a NullPointerException is thrown.

If off is negative, or len is negative, or off+len is greater than the length of the array b, then an IndexOutOfBoundsException is thrown.

If len is zero, then no bytes are read and 0 is returned; otherwise, there is an attempt to read at least one byte. If no byte is available because the stream is at end of file, the value -1 is returned; otherwise, at least one byte is read and stored into b. (...)
According to these semantics, the answer to the last question is yes, an IndexOutOfBoundsException should be thrown if off+len > b.length, even when len is zero and the array is not actually used.

To answer the original question: Is this a bug? No, it works exactly as advertised. An oversight? Perhaps, I can imagine Sun didn't have the time to think about what would be the most consistent behaviour when people are reading zero bytes into the array. (And even if they did notice, imagine the time and effort it would cost to fix this code, test this class (and all subclasses) to make sure this fix didn't introduce any new bugs, and update the documentation, just to fix something that already works...)

At least it is a documented feature...

There were several other interesting responses. Erwin Moedersheim noted that not all VMs exhibit this behavior. In particular IBM's JDK 1.1.7 for Windows handles things quite differently. John Wilson noted that the test offset > data.length is redundant and unnecessary. Complete responses and the original question are available on the question page.

Remko gets a free copy of Java I/O for his detailed answer. There'll be a new question here just as soon as I can think of a good one.

Other Responses

From: "John Wilson" <tug@wilson.co.uk>
To: <elharo@ibiblio.org>
Subject: Question of the Week: Strange behavior in java.io.InputStream
Date: Thu, 24 Jun 1999 15:30:53 +0100
Organization: The Wilson Partnership

    else if ((offset < 0) || (offset > data.length) || (length < 0)
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();

Firstly, as it stands,  there is an unnecessary test in this code:

If offset >= 0 and length >= 0 and (offset + length) >= 0  and  (offset +
length) <= data.length then offset must be  less than or equal to
data.length.

The first three tests tells us that both offset and length are zero or
greater and their sum is zero or greater (i.e. does not overflow). In this
case, if offset is greater than data.length then the sum of offset and
length must be greater than data.length.

We can rewrite the test as follows:

    else if ((offset < 0) || (length < 0) || ((offset + length) < 0)
     || ((offset + length) > data.length)) {
      throw new ArrayIndexOutOfBoundsException();

Now the reason that (offset + length) > data.length is used rather than
offset + length) >= data.length is that you would get spurious exceptions
otherwise. Consider the case when offset == 0, length == 1 and data.length
== 1.  (0 + 1) >= 1 is true but you don't want to throw an
ArrayIndexOutOfBoundsException.

If you wanted to throw the exception when the offset was out of bounds even
though (offset + length) <= data.length then you need to put the "redundant"
test of offset back, but you have to get it right this time;)

    else if ((offset < 0) || (offset >= data.length) || (length < 0)
     || ((offset + length) > data.length) || ((offset + length) < 0)) {
      throw new ArrayIndexOutOfBoundsException();

That is to say that you have to explicitly test that the offset is not out
of bounds when length is 0. So the Sun programmer got this test wrong, but
then so did you so he's in good company;)))

Personally, I'm happy with the current behaviour (though not happy with the
code that produces it). It allows the edge condition:

    byte[] input = new byte[0];

    try { // read no bytes from data into input
      in.read(input, 0, 0);
      System.out.println("in.read(input, 0, 0) succeeded");
    }
    catch (Exception e) {
      System.out.println("in.read(input, 0, 0) failed");
    }

which can arise in practical programs.

John Wilson
The Wilson Partnership
5 Market Hill, Whitchurch, Aylesbury, Bucks HP22 4JB, UK
+44 1296 641072, +44 976 611010(mobile), +44 1296 641874(fax)
Mailto: tug@wilson.co.uk

From: m9m@hotmail.com
To: elharo@ibiblio.org
Date: Tue, 22 Jun 1999 22:41:10 +0200
Subject: Question of the Week: Strange behavior in java.io.InputStream

Rusty,

I don't have access to the Sun source here at home, so I took a peek at the
implementation that comes with the IBM JDK 1.1.7, available separately or
integrated in VisualAge for Java, Enterprise Update. It appears to behave
differently from what you describe. This scares me because it means that
using the Sun or IBM JDKs will result in different behavior!!!

Some comments on the implementation that I have here:

1) The problem you cite is not present.
2) No ArrayIndexOutOfBoundsException is thrown if the indicated length 
is < 0, while Sun appears to throw one!
3) The byte array is NEVER checked, instead the stream is read, and the
Java null and bounds checking mechanism is used to throw an
ArrayOutOfBoundsException. Since the number of bytes read is not returned
when such an exception is thrown, the stream is left in an inconsistent
state!
4) java.io.IOException is properly propagated when the first byte is read,
but are caught and NOT HANDLED for all subsequent bytes, which effectively
means that IOExceptions could be ignored!

I think that both implementations simply contain bugs. If you really want
to see why Java is no good for nuclear installations, take a peek at the
non-public class java.lang.FloatingDecimal: it is plain scary.

     Erwin Moedersheim
     Consultant Engineer, e-Business Services
     IBM Global Services, Paris

From: "Anjum S. Naseer" <A.Naseer@global-communications.com>
To: <elharo@ibiblio.org>
Subject: Question of the Week: Strange behavior in java.io.InputStream
Date: Fri, 25 Jun 1999 10:33:16 +0100

I think behaviour was designed-in for efficiency reasons when reading a
byte-stream in chunks. An example of this is shown below:

do
{
    ...
    bytesLeft = bytesLeft - in.read( buffer, bufferSize - bytesLeft, bytesLeft )  ;
    ...
}
while ( bytesLeft > 0 )

If the read method did not allow reading into offset 'bufferSize' then the
above code would have to be modified to read as follows:

do
{
    ...
    if ( bytesLeft > 0 )
    {
        bytesLeft = bytesLeft - in.read( buffer, bufferSize - bytesLeft, bytesLeft )  ;
    }
    ...
}
while ( bytesLeft > 0 )

Which is less efficient.

The above could have been re-written as:

while ( bytesLeft > 0 )
{
    ...
    bytesLeft = bytesLeft - in.read( buffer, bufferSize - bytesLeft, bytesLeft )  ;
    ...
}

But there are occasions where you want to execute the while loop at least
once, and, for these cases, I thinks Sun's decision was wise in allowing the
read method to behave in the manner it does.

Hope this helps,

Anjum S. Naseer
Global Communications Limited
http://www.globalcomm.co.uk

From: alsq@alsqhpw.cnd.hp.com (Alberto Squassabia)
Date: Mon, 21 Jun 1999 11:02:13 -0600
Subject: QOW: behavior of InputStream

The offset is not an index in the destination array until
there is some actual data read.  The behavior suffers from
mumbled semantic in a corner case.  I'm going
to second-guess a justification, or the lack of one, for
this behavior.

There are a few questions to answer when attempting to access a
stream: Does the stream exist? Do I really want to access it?
Can I get what I want out of it?  Can I get all of it?

Tests (data==null) and (length==0) address the first two--but only in
part.  It would make more sense to test for both _before_ accessing
the stream.  If so, all 0-byte read would succeed, which may be
questionable when trying to read past the end of stream.  So,
(length==0) may [to support intuitive behavior], but not necessarily
must [for correctness], also add a check for a potential out-of-bound
read.

When (length<0) is true, the semantic of, Do I really want to access
it? , is called in question again.  The test for (length<0) is, in
my opinion, misplaced.  This test throws an out-of-bounds exception
that is inappropriate at least some of the time.

Some, however, may draw on the overpowering consideration that most
accesses will have length>0, so performance may justify cascading
the (length==0) test and shortcutting it.  Too bad the argument
does not hold water, since (length<0) is tested too.  Worse,
there is a test for (offset<0) || (length<0) || ((offset+length)<0)
that makes no sense.  For (offset+length) to be less than 0, either
or both must be less than 0, and the third test in the OR is
redundant

In sum, when the predicates are true:
(1) (data==null) is properly handled;
(2) (length==0)
    -> if also (offset > data.length()) may optionally throw out-of-bounds,
       if not, behavior must be documented
    -> else must return 0
(3) (length<0)  should throw some exception
(4) (offset<0), ((offset+length)>data.length) should throw out-of-bounds

Since (offset+length)==(offset) when length is 0, then it may make sense
to test for that only once (length may or may not be 0), and throw
out-of-bounds.  And since
I'd be throwing out-of-bounds, I may as well test for (offset<0) at the
same time.  Which means I'd have to test for (length==0) afterwards.  But when
I test for (offset+length), then I must make sure that length is not negative,
so I need to test for (length<0) when I'm considering throwing out-of-bounds.
This logic justifies in part the present structure of the source.

I'm unable to justify the test for (offset<0)||(length<0)||((offset+length)<0)
as explained above, for which (offset<0)||(length<0) is enough.  And
testing for ((offset+length)>data.length)||(offset>data.length) is redundant,
since it is possible to make sure that length is non-negative when
(offset+length)>data.length.

As written, the code is muddled, not tuned for performance, and lacks
intuitive behavior.  I can provide only partial justification for the
implementation, which I believe may be due to too much coffee and yet
another late late night.  The counterintuive behavior is probably
an oversight.

--
Alberto Squassabia  (970)898-7705  alsq@cnd.hp.com  HP TMD
"   We are Pentium of Borg. Division is futile. You      "
"   will be approximated. (ca. AD 1995)                  "

From: "Mark Wutka" <mark@wutka.com>
To: <elharo@ibiblio.org>
Subject: Question of the week
Date: Fri, 18 Jun 1999 12:13:11 -0400

	Hi, Rusty

	When I first saw the code, I thought I was just 
	an oversight. The more I think about it, however, the more
	it makes sense. The  way it is written, the expression
	read(theArray, numReadSoFar,  arrayLength-numReadSoFar) is
	valid as long as you aren't trying to read more  bytes than
	the array can hold, and the array length is correct. It
	would be  useful for a loop like this:

	byte b[] = new byte[30];
	int pos = 0;
	int numRead;
	while ((numRead = read(b, pos, b.length - pos))  != 0) pos += numRead;

	The loop would break out in two situations.  First, if you
	were reading from a socket and the connection closed, read
	would return 0 and you'd break out early. Second, you'd
	break out when you finished  reading. If you had done
	something like this:

    while ((pos += read(b, pos, b.length - pos))  <=b.length);

    You would stop when you read the full amount 
	without trying to read 0 bytes, but your loop could spin
	indefinitely on a  closed socket connection because pos
	would never increment.

	 Mark

From: Christopher_Jones@baynetworks.com (Christopher Jones)
To: <elharo@ibiblio.org>
Subject: Question of the Week
Date: Tue, 22 Jun 1999 15:07:22 -0400
Message-ID: <002201bebce2$75951d50$effcf584@rtp-cbjones.corpeast.baynetworks.com>

Why wasn't the comparison (offset > data.length) just written as (offset ==
data.length) since length must be positive (or else also cause an exception)
and ((offset + length) > data.length) covers the same condition plus some,
then your case would be covered I think.  I guess the same could be said of
the additional comparison ((offset + length) < 0) when there are already
comparisons for (offset < 0) and (length < 0).

Just with respect to why was ((offset + length) > data.length) used instead
of ">=" I'm just guessing that it might be an optimization in bytecode
though that's pure conjecture.

    if (data == null) {
        throw new NullPointerException();
    }
    else if ((offset < 0) || (offset > data.length) || (length < 0)
             || ((offset + length) > data.length) || ((offset + length) <
0)) {
        throw new ArrayIndexOutOfBoundsException();
    }
    else if (length == 0) {
        return 0;
    }

Brian
cbj@gnu.org

From: Si Ly <sly@uswebcks.com>
To: elharo@ibiblio.org
Subject: Question of the Week

Elliote,

I think ((offset + length) > data.length) is correct.  Replacing >
with >= would be incorrect because then it would reject expressions
like these:
  in.read(buffer, 0, buffer.length)
  in.read(buffer, 1, buffer.length-1)

The off-by-one behavior (no exception being thrown when offset =
buffer.length) occurs only when the length parameter is 0.  Your
DataFilter example will even accept negative ints for length.  In
which case, it would incorrectly return 1 (bytesRead) as long as
(offset + length) >= 0).  I can't verify this behavior using Sun's
implementation.

-- Si

From: S R <gsekar@hotmail.com>
To: elharo@ibiblio.org
Subject: Question of the Week: Strange behavior in java.io.InputStream
Date: Mon, 21 Jun 1999 15:15:51 PDT

Why is read(byte[] data, int offset, int length) implemented in this 
fashion? What was going through Sun's heads when they designed this? Is this 
a bug? An oversight? Or is there a deliberate reason to allow zero-bytes 
reads into the last-plus-one element of the array, but not into subsequent 
non-existent elements? What do you think?

Answer :

I think this is not a bug but an improper implementation of the logic. The 
key is the fact that the offset parameter is the inclusive beginning index. 
So let us take an inputstream with the following characters

qwert

byte[] data = new byte[3];

a read(data,2,3) would expected to return the int 3 and the data should 
contain ert. if as suggested, the comparison is (offset + length) >= 
data.length, then in this case the read would throw an 
ArrayIndexOutOfBoundsException which is incorrect.

Having said that, if I were to implement this the logic I would use is

if (data == null) {
   throw new NullPointerException();
} else if ((offset < 0) || (offset >= data.length) || (length < 0)
  || ((offset + length) > data.length) || ((offset + length) < 0)){
      throw new ArrayIndexOutOfBoundsException();
} else if (length == 0) {
      return 0;
}

All I have done is modified (offset > data.length) to  (offset >= 
data.length). This would do the trick.

Thanks

Sekar Ranganathan

From: Ling Siu Shian <siushian@TP.EDU.SG>
To: "'Elliotte R Harold'" <elharo@ibiblio.org>
Subject: Question of the Week: Strange behavior in java.io.InputStream
Date: Tue, 22 Jun 1999 14:51:17 +0800

Hi Elliotte,

This is my understanding:

    if (data == null) { //making sure that the sink is valid
                             throw new NullPointerException();
                       }
   else if (
      (offset < 0) ||        //ensure index to array is valid
      (offset > data.length) ||   //this is a bug, it should have been offset >= data.length
                                     //it ensures index to array is valid
       (length < 0) ||        //ensure a valid request,
                                 //otherwise there is no way to deal with it
       ((offset + length) > data.length) ||    //this is ok,
                                                //ensure the array is able to keep data as asked
       ((offset + length) < 0)) {       //this one serves no purpose, and should not be needed
                 throw new ArrayIndexOutOfBoundsException();
        }
        else if (length == 0) {    //simply verify that there is nothing to do
         return 0;
        }

My 2 cents only.

Yours,

Siu-Shian LING
Temasek Polytechnic
21 Tampines Ave 1
Singapore 529757
Ling SS
voice (65)7805457