diff mbox

DM Regression in 4.16-rc1 - read() returns data when it shouldn't

Message ID 87y3jvp13k.fsf@notabene.neil.brown.name (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

NeilBrown Feb. 15, 2018, 12:07 a.m. UTC
On Wed, Feb 14 2018, Mike Snitzer wrote:

> On Wed, Feb 14 2018 at  3:39pm -0500,
> NeilBrown <neilb@suse.com> wrote:
>
>> On Wed, Feb 14 2018, Milan Broz wrote:
>> 
>> > Hi,
>> >
>> > the commit (found by bisect)
>> >
>> >   commit 18a25da84354c6bb655320de6072c00eda6eb602
>> >   Author: NeilBrown <neilb@suse.com>
>> >   Date:   Wed Sep 6 09:43:28 2017 +1000
>> >
>> >     dm: ensure bio submission follows a depth-first tree walk
>> >
>> > cause serious regression while reading from DM device.
>> >
>> > The reproducer is below, basically it tries to simulate failure we see in cryptsetup
>> > regression test: we have DM device with error and zero target and try to read
>> > "passphrase" from it (it is test for 64 bit offset error path):
>> >
>> > Test device:
>> > # dmsetup table test
>> > 0 10000000 error 
>> > 10000000 1000000 zero 
>> >
>> > We try to run this operation:
>> >   lseek64(fd, 5119999988, SEEK_CUR); // this should seek to error target sector
>> >   read(fd, buf, 13); // this should fail, if we seek to error part of the device
>> >
>> > While on 4.15 the read properly fails:
>> >   Seek returned 5119999988.
>> >   Read returned -1.
>> >
>> > for 4.16 it actually succeeds returning some random data
>> > (perhaps kernel memory, so this bug is even more dangerous):
>> >   Seek returned 5119999988.
>> >   Read returned 13.
>> >
>> > Full reproducer below:
>> >
>> > #define _GNU_SOURCE
>> > #define _LARGEFILE64_SOURCE
>> > #include <stdio.h>
>> > #include <stddef.h>
>> > #include <stdint.h>
>> > #include <stdlib.h>
>> > #include <unistd.h>
>> > #include <fcntl.h>
>> > #include <inttypes.h>
>> >
>> > int main (int argc, char *argv[])
>> > {
>> >         char buf[13];
>> >         int fd;
>> >         //uint64_t offset64 = 5119999999;
>> >         uint64_t offset64 =   5119999988;
>> >         off64_t r;
>> >         ssize_t bytes;
>> >
>> >         system("echo -e \'0 10000000 error\'\\\\n\'10000000 1000000 zero\' | dmsetup create test");
>> >
>> >         fd = open("/dev/mapper/test", O_RDONLY);
>> >         if (fd == -1) {
>> >                 printf("open fail\n");
>> >                 return 1;
>> >         }
>> >
>> >         r = lseek64(fd, offset64, SEEK_CUR);
>> >         printf("Seek returned %" PRIu64 ".\n", r);
>> >         if (r < 0) {
>> >                 printf("seek fail\n");
>> >                 close(fd);
>> >                 return 2;
>> >         }
>> >
>> >         bytes = read(fd, buf, 13);
>> >         printf("Read returned %d.\n", (int)bytes);
>> >
>> >         close(fd);
>> >         return 0;
>> > }
>> >
>> >
>> > Please let me know if you need more info to reproduce it.
>> 
>> Thanks for the detailed report.  I haven't tried to reproduce, but the
>> code looks very weird.
>> The patch I posted "Date: Wed, 06 Sep 2017 09:43:28 +1000" had:
>>       +				struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
>>       +								 md->queue->bio_split);
>>       +				bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
>>       +				bio_chain(b, bio);
>>       +				generic_make_request(b);
>>       +				break;
>> 
>> The code in Linux has:
>> 
>> 				struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
>> 								 md->queue->bio_split);
>> 				ci.io->orig_bio = b;
>> 				bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
>> 				bio_chain(b, bio);
>> 				ret = generic_make_request(bio);
>> 				break;
>> 
>> So the wrong bio is sent to generic_make_request().
>> Mike: do you remember how that change happened?  I think there were
>> discussions following the patch, but I cannot find anything about making
>> that change.
>
> Mikulas had this feedback:
> https://www.redhat.com/archives/dm-devel/2017-November/msg00159.html
>
> You replied with:
> https://www.redhat.com/archives/dm-devel/2017-November/msg00165.html
>
> Where you said "Yes, you are right something like that would be better."
>
> And you then provided a follow-up patch:
> https://www.redhat.com/archives/dm-devel/2017-November/msg00175.html
>
> That we discussed and I said I'd just fold it into the original, and you
> agreed and thanked me for checking with you ;)
> https://www.redhat.com/archives/dm-devel/2017-November/msg00208.html
>
> Anyway, I'm just about to switch to Daddy Daycare mode (need to get my
> daughter up from her nap, feed her dinner, etc) so: I'll circle back to
> this tomorrow morning.
>

Thanks for the reminder - it's coming back to me now.

And looking again at the code, it doesn't seem so bad.  I was worried
about reassigning ci.io->orig_bio after calling
__split_and_process_non_flush(), but the important thing is to set it
before the last dec_pending() call, and the code gets that right.

I think I've found the problem though:

			/* done with normal IO or empty flush */
			bio->bi_status = io_error;

When we have chained bios, there are multiple sources for error which
need to be merged into bi_status:  all bios that were chained to this
one, and this bio itself.

__bio_chain_endio uses:

	if (!parent->bi_status)
		parent->bi_status = bio->bi_status;

so the new status won't over-ride the old status (unless there are
races....), but dm doesn't do that - I guess it didn't have to worry
about chains before.

So this:


to remove the chance that a race will hide a real error.

Milan: could you please check that the patch fixes the dm-crypt bug?
I've confirmed that it fixes your test case.
When you confirm, I'll send a proper patch.

I guess I should audit all code that assigns to ->bi_status and make
sure nothing ever assigns 0 to it.

Thanks,
NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Milan Broz Feb. 15, 2018, 7:37 a.m. UTC | #1
On 02/15/2018 01:07 AM, NeilBrown wrote:
> 
> And looking again at the code, it doesn't seem so bad.  I was worried
> about reassigning ci.io->orig_bio after calling
> __split_and_process_non_flush(), but the important thing is to set it
> before the last dec_pending() call, and the code gets that right.
> 
> I think I've found the problem though:
> 
> 			/* done with normal IO or empty flush */
> 			bio->bi_status = io_error;
> 
> When we have chained bios, there are multiple sources for error which
> need to be merged into bi_status:  all bios that were chained to this
> one, and this bio itself.
> 
> __bio_chain_endio uses:
> 
> 	if (!parent->bi_status)
> 		parent->bi_status = bio->bi_status;
> 
> so the new status won't over-ride the old status (unless there are
> races....), but dm doesn't do that - I guess it didn't have to worry
> about chains before.
> 
> So this:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index d6de00f367ef..68136806d365 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -903,7 +903,8 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
>  			queue_io(md, bio);
>  		} else {
>  			/* done with normal IO or empty flush */
> -			bio->bi_status = io_error;
> +			if (io_error)
> +				bio->bi_status = io_error;
>  			bio_endio(bio);
>  		}
>  	}
> 

Hi,

well, it indeed fixes the reported problem, thanks.
But I just tested the first (dm.c) change above.

The important part is also that if the error is hits bio later, it will produce
short read (and not fail of the whole operation), this can be easily tested if we switch
the error and the zero target in that reproducer
  //system("echo -e \'0 10000000 error\'\\\\n\'10000000 1000000 zero\' | dmsetup create test");
  system("echo -e \'0 10000000 zero\'\\\\n\'10000000 1000000 error\' | dmsetup create test");

So it seems your patch works.

I am not sure about this part though:

> should fix it.  And __bio_chain_endio should probably be
> 
> diff --git a/block/bio.c b/block/bio.c
> index e1708db48258..ad77140edc6f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>  {
>  	struct bio *parent = bio->bi_private;
>  
> -	if (!parent->bi_status)
> +	if (bio->bi_status)
>  		parent->bi_status = bio->bi_status;

Shouldn't it be
	if (!parent->bi_status && bio->bi_status)
  		parent->bi_status = bio->bi_status;
?

Maybe one rephrased question: what about priorities for errors (bio statuses)?

For example, part of a bio fails with EILSEQ (data integrity check error, BLK_STS_PROTECTION),
part with EIO (media error, BLK_STS_IOERR). What should be reported for parent bio?
(I would expect I always get EIO here because this is hard, uncorrectable error.)

Anyway, thanks for the fix!

Milan


>  	bio_put(bio);
>  	return parent;
> 
> to remove the chance that a race will hide a real error.
> 
> Milan: could you please check that the patch fixes the dm-crypt bug?
> I've confirmed that it fixes your test case.
> When you confirm, I'll send a proper patch.
> 
> I guess I should audit all code that assigns to ->bi_status and make
> sure nothing ever assigns 0 to it.
> 
> Thanks,
> NeilBrown
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown Feb. 15, 2018, 8:52 a.m. UTC | #2
On Thu, Feb 15 2018, Milan Broz wrote:

> On 02/15/2018 01:07 AM, NeilBrown wrote:
>> 
>> And looking again at the code, it doesn't seem so bad.  I was worried
>> about reassigning ci.io->orig_bio after calling
>> __split_and_process_non_flush(), but the important thing is to set it
>> before the last dec_pending() call, and the code gets that right.
>> 
>> I think I've found the problem though:
>> 
>> 			/* done with normal IO or empty flush */
>> 			bio->bi_status = io_error;
>> 
>> When we have chained bios, there are multiple sources for error which
>> need to be merged into bi_status:  all bios that were chained to this
>> one, and this bio itself.
>> 
>> __bio_chain_endio uses:
>> 
>> 	if (!parent->bi_status)
>> 		parent->bi_status = bio->bi_status;
>> 
>> so the new status won't over-ride the old status (unless there are
>> races....), but dm doesn't do that - I guess it didn't have to worry
>> about chains before.
>> 
>> So this:
>> 
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index d6de00f367ef..68136806d365 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -903,7 +903,8 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
>>  			queue_io(md, bio);
>>  		} else {
>>  			/* done with normal IO or empty flush */
>> -			bio->bi_status = io_error;
>> +			if (io_error)
>> +				bio->bi_status = io_error;
>>  			bio_endio(bio);
>>  		}
>>  	}
>> 
>
> Hi,
>
> well, it indeed fixes the reported problem, thanks.
> But I just tested the first (dm.c) change above.
>
> The important part is also that if the error is hits bio later, it will produce
> short read (and not fail of the whole operation), this can be easily tested if we switch
> the error and the zero target in that reproducer
>   //system("echo -e \'0 10000000 error\'\\\\n\'10000000 1000000 zero\' | dmsetup create test");
>   system("echo -e \'0 10000000 zero\'\\\\n\'10000000 1000000 error\' | dmsetup create test");
>
> So it seems your patch works.

Excellent - thanks.

>
> I am not sure about this part though:
>
>> should fix it.  And __bio_chain_endio should probably be
>> 
>> diff --git a/block/bio.c b/block/bio.c
>> index e1708db48258..ad77140edc6f 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>>  {
>>  	struct bio *parent = bio->bi_private;
>>  
>> -	if (!parent->bi_status)
>> +	if (bio->bi_status)
>>  		parent->bi_status = bio->bi_status;
>
> Shouldn't it be
> 	if (!parent->bi_status && bio->bi_status)
>   		parent->bi_status = bio->bi_status;
> ?

That wouldn't hurt, but I don't see how it would help.
It would still be racy as two chained bios could complete at
the same time, so ->bi_status much change between test and set.

>
> Maybe one rephrased question: what about priorities for errors (bio statuses)?
>
> For example, part of a bio fails with EILSEQ (data integrity check error, BLK_STS_PROTECTION),
> part with EIO (media error, BLK_STS_IOERR). What should be reported for parent bio?
> (I would expect I always get EIO here because this is hard, uncorrectable error.)

If there was a well defined ordering, then we could easily write to code
ensure the highest priority status wins, but I don't think there is any
such ordering.
Certainly dec_pending() in dm.c already faces the possibility that it will
be called multiple times with different errors, but doesn't consider any
ordering (except relating to BLK_STS_DM_REQUEUE), when assigning a
status code to io->status.

>
> Anyway, thanks for the fix!

And thanks for the report.

NeilBrown

>
> Milan
>
>
>>  	bio_put(bio);
>>  	return parent;
>> 
>> to remove the chance that a race will hide a real error.
>> 
>> Milan: could you please check that the patch fixes the dm-crypt bug?
>> I've confirmed that it fixes your test case.
>> When you confirm, I'll send a proper patch.
>> 
>> I guess I should audit all code that assigns to ->bi_status and make
>> sure nothing ever assigns 0 to it.
>> 
>> Thanks,
>> NeilBrown
>> 
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d6de00f367ef..68136806d365 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -903,7 +903,8 @@  static void dec_pending(struct dm_io *io, blk_status_t error)
 			queue_io(md, bio);
 		} else {
 			/* done with normal IO or empty flush */
-			bio->bi_status = io_error;
+			if (io_error)
+				bio->bi_status = io_error;
 			bio_endio(bio);
 		}
 	}

should fix it.  And __bio_chain_endio should probably be

diff --git a/block/bio.c b/block/bio.c
index e1708db48258..ad77140edc6f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -312,7 +312,7 @@  static struct bio *__bio_chain_endio(struct bio *bio)
 {
 	struct bio *parent = bio->bi_private;
 
-	if (!parent->bi_status)
+	if (bio->bi_status)
 		parent->bi_status = bio->bi_status;
 	bio_put(bio);
 	return parent;