diff mbox

block: be more careful about status in __bio_chain_endio

Message ID 87eflmpqkb.fsf@notabene.neil.brown.name (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown Feb. 15, 2018, 9:09 a.m. UTC
If two bios are chained under the one parent (with bio_chain())
it is possible that one will succeed and the other will fail.
__bio_chain_endio must ensure that the failure error status
is reported for the whole, rather than the success.

It currently tries to be careful, but this test is racy.
If both children finish at the same time, they might both see that
parent->bi_status as zero, and so will assign their own status.
If the assignment to parent->bi_status by the successful bio happens
last, the error status will be lost which can lead to silent data
corruption.

Instead, __bio_chain_endio should only assign a non-zero status
to parent->bi_status.  There is then no need to test the current
value of parent->bi_status - a test that would be racy anyway.

Note that this bug hasn't been seen in practice.  It was only discovered
by examination after a similar bug was found in dm.c

Signed-off-by: NeilBrown <neilb@suse.com>
---
 block/bio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Mike Snitzer Feb. 22, 2019, 9:10 p.m. UTC | #1
On Thu, Feb 15 2018 at  4:09am -0500,
NeilBrown <neilb@suse.com> wrote:

> 
> If two bios are chained under the one parent (with bio_chain())
> it is possible that one will succeed and the other will fail.
> __bio_chain_endio must ensure that the failure error status
> is reported for the whole, rather than the success.
> 
> It currently tries to be careful, but this test is racy.
> If both children finish at the same time, they might both see that
> parent->bi_status as zero, and so will assign their own status.
> If the assignment to parent->bi_status by the successful bio happens
> last, the error status will be lost which can lead to silent data
> corruption.
> 
> Instead, __bio_chain_endio should only assign a non-zero status
> to parent->bi_status.  There is then no need to test the current
> value of parent->bi_status - a test that would be racy anyway.
> 
> Note that this bug hasn't been seen in practice.  It was only discovered
> by examination after a similar bug was found in dm.c
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
>  block/bio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index e1708db48258..ad77140edc6f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>  {
>  	struct bio *parent = bio->bi_private;
>  
> -	if (!parent->bi_status)
> +	if (bio->bi_status)
>  		parent->bi_status = bio->bi_status;
>  	bio_put(bio);
>  	return parent;
> -- 
> 2.14.0.rc0.dirty
> 

Reviewed-by: Mike Snitzer <snitzer@redhat.com>

Jens, this one slipped through the crack just over a year ago.
It is available in patchwork here:
https://patchwork.kernel.org/patch/10220727/
Jens Axboe Feb. 22, 2019, 10:46 p.m. UTC | #2
On 2/22/19 2:10 PM, Mike Snitzer wrote:
> On Thu, Feb 15 2018 at  4:09am -0500,
> NeilBrown <neilb@suse.com> wrote:
> 
>>
>> If two bios are chained under the one parent (with bio_chain())
>> it is possible that one will succeed and the other will fail.
>> __bio_chain_endio must ensure that the failure error status
>> is reported for the whole, rather than the success.
>>
>> It currently tries to be careful, but this test is racy.
>> If both children finish at the same time, they might both see that
>> parent->bi_status as zero, and so will assign their own status.
>> If the assignment to parent->bi_status by the successful bio happens
>> last, the error status will be lost which can lead to silent data
>> corruption.
>>
>> Instead, __bio_chain_endio should only assign a non-zero status
>> to parent->bi_status.  There is then no need to test the current
>> value of parent->bi_status - a test that would be racy anyway.
>>
>> Note that this bug hasn't been seen in practice.  It was only discovered
>> by examination after a similar bug was found in dm.c
>>
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  block/bio.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/block/bio.c b/block/bio.c
>> index e1708db48258..ad77140edc6f 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
>>  {
>>  	struct bio *parent = bio->bi_private;
>>  
>> -	if (!parent->bi_status)
>> +	if (bio->bi_status)
>>  		parent->bi_status = bio->bi_status;
>>  	bio_put(bio);
>>  	return parent;
>> -- 
>> 2.14.0.rc0.dirty
>>
> 
> Reviewed-by: Mike Snitzer <snitzer@redhat.com>
> 
> Jens, this one slipped through the crack just over a year ago.
> It is available in patchwork here:
> https://patchwork.kernel.org/patch/10220727/

Should this be:

	if (!parent->bi_status && bio->bi_status)
		parent->bi_status = bio->bi_status;

perhaps?
Mike Snitzer Feb. 22, 2019, 11:55 p.m. UTC | #3
On Fri, Feb 22 2019 at  5:46pm -0500,
Jens Axboe <axboe@kernel.dk> wrote:

> On 2/22/19 2:10 PM, Mike Snitzer wrote:
> > On Thu, Feb 15 2018 at  4:09am -0500,
> > NeilBrown <neilb@suse.com> wrote:
> > 
> >>
> >> If two bios are chained under the one parent (with bio_chain())
> >> it is possible that one will succeed and the other will fail.
> >> __bio_chain_endio must ensure that the failure error status
> >> is reported for the whole, rather than the success.
> >>
> >> It currently tries to be careful, but this test is racy.
> >> If both children finish at the same time, they might both see that
> >> parent->bi_status as zero, and so will assign their own status.
> >> If the assignment to parent->bi_status by the successful bio happens
> >> last, the error status will be lost which can lead to silent data
> >> corruption.
> >>
> >> Instead, __bio_chain_endio should only assign a non-zero status
> >> to parent->bi_status.  There is then no need to test the current
> >> value of parent->bi_status - a test that would be racy anyway.
> >>
> >> Note that this bug hasn't been seen in practice.  It was only discovered
> >> by examination after a similar bug was found in dm.c
> >>
> >> Signed-off-by: NeilBrown <neilb@suse.com>
> >> ---
> >>  block/bio.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/block/bio.c b/block/bio.c
> >> index e1708db48258..ad77140edc6f 100644
> >> --- a/block/bio.c
> >> +++ b/block/bio.c
> >> @@ -312,7 +312,7 @@ static struct bio *__bio_chain_endio(struct bio *bio)
> >>  {
> >>  	struct bio *parent = bio->bi_private;
> >>  
> >> -	if (!parent->bi_status)
> >> +	if (bio->bi_status)
> >>  		parent->bi_status = bio->bi_status;
> >>  	bio_put(bio);
> >>  	return parent;
> >> -- 
> >> 2.14.0.rc0.dirty
> >>
> > 
> > Reviewed-by: Mike Snitzer <snitzer@redhat.com>
> > 
> > Jens, this one slipped through the crack just over a year ago.
> > It is available in patchwork here:
> > https://patchwork.kernel.org/patch/10220727/
> 
> Should this be:
> 
> 	if (!parent->bi_status && bio->bi_status)
> 		parent->bi_status = bio->bi_status;
> 
> perhaps?

Yeap, even better.  Not seeing any reason to have the last error win,
the first in the chain is likely the most important.
John Dorminy Feb. 23, 2019, 2:02 a.m. UTC | #4
I am perhaps not understanding the intricacies here, or not seeing a
barrier protecting it, so forgive me if I'm off base. I think reading
parent->bi_status here is unsafe.
Consider the following sequence of events on two threads.

Thread 0                                 Thread 1
In __bio_chain_endio:                    In __bio_chain_endio:
[A] Child 0 reads parent->bi_status,
    no error.
                                         Child bio 1 reads parent, no error seen
                                         It sets parent->bi_status to an error
                                         It calls bio_put.
Child bio 0 calls bio_put
[end __bio_chain_endio]                  [end __bio_chain_endio]
                                         In bio_chain_endio(), bio_endio(parent)
                                         is called, calling bio_remaining_done()
                                         which decrements __bi_remaining to 1
                                         and returns false, so no further endio
                                         stuff is done.
In bio_chain_endio(), bio_endio(parent)
is called, calling bio_remaining_done(),
decrementing parent->__bi_remaining to
 0, and continuing to finish parent.
Either for block tracing or for parent's
bi_end_io(), this thread tries to read
parent->bi_status again.

The compiler or the CPU may cache the read from [A], and since there
are no intervening barriers, parent->bi_status is still believed on
thread 0 to be success. Thus the bio may still be falsely believed to
have completed successfully, even though child 1 set an error in it.

Am I missing a subtlety here?
Mike Snitzer Feb. 23, 2019, 2:44 a.m. UTC | #5
On Fri, Feb 22 2019 at  9:02pm -0500,
John Dorminy <jdorminy@redhat.com> wrote:

> I am perhaps not understanding the intricacies here, or not seeing a
> barrier protecting it, so forgive me if I'm off base. I think reading
> parent->bi_status here is unsafe.
> Consider the following sequence of events on two threads.
> 
> Thread 0                                 Thread 1
> In __bio_chain_endio:                    In __bio_chain_endio:
> [A] Child 0 reads parent->bi_status,
>     no error.
>                                          Child bio 1 reads parent, no error seen
>                                          It sets parent->bi_status to an error
>                                          It calls bio_put.
> Child bio 0 calls bio_put
> [end __bio_chain_endio]                  [end __bio_chain_endio]
>                                          In bio_chain_endio(), bio_endio(parent)
>                                          is called, calling bio_remaining_done()
>                                          which decrements __bi_remaining to 1
>                                          and returns false, so no further endio
>                                          stuff is done.
> In bio_chain_endio(), bio_endio(parent)
> is called, calling bio_remaining_done(),
> decrementing parent->__bi_remaining to
>  0, and continuing to finish parent.
> Either for block tracing or for parent's
> bi_end_io(), this thread tries to read
> parent->bi_status again.
> 
> The compiler or the CPU may cache the read from [A], and since there
> are no intervening barriers, parent->bi_status is still believed on
> thread 0 to be success. Thus the bio may still be falsely believed to
> have completed successfully, even though child 1 set an error in it.
> 
> Am I missing a subtlety here?

Either neilb's original or even Jens' suggestion would be fine though.

>       if (!parent->bi_status && bio->bi_status)
>               parent->bi_status = bio->bi_status;

Even if your scenario did play out (which I agree it looks possible)
it'd just degenerate to neilb's version:

>       if (bio->bi_status)
>               parent->bi_status = bio->bi_status;

Which also accomplishes fixing what Neil originally detailed in his
patch header.
John Dorminy Feb. 23, 2019, 3:10 a.m. UTC | #6
I'm also worried about the other two versions, though:

memory-barriers.txt#1724:

1724 (*) The compiler is within its rights to invent stores to a variable,

i.e. the compiler is free to decide __bio_chain_endio looks like this:

static struct bio *__bio_chain_endio(struct bio *bio)
{
  struct bio *parent = bio->bi_private;
  blk_status_t tmp = parent->bi_status;
  parent->bi_status = bio->bi_status;
  if (!bio->bi_status)
    parent->bi_status = tmp;
  bio_put(bio);
  return parent;
}

In which case, the read and later store on the two different threads
may overlap in such a way that bio_endio sometimes sees success, even
if one child had an error.

As a result, I believe the setting of parent->bi_status needs to be a
WRITE_ONCE() and the later reading needs to be a READ_ONCE()
[although, since the later reading happens in many different
functions, perhaps some other barrier to make sure all readers get the
correct value is in order.]
John Dorminy June 12, 2019, 2:56 a.m. UTC | #7
Having studied the code in question more thoroughly, my first email's
scenario can't occur, I believe. bio_put() contains a
atomic_dec_and_test(), which (according to
Documentation/atomic_t.txt), having a return value, are fully ordered
and therefore impose a general memory barrier. A general memory
barrier means that no value set before bio_put() may be observed
afterwards, if I accurately understand
Documentation/memory_barriers.txt. Thus, the scenario I imagined, with
a load from inside __bio_chain_endio() being visible in bi_end_io(),
cannot occur.

However, the second email's scenario, of a compiler making two stores
out of a conditional store, is still accurate according to my
understanding. I continue to believe setting parent->bi_status needs
to be a WRITE_ONCE() and any reading of parent->bi_status before
bio_put() needs to be at least a READ_ONCE(). The last thing in that
email is wrong for the same reason that the first email couldn't
happen: the bio_put() general barrier means later accesses to the
field from a single thread will freshly read the field and thereby not
get an incorrectly cached value.

As a concrete proposal, I believe either of the following work and fix
the race NeilB described, as well as the compiler (or CPU) race I
described:

 -     if (!parent->bi_status)
 -               parent->bi_status = bio->bi_status;
 +     if (bio->bi_status)
 +               WRITE_ONCE(parent->bi_status, bio->bi_status);

--or--

 -     if (!parent->bi_status)
 -               parent->bi_status = bio->bi_status;
 +     if (!READ_ONCE(parent->bi_status) && bio->bi_status)
 +               WRITE_ONCE(parent->bi_status, bio->bi_status);

I believe the second of these might, but is not guaranteed to,
preserve the first error observed in a child; I believe if you want to
definitely save the first error you need an atomic.
Christoph Hellwig June 12, 2019, 7:01 a.m. UTC | #8
On Tue, Jun 11, 2019 at 10:56:42PM -0400, John Dorminy wrote:
> I believe the second of these might, but is not guaranteed to,
> preserve the first error observed in a child; I believe if you want to
> definitely save the first error you need an atomic.

Is there any reason not to simply use a cmpxchg?  Yes, it is a
relatively expensive operation, but once we are chaining bios we are out
of the super hot path anyway.  We do something similar in xfs and iomap
already.
Hannes Reinecke June 17, 2019, 7:32 a.m. UTC | #9
On 6/12/19 9:01 AM, Christoph Hellwig wrote:
> On Tue, Jun 11, 2019 at 10:56:42PM -0400, John Dorminy wrote:
>> I believe the second of these might, but is not guaranteed to,
>> preserve the first error observed in a child; I believe if you want to
>> definitely save the first error you need an atomic.
> 
> Is there any reason not to simply use a cmpxchg?  Yes, it is a
> relatively expensive operation, but once we are chaining bios we are out
> of the super hot path anyway.  We do something similar in xfs and iomap
> already.
> 
Agree.
Thing is, we need to check if the parent status is NULL, _and_ the
parent status might be modified asynchronously.
So even a READ_ONCE() wouldn't cut it, as it would tell us that the
parent status _was_ NULL, not that the parent status _is_ NULL by the
time we're setting it.
So cmpxchg() is it.

Cheers,

Hannes
diff mbox

Patch

diff --git a/block/bio.c b/block/bio.c
index e1708db48258..ad77140edc6f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -312,7 +312,7 @@  static struct bio *__bio_chain_endio(struct bio *bio)
 {
 	struct bio *parent = bio->bi_private;
 
-	if (!parent->bi_status)
+	if (bio->bi_status)
 		parent->bi_status = bio->bi_status;
 	bio_put(bio);
 	return parent;