diff mbox

Issue #5876 : assertion failure in rbd_img_obj_callback()

Message ID 53325F09.7000306@ieee.org (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Elder March 26, 2014, 5 a.m. UTC
I think I've got it.  I'll explain, and we'll have to look at
the explanation more closely in the morning.

Bottom line is that I think the assertion is bogus.  Like
the other assertion that was not done under protection of
a lock, this one is being done in a way that's not safe.

First, here's a patch that I think might avoid the problem:

----------------------------------
----------------------------------




Here's what I think is happening.  I'll annotate the function,
below.

First, we enter this function when an object request has
been marked "done."  In the case we're looking at, we
have an image request with two (or more) object requests.

static void rbd_img_obj_callback(struct rbd_obj_request *obj_request)
{
	struct rbd_img_request *img_request;
	u32 which = obj_request->which;
	bool more = true;

	rbd_assert(obj_request_img_data_test(obj_request));
	img_request = obj_request->img_request;

	dout("%s: img %p obj %p\n", __func__, img_request, obj_request);
	rbd_assert(img_request != NULL);
	rbd_assert(img_request->obj_request_count > 0);
	rbd_assert(which != BAD_WHICH);
	rbd_assert(which < img_request->obj_request_count);

Up to here, all is fine.  Well, *maybe*...  Anyway, I'll
work through that in the morning.  In any case, we are
examining fairly static fields in the object request.

	spin_lock_irq(&img_request->completion_lock);
	if (which > img_request->next_completion)
		goto out;

At this point we decide whether the object request now
completing is the next one.  If it's not, we're done;
whenever an earlier request (the first one) completes
it will ensure this one will get completed as well,
below.

But that's a problem.  In the loop below, the only condition
we're testing in order to account the completion of the
current *and subsequent* requests is whether each request
is marked "done."

What that means is that while the second request is waiting
to get the spinlock, the first one might be concurrently
going through the loop below.  It finds the second one "done"
and records that fact.  When it finishes the loop, it
updates the next_completion value and releases the lock.

The second request then enters the protected area, and
finds that its "which" value is *not* greater than the
next completion value.  It's in fact *less* than the
next_completion value, because the completion of this
second request has already been processed.

The problem got worse when we moved the spinlock (i.e.,
added protection around checking the next_completion
field) because now the second thread is actually waiting
before checking, so it's pretty much guaranteed it will
trigger the failure.

OK, back to bed.

					-Alex

	rbd_assert(which == img_request->next_completion);

	for_each_obj_request_from(img_request, obj_request) {
		rbd_assert(more);
		rbd_assert(which < img_request->obj_request_count);

		if (!obj_request_done_test(obj_request))
			break;
		more = rbd_img_obj_end_request(obj_request);
		which++;
	}

	rbd_assert(more ^ (which == img_request->obj_request_count));
	img_request->next_completion = which;
out:
	spin_unlock_irq(&img_request->completion_lock);

	if (!more)
		rbd_img_request_complete(img_request);
}
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alex Elder March 26, 2014, 11:13 a.m. UTC | #1
On 03/26/2014 12:00 AM, Alex Elder wrote:
> I think I've got it.  I'll explain, and we'll have to look at
> the explanation more closely in the morning.
> 
> Bottom line is that I think the assertion is bogus.  Like
> the other assertion that was not done under protection of
> a lock, this one is being done in a way that's not safe.
> 
> First, here's a patch that I think might avoid the problem:
> 
> ----------------------------------
> Index: b/drivers/block/rbd.c
> ===================================================================
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2130,9 +2130,8 @@ static void rbd_img_obj_callback(struct
>  	rbd_assert(which < img_request->obj_request_count);
> 
>  	spin_lock_irq(&img_request->completion_lock);
> -	if (which > img_request->next_completion)
> +	if (which != img_request->next_completion)
>  		goto out;
> -	rbd_assert(which == img_request->next_completion);
> 
>  	for_each_obj_request_from(img_request, obj_request) {
>  		rbd_assert(more);
> ----------------------------------


After some sleep (though not a lot), I remain convinced
the above will fix the problem, for now.  But it's not
quite enough.

The problem arises out of the fact that the thread of
control that marks an object request done is not necessarily
the thread of control that processing the completion of
that object (nor its associated image) request.  There are
no duplicate requests and no duplicate responses.


Object request completions are handled sequentially, based
on the range of address space involved in the requests.
This is the model required by the Linux blk_end_request()
interface.  Let's assume we have a single image request
that has two contiguous object requests, with the "lower"
object request involving bytes ending at the point the
"higher" one begins.

These object requests can be completed by their OSDs in
any order, and we will handle their completion in the
order their acknowledgement is received.  If we process
the higher request's acknowledgement first, that will
be the first to enter rbd_img_obj_callback().  Since
we will not have completed the lower one yet, nothing
more is done.

When the lower object request completes, it will enter
rbd_img_obj_callback() and find that it *is* the
next one we want completed.  So we'll enter a loop
that calls rbd_img_obj_end_request() on this request,
followed by its successor, doing completion processing
on any request that has by now been marked done.  In
the case I've described, the higher request will be
marked done, so it is here that its completion processing
is performed.

Once the last object request has been completed, we
call rbd_img_request_complete(), which will ultimately
lead to dropping a reference on the image request,
which leads to rbd_img_request_destroy().


What's happening in Olivier's case is that the
requests are actually completing *in order*, but
nearly simultaneously.  The lower and higher request
acknowledgements from their OSDs arrive, and their
consequent calls to rbd_img_obj_end_request() occur
at the same time.  The lower request wins the race
to get the spinlock before checking if it is the
next one to complete.  So the higher request waits;
but note that both of them have been marked "done"
at this point.

The lower request enters the completion loop, and
calls rbd_imb_obj_end_request() on both requests,
and updates the next_completion value before releasing
the spinlock.  Now the higher request acquires the
lock and gets its chance to check its position against
next_completion.  Since completion of this higher
request was now been done already while processing
the lower request, next_completion has now gone
past the higher request.  Therefore its "which"
value will not be higher than "next_completion"
but neither will it be equal to it, so the old
assertion would fail.

So my proposed fix just recognizes that, and allows
only the object request matching next_completion
to enter the completion loop, without assuming any
other request must have a greater "which" value.

Hopefully that's a more complete and clear
explanation of the problem.


Now, onto why that's not sufficient.

Basically, at the instant an image object request is
marked done, it is eligible to be completed by a
call to rbd_img_obj_end_request() by a concurrent
thread processing a lower-numbered object request in
the same image request.  And at just after that
instant (essentially, because of the protection
of the completion lock), the image request itself will
be completed by a call to rbd_img_request_complete().
It's likely this will drop the last reference to the
image request, so rbd_img_request_destroy() will be
called.  And that calls rbd_img_obj_request_del() on
each of its object requests.

In other words, conceivably the higher object
request that was stuck waiting for the spinlock
in rbd_img_obj_end_request() may now be getting
deleted from its image's request list.  (It
could be worse, it may have *just* finished
getting marked done and may not have even entered
rbd_img_obj_end_request() yet.)  And in the
process of getting deleted from its image request's
object request list, an object request gets
partially reinitialized, freed, and may get reused.


So...  in order to have a complete fix we need to
do some careful work with reference counting of
image requests and object requests, and avoid
having any of their content get disturbed until
the last reference is dropped.

Right now it's not quite right.  In particular, each
object request needs to have a *reference* to its
image request (not just a pointer to it), and it
should not be dropped until the the object request
gets destroyed.  And because one thread can destroy
an object request being actively used by another,
each object request should have reference that gets
held until it is no longer needed (probably at the
end of rbd_osd_req_callback()).


I think for the time being, the simple fix I
described last night will do, and it's somewhat
unlikely--though not impossible--for the problems
due to concurrent destruction will arise.  But
we do need a fix.  I'm prepared to create it, but
for now I'd like to have another opinion on my
treatise above.

					-Alex
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilya Dryomov March 26, 2014, 11:43 a.m. UTC | #2
On Wed, Mar 26, 2014 at 1:13 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/26/2014 12:00 AM, Alex Elder wrote:
>> I think I've got it.  I'll explain, and we'll have to look at
>> the explanation more closely in the morning.
>>
>> Bottom line is that I think the assertion is bogus.  Like
>> the other assertion that was not done under protection of
>> a lock, this one is being done in a way that's not safe.
>>
>> First, here's a patch that I think might avoid the problem:
>>
>> ----------------------------------
>> Index: b/drivers/block/rbd.c
>> ===================================================================
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -2130,9 +2130,8 @@ static void rbd_img_obj_callback(struct
>>       rbd_assert(which < img_request->obj_request_count);
>>
>>       spin_lock_irq(&img_request->completion_lock);
>> -     if (which > img_request->next_completion)
>> +     if (which != img_request->next_completion)
>>               goto out;
>> -     rbd_assert(which == img_request->next_completion);
>>
>>       for_each_obj_request_from(img_request, obj_request) {
>>               rbd_assert(more);
>> ----------------------------------
>
>
> After some sleep (though not a lot), I remain convinced
> the above will fix the problem, for now.  But it's not
> quite enough.
>
> The problem arises out of the fact that the thread of
> control that marks an object request done is not necessarily
> the thread of control that processing the completion of
> that object (nor its associated image) request.  There are
> no duplicate requests and no duplicate responses.
>
>
> Object request completions are handled sequentially, based
> on the range of address space involved in the requests.
> This is the model required by the Linux blk_end_request()
> interface.  Let's assume we have a single image request
> that has two contiguous object requests, with the "lower"
> object request involving bytes ending at the point the
> "higher" one begins.
>
> These object requests can be completed by their OSDs in
> any order, and we will handle their completion in the
> order their acknowledgement is received.  If we process
> the higher request's acknowledgement first, that will
> be the first to enter rbd_img_obj_callback().  Since
> we will not have completed the lower one yet, nothing
> more is done.
>
> When the lower object request completes, it will enter
> rbd_img_obj_callback() and find that it *is* the
> next one we want completed.  So we'll enter a loop
> that calls rbd_img_obj_end_request() on this request,
> followed by its successor, doing completion processing
> on any request that has by now been marked done.  In
> the case I've described, the higher request will be
> marked done, so it is here that its completion processing
> is performed.
>
> Once the last object request has been completed, we
> call rbd_img_request_complete(), which will ultimately
> lead to dropping a reference on the image request,
> which leads to rbd_img_request_destroy().
>
>
> What's happening in Olivier's case is that the
> requests are actually completing *in order*, but
> nearly simultaneously.  The lower and higher request
> acknowledgements from their OSDs arrive, and their
> consequent calls to rbd_img_obj_end_request() occur
> at the same time.  The lower request wins the race
> to get the spinlock before checking if it is the
> next one to complete.  So the higher request waits;
> but note that both of them have been marked "done"
> at this point.
>
> The lower request enters the completion loop, and
> calls rbd_imb_obj_end_request() on both requests,
> and updates the next_completion value before releasing
> the spinlock.  Now the higher request acquires the
> lock and gets its chance to check its position against
> next_completion.  Since completion of this higher
> request was now been done already while processing
> the lower request, next_completion has now gone
> past the higher request.  Therefore its "which"
> value will not be higher than "next_completion"
> but neither will it be equal to it, so the old
> assertion would fail.
>
> So my proposed fix just recognizes that, and allows
> only the object request matching next_completion
> to enter the completion loop, without assuming any
> other request must have a greater "which" value.
>
> Hopefully that's a more complete and clear
> explanation of the problem.
>
>
> Now, onto why that's not sufficient.
>
> Basically, at the instant an image object request is
> marked done, it is eligible to be completed by a
> call to rbd_img_obj_end_request() by a concurrent
> thread processing a lower-numbered object request in
> the same image request.  And at just after that
> instant (essentially, because of the protection
> of the completion lock), the image request itself will
> be completed by a call to rbd_img_request_complete().
> It's likely this will drop the last reference to the
> image request, so rbd_img_request_destroy() will be
> called.  And that calls rbd_img_obj_request_del() on
> each of its object requests.
>
> In other words, conceivably the higher object
> request that was stuck waiting for the spinlock
> in rbd_img_obj_end_request() may now be getting
> deleted from its image's request list.  (It
> could be worse, it may have *just* finished
> getting marked done and may not have even entered
> rbd_img_obj_end_request() yet.)  And in the
> process of getting deleted from its image request's
> object request list, an object request gets
> partially reinitialized, freed, and may get reused.
>
>
> So...  in order to have a complete fix we need to
> do some careful work with reference counting of
> image requests and object requests, and avoid
> having any of their content get disturbed until
> the last reference is dropped.
>
> Right now it's not quite right.  In particular, each
> object request needs to have a *reference* to its
> image request (not just a pointer to it), and it
> should not be dropped until the the object request
> gets destroyed.  And because one thread can destroy
> an object request being actively used by another,
> each object request should have reference that gets
> held until it is no longer needed (probably at the
> end of rbd_osd_req_callback()).
>
>
> I think for the time being, the simple fix I
> described last night will do, and it's somewhat
> unlikely--though not impossible--for the problems
> due to concurrent destruction will arise.  But
> we do need a fix.  I'm prepared to create it, but
> for now I'd like to have another opinion on my
> treatise above.

This all makes sense, I think reference counting is the right thing to
do.  We definitely do need a real fix, with the simple fix we still
have a potential use-after-free on img_request itself, not to mention
obj_requests.

It looks like img_request kref currenlty exists for posterity only.
Unless I'm missing something, its counter is set to 1 in
rbd_img_request_create() and is not incremented anywhere else, which
means that the instant rbd_img_request_put() is called, img_request is
freed.  I naively assumed it was incremented and decremented in
rbd_img_obj_request_add() and rbd_img_obj_request_del() respectively..
Maybe that's something we should look at first?

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elder March 26, 2014, 11:47 a.m. UTC | #3
On 03/26/2014 06:43 AM, Ilya Dryomov wrote:
> It looks like img_request kref currenlty exists for posterity only.
> Unless I'm missing something, its counter is set to 1 in
> rbd_img_request_create() and is not incremented anywhere else, which
> means that the instant rbd_img_request_put() is called, img_request is
> freed.  I naively assumed it was incremented and decremented in
> rbd_img_obj_request_add() and rbd_img_obj_request_del() respectively..
> Maybe that's something we should look at first?

I believe it was something that I never finished implementing.

So yes, the kref is there to be used, it just isn't, really.

If you want to give it a try, be my guest.  I'll review it
carefully.  (I can't really test my changes very well anyway.)

If you want me to do it, let me know.

					-Alex
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilya Dryomov March 26, 2014, 12:05 p.m. UTC | #4
On Wed, Mar 26, 2014 at 1:47 PM, Alex Elder <elder@ieee.org> wrote:
> On 03/26/2014 06:43 AM, Ilya Dryomov wrote:
>> It looks like img_request kref currenlty exists for posterity only.
>> Unless I'm missing something, its counter is set to 1 in
>> rbd_img_request_create() and is not incremented anywhere else, which
>> means that the instant rbd_img_request_put() is called, img_request is
>> freed.  I naively assumed it was incremented and decremented in
>> rbd_img_obj_request_add() and rbd_img_obj_request_del() respectively..
>> Maybe that's something we should look at first?
>
> I believe it was something that I never finished implementing.
>
> So yes, the kref is there to be used, it just isn't, really.
>
> If you want to give it a try, be my guest.  I'll review it
> carefully.  (I can't really test my changes very well anyway.)
>
> If you want me to do it, let me know.

I think you should do it ;)  If time is not a big issue, that is.  You
have a much more complete picture in your head.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Elder March 26, 2014, 8:58 p.m. UTC | #5
Olivier reports that with the simple patch I provided
(which changed a "<" to a "!=" and removed an assertion)
he is running successfully.

To me this is fantastic news, and you can see I posted
a patch with the fix.

There remains a race condition though, one which I described
in a separate message earlier today.  I don't think it will
prove to be a problem in practice, but I agreed to work on
a fix to ensure the race condition is eliminated.  It will
require some work with reference counting image and object
requests.

The fix won't be coming today.  But I aim to provide it
in a matter of several days.

					-Alex
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olivier Bonvalet March 27, 2014, 7:48 a.m. UTC | #6
Le mercredi 26 mars 2014 à 15:58 -0500, Alex Elder a écrit :
> Olivier reports that with the simple patch I provided
> (which changed a "<" to a "!=" and removed an assertion)
> he is running successfully.
> 
> To me this is fantastic news, and you can see I posted
> a patch with the fix.
> 
> There remains a race condition though, one which I described
> in a separate message earlier today.  I don't think it will
> prove to be a problem in practice, but I agreed to work on
> a fix to ensure the race condition is eliminated.  It will
> require some work with reference counting image and object
> requests.
> 
> The fix won't be coming today.  But I aim to provide it
> in a matter of several days.
> 
> 					-Alex
> 

One question from one of my customers : why am I the only one to
complain about that problem ? 
I know that Ceph users often use qemu/librbd instead of kernel client,
but what is the trigger of those «race condition» ? Having "multiple
requests" per RBD image ? It should be a normal use, no ?

If someone can help me give an explanation, thanks :)

Olivier

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilya Dryomov March 27, 2014, 8:45 a.m. UTC | #7
On Thu, Mar 27, 2014 at 9:48 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> Le mercredi 26 mars 2014 à 15:58 -0500, Alex Elder a écrit :
>> Olivier reports that with the simple patch I provided
>> (which changed a "<" to a "!=" and removed an assertion)
>> he is running successfully.
>>
>> To me this is fantastic news, and you can see I posted
>> a patch with the fix.
>>
>> There remains a race condition though, one which I described
>> in a separate message earlier today.  I don't think it will
>> prove to be a problem in practice, but I agreed to work on
>> a fix to ensure the race condition is eliminated.  It will
>> require some work with reference counting image and object
>> requests.
>>
>> The fix won't be coming today.  But I aim to provide it
>> in a matter of several days.
>>
>>                                       -Alex
>>
>
> One question from one of my customers : why am I the only one to
> complain about that problem ?
> I know that Ceph users often use qemu/librbd instead of kernel client,
> but what is the trigger of those «race condition» ? Having "multiple
> requests" per RBD image ? It should be a normal use, no ?
>
> If someone can help me give an explanation, thanks :)

We've had a couple more, similar reports in the last few months.
However you are the first reporter who was able to trigger this race
often enough to track it down.  This race condition (read: bug) is
kernel client specific, qemu/librbd is unaffected.  Having an rbd
request that spans multiple RADOS objects and therefore results in
multiple object requests is normal use, it's just that particular piece
of code turned out to be prone to a subtle race.  You have to keep in
mind that races are all about timing and relative order of events, so
simply issuing a multi-object rbd request is not enough to trigger it,
stars have to align too ;)

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olivier Bonvalet March 27, 2014, 8:49 a.m. UTC | #8
Le jeudi 27 mars 2014 à 10:45 +0200, Ilya Dryomov a écrit :
> On Thu, Mar 27, 2014 at 9:48 AM, Olivier Bonvalet <ceph.list@daevel.fr> wrote:
> > Le mercredi 26 mars 2014 à 15:58 -0500, Alex Elder a écrit :
> >> Olivier reports that with the simple patch I provided
> >> (which changed a "<" to a "!=" and removed an assertion)
> >> he is running successfully.
> >>
> >> To me this is fantastic news, and you can see I posted
> >> a patch with the fix.
> >>
> >> There remains a race condition though, one which I described
> >> in a separate message earlier today.  I don't think it will
> >> prove to be a problem in practice, but I agreed to work on
> >> a fix to ensure the race condition is eliminated.  It will
> >> require some work with reference counting image and object
> >> requests.
> >>
> >> The fix won't be coming today.  But I aim to provide it
> >> in a matter of several days.
> >>
> >>                                       -Alex
> >>
> >
> > One question from one of my customers : why am I the only one to
> > complain about that problem ?
> > I know that Ceph users often use qemu/librbd instead of kernel client,
> > but what is the trigger of those «race condition» ? Having "multiple
> > requests" per RBD image ? It should be a normal use, no ?
> >
> > If someone can help me give an explanation, thanks :)
> 
> We've had a couple more, similar reports in the last few months.
> However you are the first reporter who was able to trigger this race
> often enough to track it down.  This race condition (read: bug) is
> kernel client specific, qemu/librbd is unaffected.  Having an rbd
> request that spans multiple RADOS objects and therefore results in
> multiple object requests is normal use, it's just that particular piece
> of code turned out to be prone to a subtle race.  You have to keep in
> mind that races are all about timing and relative order of events, so
> simply issuing a multi-object rbd request is not enough to trigger it,
> stars have to align too ;)
> 
> Thanks,
> 
>                 Ilya
> 

Great, thanks !

Olivier

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: b/drivers/block/rbd.c
===================================================================
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2130,9 +2130,8 @@  static void rbd_img_obj_callback(struct
 	rbd_assert(which < img_request->obj_request_count);

 	spin_lock_irq(&img_request->completion_lock);
-	if (which > img_request->next_completion)
+	if (which != img_request->next_completion)
 		goto out;
-	rbd_assert(which == img_request->next_completion);

 	for_each_obj_request_from(img_request, obj_request) {
 		rbd_assert(more);