diff mbox

dm-snap deadlock in pending_complete()

Message ID 20150812114631.15268065@noble (mailing list archive)
State Rejected, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

NeilBrown Aug. 12, 2015, 1:46 a.m. UTC
On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
<mpatocka@redhat.com> wrote:

> Hi
> 
> On Mon, 10 Aug 2015, NeilBrown wrote:
> 
> > 
> > Hi Mikulas,
> >  I have a customer hitting the deadlock you described over a year ago
> >  in:
> > 
> > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> >          blocks
> 
> Ask block layer maintainers to accept that patch.

Unfortunately I don't really like the patch ... or the bioset rescue
workqueues that it is based on.   Sorry.

So I might keep looking for a more localised approach....

> 
> > I notice that patch never went upstream.
> > 
> > Has anything else been done to fix this deadlock?
> > 
> > My thought was that something like the below would be sufficient.  Do
> > you see any problem with that?  It avoids the deadlock by dropping the
> > lock while sleeping.
> 
> The patch below introduces a bug - if the user is continuously reading the 
> same chunk (for example, by issuing multiple direct i/o requests to the 
> same chunk), the function pending_complete never finishes.
> 
> We need to hold the lock while waiting to make sure that no new read 
> requests are submitted.

Ahh - thanks for the explanation.  That makes sense.
Presumably you need to wait for old reads that you don't have a read
returning old data after a write has confirmed the new data is written?
Is there some other reason?

As soon as the new "proper" exception is installed, no new reads will
use the old address, so it should be safe to wait without holding the
lock.

So what about this?

Thanks for your time,
NeilBrown




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Mikulas Patocka Aug. 12, 2015, 2:16 p.m. UTC | #1
On Wed, 12 Aug 2015, NeilBrown wrote:

> Ahh - thanks for the explanation.  That makes sense.
> Presumably you need to wait for old reads that you don't have a read
> returning old data after a write has confirmed the new data is written?
> Is there some other reason?
> 
> As soon as the new "proper" exception is installed, no new reads will
> use the old address, so it should be safe to wait without holding the
> lock.
> 
> So what about this?

This is wrong too - when you add the exception to the complete hash table 
and drop the lock, writes to the chunk in the origin target are can modify 
the origin volume. These writes race with pending reads to the snapshot 
(__chunk_is_tracked tests for those reads). If the I/O scheduler schedules 
the write to the origin before the read from the snapshot, the read from 
the snapshot returns corrupted data.

There can be more problems with snapshot merging - if the chunk is on the 
complete hash table, the merging code assumes that it can be safely 
merged.

Mikulas

> Thanks for your time,
> NeilBrown
> 
> 
> 
> diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
> index 7c82d3ccce87..c7a7ed2cfd6d 100644
> --- a/drivers/md/dm-snap.c
> +++ b/drivers/md/dm-snap.c
> @@ -1461,14 +1461,22 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
>  		goto out;
>  	}
>  
> -	/* Check for conflicting reads */
> -	__check_for_conflicting_io(s, pe->e.old_chunk);
> +	/* Add proper exception.  Now all reads will be
> +	 * redirected, so no new reads can be started on
> +	 * 'old_chunk'.
> +	 */
> +	dm_insert_exception(&s->complete, e);
> +
> +	/* Drain conflicting reads */
> +	if (__chunk_is_tracked(s, pe->e.old_chunk)) {
> +		up_write(&s->lock);
> +		__check_for_conflicting_io(s, pe->e.old_chunk);
> +		down_write(&s->lock);
> +	}
>  
>  	/*
> -	 * Add a proper exception, and remove the
> -	 * in-flight exception from the list.
> +	 * And now we can remove the temporary exception.
>  	 */
> -	dm_insert_exception(&s->complete, e);
>  
>  out:
>  	dm_remove_exception(&pe->e);
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Aug. 12, 2015, 4:25 p.m. UTC | #2
On Wed, 12 Aug 2015, NeilBrown wrote:

> On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
> <mpatocka@redhat.com> wrote:
> 
> > Hi
> > 
> > On Mon, 10 Aug 2015, NeilBrown wrote:
> > 
> > > 
> > > Hi Mikulas,
> > >  I have a customer hitting the deadlock you described over a year ago
> > >  in:
> > > 
> > > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> > >          blocks
> > 
> > Ask block layer maintainers to accept that patch.
> 
> Unfortunately I don't really like the patch ... or the bioset rescue
> workqueues that it is based on.   Sorry.
> 
> So I might keep looking for a more localised approach....

The problem here is that other dm targets may deadlock in a similar way 
too - for example, dm-thin could deadlock on pmd->pool_lock.

The cause of the bug is bio queuing on current->bio_list. There is an 
assumption that if a dm target submits a bio to a lower-level target, the 
bio finishes in finite time. Queuing on current->bio_list breaks the 
assumption, bios can be held indefinitelly on current->bio_list.

The patch that flushes current->bio_list is the correct way to fix it - it 
makes sure that a bio can't be held indefinitely.

Another way to fix it would be to abandon current->bio_list --- but then, 
there would be problem with stack overflow on deeply nested targets.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
NeilBrown Aug. 13, 2015, 12:43 a.m. UTC | #3
On Wed, 12 Aug 2015 12:25:42 -0400 (EDT) Mikulas Patocka
<mpatocka@redhat.com> wrote:

> 
> 
> On Wed, 12 Aug 2015, NeilBrown wrote:
> 
> > On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
> > <mpatocka@redhat.com> wrote:
> > 
> > > Hi
> > > 
> > > On Mon, 10 Aug 2015, NeilBrown wrote:
> > > 
> > > > 
> > > > Hi Mikulas,
> > > >  I have a customer hitting the deadlock you described over a year ago
> > > >  in:
> > > > 
> > > > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> > > >          blocks
> > > 
> > > Ask block layer maintainers to accept that patch.
> > 
> > Unfortunately I don't really like the patch ... or the bioset rescue
> > workqueues that it is based on.   Sorry.
> > 
> > So I might keep looking for a more localised approach....
> 
> The problem here is that other dm targets may deadlock in a similar way 
> too - for example, dm-thin could deadlock on pmd->pool_lock.
> 
> The cause of the bug is bio queuing on current->bio_list. There is an 
> assumption that if a dm target submits a bio to a lower-level target, the 
> bio finishes in finite time. Queuing on current->bio_list breaks the 
> assumption, bios can be held indefinitelly on current->bio_list.
> 
> The patch that flushes current->bio_list is the correct way to fix it - it 
> makes sure that a bio can't be held indefinitely.
> 
> Another way to fix it would be to abandon current->bio_list --- but then, 
> there would be problem with stack overflow on deeply nested targets.
> 

I think it is a bit simplistic to say that current->bio_list is the
"cause" of the problem.  It is certainly a part of the problem.  The
assumption you mention is another part - and the two conflict.

As you say, we could abandon current->bio_list, but then we risk stack
overflows again.
Or we could hand the bio_list to a work-queue whenever the make_request
function needs to schedule.... but then if handling one of those bios
needs to schedule...  not sure, it might work.

Or we could change the assumption and never block in a make_request
function after calling generic_make_request().  Maybe that is difficult.

Or we could change __split_and_process_bio to use bio_split() to split
the bio, then handle the first and call generic_make_request on the
second.  That would effectively put the second half on the end of
bio_list so it wouldn't be tried until all requests derived from the
first half have been handled.

None of these is completely straight forward, but I suspect all are
possible.

I'm leaning towards the last one:  when you want to split a bio, use
bio_split and handle the two halves separately.

Do you have thoughts on that?

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Aug. 13, 2015, 2:02 p.m. UTC | #4
On Wed, Aug 12 2015 at  8:43pm -0400,
NeilBrown <neilb@suse.com> wrote:

> On Wed, 12 Aug 2015 12:25:42 -0400 (EDT) Mikulas Patocka
> <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 12 Aug 2015, NeilBrown wrote:
> > 
> > > On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
> > > <mpatocka@redhat.com> wrote:
> > > 
> > > > Hi
> > > > 
> > > > On Mon, 10 Aug 2015, NeilBrown wrote:
> > > > 
> > > > > 
> > > > > Hi Mikulas,
> > > > >  I have a customer hitting the deadlock you described over a year ago
> > > > >  in:
> > > > > 
> > > > > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> > > > >          blocks
> > > > 
> > > > Ask block layer maintainers to accept that patch.
> > > 
> > > Unfortunately I don't really like the patch ... or the bioset rescue
> > > workqueues that it is based on.   Sorry.
> > > 
> > > So I might keep looking for a more localised approach....
> > 
> > The problem here is that other dm targets may deadlock in a similar way 
> > too - for example, dm-thin could deadlock on pmd->pool_lock.
> > 
> > The cause of the bug is bio queuing on current->bio_list. There is an 
> > assumption that if a dm target submits a bio to a lower-level target, the 
> > bio finishes in finite time. Queuing on current->bio_list breaks the 
> > assumption, bios can be held indefinitelly on current->bio_list.
> > 
> > The patch that flushes current->bio_list is the correct way to fix it - it 
> > makes sure that a bio can't be held indefinitely.
> > 
> > Another way to fix it would be to abandon current->bio_list --- but then, 
> > there would be problem with stack overflow on deeply nested targets.
> > 
> 
> I think it is a bit simplistic to say that current->bio_list is the
> "cause" of the problem.  It is certainly a part of the problem.  The
> assumption you mention is another part - and the two conflict.
> 
> As you say, we could abandon current->bio_list, but then we risk stack
> overflows again.
> Or we could hand the bio_list to a work-queue whenever the make_request
> function needs to schedule.... but then if handling one of those bios
> needs to schedule...  not sure, it might work.
> 
> Or we could change the assumption and never block in a make_request
> function after calling generic_make_request().  Maybe that is difficult.
> 
> Or we could change __split_and_process_bio to use bio_split() to split
> the bio, then handle the first and call generic_make_request on the
> second.  That would effectively put the second half on the end of
> bio_list so it wouldn't be tried until all requests derived from the
> first half have been handled.
> 
> None of these is completely straight forward, but I suspect all are
> possible.
> 
> I'm leaning towards the last one:  when you want to split a bio, use
> bio_split and handle the two halves separately.

Sounds worth pursuing.

> Do you have thoughts on that?

Once the late bio splitting work is merged for 4.3 I (along with Joe,
yourself, anyone interested) hope to review/audit/fix DM core to make
proper use of bio_split() et al.

So any work in this area needs to be based on the late-bio-splitting
patchset/branch that mlin has been pushing.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Aug. 17, 2015, 1:48 p.m. UTC | #5
On Thu, 13 Aug 2015, NeilBrown wrote:

> On Wed, 12 Aug 2015 12:25:42 -0400 (EDT) Mikulas Patocka
> <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Wed, 12 Aug 2015, NeilBrown wrote:
> > 
> > > On Tue, 11 Aug 2015 05:14:33 -0400 (EDT) Mikulas Patocka
> > > <mpatocka@redhat.com> wrote:
> > > 
> > > > Hi
> > > > 
> > > > On Mon, 10 Aug 2015, NeilBrown wrote:
> > > > 
> > > > > 
> > > > > Hi Mikulas,
> > > > >  I have a customer hitting the deadlock you described over a year ago
> > > > >  in:
> > > > > 
> > > > > Subject: [dm-devel] [PATCH] block: flush queued bios when the process
> > > > >          blocks
> > > > 
> > > > Ask block layer maintainers to accept that patch.
> > > 
> > > Unfortunately I don't really like the patch ... or the bioset rescue
> > > workqueues that it is based on.   Sorry.
> > > 
> > > So I might keep looking for a more localised approach....
> > 
> > The problem here is that other dm targets may deadlock in a similar way 
> > too - for example, dm-thin could deadlock on pmd->pool_lock.
> > 
> > The cause of the bug is bio queuing on current->bio_list. There is an 
> > assumption that if a dm target submits a bio to a lower-level target, the 
> > bio finishes in finite time. Queuing on current->bio_list breaks the 
> > assumption, bios can be held indefinitelly on current->bio_list.
> > 
> > The patch that flushes current->bio_list is the correct way to fix it - it 
> > makes sure that a bio can't be held indefinitely.
> > 
> > Another way to fix it would be to abandon current->bio_list --- but then, 
> > there would be problem with stack overflow on deeply nested targets.
> > 
> 
> I think it is a bit simplistic to say that current->bio_list is the
> "cause" of the problem.  It is certainly a part of the problem.  The
> assumption you mention is another part - and the two conflict.
> 
> As you say, we could abandon current->bio_list, but then we risk stack
> overflows again.
> Or we could hand the bio_list to a work-queue whenever the make_request
> function needs to schedule.... but then if handling one of those bios
> needs to schedule...  not sure, it might work.
> 
> Or we could change the assumption and never block in a make_request
> function after calling generic_make_request().  Maybe that is difficult.
> 
> Or we could change __split_and_process_bio to use bio_split() to split
> the bio, then handle the first and call generic_make_request on the
> second.  That would effectively put the second half on the end of
> bio_list so it wouldn't be tried until all requests derived from the
> first half have been handled.

I don't think it will fix the bug - even if you put the second half of the 
bio at the end of bio_list, it will still wait until other entries on the 
bio list are processed.

For example - device 1 gets a bio, splits it to bio1 and bio2, forwards 
them to device 2 and put them on current->bio_list

Device 2 receives bio1 (popped from curretn->bio_list), splits it to bio3 
and bio4, forwards them to device 3 and puts them at the end of 
current->bio_list

Device 2 receives bio2 (popped from current->bio_list), waits until bio1 
finishes, but bio1 won't ever finish because it depends on bio3 and bio4 
that are on current->bio_list.

> None of these is completely straight forward, but I suspect all are
> possible.
> 
> I'm leaning towards the last one:  when you want to split a bio, use
> bio_split and handle the two halves separately.
> 
> Do you have thoughts on that?
> 
> Thanks,
> NeilBrown

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index 7c82d3ccce87..c7a7ed2cfd6d 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -1461,14 +1461,22 @@  static void pending_complete(struct dm_snap_pending_exception *pe, int success)
 		goto out;
 	}
 
-	/* Check for conflicting reads */
-	__check_for_conflicting_io(s, pe->e.old_chunk);
+	/* Add proper exception.  Now all reads will be
+	 * redirected, so no new reads can be started on
+	 * 'old_chunk'.
+	 */
+	dm_insert_exception(&s->complete, e);
+
+	/* Drain conflicting reads */
+	if (__chunk_is_tracked(s, pe->e.old_chunk)) {
+		up_write(&s->lock);
+		__check_for_conflicting_io(s, pe->e.old_chunk);
+		down_write(&s->lock);
+	}
 
 	/*
-	 * Add a proper exception, and remove the
-	 * in-flight exception from the list.
+	 * And now we can remove the temporary exception.
 	 */
-	dm_insert_exception(&s->complete, e);
 
 out:
 	dm_remove_exception(&pe->e);