diff mbox

RFC for multipath queue_if_no_path timeout.

Message ID 52453C76.6060403@suse.de (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Hannes Reinecke Sept. 27, 2013, 8:06 a.m. UTC
On 09/27/2013 08:07 AM, Hannes Reinecke wrote:
> On 09/27/2013 01:49 AM, Mike Snitzer wrote:
>> On Thu, Sep 26 2013 at  7:22pm -0400,
>> Alasdair G Kergon <agk@redhat.com> wrote:
>>
>>> On Thu, Sep 26, 2013 at 10:47:13AM -0700, Frank Mayhar wrote:
>>>> Launching it from ramdisk won't help, particularly, since it still goes
>>>> through the block layer.  The other stuff won't help if a (potentially
>>>> unrelated) bug in the daemon happens to be being tickled at the same
>>>> time, or if some dependency happens to be broken and _that's_ what's
>>>> preventing the daemon from making progress.
>>>  
>>> Then put more effort into debugging your daemon so it doesn't have
>>> bugs that make it die?  Implement the timeout in a robust independent
>>> daemon if it's other code there that's unreliable?
>>>
>>>> And as far as lvm2 and multipath-tools, yeah, they cope okay in the kind
>>>> of environments most people have, but that's not the kind of environment
>>>> (or scale) we have to deal with.
>>>
>>> In what way are your requirements so different that a locked-into-memory
>>> monitoring daemon cannot implement this timeout?
>>
>> Frank, I had a look at your patch.  It leaves a lot to be desired, I was
>> starting to clean it up but ultimately found myself agreeing with
>> Alasdair's original point: that this policy should be implemented in the
>> userspace daemon.
>>
> _Actually_ there is a way how this could be implemented properly:
> implement a blk_timeout function.
> 
> Thing is, every request_queue might have a timeout function
> implemented, whose goal is to abort requests which are beyond that
> timeout. EG SCSI uses that for the dev_loss_tmo mechanism.
> 
> Multipath what with it being request-based could easily implement
> the same mechanism, namely have to blk_timeout function which would
> just re-arm the timeout in the default case, but abort any queued
> I/O (after a timeout) if all paths are down.
> 
> Hmm. I see to draft up a PoC.
> 
And indeed, here it is.

Completely untested, just to give you an idea what I was going on
about. Let's see if I can put this to test somewhere...

Cheers,

Hannes

Comments

Alasdair G Kergon Sept. 27, 2013, 8:37 a.m. UTC | #1
But this still dodges the fundamental problem:

  What is the right value to use for the timeout?
  - How long should you wait for a path to (re)appear?
    - In the current model, reinstating a path is a userspace 
      responsibility.

The timeout, as proposed, is being used in two conflicting ways:
  - How long to wait for path recovery when all paths went down
  - How long to wait when the system locks without enough free
    memory even to reinstate a path (because of broken userspace
    code) before having multipath fail queued I/O in a desperate
    attempt at releasing memory to assist recovery
 
The second case should point to a very short timeout.
The first case probably wants a longer one.

In my view the correct approach for the case Frank is discussing is to
use a different trigger to detect the (approaching?) locking up of the
system.   E.g.  should something related to the handling of an out
of memory condition have a hook to instruct multipath to release such
queued I/O?

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Hannes Reinecke Sept. 27, 2013, 1:52 p.m. UTC | #2
On 09/27/2013 10:37 AM, Alasdair G Kergon wrote:
> But this still dodges the fundamental problem:
> 
>   What is the right value to use for the timeout?
>   - How long should you wait for a path to (re)appear?
>     - In the current model, reinstating a path is a userspace 
>       responsibility.
> 
And with my proposed patch it would still be userspace which is
setting the timeout.
Currently, no_path_retry is not a proper measure anyway, as it's
depending on the time multipathd takes to complete a path check
round. Which depends on the number of device, the state of those etc.

> The timeout, as proposed, is being used in two conflicting ways:
>   - How long to wait for path recovery when all paths went down

That would be set via the new 'no_path_timeout' feature, which would
be set instead of the (multipath-internal) no_path_retry
setting.

>   - How long to wait when the system locks without enough free
>     memory even to reinstate a path (because of broken userspace
>     code) before having multipath fail queued I/O in a desperate
>     attempt at releasing memory to assist recovery
>  
Do we even handle that case currently?
Methinks this is precisely the use-case this is supposed to address.
When currently 'no_path_retry' is set _and_ we're running under a
low-mem condition there is a quite large likelyhood that the
multipath daemon will be killed by the OOM-killer or not able to
send any dm messages down to the kernel, as the latter most likely
require some memory allocations.

So in the current 'no_path_retry' scenario the maps would have been
created with 'queue_if_no_path', and the daemon would have to reset
the 'queue_if_no_path' flag if the no_path_retry value expires.
Which it might not be able to do so due to the above scenario.

So with the proposed 'no_path_timeout' we would enable the dm-mpath
module to terminate all outstanding I/O, irrespective on all
userland conditions. Which seems like an improvement to me ...

> The second case should point to a very short timeout.
> The first case probably wants a longer one.
> 
> In my view the correct approach for the case Frank is discussing is to
> use a different trigger to detect the (approaching?) locking up of the
> system.   E.g.  should something related to the handling of an out
> of memory condition have a hook to instruct multipath to release such
> queued I/O?
> 
Yeah, that was what I had planned for quite some time.
But thinking it over the no_path_timeout seems like a better
approach here.

(Plus we're hooking into the generic 'blk_timeout' mechanism, which
then would allow to blk_abort_request() to work)

Cheers,

Hannes
Frank Mayhar Sept. 27, 2013, 4:32 p.m. UTC | #3
On Fri, 2013-09-27 at 10:06 +0200, Hannes Reinecke wrote:
> On 09/27/2013 08:07 AM, Hannes Reinecke wrote:
> > On 09/27/2013 01:49 AM, Mike Snitzer wrote:
> >> On Thu, Sep 26 2013 at  7:22pm -0400,
> >> Alasdair G Kergon <agk@redhat.com> wrote:
> >>
> >>> On Thu, Sep 26, 2013 at 10:47:13AM -0700, Frank Mayhar wrote:
> >>>> Launching it from ramdisk won't help, particularly, since it still goes
> >>>> through the block layer.  The other stuff won't help if a (potentially
> >>>> unrelated) bug in the daemon happens to be being tickled at the same
> >>>> time, or if some dependency happens to be broken and _that's_ what's
> >>>> preventing the daemon from making progress.
> >>>  
> >>> Then put more effort into debugging your daemon so it doesn't have
> >>> bugs that make it die?  Implement the timeout in a robust independent
> >>> daemon if it's other code there that's unreliable?
> >>>
> >>>> And as far as lvm2 and multipath-tools, yeah, they cope okay in the kind
> >>>> of environments most people have, but that's not the kind of environment
> >>>> (or scale) we have to deal with.
> >>>
> >>> In what way are your requirements so different that a locked-into-memory
> >>> monitoring daemon cannot implement this timeout?
> >>
> >> Frank, I had a look at your patch.  It leaves a lot to be desired, I was
> >> starting to clean it up but ultimately found myself agreeing with
> >> Alasdair's original point: that this policy should be implemented in the
> >> userspace daemon.
> >>
> > _Actually_ there is a way how this could be implemented properly:
> > implement a blk_timeout function.
> > 
> > Thing is, every request_queue might have a timeout function
> > implemented, whose goal is to abort requests which are beyond that
> > timeout. EG SCSI uses that for the dev_loss_tmo mechanism.
> > 
> > Multipath what with it being request-based could easily implement
> > the same mechanism, namely have to blk_timeout function which would
> > just re-arm the timeout in the default case, but abort any queued
> > I/O (after a timeout) if all paths are down.
> > 
> > Hmm. I see to draft up a PoC.
> > 
> And indeed, here it is.
> 
> Completely untested, just to give you an idea what I was going on
> about. Let's see if I can put this to test somewhere...

Thanks, Hannes!  I'll grab this and test it today.  I clearly don't know
enough about the block layer, since using blk_timeout never even crossed
my mind.
Frank Mayhar Sept. 27, 2013, 4:37 p.m. UTC | #4
On Fri, 2013-09-27 at 15:52 +0200, Hannes Reinecke wrote:
> On 09/27/2013 10:37 AM, Alasdair G Kergon wrote:
> > But this still dodges the fundamental problem:
> > 
> >   What is the right value to use for the timeout?
> >   - How long should you wait for a path to (re)appear?
> >     - In the current model, reinstating a path is a userspace 
> >       responsibility.
> > 
> And with my proposed patch it would still be userspace which is
> setting the timeout.
> Currently, no_path_retry is not a proper measure anyway, as it's
> depending on the time multipathd takes to complete a path check
> round. Which depends on the number of device, the state of those etc.
> 
> > The timeout, as proposed, is being used in two conflicting ways:
> >   - How long to wait for path recovery when all paths went down
> 
> That would be set via the new 'no_path_timeout' feature, which would
> be set instead of the (multipath-internal) no_path_retry
> setting.

Yes, this matches our setup as well.

> >   - How long to wait when the system locks without enough free
> >     memory even to reinstate a path (because of broken userspace
> >     code) before having multipath fail queued I/O in a desperate
> >     attempt at releasing memory to assist recovery
> Do we even handle that case currently?

My understanding is that the current code doesn't, no, but if it does I
would love to know how.

> Methinks this is precisely the use-case this is supposed to address.

Yes, exactly.

> When currently 'no_path_retry' is set _and_ we're running under a
> low-mem condition there is a quite large likelyhood that the
> multipath daemon will be killed by the OOM-killer or not able to
> send any dm messages down to the kernel, as the latter most likely
> require some memory allocations.
> 
> So in the current 'no_path_retry' scenario the maps would have been
> created with 'queue_if_no_path', and the daemon would have to reset
> the 'queue_if_no_path' flag if the no_path_retry value expires.
> Which it might not be able to do so due to the above scenario.
> 
> So with the proposed 'no_path_timeout' we would enable the dm-mpath
> module to terminate all outstanding I/O, irrespective on all
> userland conditions. Which seems like an improvement to me ...

And to me, which is why I went in this direction in the first place.  I
could see no dependable way to deal with outside of the kernel; if I
had, I would have taken it, since userspace changes are _much_ easier
for us to deal with than kernel changes.
diff mbox

Patch

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 5adede1..6801ac3 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -444,6 +444,61 @@  static int queue_if_no_path(struct multipath *m, unsigned queue_if_no_path,
 	return 0;
 }
 
+/*
+ * Block timeout callback, called from the block layer
+ *
+ * request_queue lock is held on entry.
+ *
+ * Return values:
+ * BLK_EH_RESET_TIMER if the request should be left running
+ * BLK_EH_NOT_HANDLED if the request is handled or terminated
+ *                    by the driver.
+ */
+enum blk_eh_timer_return abort_if_no_path(struct request *req)
+{
+	union map_info *info;
+	struct dm_mpath_io *mpio;
+	struct multipath *m;
+	unsigned long flags;
+	int rc = BLK_EH_RESET_TIMER;
+	int flush_ios = 0;
+
+	info = dm_get_rq_mapinfo(req);
+	if (!info || !info->ptr)
+		return BLK_EH_NOT_HANDLED;
+
+	mpio = info->ptr;
+	m = mpio->pgpath->pg->m;
+	/*
+	 * Only abort request if:
+	 * - queued_ios is not empty
+	 *   (protect against races with process_queued_ios)
+	 * - queue_io is not set
+	 * - no valid paths are found
+	 */
+	spin_lock_irqsave(&m->lock, flags);
+	if (!list_empty(&m->queued_ios) &&
+	    !m->queue_io &&
+	    !m->nr_valid_paths) {
+		list_del_init(&req->queuelist);
+		m->queue_size--;
+		m->queue_if_no_path = 0;
+		if (m->queue_size)
+			flush_ios = 1;
+		rc = BLK_EH_NOT_HANDLED;
+	}
+	spin_unlock_irqrestore(&m->lock, flags);
+
+	if (rc == BLK_EH_NOT_HANDLED) {
+		mempool_free(mpio, m->mpio_pool);
+		dm_kill_unmapped_request(clone, -ETIMEOUT);
+	}
+	if (flush_ios)
+		queue_work(kmultipathd, &m->process_queue_ios);
+
+	return rc;
+}
+
 /*-----------------------------------------------------------------
  * The multipath daemon is responsible for resubmitting queued ios.
  *---------------------------------------------------------------*/
@@ -790,6 +845,7 @@  static int parse_features(struct dm_arg_set *as, struct multipath *m)
 		{0, 6, "invalid number of feature args"},
 		{1, 50, "pg_init_retries must be between 1 and 50"},
 		{0, 60000, "pg_init_delay_msecs must be between 0 and 60000"},
+		{0, 65535, "no_path_timeout must be between 0 and 65535"},
 	};
 
 	r = dm_read_arg_group(_args, as, &argc, &ti->error);
@@ -827,6 +883,13 @@  static int parse_features(struct dm_arg_set *as, struct multipath *m)
 			continue;
 		}
 
+		if (!strcasecmp(arg_name, "no_path_timeout") &&
+		    (argc >= 1)) {
+			r = dm_read_arg(_args + 3, as, &m->no_path_timeout, &ti->error);
+			argc--;
+			continue;
+		}
+
 		ti->error = "Unrecognised multipath feature request";
 		r = -EINVAL;
 	} while (argc && !r);
@@ -905,6 +968,12 @@  static int multipath_ctr(struct dm_target *ti, unsigned int argc,
 		goto bad;
 	}
 
+	if (m->no_path_timeout)
+		dm_set_queue_timeout(ti, abort_if_no_path,
+				     m->no_path_timeout * HZ);
+	else
+		dm_set_queue_timeout(ti, NULL, 0)
+
 	ti->num_flush_bios = 1;
 	ti->num_discard_bios = 1;
 	ti->num_write_same_bios = 1;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9e39d2b..26bfad6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1881,6 +1881,16 @@  static void dm_init_md_queue(struct mapped_device *md)
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
 }
 
+static void dm_set_md_timeout(struct mapped_device *md,
+			      rq_timed_out_fn *fn, unsigned int timeout)
+{
+	if (dm_get_md_type(md) != DM_TYPE_REQUEST_BASED)
+		return;
+
+	blk_queue_rq_timed_out(md->queue, fn);
+	blk_queue_rq_timeout(md->queue, timeout);
+}
+
 /*
  * Allocate and initialise a blank device with a given minor.
  */
@@ -2790,6 +2800,13 @@  int dm_noflush_suspending(struct dm_target *ti)
 }
 EXPORT_SYMBOL_GPL(dm_noflush_suspending);
 
+int dm_set_queue_timeout(struct dm_target *ti, rq_timed_out_fn *fn,
+			 unsigned int timeout)
+{
+	return dm_set_md_timeout(dm_table_get_md(ti->table), fn, timeout);
+}
+EXPORT_SYMBOL_GPL(dm_set_queue_timeout);
+
 struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity, unsigned per_bio_data_size)
 {
 	struct dm_md_mempools *pools = kzalloc(sizeof(*pools), GFP_KERNEL);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 45b97da..c8df1ef 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -80,6 +80,8 @@  void dm_unlock_md_type(struct mapped_device *md);
 void dm_set_md_type(struct mapped_device *md, unsigned type);
 unsigned dm_get_md_type(struct mapped_device *md);
 struct target_type *dm_get_immutable_target_type(struct mapped_device *md);
+void dm_set_queue_timeout(struct dm_table *t, rq_timed_out_fn *fn,
+			  unsigned int timeout);
 
 int dm_setup_md_queue(struct mapped_device *md);