Message ID | 20110411145022.710c30e9@notabene.brown (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On 2011-04-11 06:50, NeilBrown wrote: > On Tue, 5 Apr 2011 13:05:41 +1000 NeilBrown <neilb@suse.de> wrote: > >> On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@redhat.com> wrote: >> >>> Also, in your MD changes, you removed all calls to md_unplug() but >>> didn't remove md_unplug(). Seems it should be removed along with the >>> 'plug' member of 'struct mddev_t'? Neil? >> >> I've been distracted by other things and only just managed to have a look at >> this. >> >> The new plugging code seems to completely ignore the needs of stacked devices >> - or at least my needs in md. >> >> For RAID1 with a write-intent-bitmap, I queue all write requests and then on >> an unplug I update the write-intent-bitmap to mark all the relevant blocks >> and then release the writes. >> >> With the new code there is no way for an unplug event to wake up the raid1d >> thread to start the writeout - I haven't tested it but I suspect it will just >> hang. >> >> Similarly for RAID5 I gather write bios (long before they become 'struct >> request' which is what the plugging code understands) and on an unplug event >> I release the writes - hopefully with enough bios per stripe so that we don't >> need to pre-read. >> >> Possibly the simplest fix would be to have a second list_head in 'struct >> blk_plug' which contained callbacks (a function pointer a list_head in a >> struct which is passed as an arg to the function!). >> blk_finish_plug could then walk the list and call the call-backs. >> It would be quite easy to hook into that. > > I've implemented this and it seems to work. > Jens: could you please review and hopefully ack the patch below, and let > me know if you will submit it or should I? > > My testing of this combined with some other patches which cause various md > personalities to use it shows up a bug somewhere. > > The symptoms are crashes in various places in blk-core and sometimes > elevator.c > list_sort occurs fairly often included in the stack but not always. > > This patch > > diff --git a/block/blk-core.c b/block/blk-core.c > index 273d60b..903ce8d 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug) > struct request_queue *q; > unsigned long flags; > struct request *rq; > + struct list_head head; > > BUG_ON(plug->magic != PLUG_MAGIC); > > if (list_empty(&plug->list)) > return; > + list_add(&head, &plug->list); > + list_del_init(&plug->list); > > if (plug->should_sort) > - list_sort(NULL, &plug->list, plug_rq_cmp); > + list_sort(NULL, &head, plug_rq_cmp); > + plug->should_sort = 0; > > q = NULL; > local_irq_save(flags); > - while (!list_empty(&plug->list)) { > - rq = list_entry_rq(plug->list.next); > + while (!list_empty(&head)) { > + rq = list_entry_rq(head.next); > list_del_init(&rq->queuelist); > BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG)); > BUG_ON(!rq->q); > > > makes the symptom go away. It simply moves the plug list onto a separate > list head before sorting and processing it. > My test was simply writing to a RAID1 with dd: > while true; do dd if=/dev/zero of=/dev/md0 size=4k; done > > Obviously all writes go to two devices so the plug list will always need > sorting. > > The only explanation I can come up with is that very occasionally schedule on > 2 separate cpus calls blk_flush_plug for the same task. I don't understand > the scheduler nearly well enough to know if or how that can happen. > However with this patch in place I can write to a RAID1 constantly for half > an hour, and without it, the write rarely lasts for 3 minutes. Or perhaps if the request_fn blocks, that would be problematic. So the patch is likely a good idea even for that case. I'll merge it, changing it to list_splice_init() as I think that would be more clear. > From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001 > From: NeilBrown <neilb@suse.de> > Date: Thu, 7 Apr 2011 13:16:59 +1000 > Subject: [PATCH] Enhance new plugging support to support general callbacks. > > md/raid requires an unplug callback, but as it does not uses > requests the current code cannot provide one. > > So allow arbitrary callbacks to be attached to the blk_plug. > > Cc: Jens Axboe <jaxboe@fusionio.com> > Signed-off-by: NeilBrown <neilb@suse.de> > --- > block/blk-core.c | 13 +++++++++++++ > include/linux/blkdev.h | 7 ++++++- > 2 files changed, 19 insertions(+), 1 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 725091d..273d60b 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug) > > plug->magic = PLUG_MAGIC; > INIT_LIST_HEAD(&plug->list); > + INIT_LIST_HEAD(&plug->cb_list); > plug->should_sort = 0; > > /* > @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug) > local_irq_restore(flags); > } > > +static void flush_plug_callbacks(struct blk_plug *plug) > +{ > + while (!list_empty(&plug->cb_list)) { > + struct blk_plug_cb *cb = list_first_entry(&plug->cb_list, > + struct blk_plug_cb, > + list); > + list_del(&cb->list); > + cb->callback(cb); > + } > +} > + > static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug) > { > flush_plug_list(plug); > + flush_plug_callbacks(plug); > > if (plug == tsk->plug) > tsk->plug = NULL; > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 32176cc..3e5e604 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *); > struct blk_plug { > unsigned long magic; > struct list_head list; > + struct list_head cb_list; > unsigned int should_sort; > }; > +struct blk_plug_cb { > + struct list_head list; > + void (*callback)(struct blk_plug_cb *); > +}; > > extern void blk_start_plug(struct blk_plug *); > extern void blk_finish_plug(struct blk_plug *); > @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk) > { > struct blk_plug *plug = tsk->plug; > > - return plug && !list_empty(&plug->list); > + return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list)); > } > > /* Maybe I'm missing something, but why do you need those callbacks? If it's to use plugging yourself, perhaps we can just ensure that those don't get assigned in the task - so it would be have to used with care. It's not that I disagree to these callbacks, I just want to ensure I understand why you need them.
On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > On 2011-04-11 06:50, NeilBrown wrote: > > The only explanation I can come up with is that very occasionally schedule on > > 2 separate cpus calls blk_flush_plug for the same task. I don't understand > > the scheduler nearly well enough to know if or how that can happen. > > However with this patch in place I can write to a RAID1 constantly for half > > an hour, and without it, the write rarely lasts for 3 minutes. > > Or perhaps if the request_fn blocks, that would be problematic. So the > patch is likely a good idea even for that case. > > I'll merge it, changing it to list_splice_init() as I think that would > be more clear. OK - though I'm not 100% the patch fixes the problem - just that it hides the symptom for me. I might try instrumenting the code a bit more and see if I can find exactly where it is re-entering flush_plug_list - as that seems to be what is happening. And yeah - list_split_init is probably better. I just never remember exactly what list_split means and have to look it up every time, where as list_add/list_del are very clear to me. > > > From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001 > > From: NeilBrown <neilb@suse.de> > > Date: Thu, 7 Apr 2011 13:16:59 +1000 > > Subject: [PATCH] Enhance new plugging support to support general callbacks. > > > > md/raid requires an unplug callback, but as it does not uses > > requests the current code cannot provide one. > > > > So allow arbitrary callbacks to be attached to the blk_plug. > > > > Cc: Jens Axboe <jaxboe@fusionio.com> > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > block/blk-core.c | 13 +++++++++++++ > > include/linux/blkdev.h | 7 ++++++- > > 2 files changed, 19 insertions(+), 1 deletions(-) > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > index 725091d..273d60b 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug) > > > > plug->magic = PLUG_MAGIC; > > INIT_LIST_HEAD(&plug->list); > > + INIT_LIST_HEAD(&plug->cb_list); > > plug->should_sort = 0; > > > > /* > > @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug) > > local_irq_restore(flags); > > } > > > > +static void flush_plug_callbacks(struct blk_plug *plug) > > +{ > > + while (!list_empty(&plug->cb_list)) { > > + struct blk_plug_cb *cb = list_first_entry(&plug->cb_list, > > + struct blk_plug_cb, > > + list); > > + list_del(&cb->list); > > + cb->callback(cb); > > + } > > +} > > + > > static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug) > > { > > flush_plug_list(plug); > > + flush_plug_callbacks(plug); > > > > if (plug == tsk->plug) > > tsk->plug = NULL; > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > > index 32176cc..3e5e604 100644 > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *); > > struct blk_plug { > > unsigned long magic; > > struct list_head list; > > + struct list_head cb_list; > > unsigned int should_sort; > > }; > > +struct blk_plug_cb { > > + struct list_head list; > > + void (*callback)(struct blk_plug_cb *); > > +}; > > > > extern void blk_start_plug(struct blk_plug *); > > extern void blk_finish_plug(struct blk_plug *); > > @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk) > > { > > struct blk_plug *plug = tsk->plug; > > > > - return plug && !list_empty(&plug->list); > > + return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list)); > > } > > > > /* > > Maybe I'm missing something, but why do you need those callbacks? If > it's to use plugging yourself, perhaps we can just ensure that those > don't get assigned in the task - so it would be have to used with care. > > It's not that I disagree to these callbacks, I just want to ensure I > understand why you need them. > I'm sure one of us is missing something (probably both) but I'm not sure what. The callback is central. It is simply to use plugging in md. Just like blk-core, md will notice that a blk_plug is active and will put requests aside. I then need something to call in to md when blk_finish_plug is called so that put-aside requests can be released. As md can be built as a module, that call must be a call-back of some sort. blk-core doesn't need to register blk_plug_flush because that is never in a module, so it can be called directly. But the md equivalent could be in a module, so I need to be able to register a call back. Does that help? Thanks, NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 2011-04-11 12:59, NeilBrown wrote: > On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > >> On 2011-04-11 06:50, NeilBrown wrote: > >>> The only explanation I can come up with is that very occasionally schedule on >>> 2 separate cpus calls blk_flush_plug for the same task. I don't understand >>> the scheduler nearly well enough to know if or how that can happen. >>> However with this patch in place I can write to a RAID1 constantly for half >>> an hour, and without it, the write rarely lasts for 3 minutes. >> >> Or perhaps if the request_fn blocks, that would be problematic. So the >> patch is likely a good idea even for that case. >> >> I'll merge it, changing it to list_splice_init() as I think that would >> be more clear. > > OK - though I'm not 100% the patch fixes the problem - just that it hides the > symptom for me. > I might try instrumenting the code a bit more and see if I can find exactly > where it is re-entering flush_plug_list - as that seems to be what is > happening. It's definitely a good thing to add, to avoid the list fudging on schedule. Whether it's your exact problem, I can't tell. > And yeah - list_split_init is probably better. I just never remember exactly > what list_split means and have to look it up every time, where as > list_add/list_del are very clear to me. splice, no split :-) >>> From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001 >>> From: NeilBrown <neilb@suse.de> >>> Date: Thu, 7 Apr 2011 13:16:59 +1000 >>> Subject: [PATCH] Enhance new plugging support to support general callbacks. >>> >>> md/raid requires an unplug callback, but as it does not uses >>> requests the current code cannot provide one. >>> >>> So allow arbitrary callbacks to be attached to the blk_plug. >>> >>> Cc: Jens Axboe <jaxboe@fusionio.com> >>> Signed-off-by: NeilBrown <neilb@suse.de> >>> --- >>> block/blk-core.c | 13 +++++++++++++ >>> include/linux/blkdev.h | 7 ++++++- >>> 2 files changed, 19 insertions(+), 1 deletions(-) >>> >>> diff --git a/block/blk-core.c b/block/blk-core.c >>> index 725091d..273d60b 100644 >>> --- a/block/blk-core.c >>> +++ b/block/blk-core.c >>> @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug) >>> >>> plug->magic = PLUG_MAGIC; >>> INIT_LIST_HEAD(&plug->list); >>> + INIT_LIST_HEAD(&plug->cb_list); >>> plug->should_sort = 0; >>> >>> /* >>> @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug) >>> local_irq_restore(flags); >>> } >>> >>> +static void flush_plug_callbacks(struct blk_plug *plug) >>> +{ >>> + while (!list_empty(&plug->cb_list)) { >>> + struct blk_plug_cb *cb = list_first_entry(&plug->cb_list, >>> + struct blk_plug_cb, >>> + list); >>> + list_del(&cb->list); >>> + cb->callback(cb); >>> + } >>> +} >>> + >>> static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug) >>> { >>> flush_plug_list(plug); >>> + flush_plug_callbacks(plug); >>> >>> if (plug == tsk->plug) >>> tsk->plug = NULL; >>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h >>> index 32176cc..3e5e604 100644 >>> --- a/include/linux/blkdev.h >>> +++ b/include/linux/blkdev.h >>> @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *); >>> struct blk_plug { >>> unsigned long magic; >>> struct list_head list; >>> + struct list_head cb_list; >>> unsigned int should_sort; >>> }; >>> +struct blk_plug_cb { >>> + struct list_head list; >>> + void (*callback)(struct blk_plug_cb *); >>> +}; >>> >>> extern void blk_start_plug(struct blk_plug *); >>> extern void blk_finish_plug(struct blk_plug *); >>> @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk) >>> { >>> struct blk_plug *plug = tsk->plug; >>> >>> - return plug && !list_empty(&plug->list); >>> + return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list)); >>> } >>> >>> /* >> >> Maybe I'm missing something, but why do you need those callbacks? If >> it's to use plugging yourself, perhaps we can just ensure that those >> don't get assigned in the task - so it would be have to used with care. >> >> It's not that I disagree to these callbacks, I just want to ensure I >> understand why you need them. >> > > I'm sure one of us is missing something (probably both) but I'm not > sure what. > > The callback is central. > > It is simply to use plugging in md. > Just like blk-core, md will notice that a blk_plug is active and will put > requests aside. I then need something to call in to md when blk_finish_plug But this is done in __make_request(), so md devices should not be affected at all. This is the part of your explanation that I do not connect with the code. If md itself is putting things on the plug list, why is it doing that? > is called so that put-aside requests can be released. > As md can be built as a module, that call must be a call-back of some sort. > blk-core doesn't need to register blk_plug_flush because that is never in a > module, so it can be called directly. But the md equivalent could be in a > module, so I need to be able to register a call back. > > Does that help? Not really. Is the problem that _you_ would like to stash things aside, not the fact that __make_request() puts things on a task plug list?
On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > > > > I'm sure one of us is missing something (probably both) but I'm not > > sure what. > > > > The callback is central. > > > > It is simply to use plugging in md. > > Just like blk-core, md will notice that a blk_plug is active and will put > > requests aside. I then need something to call in to md when blk_finish_plug > > But this is done in __make_request(), so md devices should not be > affected at all. This is the part of your explanation that I do not > connect with the code. > > If md itself is putting things on the plug list, why is it doing that? Yes. Exactly. md itself want to put things aside on some list. e.g. in RAID1 when using a write-intent bitmap I want to gather as many write requests as possible so I can update the bits for all of them at once. So when a plug is in effect I just queue the bios somewhere and record the bits that need to be set. Then when the unplug happens I write out the bitmap updates in a single write and when that completes, I write out the data (to all devices). Also in RAID5 it is good if I can wait for lots of write request to arrive before committing any of them to increase the possibility of getting a full-stripe write. Previously I used ->unplug_fn to release the queued requests. Now that has gone I need a different way to register a callback when an unplug happens. > > > is called so that put-aside requests can be released. > > As md can be built as a module, that call must be a call-back of some sort. > > blk-core doesn't need to register blk_plug_flush because that is never in a > > module, so it can be called directly. But the md equivalent could be in a > > module, so I need to be able to register a call back. > > > > Does that help? > > Not really. Is the problem that _you_ would like to stash things aside, > not the fact that __make_request() puts things on a task plug list? > Yes, exactly. I (in md) want to stash things aside. (I don't actually put the stashed things on the blk_plug, though it might make sense to do that later in some cases - I'm not sure. Currently I stash things in my own internal lists and just need a call back to say "ok, flush those lists now"). Thanks, NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 2011-04-11 13:26, NeilBrown wrote: > On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > >>> >>> I'm sure one of us is missing something (probably both) but I'm not >>> sure what. >>> >>> The callback is central. >>> >>> It is simply to use plugging in md. >>> Just like blk-core, md will notice that a blk_plug is active and will put >>> requests aside. I then need something to call in to md when blk_finish_plug >> >> But this is done in __make_request(), so md devices should not be >> affected at all. This is the part of your explanation that I do not >> connect with the code. >> >> If md itself is putting things on the plug list, why is it doing that? > > Yes. Exactly. md itself want to put things aside on some list. > e.g. in RAID1 when using a write-intent bitmap I want to gather as many write > requests as possible so I can update the bits for all of them at once. > So when a plug is in effect I just queue the bios somewhere and record the > bits that need to be set. > Then when the unplug happens I write out the bitmap updates in a single write > and when that completes, I write out the data (to all devices). > > Also in RAID5 it is good if I can wait for lots of write request to arrive > before committing any of them to increase the possibility of getting a > full-stripe write. > > Previously I used ->unplug_fn to release the queued requests. Now that has > gone I need a different way to register a callback when an unplug happens. Ah, so this is what I was hinting at. But why use the task->plug for that? Seems a bit counter intuitive. Why can't you just store these internally? > >> >>> is called so that put-aside requests can be released. >>> As md can be built as a module, that call must be a call-back of some sort. >>> blk-core doesn't need to register blk_plug_flush because that is never in a >>> module, so it can be called directly. But the md equivalent could be in a >>> module, so I need to be able to register a call back. >>> >>> Does that help? >> >> Not really. Is the problem that _you_ would like to stash things aside, >> not the fact that __make_request() puts things on a task plug list? >> > > Yes, exactly. I (in md) want to stash things aside. > > (I don't actually put the stashed things on the blk_plug, though it might > make sense to do that later in some cases - I'm not sure. Currently I stash > things in my own internal lists and just need a call back to say "ok, flush > those lists now"). So we are making some progress... The thing I then don't understand is why you want to make it associated with the plug? Seems you don't have any scheduling restrictions, and in which case just storing them in md seems like a much better option.
On Mon, 11 Apr 2011 20:59:28 +1000 NeilBrown <neilb@suse.de> wrote: > On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > > > On 2011-04-11 06:50, NeilBrown wrote: > > > > The only explanation I can come up with is that very occasionally schedule on > > > 2 separate cpus calls blk_flush_plug for the same task. I don't understand > > > the scheduler nearly well enough to know if or how that can happen. > > > However with this patch in place I can write to a RAID1 constantly for half > > > an hour, and without it, the write rarely lasts for 3 minutes. > > > > Or perhaps if the request_fn blocks, that would be problematic. So the > > patch is likely a good idea even for that case. > > > > I'll merge it, changing it to list_splice_init() as I think that would > > be more clear. > > OK - though I'm not 100% the patch fixes the problem - just that it hides the > symptom for me. > I might try instrumenting the code a bit more and see if I can find exactly > where it is re-entering flush_plug_list - as that seems to be what is > happening. OK, I found how it re-enters. The request_fn doesn't exactly block, but when scsi_request_fn calls spin_unlock_irq, this calls preempt_enable which can call schedule, which is a recursive call. The patch I provided will stop that from recursing again as the blk_plug.list will be empty. So it is almost what you suggested, however the request_fn doesn't block, it just enabled preempt. So the comment I would put at the top of that patch would be something like: From: NeilBrown <neilb@suse.de> As request_fn called by __blk_run_queue is allowed to 'schedule()' (after dropping the queue lock of course), it is possible to get a recursive call: schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list -> __blk_run_queue -> request_fn -> schedule We must make sure that the second schedule does not call into blk_flush_plug again. So instead of leaving the list of requests on blk_plug->list, move them to a separate list leaving blk_plug->list empty. Signed-off-by: NeilBrown <neilb@suse.de> Thanks, NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 11 Apr 2011 13:37:20 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > On 2011-04-11 13:26, NeilBrown wrote: > > On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > > > >>> > >>> I'm sure one of us is missing something (probably both) but I'm not > >>> sure what. > >>> > >>> The callback is central. > >>> > >>> It is simply to use plugging in md. > >>> Just like blk-core, md will notice that a blk_plug is active and will put > >>> requests aside. I then need something to call in to md when blk_finish_plug > >> > >> But this is done in __make_request(), so md devices should not be > >> affected at all. This is the part of your explanation that I do not > >> connect with the code. > >> > >> If md itself is putting things on the plug list, why is it doing that? > > > > Yes. Exactly. md itself want to put things aside on some list. > > e.g. in RAID1 when using a write-intent bitmap I want to gather as many write > > requests as possible so I can update the bits for all of them at once. > > So when a plug is in effect I just queue the bios somewhere and record the > > bits that need to be set. > > Then when the unplug happens I write out the bitmap updates in a single write > > and when that completes, I write out the data (to all devices). > > > > Also in RAID5 it is good if I can wait for lots of write request to arrive > > before committing any of them to increase the possibility of getting a > > full-stripe write. > > > > Previously I used ->unplug_fn to release the queued requests. Now that has > > gone I need a different way to register a callback when an unplug happens. > > Ah, so this is what I was hinting at. But why use the task->plug for > that? Seems a bit counter intuitive. Why can't you just store these > internally? > > > > >> > >>> is called so that put-aside requests can be released. > >>> As md can be built as a module, that call must be a call-back of some sort. > >>> blk-core doesn't need to register blk_plug_flush because that is never in a > >>> module, so it can be called directly. But the md equivalent could be in a > >>> module, so I need to be able to register a call back. > >>> > >>> Does that help? > >> > >> Not really. Is the problem that _you_ would like to stash things aside, > >> not the fact that __make_request() puts things on a task plug list? > >> > > > > Yes, exactly. I (in md) want to stash things aside. > > > > (I don't actually put the stashed things on the blk_plug, though it might > > make sense to do that later in some cases - I'm not sure. Currently I stash > > things in my own internal lists and just need a call back to say "ok, flush > > those lists now"). > > So we are making some progress... The thing I then don't understand is > why you want to make it associated with the plug? Seems you don't have > any scheduling restrictions, and in which case just storing them in md > seems like a much better option. > Yes. But I need to know when to release the requests that I have stored. I need to know when ->write_pages or ->read_pages or whatever has finished submitting a pile of pages so that I can start processing the request that I have put aside. So I need a callback from blk_finish_plug. (and I also need to know if a thread that was plugging schedules for the same reason that you do). NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 2011-04-11 14:05, NeilBrown wrote: > On Mon, 11 Apr 2011 13:37:20 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > >> On 2011-04-11 13:26, NeilBrown wrote: >>> On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: >>> >>>>> >>>>> I'm sure one of us is missing something (probably both) but I'm not >>>>> sure what. >>>>> >>>>> The callback is central. >>>>> >>>>> It is simply to use plugging in md. >>>>> Just like blk-core, md will notice that a blk_plug is active and will put >>>>> requests aside. I then need something to call in to md when blk_finish_plug >>>> >>>> But this is done in __make_request(), so md devices should not be >>>> affected at all. This is the part of your explanation that I do not >>>> connect with the code. >>>> >>>> If md itself is putting things on the plug list, why is it doing that? >>> >>> Yes. Exactly. md itself want to put things aside on some list. >>> e.g. in RAID1 when using a write-intent bitmap I want to gather as many write >>> requests as possible so I can update the bits for all of them at once. >>> So when a plug is in effect I just queue the bios somewhere and record the >>> bits that need to be set. >>> Then when the unplug happens I write out the bitmap updates in a single write >>> and when that completes, I write out the data (to all devices). >>> >>> Also in RAID5 it is good if I can wait for lots of write request to arrive >>> before committing any of them to increase the possibility of getting a >>> full-stripe write. >>> >>> Previously I used ->unplug_fn to release the queued requests. Now that has >>> gone I need a different way to register a callback when an unplug happens. >> >> Ah, so this is what I was hinting at. But why use the task->plug for >> that? Seems a bit counter intuitive. Why can't you just store these >> internally? >> >>> >>>> >>>>> is called so that put-aside requests can be released. >>>>> As md can be built as a module, that call must be a call-back of some sort. >>>>> blk-core doesn't need to register blk_plug_flush because that is never in a >>>>> module, so it can be called directly. But the md equivalent could be in a >>>>> module, so I need to be able to register a call back. >>>>> >>>>> Does that help? >>>> >>>> Not really. Is the problem that _you_ would like to stash things aside, >>>> not the fact that __make_request() puts things on a task plug list? >>>> >>> >>> Yes, exactly. I (in md) want to stash things aside. >>> >>> (I don't actually put the stashed things on the blk_plug, though it might >>> make sense to do that later in some cases - I'm not sure. Currently I stash >>> things in my own internal lists and just need a call back to say "ok, flush >>> those lists now"). >> >> So we are making some progress... The thing I then don't understand is >> why you want to make it associated with the plug? Seems you don't have >> any scheduling restrictions, and in which case just storing them in md >> seems like a much better option. >> > > Yes. But I need to know when to release the requests that I have stored. > I need to know when ->write_pages or ->read_pages or whatever has finished > submitting a pile of pages so that I can start processing the request that I > have put aside. So I need a callback from blk_finish_plug. OK fair enough, I'll add your callback patch.
On 2011-04-11 13:55, NeilBrown wrote: > On Mon, 11 Apr 2011 20:59:28 +1000 NeilBrown <neilb@suse.de> wrote: > >> On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: >> >>> On 2011-04-11 06:50, NeilBrown wrote: >> >>>> The only explanation I can come up with is that very occasionally schedule on >>>> 2 separate cpus calls blk_flush_plug for the same task. I don't understand >>>> the scheduler nearly well enough to know if or how that can happen. >>>> However with this patch in place I can write to a RAID1 constantly for half >>>> an hour, and without it, the write rarely lasts for 3 minutes. >>> >>> Or perhaps if the request_fn blocks, that would be problematic. So the >>> patch is likely a good idea even for that case. >>> >>> I'll merge it, changing it to list_splice_init() as I think that would >>> be more clear. >> >> OK - though I'm not 100% the patch fixes the problem - just that it hides the >> symptom for me. >> I might try instrumenting the code a bit more and see if I can find exactly >> where it is re-entering flush_plug_list - as that seems to be what is >> happening. > > OK, I found how it re-enters. > > The request_fn doesn't exactly block, but when scsi_request_fn calls > spin_unlock_irq, this calls preempt_enable which can call schedule, which is > a recursive call. > > The patch I provided will stop that from recursing again as the blk_plug.list > will be empty. > > So it is almost what you suggested, however the request_fn doesn't block, it > just enabled preempt. > > > So the comment I would put at the top of that patch would be something like: Ah, so it was pretty close. That does explain it. I've already queued up the patch, I'll ammend the commit message.
On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > > Yes. But I need to know when to release the requests that I have stored. > > I need to know when ->write_pages or ->read_pages or whatever has finished > > submitting a pile of pages so that I can start processing the request that I > > have put aside. So I need a callback from blk_finish_plug. > > OK fair enough, I'll add your callback patch. > Thanks. I'll queue up my md fixes to follow it once it gets to -linus. NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 2011-04-11 14:36, NeilBrown wrote: > On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote: > >>> Yes. But I need to know when to release the requests that I have stored. >>> I need to know when ->write_pages or ->read_pages or whatever has finished >>> submitting a pile of pages so that I can start processing the request that I >>> have put aside. So I need a callback from blk_finish_plug. >> >> OK fair enough, I'll add your callback patch. >> > > Thanks. I'll queue up my md fixes to follow it once it gets to -linus. Great, once you do that and XFS kills the blk_flush_plug() calls too, then we can remove that export and make it internal only.
On Mon, Apr 11, 2011 at 02:50:22PM +1000, NeilBrown wrote: > diff --git a/block/blk-core.c b/block/blk-core.c > index 273d60b..903ce8d 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug) > struct request_queue *q; > unsigned long flags; > struct request *rq; > + struct list_head head; > > BUG_ON(plug->magic != PLUG_MAGIC); > > if (list_empty(&plug->list)) > return; > + list_add(&head, &plug->list); > + list_del_init(&plug->list); > > if (plug->should_sort) > - list_sort(NULL, &plug->list, plug_rq_cmp); > + list_sort(NULL, &head, plug_rq_cmp); > + plug->should_sort = 0; As Jens mentioned this should be list_splice_init. But looking over flush_plug_list the code there seems strange to me. What does the local_irq_save in flush_plug_list protect? Why don't we need it over the list_sort? And do we still need it when first splicing the list to a local one? It's one of these cases where I'd really like to see more comments explaining why the code is doing what it's doing. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 11 Apr 2011 12:59:23 -0400 "hch@infradead.org" <hch@infradead.org> wrote: > On Mon, Apr 11, 2011 at 02:50:22PM +1000, NeilBrown wrote: > > diff --git a/block/blk-core.c b/block/blk-core.c > > index 273d60b..903ce8d 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug) > > struct request_queue *q; > > unsigned long flags; > > struct request *rq; > > + struct list_head head; > > > > BUG_ON(plug->magic != PLUG_MAGIC); > > > > if (list_empty(&plug->list)) > > return; > > + list_add(&head, &plug->list); > > + list_del_init(&plug->list); > > > > if (plug->should_sort) > > - list_sort(NULL, &plug->list, plug_rq_cmp); > > + list_sort(NULL, &head, plug_rq_cmp); > > + plug->should_sort = 0; > > As Jens mentioned this should be list_splice_init. But looking over > flush_plug_list the code there seems strange to me. > > What does the local_irq_save in flush_plug_list protect? Why don't > we need it over the list_sort? And do we still need it when first > splicing the list to a local one? > > It's one of these cases where I'd really like to see more comments > explaining why the code is doing what it's doing. My understanding of that was that the calling requirement of __elv_add_request is that the queue spinlock is held and that interrupts are disabled. So rather than possible enabling and disabling interrupts several times as different queue are handled, the code just disabled interrupts once, and then just take the spinlock once for each different queue. The whole point of the change to plugging was to take locks less often. Disabling interrupts less often is presumably an analogous goal. Though I agree that a comment would help. q = NULL; + /* Disable interrupts just once rather than using spin_lock_irq/sin_unlock_irq * variants */ local_irq_save(flags); assuming my analysis is correct. NeilBrown -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Looking at the patch (http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=761e433f3de6fb8e369af9e5c08beb86286d023f) I'm not sure it's an optimal design. The flush callback really is a per-queue thing. Why isn't it a function pointer in the request queue when doing the blk_run_queue call once we're done with a given queue before moving on to the next one? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Apr 12, 2011 at 07:14:28AM +1000, NeilBrown wrote: > > My understanding of that was that the calling requirement of > __elv_add_request is that the queue spinlock is held and that interrupts are > disabled. > So rather than possible enabling and disabling interrupts several times as > different queue are handled, the code just disabled interrupts once, and > then just take the spinlock once for each different queue. > > The whole point of the change to plugging was to take locks less often. > Disabling interrupts less often is presumably an analogous goal. > > Though I agree that a comment would help. > > q = NULL; > + /* Disable interrupts just once rather than using spin_lock_irq/sin_unlock_irq > * variants > */ > local_irq_save(flags); > > > assuming my analysis is correct. Your explanation does make sense to be now that you explain it. I didn't even thing of that variant before. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote: > Great, once you do that and XFS kills the blk_flush_plug() calls too, > then we can remove that export and make it internal only. Linus pulled the tree, so they are gone now. Btw, there's still some bits in the area that confuse me: - what's the point of the queue_sync_plugs? It has a lot of comment that seem to pre-data the onstack plugging, but except for that it's trivial wrapper around blk_flush_plug, with an argument that is not used. - is there a good reason for the existance of __blk_flush_plug? You'd get one additional instruction in the inlined version of blk_flush_plug when opencoding, but avoid the need for chained function calls. - Why is having a plug in blk_flush_plug marked unlikely? Note that unlikely is the static branch prediction hint to mark the case extremly unlikely and is even used for hot/cold partitioning. But when we call it we usually check beforehand if we actually have plugs, so it's actually likely to happen. - what is the point of blk_finish_plug? All callers have the plug on stack, and there's no good reason for adding the NULL check. Note that blk_start_plug doesn't have the NULL check either. - Why does __blk_flush_plug call __blk_finish_plug which might clear tsk->plug, just to set it back after the call? When manually inlining __blk_finish_plug ino __blk_flush_plug it looks like: void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug) { flush_plug_list(plug); if (plug == tsk->plug) tsk->plug = NULL; tsk->plug = plug; } it would seem much smarted to just call flush_plug_list directly. In fact it seems like the tsk->plug is not nessecary at all and all remaining __blk_flush_plug callers could be replaced with flush_plug_list. - and of course the remaining issue of why io_schedule needs an expliciy blk_flush_plug when schedule() already does one in case it actually needs to schedule. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 2011-04-11 23:14, NeilBrown wrote: > On Mon, 11 Apr 2011 12:59:23 -0400 "hch@infradead.org" <hch@infradead.org> > wrote: > >> On Mon, Apr 11, 2011 at 02:50:22PM +1000, NeilBrown wrote: >>> diff --git a/block/blk-core.c b/block/blk-core.c >>> index 273d60b..903ce8d 100644 >>> --- a/block/blk-core.c >>> +++ b/block/blk-core.c >>> @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug) >>> struct request_queue *q; >>> unsigned long flags; >>> struct request *rq; >>> + struct list_head head; >>> >>> BUG_ON(plug->magic != PLUG_MAGIC); >>> >>> if (list_empty(&plug->list)) >>> return; >>> + list_add(&head, &plug->list); >>> + list_del_init(&plug->list); >>> >>> if (plug->should_sort) >>> - list_sort(NULL, &plug->list, plug_rq_cmp); >>> + list_sort(NULL, &head, plug_rq_cmp); >>> + plug->should_sort = 0; >> >> As Jens mentioned this should be list_splice_init. But looking over >> flush_plug_list the code there seems strange to me. >> >> What does the local_irq_save in flush_plug_list protect? Why don't >> we need it over the list_sort? And do we still need it when first >> splicing the list to a local one? >> >> It's one of these cases where I'd really like to see more comments >> explaining why the code is doing what it's doing. > > My understanding of that was that the calling requirement of > __elv_add_request is that the queue spinlock is held and that interrupts are > disabled. > So rather than possible enabling and disabling interrupts several times as > different queue are handled, the code just disabled interrupts once, and > then just take the spinlock once for each different queue. > > The whole point of the change to plugging was to take locks less often. > Disabling interrupts less often is presumably an analogous goal. > > Though I agree that a comment would help. > > q = NULL; > + /* Disable interrupts just once rather than using spin_lock_irq/sin_unlock_irq > * variants > */ > local_irq_save(flags); > > > assuming my analysis is correct. Yep that is correct, it's to avoid juggling irq on and off for multiple queues. I will put a comment there.
On 2011-04-12 00:58, hch@infradead.org wrote: > Looking at the patch > (http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=761e433f3de6fb8e369af9e5c08beb86286d023f) > > I'm not sure it's an optimal design. The flush callback really > is a per-queue thing. Why isn't it a function pointer in the request > queue when doing the blk_run_queue call once we're done with a given > queue before moving on to the next one? I was thinking about this yesterday as well, the design didn't quite feel just right. Additionally the user now must track this state too, and whether he's plugged on that task or not. I'll rewrite this.
On 2011-04-12 03:12, hch@infradead.org wrote: > On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote: >> Great, once you do that and XFS kills the blk_flush_plug() calls too, >> then we can remove that export and make it internal only. > > Linus pulled the tree, so they are gone now. Btw, there's still some > bits in the area that confuse me: Great! > - what's the point of the queue_sync_plugs? It has a lot of comment > that seem to pre-data the onstack plugging, but except for that > it's trivial wrapper around blk_flush_plug, with an argument > that is not used. There's really no point to it anymore. It's existance was due to the older revision that had to track write requests for serializaing around a barrier. I'll kill it, since we don't do that anymore. > - is there a good reason for the existance of __blk_flush_plug? You'd > get one additional instruction in the inlined version of > blk_flush_plug when opencoding, but avoid the need for chained > function calls. > - Why is having a plug in blk_flush_plug marked unlikely? Note that > unlikely is the static branch prediction hint to mark the case > extremly unlikely and is even used for hot/cold partitioning. But > when we call it we usually check beforehand if we actually have > plugs, so it's actually likely to happen. The existance and out-of-line is for the scheduler() hook. It should be an unlikely event to schedule with a plug held, normally the plug should have been explicitly unplugged before that happens. > - what is the point of blk_finish_plug? All callers have > the plug on stack, and there's no good reason for adding the NULL > check. Note that blk_start_plug doesn't have the NULL check either. That one can probably go, I need to double check that part since some things changed. > - Why does __blk_flush_plug call __blk_finish_plug which might clear > tsk->plug, just to set it back after the call? When manually inlining > __blk_finish_plug ino __blk_flush_plug it looks like: > > void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug) > { > flush_plug_list(plug); > if (plug == tsk->plug) > tsk->plug = NULL; > tsk->plug = plug; > } > > it would seem much smarted to just call flush_plug_list directly. > In fact it seems like the tsk->plug is not nessecary at all and > all remaining __blk_flush_plug callers could be replaced with > flush_plug_list. It depends on whether this was an explicit unplug (eg blk_finish_plug()), or whether it was an implicit event (eg on schedule()). If we do it on schedule(), then we retain the plug after the flush. Otherwise we clear it. > - and of course the remaining issue of why io_schedule needs an > expliciy blk_flush_plug when schedule() already does one in > case it actually needs to schedule. Already answered in other email.
On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote: > On 2011-04-12 03:12, hch@infradead.org wrote: > > On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote: > >> Great, once you do that and XFS kills the blk_flush_plug() calls too, > >> then we can remove that export and make it internal only. > > > > Linus pulled the tree, so they are gone now. Btw, there's still some > > bits in the area that confuse me: > > Great! > > > - what's the point of the queue_sync_plugs? It has a lot of comment > > that seem to pre-data the onstack plugging, but except for that > > it's trivial wrapper around blk_flush_plug, with an argument > > that is not used. > > There's really no point to it anymore. It's existance was due to the > older revision that had to track write requests for serializaing around > a barrier. I'll kill it, since we don't do that anymore. > > > - is there a good reason for the existance of __blk_flush_plug? You'd > > get one additional instruction in the inlined version of > > blk_flush_plug when opencoding, but avoid the need for chained > > function calls. > > - Why is having a plug in blk_flush_plug marked unlikely? Note that > > unlikely is the static branch prediction hint to mark the case > > extremly unlikely and is even used for hot/cold partitioning. But > > when we call it we usually check beforehand if we actually have > > plugs, so it's actually likely to happen. > > The existance and out-of-line is for the scheduler() hook. It should be > an unlikely event to schedule with a plug held, normally the plug should > have been explicitly unplugged before that happens. Though if it does, haven't you just added a significant amount of depth to the worst case stack usage? I'm seeing this sort of thing from io_schedule(): Depth Size Location (40 entries) ----- ---- -------- 0) 4256 16 mempool_alloc_slab+0x15/0x20 1) 4240 144 mempool_alloc+0x63/0x160 2) 4096 16 scsi_sg_alloc+0x4c/0x60 3) 4080 112 __sg_alloc_table+0x66/0x140 4) 3968 32 scsi_init_sgtable+0x33/0x90 5) 3936 48 scsi_init_io+0x31/0xc0 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0 7) 3856 112 sd_prep_fn+0x150/0xa90 8) 3744 48 blk_peek_request+0x6a/0x1f0 9) 3696 96 scsi_request_fn+0x60/0x510 10) 3600 32 __blk_run_queue+0x57/0x100 11) 3568 80 flush_plug_list+0x133/0x1d0 12) 3488 32 __blk_flush_plug+0x24/0x50 13) 3456 32 io_schedule+0x79/0x80 (This is from a page fault on ext3 that is doing page cache readahead and blocking on a locked buffer.) I've seen traces where mempool_alloc_slab enters direct reclaim which adds another 1.5k of stack usage to this path. So I'm extremely concerned that you've just reduced the stack available to every thread by at least 2.5k of space... Cheers, Dave.
On 2011-04-12 14:22, Dave Chinner wrote: > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote: >> On 2011-04-12 03:12, hch@infradead.org wrote: >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote: >>>> Great, once you do that and XFS kills the blk_flush_plug() calls too, >>>> then we can remove that export and make it internal only. >>> >>> Linus pulled the tree, so they are gone now. Btw, there's still some >>> bits in the area that confuse me: >> >> Great! >> >>> - what's the point of the queue_sync_plugs? It has a lot of comment >>> that seem to pre-data the onstack plugging, but except for that >>> it's trivial wrapper around blk_flush_plug, with an argument >>> that is not used. >> >> There's really no point to it anymore. It's existance was due to the >> older revision that had to track write requests for serializaing around >> a barrier. I'll kill it, since we don't do that anymore. >> >>> - is there a good reason for the existance of __blk_flush_plug? You'd >>> get one additional instruction in the inlined version of >>> blk_flush_plug when opencoding, but avoid the need for chained >>> function calls. >>> - Why is having a plug in blk_flush_plug marked unlikely? Note that >>> unlikely is the static branch prediction hint to mark the case >>> extremly unlikely and is even used for hot/cold partitioning. But >>> when we call it we usually check beforehand if we actually have >>> plugs, so it's actually likely to happen. >> >> The existance and out-of-line is for the scheduler() hook. It should be >> an unlikely event to schedule with a plug held, normally the plug should >> have been explicitly unplugged before that happens. > > Though if it does, haven't you just added a significant amount of > depth to the worst case stack usage? I'm seeing this sort of thing > from io_schedule(): > > Depth Size Location (40 entries) > ----- ---- -------- > 0) 4256 16 mempool_alloc_slab+0x15/0x20 > 1) 4240 144 mempool_alloc+0x63/0x160 > 2) 4096 16 scsi_sg_alloc+0x4c/0x60 > 3) 4080 112 __sg_alloc_table+0x66/0x140 > 4) 3968 32 scsi_init_sgtable+0x33/0x90 > 5) 3936 48 scsi_init_io+0x31/0xc0 > 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0 > 7) 3856 112 sd_prep_fn+0x150/0xa90 > 8) 3744 48 blk_peek_request+0x6a/0x1f0 > 9) 3696 96 scsi_request_fn+0x60/0x510 > 10) 3600 32 __blk_run_queue+0x57/0x100 > 11) 3568 80 flush_plug_list+0x133/0x1d0 > 12) 3488 32 __blk_flush_plug+0x24/0x50 > 13) 3456 32 io_schedule+0x79/0x80 > > (This is from a page fault on ext3 that is doing page cache > readahead and blocking on a locked buffer.) > > I've seen traces where mempool_alloc_slab enters direct reclaim > which adds another 1.5k of stack usage to this path. So I'm > extremely concerned that you've just reduced the stack available to > every thread by at least 2.5k of space... Yeah, that does not look great. If this turns out to be problematic, we can turn the queue runs from the unlikely case into out-of-line from kblockd. But this really isn't that new, you could enter the IO dispatch path when doing IO already (when submitting it). So we better be able to handle that. If it's a problem from the schedule()/io_schedule() path, then lets ensure that those are truly unlikely events so we can punt them to kblockd.
On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote: > On 2011-04-12 14:22, Dave Chinner wrote: > > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote: > >> On 2011-04-12 03:12, hch@infradead.org wrote: > >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote: > >>>> Great, once you do that and XFS kills the blk_flush_plug() calls too, > >>>> then we can remove that export and make it internal only. > >>> > >>> Linus pulled the tree, so they are gone now. Btw, there's still some > >>> bits in the area that confuse me: > >> > >> Great! > >> > >>> - what's the point of the queue_sync_plugs? It has a lot of comment > >>> that seem to pre-data the onstack plugging, but except for that > >>> it's trivial wrapper around blk_flush_plug, with an argument > >>> that is not used. > >> > >> There's really no point to it anymore. It's existance was due to the > >> older revision that had to track write requests for serializaing around > >> a barrier. I'll kill it, since we don't do that anymore. > >> > >>> - is there a good reason for the existance of __blk_flush_plug? You'd > >>> get one additional instruction in the inlined version of > >>> blk_flush_plug when opencoding, but avoid the need for chained > >>> function calls. > >>> - Why is having a plug in blk_flush_plug marked unlikely? Note that > >>> unlikely is the static branch prediction hint to mark the case > >>> extremly unlikely and is even used for hot/cold partitioning. But > >>> when we call it we usually check beforehand if we actually have > >>> plugs, so it's actually likely to happen. > >> > >> The existance and out-of-line is for the scheduler() hook. It should be > >> an unlikely event to schedule with a plug held, normally the plug should > >> have been explicitly unplugged before that happens. > > > > Though if it does, haven't you just added a significant amount of > > depth to the worst case stack usage? I'm seeing this sort of thing > > from io_schedule(): > > > > Depth Size Location (40 entries) > > ----- ---- -------- > > 0) 4256 16 mempool_alloc_slab+0x15/0x20 > > 1) 4240 144 mempool_alloc+0x63/0x160 > > 2) 4096 16 scsi_sg_alloc+0x4c/0x60 > > 3) 4080 112 __sg_alloc_table+0x66/0x140 > > 4) 3968 32 scsi_init_sgtable+0x33/0x90 > > 5) 3936 48 scsi_init_io+0x31/0xc0 > > 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0 > > 7) 3856 112 sd_prep_fn+0x150/0xa90 > > 8) 3744 48 blk_peek_request+0x6a/0x1f0 > > 9) 3696 96 scsi_request_fn+0x60/0x510 > > 10) 3600 32 __blk_run_queue+0x57/0x100 > > 11) 3568 80 flush_plug_list+0x133/0x1d0 > > 12) 3488 32 __blk_flush_plug+0x24/0x50 > > 13) 3456 32 io_schedule+0x79/0x80 > > > > (This is from a page fault on ext3 that is doing page cache > > readahead and blocking on a locked buffer.) > > > > I've seen traces where mempool_alloc_slab enters direct reclaim > > which adds another 1.5k of stack usage to this path. So I'm > > extremely concerned that you've just reduced the stack available to > > every thread by at least 2.5k of space... > > Yeah, that does not look great. If this turns out to be problematic, we > can turn the queue runs from the unlikely case into out-of-line from > kblockd. > > But this really isn't that new, you could enter the IO dispatch path > when doing IO already (when submitting it). So we better be able to > handle that. The problem I see is that IO is submitted when there's plenty of stack available whould have previously been fine. However now it hits the plug, and then later on after the thread consumes a lot more stack it, say, waits for a completion. We then schedule, it unplugs the queue and we add the IO stack to a place where there isn't much space available. So effectively we are moving the places where stack is consumed about, and it's complete unpredictable where that stack is going to land now. > If it's a problem from the schedule()/io_schedule() path, then > lets ensure that those are truly unlikely events so we can punt > them to kblockd. Rather than wait for an explosion to be reported before doing this, why not just punt unplugs to kblockd unconditionally? Cheers, Dave.
On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote: > On 2011-04-12 14:22, Dave Chinner wrote: > > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote: > >> On 2011-04-12 03:12, hch@infradead.org wrote: > >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote: > >>> function calls. > >>> - Why is having a plug in blk_flush_plug marked unlikely? Note that > >>> unlikely is the static branch prediction hint to mark the case > >>> extremly unlikely and is even used for hot/cold partitioning. But > >>> when we call it we usually check beforehand if we actually have > >>> plugs, so it's actually likely to happen. > >> > >> The existance and out-of-line is for the scheduler() hook. It should be > >> an unlikely event to schedule with a plug held, normally the plug should > >> have been explicitly unplugged before that happens. > > > > Though if it does, haven't you just added a significant amount of > > depth to the worst case stack usage? I'm seeing this sort of thing > > from io_schedule(): > > > > Depth Size Location (40 entries) > > ----- ---- -------- > > 0) 4256 16 mempool_alloc_slab+0x15/0x20 > > 1) 4240 144 mempool_alloc+0x63/0x160 > > 2) 4096 16 scsi_sg_alloc+0x4c/0x60 > > 3) 4080 112 __sg_alloc_table+0x66/0x140 > > 4) 3968 32 scsi_init_sgtable+0x33/0x90 > > 5) 3936 48 scsi_init_io+0x31/0xc0 > > 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0 > > 7) 3856 112 sd_prep_fn+0x150/0xa90 > > 8) 3744 48 blk_peek_request+0x6a/0x1f0 > > 9) 3696 96 scsi_request_fn+0x60/0x510 > > 10) 3600 32 __blk_run_queue+0x57/0x100 > > 11) 3568 80 flush_plug_list+0x133/0x1d0 > > 12) 3488 32 __blk_flush_plug+0x24/0x50 > > 13) 3456 32 io_schedule+0x79/0x80 > > > > (This is from a page fault on ext3 that is doing page cache > > readahead and blocking on a locked buffer.) FYI, the next step in the allocation chain adds >900 bytes to that stack: $ cat /sys/kernel/debug/tracing/stack_trace Depth Size Location (47 entries) ----- ---- -------- 0) 5176 40 zone_statistics+0xad/0xc0 1) 5136 288 get_page_from_freelist+0x2cf/0x840 2) 4848 304 __alloc_pages_nodemask+0x121/0x930 3) 4544 48 kmem_getpages+0x62/0x160 4) 4496 96 cache_grow+0x308/0x330 5) 4400 80 cache_alloc_refill+0x21c/0x260 6) 4320 64 kmem_cache_alloc+0x1b7/0x1e0 7) 4256 16 mempool_alloc_slab+0x15/0x20 8) 4240 144 mempool_alloc+0x63/0x160 9) 4096 16 scsi_sg_alloc+0x4c/0x60 10) 4080 112 __sg_alloc_table+0x66/0x140 11) 3968 32 scsi_init_sgtable+0x33/0x90 12) 3936 48 scsi_init_io+0x31/0xc0 13) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0 14) 3856 112 sd_prep_fn+0x150/0xa90 15) 3744 48 blk_peek_request+0x6a/0x1f0 16) 3696 96 scsi_request_fn+0x60/0x510 17) 3600 32 __blk_run_queue+0x57/0x100 18) 3568 80 flush_plug_list+0x133/0x1d0 19) 3488 32 __blk_flush_plug+0x24/0x50 20) 3456 32 io_schedule+0x79/0x80 That's close to 1800 bytes now, and that's not entering the reclaim path. If i get one deeper than that, I'll be sure to post it. :) Cheers, Dave.
On 2011-04-12 15:40, Dave Chinner wrote: > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote: >> On 2011-04-12 14:22, Dave Chinner wrote: >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote: >>>> On 2011-04-12 03:12, hch@infradead.org wrote: >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote: >>>>> function calls. >>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that >>>>> unlikely is the static branch prediction hint to mark the case >>>>> extremly unlikely and is even used for hot/cold partitioning. But >>>>> when we call it we usually check beforehand if we actually have >>>>> plugs, so it's actually likely to happen. >>>> >>>> The existance and out-of-line is for the scheduler() hook. It should be >>>> an unlikely event to schedule with a plug held, normally the plug should >>>> have been explicitly unplugged before that happens. >>> >>> Though if it does, haven't you just added a significant amount of >>> depth to the worst case stack usage? I'm seeing this sort of thing >>> from io_schedule(): >>> >>> Depth Size Location (40 entries) >>> ----- ---- -------- >>> 0) 4256 16 mempool_alloc_slab+0x15/0x20 >>> 1) 4240 144 mempool_alloc+0x63/0x160 >>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60 >>> 3) 4080 112 __sg_alloc_table+0x66/0x140 >>> 4) 3968 32 scsi_init_sgtable+0x33/0x90 >>> 5) 3936 48 scsi_init_io+0x31/0xc0 >>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0 >>> 7) 3856 112 sd_prep_fn+0x150/0xa90 >>> 8) 3744 48 blk_peek_request+0x6a/0x1f0 >>> 9) 3696 96 scsi_request_fn+0x60/0x510 >>> 10) 3600 32 __blk_run_queue+0x57/0x100 >>> 11) 3568 80 flush_plug_list+0x133/0x1d0 >>> 12) 3488 32 __blk_flush_plug+0x24/0x50 >>> 13) 3456 32 io_schedule+0x79/0x80 >>> >>> (This is from a page fault on ext3 that is doing page cache >>> readahead and blocking on a locked buffer.) > > FYI, the next step in the allocation chain adds >900 bytes to that > stack: > > $ cat /sys/kernel/debug/tracing/stack_trace > Depth Size Location (47 entries) > ----- ---- -------- > 0) 5176 40 zone_statistics+0xad/0xc0 > 1) 5136 288 get_page_from_freelist+0x2cf/0x840 > 2) 4848 304 __alloc_pages_nodemask+0x121/0x930 > 3) 4544 48 kmem_getpages+0x62/0x160 > 4) 4496 96 cache_grow+0x308/0x330 > 5) 4400 80 cache_alloc_refill+0x21c/0x260 > 6) 4320 64 kmem_cache_alloc+0x1b7/0x1e0 > 7) 4256 16 mempool_alloc_slab+0x15/0x20 > 8) 4240 144 mempool_alloc+0x63/0x160 > 9) 4096 16 scsi_sg_alloc+0x4c/0x60 > 10) 4080 112 __sg_alloc_table+0x66/0x140 > 11) 3968 32 scsi_init_sgtable+0x33/0x90 > 12) 3936 48 scsi_init_io+0x31/0xc0 > 13) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0 > 14) 3856 112 sd_prep_fn+0x150/0xa90 > 15) 3744 48 blk_peek_request+0x6a/0x1f0 > 16) 3696 96 scsi_request_fn+0x60/0x510 > 17) 3600 32 __blk_run_queue+0x57/0x100 > 18) 3568 80 flush_plug_list+0x133/0x1d0 > 19) 3488 32 __blk_flush_plug+0x24/0x50 > 20) 3456 32 io_schedule+0x79/0x80 > > That's close to 1800 bytes now, and that's not entering the reclaim > path. If i get one deeper than that, I'll be sure to post it. :) Do you have traces from 2.6.38, or are you just doing them now? The path you quote above should not go into reclaim, it's a GFP_ATOMIC allocation.
On Tue, Apr 12, 2011 at 03:48:10PM +0200, Jens Axboe wrote: > On 2011-04-12 15:40, Dave Chinner wrote: > > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote: > >> On 2011-04-12 14:22, Dave Chinner wrote: > >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote: > >>>> On 2011-04-12 03:12, hch@infradead.org wrote: > >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote: > >>>>> function calls. > >>>>> - Why is having a plug in blk_flush_plug marked unlikely? Note that > >>>>> unlikely is the static branch prediction hint to mark the case > >>>>> extremly unlikely and is even used for hot/cold partitioning. But > >>>>> when we call it we usually check beforehand if we actually have > >>>>> plugs, so it's actually likely to happen. > >>>> > >>>> The existance and out-of-line is for the scheduler() hook. It should be > >>>> an unlikely event to schedule with a plug held, normally the plug should > >>>> have been explicitly unplugged before that happens. > >>> > >>> Though if it does, haven't you just added a significant amount of > >>> depth to the worst case stack usage? I'm seeing this sort of thing > >>> from io_schedule(): > >>> > >>> Depth Size Location (40 entries) > >>> ----- ---- -------- > >>> 0) 4256 16 mempool_alloc_slab+0x15/0x20 > >>> 1) 4240 144 mempool_alloc+0x63/0x160 > >>> 2) 4096 16 scsi_sg_alloc+0x4c/0x60 > >>> 3) 4080 112 __sg_alloc_table+0x66/0x140 > >>> 4) 3968 32 scsi_init_sgtable+0x33/0x90 > >>> 5) 3936 48 scsi_init_io+0x31/0xc0 > >>> 6) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0 > >>> 7) 3856 112 sd_prep_fn+0x150/0xa90 > >>> 8) 3744 48 blk_peek_request+0x6a/0x1f0 > >>> 9) 3696 96 scsi_request_fn+0x60/0x510 > >>> 10) 3600 32 __blk_run_queue+0x57/0x100 > >>> 11) 3568 80 flush_plug_list+0x133/0x1d0 > >>> 12) 3488 32 __blk_flush_plug+0x24/0x50 > >>> 13) 3456 32 io_schedule+0x79/0x80 > >>> > >>> (This is from a page fault on ext3 that is doing page cache > >>> readahead and blocking on a locked buffer.) > > > > FYI, the next step in the allocation chain adds >900 bytes to that > > stack: > > > > $ cat /sys/kernel/debug/tracing/stack_trace > > Depth Size Location (47 entries) > > ----- ---- -------- > > 0) 5176 40 zone_statistics+0xad/0xc0 > > 1) 5136 288 get_page_from_freelist+0x2cf/0x840 > > 2) 4848 304 __alloc_pages_nodemask+0x121/0x930 > > 3) 4544 48 kmem_getpages+0x62/0x160 > > 4) 4496 96 cache_grow+0x308/0x330 > > 5) 4400 80 cache_alloc_refill+0x21c/0x260 > > 6) 4320 64 kmem_cache_alloc+0x1b7/0x1e0 > > 7) 4256 16 mempool_alloc_slab+0x15/0x20 > > 8) 4240 144 mempool_alloc+0x63/0x160 > > 9) 4096 16 scsi_sg_alloc+0x4c/0x60 > > 10) 4080 112 __sg_alloc_table+0x66/0x140 > > 11) 3968 32 scsi_init_sgtable+0x33/0x90 > > 12) 3936 48 scsi_init_io+0x31/0xc0 > > 13) 3888 32 scsi_setup_fs_cmnd+0x79/0xe0 > > 14) 3856 112 sd_prep_fn+0x150/0xa90 > > 15) 3744 48 blk_peek_request+0x6a/0x1f0 > > 16) 3696 96 scsi_request_fn+0x60/0x510 > > 17) 3600 32 __blk_run_queue+0x57/0x100 > > 18) 3568 80 flush_plug_list+0x133/0x1d0 > > 19) 3488 32 __blk_flush_plug+0x24/0x50 > > 20) 3456 32 io_schedule+0x79/0x80 > > > > That's close to 1800 bytes now, and that's not entering the reclaim > > path. If i get one deeper than that, I'll be sure to post it. :) > > Do you have traces from 2.6.38, or are you just doing them now? I do stack checks like this all the time. I generally don't keep them around, just pay attention to the path and depth. ext3 is used for / on my test VMs, and has never shown up as the worse case stack usage when running xfstests. As of the block plugging code, this trace is the top stack user for the first ~130 tests, and often for the entire test run on XFS.... > The path you quote above should not go into reclaim, it's a GFP_ATOMIC > allocation. Right. I'm still trying to produce a trace that shows more stack usage in the block layer. It's random chance as to what pops up most of the time. However, some of the stacks that are showing up in 2.6.39 are quite different from any I've ever seen before... Cheers, Dave.
Btw, "block: move queue run on unplug to kblockd" currently moves the __blk_run_queue call to kblockd unconditionally currently. But I'm not sure that's correct - if we do an explicit blk_finish_plug there's no point in forcing the context switch. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/block/blk-core.c b/block/blk-core.c index 273d60b..903ce8d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug) struct request_queue *q; unsigned long flags; struct request *rq; + struct list_head head; BUG_ON(plug->magic != PLUG_MAGIC); if (list_empty(&plug->list)) return; + list_add(&head, &plug->list); + list_del_init(&plug->list); if (plug->should_sort) - list_sort(NULL, &plug->list, plug_rq_cmp); + list_sort(NULL, &head, plug_rq_cmp); + plug->should_sort = 0; q = NULL; local_irq_save(flags); - while (!list_empty(&plug->list)) { - rq = list_entry_rq(plug->list.next); + while (!list_empty(&head)) { + rq = list_entry_rq(head.next); list_del_init(&rq->queuelist); BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG)); BUG_ON(!rq->q); makes the symptom go away. It simply moves the plug list onto a separate list head before sorting and processing it. My test was simply writing to a RAID1 with dd: while true; do dd if=/dev/zero of=/dev/md0 size=4k; done Obviously all writes go to two devices so the plug list will always need sorting. The only explanation I can come up with is that very occasionally schedule on 2 separate cpus calls blk_flush_plug for the same task. I don't understand the scheduler nearly well enough to know if or how that can happen. However with this patch in place I can write to a RAID1 constantly for half an hour, and without it, the write rarely lasts for 3 minutes. If you want to reproduce my experiment, you can pull from git://neil.brown.name/md plug-test to get my patches for plugging in md (which are quite ready for submission but seem to work), create a RAID1 using e.g. mdadm -C /dev/md0 --level=1 --raid-disks=2 /dev/device1 /dev/device2 while true; do dd if=/dev/zero of=/dev/md0 bs=4K ; done Thanks, NeilBrown >From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@suse.de> Date: Thu, 7 Apr 2011 13:16:59 +1000 Subject: [PATCH] Enhance new plugging support to support general callbacks. md/raid requires an unplug callback, but as it does not uses requests the current code cannot provide one. So allow arbitrary callbacks to be attached to the blk_plug. Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> --- block/blk-core.c | 13 +++++++++++++ include/linux/blkdev.h | 7 ++++++- 2 files changed, 19 insertions(+), 1 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 725091d..273d60b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug) plug->magic = PLUG_MAGIC; INIT_LIST_HEAD(&plug->list); + INIT_LIST_HEAD(&plug->cb_list); plug->should_sort = 0; /* @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug) local_irq_restore(flags); } +static void flush_plug_callbacks(struct blk_plug *plug) +{ + while (!list_empty(&plug->cb_list)) { + struct blk_plug_cb *cb = list_first_entry(&plug->cb_list, + struct blk_plug_cb, + list); + list_del(&cb->list); + cb->callback(cb); + } +} + static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug) { flush_plug_list(plug); + flush_plug_callbacks(plug); if (plug == tsk->plug) tsk->plug = NULL; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 32176cc..3e5e604 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *); struct blk_plug { unsigned long magic; struct list_head list; + struct list_head cb_list; unsigned int should_sort; }; +struct blk_plug_cb { + struct list_head list; + void (*callback)(struct blk_plug_cb *); +}; extern void blk_start_plug(struct blk_plug *); extern void blk_finish_plug(struct blk_plug *); @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk) { struct blk_plug *plug = tsk->plug; - return plug && !list_empty(&plug->list); + return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list)); } /*