diff mbox

cfq: priority boost on meta/prio marked IO

Message ID 20160608204347.GA30146@kernel.dk (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe June 8, 2016, 8:43 p.m. UTC
At Facebook, we have a number of cases where people use ionice to set a
lower priority, then end up having tasks stuck for a long time because
eg meta data updates from an idle priority tasks is blocking out higher
priority processes. It's bad enough that it will trigger the softlockup
warning.

This patch adds code to CFQ that bumps the priority class and data for
an idle task, if is doing IO marked as PRIO or META. With this, we no
longer see the softlockups.

Signed-off-by: Jens Axboe <axboe@fb.com>

Comments

Jeff Moyer June 9, 2016, 3:55 p.m. UTC | #1
Hi, Jens,

Jens Axboe <axboe@fb.com> writes:

> At Facebook, we have a number of cases where people use ionice to set a
> lower priority, then end up having tasks stuck for a long time because
> eg meta data updates from an idle priority tasks is blocking out higher
> priority processes. It's bad enough that it will trigger the softlockup
> warning.

I expect a higher prio process could be blocked on a lower prio process
reading the same metadata, too.  I had a hard time tracking down where
REQ_META WRITE I/O was issued outside of the journal or writeback paths
(and I hope you're not ionice-ing those!).  Eventually, with the help of
sandeen, I found some oddball cases that I doubt you're running into.
Can you enlighten me as to where this (REQ_META write I/O) is happening?
I don't disagree that it's a problem, I just would like to understand
your problem case better.

Anyway, it seems to me you could just set REQ_PRIO in the code paths you
care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as
the same thing, which essentially undoes commit 65299a3b788bd ("block:
separate priority boosting from REQ_META") from Christoph.

Thanks!
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig June 9, 2016, 4 p.m. UTC | #2
On Thu, Jun 09, 2016 at 11:55:56AM -0400, Jeff Moyer wrote:
> I expect a higher prio process could be blocked on a lower prio process
> reading the same metadata, too.  I had a hard time tracking down where
> REQ_META WRITE I/O was issued outside of the journal or writeback paths
> (and I hope you're not ionice-ing those!).  Eventually, with the help of
> sandeen, I found some oddball cases that I doubt you're running into.
> Can you enlighten me as to where this (REQ_META write I/O) is happening?
> I don't disagree that it's a problem, I just would like to understand
> your problem case better.

XFS does lots of REQ_META writes from _xfs_buf_ioapply().  But none
of those should be in the critical path as the all metadata is logged
first and then written back later.

> Anyway, it seems to me you could just set REQ_PRIO in the code paths you
> care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as
> the same thing, which essentially undoes commit 65299a3b788bd ("block:
> separate priority boosting from REQ_META") from Christoph.

And I'm still waiting for someone to explain when exactly REQ_PRIO
should be used..
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer June 9, 2016, 4:05 p.m. UTC | #3
Christoph Hellwig <hch@infradead.org> writes:

> On Thu, Jun 09, 2016 at 11:55:56AM -0400, Jeff Moyer wrote:
>> I expect a higher prio process could be blocked on a lower prio process
>> reading the same metadata, too.  I had a hard time tracking down where
>> REQ_META WRITE I/O was issued outside of the journal or writeback paths
>> (and I hope you're not ionice-ing those!).  Eventually, with the help of
>> sandeen, I found some oddball cases that I doubt you're running into.
>> Can you enlighten me as to where this (REQ_META write I/O) is happening?
>> I don't disagree that it's a problem, I just would like to understand
>> your problem case better.
>
> XFS does lots of REQ_META writes from _xfs_buf_ioapply().  But none
> of those should be in the critical path as the all metadata is logged
> first and then written back later.
>
>> Anyway, it seems to me you could just set REQ_PRIO in the code paths you
>> care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as
>> the same thing, which essentially undoes commit 65299a3b788bd ("block:
>> separate priority boosting from REQ_META") from Christoph.
>
> And I'm still waiting for someone to explain when exactly REQ_PRIO
> should be used..

I think Jens' bug report is exactly that explanation, no?  To avoid
priority inversion?

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe June 9, 2016, 4:20 p.m. UTC | #4
On 06/09/2016 09:55 AM, Jeff Moyer wrote:
> Hi, Jens,
>
> Jens Axboe <axboe@fb.com> writes:
>
>> At Facebook, we have a number of cases where people use ionice to set a
>> lower priority, then end up having tasks stuck for a long time because
>> eg meta data updates from an idle priority tasks is blocking out higher
>> priority processes. It's bad enough that it will trigger the softlockup
>> warning.
>
> I expect a higher prio process could be blocked on a lower prio process
> reading the same metadata, too.  I had a hard time tracking down where
> REQ_META WRITE I/O was issued outside of the journal or writeback paths
> (and I hope you're not ionice-ing those!).  Eventually, with the help of
> sandeen, I found some oddball cases that I doubt you're running into.
> Can you enlighten me as to where this (REQ_META write I/O) is happening?
> I don't disagree that it's a problem, I just would like to understand
> your problem case better.
>
> Anyway, it seems to me you could just set REQ_PRIO in the code paths you
> care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as
> the same thing, which essentially undoes commit 65299a3b788bd ("block:
> separate priority boosting from REQ_META") from Christoph.

I personally don't care too much about boosting for just PRIO, or for
both PRIO and META. For the important paths, we basically have both set
anyway. But here's a trace for one such hang:


[478381.198925] ------------[ cut here ]------------
[478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
[478381.201324]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[478381.203462] ionice          D ffff8803692736a8     0 1168369      1 
0x00000080
[478381.203466]  ffff8803692736a8 ffff880399c21300 ffff880276adcc00 
ffff880369273698
[478381.204589]  ffff880369273fd8 0000000000000000 7fffffffffffffff 
0000000000000002
[478381.205752]  ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 
0000000000000000
[478381.206874] Call Trace:
[478381.207253]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
[478381.208175]  [<ffffffff8177cea7>] schedule+0x37/0x90
[478381.208932]  [<ffffffff8177f5fc>] schedule_timeout+0x1dc/0x250
[478381.209805]  [<ffffffff81421c17>] ? __blk_run_queue+0x37/0x50
[478381.210706]  [<ffffffff810ca1c5>] ? ktime_get+0x45/0xb0
[478381.211489]  [<ffffffff8177c407>] io_schedule_timeout+0xa7/0x110
[478381.212402]  [<ffffffff810a8c2b>] ? prepare_to_wait+0x5b/0x90
[478381.213280]  [<ffffffff8177d616>] bit_wait_io+0x36/0x50
[478381.214063]  [<ffffffff8177d325>] __wait_on_bit+0x65/0x90
[478381.214961]  [<ffffffff8177d5e0>] ? bit_wait_io_timeout+0x80/0x80
[478381.215872]  [<ffffffff8177d47c>] out_of_line_wait_on_bit+0x7c/0x90
[478381.216806]  [<ffffffff810a89f0>] ? wake_atomic_t_function+0x40/0x40
[478381.217773]  [<ffffffff811f03aa>] __wait_on_buffer+0x2a/0x30
[478381.218641]  [<ffffffff8123c557>] ext4_bread+0x57/0x70
[478381.219425]  [<ffffffff8124498c>] __ext4_read_dirblock+0x3c/0x380
[478381.220467]  [<ffffffff8124665d>] ext4_dx_find_entry+0x7d/0x170
[478381.221357]  [<ffffffff8114c49e>] ? find_get_entry+0x1e/0xa0
[478381.222208]  [<ffffffff81246bd4>] ext4_find_entry+0x484/0x510
[478381.223090]  [<ffffffff812471a2>] ext4_lookup+0x52/0x160
[478381.223882]  [<ffffffff811c401d>] lookup_real+0x1d/0x60
[478381.224675]  [<ffffffff811c4698>] __lookup_hash+0x38/0x50
[478381.225697]  [<ffffffff817745bd>] lookup_slow+0x45/0xab
[478381.226941]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
[478381.227880]  [<ffffffff811c6a42>] path_init+0xc2/0x430
[478381.228677]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
[478381.229776]  [<ffffffff811c8c57>] path_openat+0x77/0x620
[478381.230767]  [<ffffffff81185c6e>] ? page_add_file_rmap+0x2e/0x70
[478381.232019]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
[478381.233016]  [<ffffffff8108c4a9>] ? creds_are_invalid+0x29/0x70
[478381.234072]  [<ffffffff811c0cb0>] do_open_execat+0x70/0x170
[478381.235039]  [<ffffffff811c1bf8>] do_execveat_common.isra.36+0x1b8/0x6e0
[478381.236051]  [<ffffffff811c214c>] do_execve+0x2c/0x30
[478381.236809]  [<ffffffff811ca392>] ? getname+0x12/0x20
[478381.237564]  [<ffffffff811c23be>] SyS_execve+0x2e/0x40
[478381.238338]  [<ffffffff81780a1d>] stub_execve+0x6d/0xa0
[478381.239126] ------------[ cut here ]------------
[478381.239915] ------------[ cut here ]------------
[478381.240606] INFO: task python2.7:1168375 blocked for more than 120 
seconds.
[478381.242673]       Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
[478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[478381.244902] python2.7       D ffff88005cf8fb98     0 1168375 1168248 
0x00000080
[478381.244904]  ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 
ffff88016c1f11a0
[478381.246023]  ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 
00000000ffffffff
[478381.247138]  ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 
ffff88005cf8fcc8
[478381.248252] Call Trace:
[478381.248630]  [<ffffffff8177cea7>] schedule+0x37/0x90
[478381.249382]  [<ffffffff8177d08e>] schedule_preempt_disabled+0xe/0x10
[478381.250465]  [<ffffffff8177e892>] __mutex_lock_slowpath+0x92/0x100
[478381.251409]  [<ffffffff8177e91b>] mutex_lock+0x1b/0x2f
[478381.252199]  [<ffffffff817745ae>] lookup_slow+0x36/0xab
[478381.253023]  [<ffffffff811c690e>] link_path_walk+0x7ae/0x820
[478381.253877]  [<ffffffff811aeb41>] ? try_charge+0xc1/0x700
[478381.254690]  [<ffffffff811c6a42>] path_init+0xc2/0x430
[478381.255525]  [<ffffffff813e6e26>] ? security_file_alloc+0x16/0x20
[478381.256450]  [<ffffffff811c8c57>] path_openat+0x77/0x620
[478381.257256]  [<ffffffff8115b2fb>] ? 
lru_cache_add_active_or_unevictable+0x2b/0xa0
[478381.258390]  [<ffffffff8117b623>] ? handle_mm_fault+0x13f3/0x1720
[478381.259309]  [<ffffffff811cb253>] do_filp_open+0x43/0xa0
[478381.260139]  [<ffffffff811d7ae2>] ? __alloc_fd+0x42/0x120
[478381.260962]  [<ffffffff811b95ac>] do_sys_open+0x13c/0x230
[478381.261779]  [<ffffffff81011393>] ? 
syscall_trace_enter_phase1+0x113/0x170
[478381.262851]  [<ffffffff811b96c2>] SyS_open+0x22/0x30
[478381.263598]  [<ffffffff81780532>] system_call_fastpath+0x12/0x17
[478381.264551] ------------[ cut here ]------------
[478381.265377] ------------[ cut here ]------------

These are reads. It's a classic case of starting some operation that
ends up holding a file system resource, then making very slow progress
(due to task being scheduled as idle IO), and hence blocking out
potentially much important tasks.

The IO is marked META|PRIO, so if folks don't are for boosting on meta,
we can just kill that. "Normal" meta data could be scheduled as idle,
but anything scheduled under a fs lock or similar should be marked PRIO
and get boosted.
Jeff Moyer June 9, 2016, 6:31 p.m. UTC | #5
Jens Axboe <axboe@kernel.dk> writes:

> On 06/09/2016 09:55 AM, Jeff Moyer wrote:
>> Hi, Jens,
>>
>> Jens Axboe <axboe@fb.com> writes:
>>
>>> At Facebook, we have a number of cases where people use ionice to set a
>>> lower priority, then end up having tasks stuck for a long time because
>>> eg meta data updates from an idle priority tasks is blocking out higher
>>> priority processes. It's bad enough that it will trigger the softlockup
>>> warning.
>>
>> I expect a higher prio process could be blocked on a lower prio process
>> reading the same metadata, too.  I had a hard time tracking down where
>> REQ_META WRITE I/O was issued outside of the journal or writeback paths
>> (and I hope you're not ionice-ing those!).  Eventually, with the help of
>> sandeen, I found some oddball cases that I doubt you're running into.
>> Can you enlighten me as to where this (REQ_META write I/O) is happening?
>> I don't disagree that it's a problem, I just would like to understand
>> your problem case better.
>>
>> Anyway, it seems to me you could just set REQ_PRIO in the code paths you
>> care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as
>> the same thing, which essentially undoes commit 65299a3b788bd ("block:
>> separate priority boosting from REQ_META") from Christoph.
>
> I personally don't care too much about boosting for just PRIO, or for
> both PRIO and META. For the important paths, we basically have both set
> anyway.

Yeah, I had considered that.  I like that REQ_META is separated out for
logging purposes.  I'm not too concerned about whether we imply boosted
priority from that or not.

> These are reads.

I was curious about writes.  ;-)  Anyway, it's good to validate that the
read case is also relevant.

> It's a classic case of starting some operation that ends up holding a
> file system resource, then making very slow progress (due to task
> being scheduled as idle IO), and hence blocking out potentially much
> important tasks.
>
> The IO is marked META|PRIO, so if folks don't [care] for boosting on meta,
> we can just kill that. "Normal" meta data could be scheduled as idle,
> but anything scheduled under a fs lock or similar should be marked PRIO
> and get boosted.

Interesting.  I would have thought that the cfqd->active_queue would
have been preempted by a request marked with REQ_PRIO.  But you're
suggesting that did not happen?

Specifically, cfq_insert_request would call cfq_rq_enqueued, and that
would call cfq_preempt_queue if cfq_should_preempt (and I think it
should, since the new request has REQ_PRIO set).

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe June 9, 2016, 8:14 p.m. UTC | #6
On 06/09/2016 12:31 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>> On 06/09/2016 09:55 AM, Jeff Moyer wrote:
>>> Hi, Jens,
>>>
>>> Jens Axboe <axboe@fb.com> writes:
>>>
>>>> At Facebook, we have a number of cases where people use ionice to set a
>>>> lower priority, then end up having tasks stuck for a long time because
>>>> eg meta data updates from an idle priority tasks is blocking out higher
>>>> priority processes. It's bad enough that it will trigger the softlockup
>>>> warning.
>>>
>>> I expect a higher prio process could be blocked on a lower prio process
>>> reading the same metadata, too.  I had a hard time tracking down where
>>> REQ_META WRITE I/O was issued outside of the journal or writeback paths
>>> (and I hope you're not ionice-ing those!).  Eventually, with the help of
>>> sandeen, I found some oddball cases that I doubt you're running into.
>>> Can you enlighten me as to where this (REQ_META write I/O) is happening?
>>> I don't disagree that it's a problem, I just would like to understand
>>> your problem case better.
>>>
>>> Anyway, it seems to me you could just set REQ_PRIO in the code paths you
>>> care about instead of modifying CFQ to treat REQ_META and REQ_PRIO as
>>> the same thing, which essentially undoes commit 65299a3b788bd ("block:
>>> separate priority boosting from REQ_META") from Christoph.
>>
>> I personally don't care too much about boosting for just PRIO, or for
>> both PRIO and META. For the important paths, we basically have both set
>> anyway.
>
> Yeah, I had considered that.  I like that REQ_META is separated out for
> logging purposes.  I'm not too concerned about whether we imply boosted
> priority from that or not.

Not a huge deal to me either, and most of the interesting REQ_META sites 
already set REQ_PRIO as well.

>> These are reads.
>
> I was curious about writes.  ;-)  Anyway, it's good to validate that the
> read case is also relevant.

You mean O_DIRECT writes? Most of the buffered writes will come out of 
the associated threads, so I don't think it's a big of an issue there. 
We've only seen it for reads.

>> It's a classic case of starting some operation that ends up holding a
>> file system resource, then making very slow progress (due to task
>> being scheduled as idle IO), and hence blocking out potentially much
>> important tasks.
>>
>> The IO is marked META|PRIO, so if folks don't [care] for boosting on meta,
>> we can just kill that. "Normal" meta data could be scheduled as idle,
>> but anything scheduled under a fs lock or similar should be marked PRIO
>> and get boosted.
>
> Interesting.  I would have thought that the cfqd->active_queue would
> have been preempted by a request marked with REQ_PRIO.  But you're
> suggesting that did not happen?
>
> Specifically, cfq_insert_request would call cfq_rq_enqueued, and that
> would call cfq_preempt_queue if cfq_should_preempt (and I think it
> should, since the new request has REQ_PRIO set).

We seem to handily mostly ignore prio_pending for the idle class. If the 
new queue is idle, then we don't look at prio pending. I'd rather make 
this more explicit, the patch is pretty similar to what we had in the 
past. It's somewhat of a regression caused by commit 4aede84b33d, except 
I like using the request flags for this a lot more than the old 
current->fs_excl. REQ_PRIO should always be set for cases where we hold 
fs (or even directory) specific resources.
Jeff Moyer June 9, 2016, 9:08 p.m. UTC | #7
Jens Axboe <axboe@kernel.dk> writes:

>> I was curious about writes.  ;-)  Anyway, it's good to validate that the
>> read case is also relevant.
>
> You mean O_DIRECT writes? Most of the buffered writes will come out of
> the associated threads, so I don't think it's a big of an issue
> there. We've only seen it for reads.

Well, you had me confused with your initial report:

"... because eg meta data updates..."

So I assumed that meant REQ_META WRITES.  My bad.

[snip]

>> Interesting.  I would have thought that the cfqd->active_queue would
>> have been preempted by a request marked with REQ_PRIO.  But you're
>> suggesting that did not happen?

[snip]

> We seem to handily mostly ignore prio_pending for the idle class. If

Right, I forgot we were talking about idle class.  Sorry.

> the new queue is idle, then we don't look at prio pending. I'd rather
> make this more explicit, the patch is pretty similar to what we had in
> the past. It's somewhat of a regression caused by commit 4aede84b33d,
> except I like using the request flags for this a lot more than the old
> current->fs_excl. REQ_PRIO should always be set for cases where we
> hold fs (or even directory) specific resources.

Ah, thanks for the reference!  Now I'll go back and finish reviewing the
actual patch.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer June 9, 2016, 9:28 p.m. UTC | #8
Jens Axboe <axboe@fb.com> writes:

> At Facebook, we have a number of cases where people use ionice to set a
> lower priority, then end up having tasks stuck for a long time because
> eg meta data updates from an idle priority tasks is blocking out higher
> priority processes. It's bad enough that it will trigger the softlockup
> warning.
>
> This patch adds code to CFQ that bumps the priority class and data for
> an idle task, if is doing IO marked as PRIO or META. With this, we no
> longer see the softlockups.
>
> Signed-off-by: Jens Axboe <axboe@fb.com>
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 32a283eb7274..3cfd67d006fb 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1781,6 +1781,11 @@ get_rq:
>  		rw_flags |= REQ_SYNC;
>  
>  	/*
> +	 * Add in META/PRIO flags, if set, before we get to the IO scheduler
> +	 */
> +	rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO));
> +
> +	/*

This needs a docbook update.  It now reads:

 * @rw_flags: RW and SYNC flags

so whatever flags we're adding should be specified, I guess.

Speaking of which, after much waffling, I think I've decided it would be
cleaner to limit the priority boost to REQ_PRIO requests only.

Other than that, I think this looks fine.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe June 9, 2016, 9:36 p.m. UTC | #9
On 06/09/2016 03:28 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@fb.com> writes:
>
>> At Facebook, we have a number of cases where people use ionice to set a
>> lower priority, then end up having tasks stuck for a long time because
>> eg meta data updates from an idle priority tasks is blocking out higher
>> priority processes. It's bad enough that it will trigger the softlockup
>> warning.
>>
>> This patch adds code to CFQ that bumps the priority class and data for
>> an idle task, if is doing IO marked as PRIO or META. With this, we no
>> longer see the softlockups.
>>
>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 32a283eb7274..3cfd67d006fb 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1781,6 +1781,11 @@ get_rq:
>>   		rw_flags |= REQ_SYNC;
>>
>>   	/*
>> +	 * Add in META/PRIO flags, if set, before we get to the IO scheduler
>> +	 */
>> +	rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO));
>> +
>> +	/*
>
> This needs a docbook update.  It now reads:
>
>   * @rw_flags: RW and SYNC flags
>
> so whatever flags we're adding should be specified, I guess.
>
> Speaking of which, after much waffling, I think I've decided it would be
> cleaner to limit the priority boost to REQ_PRIO requests only.

I went and checked, but I don't see it. Where is this?

> Other than that, I think this looks fine.

Thanks!
Jeff Moyer June 9, 2016, 9:47 p.m. UTC | #10
Jens Axboe <axboe@kernel.dk> writes:

> On 06/09/2016 03:28 PM, Jeff Moyer wrote:
>> Jens Axboe <axboe@fb.com> writes:
>>
>>> At Facebook, we have a number of cases where people use ionice to set a
>>> lower priority, then end up having tasks stuck for a long time because
>>> eg meta data updates from an idle priority tasks is blocking out higher
>>> priority processes. It's bad enough that it will trigger the softlockup
>>> warning.
>>>
>>> This patch adds code to CFQ that bumps the priority class and data for
>>> an idle task, if is doing IO marked as PRIO or META. With this, we no
>>> longer see the softlockups.
>>>
>>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 32a283eb7274..3cfd67d006fb 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -1781,6 +1781,11 @@ get_rq:
>>>   		rw_flags |= REQ_SYNC;
>>>
>>>   	/*
>>> +	 * Add in META/PRIO flags, if set, before we get to the IO scheduler
>>> +	 */
>>> +	rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO));
>>> +
>>> +	/*
>>
>> This needs a docbook update.  It now reads:
>>
>>   * @rw_flags: RW and SYNC flags
>>
>> so whatever flags we're adding should be specified, I guess.
>>
>> Speaking of which, after much waffling, I think I've decided it would be
>> cleaner to limit the priority boost to REQ_PRIO requests only.
>
> I went and checked, but I don't see it. Where is this?

Oops, sorry.  I meant that get_request and __get_request need updates to
their documentation.

On the second part (in case there was confusion on what I meant there),
what I meant was only do the prio boost for REQ_PRIO requests instead
of also doing it for REQ_META.  The way I arrived at that conclusion was
when I was going to ask you to update the documentation for REQ_META to
state that it implied REQ_PRIO, at which point, one has to wonder why we
need two flags.

There are cases where REQ_PRIO is used without REQ_META.
There are cases where REQ_META is used withoug REQ_PRIO.
And of course, there are cases where they're both sent down.

REQ_META itself is useful for tracing, and also makes the code
self-documenting.

REQ_PRIO pretty clearly means that we should boost priority for this
request.  And I think Christoph was making a case for REQ_META that
doesn't require a priority boost (if I read what he said correctly).

So, I think they serve different purposes.  Have I convinced you?  It'll
make your patch smaller!  ;-)

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe June 9, 2016, 9:51 p.m. UTC | #11
On 06/09/2016 03:47 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>> On 06/09/2016 03:28 PM, Jeff Moyer wrote:
>>> Jens Axboe <axboe@fb.com> writes:
>>>
>>>> At Facebook, we have a number of cases where people use ionice to set a
>>>> lower priority, then end up having tasks stuck for a long time because
>>>> eg meta data updates from an idle priority tasks is blocking out higher
>>>> priority processes. It's bad enough that it will trigger the softlockup
>>>> warning.
>>>>
>>>> This patch adds code to CFQ that bumps the priority class and data for
>>>> an idle task, if is doing IO marked as PRIO or META. With this, we no
>>>> longer see the softlockups.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@fb.com>
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 32a283eb7274..3cfd67d006fb 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1781,6 +1781,11 @@ get_rq:
>>>>    		rw_flags |= REQ_SYNC;
>>>>
>>>>    	/*
>>>> +	 * Add in META/PRIO flags, if set, before we get to the IO scheduler
>>>> +	 */
>>>> +	rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO));
>>>> +
>>>> +	/*
>>>
>>> This needs a docbook update.  It now reads:
>>>
>>>    * @rw_flags: RW and SYNC flags
>>>
>>> so whatever flags we're adding should be specified, I guess.
>>>
>>> Speaking of which, after much waffling, I think I've decided it would be
>>> cleaner to limit the priority boost to REQ_PRIO requests only.
>>
>> I went and checked, but I don't see it. Where is this?
>
> Oops, sorry.  I meant that get_request and __get_request need updates to
> their documentation.

See followup email, that no longer applies!

> On the second part (in case there was confusion on what I meant there),
> what I meant was only do the prio boost for REQ_PRIO requests instead
> of also doing it for REQ_META.  The way I arrived at that conclusion was
> when I was going to ask you to update the documentation for REQ_META to
> state that it implied REQ_PRIO, at which point, one has to wonder why we
> need two flags.
>
> There are cases where REQ_PRIO is used without REQ_META.
> There are cases where REQ_META is used withoug REQ_PRIO.
> And of course, there are cases where they're both sent down.
>
> REQ_META itself is useful for tracing, and also makes the code
> self-documenting.
>
> REQ_PRIO pretty clearly means that we should boost priority for this
> request.  And I think Christoph was making a case for REQ_META that
> doesn't require a priority boost (if I read what he said correctly).
>
> So, I think they serve different purposes.  Have I convinced you?  It'll
> make your patch smaller!  ;-)

I got what you meant, the updated patch in that same followup email has 
it cut down to just REQ_PRIO and drops RQ_PRIO_MASK as a result.
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 32a283eb7274..3cfd67d006fb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1781,6 +1781,11 @@  get_rq:
 		rw_flags |= REQ_SYNC;
 
 	/*
+	 * Add in META/PRIO flags, if set, before we get to the IO scheduler
+	 */
+	rw_flags |= (bio->bi_rw & (REQ_META | REQ_PRIO));
+
+	/*
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4e5978426ee7..7969882e0a2a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -72,6 +72,8 @@  static struct kmem_cache *cfq_pool;
 #define CFQ_WEIGHT_LEGACY_DFL	500
 #define CFQ_WEIGHT_LEGACY_MAX	1000
 
+#define RQ_PRIO_MASK		(REQ_META | REQ_PRIO)
+
 struct cfq_ttime {
 	u64 last_end_request;
 
@@ -141,7 +143,7 @@  struct cfq_queue {
 
 	/* io prio of this group */
 	unsigned short ioprio, org_ioprio;
-	unsigned short ioprio_class;
+	unsigned short ioprio_class, org_ioprio_class;
 
 	pid_t pid;
 
@@ -1114,8 +1116,8 @@  cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2,
 	if (rq_is_sync(rq1) != rq_is_sync(rq2))
 		return rq_is_sync(rq1) ? rq1 : rq2;
 
-	if ((rq1->cmd_flags ^ rq2->cmd_flags) & REQ_PRIO)
-		return rq1->cmd_flags & REQ_PRIO ? rq1 : rq2;
+	if ((rq1->cmd_flags ^ rq2->cmd_flags) & RQ_PRIO_MASK)
+		return rq1->cmd_flags & RQ_PRIO_MASK ? rq1 : rq2;
 
 	s1 = blk_rq_pos(rq1);
 	s2 = blk_rq_pos(rq2);
@@ -2530,7 +2532,7 @@  static void cfq_remove_request(struct request *rq)
 
 	cfqq->cfqd->rq_queued--;
 	cfqg_stats_update_io_remove(RQ_CFQG(rq), req_op(rq), rq->cmd_flags);
-	if (rq->cmd_flags & REQ_PRIO) {
+	if (rq->cmd_flags & RQ_PRIO_MASK) {
 		WARN_ON(!cfqq->prio_pending);
 		cfqq->prio_pending--;
 	}
@@ -3700,6 +3702,7 @@  static void cfq_init_prio_data(struct cfq_queue *cfqq, struct cfq_io_cq *cic)
 	 * elevate the priority of this queue
 	 */
 	cfqq->org_ioprio = cfqq->ioprio;
+	cfqq->org_ioprio_class = cfqq->ioprio_class;
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
@@ -4012,7 +4015,7 @@  cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	 * So both queues are sync. Let the new request get disk time if
 	 * it's a metadata request and the current queue is doing regular IO.
 	 */
-	if ((rq->cmd_flags & REQ_PRIO) && !cfqq->prio_pending)
+	if ((rq->cmd_flags & RQ_PRIO_MASK) && !cfqq->prio_pending)
 		return true;
 
 	/* An idle queue should not be idle now for some reason */
@@ -4073,7 +4076,7 @@  cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	struct cfq_io_cq *cic = RQ_CIC(rq);
 
 	cfqd->rq_queued++;
-	if (rq->cmd_flags & REQ_PRIO)
+	if (rq->cmd_flags & RQ_PRIO_MASK)
 		cfqq->prio_pending++;
 
 	cfq_update_io_thinktime(cfqd, cfqq, cic);
@@ -4295,6 +4298,20 @@  static void cfq_completed_request(struct request_queue *q, struct request *rq)
 		cfq_schedule_dispatch(cfqd);
 }
 
+static void cfqq_boost_on_meta(struct cfq_queue *cfqq, int op_flags)
+{
+	if (!(op_flags & RQ_PRIO_MASK)) {
+		cfqq->ioprio_class = cfqq->org_ioprio_class;
+		cfqq->ioprio = cfqq->org_ioprio;
+		return;
+	}
+
+	if (cfq_class_idle(cfqq))
+		cfqq->ioprio_class = IOPRIO_CLASS_BE;
+	if (cfqq->ioprio > IOPRIO_NORM)
+		cfqq->ioprio = IOPRIO_NORM;
+}
+
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
 	if (cfq_cfqq_wait_request(cfqq) && !cfq_cfqq_must_alloc_slice(cfqq)) {
@@ -4325,6 +4342,7 @@  static int cfq_may_queue(struct request_queue *q, int op, int op_flags)
 	cfqq = cic_to_cfqq(cic, rw_is_sync(op, op_flags));
 	if (cfqq) {
 		cfq_init_prio_data(cfqq, cic);
+		cfqq_boost_on_meta(cfqq, op_flags);
 
 		return __cfq_may_queue(cfqq);
 	}