diff mbox series

[3/3] block: fix default IO priority handling again

Message ID 20220601145110.18162-3-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series block: Fix IO priority mess | expand

Commit Message

Jan Kara June 1, 2022, 2:51 p.m. UTC
Commit e70344c05995 ("block: fix default IO priority handling")
introduced an inconsistency in get_current_ioprio() that tasks without
IO context return IOPRIO_DEFAULT priority while tasks with freshly
allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
Tasks without IO context used to be rare before 5a9d041ba2f6 ("block:
move io_context creation into where it's needed") but after this commit
they became common because now only BFQ IO scheduler setups task's IO
context. Similar inconsistency is there for get_task_ioprio() so this
inconsistency is now exposed to userspace and userspace will see
different IO priority for tasks operating on devices with BFQ compared
to devices without BFQ. Furthemore the changes done by commit
e70344c05995 change the behavior when no IO priority is set for BFQ IO
scheduler which is also documented in ioprio_set(2) manpage - namely
that tasks without set IO priority will use IO priority based on their
nice value.

So make sure we default to IOPRIO_CLASS_NONE as used to be the case
before commit e70344c05995. Also cleanup alloc_io_context() to
explicitely set this IO priority for the allocated IO context.

Fixes: e70344c05995 ("block: fix default IO priority handling")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk-ioc.c        | 2 ++
 include/linux/ioprio.h | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

Comments

Damien Le Moal June 1, 2022, 3:08 p.m. UTC | #1
On 2022/06/01 23:51, Jan Kara wrote:
> Commit e70344c05995 ("block: fix default IO priority handling")
> introduced an inconsistency in get_current_ioprio() that tasks without
> IO context return IOPRIO_DEFAULT priority while tasks with freshly
> allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
> Tasks without IO context used to be rare before 5a9d041ba2f6 ("block:
> move io_context creation into where it's needed") but after this commit
> they became common because now only BFQ IO scheduler setups task's IO
> context. Similar inconsistency is there for get_task_ioprio() so this
> inconsistency is now exposed to userspace and userspace will see
> different IO priority for tasks operating on devices with BFQ compared
> to devices without BFQ. Furthemore the changes done by commit
> e70344c05995 change the behavior when no IO priority is set for BFQ IO
> scheduler which is also documented in ioprio_set(2) manpage - namely
> that tasks without set IO priority will use IO priority based on their
> nice value.
> 
> So make sure we default to IOPRIO_CLASS_NONE as used to be the case
> before commit e70344c05995. Also cleanup alloc_io_context() to
> explicitely set this IO priority for the allocated IO context.
> 
> Fixes: e70344c05995 ("block: fix default IO priority handling")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  block/blk-ioc.c        | 2 ++
>  include/linux/ioprio.h | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index df9cfe4ca532..63fc02042408 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -247,6 +247,8 @@ static struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
>  	INIT_HLIST_HEAD(&ioc->icq_list);
>  	INIT_WORK(&ioc->release_work, ioc_release_fn);
>  #endif
> +	ioc->ioprio = IOPRIO_DEFAULT;
> +
>  	return ioc;
>  }
>  
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index 774bb90ad668..d9dc78a15301 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -11,7 +11,7 @@
>  /*
>   * Default IO priority.
>   */
> -#define IOPRIO_DEFAULT	IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM)
> +#define IOPRIO_DEFAULT	IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0)

"man ioprio_set" says:

IOPRIO_CLASS_BE (2)
This is the best-effort scheduling class, which is the default for any process
that hasn't set  a  specific I/O  priority.

Which is why patch e70344c05995 introduced IOPRIO_DEFAULT definition using the
BE class, to have the kernel in sync with the manual.

The different ioprio leading to no BIO merging is definitely a problem but this
patch is not really fixing anything in my opinion. It simply gets back to the
previous "all 0s" ioprio initialization, which do not show the places where we
actually have missing ioprio initialization to IOPRIO_DEFAULT.

Isn't it simply that IOPRIO_DEFAULT should be set as the default for any bio
being allocated (in bio_alloc ?) before it is setup and inherits the user io
priority ? Otherwise, the bio io prio is indeed IOPRIO_CLASS_NONE/0 and changing
IOPRIO_DEFAULT to that value removes the differences you observed.

>  
>  /*
>   * Check that a priority value has a valid class.
Jan Kara June 1, 2022, 4:04 p.m. UTC | #2
On Thu 02-06-22 00:08:28, Damien Le Moal wrote:
> On 2022/06/01 23:51, Jan Kara wrote:
> > Commit e70344c05995 ("block: fix default IO priority handling")
> > introduced an inconsistency in get_current_ioprio() that tasks without
> > IO context return IOPRIO_DEFAULT priority while tasks with freshly
> > allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
> > Tasks without IO context used to be rare before 5a9d041ba2f6 ("block:
> > move io_context creation into where it's needed") but after this commit
> > they became common because now only BFQ IO scheduler setups task's IO
> > context. Similar inconsistency is there for get_task_ioprio() so this
> > inconsistency is now exposed to userspace and userspace will see
> > different IO priority for tasks operating on devices with BFQ compared
> > to devices without BFQ. Furthemore the changes done by commit
> > e70344c05995 change the behavior when no IO priority is set for BFQ IO
> > scheduler which is also documented in ioprio_set(2) manpage - namely
> > that tasks without set IO priority will use IO priority based on their
> > nice value.
> > 
> > So make sure we default to IOPRIO_CLASS_NONE as used to be the case
> > before commit e70344c05995. Also cleanup alloc_io_context() to
> > explicitely set this IO priority for the allocated IO context.
> > 
> > Fixes: e70344c05995 ("block: fix default IO priority handling")
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  block/blk-ioc.c        | 2 ++
> >  include/linux/ioprio.h | 2 +-
> >  2 files changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> > index df9cfe4ca532..63fc02042408 100644
> > --- a/block/blk-ioc.c
> > +++ b/block/blk-ioc.c
> > @@ -247,6 +247,8 @@ static struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
> >  	INIT_HLIST_HEAD(&ioc->icq_list);
> >  	INIT_WORK(&ioc->release_work, ioc_release_fn);
> >  #endif
> > +	ioc->ioprio = IOPRIO_DEFAULT;
> > +
> >  	return ioc;
> >  }
> >  
> > diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> > index 774bb90ad668..d9dc78a15301 100644
> > --- a/include/linux/ioprio.h
> > +++ b/include/linux/ioprio.h
> > @@ -11,7 +11,7 @@
> >  /*
> >   * Default IO priority.
> >   */
> > -#define IOPRIO_DEFAULT	IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM)
> > +#define IOPRIO_DEFAULT	IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0)
> 
> "man ioprio_set" says:
> 
> IOPRIO_CLASS_BE (2)
> This is the best-effort scheduling class, which is the default for any process
> that hasn't set  a  specific I/O  priority.
> 
> Which is why patch e70344c05995 introduced IOPRIO_DEFAULT definition using the
> BE class, to have the kernel in sync with the manual.

Yes, but it also has:

       If no I/O scheduler has been set for a thread, then by default the  I/O
       priority  will  follow  the  CPU nice value (setpriority(2)).  In Linux
       kernels before version 2.6.24, once an I/O priority had been set  using
       ioprio_set(),  there was no way to reset the I/O scheduling behavior to
       the default.  Since Linux 2.6.24, specifying ioprio as 0 can be used to
       reset to the default I/O scheduling behavior.

So apparently even the manpage itself is inconsistent ;) (and I will
neglect the mistake that the text says "no I/O scheduler" instead of "no
I/O priority"). And there is actually code in BFQ (and there used to be
code in CFQ) that does:
        switch (ioprio_class) {
	...
        case IOPRIO_CLASS_NONE:
                /*
                 * No prio set, inherit CPU scheduling settings.
                 */
                bfqq->new_ioprio = task_nice_ioprio(tsk);
                bfqq->new_ioprio_class = task_nice_ioclass(tsk);
                break;
	
So IOPRIO_CLASS_NONE indeed has a meaning and it used to be the default
one until Christoph's 5a9d041ba2f6 in most cases. Your change e70344c05995
happening before that actually didn't change much practically because IO
contexts were initialized with 0 priority anyway and that was what
get_current_ioprio() was returning.

> The different ioprio leading to no BIO merging is definitely a problem
> but this patch is not really fixing anything in my opinion. It simply
> gets back to the previous "all 0s" ioprio initialization, which do not
> show the places where we actually have missing ioprio initialization to
> IOPRIO_DEFAULT.

So I agree we should settle on how to treat IOs with unset IO priority. The
behavior to use task's CPU priority when IO priority is unset is there for
a *long* time and so I think we should preserve that. The question is where
in the stack should the switch from "unset" value to "effective ioprio"
happen. Switching in IO context is IMO too early since userspace needs to
be able to differentiate "unset" from "set to
IOPRIO_CLASS_BE,IOPRIO_BE_NORM". But we could have a helper like
current_effective_ioprio() that will do the magic with mangling unset IO
priority based on task's CPU priority.

The fact is that bio->bi_ioprio gets set to anything only for direct IO in
iomap_dio_rw(). The rest of the IO has priority unset (BFQ fetches the
priority from task's IO context and ignores priority on bios BTW). And the
only place where req->ioprio (inherited from bio->bi_ioprio) gets used is
in a few drivers to set HIGHPRI flag for IOPRIO_CLASS_RT IO and then the
relatively new code in mq-deadline.

So it all is very inconsistent mess :-|

> Isn't it simply that IOPRIO_DEFAULT should be set as the default for any bio
> being allocated (in bio_alloc ?) before it is setup and inherits the user io
> priority ? Otherwise, the bio io prio is indeed IOPRIO_CLASS_NONE/0 and changing
> IOPRIO_DEFAULT to that value removes the differences you observed.

Yes, I think that would make sence although as I explain above this is
somewhat independent to what the default IO priority behavior should be.

									Honza
Damien Le Moal June 2, 2022, 1:53 a.m. UTC | #3
On 2022/06/02 1:04, Jan Kara wrote:
> On Thu 02-06-22 00:08:28, Damien Le Moal wrote:
>> On 2022/06/01 23:51, Jan Kara wrote:
>>> Commit e70344c05995 ("block: fix default IO priority handling")
>>> introduced an inconsistency in get_current_ioprio() that tasks without
>>> IO context return IOPRIO_DEFAULT priority while tasks with freshly
>>> allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
>>> Tasks without IO context used to be rare before 5a9d041ba2f6 ("block:
>>> move io_context creation into where it's needed") but after this commit
>>> they became common because now only BFQ IO scheduler setups task's IO
>>> context. Similar inconsistency is there for get_task_ioprio() so this
>>> inconsistency is now exposed to userspace and userspace will see
>>> different IO priority for tasks operating on devices with BFQ compared
>>> to devices without BFQ. Furthemore the changes done by commit
>>> e70344c05995 change the behavior when no IO priority is set for BFQ IO
>>> scheduler which is also documented in ioprio_set(2) manpage - namely
>>> that tasks without set IO priority will use IO priority based on their
>>> nice value.
>>>
>>> So make sure we default to IOPRIO_CLASS_NONE as used to be the case
>>> before commit e70344c05995. Also cleanup alloc_io_context() to
>>> explicitely set this IO priority for the allocated IO context.
>>>
>>> Fixes: e70344c05995 ("block: fix default IO priority handling")
>>> Signed-off-by: Jan Kara <jack@suse.cz>
>>> ---
>>>  block/blk-ioc.c        | 2 ++
>>>  include/linux/ioprio.h | 2 +-
>>>  2 files changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
>>> index df9cfe4ca532..63fc02042408 100644
>>> --- a/block/blk-ioc.c
>>> +++ b/block/blk-ioc.c
>>> @@ -247,6 +247,8 @@ static struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
>>>  	INIT_HLIST_HEAD(&ioc->icq_list);
>>>  	INIT_WORK(&ioc->release_work, ioc_release_fn);
>>>  #endif
>>> +	ioc->ioprio = IOPRIO_DEFAULT;
>>> +
>>>  	return ioc;
>>>  }
>>>  
>>> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
>>> index 774bb90ad668..d9dc78a15301 100644
>>> --- a/include/linux/ioprio.h
>>> +++ b/include/linux/ioprio.h
>>> @@ -11,7 +11,7 @@
>>>  /*
>>>   * Default IO priority.
>>>   */
>>> -#define IOPRIO_DEFAULT	IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM)
>>> +#define IOPRIO_DEFAULT	IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0)
>>
>> "man ioprio_set" says:
>>
>> IOPRIO_CLASS_BE (2)
>> This is the best-effort scheduling class, which is the default for any process
>> that hasn't set  a  specific I/O  priority.
>>
>> Which is why patch e70344c05995 introduced IOPRIO_DEFAULT definition using the
>> BE class, to have the kernel in sync with the manual.
> 
> Yes, but it also has:
> 
>        If no I/O scheduler has been set for a thread, then by default the  I/O
>        priority  will  follow  the  CPU nice value (setpriority(2)).  In Linux
>        kernels before version 2.6.24, once an I/O priority had been set  using
>        ioprio_set(),  there was no way to reset the I/O scheduling behavior to
>        the default.  Since Linux 2.6.24, specifying ioprio as 0 can be used to
>        reset to the default I/O scheduling behavior.
> 
> So apparently even the manpage itself is inconsistent ;) (and I will
> neglect the mistake that the text says "no I/O scheduler" instead of "no
> I/O priority"). And there is actually code in BFQ (and there used to be
> code in CFQ) that does:

Arg, indeed, it is a mess. We need to fix this manpage too.


>         switch (ioprio_class) {
> 	...
>         case IOPRIO_CLASS_NONE:
>                 /*
>                  * No prio set, inherit CPU scheduling settings.
>                  */
>                 bfqq->new_ioprio = task_nice_ioprio(tsk);
>                 bfqq->new_ioprio_class = task_nice_ioclass(tsk);
>                 break;
> 	
> So IOPRIO_CLASS_NONE indeed has a meaning and it used to be the default
> one until Christoph's 5a9d041ba2f6 in most cases. Your change e70344c05995
> happening before that actually didn't change much practically because IO
> contexts were initialized with 0 priority anyway and that was what
> get_current_ioprio() was returning.

5a9d041ba2f6 was from Jens, but agreed, io priority initialization never has
been correct.

> 
>> The different ioprio leading to no BIO merging is definitely a problem
>> but this patch is not really fixing anything in my opinion. It simply
>> gets back to the previous "all 0s" ioprio initialization, which do not
>> show the places where we actually have missing ioprio initialization to
>> IOPRIO_DEFAULT.
> 
> So I agree we should settle on how to treat IOs with unset IO priority. The
> behavior to use task's CPU priority when IO priority is unset is there for
> a *long* time and so I think we should preserve that. The question is where
> in the stack should the switch from "unset" value to "effective ioprio"
> happen. Switching in IO context is IMO too early since userspace needs to
> be able to differentiate "unset" from "set to
> IOPRIO_CLASS_BE,IOPRIO_BE_NORM". But we could have a helper like
> current_effective_ioprio() that will do the magic with mangling unset IO
> priority based on task's CPU priority.

I agree that the task's CPU priority is a more sensible default. However, I do
not understand your point about the IO context being too early to set the
effective priority. If we do not do that, getting the issuer CPU priority will
not be easily possible, right ?

> 
> The fact is that bio->bi_ioprio gets set to anything only for direct IO in
> iomap_dio_rw(). The rest of the IO has priority unset (BFQ fetches the
> priority from task's IO context and ignores priority on bios BTW). And the
> only place where req->ioprio (inherited from bio->bi_ioprio) gets used is
> in a few drivers to set HIGHPRI flag for IOPRIO_CLASS_RT IO and then the
> relatively new code in mq-deadline.

Yes, only IOPRIO_CLASS_RT has an effect at the hardware level right now.
Everything else is only for the IO scheduler to play with. I am preparing a
patch series to support scsi/ata command duration limits though. That adds a new
IOPRIO_CLASS_DL and that one will also have an effect on the hardware.

Note that if a process/thread use ioprio_set(), we should also honor the ioprio
set for buffered reads, at least those page IOs that are issued directly from
the IO issuer context (which may include readahead... hmmm).

> 
> So it all is very inconsistent mess :-|

Indeed it is.

> 
>> Isn't it simply that IOPRIO_DEFAULT should be set as the default for any bio
>> being allocated (in bio_alloc ?) before it is setup and inherits the user io
>> priority ? Otherwise, the bio io prio is indeed IOPRIO_CLASS_NONE/0 and changing
>> IOPRIO_DEFAULT to that value removes the differences you observed.
> 
> Yes, I think that would make sence although as I explain above this is
> somewhat independent to what the default IO priority behavior should be.
I am OK with the use of the task CPU priority/ionice value as the default when
no other ioprio is set for a bio using the user aio_reqprio or ioprio_set(). If
this relies on task_nice_ioclass() as it is today (I see no reason to change
that), then the default class for regular tasks remains IOPRIO_CLASS_BE as is
defined by IOPRIO_DEFAULT.

But to avoid the performance regression you observed, we really need to be 100%
sure that all bios have their ->bi_ioprio field correctly initialized. Something
like:

void bio_set_effective_ioprio(struct *bio)
{
	switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) {
	case IOPRIO_CLASS_RT:
	case IOPRIO_CLASS_BE:
	case IOPRIO_CLASS_IDLE:
		/*
		 * the bio ioprio was already set from an aio kiocb ioprio
		 * (aio->aio_reqprio) or from the issuer context ioprio if that
		 * context used ioprio_set().
		 */;
		return;
	case IOPRIO_CLASS_NONE:
	default:
		/* Use the current task CPU priority */
		bio->ioprio =
			IOPRIO_PRIO_VALUE(task_nice_ioclass(current),
					  task_nice_ioprio(current));
		return;
	}
}

being called before a bio is inserted in a scheduler or bypass inserted in the
dispatch queues should result in all BIOs having an ioprio that is set to
something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at
the beginning of submit_bio(), before submit_bio_noacct() is called.

I am tempted to argue that block device drivers should never see any req with an
ioprio set to IOPRIO_CLASS_NONE, which means that no bio should ever enter the
block stack with that ioprio either. With the above solution, bios from DM
targets submitted with submit_bio_noacct() could still have IOPRIO_CLASS_NONE...
So would submit_bio_noacct() be the better place to call the effective ioprio
helper ?
Jan Kara June 6, 2022, 10:42 a.m. UTC | #4
On Thu 02-06-22 10:53:29, Damien Le Moal wrote:
> On 2022/06/02 1:04, Jan Kara wrote:
> > On Thu 02-06-22 00:08:28, Damien Le Moal wrote:
> >> The different ioprio leading to no BIO merging is definitely a problem
> >> but this patch is not really fixing anything in my opinion. It simply
> >> gets back to the previous "all 0s" ioprio initialization, which do not
> >> show the places where we actually have missing ioprio initialization to
> >> IOPRIO_DEFAULT.
> > 
> > So I agree we should settle on how to treat IOs with unset IO priority. The
> > behavior to use task's CPU priority when IO priority is unset is there for
> > a *long* time and so I think we should preserve that. The question is where
> > in the stack should the switch from "unset" value to "effective ioprio"
> > happen. Switching in IO context is IMO too early since userspace needs to
> > be able to differentiate "unset" from "set to
> > IOPRIO_CLASS_BE,IOPRIO_BE_NORM". But we could have a helper like
> > current_effective_ioprio() that will do the magic with mangling unset IO
> > priority based on task's CPU priority.
> 
> I agree that the task's CPU priority is a more sensible default. However,
> I do not understand your point about the IO context being too early to
> set the effective priority. If we do not do that, getting the issuer CPU
> priority will not be easily possible, right ?

I just meant that in the IO context we need to keep the information whether
IO priority is set to a particular value, or whether it is set to 0 meaning
"inherit from CPU priority". So we cannot just store the effective IO
priority immediately in the IO context instead of 0.

> > The fact is that bio->bi_ioprio gets set to anything only for direct IO in
> > iomap_dio_rw(). The rest of the IO has priority unset (BFQ fetches the
> > priority from task's IO context and ignores priority on bios BTW). And the
> > only place where req->ioprio (inherited from bio->bi_ioprio) gets used is
> > in a few drivers to set HIGHPRI flag for IOPRIO_CLASS_RT IO and then the
> > relatively new code in mq-deadline.
> 
> Yes, only IOPRIO_CLASS_RT has an effect at the hardware level right now.
> Everything else is only for the IO scheduler to play with. I am preparing a
> patch series to support scsi/ata command duration limits though. That adds a new
> IOPRIO_CLASS_DL and that one will also have an effect on the hardware.
> 
> Note that if a process/thread use ioprio_set(), we should also honor the ioprio
> set for buffered reads, at least those page IOs that are issued directly from
> the IO issuer context (which may include readahead... hmmm).

Yes, that would make sense as well.

> >> Isn't it simply that IOPRIO_DEFAULT should be set as the default for any bio
> >> being allocated (in bio_alloc ?) before it is setup and inherits the user io
> >> priority ? Otherwise, the bio io prio is indeed IOPRIO_CLASS_NONE/0 and changing
> >> IOPRIO_DEFAULT to that value removes the differences you observed.
> > 
> > Yes, I think that would make sence although as I explain above this is
> > somewhat independent to what the default IO priority behavior should be.
> I am OK with the use of the task CPU priority/ionice value as the default when
> no other ioprio is set for a bio using the user aio_reqprio or ioprio_set(). If
> this relies on task_nice_ioclass() as it is today (I see no reason to change
> that), then the default class for regular tasks remains IOPRIO_CLASS_BE as is
> defined by IOPRIO_DEFAULT.

Yes, good.

> But to avoid the performance regression you observed, we really need to be 100%
> sure that all bios have their ->bi_ioprio field correctly initialized. Something
> like:
> 
> void bio_set_effective_ioprio(struct *bio)
> {
> 	switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) {
> 	case IOPRIO_CLASS_RT:
> 	case IOPRIO_CLASS_BE:
> 	case IOPRIO_CLASS_IDLE:
> 		/*
> 		 * the bio ioprio was already set from an aio kiocb ioprio
> 		 * (aio->aio_reqprio) or from the issuer context ioprio if that
> 		 * context used ioprio_set().
> 		 */;
> 		return;
> 	case IOPRIO_CLASS_NONE:
> 	default:
> 		/* Use the current task CPU priority */
> 		bio->ioprio =
> 			IOPRIO_PRIO_VALUE(task_nice_ioclass(current),
> 					  task_nice_ioprio(current));
> 		return;
> 	}
> }
> 
> being called before a bio is inserted in a scheduler or bypass inserted in the
> dispatch queues should result in all BIOs having an ioprio that is set to
> something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at
> the beginning of submit_bio(), before submit_bio_noacct() is called.
> 
> I am tempted to argue that block device drivers should never see any req
> with an ioprio set to IOPRIO_CLASS_NONE, which means that no bio should
> ever enter the block stack with that ioprio either. With the above
> solution, bios from DM targets submitted with submit_bio_noacct() could
> still have IOPRIO_CLASS_NONE...  So would submit_bio_noacct() be the
> better place to call the effective ioprio helper ?

Yes, I also think it would be the cleanest if we made sure bio->ioprio is
always set to some value other than IOPRIO_CLASS_NONE. I'll see how we can
make that happen in the least painful way :). Thanks for your input!

								Honza
Jan Kara June 6, 2022, 2:21 p.m. UTC | #5
On Mon 06-06-22 12:42:02, Jan Kara wrote:
> On Thu 02-06-22 10:53:29, Damien Le Moal wrote:
> > But to avoid the performance regression you observed, we really need to be 100%
> > sure that all bios have their ->bi_ioprio field correctly initialized. Something
> > like:
> > 
> > void bio_set_effective_ioprio(struct *bio)
> > {
> > 	switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) {
> > 	case IOPRIO_CLASS_RT:
> > 	case IOPRIO_CLASS_BE:
> > 	case IOPRIO_CLASS_IDLE:
> > 		/*
> > 		 * the bio ioprio was already set from an aio kiocb ioprio
> > 		 * (aio->aio_reqprio) or from the issuer context ioprio if that
> > 		 * context used ioprio_set().
> > 		 */;
> > 		return;
> > 	case IOPRIO_CLASS_NONE:
> > 	default:
> > 		/* Use the current task CPU priority */
> > 		bio->ioprio =
> > 			IOPRIO_PRIO_VALUE(task_nice_ioclass(current),
> > 					  task_nice_ioprio(current));
> > 		return;
> > 	}
> > }
> > 
> > being called before a bio is inserted in a scheduler or bypass inserted in the
> > dispatch queues should result in all BIOs having an ioprio that is set to
> > something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at
> > the beginning of submit_bio(), before submit_bio_noacct() is called.
> > 
> > I am tempted to argue that block device drivers should never see any req
> > with an ioprio set to IOPRIO_CLASS_NONE, which means that no bio should
> > ever enter the block stack with that ioprio either. With the above
> > solution, bios from DM targets submitted with submit_bio_noacct() could
> > still have IOPRIO_CLASS_NONE...  So would submit_bio_noacct() be the
> > better place to call the effective ioprio helper ?
> 
> Yes, I also think it would be the cleanest if we made sure bio->ioprio is
> always set to some value other than IOPRIO_CLASS_NONE. I'll see how we can
> make that happen in the least painful way :). Thanks for your input!

When looking into this I've hit a snag: ioprio rq_qos policy relies on the
fact that bio->bi_ioprio remains at 0 (unless explicitely set to some other
value by userspace) until we call rq_qos_track() in blk_mq_submit_bio().
BTW this happens after we have attempted to merge the bio to existing
requests so ioprio rq_qos policy is going to show strange behavior wrt
merging - most of the bios will not be able to merge to existing queued
requests due to ioprio mismatch.

I'd say .track hook gets called too late to properly set bio->bi_ioprio.
Shouldn't we set the io priority much earlier - I'd be tempted to use
bio_associate_blkg_from_css() for this... What do people think?

								Honza
Niklas Cassel June 7, 2022, 12:13 p.m. UTC | #6
On Mon, Jun 06, 2022 at 04:21:36PM +0200, Jan Kara wrote:
> On Mon 06-06-22 12:42:02, Jan Kara wrote:
> > On Thu 02-06-22 10:53:29, Damien Le Moal wrote:
> > > But to avoid the performance regression you observed, we really need to be 100%
> > > sure that all bios have their ->bi_ioprio field correctly initialized. Something
> > > like:
> > > 
> > > void bio_set_effective_ioprio(struct *bio)
> > > {
> > > 	switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) {
> > > 	case IOPRIO_CLASS_RT:
> > > 	case IOPRIO_CLASS_BE:
> > > 	case IOPRIO_CLASS_IDLE:
> > > 		/*
> > > 		 * the bio ioprio was already set from an aio kiocb ioprio
> > > 		 * (aio->aio_reqprio) or from the issuer context ioprio if that
> > > 		 * context used ioprio_set().
> > > 		 */;
> > > 		return;
> > > 	case IOPRIO_CLASS_NONE:
> > > 	default:
> > > 		/* Use the current task CPU priority */
> > > 		bio->ioprio =
> > > 			IOPRIO_PRIO_VALUE(task_nice_ioclass(current),
> > > 					  task_nice_ioprio(current));
> > > 		return;
> > > 	}
> > > }
> > > 
> > > being called before a bio is inserted in a scheduler or bypass inserted in the
> > > dispatch queues should result in all BIOs having an ioprio that is set to
> > > something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at
> > > the beginning of submit_bio(), before submit_bio_noacct() is called.
> > > 
> > > I am tempted to argue that block device drivers should never see any req
> > > with an ioprio set to IOPRIO_CLASS_NONE, which means that no bio should
> > > ever enter the block stack with that ioprio either. With the above
> > > solution, bios from DM targets submitted with submit_bio_noacct() could
> > > still have IOPRIO_CLASS_NONE...  So would submit_bio_noacct() be the
> > > better place to call the effective ioprio helper ?
> > 
> > Yes, I also think it would be the cleanest if we made sure bio->ioprio is
> > always set to some value other than IOPRIO_CLASS_NONE. I'll see how we can
> > make that happen in the least painful way :). Thanks for your input!
> 
> When looking into this I've hit a snag: ioprio rq_qos policy relies on the
> fact that bio->bi_ioprio remains at 0 (unless explicitely set to some other
> value by userspace) until we call rq_qos_track() in blk_mq_submit_bio().
> BTW this happens after we have attempted to merge the bio to existing
> requests so ioprio rq_qos policy is going to show strange behavior wrt
> merging - most of the bios will not be able to merge to existing queued
> requests due to ioprio mismatch.
> 
> I'd say .track hook gets called too late to properly set bio->bi_ioprio.
> Shouldn't we set the io priority much earlier - I'd be tempted to use
> bio_associate_blkg_from_css() for this... What do people think?

Hello Jan,

bio_associate_blkg_from_css() is just an empty stub if CONFIG_BLK_CGROUP
is not set.

Having the effective ioprio set should correctly shouldn't depend on if
CONFIG_BLK_CGROUP is set or not, no?

The function name bio_associate_blkg_from_css() (css - cgroup_subsys_state)
also seems to imply that it should only perform cgroup related things, no?

AFAICT, both bfq and mq-deadline can currently prioritize requests without
CONFIG_BLK_CGROUP enabled.


Kind regards,
Niklas
Jan Kara June 7, 2022, 12:59 p.m. UTC | #7
On Tue 07-06-22 12:13:48, Niklas Cassel wrote:
> On Mon, Jun 06, 2022 at 04:21:36PM +0200, Jan Kara wrote:
> > On Mon 06-06-22 12:42:02, Jan Kara wrote:
> > > On Thu 02-06-22 10:53:29, Damien Le Moal wrote:
> > > > But to avoid the performance regression you observed, we really need to be 100%
> > > > sure that all bios have their ->bi_ioprio field correctly initialized. Something
> > > > like:
> > > > 
> > > > void bio_set_effective_ioprio(struct *bio)
> > > > {
> > > > 	switch (IOPRIO_PRIO_CLASS(bio->bi_ioprio)) {
> > > > 	case IOPRIO_CLASS_RT:
> > > > 	case IOPRIO_CLASS_BE:
> > > > 	case IOPRIO_CLASS_IDLE:
> > > > 		/*
> > > > 		 * the bio ioprio was already set from an aio kiocb ioprio
> > > > 		 * (aio->aio_reqprio) or from the issuer context ioprio if that
> > > > 		 * context used ioprio_set().
> > > > 		 */;
> > > > 		return;
> > > > 	case IOPRIO_CLASS_NONE:
> > > > 	default:
> > > > 		/* Use the current task CPU priority */
> > > > 		bio->ioprio =
> > > > 			IOPRIO_PRIO_VALUE(task_nice_ioclass(current),
> > > > 					  task_nice_ioprio(current));
> > > > 		return;
> > > > 	}
> > > > }
> > > > 
> > > > being called before a bio is inserted in a scheduler or bypass inserted in the
> > > > dispatch queues should result in all BIOs having an ioprio that is set to
> > > > something other than IOPRIO_CLASS_NONE. And the obvious place may be simply at
> > > > the beginning of submit_bio(), before submit_bio_noacct() is called.
> > > > 
> > > > I am tempted to argue that block device drivers should never see any req
> > > > with an ioprio set to IOPRIO_CLASS_NONE, which means that no bio should
> > > > ever enter the block stack with that ioprio either. With the above
> > > > solution, bios from DM targets submitted with submit_bio_noacct() could
> > > > still have IOPRIO_CLASS_NONE...  So would submit_bio_noacct() be the
> > > > better place to call the effective ioprio helper ?
> > > 
> > > Yes, I also think it would be the cleanest if we made sure bio->ioprio is
> > > always set to some value other than IOPRIO_CLASS_NONE. I'll see how we can
> > > make that happen in the least painful way :). Thanks for your input!
> > 
> > When looking into this I've hit a snag: ioprio rq_qos policy relies on the
> > fact that bio->bi_ioprio remains at 0 (unless explicitely set to some other
> > value by userspace) until we call rq_qos_track() in blk_mq_submit_bio().
> > BTW this happens after we have attempted to merge the bio to existing
> > requests so ioprio rq_qos policy is going to show strange behavior wrt
> > merging - most of the bios will not be able to merge to existing queued
> > requests due to ioprio mismatch.
> > 
> > I'd say .track hook gets called too late to properly set bio->bi_ioprio.
> > Shouldn't we set the io priority much earlier - I'd be tempted to use
> > bio_associate_blkg_from_css() for this... What do people think?
> 
> Hello Jan,
> 
> bio_associate_blkg_from_css() is just an empty stub if CONFIG_BLK_CGROUP
> is not set.
> 
> Having the effective ioprio set should correctly shouldn't depend on if
> CONFIG_BLK_CGROUP is set or not, no?
> 
> The function name bio_associate_blkg_from_css() (css - cgroup_subsys_state)
> also seems to imply that it should only perform cgroup related things, no?
> 
> AFAICT, both bfq and mq-deadline can currently prioritize requests without
> CONFIG_BLK_CGROUP enabled.

Correct on all points. However the ioprio rq_qos policy very much depends
on the cgroup support. So at least the update of bio->bi_ioprio based on
that policy would make sense in bio_associate_blkg_from_css(). OTOH
thinking about it now, we would have a problem that if bio blkcg
association changes, we don't have enough information to update the
bi_ioprio accordingly (since we don't know whether the resulting ioprio is
set by userspace or by ioprio rq_qos policy). So probably we need make
ioprio rq_qos policy to set the priority later (likely at bio submission
time when bio blkcg association is stable) but we need to do it earlier
than after we try to merge the bio...

								Honza
diff mbox series

Patch

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index df9cfe4ca532..63fc02042408 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -247,6 +247,8 @@  static struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 	INIT_HLIST_HEAD(&ioc->icq_list);
 	INIT_WORK(&ioc->release_work, ioc_release_fn);
 #endif
+	ioc->ioprio = IOPRIO_DEFAULT;
+
 	return ioc;
 }
 
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 774bb90ad668..d9dc78a15301 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -11,7 +11,7 @@ 
 /*
  * Default IO priority.
  */
-#define IOPRIO_DEFAULT	IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM)
+#define IOPRIO_DEFAULT	IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0)
 
 /*
  * Check that a priority value has a valid class.