diff mbox series

[block-5.14] Revert "block/mq-deadline: Add cgroup support"

Message ID YRQL2dlLsQ6mGNtz@slm.duckdns.org (mailing list archive)
State New, archived
Headers show
Series [block-5.14] Revert "block/mq-deadline: Add cgroup support" | expand

Commit Message

Tejun Heo Aug. 11, 2021, 5:41 p.m. UTC
From e150c6478e453fe27b5cf83ed5d03b7582b6d35e Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 11 Aug 2021 07:29:20 -1000

This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a
WARN_ON_ONCE() call"). The added cgroup support has the following issues:

* It breaks cgroup interface file format rule by adding custom elements to a
  nested key-value file.

* It registers mq-deadline as a cgroup-aware policy even though all it's
  doing is collecting per-cgroup stats. Even if we need these stats, this
  isn't the right way to add them.

* It hasn't been reviewed from cgroup side.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/Kconfig.iosched                       |   6 -
 block/Makefile                              |   2 -
 block/mq-deadline-cgroup.c                  | 126 --------------------
 block/mq-deadline-cgroup.h                  | 114 ------------------
 block/{mq-deadline-main.c => mq-deadline.c} |  73 +++---------
 5 files changed, 14 insertions(+), 307 deletions(-)
 delete mode 100644 block/mq-deadline-cgroup.c
 delete mode 100644 block/mq-deadline-cgroup.h
 rename block/{mq-deadline-main.c => mq-deadline.c} (95%)

Comments

Bart Van Assche Aug. 11, 2021, 6:49 p.m. UTC | #1
On 8/11/21 10:41 AM, Tejun Heo wrote:
>  From e150c6478e453fe27b5cf83ed5d03b7582b6d35e Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 11 Aug 2021 07:29:20 -1000
> 
> This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
> and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a
> WARN_ON_ONCE() call"). The added cgroup support has the following issues:
> 
> * It breaks cgroup interface file format rule by adding custom elements to a
>    nested key-value file.
> 
> * It registers mq-deadline as a cgroup-aware policy even though all it's
>    doing is collecting per-cgroup stats. Even if we need these stats, this
>    isn't the right way to add them.
> 
> * It hasn't been reviewed from cgroup side.

Agreed that I should have Cc-ed you on the cgroup patches. But where 
were you while my mq-deadline patch series was out for review? The first 
version of that patch series was published on May 27 and the patch 
series was merged on June 21 so there was almost one month time to post 
review feedback.

Additionally, the above description is not very helpful. If it is not 
allowed to add custom elements by adding more pd_stat_fn callbacks, why 
does that callback even exist? Why does the cgroup core not complain if 
a new policy is registered that defines a pd_stat_fn callback?

You write that this isn't the right way to collect per cgroup stats. 
What is the "right way"? Has this been documented somewhere?

Bart.
Tejun Heo Aug. 11, 2021, 7:14 p.m. UTC | #2
Hello, Bart.

On Wed, Aug 11, 2021 at 11:49:10AM -0700, Bart Van Assche wrote:
> Agreed that I should have Cc-ed you on the cgroup patches. But where were
> you while my mq-deadline patch series was out for review? The first version
> of that patch series was published on May 27 and the patch series was merged
> on June 21 so there was almost one month time to post review feedback.

Regardless of where I've been, I can't really review things which
don't show up in my radar. The patches didn't even cc cgroups mailing
list. How would I know that I needed to review the patches?

> Additionally, the above description is not very helpful. If it is not
> allowed to add custom elements by adding more pd_stat_fn callbacks, why does
> that callback even exist? Why does the cgroup core not complain if a new
> policy is registered that defines a pd_stat_fn callback?

That part of the comment was on the specific fomatting that you used.
cgroup interface files follow a few styles to stay consistent and ease
parsing. Please refer to Documentation/admin-guide/cgroup-v2.rst.

> You write that this isn't the right way to collect per cgroup stats. What is
> the "right way"? Has this been documented somewhere?

Well, there's nothing specific to mq-deadline or any other elevator or
controller about the stats that your patch collected and showed. That
seems like a pretty straight forward sign that it likely doens't
belong there.

Thanks.
Jens Axboe Aug. 11, 2021, 7:48 p.m. UTC | #3
On 8/11/21 11:41 AM, Tejun Heo wrote:
> From e150c6478e453fe27b5cf83ed5d03b7582b6d35e Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 11 Aug 2021 07:29:20 -1000
> 
> This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
> and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a
> WARN_ON_ONCE() call"). The added cgroup support has the following issues:
> 
> * It breaks cgroup interface file format rule by adding custom elements to a
>   nested key-value file.
> 
> * It registers mq-deadline as a cgroup-aware policy even though all it's
>   doing is collecting per-cgroup stats. Even if we need these stats, this
>   isn't the right way to add them.
> 
> * It hasn't been reviewed from cgroup side.

I missed that the cgroup side hadn't seen or signed off on this one. I
have applied this revert for 5.14.
Bart Van Assche Aug. 11, 2021, 8:22 p.m. UTC | #4
On 8/11/21 12:14 PM, Tejun Heo wrote:
> On Wed, Aug 11, 2021 at 11:49:10AM -0700, Bart Van Assche wrote:
>> You write that this isn't the right way to collect per cgroup stats. What is
>> the "right way"? Has this been documented somewhere?
> 
> Well, there's nothing specific to mq-deadline or any other elevator or
> controller about the stats that your patch collected and showed. That
> seems like a pretty straight forward sign that it likely doens't
> belong there.

Do you perhaps want these statistics to be reported via read-only cgroup 
attributes of a new cgroup policy that is independent of any particular 
I/O scheduler?

Thanks,

Bart.
Oleksandr Natalenko Aug. 12, 2021, 2:14 p.m. UTC | #5
Hi.

On středa 11. srpna 2021 21:48:19 CEST Jens Axboe wrote:
> On 8/11/21 11:41 AM, Tejun Heo wrote:
> > From e150c6478e453fe27b5cf83ed5d03b7582b6d35e Mon Sep 17 00:00:00 2001
> > From: Tejun Heo <tj@kernel.org>
> > Date: Wed, 11 Aug 2021 07:29:20 -1000
> > 
> > This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
> > and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a
> > WARN_ON_ONCE() call"). The added cgroup support has the following issues:
> > 
> > * It breaks cgroup interface file format rule by adding custom elements to
> > a> 
> >   nested key-value file.
> > 
> > * It registers mq-deadline as a cgroup-aware policy even though all it's
> > 
> >   doing is collecting per-cgroup stats. Even if we need these stats, this
> >   isn't the right way to add them.
> > 
> > * It hasn't been reviewed from cgroup side.
> 
> I missed that the cgroup side hadn't seen or signed off on this one. I
> have applied this revert for 5.14.

Should my commit [1] be reverted too?

Thanks.

[1] https://git.kernel.dk/cgit/linux-block/commit/?h=block-5.14&id=ec645dc96699ea6c37b6de86c84d7288ea9a4ddf
Jens Axboe Aug. 12, 2021, 3:50 p.m. UTC | #6
On 8/12/21 8:14 AM, Oleksandr Natalenko wrote:
> Hi.
> 
> On středa 11. srpna 2021 21:48:19 CEST Jens Axboe wrote:
>> On 8/11/21 11:41 AM, Tejun Heo wrote:
>>> From e150c6478e453fe27b5cf83ed5d03b7582b6d35e Mon Sep 17 00:00:00 2001
>>> From: Tejun Heo <tj@kernel.org>
>>> Date: Wed, 11 Aug 2021 07:29:20 -1000
>>>
>>> This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
>>> and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a
>>> WARN_ON_ONCE() call"). The added cgroup support has the following issues:
>>>
>>> * It breaks cgroup interface file format rule by adding custom elements to
>>> a> 
>>>   nested key-value file.
>>>
>>> * It registers mq-deadline as a cgroup-aware policy even though all it's
>>>
>>>   doing is collecting per-cgroup stats. Even if we need these stats, this
>>>   isn't the right way to add them.
>>>
>>> * It hasn't been reviewed from cgroup side.
>>
>> I missed that the cgroup side hadn't seen or signed off on this one. I
>> have applied this revert for 5.14.
> 
> Should my commit [1] be reverted too?
> 
> Thanks.
> 
> [1] https://git.kernel.dk/cgit/linux-block/commit/?h=block-5.14&id=ec645dc96699ea6c37b6de86c84d7288ea9a4ddf

It could be, but may as well leave it for now as we're super close to
the limit anyway.
Tejun Heo Aug. 12, 2021, 5:51 p.m. UTC | #7
On Wed, Aug 11, 2021 at 01:22:20PM -0700, Bart Van Assche wrote:
> On 8/11/21 12:14 PM, Tejun Heo wrote:
> > On Wed, Aug 11, 2021 at 11:49:10AM -0700, Bart Van Assche wrote:
> > > You write that this isn't the right way to collect per cgroup stats. What is
> > > the "right way"? Has this been documented somewhere?
> > 
> > Well, there's nothing specific to mq-deadline or any other elevator or
> > controller about the stats that your patch collected and showed. That
> > seems like a pretty straight forward sign that it likely doens't
> > belong there.
> 
> Do you perhaps want these statistics to be reported via read-only cgroup
> attributes of a new cgroup policy that is independent of any particular I/O
> scheduler?

There's an almost fundamental conflict between ioprio and cgroup IO
control. bfq layers it so that ioprio classes define the global
priority above weights and then ioprio modifies the weights within
each class. mq-deadline isn't cgroup aware and who knows what kind of
priority inversions it's creating when its ioprio enforcement is
interacting with other cgroup controllers.

The problem is that as currently used, they're specifying the same
things - how IO should be distributed globally in the system, and
there's no right way to make the two configuration configuration
regimes agree on what should happen on the system.

I can see two paths forward:

1. Accept that ioprio isn't something which makes senes with cgroup IO
   control in a generic manner and approach it in per-configuration
   manner, either by doing whatever the specific combination decided
   to do with ioprio or ignoring it.

2. The only generic way to integrate ioprio and cgroup IO control
   would be nesting ioprio inside cgroup IO control, so that ioprio
   can express per-process priority within each cgroup. While this
   makes semantic sense and can be useful in certain scenarios, this
   is also a departure from how people have been using ioprio and it'd
   be involve quite a bit of effort and complexity, likely too much to
   be justified by its inherent usefulness.

Jens, what do you think?

Thanks.
Bart Van Assche Aug. 12, 2021, 6:16 p.m. UTC | #8
On 8/12/21 10:51 AM, Tejun Heo wrote:
> There's an almost fundamental conflict between ioprio and cgroup IO
> control. bfq layers it so that ioprio classes define the global
> priority above weights and then ioprio modifies the weights within
> each class. mq-deadline isn't cgroup aware and who knows what kind of
> priority inversions it's creating when its ioprio enforcement is
> interacting with other cgroup controllers.

Are you perhaps referring to the iocost and iolatency cgroup 
controllers? Is it ever useful to combine these controllers with the 
ioprio controller? The ioprio controller was introduced with the goal to 
provide the information to the storage controller about which I/O 
request to handle first. My understanding of the iocost and iolatency 
controllers is that these cgroup controllers decide in which order to 
process I/O requests. Neither controller has the last word over I/O 
order if the queue depth is larger than one, something that is essential 
to achieve reasonable performance.

Bart.
Jens Axboe Aug. 12, 2021, 6:56 p.m. UTC | #9
On 8/12/21 11:51 AM, Tejun Heo wrote:
> On Wed, Aug 11, 2021 at 01:22:20PM -0700, Bart Van Assche wrote:
>> On 8/11/21 12:14 PM, Tejun Heo wrote:
>>> On Wed, Aug 11, 2021 at 11:49:10AM -0700, Bart Van Assche wrote:
>>>> You write that this isn't the right way to collect per cgroup stats. What is
>>>> the "right way"? Has this been documented somewhere?
>>>
>>> Well, there's nothing specific to mq-deadline or any other elevator or
>>> controller about the stats that your patch collected and showed. That
>>> seems like a pretty straight forward sign that it likely doens't
>>> belong there.
>>
>> Do you perhaps want these statistics to be reported via read-only cgroup
>> attributes of a new cgroup policy that is independent of any particular I/O
>> scheduler?
> 
> There's an almost fundamental conflict between ioprio and cgroup IO
> control. bfq layers it so that ioprio classes define the global
> priority above weights and then ioprio modifies the weights within
> each class. mq-deadline isn't cgroup aware and who knows what kind of
> priority inversions it's creating when its ioprio enforcement is
> interacting with other cgroup controllers.
> 
> The problem is that as currently used, they're specifying the same
> things - how IO should be distributed globally in the system, and
> there's no right way to make the two configuration configuration
> regimes agree on what should happen on the system.
> 
> I can see two paths forward:
> 
> 1. Accept that ioprio isn't something which makes senes with cgroup IO
>    control in a generic manner and approach it in per-configuration
>    manner, either by doing whatever the specific combination decided
>    to do with ioprio or ignoring it.
> 
> 2. The only generic way to integrate ioprio and cgroup IO control
>    would be nesting ioprio inside cgroup IO control, so that ioprio
>    can express per-process priority within each cgroup. While this
>    makes semantic sense and can be useful in certain scenarios, this
>    is also a departure from how people have been using ioprio and it'd
>    be involve quite a bit of effort and complexity, likely too much to
>    be justified by its inherent usefulness.
> 
> Jens, what do you think?

On the surface, #2 makes the most sense. But you'd then have to apply
some scaling before it reaches the hardware side or is factored in by
the underlying scheduler, or you could have a high priority from a
cgroup that has small share of the total resources, yet ends up being
regarded as more important than a lower priority request from a cgroup
that has a much higher share of the total resources.

Hence not really sure it makes a lot of sense... We could probably come
up with some heuristics that make some sense, but they'd still just be
heuristics.
Tejun Heo Aug. 12, 2021, 7:10 p.m. UTC | #10
Hello,

On Thu, Aug 12, 2021 at 12:56:47PM -0600, Jens Axboe wrote:
> On the surface, #2 makes the most sense. But you'd then have to apply
> some scaling before it reaches the hardware side or is factored in by
> the underlying scheduler, or you could have a high priority from a
> cgroup that has small share of the total resources, yet ends up being
> regarded as more important than a lower priority request from a cgroup
> that has a much higher share of the total resources.

I don't think hardware side support is all that useful for mainstream
use cases. Whatever is controlling IO has to own the queue by being
the most throttling portion of the pipeline anyway and at least my
experience has been that what matters a lot more is the overall rate
of throttling rather than how specific IO is handled - ie. you can try
to feed one IO at a time to a SSD with the right priority marking but
it won't do much good if the sequence includes a lot of back-to-back
writes which can saturate some portion of the SSD in terms of
buffering and GC. Also, the configuration granularity is too coarse to
adapt to generic use cases and the history is pretty clear on optional
advanced features on mainstream storage devices.

> Hence not really sure it makes a lot of sense... We could probably come
> up with some heuristics that make some sense, but they'd still just be
> heuristics.

Yeah, I have a hard time imagining a sane and logical mapping between
cgroup IO control and hardware prioritization features. For use cases
and hardware configuration which can benefit from prioritization on
the hardware side, I think the right solution is doing it outside the
cgroup framework, or at least the usual IO controllers.

Thanks.
Tejun Heo Aug. 12, 2021, 7:23 p.m. UTC | #11
Hello,

On Thu, Aug 12, 2021 at 11:16:37AM -0700, Bart Van Assche wrote:
> Are you perhaps referring to the iocost and iolatency cgroup controllers? Is

and blk-throtl

> it ever useful to combine these controllers with the ioprio controller? The

So, I used controller as in cgroup IO controller, something which can
distribute IO resources hierarchically according to either weights or
limits. In that sense, there is no such thing as an ioprio controller.

> ioprio controller was introduced with the goal to provide the information to
> the storage controller about which I/O request to handle first. My

My experience has been that this isn't all that useful for generic IO
control scenarios involving cgroups. The configuration is too flexible
and granular to map to hardware priorities and there are way more
significant factors than how a controller manages its iternal ordering
on most commodity SSDs. For situations where such feature is useful,
cgroup might be able to help with tagging the associated priorities
but I don't think there's much beyond that and I have a hard time
seeing why the existing ioprio interface wouldn't be enough.

> understanding of the iocost and iolatency controllers is that these cgroup
> controllers decide in which order to process I/O requests. Neither
> controller has the last word over I/O order if the queue depth is larger
> than one, something that is essential to achieve reasonable performance.

Well, whoever owns the queue can control the latencies and it isn't
difficult to mess up while issuing one command at a time, so if the
strategy is stuffing device queue as much as possible and telling the
device what to do, the end result is gonna be pretty sad.

Thanks.
Damien Le Moal Aug. 13, 2021, 2:18 a.m. UTC | #12
On 2021/08/13 4:23, Tejun Heo wrote:
> Hello,
> 
> On Thu, Aug 12, 2021 at 11:16:37AM -0700, Bart Van Assche wrote:
>> Are you perhaps referring to the iocost and iolatency cgroup controllers? Is
> 
> and blk-throtl
> 
>> it ever useful to combine these controllers with the ioprio controller? The
> 
> So, I used controller as in cgroup IO controller, something which can
> distribute IO resources hierarchically according to either weights or
> limits. In that sense, there is no such thing as an ioprio controller.
> 
>> ioprio controller was introduced with the goal to provide the information to
>> the storage controller about which I/O request to handle first. My
> 
> My experience has been that this isn't all that useful for generic IO
> control scenarios involving cgroups. The configuration is too flexible
> and granular to map to hardware priorities and there are way more
> significant factors than how a controller manages its iternal ordering
> on most commodity SSDs. For situations where such feature is useful,
> cgroup might be able to help with tagging the associated priorities
> but I don't think there's much beyond that and I have a hard time
> seeing why the existing ioprio interface wouldn't be enough.
> 
>> understanding of the iocost and iolatency controllers is that these cgroup
>> controllers decide in which order to process I/O requests. Neither
>> controller has the last word over I/O order if the queue depth is larger
>> than one, something that is essential to achieve reasonable performance.
> 
> Well, whoever owns the queue can control the latencies and it isn't
> difficult to mess up while issuing one command at a time, so if the
> strategy is stuffing device queue as much as possible and telling the
> device what to do, the end result is gonna be pretty sad.

Let me throw in more information related to this.

Command duration limits (CDL) and Sequestered commands features are being
drafted in SPC/SBC and ACS to give the device better hints than just a on/off
high priority bit. I am currently prototyping these features and I am reusing
the ioprio interface for that. Here is how this works:
1) The drives exposes a set of command duration limits descriptors (up to 7 for
reads and 7 for writes) that define duration limits for a command execution:
overall processing time, queuing time and execution time. Each duration time has
a policy associated with it that is applied if a command processing exceeds one
of the defined time limit: continue, continue but signal limit exceeded, abort.
2) Users can change the drive command duration limits to whatever they need
(e.g. change the policies for the limits to get a fast-fail behavior for
commands instead of having the drive retry for a long time)
3) When issuing IOs, users (or FSes) can apply a command duration limit
descriptor by specifying the IOPRIO_CLASS_DL priority class. The priority level
for that class indicates the descriptor to apply to the command.
4) At SCSI/ATA level, read and write commands have 3 bits defined to specify the
command descriptor to apply to the command (1 to 7 or 0 for no limit)

With that in place, the disk firmware can now make more intelligent decisions on
command scheduling to keep performance high at high queue depth without
increasing latency for commands that have low duration limits. And based on the
policy defined for a limit, this can be a "soft" best-effort optimization by the
disk, or a hard one with aborts if the drive decides that what the user is
asking for is not possible.

CDL can completely replace the existing binary on/off NCQ priority in a more
flexible manner as the user can set different duration limits for high vs low
priority. E.g. high priority is equivalent to a very short limit while low
priority is equivalent to longer or no limits.

I think that CDL has the potential for better interactions with cgroups as
cgroup controllers can install a set of limits on the drive that fits the
controller target policy. E.g., the latency controller can set duration limits
and use the IOPRIO_CLASS_DL class to tell the drive the exact latency target to use.

In my implementation, I have not yet looked into cgroups integration for CDL
though. I am still wondering what the best approach is: defining a new
controller or integrating into existing controllers. The former is likely easier
than the latter, but having hardware support for existing controllers has the
potential to improve them seamlessly without forcing the user to change anything
to there application setup.

CDL is still in draft state in the specs though. So I will not be sending this yet.

> 
> Thanks.
>
Tejun Heo Aug. 13, 2021, 4:29 p.m. UTC | #13
Hello, Damien.

On Fri, Aug 13, 2021 at 02:18:04AM +0000, Damien Le Moal wrote:
> Command duration limits (CDL) and Sequestered commands features are being
> drafted in SPC/SBC and ACS to give the device better hints than just a on/off
> high priority bit. I am currently prototyping these features and I am reusing
> the ioprio interface for that. Here is how this works:

And I think ioprio would be the right interface for it, combined with
some mechanism which can limit which class can take up how many of the
available command slots. Without that, it'd be really easy for anyone
to saturate the command queues and expressing priorities inside the
device won't do much.

> CDL can completely replace the existing binary on/off NCQ priority in a more
> flexible manner as the user can set different duration limits for high vs low
> priority. E.g. high priority is equivalent to a very short limit while low
> priority is equivalent to longer or no limits.

The problem with complex optional hardware features is often the
accompanying variability in terms of availability, reliability and
behavior. The track record has been pretty sad. That isn't to say this
won't be useful for anybody but it'd need careful coordination in
terms of picking hardware vendor and model and ensuring vendor
support, which kinda severely limits the usefulness.

Thanks.
Bart Van Assche Aug. 13, 2021, 5:15 p.m. UTC | #14
On 8/12/21 7:18 PM, Damien Le Moal wrote:
> Let me throw in more information related to this.
> 
> Command duration limits (CDL) and Sequestered commands features are being
> drafted in SPC/SBC and ACS to give the device better hints than just a on/off
> high priority bit. I am currently prototyping these features and I am reusing
> the ioprio interface for that. Here is how this works:
> 1) The drives exposes a set of command duration limits descriptors (up to 7 for
> reads and 7 for writes) that define duration limits for a command execution:
> overall processing time, queuing time and execution time. Each duration time has
> a policy associated with it that is applied if a command processing exceeds one
> of the defined time limit: continue, continue but signal limit exceeded, abort.
> 2) Users can change the drive command duration limits to whatever they need
> (e.g. change the policies for the limits to get a fast-fail behavior for
> commands instead of having the drive retry for a long time)
> 3) When issuing IOs, users (or FSes) can apply a command duration limit
> descriptor by specifying the IOPRIO_CLASS_DL priority class. The priority level
> for that class indicates the descriptor to apply to the command.
> 4) At SCSI/ATA level, read and write commands have 3 bits defined to specify the
> command descriptor to apply to the command (1 to 7 or 0 for no limit)
> 
> With that in place, the disk firmware can now make more intelligent decisions on
> command scheduling to keep performance high at high queue depth without
> increasing latency for commands that have low duration limits. And based on the
> policy defined for a limit, this can be a "soft" best-effort optimization by the
> disk, or a hard one with aborts if the drive decides that what the user is
> asking for is not possible.
> 
> CDL can completely replace the existing binary on/off NCQ priority in a more
> flexible manner as the user can set different duration limits for high vs low
> priority. E.g. high priority is equivalent to a very short limit while low
> priority is equivalent to longer or no limits.
> 
> I think that CDL has the potential for better interactions with cgroups as
> cgroup controllers can install a set of limits on the drive that fits the
> controller target policy. E.g., the latency controller can set duration limits
> and use the IOPRIO_CLASS_DL class to tell the drive the exact latency target to use.
> 
> In my implementation, I have not yet looked into cgroups integration for CDL
> though. I am still wondering what the best approach is: defining a new
> controller or integrating into existing controllers. The former is likely easier
> than the latter, but having hardware support for existing controllers has the
> potential to improve them seamlessly without forcing the user to change anything
> to there application setup.
> 
> CDL is still in draft state in the specs though. So I will not be sending this yet.

Thanks Damien for having provided this additional information. This is 
very helpful. I see this as a welcome evolution since the disk firmware 
has more information than the CPU (e.g. about the disk head position) 
and hence can make a better decision than an I/O scheduler or cgroup policy.

For the cloud use case, are all disks used to implement disaggregated 
storage? I'm asking this because in a disaggregated storage setup the 
I/O submitter runs on another server than the server to which the disks 
are connected. In such a setup I expect that the I/O priority will be 
provided from user space instead of being provided by a cgroup.

Bart.
Bart Van Assche Aug. 13, 2021, 5:17 p.m. UTC | #15
On 8/13/21 9:29 AM, Tejun Heo wrote:
> The problem with complex optional hardware features is often the
> accompanying variability in terms of availability, reliability and
> behavior. The track record has been pretty sad. That isn't to say this
> won't be useful for anybody but it'd need careful coordination in
> terms of picking hardware vendor and model and ensuring vendor
> support, which kinda severely limits the usefulness.

I think the above view is too negative. Companies that store large 
amounts of data have the power to make this happen by only buying 
storage devices that support I/O prioritization well enough.

Bart.
Tejun Heo Aug. 13, 2021, 9:43 p.m. UTC | #16
On Fri, Aug 13, 2021 at 10:17:42AM -0700, Bart Van Assche wrote:
> On 8/13/21 9:29 AM, Tejun Heo wrote:
> > The problem with complex optional hardware features is often the
> > accompanying variability in terms of availability, reliability and
> > behavior. The track record has been pretty sad. That isn't to say this
> > won't be useful for anybody but it'd need careful coordination in
> > terms of picking hardware vendor and model and ensuring vendor
> > support, which kinda severely limits the usefulness.
> 
> I think the above view is too negative. Companies that store large amounts
> of data have the power to make this happen by only buying storage devices
> that support I/O prioritization well enough.

The problem usually is that there always are other ways to skin that
cat which don't depend on having complex optional features. So, the
comparison isn't just about or among devices that support such extra
feature but with other solutions which don't need them in the first
place. Throw in the many inherent problems in expanding hardware
interface such as variability and timescale mismatch (hardware changes
a lot faster than software stack), the long term result tends to skew
pretty clearly.

Thanks.
diff mbox series

Patch

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 64053d67a97b7..2f2158e05a91c 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -9,12 +9,6 @@  config MQ_IOSCHED_DEADLINE
 	help
 	  MQ version of the deadline IO scheduler.
 
-config MQ_IOSCHED_DEADLINE_CGROUP
-       tristate
-       default y
-       depends on MQ_IOSCHED_DEADLINE
-       depends on BLK_CGROUP
-
 config MQ_IOSCHED_KYBER
 	tristate "Kyber I/O scheduler"
 	default y
diff --git a/block/Makefile b/block/Makefile
index bfbe4e13ca1ef..1e1afa10f869d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -22,8 +22,6 @@  obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
 obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
-mq-deadline-y += mq-deadline-main.o
-mq-deadline-$(CONFIG_MQ_IOSCHED_DEADLINE_CGROUP)+= mq-deadline-cgroup.o
 obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
 bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
 obj-$(CONFIG_IOSCHED_BFQ)	+= bfq.o
diff --git a/block/mq-deadline-cgroup.c b/block/mq-deadline-cgroup.c
deleted file mode 100644
index 3b4bfddec39f3..0000000000000
--- a/block/mq-deadline-cgroup.c
+++ /dev/null
@@ -1,126 +0,0 @@ 
-// SPDX-License-Identifier: GPL-2.0
-
-#include <linux/blk-cgroup.h>
-#include <linux/ioprio.h>
-
-#include "mq-deadline-cgroup.h"
-
-static struct blkcg_policy dd_blkcg_policy;
-
-static struct blkcg_policy_data *dd_cpd_alloc(gfp_t gfp)
-{
-	struct dd_blkcg *pd;
-
-	pd = kzalloc(sizeof(*pd), gfp);
-	if (!pd)
-		return NULL;
-	pd->stats = alloc_percpu_gfp(typeof(*pd->stats),
-				     GFP_KERNEL | __GFP_ZERO);
-	if (!pd->stats) {
-		kfree(pd);
-		return NULL;
-	}
-	return &pd->cpd;
-}
-
-static void dd_cpd_free(struct blkcg_policy_data *cpd)
-{
-	struct dd_blkcg *dd_blkcg = container_of(cpd, typeof(*dd_blkcg), cpd);
-
-	free_percpu(dd_blkcg->stats);
-	kfree(dd_blkcg);
-}
-
-static struct dd_blkcg *dd_blkcg_from_pd(struct blkg_policy_data *pd)
-{
-	return container_of(blkcg_to_cpd(pd->blkg->blkcg, &dd_blkcg_policy),
-			    struct dd_blkcg, cpd);
-}
-
-/*
- * Convert an association between a block cgroup and a request queue into a
- * pointer to the mq-deadline information associated with a (blkcg, queue) pair.
- */
-struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio)
-{
-	struct blkg_policy_data *pd;
-
-	pd = blkg_to_pd(bio->bi_blkg, &dd_blkcg_policy);
-	if (!pd)
-		return NULL;
-
-	return dd_blkcg_from_pd(pd);
-}
-
-static size_t dd_pd_stat(struct blkg_policy_data *pd, char *buf, size_t size)
-{
-	static const char *const prio_class_name[] = {
-		[IOPRIO_CLASS_NONE]	= "NONE",
-		[IOPRIO_CLASS_RT]	= "RT",
-		[IOPRIO_CLASS_BE]	= "BE",
-		[IOPRIO_CLASS_IDLE]	= "IDLE",
-	};
-	struct dd_blkcg *blkcg = dd_blkcg_from_pd(pd);
-	int res = 0;
-	u8 prio;
-
-	for (prio = 0; prio < ARRAY_SIZE(blkcg->stats->stats); prio++)
-		res += scnprintf(buf + res, size - res,
-			" [%s] dispatched=%u inserted=%u merged=%u",
-			prio_class_name[prio],
-			ddcg_sum(blkcg, dispatched, prio) +
-			ddcg_sum(blkcg, merged, prio) -
-			ddcg_sum(blkcg, completed, prio),
-			ddcg_sum(blkcg, inserted, prio) -
-			ddcg_sum(blkcg, completed, prio),
-			ddcg_sum(blkcg, merged, prio));
-
-	return res;
-}
-
-static struct blkg_policy_data *dd_pd_alloc(gfp_t gfp, struct request_queue *q,
-					    struct blkcg *blkcg)
-{
-	struct dd_blkg *pd;
-
-	pd = kzalloc(sizeof(*pd), gfp);
-	if (!pd)
-		return NULL;
-	return &pd->pd;
-}
-
-static void dd_pd_free(struct blkg_policy_data *pd)
-{
-	struct dd_blkg *dd_blkg = container_of(pd, typeof(*dd_blkg), pd);
-
-	kfree(dd_blkg);
-}
-
-static struct blkcg_policy dd_blkcg_policy = {
-	.cpd_alloc_fn		= dd_cpd_alloc,
-	.cpd_free_fn		= dd_cpd_free,
-
-	.pd_alloc_fn		= dd_pd_alloc,
-	.pd_free_fn		= dd_pd_free,
-	.pd_stat_fn		= dd_pd_stat,
-};
-
-int dd_activate_policy(struct request_queue *q)
-{
-	return blkcg_activate_policy(q, &dd_blkcg_policy);
-}
-
-void dd_deactivate_policy(struct request_queue *q)
-{
-	blkcg_deactivate_policy(q, &dd_blkcg_policy);
-}
-
-int __init dd_blkcg_init(void)
-{
-	return blkcg_policy_register(&dd_blkcg_policy);
-}
-
-void __exit dd_blkcg_exit(void)
-{
-	blkcg_policy_unregister(&dd_blkcg_policy);
-}
diff --git a/block/mq-deadline-cgroup.h b/block/mq-deadline-cgroup.h
deleted file mode 100644
index 0143fd74f3cea..0000000000000
--- a/block/mq-deadline-cgroup.h
+++ /dev/null
@@ -1,114 +0,0 @@ 
-/* SPDX-License-Identifier: GPL-2.0 */
-
-#if !defined(_MQ_DEADLINE_CGROUP_H_)
-#define _MQ_DEADLINE_CGROUP_H_
-
-#include <linux/blk-cgroup.h>
-
-struct request_queue;
-
-/**
- * struct io_stats_per_prio - I/O statistics per I/O priority class.
- * @inserted: Number of inserted requests.
- * @merged: Number of merged requests.
- * @dispatched: Number of dispatched requests.
- * @completed: Number of I/O completions.
- */
-struct io_stats_per_prio {
-	local_t inserted;
-	local_t merged;
-	local_t dispatched;
-	local_t completed;
-};
-
-/* I/O statistics per I/O cgroup per I/O priority class (IOPRIO_CLASS_*). */
-struct blkcg_io_stats {
-	struct io_stats_per_prio stats[4];
-};
-
-/**
- * struct dd_blkcg - Per cgroup data.
- * @cpd: blkcg_policy_data structure.
- * @stats: I/O statistics.
- */
-struct dd_blkcg {
-	struct blkcg_policy_data cpd;	/* must be the first member */
-	struct blkcg_io_stats __percpu *stats;
-};
-
-/*
- * Count one event of type 'event_type' and with I/O priority class
- * 'prio_class'.
- */
-#define ddcg_count(ddcg, event_type, prio_class) do {			\
-if (ddcg) {								\
-	struct blkcg_io_stats *io_stats = get_cpu_ptr((ddcg)->stats);	\
-									\
-	BUILD_BUG_ON(!__same_type((ddcg), struct dd_blkcg *));		\
-	BUILD_BUG_ON(!__same_type((prio_class), u8));			\
-	local_inc(&io_stats->stats[(prio_class)].event_type);		\
-	put_cpu_ptr(io_stats);						\
-}									\
-} while (0)
-
-/*
- * Returns the total number of ddcg_count(ddcg, event_type, prio_class) calls
- * across all CPUs. No locking or barriers since it is fine if the returned
- * sum is slightly outdated.
- */
-#define ddcg_sum(ddcg, event_type, prio) ({				\
-	unsigned int cpu;						\
-	u32 sum = 0;							\
-									\
-	BUILD_BUG_ON(!__same_type((ddcg), struct dd_blkcg *));		\
-	BUILD_BUG_ON(!__same_type((prio), u8));				\
-	for_each_present_cpu(cpu)					\
-		sum += local_read(&per_cpu_ptr((ddcg)->stats, cpu)->	\
-				  stats[(prio)].event_type);		\
-	sum;								\
-})
-
-#ifdef CONFIG_BLK_CGROUP
-
-/**
- * struct dd_blkg - Per (cgroup, request queue) data.
- * @pd: blkg_policy_data structure.
- */
-struct dd_blkg {
-	struct blkg_policy_data pd;	/* must be the first member */
-};
-
-struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio);
-int dd_activate_policy(struct request_queue *q);
-void dd_deactivate_policy(struct request_queue *q);
-int __init dd_blkcg_init(void);
-void __exit dd_blkcg_exit(void);
-
-#else /* CONFIG_BLK_CGROUP */
-
-static inline struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio)
-{
-	return NULL;
-}
-
-static inline int dd_activate_policy(struct request_queue *q)
-{
-	return 0;
-}
-
-static inline void dd_deactivate_policy(struct request_queue *q)
-{
-}
-
-static inline int dd_blkcg_init(void)
-{
-	return 0;
-}
-
-static inline void dd_blkcg_exit(void)
-{
-}
-
-#endif /* CONFIG_BLK_CGROUP */
-
-#endif /* _MQ_DEADLINE_CGROUP_H_ */
diff --git a/block/mq-deadline-main.c b/block/mq-deadline.c
similarity index 95%
rename from block/mq-deadline-main.c
rename to block/mq-deadline.c
index 6f612e6dc82b6..a09761cbdf12e 100644
--- a/block/mq-deadline-main.c
+++ b/block/mq-deadline.c
@@ -25,7 +25,6 @@ 
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
 #include "blk-mq-sched.h"
-#include "mq-deadline-cgroup.h"
 
 /*
  * See Documentation/block/deadline-iosched.rst
@@ -57,6 +56,14 @@  enum dd_prio {
 
 enum { DD_PRIO_COUNT = 3 };
 
+/* I/O statistics per I/O priority. */
+struct io_stats_per_prio {
+	local_t inserted;
+	local_t merged;
+	local_t dispatched;
+	local_t completed;
+};
+
 /* I/O statistics for all I/O priorities (enum dd_prio). */
 struct io_stats {
 	struct io_stats_per_prio stats[DD_PRIO_COUNT];
@@ -79,9 +86,6 @@  struct deadline_data {
 	 * run time data
 	 */
 
-	/* Request queue that owns this data structure. */
-	struct request_queue *queue;
-
 	struct dd_per_prio per_prio[DD_PRIO_COUNT];
 
 	/* Data direction of latest dispatched request. */
@@ -234,10 +238,8 @@  static void dd_merged_requests(struct request_queue *q, struct request *req,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const u8 ioprio_class = dd_rq_ioclass(next);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
-	struct dd_blkcg *blkcg = next->elv.priv[0];
 
 	dd_count(dd, merged, prio);
-	ddcg_count(blkcg, merged, ioprio_class);
 
 	/*
 	 * if next expires before rq, assign its expire time to rq
@@ -375,7 +377,6 @@  static struct request *__dd_dispatch_request(struct deadline_data *dd,
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
-	struct dd_blkcg *blkcg;
 	enum dd_prio prio;
 	u8 ioprio_class;
 
@@ -474,8 +475,6 @@  static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	ioprio_class = dd_rq_ioclass(rq);
 	prio = ioprio_class_to_prio[ioprio_class];
 	dd_count(dd, dispatched, prio);
-	blkcg = rq->elv.priv[0];
-	ddcg_count(blkcg, dispatched, ioprio_class);
 	/*
 	 * If the request needs its target zone locked, do it.
 	 */
@@ -569,8 +568,6 @@  static void dd_exit_sched(struct elevator_queue *e)
 	struct deadline_data *dd = e->elevator_data;
 	enum dd_prio prio;
 
-	dd_deactivate_policy(dd->queue);
-
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -584,7 +581,7 @@  static void dd_exit_sched(struct elevator_queue *e)
 }
 
 /*
- * Initialize elevator private data (deadline_data) and associate with blkcg.
+ * initialize elevator private data (deadline_data).
  */
 static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
@@ -593,12 +590,6 @@  static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	enum dd_prio prio;
 	int ret = -ENOMEM;
 
-	/*
-	 * Initialization would be very tricky if the queue is not frozen,
-	 * hence the warning statement below.
-	 */
-	WARN_ON_ONCE(!percpu_ref_is_zero(&q->q_usage_counter));
-
 	eq = elevator_alloc(q, e);
 	if (!eq)
 		return ret;
@@ -614,8 +605,6 @@  static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (!dd->stats)
 		goto free_dd;
 
-	dd->queue = q;
-
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -635,17 +624,9 @@  static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
-	ret = dd_activate_policy(q);
-	if (ret)
-		goto free_stats;
-
-	ret = 0;
 	q->elevator = eq;
 	return 0;
 
-free_stats:
-	free_percpu(dd->stats);
-
 free_dd:
 	kfree(dd);
 
@@ -718,7 +699,6 @@  static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
 	struct dd_per_prio *per_prio;
 	enum dd_prio prio;
-	struct dd_blkcg *blkcg;
 	LIST_HEAD(free);
 
 	lockdep_assert_held(&dd->lock);
@@ -729,18 +709,8 @@  static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	 */
 	blk_req_zone_write_unlock(rq);
 
-	/*
-	 * If a block cgroup has been associated with the submitter and if an
-	 * I/O priority has been set in the associated block cgroup, use the
-	 * lowest of the cgroup priority and the request priority for the
-	 * request. If no priority has been set in the request, use the cgroup
-	 * priority.
-	 */
 	prio = ioprio_class_to_prio[ioprio_class];
 	dd_count(dd, inserted, prio);
-	blkcg = dd_blkcg_from_bio(rq->bio);
-	ddcg_count(blkcg, inserted, ioprio_class);
-	rq->elv.priv[0] = blkcg;
 
 	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
 		blk_mq_free_requests(&free);
@@ -789,10 +759,12 @@  static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	spin_unlock(&dd->lock);
 }
 
-/* Callback from inside blk_mq_rq_ctx_init(). */
+/*
+ * Nothing to do here. This is defined only to ensure that .finish_request
+ * method is called upon request completion.
+ */
 static void dd_prepare_request(struct request *rq)
 {
-	rq->elv.priv[0] = NULL;
 }
 
 /*
@@ -815,13 +787,11 @@  static void dd_finish_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct deadline_data *dd = q->elevator->elevator_data;
-	struct dd_blkcg *blkcg = rq->elv.priv[0];
 	const u8 ioprio_class = dd_rq_ioclass(rq);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
 	dd_count(dd, completed, prio);
-	ddcg_count(blkcg, completed, ioprio_class);
 
 	if (blk_queue_is_zoned(q)) {
 		unsigned long flags;
@@ -1144,26 +1114,11 @@  MODULE_ALIAS("mq-deadline-iosched");
 
 static int __init deadline_init(void)
 {
-	int ret;
-
-	ret = elv_register(&mq_deadline);
-	if (ret)
-		goto out;
-	ret = dd_blkcg_init();
-	if (ret)
-		goto unreg;
-
-out:
-	return ret;
-
-unreg:
-	elv_unregister(&mq_deadline);
-	goto out;
+	return elv_register(&mq_deadline);
 }
 
 static void __exit deadline_exit(void)
 {
-	dd_blkcg_exit();
 	elv_unregister(&mq_deadline);
 }