[RFC,0/3] cgroup: fsio throttle controller

Message ID	20190118103127.325-1-righi.andrea@gmail.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Andrea Righi <righi.andrea@gmail.com> To: Tejun Heo <tj@kernel.org>, Li Zefan <lizefan@huawei.com>, Johannes Weiner <hannes@cmpxchg.org> Cc: Jens Axboe <axboe@kernel.dk>, Vivek Goyal <vgoyal@redhat.com>, Josef Bacik <josef@toxicpanda.com>, Dennis Zhou <dennis@kernel.org>, cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 0/3] cgroup: fsio throttle controller Date: Fri, 18 Jan 2019 11:31:24 +0100 Message-Id: <20190118103127.325-1-righi.andrea@gmail.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	cgroup: fsio throttle controller \| expand [RFC,0/3] cgroup: fsio throttle controller [RFC,1/3] fsio-throttle: documentation [RFC,2/3] fsio-throttle: controller infrastructure [RFC,3/3] fsio-throttle: instrumentation

Andrea Righi Jan. 18, 2019, 10:31 a.m. UTC

This is a redesign of my old cgroup-io-throttle controller:
https://lwn.net/Articles/330531/

I'm resuming this old patch to point out a problem that I think is still
not solved completely.

= Problem =

The io.max controller works really well at limiting synchronous I/O
(READs), but a lot of I/O requests are initiated outside the context of
the process that is ultimately responsible for its creation (e.g.,
WRITEs).

Throttling at the block layer in some cases is too late and we may end
up slowing down processes that are not responsible for the I/O that
is being processed at that level.

= Proposed solution =

The main idea of this controller is to split I/O measurement and I/O
throttling: I/O is measured at the block layer for READS, at page cache
(dirty pages) for WRITEs, and processes are limited while they're
generating I/O at the VFS level, based on the measured I/O.

= Example =

Here's a trivial example: create 2 cgroups, set an io.max limit of
10MB/s, run a write-intensive workload on both and after a while, from a
root cgroup, run "sync".

 # cat /proc/self/cgroup
 0::/cg1
 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30

 # cat /proc/self/cgroup
 0::/cg2
 # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30

 - io.max controller:

 # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg1/io.max
 # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg2/io.max

 # cat /proc/self/cgroup
 0::/
 # time sync

 real	0m51,241s
 user	0m0,000s
 sys	0m0,113s

Ideally "sync" should complete almost immediately, because the root
cgroup is unlimited and it's not doing any I/O at all, but instead it's
blocked for more than 50 sec with io.max, because the writeback is
throttled to satisfy the io.max limits.

- fsio controller:

 # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg1/fsio.max_mbs
 # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg2/fsio.max_mbs

 [you can find details about the syntax in the documentation patch]

 # cat /proc/self/cgroup
 0::/
 # time sync

 real	0m0,146s
 user	0m0,003s
 sys	0m0,001s

= Questions =

Q: Do we need another controller?
A: Probably no, I think it would be better to integrate this policy (or
   something similar) in the current blkio controller, this is just to
   highlight the problem and get some ideas on how to address it.

Q: What about proportional limits / latency?
A: It should be trivial to add latency-based limits if we integrate this in the
   current I/O controller. About proportional limits (weights), they're
   strictly related to I/O scheduling and since this controller doesn't touch
   I/O dispatching policies it's not trivial to implement proportional limits
   (bandwidth limiting is definitely more straightforward).

Q: Applying delays at the VFS layer doesn't prevent I/O spikes during
   writeback, right?
A: Correct, the tradeoff here is to tolerate I/O bursts during writeback to
   avoid priority inversion problems in the system.

Andrea Righi (3):
  fsio-throttle: documentation
  fsio-throttle: controller infrastructure
  fsio-throttle: instrumentation

 Documentation/cgroup-v1/fsio-throttle.txt | 142 +++++++++
 block/blk-core.c                          |  10 +
 include/linux/cgroup_subsys.h             |   4 +
 include/linux/fsio-throttle.h             |  43 +++
 include/linux/writeback.h                 |   7 +-
 init/Kconfig                              |  11 +
 kernel/cgroup/Makefile                    |   1 +
 kernel/cgroup/fsio-throttle.c             | 501 ++++++++++++++++++++++++++++++
 mm/filemap.c                              |  20 +-
 mm/page-writeback.c                       |  14 +-
 10 files changed, 749 insertions(+), 4 deletions(-)

Paolo Valente Jan. 18, 2019, 11:04 a.m. UTC | #1

> Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi <righi.andrea@gmail.com> ha scritto:
> 
> This is a redesign of my old cgroup-io-throttle controller:
> https://lwn.net/Articles/330531/
> 
> I'm resuming this old patch to point out a problem that I think is still
> not solved completely.
> 
> = Problem =
> 
> The io.max controller works really well at limiting synchronous I/O
> (READs), but a lot of I/O requests are initiated outside the context of
> the process that is ultimately responsible for its creation (e.g.,
> WRITEs).
> 
> Throttling at the block layer in some cases is too late and we may end
> up slowing down processes that are not responsible for the I/O that
> is being processed at that level.
> 
> = Proposed solution =
> 
> The main idea of this controller is to split I/O measurement and I/O
> throttling: I/O is measured at the block layer for READS, at page cache
> (dirty pages) for WRITEs, and processes are limited while they're
> generating I/O at the VFS level, based on the measured I/O.
> 

Hi Andrea,
what the about the case where two processes are dirtying the same
pages?  Which will be charged?

Thanks,
Paolo

> = Example =
> 
> Here's a trivial example: create 2 cgroups, set an io.max limit of
> 10MB/s, run a write-intensive workload on both and after a while, from a
> root cgroup, run "sync".
> 
> # cat /proc/self/cgroup
> 0::/cg1
> # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30
> 
> # cat /proc/self/cgroup
> 0::/cg2
> # fio --rw=write --bs=1M --size=32M --numjobs=16 --name=seeker --time_based --runtime=30
> 
> - io.max controller:
> 
> # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg1/io.max
> # echo "259:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/unified/cg2/io.max
> 
> # cat /proc/self/cgroup
> 0::/
> # time sync
> 
> real	0m51,241s
> user	0m0,000s
> sys	0m0,113s
> 
> Ideally "sync" should complete almost immediately, because the root
> cgroup is unlimited and it's not doing any I/O at all, but instead it's
> blocked for more than 50 sec with io.max, because the writeback is
> throttled to satisfy the io.max limits.
> 
> - fsio controller:
> 
> # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg1/fsio.max_mbs
> # echo "259:0 10 10" > /sys/fs/cgroup/unified/cg2/fsio.max_mbs
> 
> [you can find details about the syntax in the documentation patch]
> 
> # cat /proc/self/cgroup
> 0::/
> # time sync
> 
> real	0m0,146s
> user	0m0,003s
> sys	0m0,001s
> 
> = Questions =
> 
> Q: Do we need another controller?
> A: Probably no, I think it would be better to integrate this policy (or
>   something similar) in the current blkio controller, this is just to
>   highlight the problem and get some ideas on how to address it.
> 
> Q: What about proportional limits / latency?
> A: It should be trivial to add latency-based limits if we integrate this in the
>   current I/O controller. About proportional limits (weights), they're
>   strictly related to I/O scheduling and since this controller doesn't touch
>   I/O dispatching policies it's not trivial to implement proportional limits
>   (bandwidth limiting is definitely more straightforward).
> 
> Q: Applying delays at the VFS layer doesn't prevent I/O spikes during
>   writeback, right?
> A: Correct, the tradeoff here is to tolerate I/O bursts during writeback to
>   avoid priority inversion problems in the system.
> 
> Andrea Righi (3):
>  fsio-throttle: documentation
>  fsio-throttle: controller infrastructure
>  fsio-throttle: instrumentation
> 
> Documentation/cgroup-v1/fsio-throttle.txt | 142 +++++++++
> block/blk-core.c                          |  10 +
> include/linux/cgroup_subsys.h             |   4 +
> include/linux/fsio-throttle.h             |  43 +++
> include/linux/writeback.h                 |   7 +-
> init/Kconfig                              |  11 +
> kernel/cgroup/Makefile                    |   1 +
> kernel/cgroup/fsio-throttle.c             | 501 ++++++++++++++++++++++++++++++
> mm/filemap.c                              |  20 +-
> mm/page-writeback.c                       |  14 +-
> 10 files changed, 749 insertions(+), 4 deletions(-)
>

Andrea Righi Jan. 18, 2019, 11:10 a.m. UTC | #2

On Fri, Jan 18, 2019 at 12:04:17PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi <righi.andrea@gmail.com> ha scritto:
> > 
> > This is a redesign of my old cgroup-io-throttle controller:
> > https://lwn.net/Articles/330531/
> > 
> > I'm resuming this old patch to point out a problem that I think is still
> > not solved completely.
> > 
> > = Problem =
> > 
> > The io.max controller works really well at limiting synchronous I/O
> > (READs), but a lot of I/O requests are initiated outside the context of
> > the process that is ultimately responsible for its creation (e.g.,
> > WRITEs).
> > 
> > Throttling at the block layer in some cases is too late and we may end
> > up slowing down processes that are not responsible for the I/O that
> > is being processed at that level.
> > 
> > = Proposed solution =
> > 
> > The main idea of this controller is to split I/O measurement and I/O
> > throttling: I/O is measured at the block layer for READS, at page cache
> > (dirty pages) for WRITEs, and processes are limited while they're
> > generating I/O at the VFS level, based on the measured I/O.
> > 
> 
> Hi Andrea,
> what the about the case where two processes are dirtying the same
> pages?  Which will be charged?
> 
> Thanks,
> Paolo

Hi Paolo,

in this case only the first one will be charged for the I/O activity
(the one that changes a page from clean to dirty). This is probably not
totally fair in some cases, but I think it's a good compromise, at the
end rewriting the same page over and over while it's already dirty
doesn't actually generate I/O activity, until the page is flushed back
to disk.

Obviously I'm open to other better ideas and suggestions.

Thanks!
-Andrea

Paolo Valente Jan. 18, 2019, 11:11 a.m. UTC | #3

> Il giorno 18 gen 2019, alle ore 12:10, Andrea Righi <righi.andrea@gmail.com> ha scritto:
> 
> On Fri, Jan 18, 2019 at 12:04:17PM +0100, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 18 gen 2019, alle ore 11:31, Andrea Righi <righi.andrea@gmail.com> ha scritto:
>>> 
>>> This is a redesign of my old cgroup-io-throttle controller:
>>> https://lwn.net/Articles/330531/
>>> 
>>> I'm resuming this old patch to point out a problem that I think is still
>>> not solved completely.
>>> 
>>> = Problem =
>>> 
>>> The io.max controller works really well at limiting synchronous I/O
>>> (READs), but a lot of I/O requests are initiated outside the context of
>>> the process that is ultimately responsible for its creation (e.g.,
>>> WRITEs).
>>> 
>>> Throttling at the block layer in some cases is too late and we may end
>>> up slowing down processes that are not responsible for the I/O that
>>> is being processed at that level.
>>> 
>>> = Proposed solution =
>>> 
>>> The main idea of this controller is to split I/O measurement and I/O
>>> throttling: I/O is measured at the block layer for READS, at page cache
>>> (dirty pages) for WRITEs, and processes are limited while they're
>>> generating I/O at the VFS level, based on the measured I/O.
>>> 
>> 
>> Hi Andrea,
>> what the about the case where two processes are dirtying the same
>> pages?  Which will be charged?
>> 
>> Thanks,
>> Paolo
> 
> Hi Paolo,
> 
> in this case only the first one will be charged for the I/O activity
> (the one that changes a page from clean to dirty). This is probably not
> totally fair in some cases, but I think it's a good compromise,

Absolutely, I just wanted to better understand this point.

> at the
> end rewriting the same page over and over while it's already dirty
> doesn't actually generate I/O activity, until the page is flushed back
> to disk.
> 

Right.

Thanks,
Paolo

> Obviously I'm open to other better ideas and suggestions.
> 
> Thanks!
> -Andrea

Josef Bacik Jan. 18, 2019, 4:35 p.m. UTC | #4

On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> This is a redesign of my old cgroup-io-throttle controller:
> https://lwn.net/Articles/330531/
> 
> I'm resuming this old patch to point out a problem that I think is still
> not solved completely.
> 
> = Problem =
> 
> The io.max controller works really well at limiting synchronous I/O
> (READs), but a lot of I/O requests are initiated outside the context of
> the process that is ultimately responsible for its creation (e.g.,
> WRITEs).
> 
> Throttling at the block layer in some cases is too late and we may end
> up slowing down processes that are not responsible for the I/O that
> is being processed at that level.

How so?  The writeback threads are per-cgroup and have the cgroup stuff set
properly.  So if you dirty a bunch of pages, they are associated with your
cgroup, and then writeback happens and it's done in the writeback thread
associated with your cgroup and then that is throttled.  Then you are throttled
at balance_dirty_pages() because the writeout is taking longer.

I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
clearly tie IO to the thing generating the IO, such as readahead and such.  If
you are running into this case that may be something worth using.  Course it
only works for io.latency now but there's no reason you can't add support to it
for io.max or whatever.

> 
> = Proposed solution =
> 
> The main idea of this controller is to split I/O measurement and I/O
> throttling: I/O is measured at the block layer for READS, at page cache
> (dirty pages) for WRITEs, and processes are limited while they're
> generating I/O at the VFS level, based on the measured I/O.
> 

This is what blk_cgroup_congested() is meant to accomplish, I would suggest
looking into that route and simply changing the existing io controller you are
using to take advantage of that so it will actually throttle things.  Then just
sprinkle it around the areas where we indirectly generate IO.  Thanks,

Josef

Paolo Valente Jan. 18, 2019, 5:07 p.m. UTC | #5

> Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik <josef@toxicpanda.com> ha scritto:
> 
> On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
>> This is a redesign of my old cgroup-io-throttle controller:
>> https://lwn.net/Articles/330531/
>> 
>> I'm resuming this old patch to point out a problem that I think is still
>> not solved completely.
>> 
>> = Problem =
>> 
>> The io.max controller works really well at limiting synchronous I/O
>> (READs), but a lot of I/O requests are initiated outside the context of
>> the process that is ultimately responsible for its creation (e.g.,
>> WRITEs).
>> 
>> Throttling at the block layer in some cases is too late and we may end
>> up slowing down processes that are not responsible for the I/O that
>> is being processed at that level.
> 
> How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> properly.  So if you dirty a bunch of pages, they are associated with your
> cgroup, and then writeback happens and it's done in the writeback thread
> associated with your cgroup and then that is throttled.  Then you are throttled
> at balance_dirty_pages() because the writeout is taking longer.
> 

IIUC, Andrea described this problem: certain processes in a certain group dirty a
lot of pages, causing write back to start.  Then some other blameless
process in the same group experiences very high latency, in spite of
the fact that it has to do little I/O.

Does your blk_cgroup_congested() stuff solves this issue?

Or simply I didn't get what Andrea meant at all :)

Thanks,
Paolo

> I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> clearly tie IO to the thing generating the IO, such as readahead and such.  If
> you are running into this case that may be something worth using.  Course it
> only works for io.latency now but there's no reason you can't add support to it
> for io.max or whatever.
> 
>> 
>> = Proposed solution =
>> 
>> The main idea of this controller is to split I/O measurement and I/O
>> throttling: I/O is measured at the block layer for READS, at page cache
>> (dirty pages) for WRITEs, and processes are limited while they're
>> generating I/O at the VFS level, based on the measured I/O.
>> 
> 
> This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> looking into that route and simply changing the existing io controller you are
> using to take advantage of that so it will actually throttle things.  Then just
> sprinkle it around the areas where we indirectly generate IO.  Thanks,
> 
> Josef

Josef Bacik Jan. 18, 2019, 5:12 p.m. UTC | #6

On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik <josef@toxicpanda.com> ha scritto:
> > 
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> >> This is a redesign of my old cgroup-io-throttle controller:
> >> https://lwn.net/Articles/330531/
> >> 
> >> I'm resuming this old patch to point out a problem that I think is still
> >> not solved completely.
> >> 
> >> = Problem =
> >> 
> >> The io.max controller works really well at limiting synchronous I/O
> >> (READs), but a lot of I/O requests are initiated outside the context of
> >> the process that is ultimately responsible for its creation (e.g.,
> >> WRITEs).
> >> 
> >> Throttling at the block layer in some cases is too late and we may end
> >> up slowing down processes that are not responsible for the I/O that
> >> is being processed at that level.
> > 
> > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > properly.  So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled.  Then you are throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> > 
> 
> IIUC, Andrea described this problem: certain processes in a certain group dirty a
> lot of pages, causing write back to start.  Then some other blameless
> process in the same group experiences very high latency, in spite of
> the fact that it has to do little I/O.
> 

In that case the io controller isn't doing it's job and needs to be fixed (or
reconfigured).  io.latency guards against this, I assume io.max would keep this
from happening if it were configured properly.

> Does your blk_cgroup_congested() stuff solves this issue?
> 
> Or simply I didn't get what Andrea meant at all :)
> 

I _think_ Andrea is talking about the fact that we can generate IO indirectly
and never get throttled for it, which is what blk_cgroup_congested() is meant to
address.  I added it specifically because some low prio task was just allocating
all of the memory on the system and causing a lot of pressure because of
swapping, but there was no direct feedback loop there.  blk_cgroup_congested()
provides that feedback loop.

Course I could be wrong too and we're all just talking past each other ;).
Thanks,

Josef

Andrea Righi Jan. 18, 2019, 6:44 p.m. UTC | #7

On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > This is a redesign of my old cgroup-io-throttle controller:
> > https://lwn.net/Articles/330531/
> > 
> > I'm resuming this old patch to point out a problem that I think is still
> > not solved completely.
> > 
> > = Problem =
> > 
> > The io.max controller works really well at limiting synchronous I/O
> > (READs), but a lot of I/O requests are initiated outside the context of
> > the process that is ultimately responsible for its creation (e.g.,
> > WRITEs).
> > 
> > Throttling at the block layer in some cases is too late and we may end
> > up slowing down processes that are not responsible for the I/O that
> > is being processed at that level.
> 
> How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> properly.  So if you dirty a bunch of pages, they are associated with your
> cgroup, and then writeback happens and it's done in the writeback thread
> associated with your cgroup and then that is throttled.  Then you are throttled
> at balance_dirty_pages() because the writeout is taking longer.

Right, writeback is per-cgroup and slowing down writeback affects only
that specific cgroup, but, there are cases where other processes from
other cgroups may require to wait on that writeback to complete before
doing I/O (for example an fsync() to a file shared among different
cgroups). In this case we may end up blocking cgroups that shouldn't be
blocked, that looks like a priority-inversion problem. This is the
problem that I'm trying to address.

> 
> I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> clearly tie IO to the thing generating the IO, such as readahead and such.  If
> you are running into this case that may be something worth using.  Course it
> only works for io.latency now but there's no reason you can't add support to it
> for io.max or whatever.

IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
memcg), something like this: if the cgroup is already congested don't
generate extra I/O due to readahead. Am I right?

> 
> > 
> > = Proposed solution =
> > 
> > The main idea of this controller is to split I/O measurement and I/O
> > throttling: I/O is measured at the block layer for READS, at page cache
> > (dirty pages) for WRITEs, and processes are limited while they're
> > generating I/O at the VFS level, based on the measured I/O.
> > 
> 
> This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> looking into that route and simply changing the existing io controller you are
> using to take advantage of that so it will actually throttle things.  Then just
> sprinkle it around the areas where we indirectly generate IO.  Thanks,

Absolutely, I can probably use blk_cgroup_congested() as a method to
determine when a cgroup should be throttled (instead of doing my own
I/O measuring), but to prevent the "slow writeback slowing down other
cgroups" issue I still need to apply throttling when pages are dirtied
in page cache.

Thanks,
-Andrea

Andrea Righi Jan. 18, 2019, 7:02 p.m. UTC | #8

On Fri, Jan 18, 2019 at 06:07:45PM +0100, Paolo Valente wrote:
> 
> 
> > Il giorno 18 gen 2019, alle ore 17:35, Josef Bacik <josef@toxicpanda.com> ha scritto:
> > 
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> >> This is a redesign of my old cgroup-io-throttle controller:
> >> https://lwn.net/Articles/330531/
> >> 
> >> I'm resuming this old patch to point out a problem that I think is still
> >> not solved completely.
> >> 
> >> = Problem =
> >> 
> >> The io.max controller works really well at limiting synchronous I/O
> >> (READs), but a lot of I/O requests are initiated outside the context of
> >> the process that is ultimately responsible for its creation (e.g.,
> >> WRITEs).
> >> 
> >> Throttling at the block layer in some cases is too late and we may end
> >> up slowing down processes that are not responsible for the I/O that
> >> is being processed at that level.
> > 
> > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > properly.  So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled.  Then you are throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> > 
> 
> IIUC, Andrea described this problem: certain processes in a certain group dirty a
> lot of pages, causing write back to start.  Then some other blameless
> process in the same group experiences very high latency, in spite of
> the fact that it has to do little I/O.
> 
> Does your blk_cgroup_congested() stuff solves this issue?
> 
> Or simply I didn't get what Andrea meant at all :)
> 
> Thanks,
> Paolo

Yes, there is also this problem: provide fairness among processes
running inside the same cgroup.

This is a tough one, because once the I/O limit is reached whoever
process comes next gets punished, even if it hasn't done any I/O before.

Well, my proposal doesn't solve this problem. To solve this one in the
"throttling" scenario, we should probably save some information about
the previously generated I/O activity and apply a delay proportional to
that (like a dynamic weight for each process inside each cgroup).

AFAICS the io.max controller doesn't solve this problem either.

-Andrea

Josef Bacik Jan. 18, 2019, 7:46 p.m. UTC | #9

On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > This is a redesign of my old cgroup-io-throttle controller:
> > > https://lwn.net/Articles/330531/
> > > 
> > > I'm resuming this old patch to point out a problem that I think is still
> > > not solved completely.
> > > 
> > > = Problem =
> > > 
> > > The io.max controller works really well at limiting synchronous I/O
> > > (READs), but a lot of I/O requests are initiated outside the context of
> > > the process that is ultimately responsible for its creation (e.g.,
> > > WRITEs).
> > > 
> > > Throttling at the block layer in some cases is too late and we may end
> > > up slowing down processes that are not responsible for the I/O that
> > > is being processed at that level.
> > 
> > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > properly.  So if you dirty a bunch of pages, they are associated with your
> > cgroup, and then writeback happens and it's done in the writeback thread
> > associated with your cgroup and then that is throttled.  Then you are throttled
> > at balance_dirty_pages() because the writeout is taking longer.
> 
> Right, writeback is per-cgroup and slowing down writeback affects only
> that specific cgroup, but, there are cases where other processes from
> other cgroups may require to wait on that writeback to complete before
> doing I/O (for example an fsync() to a file shared among different
> cgroups). In this case we may end up blocking cgroups that shouldn't be
> blocked, that looks like a priority-inversion problem. This is the
> problem that I'm trying to address.

Well this case is a misconfiguration, you shouldn't be sharing files between
cgroups.  But even if you are, fsync() is synchronous, we should be getting the
context from the process itself and thus should have its own rules applied.
There's nothing we can do for outstanding IO, but that shouldn't be that much.
That would need to be dealt with on a per-contoller basis.

> 
> > 
> > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> > clearly tie IO to the thing generating the IO, such as readahead and such.  If
> > you are running into this case that may be something worth using.  Course it
> > only works for io.latency now but there's no reason you can't add support to it
> > for io.max or whatever.
> 
> IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> memcg), something like this: if the cgroup is already congested don't
> generate extra I/O due to readahead. Am I right?

Yeah, but that's just how it's currently used, it can be used any which way we
feel like.

> 
> > 
> > > 
> > > = Proposed solution =
> > > 
> > > The main idea of this controller is to split I/O measurement and I/O
> > > throttling: I/O is measured at the block layer for READS, at page cache
> > > (dirty pages) for WRITEs, and processes are limited while they're
> > > generating I/O at the VFS level, based on the measured I/O.
> > > 
> > 
> > This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> > looking into that route and simply changing the existing io controller you are
> > using to take advantage of that so it will actually throttle things.  Then just
> > sprinkle it around the areas where we indirectly generate IO.  Thanks,
> 
> Absolutely, I can probably use blk_cgroup_congested() as a method to
> determine when a cgroup should be throttled (instead of doing my own
> I/O measuring), but to prevent the "slow writeback slowing down other
> cgroups" issue I still need to apply throttling when pages are dirtied
> in page cache.

Again this is just a fuckup from a configuration stand point.  The argument
could be made that sync() is probably broken here, but I think the right
solution here is to just pass the cgroup context along with the writeback
information and use that if it's set instead.  Thanks,

Josef

Andrea Righi Jan. 19, 2019, 10:08 a.m. UTC | #10

On Fri, Jan 18, 2019 at 02:46:53PM -0500, Josef Bacik wrote:
> On Fri, Jan 18, 2019 at 07:44:03PM +0100, Andrea Righi wrote:
> > On Fri, Jan 18, 2019 at 11:35:31AM -0500, Josef Bacik wrote:
> > > On Fri, Jan 18, 2019 at 11:31:24AM +0100, Andrea Righi wrote:
> > > > This is a redesign of my old cgroup-io-throttle controller:
> > > > https://lwn.net/Articles/330531/
> > > > 
> > > > I'm resuming this old patch to point out a problem that I think is still
> > > > not solved completely.
> > > > 
> > > > = Problem =
> > > > 
> > > > The io.max controller works really well at limiting synchronous I/O
> > > > (READs), but a lot of I/O requests are initiated outside the context of
> > > > the process that is ultimately responsible for its creation (e.g.,
> > > > WRITEs).
> > > > 
> > > > Throttling at the block layer in some cases is too late and we may end
> > > > up slowing down processes that are not responsible for the I/O that
> > > > is being processed at that level.
> > > 
> > > How so?  The writeback threads are per-cgroup and have the cgroup stuff set
> > > properly.  So if you dirty a bunch of pages, they are associated with your
> > > cgroup, and then writeback happens and it's done in the writeback thread
> > > associated with your cgroup and then that is throttled.  Then you are throttled
> > > at balance_dirty_pages() because the writeout is taking longer.
> > 
> > Right, writeback is per-cgroup and slowing down writeback affects only
> > that specific cgroup, but, there are cases where other processes from
> > other cgroups may require to wait on that writeback to complete before
> > doing I/O (for example an fsync() to a file shared among different
> > cgroups). In this case we may end up blocking cgroups that shouldn't be
> > blocked, that looks like a priority-inversion problem. This is the
> > problem that I'm trying to address.
> 
> Well this case is a misconfiguration, you shouldn't be sharing files between
> cgroups.  But even if you are, fsync() is synchronous, we should be getting the
> context from the process itself and thus should have its own rules applied.
> There's nothing we can do for outstanding IO, but that shouldn't be that much.
> That would need to be dealt with on a per-contoller basis.

OK, fair point. We shouldn't be sharing files between cgroups.

I'm still not sure if we can have similar issues with metadata I/O (that
may introduce latencies like the sync() scenario), I have to investigate
more and do more tests.

> 
> > 
> > > 
> > > I introduced the blk_cgroup_congested() stuff for paths that it's not easy to
> > > clearly tie IO to the thing generating the IO, such as readahead and such.  If
> > > you are running into this case that may be something worth using.  Course it
> > > only works for io.latency now but there's no reason you can't add support to it
> > > for io.max or whatever.
> > 
> > IIUC blk_cgroup_congested() is used in readahead I/O (and swap with
> > memcg), something like this: if the cgroup is already congested don't
> > generate extra I/O due to readahead. Am I right?
> 
> Yeah, but that's just how it's currently used, it can be used any which way we
> feel like.

I think it'd be very interesting to have the possibility to either
throttle I/O before writeback or during writeback. Right now we can only
throttle writeback. Maybe we can try to introduce some kind of dirty
page throttling controller using blk_cgroup_congested()... Opinions?

> 
> > 
> > > 
> > > > 
> > > > = Proposed solution =
> > > > 
> > > > The main idea of this controller is to split I/O measurement and I/O
> > > > throttling: I/O is measured at the block layer for READS, at page cache
> > > > (dirty pages) for WRITEs, and processes are limited while they're
> > > > generating I/O at the VFS level, based on the measured I/O.
> > > > 
> > > 
> > > This is what blk_cgroup_congested() is meant to accomplish, I would suggest
> > > looking into that route and simply changing the existing io controller you are
> > > using to take advantage of that so it will actually throttle things.  Then just
> > > sprinkle it around the areas where we indirectly generate IO.  Thanks,
> > 
> > Absolutely, I can probably use blk_cgroup_congested() as a method to
> > determine when a cgroup should be throttled (instead of doing my own
> > I/O measuring), but to prevent the "slow writeback slowing down other
> > cgroups" issue I still need to apply throttling when pages are dirtied
> > in page cache.
> 
> Again this is just a fuckup from a configuration stand point.  The argument
> could be made that sync() is probably broken here, but I think the right
> solution here is to just pass the cgroup context along with the writeback
> information and use that if it's set instead.  Thanks,

Alright, let's skip the root cgroup for now. I think the point here is
if we want to provide sync() isolation among cgroups or not.

According to the manpage:

       sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
       written to the underlying filesystems.

And:
       According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
       may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
       thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
       tem or filesystem respectively.

Excluding the root cgroup, do you think a sync() issued inside a
specific cgroup should wait for I/O completions only for the writes that
have been generated by that cgroup?

Thanks,
-Andrea

Vivek Goyal Jan. 21, 2019, 9:47 p.m. UTC | #11

On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:

[..]
> Alright, let's skip the root cgroup for now. I think the point here is
> if we want to provide sync() isolation among cgroups or not.
> 
> According to the manpage:
> 
>        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
>        written to the underlying filesystems.
> 
> And:
>        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
>        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
>        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
>        tem or filesystem respectively.
> 
> Excluding the root cgroup, do you think a sync() issued inside a
> specific cgroup should wait for I/O completions only for the writes that
> have been generated by that cgroup?

Can we account I/O towards the cgroup which issued "sync" only if write
rate of sync cgroup is higher than cgroup to which page belongs to. Will
that solve problem, assuming its doable?

Vivek

Andrea Righi Jan. 28, 2019, 5:41 p.m. UTC | #12

Hi Vivek,

sorry for the late reply.

On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> 
> [..]
> > Alright, let's skip the root cgroup for now. I think the point here is
> > if we want to provide sync() isolation among cgroups or not.
> > 
> > According to the manpage:
> > 
> >        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
> >        written to the underlying filesystems.
> > 
> > And:
> >        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> >        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
> >        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
> >        tem or filesystem respectively.
> > 
> > Excluding the root cgroup, do you think a sync() issued inside a
> > specific cgroup should wait for I/O completions only for the writes that
> > have been generated by that cgroup?
> 
> Can we account I/O towards the cgroup which issued "sync" only if write
> rate of sync cgroup is higher than cgroup to which page belongs to. Will
> that solve problem, assuming its doable?

Maybe this would mitigate the problem, in part, but it doesn't solve it.

The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
issues "sync", the fast cgroup needs to wait a lot, because writeback is
happening at the speed of the slow cgroup.

Ideally in this case we should bump up the writeback speed, maybe even
temporarily inherit the write rate of the sync cgroup, similarly to a
priority-inversion locking scenario, but I think it's not doable at the
moment without applying big changes.

Or we could isolate the sync domain, meaning that a cgroup issuing a
sync will only wait for the syncing of the pages that belong to that
sync cgroup. But probably also this method requires big changes...

-Andrea

Vivek Goyal Jan. 28, 2019, 7:26 p.m. UTC | #13

On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> Hi Vivek,
> 
> sorry for the late reply.
> 
> On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > 
> > [..]
> > > Alright, let's skip the root cgroup for now. I think the point here is
> > > if we want to provide sync() isolation among cgroups or not.
> > > 
> > > According to the manpage:
> > > 
> > >        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
> > >        written to the underlying filesystems.
> > > 
> > > And:
> > >        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > >        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
> > >        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
> > >        tem or filesystem respectively.
> > > 
> > > Excluding the root cgroup, do you think a sync() issued inside a
> > > specific cgroup should wait for I/O completions only for the writes that
> > > have been generated by that cgroup?
> > 
> > Can we account I/O towards the cgroup which issued "sync" only if write
> > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > that solve problem, assuming its doable?
> 
> Maybe this would mitigate the problem, in part, but it doesn't solve it.
> 
> The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> issues "sync", the fast cgroup needs to wait a lot, because writeback is
> happening at the speed of the slow cgroup.

Hi Andrea,

But that's true only for I/O which has already been submitted to block
layer, right? Any new I/O yet to be submitted could still be attributed
to faster cgroup requesting sync.

Until and unless cgroups limits are absurdly low, it should not take very
long for already submitted I/O to finish. If yes, then in practice, it
might not be a big problem?

Vivek

> 
> Ideally in this case we should bump up the writeback speed, maybe even
> temporarily inherit the write rate of the sync cgroup, similarly to a
> priority-inversion locking scenario, but I think it's not doable at the
> moment without applying big changes.
> 
> Or we could isolate the sync domain, meaning that a cgroup issuing a
> sync will only wait for the syncing of the pages that belong to that
> sync cgroup. But probably also this method requires big changes...
> 
> -Andrea

Andrea Righi Jan. 29, 2019, 6:39 p.m. UTC | #14

On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote:
> On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> > Hi Vivek,
> > 
> > sorry for the late reply.
> > 
> > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > > 
> > > [..]
> > > > Alright, let's skip the root cgroup for now. I think the point here is
> > > > if we want to provide sync() isolation among cgroups or not.
> > > > 
> > > > According to the manpage:
> > > > 
> > > >        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
> > > >        written to the underlying filesystems.
> > > > 
> > > > And:
> > > >        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > > >        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
> > > >        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
> > > >        tem or filesystem respectively.
> > > > 
> > > > Excluding the root cgroup, do you think a sync() issued inside a
> > > > specific cgroup should wait for I/O completions only for the writes that
> > > > have been generated by that cgroup?
> > > 
> > > Can we account I/O towards the cgroup which issued "sync" only if write
> > > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > > that solve problem, assuming its doable?
> > 
> > Maybe this would mitigate the problem, in part, but it doesn't solve it.
> > 
> > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> > issues "sync", the fast cgroup needs to wait a lot, because writeback is
> > happening at the speed of the slow cgroup.
> 
> Hi Andrea,
> 
> But that's true only for I/O which has already been submitted to block
> layer, right? Any new I/O yet to be submitted could still be attributed
> to faster cgroup requesting sync.

Right. If we could bump up the new I/O yet to be submitted I think we
could effectively prevent the priority inversion problem (the ongoing
writeback I/O should be negligible).

> 
> Until and unless cgroups limits are absurdly low, it should not take very
> long for already submitted I/O to finish. If yes, then in practice, it
> might not be a big problem?

I was actually doing my tests with a very low limit (1MB/s both for rbps
and wbps), but this shows the problem very well I think.

Here's what I'm doing:

 [ slow cgroup (1Mbps read/write) ]

   $ cat /sys/fs/cgroup/unified/cg1/io.max
   259:0 rbps=1048576 wbps=1048576 riops=max wiops=max
   $ cat /proc/self/cgroup
   0::/cg1

   $ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based --runtime=30

 [ fast cgroup (root cgroup, no limitation) ]

   # cat /proc/self/cgroup
   0::/

   # time sync
   real	9m32,618s
   user	0m0,000s
   sys	0m0,018s

With this simple test I can easily trigger hung task timeout warnings
and make the whole system totally sluggish (even the processes running
in the root cgroup).

When fio ends, writeback is still taking forever to complete, as you can
see by the insane amount that sync takes to complete.

-Andrea

Josef Bacik Jan. 29, 2019, 6:50 p.m. UTC | #15

On Tue, Jan 29, 2019 at 07:39:38PM +0100, Andrea Righi wrote:
> On Mon, Jan 28, 2019 at 02:26:20PM -0500, Vivek Goyal wrote:
> > On Mon, Jan 28, 2019 at 06:41:29PM +0100, Andrea Righi wrote:
> > > Hi Vivek,
> > > 
> > > sorry for the late reply.
> > > 
> > > On Mon, Jan 21, 2019 at 04:47:15PM -0500, Vivek Goyal wrote:
> > > > On Sat, Jan 19, 2019 at 11:08:27AM +0100, Andrea Righi wrote:
> > > > 
> > > > [..]
> > > > > Alright, let's skip the root cgroup for now. I think the point here is
> > > > > if we want to provide sync() isolation among cgroups or not.
> > > > > 
> > > > > According to the manpage:
> > > > > 
> > > > >        sync()  causes  all  pending  modifications  to filesystem metadata and cached file data to be
> > > > >        written to the underlying filesystems.
> > > > > 
> > > > > And:
> > > > >        According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but
> > > > >        may  return  before  the actual writing is done.  However Linux waits for I/O completions, and
> > > > >        thus sync() or syncfs() provide the same guarantees as fsync called on every file in the  sys‐
> > > > >        tem or filesystem respectively.
> > > > > 
> > > > > Excluding the root cgroup, do you think a sync() issued inside a
> > > > > specific cgroup should wait for I/O completions only for the writes that
> > > > > have been generated by that cgroup?
> > > > 
> > > > Can we account I/O towards the cgroup which issued "sync" only if write
> > > > rate of sync cgroup is higher than cgroup to which page belongs to. Will
> > > > that solve problem, assuming its doable?
> > > 
> > > Maybe this would mitigate the problem, in part, but it doesn't solve it.
> > > 
> > > The thing is, if a dirty page belongs to a slow cgroup and a fast cgroup
> > > issues "sync", the fast cgroup needs to wait a lot, because writeback is
> > > happening at the speed of the slow cgroup.
> > 
> > Hi Andrea,
> > 
> > But that's true only for I/O which has already been submitted to block
> > layer, right? Any new I/O yet to be submitted could still be attributed
> > to faster cgroup requesting sync.
> 
> Right. If we could bump up the new I/O yet to be submitted I think we
> could effectively prevent the priority inversion problem (the ongoing
> writeback I/O should be negligible).
> 
> > 
> > Until and unless cgroups limits are absurdly low, it should not take very
> > long for already submitted I/O to finish. If yes, then in practice, it
> > might not be a big problem?
> 
> I was actually doing my tests with a very low limit (1MB/s both for rbps
> and wbps), but this shows the problem very well I think.
> 
> Here's what I'm doing:
> 
>  [ slow cgroup (1Mbps read/write) ]
> 
>    $ cat /sys/fs/cgroup/unified/cg1/io.max
>    259:0 rbps=1048576 wbps=1048576 riops=max wiops=max
>    $ cat /proc/self/cgroup
>    0::/cg1
> 
>    $ fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based --runtime=30
> 
>  [ fast cgroup (root cgroup, no limitation) ]
> 
>    # cat /proc/self/cgroup
>    0::/
> 
>    # time sync
>    real	9m32,618s
>    user	0m0,000s
>    sys	0m0,018s
> 
> With this simple test I can easily trigger hung task timeout warnings
> and make the whole system totally sluggish (even the processes running
> in the root cgroup).
> 
> When fio ends, writeback is still taking forever to complete, as you can
> see by the insane amount that sync takes to complete.
> 

Yeah sync() needs to be treated differently, but its kind of special too.  We
don't want slow to run sync() and backup fast doing sync() because we make all
of the io go based on the submitting cgroup.  The problem here is we don't know
who's more important until we get to the blk cgroup layer, and even then
sometimes we can't tell (different hierarchies would make this tricky with
io.weight or io.latency).

We could treat it like REQ_META and just let everything go through and back
charge.  This feels like a way for the slow group to cheat though, unless we
just throttle the shit out of him before returning to user space.  I'll have to
think about this some more.  Thanks,

Josef

[RFC,0/3] cgroup: fsio throttle controller

Message

Comments