Message ID | 20210617095309.3542373-1-stapelberg+linux@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | backing_dev_info: introduce min_bw/max_bw limits | expand |
On Thu, 17 Jun 2021 at 11:53, Michael Stapelberg <stapelberg+linux@google.com> wrote: > > These new knobs allow e.g. FUSE file systems to guide kernel memory > writeback bandwidth throttling. > > Background: > > When using mmap(2) to read/write files, the page-writeback code tries to > measure how quick file system backing devices (BDI) are able to write data, > so that it can throttle processes accordingly. > > Unfortunately, certain usage patterns, such as linkers (tested with GCC, > but also the Go linker) seem to hit an unfortunate corner case when writing > their large executable output files: the kernel only ever measures > the (non-representative) rising slope of the starting bulk write, but the > whole file write is already over before the kernel could possibly measure > the representative steady-state. > > As a consequence, with each program invocation hitting this corner case, > the FUSE write bandwidth steadily sinks in a downward spiral, until it > eventually reaches 0 (!). This results in the kernel heavily throttling > page dirtying in programs trying to write to FUSE, which in turn manifests > itself in slow or even entirely stalled linker processes. > > Change: > > This commit adds 2 knobs which allow avoiding this situation entirely on a > per-file-system basis by restricting the minimum/maximum bandwidth. This looks like a bug in the dirty throttling heuristics, that may be effecting multiple fuse based filesystems. Ideally the solution should be a fix to those heuristics, not adding more knobs. Is there a fundamental reason why that can't be done? Maybe the heuristics need to detect the fact that steady state has not been reached, and not modify the bandwidth in that case, or something along those lines. Thanks, Miklos
Hey Miklos Thanks for taking a look! On Fri, 18 Jun 2021 at 10:18, Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Thu, 17 Jun 2021 at 11:53, Michael Stapelberg > <stapelberg+linux@google.com> wrote: > > > > These new knobs allow e.g. FUSE file systems to guide kernel memory > > writeback bandwidth throttling. > > > > Background: > > > > When using mmap(2) to read/write files, the page-writeback code tries to > > measure how quick file system backing devices (BDI) are able to write data, > > so that it can throttle processes accordingly. > > > > Unfortunately, certain usage patterns, such as linkers (tested with GCC, > > but also the Go linker) seem to hit an unfortunate corner case when writing > > their large executable output files: the kernel only ever measures > > the (non-representative) rising slope of the starting bulk write, but the > > whole file write is already over before the kernel could possibly measure > > the representative steady-state. > > > > As a consequence, with each program invocation hitting this corner case, > > the FUSE write bandwidth steadily sinks in a downward spiral, until it > > eventually reaches 0 (!). This results in the kernel heavily throttling > > page dirtying in programs trying to write to FUSE, which in turn manifests > > itself in slow or even entirely stalled linker processes. > > > > Change: > > > > This commit adds 2 knobs which allow avoiding this situation entirely on a > > per-file-system basis by restricting the minimum/maximum bandwidth. > > > This looks like a bug in the dirty throttling heuristics, that may be > effecting multiple fuse based filesystems. > > Ideally the solution should be a fix to those heuristics, not adding more knobs. Agreed. > > > Is there a fundamental reason why that can't be done? Maybe the > heuristics need to detect the fact that steady state has not been > reached, and not modify the bandwidth in that case, or something along > those lines. Maybe, but I don’t have the expertise, motivation or time to investigate this any further, let alone commit to get it done. During our previous discussion I got the impression that nobody else had any cycles for this either: https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com/ Have you had a look at the China LSF report at http://bardofschool.blogspot.com/2011/? The author of the heuristic has spent significant effort and time coming up with what we currently have in the kernel: """ Fengguang said he draw more than 10K performance graphs and read even more in the past year. """ This implies that making changes to the heuristic will not be a quick fix. I think adding these limit knobs could be useful regardless of the specific heuristic behavior. The knobs are certainly easy to understand, safe to introduce (no regressions), and can be used to fix the issue at hand as well as other issues (if any, now or in the future). Thanks Best regards Michael
On Fri, 18 Jun 2021 10:31:35 +0200 Michael Stapelberg wrote: >Hey Miklos > >Thanks for taking a look! > >On Fri, 18 Jun 2021 at 10:18, Miklos Szeredi <miklos@szeredi.hu> wrote: >> >> On Thu, 17 Jun 2021 at 11:53, Michael Stapelberg >> <stapelberg+linux@google.com> wrote: >> > >> > These new knobs allow e.g. FUSE file systems to guide kernel memory >> > writeback bandwidth throttling. >> > >> > Background: >> > >> > When using mmap(2) to read/write files, the page-writeback code tries t= >o >> > measure how quick file system backing devices (BDI) are able to write d= >ata, >> > so that it can throttle processes accordingly. >> > >> > Unfortunately, certain usage patterns, such as linkers (tested with GCC= >, >> > but also the Go linker) seem to hit an unfortunate corner case when wri= >ting >> > their large executable output files: the kernel only ever measures >> > the (non-representative) rising slope of the starting bulk write, but t= >he >> > whole file write is already over before the kernel could possibly measu= >re >> > the representative steady-state. >> > >> > As a consequence, with each program invocation hitting this corner case= >, >> > the FUSE write bandwidth steadily sinks in a downward spiral, until it >> > eventually reaches 0 (!). This results in the kernel heavily throttling >> > page dirtying in programs trying to write to FUSE, which in turn manife= >sts >> > itself in slow or even entirely stalled linker processes. >> > >> > Change: >> > >> > This commit adds 2 knobs which allow avoiding this situation entirely o= >n a >> > per-file-system basis by restricting the minimum/maximum bandwidth. >> >> >> This looks like a bug in the dirty throttling heuristics, that may be >> effecting multiple fuse based filesystems. >> >> Ideally the solution should be a fix to those heuristics, not adding more= > knobs. > > >Agreed. +1 > >> >> >> Is there a fundamental reason why that can't be done? Maybe the >> heuristics need to detect the fact that steady state has not been >> reached, and not modify the bandwidth in that case, or something along >> those lines. > >Maybe, but I don=E2=80=99t have the expertise, motivation or time to >investigate this any further, let alone commit to get it done. >During our previous discussion I got the impression that nobody else >had any cycles for this either: >https://lore.kernel.org/linux-fsdevel/CANnVG6n=3DySfe1gOr=3D0ituQidp56idGAR= >DKHzP0hv=3DERedeMrMA@mail.gmail.com/ Its timestamp is Mon, 9 Mar 2020 16:11:41 +0100 > >Have you had a look at the China LSF report at >http://bardofschool.blogspot.com/2011/? >The author of the heuristic has spent significant effort and time >coming up with what we currently have in the kernel: > >""" >Fengguang said he draw more than 10K performance graphs and read even >more in the past year. >""" > >This implies that making changes to the heuristic will not be a quick fix. The 2019 attempt [01] IIRC was trying to cut the heuristics. > >I think adding these limit knobs could be useful regardless of the >specific heuristic behavior. >The knobs are certainly easy to understand, safe to introduce (no regressio= >ns), >and can be used to fix the issue at hand as well as other issues (if >any, now or in the future). > >Thanks >Best regards >Michael [01] https://lore.kernel.org/lkml/20191118082559.GJ6910@shao2-debian/
On Fri, 18 Jun 2021 at 10:31, Michael Stapelberg <stapelberg+linux@google.com> wrote: > Maybe, but I don’t have the expertise, motivation or time to > investigate this any further, let alone commit to get it done. > During our previous discussion I got the impression that nobody else > had any cycles for this either: > https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com/ > > Have you had a look at the China LSF report at > http://bardofschool.blogspot.com/2011/? > The author of the heuristic has spent significant effort and time > coming up with what we currently have in the kernel: > > """ > Fengguang said he draw more than 10K performance graphs and read even > more in the past year. > """ > > This implies that making changes to the heuristic will not be a quick fix. Having a piece of kernel code sitting there that nobody is willing to fix is certainly not a great situation to be in. And introducing band aids is not going improve the above situation, more likely it will prolong it even further. Thanks, Miklos
Hey Miklos On Fri, 18 Jun 2021 at 16:42, Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Fri, 18 Jun 2021 at 10:31, Michael Stapelberg > <stapelberg+linux@google.com> wrote: > > > Maybe, but I don’t have the expertise, motivation or time to > > investigate this any further, let alone commit to get it done. > > During our previous discussion I got the impression that nobody else > > had any cycles for this either: > > https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com/ > > > > Have you had a look at the China LSF report at > > http://bardofschool.blogspot.com/2011/? > > The author of the heuristic has spent significant effort and time > > coming up with what we currently have in the kernel: > > > > """ > > Fengguang said he draw more than 10K performance graphs and read even > > more in the past year. > > """ > > > > This implies that making changes to the heuristic will not be a quick fix. > > Having a piece of kernel code sitting there that nobody is willing to > fix is certainly not a great situation to be in. Agreed. > > And introducing band aids is not going improve the above situation, > more likely it will prolong it even further. Sounds like “Perfect is the enemy of good” to me: you’re looking for a perfect hypothetical solution, whereas we have a known-working low risk fix for a real problem. Could we find a solution where medium-/long-term, the code in question is improved, perhaps via a Summer Of Code project or similar community efforts, but until then, we apply the patch at hand? As I mentioned, I think adding min/max limits can be useful regardless of how the heuristic itself changes. If that turns out to be incorrect or undesired, we can still turn the knobs into a no-op, if removal isn’t an option. Thanks Best regards Michael
On Mon 21-06-21 11:20:10, Michael Stapelberg wrote: > Hey Miklos > > On Fri, 18 Jun 2021 at 16:42, Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Fri, 18 Jun 2021 at 10:31, Michael Stapelberg > > <stapelberg+linux@google.com> wrote: > > > > > Maybe, but I don’t have the expertise, motivation or time to > > > investigate this any further, let alone commit to get it done. > > > During our previous discussion I got the impression that nobody else > > > had any cycles for this either: > > > https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com/ > > > > > > Have you had a look at the China LSF report at > > > http://bardofschool.blogspot.com/2011/? > > > The author of the heuristic has spent significant effort and time > > > coming up with what we currently have in the kernel: > > > > > > """ > > > Fengguang said he draw more than 10K performance graphs and read even > > > more in the past year. > > > """ > > > > > > This implies that making changes to the heuristic will not be a quick fix. > > > > Having a piece of kernel code sitting there that nobody is willing to > > fix is certainly not a great situation to be in. > > Agreed. > > > > > And introducing band aids is not going improve the above situation, > > more likely it will prolong it even further. > > Sounds like “Perfect is the enemy of good” to me: you’re looking for a > perfect hypothetical solution, > whereas we have a known-working low risk fix for a real problem. > > Could we find a solution where medium-/long-term, the code in question > is improved, > perhaps via a Summer Of Code project or similar community efforts, > but until then, we apply the patch at hand? > > As I mentioned, I think adding min/max limits can be useful regardless > of how the heuristic itself changes. > > If that turns out to be incorrect or undesired, we can still turn the > knobs into a no-op, if removal isn’t an option. Well, removal of added knobs is more or less out of question as it can break some userspace. Similarly making them no-op is problematic unless we are pretty certain it cannot break some existing setup. That's why we have to think twice (or better three times ;) before adding any knobs. Also honestly the knobs you suggest will be pretty hard to tune when there are multiple cgroups with writeback control involved (which can be affected by the same problems you observe as well). So I agree with Miklos that this is not the right way to go. Speaking of tunables, did you try tuning /sys/devices/virtual/bdi/<fuse-bdi>/min_ratio? I suspect that may workaround your problems... Looking into your original report and tracing you did (thanks for that, really useful), it seems that the problem is that writeback bandwidth is updated at most every 200ms (more frequent calls are just ignored) and are triggered only from balance_dirty_pages() (happen when pages are dirtied) and inode writeback code so if the workload tends to have short spikes of activity and extended periods of quiet time, then writeback bandwidth may indeed be seriously miscomputed because we just won't update writeback throughput after most of writeback has happened as you observed. I think the fix for this can be relatively simple. We just need to make sure we update writeback bandwidth reasonably quickly after the IO finishes. I'll write a patch and see if it helps. Honza
Thanks for taking a look! Comments inline: On Tue, 22 Jun 2021 at 14:12, Jan Kara <jack@suse.cz> wrote: > > On Mon 21-06-21 11:20:10, Michael Stapelberg wrote: > > Hey Miklos > > > > On Fri, 18 Jun 2021 at 16:42, Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > On Fri, 18 Jun 2021 at 10:31, Michael Stapelberg > > > <stapelberg+linux@google.com> wrote: > > > > > > > Maybe, but I don’t have the expertise, motivation or time to > > > > investigate this any further, let alone commit to get it done. > > > > During our previous discussion I got the impression that nobody else > > > > had any cycles for this either: > > > > https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com/ > > > > > > > > Have you had a look at the China LSF report at > > > > http://bardofschool.blogspot.com/2011/? > > > > The author of the heuristic has spent significant effort and time > > > > coming up with what we currently have in the kernel: > > > > > > > > """ > > > > Fengguang said he draw more than 10K performance graphs and read even > > > > more in the past year. > > > > """ > > > > > > > > This implies that making changes to the heuristic will not be a quick fix. > > > > > > Having a piece of kernel code sitting there that nobody is willing to > > > fix is certainly not a great situation to be in. > > > > Agreed. > > > > > > > > And introducing band aids is not going improve the above situation, > > > more likely it will prolong it even further. > > > > Sounds like “Perfect is the enemy of good” to me: you’re looking for a > > perfect hypothetical solution, > > whereas we have a known-working low risk fix for a real problem. > > > > Could we find a solution where medium-/long-term, the code in question > > is improved, > > perhaps via a Summer Of Code project or similar community efforts, > > but until then, we apply the patch at hand? > > > > As I mentioned, I think adding min/max limits can be useful regardless > > of how the heuristic itself changes. > > > > If that turns out to be incorrect or undesired, we can still turn the > > knobs into a no-op, if removal isn’t an option. > > Well, removal of added knobs is more or less out of question as it can > break some userspace. Similarly making them no-op is problematic unless we > are pretty certain it cannot break some existing setup. That's why we have > to think twice (or better three times ;) before adding any knobs. Also > honestly the knobs you suggest will be pretty hard to tune when there are > multiple cgroups with writeback control involved (which can be affected by > the same problems you observe as well). So I agree with Miklos that this is > not the right way to go. Speaking of tunables, did you try tuning > /sys/devices/virtual/bdi/<fuse-bdi>/min_ratio? I suspect that may > workaround your problems... Back then, I did try the various tunables (vm.dirty_ratio and vm.dirty_background_ratio on the global level, /sys/class/bdi/<bdi>/{min,max}_ratio on the file system level), and they have had no observable effect on the problem at all in my tests. > > Looking into your original report and tracing you did (thanks for that, > really useful), it seems that the problem is that writeback bandwidth is > updated at most every 200ms (more frequent calls are just ignored) and are > triggered only from balance_dirty_pages() (happen when pages are dirtied) and > inode writeback code so if the workload tends to have short spikes of activity > and extended periods of quiet time, then writeback bandwidth may indeed be > seriously miscomputed because we just won't update writeback throughput > after most of writeback has happened as you observed. > > I think the fix for this can be relatively simple. We just need to make > sure we update writeback bandwidth reasonably quickly after the IO > finishes. I'll write a patch and see if it helps. Thank you! Please keep us posted.
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 1d7edad9914f..e34797bb62a1 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -175,6 +175,8 @@ struct backing_dev_info { unsigned int capabilities; /* Device capabilities */ unsigned int min_ratio; unsigned int max_ratio, max_prop_frac; + u64 min_bw; + u64 max_bw; /* * Sum of avg_write_bw of wbs with dirty inodes. > 0 if there are diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 44df4fcef65c..bb812a4df3a1 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -107,6 +107,9 @@ static inline unsigned long wb_stat_error(void) int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio); int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); +int bdi_set_min_bw(struct backing_dev_info *bdi, u64 min_bw); +int bdi_set_max_bw(struct backing_dev_info *bdi, u64 max_bw); + /* * Flags in backing_dev_info::capability * diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 271f2ca862c8..0201345d41f2 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -197,6 +197,44 @@ static ssize_t max_ratio_store(struct device *dev, } BDI_SHOW(max_ratio, bdi->max_ratio) +static ssize_t min_bw_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned long long limit; + ssize_t ret; + + ret = kstrtoull(buf, 10, &limit); + if (ret < 0) + return ret; + + ret = bdi_set_min_bw(bdi, limit); + if (!ret) + ret = count; + + return ret; +} +BDI_SHOW(min_bw, bdi->min_bw) + +static ssize_t max_bw_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned long long limit; + ssize_t ret; + + ret = kstrtoull(buf, 10, &limit); + if (ret < 0) + return ret; + + ret = bdi_set_max_bw(bdi, limit); + if (!ret) + ret = count; + + return ret; +} +BDI_SHOW(max_bw, bdi->max_bw) + static ssize_t stable_pages_required_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -211,6 +249,8 @@ static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, + &dev_attr_min_bw.attr, + &dev_attr_max_bw.attr, &dev_attr_stable_pages_required.attr, NULL, }; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 9f63548f247c..1ee9636e6088 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -701,6 +701,22 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio) } EXPORT_SYMBOL(bdi_set_max_ratio); +int bdi_set_min_bw(struct backing_dev_info *bdi, u64 min_bw) +{ + spin_lock_bh(&bdi_lock); + bdi->min_bw = min_bw; + spin_unlock_bh(&bdi_lock); + return 0; +} + +int bdi_set_max_bw(struct backing_dev_info *bdi, u64 max_bw) +{ + spin_lock_bh(&bdi_lock); + bdi->max_bw = max_bw; + spin_unlock_bh(&bdi_lock); + return 0; +} + static unsigned long dirty_freerun_ceiling(unsigned long thresh, unsigned long bg_thresh) { @@ -1068,6 +1084,15 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc) dtc->pos_ratio = pos_ratio; } +static u64 clamp_bw(struct backing_dev_info *bdi, u64 bw) +{ + if (bdi->min_bw > 0 && bw < bdi->min_bw) + bw = bdi->min_bw; + if (bdi->max_bw > 0 && bw > bdi->max_bw) + bw = bdi->max_bw; + return bw; +} + static void wb_update_write_bandwidth(struct bdi_writeback *wb, unsigned long elapsed, unsigned long written) @@ -1091,12 +1116,15 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb, bw *= HZ; if (unlikely(elapsed > period)) { bw = div64_ul(bw, elapsed); + bw = clamp_bw(wb->bdi, bw); avg = bw; goto out; } bw += (u64)wb->write_bandwidth * (period - elapsed); bw >>= ilog2(period); + bw = clamp_bw(wb->bdi, bw); + /* * one more level of smoothing, for filtering out sudden spikes */
These new knobs allow e.g. FUSE file systems to guide kernel memory writeback bandwidth throttling. Background: When using mmap(2) to read/write files, the page-writeback code tries to measure how quick file system backing devices (BDI) are able to write data, so that it can throttle processes accordingly. Unfortunately, certain usage patterns, such as linkers (tested with GCC, but also the Go linker) seem to hit an unfortunate corner case when writing their large executable output files: the kernel only ever measures the (non-representative) rising slope of the starting bulk write, but the whole file write is already over before the kernel could possibly measure the representative steady-state. As a consequence, with each program invocation hitting this corner case, the FUSE write bandwidth steadily sinks in a downward spiral, until it eventually reaches 0 (!). This results in the kernel heavily throttling page dirtying in programs trying to write to FUSE, which in turn manifests itself in slow or even entirely stalled linker processes. Change: This commit adds 2 knobs which allow avoiding this situation entirely on a per-file-system basis by restricting the minimum/maximum bandwidth. There are no negative effects expected from applying this patch. At Google, we have been running this patch for about 1 year on many thousands of developer PCs without observing any issues. Our in-house FUSE filesystems pin the BDI write rate at its default setting of 100 MB/s, which successfully prevents the bug described above. Usage: To inspect the measured bandwidth, check the BdiWriteBandwidth field in e.g. /sys/kernel/debug/bdi/0:93/stats. To pin the measured bandwidth to its default of 100 MB/s, use: echo 25600 > /sys/class/bdi/0:42/min_bw echo 25600 > /sys/class/bdi/0:42/max_bw Notes: For more discussion, including a test program for reproducing the issue, see the following discussion thread on the Linux Kernel Mailing List: https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com/ Why introduce these knobs instead of trying to tweak the throttling/measurement algorithm? The effort required to systematically analyze, improve and land such an algorithm change exceeds the time budget I have available. For comparison, check out this quote about the original patch set from 2011: “Fengguang said he draw more than 10K performance graphs and read even more in the past year.” (from http://bardofschool.blogspot.com/2011/). Given that nobody else has stepped up, despite the problem being known since 2016, my suggestion is to add the knobs until someone can spend significant time on a revision to the algorithm. Signed-off-by: Michael Stapelberg <stapelberg+linux@google.com> --- include/linux/backing-dev-defs.h | 2 ++ include/linux/backing-dev.h | 3 +++ mm/backing-dev.c | 40 ++++++++++++++++++++++++++++++++ mm/page-writeback.c | 28 ++++++++++++++++++++++ 4 files changed, 73 insertions(+)