From patchwork Thu Jun 17 09:53:03 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Michael Stapelberg X-Patchwork-Id: 12327225 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A73CC2B9F4 for ; Thu, 17 Jun 2021 09:53:35 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DAEE6613F5 for ; Thu, 17 Jun 2021 09:53:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DAEE6613F5 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6E29A6B0070; Thu, 17 Jun 2021 05:53:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 692FD6B0071; Thu, 17 Jun 2021 05:53:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E6756B0072; Thu, 17 Jun 2021 05:53:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0216.hostedemail.com [216.40.44.216]) by kanga.kvack.org (Postfix) with ESMTP id 19EC96B0070 for ; Thu, 17 Jun 2021 05:53:34 -0400 (EDT) Received: from smtpin36.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 93997181AC9CC for ; Thu, 17 Jun 2021 09:53:33 +0000 (UTC) X-FDA: 78262753506.36.B4E2591 Received: from mail-ed1-f74.google.com (mail-ed1-f74.google.com [209.85.208.74]) by imf08.hostedemail.com (Postfix) with ESMTP id 8265A80192EE for ; Thu, 17 Jun 2021 09:53:22 +0000 (UTC) Received: by mail-ed1-f74.google.com with SMTP id df3-20020a05640230a3b029039179c0f290so1193938edb.13 for ; Thu, 17 Jun 2021 02:53:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=a+zdbgiBb9ArM5mcAv9PP3FFuftdHwNPy6WG5ezhuhg=; b=qxObvzoX4JdSGrxQ3ZuyPpBkDeRxJZg+BzJtMUVId5XO7aYRatSQFvH6BaFR5n1YYT ozu0EXR9FPJfKda227z7e6mgME/Njuw/mhM3Sqj0NDlnTfs3mOf7Zfqqunthk+Zi+Tpi 3IFpuIYt7INLA9ujbK7GVHPR7irOutzijhzx2zywr+/kktXLCZk/yS0JJLFK8ct66m/6 PRlsCJVmqVCYYUiZBNM0OKjvm6XcdSdQnpwQIy1Hz/MbzkgKHUvn3fC1fkph7vlazrFY Uh7euAWtna3Lk7mdeg6VNymy6MUoH5sF3CYu5Bj99IvKSlgMReTolU+nHXHVKLk3g2mq xT7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=a+zdbgiBb9ArM5mcAv9PP3FFuftdHwNPy6WG5ezhuhg=; b=prBHFgstGL0+pvhCbd/rF/w97PDelbRXrAdgyY+ec96loW/que52ZGzdL8QkzXPHLZ gHNVag39+Mykar0+1yHTaIh/q/j/NNYGFul3/jOXr128Y1oDnlcIy6Dvb0VOWQkJ9RPe K9UgZgXmyED3MePJB9MhyspbKugglMH+uT4EFl4aIO+R+RXPlJ5+OT3OppZ0vBzDyJBQ HWl+CAI/xlLyg0YNRZh8W+tu4p9U43lSJBb560YnlMkWQt7sTtfDohM3QIpjCgLx/kzV GBTs20d4xuAf0WjVf+GtNKizikCpNvWCih1gJ6XGlAyLD61qYTh2YOevyVh0Iw3XZ3em vS9Q== X-Gm-Message-State: AOAM532TkNE/KFYcMXimEHcrYg7uF1esZ/aqkcJV1w7ZDViIkdk3sIEF bd0wdIJHhNt+/yyAqrexexgQw9fNCfeoPAJ7 X-Google-Smtp-Source: ABdhPJx+PI0fe7iUwOHHvsHwzw9nummcEoOdT9cgCgSTBXi6D0g39LHeXPVGfIyF2IUCTYY0OhGKjTEw6YgI+NNU X-Received: from mklencke.zrh.corp.google.com ([2a00:79e0:48:202:aed9:dffa:ca7e:ac4d]) (user=stapelberg job=sendgmr) by 2002:aa7:c547:: with SMTP id s7mr5202848edr.239.1623923611816; Thu, 17 Jun 2021 02:53:31 -0700 (PDT) Date: Thu, 17 Jun 2021 11:53:03 +0200 Message-Id: <20210617095309.3542373-1-stapelberg+linux@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.32.0.288.g62a8d224e6-goog Subject: [PATCH] backing_dev_info: introduce min_bw/max_bw limits From: Michael Stapelberg To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Michael Stapelberg , Tejun Heo , Dennis Zhou , Jens Axboe , Roman Gushchin , Johannes Thumshirn , Jan Kara , Song Liu , David Sterba Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=qxObvzoX; spf=pass (imf08.hostedemail.com: domain of 3mxvLYBAKCAw45m1qxnq3s@flex--stapelberg.bounces.google.com designates 209.85.208.74 as permitted sender) smtp.mailfrom=3mxvLYBAKCAw45m1qxnq3s@flex--stapelberg.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 8265A80192EE X-Stat-Signature: z6ho6m1dxuf18f8e48z7rxo4qb1h3xq3 X-HE-Tag: 1623923602-858558 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: These new knobs allow e.g. FUSE file systems to guide kernel memory writeback bandwidth throttling. Background: When using mmap(2) to read/write files, the page-writeback code tries to measure how quick file system backing devices (BDI) are able to write data, so that it can throttle processes accordingly. Unfortunately, certain usage patterns, such as linkers (tested with GCC, but also the Go linker) seem to hit an unfortunate corner case when writing their large executable output files: the kernel only ever measures the (non-representative) rising slope of the starting bulk write, but the whole file write is already over before the kernel could possibly measure the representative steady-state. As a consequence, with each program invocation hitting this corner case, the FUSE write bandwidth steadily sinks in a downward spiral, until it eventually reaches 0 (!). This results in the kernel heavily throttling page dirtying in programs trying to write to FUSE, which in turn manifests itself in slow or even entirely stalled linker processes. Change: This commit adds 2 knobs which allow avoiding this situation entirely on a per-file-system basis by restricting the minimum/maximum bandwidth. There are no negative effects expected from applying this patch. At Google, we have been running this patch for about 1 year on many thousands of developer PCs without observing any issues. Our in-house FUSE filesystems pin the BDI write rate at its default setting of 100 MB/s, which successfully prevents the bug described above. Usage: To inspect the measured bandwidth, check the BdiWriteBandwidth field in e.g. /sys/kernel/debug/bdi/0:93/stats. To pin the measured bandwidth to its default of 100 MB/s, use: echo 25600 > /sys/class/bdi/0:42/min_bw echo 25600 > /sys/class/bdi/0:42/max_bw Notes: For more discussion, including a test program for reproducing the issue, see the following discussion thread on the Linux Kernel Mailing List: https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@mail.gmail.com/ Why introduce these knobs instead of trying to tweak the throttling/measurement algorithm? The effort required to systematically analyze, improve and land such an algorithm change exceeds the time budget I have available. For comparison, check out this quote about the original patch set from 2011: “Fengguang said he draw more than 10K performance graphs and read even more in the past year.” (from http://bardofschool.blogspot.com/2011/). Given that nobody else has stepped up, despite the problem being known since 2016, my suggestion is to add the knobs until someone can spend significant time on a revision to the algorithm. Signed-off-by: Michael Stapelberg --- include/linux/backing-dev-defs.h | 2 ++ include/linux/backing-dev.h | 3 +++ mm/backing-dev.c | 40 ++++++++++++++++++++++++++++++++ mm/page-writeback.c | 28 ++++++++++++++++++++++ 4 files changed, 73 insertions(+) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 1d7edad9914f..e34797bb62a1 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -175,6 +175,8 @@ struct backing_dev_info { unsigned int capabilities; /* Device capabilities */ unsigned int min_ratio; unsigned int max_ratio, max_prop_frac; + u64 min_bw; + u64 max_bw; /* * Sum of avg_write_bw of wbs with dirty inodes. > 0 if there are diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 44df4fcef65c..bb812a4df3a1 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -107,6 +107,9 @@ static inline unsigned long wb_stat_error(void) int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio); int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); +int bdi_set_min_bw(struct backing_dev_info *bdi, u64 min_bw); +int bdi_set_max_bw(struct backing_dev_info *bdi, u64 max_bw); + /* * Flags in backing_dev_info::capability * diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 271f2ca862c8..0201345d41f2 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -197,6 +197,44 @@ static ssize_t max_ratio_store(struct device *dev, } BDI_SHOW(max_ratio, bdi->max_ratio) +static ssize_t min_bw_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned long long limit; + ssize_t ret; + + ret = kstrtoull(buf, 10, &limit); + if (ret < 0) + return ret; + + ret = bdi_set_min_bw(bdi, limit); + if (!ret) + ret = count; + + return ret; +} +BDI_SHOW(min_bw, bdi->min_bw) + +static ssize_t max_bw_store(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) +{ + struct backing_dev_info *bdi = dev_get_drvdata(dev); + unsigned long long limit; + ssize_t ret; + + ret = kstrtoull(buf, 10, &limit); + if (ret < 0) + return ret; + + ret = bdi_set_max_bw(bdi, limit); + if (!ret) + ret = count; + + return ret; +} +BDI_SHOW(max_bw, bdi->max_bw) + static ssize_t stable_pages_required_show(struct device *dev, struct device_attribute *attr, char *buf) @@ -211,6 +249,8 @@ static struct attribute *bdi_dev_attrs[] = { &dev_attr_read_ahead_kb.attr, &dev_attr_min_ratio.attr, &dev_attr_max_ratio.attr, + &dev_attr_min_bw.attr, + &dev_attr_max_bw.attr, &dev_attr_stable_pages_required.attr, NULL, }; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 9f63548f247c..1ee9636e6088 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -701,6 +701,22 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio) } EXPORT_SYMBOL(bdi_set_max_ratio); +int bdi_set_min_bw(struct backing_dev_info *bdi, u64 min_bw) +{ + spin_lock_bh(&bdi_lock); + bdi->min_bw = min_bw; + spin_unlock_bh(&bdi_lock); + return 0; +} + +int bdi_set_max_bw(struct backing_dev_info *bdi, u64 max_bw) +{ + spin_lock_bh(&bdi_lock); + bdi->max_bw = max_bw; + spin_unlock_bh(&bdi_lock); + return 0; +} + static unsigned long dirty_freerun_ceiling(unsigned long thresh, unsigned long bg_thresh) { @@ -1068,6 +1084,15 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc) dtc->pos_ratio = pos_ratio; } +static u64 clamp_bw(struct backing_dev_info *bdi, u64 bw) +{ + if (bdi->min_bw > 0 && bw < bdi->min_bw) + bw = bdi->min_bw; + if (bdi->max_bw > 0 && bw > bdi->max_bw) + bw = bdi->max_bw; + return bw; +} + static void wb_update_write_bandwidth(struct bdi_writeback *wb, unsigned long elapsed, unsigned long written) @@ -1091,12 +1116,15 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb, bw *= HZ; if (unlikely(elapsed > period)) { bw = div64_ul(bw, elapsed); + bw = clamp_bw(wb->bdi, bw); avg = bw; goto out; } bw += (u64)wb->write_bandwidth * (period - elapsed); bw >>= ilog2(period); + bw = clamp_bw(wb->bdi, bw); + /* * one more level of smoothing, for filtering out sudden spikes */