From patchwork Thu Aug 15 19:57:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096421 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 93C7013B1 for ; Thu, 15 Aug 2019 19:57:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 809DE2898C for ; Thu, 15 Aug 2019 19:57:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 745682899D; Thu, 15 Aug 2019 19:57:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C0E7C2898C for ; Thu, 15 Aug 2019 19:57:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730018AbfHOT5X (ORCPT ); Thu, 15 Aug 2019 15:57:23 -0400 Received: from mail-qt1-f194.google.com ([209.85.160.194]:34698 "EHLO mail-qt1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728956AbfHOT5X (ORCPT ); Thu, 15 Aug 2019 15:57:23 -0400 Received: by mail-qt1-f194.google.com with SMTP id q4so3673873qtp.1; Thu, 15 Aug 2019 12:57:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=/LFxVzI9MepcAl4Lnqm6SEvMCLIi+u2ZqmJKlF6ic0o=; b=jjTNHUJ+CSG6+ZUiOmqZD7r0T3PUVCgfjcmbNMfGfPR8DHnctfki8PyRgCaQiyB90S UWKnZoGdq8+gWKBRhfYYqTZFxccfzAN8droy/uuQAn/QQj/WXaMpDNOoUqAixilN6XWZ Sk53sgnFEFB4AG7XxGGusOOxgwx2THclF1wyqs+7Qb2gzDgFcnq4EhL3ld/Ls5TSy4Q5 LIUIN5BhMM68p6SVE3RIkRWEwox+dN2CEbkaavpnNzDMj+eCiq2B1UGpU7K14QSpw51w 1SO27McqrcnIjLscZNURHZZeQ+IdchCWOkYO6u9avMv8pnjE+yYMDugbMHQaNhDFI9JN y/EA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=/LFxVzI9MepcAl4Lnqm6SEvMCLIi+u2ZqmJKlF6ic0o=; b=Xk18+Gw0nyCbCzE8omgSUbQ9XQ7cFzO5CeFyB6g+LUCoXlzyMiPgzIwcNTEryoXA2+ LVfxCSC+1Pd6sc2LVs6Umu3kJ7OQuH+5yNyS0gUj47Xqaef/or1vsrPIs998eg1fb+71 7aIic4DpgNvlNlopHeZ4UnvgAca8RVMadveXp7CHyk45uWXwEJFdIEhg2YuzxQxk9t0w m5JcE/9hOOrr69clf7MxhmU83DQWSI7MqT+2JVhu85DNDu2kii3Z5ouKWE3i7uoJzMCV iIdRivtIaTdm75nQvB0UlZo09yFYKR0gOoBKMxlGAv25yvDoJlbQnF2vwscUN7+HFNhu ehAw== X-Gm-Message-State: APjAAAVed38oAX/HKB519ZZKYA1oosx8Kd5JRtv/YEObtVfqlfFToJVV HEX+1lrPGpazaG7DEvrc8wE= X-Google-Smtp-Source: APXvYqwvGoz95kfeW/o6YJ39eK/95vaFIXp94ZgS9UtlL7U+AC31LouqDrLwGVqyFtoh8BCsILyYSA== X-Received: by 2002:ac8:289b:: with SMTP id i27mr5552503qti.67.1565899041355; Thu, 15 Aug 2019 12:57:21 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id h1sm2042852qkh.101.2019.08.15.12.57.20 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:57:20 -0700 (PDT) Date: Thu, 15 Aug 2019 12:57:19 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 1/5] writeback: Generalize and expose wb_completion Message-ID: <20190815195719.GB2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP wb_completion is used to track writeback completions. We want to use it from memcg side for foreign inode flushes. This patch updates it to remember the target waitq instead of assuming bdi->wb_waitq and expose it outside of fs-writeback.c. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- fs/fs-writeback.c | 47 +++++++++++---------------------------- include/linux/backing-dev-defs.h | 20 ++++++++++++++++ include/linux/backing-dev.h | 2 + 3 files changed, 36 insertions(+), 33 deletions(-) --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -36,10 +36,6 @@ */ #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_SHIFT - 10)) -struct wb_completion { - atomic_t cnt; -}; - /* * Passed into wb_writeback(), essentially a subset of writeback_control */ @@ -61,19 +57,6 @@ struct wb_writeback_work { }; /* - * If one wants to wait for one or more wb_writeback_works, each work's - * ->done should be set to a wb_completion defined using the following - * macro. Once all work items are issued with wb_queue_work(), the caller - * can wait for the completion of all using wb_wait_for_completion(). Work - * items which are waited upon aren't freed automatically on completion. - */ -#define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \ - struct wb_completion cmpl = { \ - .cnt = ATOMIC_INIT(1), \ - } - - -/* * If an inode is constantly having its pages dirtied, but then the * updates stop dirtytime_expire_interval seconds in the past, it's * possible for the worst case time between when an inode has its @@ -182,7 +165,7 @@ static void finish_writeback_work(struct if (work->auto_free) kfree(work); if (done && atomic_dec_and_test(&done->cnt)) - wake_up_all(&wb->bdi->wb_waitq); + wake_up_all(done->waitq); } static void wb_queue_work(struct bdi_writeback *wb, @@ -206,20 +189,18 @@ static void wb_queue_work(struct bdi_wri /** * wb_wait_for_completion - wait for completion of bdi_writeback_works - * @bdi: bdi work items were issued to * @done: target wb_completion * * Wait for one or more work items issued to @bdi with their ->done field - * set to @done, which should have been defined with - * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such - * work items are completed. Work items which are waited upon aren't freed + * set to @done, which should have been initialized with + * DEFINE_WB_COMPLETION(). This function returns after all such work items + * are completed. Work items which are waited upon aren't freed * automatically on completion. */ -static void wb_wait_for_completion(struct backing_dev_info *bdi, - struct wb_completion *done) +void wb_wait_for_completion(struct wb_completion *done) { atomic_dec(&done->cnt); /* put down the initial count */ - wait_event(bdi->wb_waitq, !atomic_read(&done->cnt)); + wait_event(*done->waitq, !atomic_read(&done->cnt)); } #ifdef CONFIG_CGROUP_WRITEBACK @@ -854,7 +835,7 @@ static void bdi_split_work_to_wbs(struct restart: rcu_read_lock(); list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) { - DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done); + DEFINE_WB_COMPLETION(fallback_work_done, bdi); struct wb_writeback_work fallback_work; struct wb_writeback_work *work; long nr_pages; @@ -901,7 +882,7 @@ restart: last_wb = wb; rcu_read_unlock(); - wb_wait_for_completion(bdi, &fallback_work_done); + wb_wait_for_completion(&fallback_work_done); goto restart; } rcu_read_unlock(); @@ -2373,7 +2354,8 @@ static void wait_sb_inodes(struct super_ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, enum wb_reason reason, bool skip_if_busy) { - DEFINE_WB_COMPLETION_ONSTACK(done); + struct backing_dev_info *bdi = sb->s_bdi; + DEFINE_WB_COMPLETION(done, bdi); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_NONE, @@ -2382,14 +2364,13 @@ static void __writeback_inodes_sb_nr(str .nr_pages = nr, .reason = reason, }; - struct backing_dev_info *bdi = sb->s_bdi; if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); - wb_wait_for_completion(bdi, &done); + wb_wait_for_completion(&done); } /** @@ -2451,7 +2432,8 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb */ void sync_inodes_sb(struct super_block *sb) { - DEFINE_WB_COMPLETION_ONSTACK(done); + struct backing_dev_info *bdi = sb->s_bdi; + DEFINE_WB_COMPLETION(done, bdi); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_ALL, @@ -2461,7 +2443,6 @@ void sync_inodes_sb(struct super_block * .reason = WB_REASON_SYNC, .for_sync = 1, }; - struct backing_dev_info *bdi = sb->s_bdi; /* * Can't skip on !bdi_has_dirty() because we should wait for !dirty @@ -2475,7 +2456,7 @@ void sync_inodes_sb(struct super_block * /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ bdi_down_write_wb_switch_rwsem(bdi); bdi_split_work_to_wbs(bdi, &work, false); - wb_wait_for_completion(bdi, &done); + wb_wait_for_completion(&done); bdi_up_write_wb_switch_rwsem(bdi); wait_sb_inodes(sb); --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -67,6 +67,26 @@ enum wb_reason { WB_REASON_MAX, }; +struct wb_completion { + atomic_t cnt; + wait_queue_head_t *waitq; +}; + +#define __WB_COMPLETION_INIT(_waitq) \ + (struct wb_completion){ .cnt = ATOMIC_INIT(1), .waitq = (_waitq) } + +/* + * If one wants to wait for one or more wb_writeback_works, each work's + * ->done should be set to a wb_completion defined using the following + * macro. Once all work items are issued with wb_queue_work(), the caller + * can wait for the completion of all using wb_wait_for_completion(). Work + * items which are waited upon aren't freed automatically on completion. + */ +#define WB_COMPLETION_INIT(bdi) __WB_COMPLETION_INIT(&(bdi)->wb_waitq) + +#define DEFINE_WB_COMPLETION(cmpl, bdi) \ + struct wb_completion cmpl = WB_COMPLETION_INIT(bdi) + /* * For cgroup writeback, multiple wb's may map to the same blkcg. Those * wb's can operate mostly independently but should share the congested --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -44,6 +44,8 @@ void wb_start_background_writeback(struc void wb_workfn(struct work_struct *work); void wb_wakeup_delayed(struct bdi_writeback *wb); +void wb_wait_for_completion(struct wb_completion *done); + extern spinlock_t bdi_lock; extern struct list_head bdi_list; From patchwork Thu Aug 15 19:57:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096425 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 558BD1395 for ; Thu, 15 Aug 2019 19:57:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 439BB2898C for ; Thu, 15 Aug 2019 19:57:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 379DC2899C; Thu, 15 Aug 2019 19:57:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A98042898C for ; Thu, 15 Aug 2019 19:57:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730132AbfHOT5y (ORCPT ); Thu, 15 Aug 2019 15:57:54 -0400 Received: from mail-qk1-f196.google.com ([209.85.222.196]:43857 "EHLO mail-qk1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728956AbfHOT5x (ORCPT ); Thu, 15 Aug 2019 15:57:53 -0400 Received: by mail-qk1-f196.google.com with SMTP id m2so2801865qkd.10; Thu, 15 Aug 2019 12:57:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=NK0PK6csvakTjylMz8jDV7zgrirY2g6A6SS9gcMGP38=; b=XEiIETwR/iRc/lXiBJkeGJWLPwdP3OTZNZTIZNNbaUyYnuuBUMWk76hAjCesVsGRvR DVIXJNeKmQfqKXW702SF1PUeP9sMQBQHvirBaKtwjRZq+JeR2Bqiqj8abtcJSMCd1jzQ 70f4yD1051JXCEGH8Jf4XgvdEGl7z6LtJqxDO8gRemoNlh+IeCfl8mEaqjzVNU4JR8CL rQRPxZoa84hi4tZtGQDIy5KO8uZu8JaNvj9NPih9o3iaXsNOSBCo6teWiZ/GbCdlfiuX p4mSuilXdKhg1WbG1k1KNt5O/wZddMSeF2o9jyOlaEWhRejaiNoKnop+vbB73UpmszkK DYkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=NK0PK6csvakTjylMz8jDV7zgrirY2g6A6SS9gcMGP38=; b=JSiPhVl6cYU2AppTF3Y2khVQgFb1256EC58comULgg8Pw2ubbLCeu2O8q4raJra8pr mJIIdAxaYLmCQshCImx+biolu6tsKPx/7JMBEiEWjD2yyoY+KCYTStjobiUb8ewyI7Lf 6CeiMbLxINoNNIqn6l6hbuQCRnHWROkfLTD5LEyXvCyufvy9Y48c7/+3nyXhDtrgbFFI pkZfJzcGXrCUKeLiA51DeZF99+HepuSRXejBDbNVBsSAchVN7cVnkAcDV06F6XqonD52 fsvZR6GLJXIRhqBebAMUl4JJnPOyasUeT09irldpufzItqnPWFGgRqPTYTkMjNaUG9Nw vWUQ== X-Gm-Message-State: APjAAAVpDnWertv07XkjopAyu0WuhotGztzmfczwjyJlDykxBd2ub1ew r1RPG7CEbtw4BAhk3XZxXUQ= X-Google-Smtp-Source: APXvYqxgqkXzG03UW+nQ2wzQV8lJWn1lUi6WbNWPWMtyk1a9ytxbt9vAVv4HDVD3hmWenqlqAUYrlw== X-Received: by 2002:ae9:e411:: with SMTP id q17mr5416347qkc.465.1565899072231; Thu, 15 Aug 2019 12:57:52 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id o21sm1768914qkk.100.2019.08.15.12.57.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:57:51 -0700 (PDT) Date: Thu, 15 Aug 2019 12:57:50 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 2/5] bdi: Add bdi->id Message-ID: <20190815195750.GC2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP There currently is no way to universally identify and lookup a bdi without holding a reference and pointer to it. This patch adds an non-recycling bdi->id and implements bdi_get_by_id() which looks up bdis by their ids. This will be used by memcg foreign inode flushing. I left bdi_list alone for simplicity and because while rb_tree does support rcu assignment it doesn't seem to guarantee lossless walk when walk is racing aginst tree rebalance operations. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- include/linux/backing-dev-defs.h | 2 + include/linux/backing-dev.h | 1 mm/backing-dev.c | 65 +++++++++++++++++++++++++++++++++++++-- 3 files changed, 66 insertions(+), 2 deletions(-) --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -185,6 +185,8 @@ struct bdi_writeback { }; struct backing_dev_info { + u64 id; + struct rb_node rb_node; /* keyed by ->id */ struct list_head bdi_list; unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ unsigned long io_pages; /* max allowed IO size */ --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -24,6 +24,7 @@ static inline struct backing_dev_info *b return bdi; } +struct backing_dev_info *bdi_get_by_id(u64 id); void bdi_put(struct backing_dev_info *bdi); __printf(2, 3) --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only #include +#include #include #include #include @@ -22,10 +23,12 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info) static struct class *bdi_class; /* - * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side - * locking. + * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU + * reader side locking. */ DEFINE_SPINLOCK(bdi_lock); +static u64 bdi_id_cursor; +static struct rb_root bdi_tree = RB_ROOT; LIST_HEAD(bdi_list); /* bdi_wq serves all asynchronous writeback tasks */ @@ -859,9 +862,58 @@ struct backing_dev_info *bdi_alloc_node( } EXPORT_SYMBOL(bdi_alloc_node); +static struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp) +{ + struct rb_node **p = &bdi_tree.rb_node; + struct rb_node *parent = NULL; + struct backing_dev_info *bdi; + + lockdep_assert_held(&bdi_lock); + + while (*p) { + parent = *p; + bdi = rb_entry(parent, struct backing_dev_info, rb_node); + + if (bdi->id > id) + p = &(*p)->rb_left; + else if (bdi->id < id) + p = &(*p)->rb_right; + else + break; + } + + if (parentp) + *parentp = parent; + return p; +} + +/** + * bdi_get_by_id - lookup and get bdi from its id + * @id: bdi id to lookup + * + * Find bdi matching @id and get it. Returns NULL if the matching bdi + * doesn't exist or is already unregistered. + */ +struct backing_dev_info *bdi_get_by_id(u64 id) +{ + struct backing_dev_info *bdi = NULL; + struct rb_node **p; + + spin_lock_bh(&bdi_lock); + p = bdi_lookup_rb_node(id, NULL); + if (*p) { + bdi = rb_entry(*p, struct backing_dev_info, rb_node); + bdi_get(bdi); + } + spin_unlock_bh(&bdi_lock); + + return bdi; +} + int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) { struct device *dev; + struct rb_node *parent, **p; if (bdi->dev) /* The driver needs to use separate queues per device */ return 0; @@ -877,7 +929,15 @@ int bdi_register_va(struct backing_dev_i set_bit(WB_registered, &bdi->wb.state); spin_lock_bh(&bdi_lock); + + bdi->id = ++bdi_id_cursor; + + p = bdi_lookup_rb_node(bdi->id, &parent); + rb_link_node(&bdi->rb_node, parent, p); + rb_insert_color(&bdi->rb_node, &bdi_tree); + list_add_tail_rcu(&bdi->bdi_list, &bdi_list); + spin_unlock_bh(&bdi_lock); trace_writeback_bdi_register(bdi); @@ -918,6 +978,7 @@ EXPORT_SYMBOL(bdi_register_owner); static void bdi_remove_from_list(struct backing_dev_info *bdi) { spin_lock_bh(&bdi_lock); + rb_erase(&bdi->rb_node, &bdi_tree); list_del_rcu(&bdi->bdi_list); spin_unlock_bh(&bdi_lock); From patchwork Thu Aug 15 19:58:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096429 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 34AC81395 for ; Thu, 15 Aug 2019 19:58:28 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2438528998 for ; Thu, 15 Aug 2019 19:58:28 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 155592899E; Thu, 15 Aug 2019 19:58:28 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A221C28998 for ; Thu, 15 Aug 2019 19:58:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729668AbfHOT60 (ORCPT ); Thu, 15 Aug 2019 15:58:26 -0400 Received: from mail-qk1-f196.google.com ([209.85.222.196]:37367 "EHLO mail-qk1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728956AbfHOT60 (ORCPT ); Thu, 15 Aug 2019 15:58:26 -0400 Received: by mail-qk1-f196.google.com with SMTP id s14so2900424qkm.4; Thu, 15 Aug 2019 12:58:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=No9QVbYlU2sgvqpbXkCw9UwcBjO6u+aBDjKZ2hew4hU=; b=Gt2lVOaCaz027V8blAT2ErzUdzTPXVxQRyH5tybpqHiiG25cOe/Ls1rrBq0oQhKq4p d7wDqqbJ1bM5gobzk77hGHVjDxsDbsTDq+3z5hGLL7VRGAHBX54T8poZTuJbJHT6/wIS n+3ZywqbhJ8Uy8k680B4ItSJ85EZjWnkScl1y0M11RkI5WI1vV26GfrlBoAOgyn8ZImy MXwv9CZiXjeSbA4ctU2yD4jwsbkz8PujxOymrX0SWDi+4cEk8cYfE//WGxqaDLTNEBGl XLQPaEG9+Gi8NYgFRTQ2Qqk+WJv08PpnQoMPdw8hU66i8SPUqA1LZE/L+3O+hb/w8cN8 s+9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=No9QVbYlU2sgvqpbXkCw9UwcBjO6u+aBDjKZ2hew4hU=; b=I1jSDxk2anaDI6SVT992tvOccLi9/Xujbra4dn3JJl/QU8HSd40WdKWLzGNJZurzHX 8daHzUFupc/5wKD8umDjQEY16aDDpiI3Io057ldcMd5nSjIYpcnWGRmGzsdFGL7h0IA9 5ufirkEXAhfigS3dF3+71onnMexdbNf7ZQna4k27PJ3rpPzbBvIKWGkkiz0JdCUM8eFO MvM9MXRJCew8ShkgV++wTPfGDtO3v+nuAiG9R8UnPyO54FZF5sAWmmnvuA0rO1PGA+EZ WxfAKwUD3QUSj/5kwVMPHccbQuqibbXBDw7mI+HmQw2/Jv3zunfCLNrESFjy2YMoBtVl V3rA== X-Gm-Message-State: APjAAAXnbX6sx9BzAtw1bwKa0I+zEjxjiyU+wUOrfk4x3HSLM1uKPsiM Bn67sw9XYkzDjAeOoFpfsbo= X-Google-Smtp-Source: APXvYqwYWkovkIDA9ykxqDtjrl7m1KWzJV4ngQIYl24GN2FpAUGCMjzO1WndSIPov9pFcxFYLJIpYA== X-Received: by 2002:ae9:e216:: with SMTP id c22mr5746003qkc.192.1565899105305; Thu, 15 Aug 2019 12:58:25 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id 23sm1855599qkk.121.2019.08.15.12.58.24 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:58:24 -0700 (PDT) Date: Thu, 15 Aug 2019 12:58:23 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 3/5] writeback: Separate out wb_get_lookup() from wb_get_create() Message-ID: <20190815195823.GD2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Separate out wb_get_lookup() which doesn't try to create one if there isn't already one from wb_get_create(). This will be used by later patches. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- include/linux/backing-dev.h | 2 + mm/backing-dev.c | 55 +++++++++++++++++++++++++++++--------------- 2 files changed, 39 insertions(+), 18 deletions(-) --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -230,6 +230,8 @@ static inline int bdi_sched_wait(void *w struct bdi_writeback_congested * wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp); void wb_congested_put(struct bdi_writeback_congested *congested); +struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css); struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp); --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -618,13 +618,12 @@ out_put: } /** - * wb_get_create - get wb for a given memcg, create if necessary + * wb_get_lookup - get wb for a given memcg * @bdi: target bdi * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) - * @gfp: allocation mask to use * - * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to - * create one. The returned wb has its refcount incremented. + * Try to get the wb for @memcg_css on @bdi. The returned wb has its + * refcount incremented. * * This function uses css_get() on @memcg_css and thus expects its refcnt * to be positive on invocation. IOW, rcu_read_lock() protection on @@ -641,6 +640,39 @@ out_put: * each lookup. On mismatch, the existing wb is discarded and a new one is * created. */ +struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css) +{ + struct bdi_writeback *wb; + + if (!memcg_css->parent) + return &bdi->wb; + + rcu_read_lock(); + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + if (wb) { + struct cgroup_subsys_state *blkcg_css; + + /* see whether the blkcg association has changed */ + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &io_cgrp_subsys); + if (unlikely(wb->blkcg_css != blkcg_css || !wb_tryget(wb))) + wb = NULL; + css_put(blkcg_css); + } + rcu_read_unlock(); + + return wb; +} + +/** + * wb_get_create - get wb for a given memcg, create if necessary + * @bdi: target bdi + * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) + * @gfp: allocation mask to use + * + * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to + * create one. See wb_get_lookup() for more details. + */ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp) @@ -653,20 +685,7 @@ struct bdi_writeback *wb_get_create(stru return &bdi->wb; do { - rcu_read_lock(); - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); - if (wb) { - struct cgroup_subsys_state *blkcg_css; - - /* see whether the blkcg association has changed */ - blkcg_css = cgroup_get_e_css(memcg_css->cgroup, - &io_cgrp_subsys); - if (unlikely(wb->blkcg_css != blkcg_css || - !wb_tryget(wb))) - wb = NULL; - css_put(blkcg_css); - } - rcu_read_unlock(); + wb = wb_get_lookup(bdi, memcg_css); } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); return wb; From patchwork Thu Aug 15 19:59:02 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096435 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5623013A0 for ; Thu, 15 Aug 2019 19:59:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 468CC2899D for ; Thu, 15 Aug 2019 19:59:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3AF102899F; Thu, 15 Aug 2019 19:59:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D9AD82899D for ; Thu, 15 Aug 2019 19:59:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731303AbfHOT7F (ORCPT ); Thu, 15 Aug 2019 15:59:05 -0400 Received: from mail-qk1-f194.google.com ([209.85.222.194]:44005 "EHLO mail-qk1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727814AbfHOT7F (ORCPT ); Thu, 15 Aug 2019 15:59:05 -0400 Received: by mail-qk1-f194.google.com with SMTP id m2so2804807qkd.10; Thu, 15 Aug 2019 12:59:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=LTjjxYDcINav7DiTg3r+K5uyvclxft6ApAzOA1thajw=; b=ReQZ+Ab0hATpZtDXOA2ejtpPxUouA+/1xRBT9pTb1ePOZTo2VSyRKjTL2j3RzOyJ7S mlcrNaILH3zXXtOaaBVm+e5yQG01QhwfI90xWoTHI3jXtQ8RPEbodPybOVmvjN9wSaPt EaJaVv1bmqhFDi+MaB4fvT5nUovo3hGXJ52wdQU56ZxArTR0gW5bs5CDYoWb8QHWBjEq +xWRzvOeKJ1HPiDh+Zawly8QwcsWoCzCpWbL/+Ca7PTd1evz6glff1DdWU4lNTP+/Gm6 utFTGkkDos6KzH8N0RhsbOYj9FshB3Uth+gddwFR0l77/gvDZnCzRj2R/DXN380JqxnN 7XWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=LTjjxYDcINav7DiTg3r+K5uyvclxft6ApAzOA1thajw=; b=kFgLMxGMuEgeY+xO8aU/5etxfUh+N8TdedA4WyO5XgrPWk8px7Pdxs2Mid+weGhDE6 c8+jv1FvLgO6FKBhynUC08w0UFgJY4Iu8POwHpT8MxCZ2aT86v1F+14lhw6N8ZvY5Dvu Wo55MEf3gGR+5mQjN0BfMPNEB3ri2fFIkq98rP4wndQOxc3oacU9NCKlmOgENpsM5MTO Iube4uk4l2mGGfMjRg86j9NkKW0JABQi7vc4FukgHqali3KjzyI7FjUyvTlcd/uFPf+5 U5trQAGQjzqJScKJevp5Tu9PryITYl8r1utG9in0wlJ1QRVJirEWOfyzt4rDdEqtMNEp 58PA== X-Gm-Message-State: APjAAAU074DJZgm1k7674qr6P/O/yHBwbdU7x7IwtsAorZc/ESVTK6aL 6mE4cEDRip5QAOGvN6mNAB4= X-Google-Smtp-Source: APXvYqz/lIOxVe5nWX7K5sBLu6KzXtCB0IqgPNj3VAKGjq1GOg0HMbdZXFb4fh326h7juBrH2FnOjw== X-Received: by 2002:a05:620a:342:: with SMTP id t2mr5109727qkm.283.1565899144144; Thu, 15 Aug 2019 12:59:04 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id n46sm2333045qtk.14.2019.08.15.12.59.03 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:59:03 -0700 (PDT) Date: Thu, 15 Aug 2019 12:59:02 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 4/5] writeback, memcg: Implement cgroup_writeback_by_id() Message-ID: <20190815195902.GE2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Implement cgroup_writeback_by_id() which initiates cgroup writeback from bdi and memcg IDs. This will be used by memcg foreign inode flushing. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- fs/fs-writeback.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/writeback.h | 2 + 2 files changed, 69 insertions(+) --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -892,6 +892,73 @@ restart: } /** + * cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs + * @bdi_id: target bdi id + * @memcg_id: target memcg css id + * @nr_pages: number of pages to write + * @reason: reason why some writeback work initiated + * @done: target wb_completion + * + * Initiate flush of the bdi_writeback identified by @bdi_id and @memcg_id + * with the specified parameters. + */ +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr, + enum wb_reason reason, struct wb_completion *done) +{ + struct backing_dev_info *bdi; + struct cgroup_subsys_state *memcg_css; + struct bdi_writeback *wb; + struct wb_writeback_work *work; + int ret; + + /* lookup bdi and memcg */ + bdi = bdi_get_by_id(bdi_id); + if (!bdi) + return -ENOENT; + + rcu_read_lock(); + memcg_css = css_from_id(memcg_id, &memory_cgrp_subsys); + if (memcg_css && !css_tryget(memcg_css)) + memcg_css = NULL; + rcu_read_unlock(); + if (!memcg_css) { + ret = -ENOENT; + goto out_bdi_put; + } + + /* + * And find the associated wb. If the wb isn't there already + * there's nothing to flush, don't create one. + */ + wb = wb_get_lookup(bdi, memcg_css); + if (!wb) { + ret = -ENOENT; + goto out_css_put; + } + + /* issue the writeback work */ + work = kzalloc(sizeof(*work), GFP_NOWAIT | __GFP_NOWARN); + if (work) { + work->nr_pages = nr; + work->sync_mode = WB_SYNC_NONE; + work->reason = reason; + work->done = done; + work->auto_free = 1; + wb_queue_work(wb, work); + ret = 0; + } else { + ret = -ENOMEM; + } + + wb_put(wb); +out_css_put: + css_put(memcg_css); +out_bdi_put: + bdi_put(bdi); + return ret; +} + +/** * cgroup_writeback_umount - flush inode wb switches for umount * * This function is called when a super_block is about to be destroyed and --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -217,6 +217,8 @@ void wbc_attach_and_unlock_inode(struct void wbc_detach_inode(struct writeback_control *wbc); void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, size_t bytes); +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, + enum wb_reason reason, struct wb_completion *done); void cgroup_writeback_umount(void); /** From patchwork Thu Aug 15 19:59:30 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096439 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E70811395 for ; Thu, 15 Aug 2019 19:59:38 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D0C1D2898B for ; Thu, 15 Aug 2019 19:59:38 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C1BAA2899C; Thu, 15 Aug 2019 19:59:38 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D71862898B for ; Thu, 15 Aug 2019 19:59:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733097AbfHOT7e (ORCPT ); Thu, 15 Aug 2019 15:59:34 -0400 Received: from mail-qt1-f174.google.com ([209.85.160.174]:46889 "EHLO mail-qt1-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727814AbfHOT7e (ORCPT ); Thu, 15 Aug 2019 15:59:34 -0400 Received: by mail-qt1-f174.google.com with SMTP id j15so3588646qtl.13; Thu, 15 Aug 2019 12:59:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=ud07z5/psxOSblwwNPdaBzV4BiEQea+btrLNqweJWnU=; b=fTUAJHmuWefuhAPMBB40xKzqRQvfZGdI5MSSx6kqGIf7gsV/T4U348k1z68knsQH6L Iu+r0xrblgbGagjwASJnb64jKFKE95AvsfKlufI0kDhF1C71zfuh6wJ2RmSWH47a6Ehm X/u/cyuChrpjLgwyWxFiHKZLPfJ53zBwwHqz1fsoH9meT7WMDbHNG9kxAem1YFlvWSdj srVCnHILR7v8ZgI848qGgf0VBsDF9SM4ftGyQ6c1GuT8x4tWidfnsBzg5jitemFhjzov GI5LaBziSWIT06uQt4/71Tud45I6LnwoWxH4NtYV6Dj4h0na/ClottBDdc9nFsZqf836 paEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=ud07z5/psxOSblwwNPdaBzV4BiEQea+btrLNqweJWnU=; b=IeOIbmYjOERILAGznjjxa0txOBJkFhIY4LbgPMng/+WaKBLtM0u3WK93Lz0rrCCFPv Fk0pwa/nGloqhsHlAUsAyBIoDiuEIxlftx04iegRv40XtE+akXOJiuVyqppw6x8fWhbi Wx9o4MFHyTewAVUYdfXQPYUq3yeM6MdSaSd7cdGEAtEzUkZt80bFZwr3K9Bf6RIFMhmM QQjjy20wOFrR6IQ+/iV1U8yhQlAIePbJVbkLBv/Z1Tg0mbBX8srMiE7lKFu6dpZykZT7 7V5tbvviwRJoUHI8KcT7LY1hEl+StWWJQ85ORE4irb3t2reDgFVpzxtVFHWq+2TDTTqv pq6g== X-Gm-Message-State: APjAAAVWhH9M7bNw7Wl64KFoIC3Cf/TmkCnhhX7vshhqE6FY+0Hmx48V bo5s71Z8ifpeYZrB5uJOkW4= X-Google-Smtp-Source: APXvYqyzw9nynFYELzR4kvkflWQBz/g9lbMuvJ70B+XSdD7MvLeKCxsDFr9wopZOwF2ouzGK1o5wJg== X-Received: by 2002:ac8:358e:: with SMTP id k14mr3954849qtb.83.1565899172696; Thu, 15 Aug 2019 12:59:32 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id m9sm1773171qtp.27.2019.08.15.12.59.32 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:59:32 -0700 (PDT) Date: Thu, 15 Aug 2019 12:59:30 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 5/5] writeback, memcg: Implement foreign dirty flushing Message-ID: <20190815195930.GF2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP There's an inherent mismatch between memcg and writeback. The former trackes ownership per-page while the latter per-inode. This was a deliberate design decision because honoring per-page ownership in the writeback path is complicated, may lead to higher CPU and IO overheads and deemed unnecessary given that write-sharing an inode across different cgroups isn't a common use-case. Combined with inode majority-writer ownership switching, this works well enough in most cases but there are some pathological cases. For example, let's say there are two cgroups A and B which keep writing to different but confined parts of the same inode. B owns the inode and A's memory is limited far below B's. A's dirty ratio can rise enough to trigger balance_dirty_pages() sleeps but B's can be low enough to avoid triggering background writeback. A will be slowed down without a way to make writeback of the dirty pages happen. This patch implements foreign dirty recording and foreign mechanism so that when a memcg encounters a condition as above it can trigger flushes on bdi_writebacks which can clean its pages. Please see the comment on top of mem_cgroup_track_foreign_dirty_slowpath() for details. A reproducer follows. write-range.c:: #include #include #include #include #include static const char *usage = "write-range FILE START SIZE\n"; int main(int argc, char **argv) { int fd; unsigned long start, size, end, pos; char *endp; char buf[4096]; if (argc < 4) { fprintf(stderr, usage); return 1; } fd = open(argv[1], O_WRONLY); if (fd < 0) { perror("open"); return 1; } start = strtoul(argv[2], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } size = strtoul(argv[3], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } end = start + size; while (1) { for (pos = start; pos < end; ) { long bread, bwritten = 0; if (lseek(fd, pos, SEEK_SET) < 0) { perror("lseek"); return 1; } bread = read(0, buf, sizeof(buf) < end - pos ? sizeof(buf) : end - pos); if (bread < 0) { perror("read"); return 1; } if (bread == 0) return 0; while (bwritten < bread) { long this; this = write(fd, buf + bwritten, bread - bwritten); if (this < 0) { perror("write"); return 1; } bwritten += this; pos += bwritten; } } } } repro.sh:: #!/bin/bash set -e set -x sysctl -w vm.dirty_expire_centisecs=300000 sysctl -w vm.dirty_writeback_centisecs=300000 sysctl -w vm.dirtytime_expire_seconds=300000 echo 3 > /proc/sys/vm/drop_caches TEST=/sys/fs/cgroup/test A=$TEST/A B=$TEST/B mkdir -p $A $B echo "+memory +io" > $TEST/cgroup.subtree_control echo $((1<<30)) > $A/memory.high echo $((32<<30)) > $B/memory.high rm -f testfile touch testfile fallocate -l 4G testfile echo "Starting B" (echo $BASHPID > $B/cgroup.procs pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) & echo "Waiting 10s to ensure B claims the testfile inode" sleep 5 sync sleep 5 sync echo "Starting A" (echo $BASHPID > $A/cgroup.procs pv < /dev/urandom | ./write-range testfile 0 $((2<<30))) v2: Added comments explaining why the specific intervals are being used. Signed-off-by: Tejun Heo --- include/linux/backing-dev-defs.h | 1 include/linux/memcontrol.h | 39 +++++++++++ mm/memcontrol.c | 132 +++++++++++++++++++++++++++++++++++++++ mm/page-writeback.c | 4 + 4 files changed, 176 insertions(+) --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -63,6 +63,7 @@ enum wb_reason { * so it has a mismatch name. */ WB_REASON_FORKER_THREAD, + WB_REASON_FOREIGN_FLUSH, WB_REASON_MAX, }; --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -184,6 +184,23 @@ struct memcg_padding { #endif /* + * Remember four most recent foreign writebacks with dirty pages in this + * cgroup. Inode sharing is expected to be uncommon and, even if we miss + * one in a given round, we're likely to catch it later if it keeps + * foreign-dirtying, so a fairly low count should be enough. + * + * See mem_cgroup_track_foreign_dirty_slowpath() for details. + */ +#define MEMCG_CGWB_FRN_CNT 4 + +struct memcg_cgwb_frn { + u64 bdi_id; /* bdi->id of the foreign inode */ + int memcg_id; /* memcg->css.id of foreign inode */ + u64 at; /* jiffies_64 at the time of dirtying */ + struct wb_completion done; /* tracks in-flight foreign writebacks */ +}; + +/* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide * statistics based on the statistics developed by Rik Van Riel for clock-pro, @@ -307,6 +324,7 @@ struct mem_cgroup { #ifdef CONFIG_CGROUP_WRITEBACK struct list_head cgwb_list; struct wb_domain cgwb_domain; + struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; #endif /* List of events which userspace want to receive */ @@ -1237,6 +1255,18 @@ void mem_cgroup_wb_stats(struct bdi_writ unsigned long *pheadroom, unsigned long *pdirty, unsigned long *pwriteback); +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, + struct bdi_writeback *wb); + +static inline void mem_cgroup_track_foreign_dirty(struct page *page, + struct bdi_writeback *wb) +{ + if (unlikely(&page->mem_cgroup->css != wb->memcg_css)) + mem_cgroup_track_foreign_dirty_slowpath(page, wb); +} + +void mem_cgroup_flush_foreign(struct bdi_writeback *wb); + #else /* CONFIG_CGROUP_WRITEBACK */ static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) @@ -1252,6 +1282,15 @@ static inline void mem_cgroup_wb_stats(s { } +static inline void mem_cgroup_track_foreign_dirty(struct page *page, + struct bdi_writeback *wb) +{ +} + +static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb) +{ +} + #endif /* CONFIG_CGROUP_WRITEBACK */ struct sock; --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -87,6 +87,10 @@ int do_swap_account __read_mostly; #define do_swap_account 0 #endif +#ifdef CONFIG_CGROUP_WRITEBACK +static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); +#endif + /* Whether legacy memory+swap accounting is active */ static bool do_memsw_account(void) { @@ -4184,6 +4188,125 @@ void mem_cgroup_wb_stats(struct bdi_writ } } +/* + * Foreign dirty flushing + * + * There's an inherent mismatch between memcg and writeback. The former + * trackes ownership per-page while the latter per-inode. This was a + * deliberate design decision because honoring per-page ownership in the + * writeback path is complicated, may lead to higher CPU and IO overheads + * and deemed unnecessary given that write-sharing an inode across + * different cgroups isn't a common use-case. + * + * Combined with inode majority-writer ownership switching, this works well + * enough in most cases but there are some pathological cases. For + * example, let's say there are two cgroups A and B which keep writing to + * different but confined parts of the same inode. B owns the inode and + * A's memory is limited far below B's. A's dirty ratio can rise enough to + * trigger balance_dirty_pages() sleeps but B's can be low enough to avoid + * triggering background writeback. A will be slowed down without a way to + * make writeback of the dirty pages happen. + * + * Conditions like the above can lead to a cgroup getting repatedly and + * severely throttled after making some progress after each + * dirty_expire_interval while the underyling IO device is almost + * completely idle. + * + * Solving this problem completely requires matching the ownership tracking + * granularities between memcg and writeback in either direction. However, + * the more egregious behaviors can be avoided by simply remembering the + * most recent foreign dirtying events and initiating remote flushes on + * them when local writeback isn't enough to keep the memory clean enough. + * + * The following two functions implement such mechanism. When a foreign + * page - a page whose memcg and writeback ownerships don't match - is + * dirtied, mem_cgroup_track_foreign_dirty() records the inode owning + * bdi_writeback on the page owning memcg. When balance_dirty_pages() + * decides that the memcg needs to sleep due to high dirty ratio, it calls + * mem_cgroup_flush_foreign() which queues writeback on the recorded + * foreign bdi_writebacks which haven't expired. Both the numbers of + * recorded bdi_writebacks and concurrent in-flight foreign writebacks are + * limited to MEMCG_CGWB_FRN_CNT. + * + * The mechanism only remembers IDs and doesn't hold any object references. + * As being wrong occasionally doesn't matter, updates and accesses to the + * records are lockless and racy. + */ +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, + struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = page->mem_cgroup; + struct memcg_cgwb_frn *frn; + u64 now = jiffies_64; + u64 oldest_at = now; + int oldest = -1; + int i; + + /* + * Pick the slot to use. If there is already a slot for @wb, keep + * using it. If not replace the oldest one which isn't being + * written out. + */ + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + frn = &memcg->cgwb_frn[i]; + if (frn->bdi_id == wb->bdi->id && + frn->memcg_id == wb->memcg_css->id) + break; + if (frn->at < oldest_at && atomic_read(&frn->done.cnt) == 1) { + oldest = i; + oldest_at = frn->at; + } + } + + if (i < MEMCG_CGWB_FRN_CNT) { + /* + * Re-using an existing one. Update timestamp lazily to + * avoid making the cacheline hot. We want them to be + * reasonably up-to-date and significantly shorter than + * dirty_expire_interval as that's what expires the record. + * Use the shorter of 1s and dirty_expire_interval / 8. + */ + unsigned long update_intv = + min_t(unsigned long, HZ, + msecs_to_jiffies(dirty_expire_interval * 10) / 8); + + if (frn->at < now - update_intv) + frn->at = now; + } else if (oldest >= 0) { + /* replace the oldest free one */ + frn = &memcg->cgwb_frn[oldest]; + frn->bdi_id = wb->bdi->id; + frn->memcg_id = wb->memcg_css->id; + frn->at = now; + } +} + +/* issue foreign writeback flushes for recorded foreign dirtying events */ +void mem_cgroup_flush_foreign(struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); + unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10); + u64 now = jiffies_64; + int i; + + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i]; + + /* + * If the record is older than dirty_expire_interval, + * writeback on it has already started. No need to kick it + * off again. Also, don't start a new one if there's + * already one in flight. + */ + if (frn->at > now - intv && atomic_read(&frn->done.cnt) == 1) { + frn->at = 0; + cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, + LONG_MAX, WB_REASON_FOREIGN_FLUSH, + &frn->done); + } + } +} + #else /* CONFIG_CGROUP_WRITEBACK */ static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) @@ -4700,6 +4823,7 @@ static struct mem_cgroup *mem_cgroup_all struct mem_cgroup *memcg; unsigned int size; int node; + int __maybe_unused i; size = sizeof(struct mem_cgroup); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *); @@ -4743,6 +4867,9 @@ static struct mem_cgroup *mem_cgroup_all #endif #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) + memcg->cgwb_frn[i].done = + __WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq); #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; @@ -4872,7 +4999,12 @@ static void mem_cgroup_css_released(stru static void mem_cgroup_css_free(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + int __maybe_unused i; +#ifdef CONFIG_CGROUP_WRITEBACK + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) + wb_wait_for_completion(&memcg->cgwb_frn[i].done); +#endif if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_dec(&memcg_sockets_enabled_key); --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1667,6 +1667,8 @@ static void balance_dirty_pages(struct b if (unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); + mem_cgroup_flush_foreign(wb); + /* * Calculate global domain's pos_ratio and select the * global dtc by default. @@ -2427,6 +2429,8 @@ void account_page_dirtied(struct page *p task_io_account_write(PAGE_SIZE); current->nr_dirtied++; this_cpu_inc(bdp_ratelimits); + + mem_cgroup_track_foreign_dirty(page, wb); } }