From patchwork Thu Aug 15 19:57:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096423 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A599D1395 for ; Thu, 15 Aug 2019 19:57:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 924662899E for ; Thu, 15 Aug 2019 19:57:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7D38A2899C; Thu, 15 Aug 2019 19:57:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=unavailable version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CFEFA28998 for ; Thu, 15 Aug 2019 19:57:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 22E256B0277; Thu, 15 Aug 2019 15:57:23 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 1DEFD6B027A; Thu, 15 Aug 2019 15:57:23 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0F5EC6B027C; Thu, 15 Aug 2019 15:57:23 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0195.hostedemail.com [216.40.44.195]) by kanga.kvack.org (Postfix) with ESMTP id E33306B0277 for ; Thu, 15 Aug 2019 15:57:22 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 9125F45B3 for ; Thu, 15 Aug 2019 19:57:22 +0000 (UTC) X-FDA: 75825721524.06.bead24_62004f1c9b38 X-HE-Tag: bead24_62004f1c9b38 X-Filterd-Recvd-Size: 9984 Received: from mail-qt1-f193.google.com (mail-qt1-f193.google.com [209.85.160.193]) by imf17.hostedemail.com (Postfix) with ESMTP for ; Thu, 15 Aug 2019 19:57:21 +0000 (UTC) Received: by mail-qt1-f193.google.com with SMTP id t12so3622458qtp.9 for ; Thu, 15 Aug 2019 12:57:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=/LFxVzI9MepcAl4Lnqm6SEvMCLIi+u2ZqmJKlF6ic0o=; b=jjTNHUJ+CSG6+ZUiOmqZD7r0T3PUVCgfjcmbNMfGfPR8DHnctfki8PyRgCaQiyB90S UWKnZoGdq8+gWKBRhfYYqTZFxccfzAN8droy/uuQAn/QQj/WXaMpDNOoUqAixilN6XWZ Sk53sgnFEFB4AG7XxGGusOOxgwx2THclF1wyqs+7Qb2gzDgFcnq4EhL3ld/Ls5TSy4Q5 LIUIN5BhMM68p6SVE3RIkRWEwox+dN2CEbkaavpnNzDMj+eCiq2B1UGpU7K14QSpw51w 1SO27McqrcnIjLscZNURHZZeQ+IdchCWOkYO6u9avMv8pnjE+yYMDugbMHQaNhDFI9JN y/EA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=/LFxVzI9MepcAl4Lnqm6SEvMCLIi+u2ZqmJKlF6ic0o=; b=axAEFcIU7kHguO7yiTxvc+gmuGopjwvIj7bH8zBXAWKy1Z0whOqxNUk6tx//c3LEFF YM73FIxc7aTqzpSO/RCVRZojbkiWCixw1uCP4SNNEeGTwFOhJ0roESdWYzcn15oft8+c YEOFXrUBh7qPbfCb8ULEu/PkocTwAxvj+ue0LvOvc/k0rV2F5PebM+aCI6/yH1lmMxRh ljwv/UktsFXylIjzh/Fi8IRWOLxVJTG4N59pCWa5fV4ruG5gf1JTBAroDIYev4Up1WOj r/uvhwoGD6jLjanV20xDLIBWjUxy0lOko7tMskEC0xCWWlnbTkVUlDBZ4mD2MPb4yd0M 0wZw== X-Gm-Message-State: APjAAAVe1b1VquZotQIcSGmGAM9M6DY/B/F7K6aRdx5VlT43vbcVjLnX W+SCuWUfKtCiW8ZiAHa1hNU= X-Google-Smtp-Source: APXvYqwvGoz95kfeW/o6YJ39eK/95vaFIXp94ZgS9UtlL7U+AC31LouqDrLwGVqyFtoh8BCsILyYSA== X-Received: by 2002:ac8:289b:: with SMTP id i27mr5552503qti.67.1565899041355; Thu, 15 Aug 2019 12:57:21 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id h1sm2042852qkh.101.2019.08.15.12.57.20 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:57:20 -0700 (PDT) Date: Thu, 15 Aug 2019 12:57:19 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 1/5] writeback: Generalize and expose wb_completion Message-ID: <20190815195719.GB2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP wb_completion is used to track writeback completions. We want to use it from memcg side for foreign inode flushes. This patch updates it to remember the target waitq instead of assuming bdi->wb_waitq and expose it outside of fs-writeback.c. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- fs/fs-writeback.c | 47 +++++++++++---------------------------- include/linux/backing-dev-defs.h | 20 ++++++++++++++++ include/linux/backing-dev.h | 2 + 3 files changed, 36 insertions(+), 33 deletions(-) --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -36,10 +36,6 @@ */ #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_SHIFT - 10)) -struct wb_completion { - atomic_t cnt; -}; - /* * Passed into wb_writeback(), essentially a subset of writeback_control */ @@ -61,19 +57,6 @@ struct wb_writeback_work { }; /* - * If one wants to wait for one or more wb_writeback_works, each work's - * ->done should be set to a wb_completion defined using the following - * macro. Once all work items are issued with wb_queue_work(), the caller - * can wait for the completion of all using wb_wait_for_completion(). Work - * items which are waited upon aren't freed automatically on completion. - */ -#define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \ - struct wb_completion cmpl = { \ - .cnt = ATOMIC_INIT(1), \ - } - - -/* * If an inode is constantly having its pages dirtied, but then the * updates stop dirtytime_expire_interval seconds in the past, it's * possible for the worst case time between when an inode has its @@ -182,7 +165,7 @@ static void finish_writeback_work(struct if (work->auto_free) kfree(work); if (done && atomic_dec_and_test(&done->cnt)) - wake_up_all(&wb->bdi->wb_waitq); + wake_up_all(done->waitq); } static void wb_queue_work(struct bdi_writeback *wb, @@ -206,20 +189,18 @@ static void wb_queue_work(struct bdi_wri /** * wb_wait_for_completion - wait for completion of bdi_writeback_works - * @bdi: bdi work items were issued to * @done: target wb_completion * * Wait for one or more work items issued to @bdi with their ->done field - * set to @done, which should have been defined with - * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such - * work items are completed. Work items which are waited upon aren't freed + * set to @done, which should have been initialized with + * DEFINE_WB_COMPLETION(). This function returns after all such work items + * are completed. Work items which are waited upon aren't freed * automatically on completion. */ -static void wb_wait_for_completion(struct backing_dev_info *bdi, - struct wb_completion *done) +void wb_wait_for_completion(struct wb_completion *done) { atomic_dec(&done->cnt); /* put down the initial count */ - wait_event(bdi->wb_waitq, !atomic_read(&done->cnt)); + wait_event(*done->waitq, !atomic_read(&done->cnt)); } #ifdef CONFIG_CGROUP_WRITEBACK @@ -854,7 +835,7 @@ static void bdi_split_work_to_wbs(struct restart: rcu_read_lock(); list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) { - DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done); + DEFINE_WB_COMPLETION(fallback_work_done, bdi); struct wb_writeback_work fallback_work; struct wb_writeback_work *work; long nr_pages; @@ -901,7 +882,7 @@ restart: last_wb = wb; rcu_read_unlock(); - wb_wait_for_completion(bdi, &fallback_work_done); + wb_wait_for_completion(&fallback_work_done); goto restart; } rcu_read_unlock(); @@ -2373,7 +2354,8 @@ static void wait_sb_inodes(struct super_ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, enum wb_reason reason, bool skip_if_busy) { - DEFINE_WB_COMPLETION_ONSTACK(done); + struct backing_dev_info *bdi = sb->s_bdi; + DEFINE_WB_COMPLETION(done, bdi); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_NONE, @@ -2382,14 +2364,13 @@ static void __writeback_inodes_sb_nr(str .nr_pages = nr, .reason = reason, }; - struct backing_dev_info *bdi = sb->s_bdi; if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); - wb_wait_for_completion(bdi, &done); + wb_wait_for_completion(&done); } /** @@ -2451,7 +2432,8 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb */ void sync_inodes_sb(struct super_block *sb) { - DEFINE_WB_COMPLETION_ONSTACK(done); + struct backing_dev_info *bdi = sb->s_bdi; + DEFINE_WB_COMPLETION(done, bdi); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_ALL, @@ -2461,7 +2443,6 @@ void sync_inodes_sb(struct super_block * .reason = WB_REASON_SYNC, .for_sync = 1, }; - struct backing_dev_info *bdi = sb->s_bdi; /* * Can't skip on !bdi_has_dirty() because we should wait for !dirty @@ -2475,7 +2456,7 @@ void sync_inodes_sb(struct super_block * /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ bdi_down_write_wb_switch_rwsem(bdi); bdi_split_work_to_wbs(bdi, &work, false); - wb_wait_for_completion(bdi, &done); + wb_wait_for_completion(&done); bdi_up_write_wb_switch_rwsem(bdi); wait_sb_inodes(sb); --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -67,6 +67,26 @@ enum wb_reason { WB_REASON_MAX, }; +struct wb_completion { + atomic_t cnt; + wait_queue_head_t *waitq; +}; + +#define __WB_COMPLETION_INIT(_waitq) \ + (struct wb_completion){ .cnt = ATOMIC_INIT(1), .waitq = (_waitq) } + +/* + * If one wants to wait for one or more wb_writeback_works, each work's + * ->done should be set to a wb_completion defined using the following + * macro. Once all work items are issued with wb_queue_work(), the caller + * can wait for the completion of all using wb_wait_for_completion(). Work + * items which are waited upon aren't freed automatically on completion. + */ +#define WB_COMPLETION_INIT(bdi) __WB_COMPLETION_INIT(&(bdi)->wb_waitq) + +#define DEFINE_WB_COMPLETION(cmpl, bdi) \ + struct wb_completion cmpl = WB_COMPLETION_INIT(bdi) + /* * For cgroup writeback, multiple wb's may map to the same blkcg. Those * wb's can operate mostly independently but should share the congested --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -44,6 +44,8 @@ void wb_start_background_writeback(struc void wb_workfn(struct work_struct *work); void wb_wakeup_delayed(struct bdi_writeback *wb); +void wb_wait_for_completion(struct wb_completion *done); + extern spinlock_t bdi_lock; extern struct list_head bdi_list; From patchwork Thu Aug 15 19:57:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096427 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D684413A0 for ; Thu, 15 Aug 2019 19:57:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C05CA2898B for ; Thu, 15 Aug 2019 19:57:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B17F528998; Thu, 15 Aug 2019 19:57:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=unavailable version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 95FFD2898B for ; Thu, 15 Aug 2019 19:57:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E15FF6B0275; Thu, 15 Aug 2019 15:57:53 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id DC64B6B027A; Thu, 15 Aug 2019 15:57:53 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CDC0A6B027C; Thu, 15 Aug 2019 15:57:53 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0140.hostedemail.com [216.40.44.140]) by kanga.kvack.org (Postfix) with ESMTP id ACAEA6B0275 for ; Thu, 15 Aug 2019 15:57:53 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 5F708181AC9AE for ; Thu, 15 Aug 2019 19:57:53 +0000 (UTC) X-FDA: 75825722826.08.stove05_a9e23506b055 X-HE-Tag: stove05_a9e23506b055 X-Filterd-Recvd-Size: 7385 Received: from mail-qk1-f193.google.com (mail-qk1-f193.google.com [209.85.222.193]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Thu, 15 Aug 2019 19:57:52 +0000 (UTC) Received: by mail-qk1-f193.google.com with SMTP id p13so2786666qkg.13 for ; Thu, 15 Aug 2019 12:57:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=NK0PK6csvakTjylMz8jDV7zgrirY2g6A6SS9gcMGP38=; b=XEiIETwR/iRc/lXiBJkeGJWLPwdP3OTZNZTIZNNbaUyYnuuBUMWk76hAjCesVsGRvR DVIXJNeKmQfqKXW702SF1PUeP9sMQBQHvirBaKtwjRZq+JeR2Bqiqj8abtcJSMCd1jzQ 70f4yD1051JXCEGH8Jf4XgvdEGl7z6LtJqxDO8gRemoNlh+IeCfl8mEaqjzVNU4JR8CL rQRPxZoa84hi4tZtGQDIy5KO8uZu8JaNvj9NPih9o3iaXsNOSBCo6teWiZ/GbCdlfiuX p4mSuilXdKhg1WbG1k1KNt5O/wZddMSeF2o9jyOlaEWhRejaiNoKnop+vbB73UpmszkK DYkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=NK0PK6csvakTjylMz8jDV7zgrirY2g6A6SS9gcMGP38=; b=oQVh6Zn+OzSBjbK1knpD0T+9wB9L8MM6z647+iFBY14tXGtAz1KAgGYksfytJG0F7p VFyrvJL4ZzPrE2Ilkf8k6lS8OyvRSs5pnpbF9KFM7ZqMHTCTRq0Csgnq3QYNoPlALuoW vjvYptUAj/PUw4CXxd30YySwnyXTh5Zo8Oy7AORc4RmKb3BkPrb1LyzPno022Or1NoDN T9yIeFCPGSqlCyLeOxTT1eXJwKbtyLJj4GWHT/y2yWcqRCuOHZxgGhDlPbotxDr2TQPw vhz0gU1D2PqSwZi7cETKn9u1GEktVBcNv0iVneqVofAil524jDDmRYARDxuKPC9FOxwg wUxA== X-Gm-Message-State: APjAAAWEmHa8AuDfEy4sPy82GCAUntE3KVRoBSQ8ZlAW9u8W+wm6onCG nku/bO+ikuWqiw+Dwc8RLbI= X-Google-Smtp-Source: APXvYqxgqkXzG03UW+nQ2wzQV8lJWn1lUi6WbNWPWMtyk1a9ytxbt9vAVv4HDVD3hmWenqlqAUYrlw== X-Received: by 2002:ae9:e411:: with SMTP id q17mr5416347qkc.465.1565899072231; Thu, 15 Aug 2019 12:57:52 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id o21sm1768914qkk.100.2019.08.15.12.57.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:57:51 -0700 (PDT) Date: Thu, 15 Aug 2019 12:57:50 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 2/5] bdi: Add bdi->id Message-ID: <20190815195750.GC2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP There currently is no way to universally identify and lookup a bdi without holding a reference and pointer to it. This patch adds an non-recycling bdi->id and implements bdi_get_by_id() which looks up bdis by their ids. This will be used by memcg foreign inode flushing. I left bdi_list alone for simplicity and because while rb_tree does support rcu assignment it doesn't seem to guarantee lossless walk when walk is racing aginst tree rebalance operations. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- include/linux/backing-dev-defs.h | 2 + include/linux/backing-dev.h | 1 mm/backing-dev.c | 65 +++++++++++++++++++++++++++++++++++++-- 3 files changed, 66 insertions(+), 2 deletions(-) --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -185,6 +185,8 @@ struct bdi_writeback { }; struct backing_dev_info { + u64 id; + struct rb_node rb_node; /* keyed by ->id */ struct list_head bdi_list; unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ unsigned long io_pages; /* max allowed IO size */ --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -24,6 +24,7 @@ static inline struct backing_dev_info *b return bdi; } +struct backing_dev_info *bdi_get_by_id(u64 id); void bdi_put(struct backing_dev_info *bdi); __printf(2, 3) --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only #include +#include #include #include #include @@ -22,10 +23,12 @@ EXPORT_SYMBOL_GPL(noop_backing_dev_info) static struct class *bdi_class; /* - * bdi_lock protects updates to bdi_list. bdi_list has RCU reader side - * locking. + * bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU + * reader side locking. */ DEFINE_SPINLOCK(bdi_lock); +static u64 bdi_id_cursor; +static struct rb_root bdi_tree = RB_ROOT; LIST_HEAD(bdi_list); /* bdi_wq serves all asynchronous writeback tasks */ @@ -859,9 +862,58 @@ struct backing_dev_info *bdi_alloc_node( } EXPORT_SYMBOL(bdi_alloc_node); +static struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp) +{ + struct rb_node **p = &bdi_tree.rb_node; + struct rb_node *parent = NULL; + struct backing_dev_info *bdi; + + lockdep_assert_held(&bdi_lock); + + while (*p) { + parent = *p; + bdi = rb_entry(parent, struct backing_dev_info, rb_node); + + if (bdi->id > id) + p = &(*p)->rb_left; + else if (bdi->id < id) + p = &(*p)->rb_right; + else + break; + } + + if (parentp) + *parentp = parent; + return p; +} + +/** + * bdi_get_by_id - lookup and get bdi from its id + * @id: bdi id to lookup + * + * Find bdi matching @id and get it. Returns NULL if the matching bdi + * doesn't exist or is already unregistered. + */ +struct backing_dev_info *bdi_get_by_id(u64 id) +{ + struct backing_dev_info *bdi = NULL; + struct rb_node **p; + + spin_lock_bh(&bdi_lock); + p = bdi_lookup_rb_node(id, NULL); + if (*p) { + bdi = rb_entry(*p, struct backing_dev_info, rb_node); + bdi_get(bdi); + } + spin_unlock_bh(&bdi_lock); + + return bdi; +} + int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) { struct device *dev; + struct rb_node *parent, **p; if (bdi->dev) /* The driver needs to use separate queues per device */ return 0; @@ -877,7 +929,15 @@ int bdi_register_va(struct backing_dev_i set_bit(WB_registered, &bdi->wb.state); spin_lock_bh(&bdi_lock); + + bdi->id = ++bdi_id_cursor; + + p = bdi_lookup_rb_node(bdi->id, &parent); + rb_link_node(&bdi->rb_node, parent, p); + rb_insert_color(&bdi->rb_node, &bdi_tree); + list_add_tail_rcu(&bdi->bdi_list, &bdi_list); + spin_unlock_bh(&bdi_lock); trace_writeback_bdi_register(bdi); @@ -918,6 +978,7 @@ EXPORT_SYMBOL(bdi_register_owner); static void bdi_remove_from_list(struct backing_dev_info *bdi) { spin_lock_bh(&bdi_lock); + rb_erase(&bdi->rb_node, &bdi_tree); list_del_rcu(&bdi->bdi_list); spin_unlock_bh(&bdi_lock); From patchwork Thu Aug 15 19:58:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096431 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 776D21395 for ; Thu, 15 Aug 2019 19:58:29 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6690C2898C for ; Thu, 15 Aug 2019 19:58:29 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5A33D2899C; Thu, 15 Aug 2019 19:58:29 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=unavailable version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8BF9C2898C for ; Thu, 15 Aug 2019 19:58:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DED826B0277; Thu, 15 Aug 2019 15:58:26 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id D9F096B027A; Thu, 15 Aug 2019 15:58:26 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CB49C6B027C; Thu, 15 Aug 2019 15:58:26 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0032.hostedemail.com [216.40.44.32]) by kanga.kvack.org (Postfix) with ESMTP id AA7896B0277 for ; Thu, 15 Aug 2019 15:58:26 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 525DD180AD7C1 for ; Thu, 15 Aug 2019 19:58:26 +0000 (UTC) X-FDA: 75825724212.18.hat52_f6ed84d6965e X-HE-Tag: hat52_f6ed84d6965e X-Filterd-Recvd-Size: 6754 Received: from mail-qk1-f195.google.com (mail-qk1-f195.google.com [209.85.222.195]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Thu, 15 Aug 2019 19:58:25 +0000 (UTC) Received: by mail-qk1-f195.google.com with SMTP id p13so2788115qkg.13 for ; Thu, 15 Aug 2019 12:58:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=No9QVbYlU2sgvqpbXkCw9UwcBjO6u+aBDjKZ2hew4hU=; b=Gt2lVOaCaz027V8blAT2ErzUdzTPXVxQRyH5tybpqHiiG25cOe/Ls1rrBq0oQhKq4p d7wDqqbJ1bM5gobzk77hGHVjDxsDbsTDq+3z5hGLL7VRGAHBX54T8poZTuJbJHT6/wIS n+3ZywqbhJ8Uy8k680B4ItSJ85EZjWnkScl1y0M11RkI5WI1vV26GfrlBoAOgyn8ZImy MXwv9CZiXjeSbA4ctU2yD4jwsbkz8PujxOymrX0SWDi+4cEk8cYfE//WGxqaDLTNEBGl XLQPaEG9+Gi8NYgFRTQ2Qqk+WJv08PpnQoMPdw8hU66i8SPUqA1LZE/L+3O+hb/w8cN8 s+9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=No9QVbYlU2sgvqpbXkCw9UwcBjO6u+aBDjKZ2hew4hU=; b=Xo5IDUMyr0El7yeaKVRxYmzPOI8VgvOeNpRjl6L6g0JtmiIYrOIfT3kNAvMXfVmnpp nCO4I9OXEUjkdAHIQMQLUzh9xsDEurcSWwAYW2fIj++AfPz9hOPeU1TKtVDYOFtch80f KsIwYuZEoJH0v2qJV7CRmpdb83rNVd8suE+jtmm1xV1/x4wYREMecm0olvnsPLb+QWdG FdYFccaPuw6UtBtVucaFcNCBEDpukw03oizDrImRyJbt83kR9f4f6zGZ/gTykA/VI42c salRzDUFJ6pugUec4qj550uWvmoN53IMCkCSTO9Y47WqSP/x+wz3SJ+bP7tQB40PgZAy 48Dg== X-Gm-Message-State: APjAAAUfyboaYPs1V2jbicm+8mEM7z16FcYyaVnQ8WPsqPBD3jLO7ajh OkJTCmmBsqOJBtScXTgS6Ec= X-Google-Smtp-Source: APXvYqwYWkovkIDA9ykxqDtjrl7m1KWzJV4ngQIYl24GN2FpAUGCMjzO1WndSIPov9pFcxFYLJIpYA== X-Received: by 2002:ae9:e216:: with SMTP id c22mr5746003qkc.192.1565899105305; Thu, 15 Aug 2019 12:58:25 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id 23sm1855599qkk.121.2019.08.15.12.58.24 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:58:24 -0700 (PDT) Date: Thu, 15 Aug 2019 12:58:23 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 3/5] writeback: Separate out wb_get_lookup() from wb_get_create() Message-ID: <20190815195823.GD2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Separate out wb_get_lookup() which doesn't try to create one if there isn't already one from wb_get_create(). This will be used by later patches. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- include/linux/backing-dev.h | 2 + mm/backing-dev.c | 55 +++++++++++++++++++++++++++++--------------- 2 files changed, 39 insertions(+), 18 deletions(-) --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -230,6 +230,8 @@ static inline int bdi_sched_wait(void *w struct bdi_writeback_congested * wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp); void wb_congested_put(struct bdi_writeback_congested *congested); +struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css); struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp); --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -618,13 +618,12 @@ out_put: } /** - * wb_get_create - get wb for a given memcg, create if necessary + * wb_get_lookup - get wb for a given memcg * @bdi: target bdi * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) - * @gfp: allocation mask to use * - * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to - * create one. The returned wb has its refcount incremented. + * Try to get the wb for @memcg_css on @bdi. The returned wb has its + * refcount incremented. * * This function uses css_get() on @memcg_css and thus expects its refcnt * to be positive on invocation. IOW, rcu_read_lock() protection on @@ -641,6 +640,39 @@ out_put: * each lookup. On mismatch, the existing wb is discarded and a new one is * created. */ +struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css) +{ + struct bdi_writeback *wb; + + if (!memcg_css->parent) + return &bdi->wb; + + rcu_read_lock(); + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + if (wb) { + struct cgroup_subsys_state *blkcg_css; + + /* see whether the blkcg association has changed */ + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &io_cgrp_subsys); + if (unlikely(wb->blkcg_css != blkcg_css || !wb_tryget(wb))) + wb = NULL; + css_put(blkcg_css); + } + rcu_read_unlock(); + + return wb; +} + +/** + * wb_get_create - get wb for a given memcg, create if necessary + * @bdi: target bdi + * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) + * @gfp: allocation mask to use + * + * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to + * create one. See wb_get_lookup() for more details. + */ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, struct cgroup_subsys_state *memcg_css, gfp_t gfp) @@ -653,20 +685,7 @@ struct bdi_writeback *wb_get_create(stru return &bdi->wb; do { - rcu_read_lock(); - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); - if (wb) { - struct cgroup_subsys_state *blkcg_css; - - /* see whether the blkcg association has changed */ - blkcg_css = cgroup_get_e_css(memcg_css->cgroup, - &io_cgrp_subsys); - if (unlikely(wb->blkcg_css != blkcg_css || - !wb_tryget(wb))) - wb = NULL; - css_put(blkcg_css); - } - rcu_read_unlock(); + wb = wb_get_lookup(bdi, memcg_css); } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); return wb; From patchwork Thu Aug 15 19:59:02 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096433 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 087D11395 for ; Thu, 15 Aug 2019 19:59:07 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EB5E52899D for ; Thu, 15 Aug 2019 19:59:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DAEC02899E; Thu, 15 Aug 2019 19:59:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6DE1D2899C for ; Thu, 15 Aug 2019 19:59:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B7BFE6B0275; Thu, 15 Aug 2019 15:59:05 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id B2D956B027A; Thu, 15 Aug 2019 15:59:05 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A43056B027C; Thu, 15 Aug 2019 15:59:05 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0081.hostedemail.com [216.40.44.81]) by kanga.kvack.org (Postfix) with ESMTP id 844D46B0275 for ; Thu, 15 Aug 2019 15:59:05 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 387B32809 for ; Thu, 15 Aug 2019 19:59:05 +0000 (UTC) X-FDA: 75825725850.18.pets93_151557b942447 X-HE-Tag: pets93_151557b942447 X-Filterd-Recvd-Size: 5939 Received: from mail-qk1-f193.google.com (mail-qk1-f193.google.com [209.85.222.193]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Thu, 15 Aug 2019 19:59:04 +0000 (UTC) Received: by mail-qk1-f193.google.com with SMTP id w18so2136367qki.0 for ; Thu, 15 Aug 2019 12:59:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=LTjjxYDcINav7DiTg3r+K5uyvclxft6ApAzOA1thajw=; b=ReQZ+Ab0hATpZtDXOA2ejtpPxUouA+/1xRBT9pTb1ePOZTo2VSyRKjTL2j3RzOyJ7S mlcrNaILH3zXXtOaaBVm+e5yQG01QhwfI90xWoTHI3jXtQ8RPEbodPybOVmvjN9wSaPt EaJaVv1bmqhFDi+MaB4fvT5nUovo3hGXJ52wdQU56ZxArTR0gW5bs5CDYoWb8QHWBjEq +xWRzvOeKJ1HPiDh+Zawly8QwcsWoCzCpWbL/+Ca7PTd1evz6glff1DdWU4lNTP+/Gm6 utFTGkkDos6KzH8N0RhsbOYj9FshB3Uth+gddwFR0l77/gvDZnCzRj2R/DXN380JqxnN 7XWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=LTjjxYDcINav7DiTg3r+K5uyvclxft6ApAzOA1thajw=; b=C1Og2ENDkkawy39iYfSaNszmjTeImpP3CWUJQTUIyrGBy7DqbdG8H0afeYondm3Mlw J/qLKoZQdRc1LVwYpdwzi8DhzB1hn5AUqGDTGvi3VX8c9yeuIBAQhSo3Iohr6SsD1XBS e6lKzAGFX5f//KqJqkskfitFFrhX/+LlozccBWJ0W7lMqwAP+SAqKdci6khRmIOMZeBv 4a2Orj7bYTECzgU5Mp7G/8RPIMic0bEyL1cKIPIRYikOTVPXfHCIG+0qbNrx2UqYuh86 xdF7Q3K+ufNYdy2y0m5V7oCq6zy5jye+v2AwA9U8r24HmYVTVYClvRxVzAMKmFFtUKdp d5Rw== X-Gm-Message-State: APjAAAVSrho4/NehgsNFVkZAGspz9M/YZ0DQDsvq7dDKt2+bFCGgcH0y cRpj7oXRWVn+RMBiRnWPI+Y= X-Google-Smtp-Source: APXvYqz/lIOxVe5nWX7K5sBLu6KzXtCB0IqgPNj3VAKGjq1GOg0HMbdZXFb4fh326h7juBrH2FnOjw== X-Received: by 2002:a05:620a:342:: with SMTP id t2mr5109727qkm.283.1565899144144; Thu, 15 Aug 2019 12:59:04 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id n46sm2333045qtk.14.2019.08.15.12.59.03 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:59:03 -0700 (PDT) Date: Thu, 15 Aug 2019 12:59:02 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 4/5] writeback, memcg: Implement cgroup_writeback_by_id() Message-ID: <20190815195902.GE2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Implement cgroup_writeback_by_id() which initiates cgroup writeback from bdi and memcg IDs. This will be used by memcg foreign inode flushing. Signed-off-by: Tejun Heo Reviewed-by: Jan Kara --- fs/fs-writeback.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/writeback.h | 2 + 2 files changed, 69 insertions(+) --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -892,6 +892,73 @@ restart: } /** + * cgroup_writeback_by_id - initiate cgroup writeback from bdi and memcg IDs + * @bdi_id: target bdi id + * @memcg_id: target memcg css id + * @nr_pages: number of pages to write + * @reason: reason why some writeback work initiated + * @done: target wb_completion + * + * Initiate flush of the bdi_writeback identified by @bdi_id and @memcg_id + * with the specified parameters. + */ +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr, + enum wb_reason reason, struct wb_completion *done) +{ + struct backing_dev_info *bdi; + struct cgroup_subsys_state *memcg_css; + struct bdi_writeback *wb; + struct wb_writeback_work *work; + int ret; + + /* lookup bdi and memcg */ + bdi = bdi_get_by_id(bdi_id); + if (!bdi) + return -ENOENT; + + rcu_read_lock(); + memcg_css = css_from_id(memcg_id, &memory_cgrp_subsys); + if (memcg_css && !css_tryget(memcg_css)) + memcg_css = NULL; + rcu_read_unlock(); + if (!memcg_css) { + ret = -ENOENT; + goto out_bdi_put; + } + + /* + * And find the associated wb. If the wb isn't there already + * there's nothing to flush, don't create one. + */ + wb = wb_get_lookup(bdi, memcg_css); + if (!wb) { + ret = -ENOENT; + goto out_css_put; + } + + /* issue the writeback work */ + work = kzalloc(sizeof(*work), GFP_NOWAIT | __GFP_NOWARN); + if (work) { + work->nr_pages = nr; + work->sync_mode = WB_SYNC_NONE; + work->reason = reason; + work->done = done; + work->auto_free = 1; + wb_queue_work(wb, work); + ret = 0; + } else { + ret = -ENOMEM; + } + + wb_put(wb); +out_css_put: + css_put(memcg_css); +out_bdi_put: + bdi_put(bdi); + return ret; +} + +/** * cgroup_writeback_umount - flush inode wb switches for umount * * This function is called when a super_block is about to be destroyed and --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -217,6 +217,8 @@ void wbc_attach_and_unlock_inode(struct void wbc_detach_inode(struct writeback_control *wbc); void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page, size_t bytes); +int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages, + enum wb_reason reason, struct wb_completion *done); void cgroup_writeback_umount(void); /** From patchwork Thu Aug 15 19:59:30 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tejun Heo X-Patchwork-Id: 11096437 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B975713A0 for ; Thu, 15 Aug 2019 19:59:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A3F6D2898B for ; Thu, 15 Aug 2019 19:59:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 971172899C; Thu, 15 Aug 2019 19:59:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=2.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 839C32898B for ; Thu, 15 Aug 2019 19:59:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 980026B0277; Thu, 15 Aug 2019 15:59:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 930976B027A; Thu, 15 Aug 2019 15:59:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 81EFC6B027C; Thu, 15 Aug 2019 15:59:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0210.hostedemail.com [216.40.44.210]) by kanga.kvack.org (Postfix) with ESMTP id 5AF256B0277 for ; Thu, 15 Aug 2019 15:59:34 -0400 (EDT) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 0F854181AC9B4 for ; Thu, 15 Aug 2019 19:59:34 +0000 (UTC) X-FDA: 75825727068.17.fuel70_1941213432429 X-HE-Tag: fuel70_1941213432429 X-Filterd-Recvd-Size: 16670 Received: from mail-qt1-f169.google.com (mail-qt1-f169.google.com [209.85.160.169]) by imf44.hostedemail.com (Postfix) with ESMTP for ; Thu, 15 Aug 2019 19:59:33 +0000 (UTC) Received: by mail-qt1-f169.google.com with SMTP id q4so3681095qtp.1 for ; Thu, 15 Aug 2019 12:59:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=ud07z5/psxOSblwwNPdaBzV4BiEQea+btrLNqweJWnU=; b=fTUAJHmuWefuhAPMBB40xKzqRQvfZGdI5MSSx6kqGIf7gsV/T4U348k1z68knsQH6L Iu+r0xrblgbGagjwASJnb64jKFKE95AvsfKlufI0kDhF1C71zfuh6wJ2RmSWH47a6Ehm X/u/cyuChrpjLgwyWxFiHKZLPfJ53zBwwHqz1fsoH9meT7WMDbHNG9kxAem1YFlvWSdj srVCnHILR7v8ZgI848qGgf0VBsDF9SM4ftGyQ6c1GuT8x4tWidfnsBzg5jitemFhjzov GI5LaBziSWIT06uQt4/71Tud45I6LnwoWxH4NtYV6Dj4h0na/ClottBDdc9nFsZqf836 paEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=ud07z5/psxOSblwwNPdaBzV4BiEQea+btrLNqweJWnU=; b=iBIHgWPGxngvlL2ZZyE+E5L1WkK6m4c8TD7X8EyudanpWCaG4uPPnsDfhrgs24aKvo 9W9fhd8rGBZeCSGf4robHaY3dLJygzpT8PgStELr5OSfsKmUgb26nwvql5C7dRTOCAXz vwA/9543oxb2GjKjU0knYkzfR1Jd5P66NrDhx7iz/HVLI3JYY2dMb9lRQMIFQTgmKrMZ HkIvwyZz7AOLG8wPTp4LUg5If5kfZAx+IamIYdC3LeW5zmjKFpw8fDXTrd8Bs54IzcaT wj8n4M9StaFokBBB4qwIJBtWXcGpFdXhzkdNwxw3+kixR+unIv3Xw34SGpRGd3KSAtyT OUUw== X-Gm-Message-State: APjAAAVsGt/RVxS5uUDpxinS94TB7O1O8ZaEKibbgCI75rEYh4XHhOKe omczn1v9Kw5ZzZyGTsz4CSE= X-Google-Smtp-Source: APXvYqyzw9nynFYELzR4kvkflWQBz/g9lbMuvJ70B+XSdD7MvLeKCxsDFr9wopZOwF2ouzGK1o5wJg== X-Received: by 2002:ac8:358e:: with SMTP id k14mr3954849qtb.83.1565899172696; Thu, 15 Aug 2019 12:59:32 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::1:25cd]) by smtp.gmail.com with ESMTPSA id m9sm1773171qtp.27.2019.08.15.12.59.32 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Aug 2019 12:59:32 -0700 (PDT) Date: Thu, 15 Aug 2019 12:59:30 -0700 From: Tejun Heo To: axboe@kernel.dk, jack@suse.cz, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: cgroups@vger.kernel.org, linux-mm@kvack.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, guro@fb.com, akpm@linux-foundation.org Subject: [PATCH 5/5] writeback, memcg: Implement foreign dirty flushing Message-ID: <20190815195930.GF2263813@devbig004.ftw2.facebook.com> References: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20190815195619.GA2263813@devbig004.ftw2.facebook.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP There's an inherent mismatch between memcg and writeback. The former trackes ownership per-page while the latter per-inode. This was a deliberate design decision because honoring per-page ownership in the writeback path is complicated, may lead to higher CPU and IO overheads and deemed unnecessary given that write-sharing an inode across different cgroups isn't a common use-case. Combined with inode majority-writer ownership switching, this works well enough in most cases but there are some pathological cases. For example, let's say there are two cgroups A and B which keep writing to different but confined parts of the same inode. B owns the inode and A's memory is limited far below B's. A's dirty ratio can rise enough to trigger balance_dirty_pages() sleeps but B's can be low enough to avoid triggering background writeback. A will be slowed down without a way to make writeback of the dirty pages happen. This patch implements foreign dirty recording and foreign mechanism so that when a memcg encounters a condition as above it can trigger flushes on bdi_writebacks which can clean its pages. Please see the comment on top of mem_cgroup_track_foreign_dirty_slowpath() for details. A reproducer follows. write-range.c:: #include #include #include #include #include static const char *usage = "write-range FILE START SIZE\n"; int main(int argc, char **argv) { int fd; unsigned long start, size, end, pos; char *endp; char buf[4096]; if (argc < 4) { fprintf(stderr, usage); return 1; } fd = open(argv[1], O_WRONLY); if (fd < 0) { perror("open"); return 1; } start = strtoul(argv[2], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } size = strtoul(argv[3], &endp, 0); if (*endp != '\0') { fprintf(stderr, usage); return 1; } end = start + size; while (1) { for (pos = start; pos < end; ) { long bread, bwritten = 0; if (lseek(fd, pos, SEEK_SET) < 0) { perror("lseek"); return 1; } bread = read(0, buf, sizeof(buf) < end - pos ? sizeof(buf) : end - pos); if (bread < 0) { perror("read"); return 1; } if (bread == 0) return 0; while (bwritten < bread) { long this; this = write(fd, buf + bwritten, bread - bwritten); if (this < 0) { perror("write"); return 1; } bwritten += this; pos += bwritten; } } } } repro.sh:: #!/bin/bash set -e set -x sysctl -w vm.dirty_expire_centisecs=300000 sysctl -w vm.dirty_writeback_centisecs=300000 sysctl -w vm.dirtytime_expire_seconds=300000 echo 3 > /proc/sys/vm/drop_caches TEST=/sys/fs/cgroup/test A=$TEST/A B=$TEST/B mkdir -p $A $B echo "+memory +io" > $TEST/cgroup.subtree_control echo $((1<<30)) > $A/memory.high echo $((32<<30)) > $B/memory.high rm -f testfile touch testfile fallocate -l 4G testfile echo "Starting B" (echo $BASHPID > $B/cgroup.procs pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) & echo "Waiting 10s to ensure B claims the testfile inode" sleep 5 sync sleep 5 sync echo "Starting A" (echo $BASHPID > $A/cgroup.procs pv < /dev/urandom | ./write-range testfile 0 $((2<<30))) v2: Added comments explaining why the specific intervals are being used. Signed-off-by: Tejun Heo --- include/linux/backing-dev-defs.h | 1 include/linux/memcontrol.h | 39 +++++++++++ mm/memcontrol.c | 132 +++++++++++++++++++++++++++++++++++++++ mm/page-writeback.c | 4 + 4 files changed, 176 insertions(+) --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -63,6 +63,7 @@ enum wb_reason { * so it has a mismatch name. */ WB_REASON_FORKER_THREAD, + WB_REASON_FOREIGN_FLUSH, WB_REASON_MAX, }; --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -184,6 +184,23 @@ struct memcg_padding { #endif /* + * Remember four most recent foreign writebacks with dirty pages in this + * cgroup. Inode sharing is expected to be uncommon and, even if we miss + * one in a given round, we're likely to catch it later if it keeps + * foreign-dirtying, so a fairly low count should be enough. + * + * See mem_cgroup_track_foreign_dirty_slowpath() for details. + */ +#define MEMCG_CGWB_FRN_CNT 4 + +struct memcg_cgwb_frn { + u64 bdi_id; /* bdi->id of the foreign inode */ + int memcg_id; /* memcg->css.id of foreign inode */ + u64 at; /* jiffies_64 at the time of dirtying */ + struct wb_completion done; /* tracks in-flight foreign writebacks */ +}; + +/* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide * statistics based on the statistics developed by Rik Van Riel for clock-pro, @@ -307,6 +324,7 @@ struct mem_cgroup { #ifdef CONFIG_CGROUP_WRITEBACK struct list_head cgwb_list; struct wb_domain cgwb_domain; + struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; #endif /* List of events which userspace want to receive */ @@ -1237,6 +1255,18 @@ void mem_cgroup_wb_stats(struct bdi_writ unsigned long *pheadroom, unsigned long *pdirty, unsigned long *pwriteback); +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, + struct bdi_writeback *wb); + +static inline void mem_cgroup_track_foreign_dirty(struct page *page, + struct bdi_writeback *wb) +{ + if (unlikely(&page->mem_cgroup->css != wb->memcg_css)) + mem_cgroup_track_foreign_dirty_slowpath(page, wb); +} + +void mem_cgroup_flush_foreign(struct bdi_writeback *wb); + #else /* CONFIG_CGROUP_WRITEBACK */ static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) @@ -1252,6 +1282,15 @@ static inline void mem_cgroup_wb_stats(s { } +static inline void mem_cgroup_track_foreign_dirty(struct page *page, + struct bdi_writeback *wb) +{ +} + +static inline void mem_cgroup_flush_foreign(struct bdi_writeback *wb) +{ +} + #endif /* CONFIG_CGROUP_WRITEBACK */ struct sock; --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -87,6 +87,10 @@ int do_swap_account __read_mostly; #define do_swap_account 0 #endif +#ifdef CONFIG_CGROUP_WRITEBACK +static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq); +#endif + /* Whether legacy memory+swap accounting is active */ static bool do_memsw_account(void) { @@ -4184,6 +4188,125 @@ void mem_cgroup_wb_stats(struct bdi_writ } } +/* + * Foreign dirty flushing + * + * There's an inherent mismatch between memcg and writeback. The former + * trackes ownership per-page while the latter per-inode. This was a + * deliberate design decision because honoring per-page ownership in the + * writeback path is complicated, may lead to higher CPU and IO overheads + * and deemed unnecessary given that write-sharing an inode across + * different cgroups isn't a common use-case. + * + * Combined with inode majority-writer ownership switching, this works well + * enough in most cases but there are some pathological cases. For + * example, let's say there are two cgroups A and B which keep writing to + * different but confined parts of the same inode. B owns the inode and + * A's memory is limited far below B's. A's dirty ratio can rise enough to + * trigger balance_dirty_pages() sleeps but B's can be low enough to avoid + * triggering background writeback. A will be slowed down without a way to + * make writeback of the dirty pages happen. + * + * Conditions like the above can lead to a cgroup getting repatedly and + * severely throttled after making some progress after each + * dirty_expire_interval while the underyling IO device is almost + * completely idle. + * + * Solving this problem completely requires matching the ownership tracking + * granularities between memcg and writeback in either direction. However, + * the more egregious behaviors can be avoided by simply remembering the + * most recent foreign dirtying events and initiating remote flushes on + * them when local writeback isn't enough to keep the memory clean enough. + * + * The following two functions implement such mechanism. When a foreign + * page - a page whose memcg and writeback ownerships don't match - is + * dirtied, mem_cgroup_track_foreign_dirty() records the inode owning + * bdi_writeback on the page owning memcg. When balance_dirty_pages() + * decides that the memcg needs to sleep due to high dirty ratio, it calls + * mem_cgroup_flush_foreign() which queues writeback on the recorded + * foreign bdi_writebacks which haven't expired. Both the numbers of + * recorded bdi_writebacks and concurrent in-flight foreign writebacks are + * limited to MEMCG_CGWB_FRN_CNT. + * + * The mechanism only remembers IDs and doesn't hold any object references. + * As being wrong occasionally doesn't matter, updates and accesses to the + * records are lockless and racy. + */ +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, + struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = page->mem_cgroup; + struct memcg_cgwb_frn *frn; + u64 now = jiffies_64; + u64 oldest_at = now; + int oldest = -1; + int i; + + /* + * Pick the slot to use. If there is already a slot for @wb, keep + * using it. If not replace the oldest one which isn't being + * written out. + */ + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + frn = &memcg->cgwb_frn[i]; + if (frn->bdi_id == wb->bdi->id && + frn->memcg_id == wb->memcg_css->id) + break; + if (frn->at < oldest_at && atomic_read(&frn->done.cnt) == 1) { + oldest = i; + oldest_at = frn->at; + } + } + + if (i < MEMCG_CGWB_FRN_CNT) { + /* + * Re-using an existing one. Update timestamp lazily to + * avoid making the cacheline hot. We want them to be + * reasonably up-to-date and significantly shorter than + * dirty_expire_interval as that's what expires the record. + * Use the shorter of 1s and dirty_expire_interval / 8. + */ + unsigned long update_intv = + min_t(unsigned long, HZ, + msecs_to_jiffies(dirty_expire_interval * 10) / 8); + + if (frn->at < now - update_intv) + frn->at = now; + } else if (oldest >= 0) { + /* replace the oldest free one */ + frn = &memcg->cgwb_frn[oldest]; + frn->bdi_id = wb->bdi->id; + frn->memcg_id = wb->memcg_css->id; + frn->at = now; + } +} + +/* issue foreign writeback flushes for recorded foreign dirtying events */ +void mem_cgroup_flush_foreign(struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); + unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10); + u64 now = jiffies_64; + int i; + + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { + struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i]; + + /* + * If the record is older than dirty_expire_interval, + * writeback on it has already started. No need to kick it + * off again. Also, don't start a new one if there's + * already one in flight. + */ + if (frn->at > now - intv && atomic_read(&frn->done.cnt) == 1) { + frn->at = 0; + cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, + LONG_MAX, WB_REASON_FOREIGN_FLUSH, + &frn->done); + } + } +} + #else /* CONFIG_CGROUP_WRITEBACK */ static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) @@ -4700,6 +4823,7 @@ static struct mem_cgroup *mem_cgroup_all struct mem_cgroup *memcg; unsigned int size; int node; + int __maybe_unused i; size = sizeof(struct mem_cgroup); size += nr_node_ids * sizeof(struct mem_cgroup_per_node *); @@ -4743,6 +4867,9 @@ static struct mem_cgroup *mem_cgroup_all #endif #ifdef CONFIG_CGROUP_WRITEBACK INIT_LIST_HEAD(&memcg->cgwb_list); + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) + memcg->cgwb_frn[i].done = + __WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq); #endif idr_replace(&mem_cgroup_idr, memcg, memcg->id.id); return memcg; @@ -4872,7 +4999,12 @@ static void mem_cgroup_css_released(stru static void mem_cgroup_css_free(struct cgroup_subsys_state *css) { struct mem_cgroup *memcg = mem_cgroup_from_css(css); + int __maybe_unused i; +#ifdef CONFIG_CGROUP_WRITEBACK + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) + wb_wait_for_completion(&memcg->cgwb_frn[i].done); +#endif if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket) static_branch_dec(&memcg_sockets_enabled_key); --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1667,6 +1667,8 @@ static void balance_dirty_pages(struct b if (unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); + mem_cgroup_flush_foreign(wb); + /* * Calculate global domain's pos_ratio and select the * global dtc by default. @@ -2427,6 +2429,8 @@ void account_page_dirtied(struct page *p task_io_account_write(PAGE_SIZE); current->nr_dirtied++; this_cpu_inc(bdp_ratelimits); + + mem_cgroup_track_foreign_dirty(page, wb); } }