From patchwork Wed Jun 13 10:43:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tetsuo Handa X-Patchwork-Id: 10461991 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id E04BF60329 for ; Wed, 13 Jun 2018 10:44:53 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C1BB4288E7 for ; Wed, 13 Jun 2018 10:44:53 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B3426288ED; Wed, 13 Jun 2018 10:44:53 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5AA14288E7 for ; Wed, 13 Jun 2018 10:44:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934466AbeFMKov (ORCPT ); Wed, 13 Jun 2018 06:44:51 -0400 Received: from www262.sakura.ne.jp ([202.181.97.72]:17204 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934436AbeFMKou (ORCPT ); Wed, 13 Jun 2018 06:44:50 -0400 Received: from fsav305.sakura.ne.jp (fsav305.sakura.ne.jp [153.120.85.136]) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTP id w5DAhoBX084010; Wed, 13 Jun 2018 19:43:50 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav305.sakura.ne.jp (F-Secure/fsigk_smtp/530/fsav305.sakura.ne.jp); Wed, 13 Jun 2018 19:43:50 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/530/fsav305.sakura.ne.jp) Received: from [192.168.1.8] (softbank126074194044.bbtec.net [126.74.194.44]) (authenticated bits=0) by www262.sakura.ne.jp (8.15.2/8.15.2) with ESMTPSA id w5DAhi85083966 (version=TLSv1.2 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 13 Jun 2018 19:43:49 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Subject: Re: [PATCH] bdi: Fix another oops in wb_workfn() To: Jan Kara , Tejun Heo Cc: Dmitry Vyukov , Jens Axboe , syzbot , syzkaller-bugs , linux-fsdevel , LKML , Al Viro , Dave Chinner , linux-block@vger.kernel.org, Linus Torvalds References: <2b437c6f-3e10-3d83-bdf3-82075d3eaa1a@i-love.sakura.ne.jp> <3cf4b0e3-31b6-8cdc-7c1e-15ba575a7879@i-love.sakura.ne.jp> <20180611091248.2i6nt27h5mxrodm2@quack2.suse.cz> <20180611160131.GQ1351649@devbig577.frc2.facebook.com> <20180611162920.mwapvuqotvhkntt3@quack2.suse.cz> <20180611172053.GR1351649@devbig577.frc2.facebook.com> <20180612155754.x5k2yndh5t6wlmpy@quack2.suse.cz> From: Tetsuo Handa Message-ID: Date: Wed, 13 Jun 2018 19:43:47 +0900 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: <20180612155754.x5k2yndh5t6wlmpy@quack2.suse.cz> Content-Language: en-US Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 2018/06/13 0:57, Jan Kara wrote: > On Mon 11-06-18 10:20:53, Tejun Heo wrote: >> Hello, >> >> On Mon, Jun 11, 2018 at 06:29:20PM +0200, Jan Kara wrote: >>>> Would something like the following work or am I missing the point >>>> entirely? >>> >>> I was pondering the same solution for a while but I think it won't work. >>> The problem is that e.g. wb_memcg_offline() could have already removed >>> wb from the radix tree but it is still pending in bdi->wb_list >>> (wb_shutdown() has not run yet) and so we'd drop reference we didn't get. >> >> Yeah, right, so the root cause is that we're walking the wb_list while >> holding lock and expecting the object to stay there even after lock is >> released. Hmm... we can use a mutex to synchronize the two >> destruction paths. It's not like they're hot paths anyway. > > Hmm, do you mean like having a per-bdi or even a global mutex that would > protect whole wb_shutdown()? Yes, that should work and we could get rid of > WB_shutting_down bit as well with that. Just it seems a bit strange to > introduce a mutex only to synchronize these two shutdown paths - usually > locks protect data structures and in this case we have cgwb_lock for > that so it looks like a duplication from a first look. > Can't we utilize RCU grace period (like shown below) ? If wb_shutdown(wb) by cgwb_release_workfn() was faster than wb_shutdown(wb) by cgwb_bdi_unregister(): cgwb_bdi_unregister(bdi) cgwb_release_workfn(work) wb = container_of(work, struct bdi_writeback, release_work); spin_lock_irq(&cgwb_lock); wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Same wb here */ rcu_read_lock(); /* Prevent kfree_rcu() from invoking kfree() */ spin_unlock_irq(&cgwb_lock); wb_shutdown(wb); spin_lock_bh(&wb->work_lock); !test_and_clear_bit(WB_registered, &wb->state) is "false". set_bit(WB_shutting_down, &wb->state); spin_unlock_bh(&wb->work_lock); mod_delayed_work(bdi_wq, &wb->dwork, 0); flush_delayed_work(&wb->dwork); cgwb_remove_from_bdi_list(wb); spin_lock_irq(&cgwb_lock); list_del_rcu(&wb->bdi_node); spin_unlock_irq(&cgwb_lock); wb_exit(wb); kfree_rcu(wb, rcu); /* Won't call kfree() because of rcu_read_lock() */ wb_shutdown(wb); spin_lock_bh(&wb->work_lock); /* Safe to access because kfree() cannot be called */ !test_and_clear_bit(WB_registered, &wb->state) is "true". spin_unlock_bh(&wb->work_lock); rcu_read_unlock(); kfree(wb); schedule_timeout_uninterruptible(HZ / 10); /* Try to wait in case list_del_rcu() is not yet called */ spin_lock_irq(&cgwb_lock); wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Different wb if list_del_rcu() was already called, same wb otherwise */ rcu_read_lock(); /* Prevent kfree_rcu() from invoking kfree() if still same wb here */ spin_unlock_irq(&cgwb_lock); If wb_shutdown(wb) by cgwb_bdi_unregister() was faster than wb_shutdown(wb) by cgwb_release_workfn(): cgwb_bdi_unregister(bdi) cgwb_release_workfn(work) wb = container_of(work, struct bdi_writeback, release_work); spin_lock_irq(&cgwb_lock); wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Same wb here */ rcu_read_lock(); spin_unlock_irq(&cgwb_lock); wb_shutdown(wb); spin_lock_bh(&wb->work_lock); !test_and_clear_bit(WB_registered, &wb->state) is "false". set_bit(WB_shutting_down, &wb->state); spin_unlock_bh(&wb->work_lock); rcu_read_unlock(); mod_delayed_work(bdi_wq, &wb->dwork, 0); flush_delayed_work(&wb->dwork); cgwb_remove_from_bdi_list(wb); spin_lock_irq(&cgwb_lock); list_del_rcu(&wb->bdi_node); spin_unlock_irq(&cgwb_lock); spin_lock_irq(&cgwb_lock); wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); /* Different wb here */ rcu_read_lock(); spin_unlock_irq(&cgwb_lock); wb_shutdown(wb); spin_lock_bh(&wb->work_lock); /* Safe to access because kfree() cannot be called */ !test_and_clear_bit(WB_registered, &wb->state) is "true". spin_unlock_bh(&wb->work_lock); wb_exit(wb); kfree_rcu(wb, rcu); mm/backing-dev.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 347cc83..6886cea 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -353,13 +353,24 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, /* * Remove bdi from the global list and shutdown any threads we have running */ -static void wb_shutdown(struct bdi_writeback *wb) +static void wb_shutdown(struct bdi_writeback *wb, const bool in_rcu) { /* Make sure nobody queues further work */ spin_lock_bh(&wb->work_lock); if (!test_and_clear_bit(WB_registered, &wb->state)) { spin_unlock_bh(&wb->work_lock); /* + * Try to wait for cgwb_remove_from_bdi_list() without + * using wait_on_bit(), for kfree_rcu(wb, rcu) from + * cgwb_release_workfn() might invoke kfree() before we + * recheck WB_shutting_down bit. + */ + if (in_rcu) { + rcu_read_unlock(); + schedule_timeout_uninterruptible(HZ / 10); + return; + } + /* * Wait for wb shutdown to finish if someone else is just * running wb_shutdown(). Otherwise we could proceed to wb / * bdi destruction before wb_shutdown() is finished. @@ -369,8 +380,9 @@ static void wb_shutdown(struct bdi_writeback *wb) } set_bit(WB_shutting_down, &wb->state); spin_unlock_bh(&wb->work_lock); + if (in_rcu) + rcu_read_unlock(); - cgwb_remove_from_bdi_list(wb); /* * Drain work list and shutdown the delayed_work. !WB_registered * tells wb_workfn() that @wb is dying and its work_list needs to @@ -379,6 +391,7 @@ static void wb_shutdown(struct bdi_writeback *wb) mod_delayed_work(bdi_wq, &wb->dwork, 0); flush_delayed_work(&wb->dwork); WARN_ON(!list_empty(&wb->work_list)); + cgwb_remove_from_bdi_list(wb); /* * Make sure bit gets cleared after shutdown is finished. Matches with * the barrier provided by test_and_clear_bit() above. @@ -508,7 +521,7 @@ static void cgwb_release_workfn(struct work_struct *work) struct bdi_writeback *wb = container_of(work, struct bdi_writeback, release_work); - wb_shutdown(wb); + wb_shutdown(wb, false); css_put(wb->memcg_css); css_put(wb->blkcg_css); @@ -721,8 +734,9 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) while (!list_empty(&bdi->wb_list)) { wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); + rcu_read_lock(); spin_unlock_irq(&cgwb_lock); - wb_shutdown(wb); + wb_shutdown(wb, true); spin_lock_irq(&cgwb_lock); } spin_unlock_irq(&cgwb_lock); @@ -944,7 +958,7 @@ void bdi_unregister(struct backing_dev_info *bdi) { /* make sure nobody finds us on the bdi_list anymore */ bdi_remove_from_list(bdi); - wb_shutdown(&bdi->wb); + wb_shutdown(&bdi->wb, false); cgwb_bdi_unregister(bdi); if (bdi->dev) { -- Or, equivalent one but easier to read: mm/backing-dev.c | 42 +++++++++++++++++++++++++++++++----------- 1 file changed, 31 insertions(+), 11 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 347cc83..77088a3 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -350,15 +350,25 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb); -/* - * Remove bdi from the global list and shutdown any threads we have running - */ -static void wb_shutdown(struct bdi_writeback *wb) +static bool wb_start_shutdown(struct bdi_writeback *wb) { /* Make sure nobody queues further work */ spin_lock_bh(&wb->work_lock); if (!test_and_clear_bit(WB_registered, &wb->state)) { spin_unlock_bh(&wb->work_lock); + return false; + } + set_bit(WB_shutting_down, &wb->state); + spin_unlock_bh(&wb->work_lock); + return true; +} + +/* + * Remove bdi from the global list and shutdown any threads we have running + */ +static void wb_shutdown(struct bdi_writeback *wb, const bool started) +{ + if (!started && !wb_start_shutdown(wb)) { /* * Wait for wb shutdown to finish if someone else is just * running wb_shutdown(). Otherwise we could proceed to wb / @@ -367,10 +377,6 @@ static void wb_shutdown(struct bdi_writeback *wb) wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE); return; } - set_bit(WB_shutting_down, &wb->state); - spin_unlock_bh(&wb->work_lock); - - cgwb_remove_from_bdi_list(wb); /* * Drain work list and shutdown the delayed_work. !WB_registered * tells wb_workfn() that @wb is dying and its work_list needs to @@ -379,6 +385,7 @@ static void wb_shutdown(struct bdi_writeback *wb) mod_delayed_work(bdi_wq, &wb->dwork, 0); flush_delayed_work(&wb->dwork); WARN_ON(!list_empty(&wb->work_list)); + cgwb_remove_from_bdi_list(wb); /* * Make sure bit gets cleared after shutdown is finished. Matches with * the barrier provided by test_and_clear_bit() above. @@ -508,7 +515,7 @@ static void cgwb_release_workfn(struct work_struct *work) struct bdi_writeback *wb = container_of(work, struct bdi_writeback, release_work); - wb_shutdown(wb); + wb_shutdown(wb, false); css_put(wb->memcg_css); css_put(wb->blkcg_css); @@ -711,6 +718,7 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) struct radix_tree_iter iter; void **slot; struct bdi_writeback *wb; + bool started; WARN_ON(test_bit(WB_registered, &bdi->wb.state)); @@ -721,8 +729,20 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) while (!list_empty(&bdi->wb_list)) { wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); + rcu_read_lock(); spin_unlock_irq(&cgwb_lock); - wb_shutdown(wb); + started = wb_start_shutdown(wb); + rcu_read_unlock(); + if (started) + wb_shutdown(wb, true); + else + /* + * Try to wait for cgwb_remove_from_bdi_list() without + * using wait_on_bit(), for kfree_rcu(wb, rcu) from + * cgwb_release_workfn() might invoke kfree() before we + * recheck WB_shutting_down bit. + */ + schedule_timeout_uninterruptible(HZ / 10); spin_lock_irq(&cgwb_lock); } spin_unlock_irq(&cgwb_lock); @@ -944,7 +964,7 @@ void bdi_unregister(struct backing_dev_info *bdi) { /* make sure nobody finds us on the bdi_list anymore */ bdi_remove_from_list(bdi); - wb_shutdown(&bdi->wb); + wb_shutdown(&bdi->wb, false); cgwb_bdi_unregister(bdi); if (bdi->dev) { -- Or, toggling a dedicated flag using test_and_change_bit(): include/linux/backing-dev-defs.h | 1 + mm/backing-dev.c | 8 +++++++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 0bd432a..93ff83c 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -26,6 +26,7 @@ enum wb_state { WB_writeback_running, /* Writeback is in progress */ WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */ WB_start_all, /* nr_pages == 0 (all) work pending */ + WB_postpone_kfree, /* cgwb_bdi_unregister() will access later */ }; enum wb_congested_state { diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 347cc83..422d7a7 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -516,7 +516,10 @@ static void cgwb_release_workfn(struct work_struct *work) fprop_local_destroy_percpu(&wb->memcg_completions); percpu_ref_exit(&wb->refcnt); wb_exit(wb); - kfree_rcu(wb, rcu); + spin_lock_irq(&cgwb_lock); + if (!test_and_change_bit(WB_postpone_kfree, &wb->state)) + kfree_rcu(wb, rcu); + spin_unlock_irq(&cgwb_lock); } static void cgwb_release(struct percpu_ref *refcnt) @@ -721,9 +724,12 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) while (!list_empty(&bdi->wb_list)) { wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, bdi_node); + set_bit(WB_postpone_kfree, &wb->state); spin_unlock_irq(&cgwb_lock); wb_shutdown(wb); spin_lock_irq(&cgwb_lock); + if (!test_and_change_bit(WB_postpone_kfree, &wb->state)) + kfree_rcu(wb, rcu); } spin_unlock_irq(&cgwb_lock); }