diff mbox

bdi: Fix another oops in wb_workfn()

Message ID 20180618134658.32302-1-jack@suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kara June 18, 2018, 1:46 p.m. UTC
syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1                            CPU2
                                cgwb_bdi_unregister()
                                  cgwb_kill(*slot);

cgwb_release()
  queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
                                  wb = list_first_entry(&bdi->wb_list, ...)
                                  spin_unlock_irq(&cgwb_lock);
  wb_shutdown(wb);
  ...
  kfree_rcu(wb, rcu);
                                  wb_shutdown(wb); -> oops use-after-free

We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
way we also no longer need synchronization using WB_shutting_down as the
mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
bdi_unregister().

Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  2 +-
 mm/backing-dev.c                 | 20 +++++++-------------
 2 files changed, 8 insertions(+), 14 deletions(-)

Comments

Tetsuo Handa June 18, 2018, 2:38 p.m. UTC | #1
On 2018/06/18 22:46, Jan Kara wrote:
> syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to

[1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206

line is missing.

> wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> 
> The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
> drops bdi's reference to wb structures before going through the list of
> wbs again and calling wb_shutdown() on each of them. This way the loop
> iterating through all wbs can easily miss a wb if that wb has already
> passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
> from cgwb_release_workfn() and as a result fully shutdown bdi although
> wb_workfn() for this wb structure is still running. In fact there are
> also other ways cgwb_bdi_unregister() can race with
> cgwb_release_workfn() leading e.g. to use-after-free issues:
> 
> CPU1                            CPU2
>                                 cgwb_bdi_unregister()
>                                   cgwb_kill(*slot);
> 
> cgwb_release()
>   queue_work(cgwb_release_wq, &wb->release_work);
> cgwb_release_workfn()
>                                   wb = list_first_entry(&bdi->wb_list, ...)
>                                   spin_unlock_irq(&cgwb_lock);
>   wb_shutdown(wb);
>   ...
>   kfree_rcu(wb, rcu);
>                                   wb_shutdown(wb); -> oops use-after-free
> 
> We solve these issues by synchronizing writeback structure shutdown from
> cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
> way we also no longer need synchronization using WB_shutting_down as the
> mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
> CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
> bdi_unregister().

Wow, this patch removes WB_shutting_down.

A bit of worry for me is how long will this mutex_lock() sleep, for
if there are a lot of wb objects to shutdown, sequentially doing
wb_shutdown() might block someone's mutex_lock() for longer than
khungtaskd's timeout period (typically 120 seconds) ?

> 
> Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/backing-dev-defs.h |  2 +-
>  mm/backing-dev.c                 | 20 +++++++-------------
>  2 files changed, 8 insertions(+), 14 deletions(-)
Tejun Heo June 18, 2018, 5:40 p.m. UTC | #2
On Mon, Jun 18, 2018 at 03:46:58PM +0200, Jan Kara wrote:
> syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
> wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> 
> The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
> drops bdi's reference to wb structures before going through the list of
> wbs again and calling wb_shutdown() on each of them. This way the loop
> iterating through all wbs can easily miss a wb if that wb has already
> passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
> from cgwb_release_workfn() and as a result fully shutdown bdi although
> wb_workfn() for this wb structure is still running. In fact there are
> also other ways cgwb_bdi_unregister() can race with
> cgwb_release_workfn() leading e.g. to use-after-free issues:
> 
> CPU1                            CPU2
>                                 cgwb_bdi_unregister()
>                                   cgwb_kill(*slot);
> 
> cgwb_release()
>   queue_work(cgwb_release_wq, &wb->release_work);
> cgwb_release_workfn()
>                                   wb = list_first_entry(&bdi->wb_list, ...)
>                                   spin_unlock_irq(&cgwb_lock);
>   wb_shutdown(wb);
>   ...
>   kfree_rcu(wb, rcu);
>                                   wb_shutdown(wb); -> oops use-after-free
> 
> We solve these issues by synchronizing writeback structure shutdown from
> cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
> way we also no longer need synchronization using WB_shutting_down as the
> mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
> CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
> bdi_unregister().
> 
> Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.
Jan Kara June 19, 2018, 8:41 a.m. UTC | #3
On Mon 18-06-18 23:38:12, Tetsuo Handa wrote:
> On 2018/06/18 22:46, Jan Kara wrote:
> > syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
> 
> [1] https://syzkaller.appspot.com/bug?id=e0818ccb7e46190b3f1038b0c794299208ed4206
> 
> line is missing.
> 
> > wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> > WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> > unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> > 
> > The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
> > drops bdi's reference to wb structures before going through the list of
> > wbs again and calling wb_shutdown() on each of them. This way the loop
> > iterating through all wbs can easily miss a wb if that wb has already
> > passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
> > from cgwb_release_workfn() and as a result fully shutdown bdi although
> > wb_workfn() for this wb structure is still running. In fact there are
> > also other ways cgwb_bdi_unregister() can race with
> > cgwb_release_workfn() leading e.g. to use-after-free issues:
> > 
> > CPU1                            CPU2
> >                                 cgwb_bdi_unregister()
> >                                   cgwb_kill(*slot);
> > 
> > cgwb_release()
> >   queue_work(cgwb_release_wq, &wb->release_work);
> > cgwb_release_workfn()
> >                                   wb = list_first_entry(&bdi->wb_list, ...)
> >                                   spin_unlock_irq(&cgwb_lock);
> >   wb_shutdown(wb);
> >   ...
> >   kfree_rcu(wb, rcu);
> >                                   wb_shutdown(wb); -> oops use-after-free
> > 
> > We solve these issues by synchronizing writeback structure shutdown from
> > cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
> > way we also no longer need synchronization using WB_shutting_down as the
> > mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
> > CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
> > bdi_unregister().
> 
> Wow, this patch removes WB_shutting_down.

Yes.

> A bit of worry for me is how long will this mutex_lock() sleep, for
> if there are a lot of wb objects to shutdown, sequentially doing
> wb_shutdown() might block someone's mutex_lock() for longer than
> khungtaskd's timeout period (typically 120 seconds) ?

That's a good question but since the bdi is going away in this case I
don't think the flusher work should take long to complete - the device is
removed from the system at this point so it won't do any IO.

								Honza
Jan Kara June 22, 2018, 8:52 a.m. UTC | #4
On Mon 18-06-18 10:40:14, Tejun Heo wrote:
> On Mon, Jun 18, 2018 at 03:46:58PM +0200, Jan Kara wrote:
> > syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
> > wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
> > WB_shutting_down after wb->bdi->dev became NULL. This indicates that
> > unregister_bdi() failed to call wb_shutdown() on one of wb objects.
> > 
> > The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
> > drops bdi's reference to wb structures before going through the list of
> > wbs again and calling wb_shutdown() on each of them. This way the loop
> > iterating through all wbs can easily miss a wb if that wb has already
> > passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
> > from cgwb_release_workfn() and as a result fully shutdown bdi although
> > wb_workfn() for this wb structure is still running. In fact there are
> > also other ways cgwb_bdi_unregister() can race with
> > cgwb_release_workfn() leading e.g. to use-after-free issues:
> > 
> > CPU1                            CPU2
> >                                 cgwb_bdi_unregister()
> >                                   cgwb_kill(*slot);
> > 
> > cgwb_release()
> >   queue_work(cgwb_release_wq, &wb->release_work);
> > cgwb_release_workfn()
> >                                   wb = list_first_entry(&bdi->wb_list, ...)
> >                                   spin_unlock_irq(&cgwb_lock);
> >   wb_shutdown(wb);
> >   ...
> >   kfree_rcu(wb, rcu);
> >                                   wb_shutdown(wb); -> oops use-after-free
> > 
> > We solve these issues by synchronizing writeback structure shutdown from
> > cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
> > way we also no longer need synchronization using WB_shutting_down as the
> > mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
> > CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
> > bdi_unregister().
> > 
> > Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Acked-by: Tejun Heo <tj@kernel.org>

OK, Jens, can you please pick up the fix? Thanks!

								Honza
Jens Axboe June 22, 2018, 6:08 p.m. UTC | #5
On 6/22/18 2:52 AM, Jan Kara wrote:
> On Mon 18-06-18 10:40:14, Tejun Heo wrote:
>> On Mon, Jun 18, 2018 at 03:46:58PM +0200, Jan Kara wrote:
>>> syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
>>> wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
>>> WB_shutting_down after wb->bdi->dev became NULL. This indicates that
>>> unregister_bdi() failed to call wb_shutdown() on one of wb objects.
>>>
>>> The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
>>> drops bdi's reference to wb structures before going through the list of
>>> wbs again and calling wb_shutdown() on each of them. This way the loop
>>> iterating through all wbs can easily miss a wb if that wb has already
>>> passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
>>> from cgwb_release_workfn() and as a result fully shutdown bdi although
>>> wb_workfn() for this wb structure is still running. In fact there are
>>> also other ways cgwb_bdi_unregister() can race with
>>> cgwb_release_workfn() leading e.g. to use-after-free issues:
>>>
>>> CPU1                            CPU2
>>>                                 cgwb_bdi_unregister()
>>>                                   cgwb_kill(*slot);
>>>
>>> cgwb_release()
>>>   queue_work(cgwb_release_wq, &wb->release_work);
>>> cgwb_release_workfn()
>>>                                   wb = list_first_entry(&bdi->wb_list, ...)
>>>                                   spin_unlock_irq(&cgwb_lock);
>>>   wb_shutdown(wb);
>>>   ...
>>>   kfree_rcu(wb, rcu);
>>>                                   wb_shutdown(wb); -> oops use-after-free
>>>
>>> We solve these issues by synchronizing writeback structure shutdown from
>>> cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
>>> way we also no longer need synchronization using WB_shutting_down as the
>>> mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
>>> CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
>>> bdi_unregister().
>>>
>>> Reported-by: syzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
>>> Signed-off-by: Jan Kara <jack@suse.cz>
>>
>> Acked-by: Tejun Heo <tj@kernel.org>
> 
> OK, Jens, can you please pick up the fix? Thanks!

Applied, thanks!
diff mbox

Patch

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 0bd432a4d7bd..24251762c20c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -22,7 +22,6 @@  struct dentry;
  */
 enum wb_state {
 	WB_registered,		/* bdi_register() was done */
-	WB_shutting_down,	/* wb_shutdown() in progress */
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 	WB_start_all,		/* nr_pages == 0 (all) work pending */
@@ -189,6 +188,7 @@  struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct rb_root cgwb_congested_tree; /* their congested states */
+	struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
 #else
 	struct bdi_writeback_congested *wb_congested;
 #endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 347cc834c04a..2e5d3df0853d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -359,15 +359,8 @@  static void wb_shutdown(struct bdi_writeback *wb)
 	spin_lock_bh(&wb->work_lock);
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
-		/*
-		 * Wait for wb shutdown to finish if someone else is just
-		 * running wb_shutdown(). Otherwise we could proceed to wb /
-		 * bdi destruction before wb_shutdown() is finished.
-		 */
-		wait_on_bit(&wb->state, WB_shutting_down, TASK_UNINTERRUPTIBLE);
 		return;
 	}
-	set_bit(WB_shutting_down, &wb->state);
 	spin_unlock_bh(&wb->work_lock);
 
 	cgwb_remove_from_bdi_list(wb);
@@ -379,12 +372,6 @@  static void wb_shutdown(struct bdi_writeback *wb)
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 	flush_delayed_work(&wb->dwork);
 	WARN_ON(!list_empty(&wb->work_list));
-	/*
-	 * Make sure bit gets cleared after shutdown is finished. Matches with
-	 * the barrier provided by test_and_clear_bit() above.
-	 */
-	smp_wmb();
-	clear_and_wake_up_bit(WB_shutting_down, &wb->state);
 }
 
 static void wb_exit(struct bdi_writeback *wb)
@@ -508,10 +495,12 @@  static void cgwb_release_workfn(struct work_struct *work)
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
 
+	mutex_lock(&wb->bdi->cgwb_release_mutex);
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
 	css_put(wb->blkcg_css);
+	mutex_unlock(&wb->bdi->cgwb_release_mutex);
 
 	fprop_local_destroy_percpu(&wb->memcg_completions);
 	percpu_ref_exit(&wb->refcnt);
@@ -697,6 +686,7 @@  static int cgwb_bdi_init(struct backing_dev_info *bdi)
 
 	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
 	bdi->cgwb_congested_tree = RB_ROOT;
+	mutex_init(&bdi->cgwb_release_mutex);
 
 	ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
 	if (!ret) {
@@ -717,7 +707,10 @@  static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 	spin_lock_irq(&cgwb_lock);
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
+	spin_unlock_irq(&cgwb_lock);
 
+	mutex_lock(&bdi->cgwb_release_mutex);
+	spin_lock_irq(&cgwb_lock);
 	while (!list_empty(&bdi->wb_list)) {
 		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
 				      bdi_node);
@@ -726,6 +719,7 @@  static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
 		spin_lock_irq(&cgwb_lock);
 	}
 	spin_unlock_irq(&cgwb_lock);
+	mutex_unlock(&bdi->cgwb_release_mutex);
 }
 
 /**