Message ID | 5d98091d3e089b4f74cb61fb2ed691e1f4dd1d6b.1541005462.git.osandov@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] Btrfs: fix missing delayed iputs on unmount | expand |
On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > There's a race between close_ctree() and cleaner_kthread(). > close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > sees it set, but this is racy; the cleaner might have already checked > the bit and could be cleaning stuff. In particular, if it deletes unused > block groups, it will create delayed iputs for the free space cache > inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > longer running delayed iputs after a commit. Therefore, if the cleaner > creates more delayed iputs after delayed iputs are run in > btrfs_commit_super(), we will leak inodes on unmount and get a busy > inode crash from the VFS. > > Fix it by parking the cleaner Ouch, that's IMO wrong way to fix it. The bug is on a higher level, we're missing a commit or clean up data structures. Messing with state of a thread would be the last thing I'd try after proving that it's not possible to fix in the logic of btrfs itself. The shutdown sequence in close_tree is quite tricky and we've had bugs there. The interdependencies of thread and data structures and other subsystems cannot have loops that could not find an ordering that will not leak something. It's not a big problem if some step is done more than once, like committing or cleaning up some other structures if we know that it could create new.
On 1 Nov 2018, at 6:15, David Sterba wrote: > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: >> From: Omar Sandoval <osandov@fb.com> >> >> There's a race between close_ctree() and cleaner_kthread(). >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it >> sees it set, but this is racy; the cleaner might have already checked >> the bit and could be cleaning stuff. In particular, if it deletes >> unused >> block groups, it will create delayed iputs for the free space cache >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no >> longer running delayed iputs after a commit. Therefore, if the >> cleaner >> creates more delayed iputs after delayed iputs are run in >> btrfs_commit_super(), we will leak inodes on unmount and get a busy >> inode crash from the VFS. >> >> Fix it by parking the cleaner > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > we're missing a commit or clean up data structures. Messing with state > of a thread would be the last thing I'd try after proving that it's > not > possible to fix in the logic of btrfs itself. > > The shutdown sequence in close_tree is quite tricky and we've had bugs > there. The interdependencies of thread and data structures and other > subsystems cannot have loops that could not find an ordering that will > not leak something. > > It's not a big problem if some step is done more than once, like > committing or cleaning up some other structures if we know that > it could create new. The problem is the cleaner thread needs to be told to stop doing new work, and we need to wait for the work it's already doing to be finished. We're getting "stop doing new work" already because the cleaner thread checks to see if the FS is closing, but we don't have a way today to wait for him to finish what he's already doing. kthread_park() is basically the same as adding another mutex or synchronization point. I'm not sure how we could change close_tree() or the final commit to pick this up more effectively? -chris
On 31.10.18 г. 19:06 ч., Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > There's a race between close_ctree() and cleaner_kthread(). > close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > sees it set, but this is racy; the cleaner might have already checked > the bit and could be cleaning stuff. In particular, if it deletes unused > block groups, it will create delayed iputs for the free space cache > inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > longer running delayed iputs after a commit. Therefore, if the cleaner > creates more delayed iputs after delayed iputs are run in > btrfs_commit_super(), we will leak inodes on unmount and get a busy > inode crash from the VFS. > > Fix it by parking the cleaner before we actually close anything. Then, > any remaining delayed iputs will always be handled in > btrfs_commit_super(). This also ensures that the commit in close_ctree() > is really the last commit, so we can get rid of the commit in > cleaner_kthread(). > > Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit") > Signed-off-by: Omar Sandoval <osandov@fb.com> Also I believe this patch renders the wake_up_process in btrfs_commit_super a null op so it can also be removed, which leaves a single place that could wake up the cleaner - transaction_kthread. So can't we stop transaction and cleaner thread right after setting CLOSING_FS. And commit the transaction in close_ctree whenever we deem necessary (in btrfs_commit_super for example) ? > --- > Changes from v1: > > - Add a comment explaining why it needs to be a kthread_park(), not > kthread_stop() > - Update later comment now that the cleaner thread is definitely stopped > > fs/btrfs/disk-io.c | 51 ++++++++++++++-------------------------------- > 1 file changed, 15 insertions(+), 36 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index b0ab41da91d1..40bcc45d827d 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg) > struct btrfs_root *root = arg; > struct btrfs_fs_info *fs_info = root->fs_info; > int again; > - struct btrfs_trans_handle *trans; > > - do { > + while (1) { > again = 0; > > /* Make the cleaner go to sleep early. */ > @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg) > */ > btrfs_delete_unused_bgs(fs_info); > sleep: > + if (kthread_should_park()) > + kthread_parkme(); > + if (kthread_should_stop()) > + return 0; > if (!again) { > set_current_state(TASK_INTERRUPTIBLE); > - if (!kthread_should_stop()) > - schedule(); > + schedule(); > __set_current_state(TASK_RUNNING); > } > - } while (!kthread_should_stop()); > - > - /* > - * Transaction kthread is stopped before us and wakes us up. > - * However we might have started a new transaction and COWed some > - * tree blocks when deleting unused block groups for example. So > - * make sure we commit the transaction we started to have a clean > - * shutdown when evicting the btree inode - if it has dirty pages > - * when we do the final iput() on it, eviction will trigger a > - * writeback for it which will fail with null pointer dereferences > - * since work queues and other resources were already released and > - * destroyed by the time the iput/eviction/writeback is made. > - */ > - trans = btrfs_attach_transaction(root); > - if (IS_ERR(trans)) { > - if (PTR_ERR(trans) != -ENOENT) > - btrfs_err(fs_info, > - "cleaner transaction attach returned %ld", > - PTR_ERR(trans)); > - } else { > - int ret; > - > - ret = btrfs_commit_transaction(trans); > - if (ret) > - btrfs_err(fs_info, > - "cleaner open transaction commit returned %d", > - ret); > } > - > - return 0; > } > > static int transaction_kthread(void *arg) > @@ -3931,6 +3904,13 @@ void close_ctree(struct btrfs_fs_info *fs_info) > int ret; > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > + /* > + * We don't want the cleaner to start new transactions, add more delayed > + * iputs, etc. while we're closing. We can't use kthread_stop() yet > + * because that frees the task_struct, and the transaction kthread might > + * still try to wake up the cleaner. > + */ > + kthread_park(fs_info->cleaner_kthread); > > /* wait for the qgroup rescan worker to stop */ > btrfs_qgroup_wait_for_completion(fs_info, false); > @@ -3958,9 +3938,8 @@ void close_ctree(struct btrfs_fs_info *fs_info) > > if (!sb_rdonly(fs_info->sb)) { > /* > - * If the cleaner thread is stopped and there are > - * block groups queued for removal, the deletion will be > - * skipped when we quit the cleaner thread. > + * The cleaner kthread is stopped, so do one final pass over > + * unused block groups. > */ > btrfs_delete_unused_bgs(fs_info); > >
On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > >> From: Omar Sandoval <osandov@fb.com> > >> > >> There's a race between close_ctree() and cleaner_kthread(). > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > >> sees it set, but this is racy; the cleaner might have already checked > >> the bit and could be cleaning stuff. In particular, if it deletes > >> unused > >> block groups, it will create delayed iputs for the free space cache > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > >> longer running delayed iputs after a commit. Therefore, if the > >> cleaner > >> creates more delayed iputs after delayed iputs are run in > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > >> inode crash from the VFS. > >> > >> Fix it by parking the cleaner > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > we're missing a commit or clean up data structures. Messing with state > > of a thread would be the last thing I'd try after proving that it's > > not > > possible to fix in the logic of btrfs itself. > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > there. The interdependencies of thread and data structures and other > > subsystems cannot have loops that could not find an ordering that will > > not leak something. > > > > It's not a big problem if some step is done more than once, like > > committing or cleaning up some other structures if we know that > > it could create new. > > The problem is the cleaner thread needs to be told to stop doing new > work, and we need to wait for the work it's already doing to be > finished. We're getting "stop doing new work" already because the > cleaner thread checks to see if the FS is closing, but we don't have a > way today to wait for him to finish what he's already doing. > > kthread_park() is basically the same as adding another mutex or > synchronization point. I'm not sure how we could change close_tree() or > the final commit to pick this up more effectively? The idea is: cleaner close_ctree thread tell cleaner to stop wait start work if should_stop, then exit cleaner is stopped [does not run: finish work] [does not run: loop] pick up the work or make sure there's nothing in progress anymore A simplified version in code: set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); wait for defrag - could be started from cleaner but next iteration will see the fs closed and will not continue kthread_stop(transaction_kthread) kthread_stop(cleaner_kthread) /* next, everything that could be left from cleaner should be finished */ btrfs_delete_unused_bgs(); assert there are no defrags assert there are no delayed iputs commit if necessary IOW the unused block groups are removed from close_ctree too early, moving that after the threads stop AND makins sure that it does not need either of them should work. The "AND" above is not currently implemented as btrfs_delete_unused_bgs calls plain btrfs_end_transaction that wakes up transaction ktread, so there would need to be an argument passed to tell it to do full commit.
On 1.11.18 г. 16:35 ч., Nikolay Borisov wrote: > > > On 31.10.18 г. 19:06 ч., Omar Sandoval wrote: >> From: Omar Sandoval <osandov@fb.com> >> >> There's a race between close_ctree() and cleaner_kthread(). >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it >> sees it set, but this is racy; the cleaner might have already checked >> the bit and could be cleaning stuff. In particular, if it deletes unused >> block groups, it will create delayed iputs for the free space cache >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no >> longer running delayed iputs after a commit. Therefore, if the cleaner >> creates more delayed iputs after delayed iputs are run in >> btrfs_commit_super(), we will leak inodes on unmount and get a busy >> inode crash from the VFS. >> >> Fix it by parking the cleaner before we actually close anything. Then, >> any remaining delayed iputs will always be handled in >> btrfs_commit_super(). This also ensures that the commit in close_ctree() >> is really the last commit, so we can get rid of the commit in >> cleaner_kthread(). >> >> Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit") >> Signed-off-by: Omar Sandoval <osandov@fb.com> > > Also I believe this patch renders the wake_up_process in > btrfs_commit_super a null op so it can also be removed, which leaves a > single place that could wake up the cleaner - transaction_kthread. > > So can't we stop transaction and cleaner thread right after setting > CLOSING_FS. And commit the transaction in close_ctree whenever we deem > necessary (in btrfs_commit_super for example) ? Ok, that won't work because commit_super is called in other contexts where it can genuinely wake up the trans kthread, blimey. Why can't we stop transaction kthread first thing in close_ctree ? > >> --- >> Changes from v1: >> >> - Add a comment explaining why it needs to be a kthread_park(), not >> kthread_stop() >> - Update later comment now that the cleaner thread is definitely stopped >> >> fs/btrfs/disk-io.c | 51 ++++++++++++++-------------------------------- >> 1 file changed, 15 insertions(+), 36 deletions(-) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index b0ab41da91d1..40bcc45d827d 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg) >> struct btrfs_root *root = arg; >> struct btrfs_fs_info *fs_info = root->fs_info; >> int again; >> - struct btrfs_trans_handle *trans; >> >> - do { >> + while (1) { >> again = 0; >> >> /* Make the cleaner go to sleep early. */ >> @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg) >> */ >> btrfs_delete_unused_bgs(fs_info); >> sleep: >> + if (kthread_should_park()) >> + kthread_parkme(); >> + if (kthread_should_stop()) >> + return 0; >> if (!again) { >> set_current_state(TASK_INTERRUPTIBLE); >> - if (!kthread_should_stop()) >> - schedule(); >> + schedule(); >> __set_current_state(TASK_RUNNING); >> } >> - } while (!kthread_should_stop()); >> - >> - /* >> - * Transaction kthread is stopped before us and wakes us up. >> - * However we might have started a new transaction and COWed some >> - * tree blocks when deleting unused block groups for example. So >> - * make sure we commit the transaction we started to have a clean >> - * shutdown when evicting the btree inode - if it has dirty pages >> - * when we do the final iput() on it, eviction will trigger a >> - * writeback for it which will fail with null pointer dereferences >> - * since work queues and other resources were already released and >> - * destroyed by the time the iput/eviction/writeback is made. >> - */ >> - trans = btrfs_attach_transaction(root); >> - if (IS_ERR(trans)) { >> - if (PTR_ERR(trans) != -ENOENT) >> - btrfs_err(fs_info, >> - "cleaner transaction attach returned %ld", >> - PTR_ERR(trans)); >> - } else { >> - int ret; >> - >> - ret = btrfs_commit_transaction(trans); >> - if (ret) >> - btrfs_err(fs_info, >> - "cleaner open transaction commit returned %d", >> - ret); >> } >> - >> - return 0; >> } >> >> static int transaction_kthread(void *arg) >> @@ -3931,6 +3904,13 @@ void close_ctree(struct btrfs_fs_info *fs_info) >> int ret; >> >> set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); >> + /* >> + * We don't want the cleaner to start new transactions, add more delayed >> + * iputs, etc. while we're closing. We can't use kthread_stop() yet >> + * because that frees the task_struct, and the transaction kthread might >> + * still try to wake up the cleaner. >> + */ >> + kthread_park(fs_info->cleaner_kthread); >> >> /* wait for the qgroup rescan worker to stop */ >> btrfs_qgroup_wait_for_completion(fs_info, false); >> @@ -3958,9 +3938,8 @@ void close_ctree(struct btrfs_fs_info *fs_info) >> >> if (!sb_rdonly(fs_info->sb)) { >> /* >> - * If the cleaner thread is stopped and there are >> - * block groups queued for removal, the deletion will be >> - * skipped when we quit the cleaner thread. >> + * The cleaner kthread is stopped, so do one final pass over >> + * unused block groups. >> */ >> btrfs_delete_unused_bgs(fs_info); >> >>
On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote: > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > >> From: Omar Sandoval <osandov@fb.com> > > >> > > >> There's a race between close_ctree() and cleaner_kthread(). > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > >> sees it set, but this is racy; the cleaner might have already checked > > >> the bit and could be cleaning stuff. In particular, if it deletes > > >> unused > > >> block groups, it will create delayed iputs for the free space cache > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > >> longer running delayed iputs after a commit. Therefore, if the > > >> cleaner > > >> creates more delayed iputs after delayed iputs are run in > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > > >> inode crash from the VFS. > > >> > > >> Fix it by parking the cleaner > > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > > we're missing a commit or clean up data structures. Messing with state > > > of a thread would be the last thing I'd try after proving that it's > > > not > > > possible to fix in the logic of btrfs itself. > > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > > there. The interdependencies of thread and data structures and other > > > subsystems cannot have loops that could not find an ordering that will > > > not leak something. > > > > > > It's not a big problem if some step is done more than once, like > > > committing or cleaning up some other structures if we know that > > > it could create new. > > > > The problem is the cleaner thread needs to be told to stop doing new > > work, and we need to wait for the work it's already doing to be > > finished. We're getting "stop doing new work" already because the > > cleaner thread checks to see if the FS is closing, but we don't have a > > way today to wait for him to finish what he's already doing. > > > > kthread_park() is basically the same as adding another mutex or > > synchronization point. I'm not sure how we could change close_tree() or > > the final commit to pick this up more effectively? > > The idea is: > > cleaner close_ctree thread > > tell cleaner to stop > wait > start work > if should_stop, then exit > cleaner is stopped > > [does not run: finish work] > [does not run: loop] > pick up the work or make > sure there's nothing in > progress anymore > > > A simplified version in code: > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > > wait for defrag - could be started from cleaner but next iteration will > see the fs closed and will not continue > > kthread_stop(transaction_kthread) > > kthread_stop(cleaner_kthread) > > /* next, everything that could be left from cleaner should be finished */ > > btrfs_delete_unused_bgs(); > assert there are no defrags > assert there are no delayed iputs > commit if necessary > > IOW the unused block groups are removed from close_ctree too early, > moving that after the threads stop AND makins sure that it does not need > either of them should work. > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs > calls plain btrfs_end_transaction that wakes up transaction ktread, so > there would need to be an argument passed to tell it to do full commit. Not perfect, relies on the fact that wake_up_process(thread) on a stopped thread is a no-op, arguments would need to be added to skip that in btrfs_delete_unused_bgs and btrfs_commit_super. --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3956,11 +3956,18 @@ void close_ctree(struct btrfs_fs_info *fs_info) cancel_work_sync(&fs_info->async_reclaim_work); + if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) || + test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state)) + btrfs_error_commit_super(fs_info); + + kthread_stop(fs_info->transaction_kthread); + kthread_stop(fs_info->cleaner_kthread); + if (!sb_rdonly(fs_info->sb)) { /* - * If the cleaner thread is stopped and there are - * block groups queued for removal, the deletion will be - * skipped when we quit the cleaner thread. + * The cleaner thread is now stopped and if there are block + * groups queued for removal, we have to pick up the work here + * so there are no delayed iputs triggered. */ btrfs_delete_unused_bgs(fs_info); @@ -3969,13 +3976,6 @@ void close_ctree(struct btrfs_fs_info *fs_info) btrfs_err(fs_info, "commit super ret %d", ret); } - if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) || - test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state)) - btrfs_error_commit_super(fs_info); - - kthread_stop(fs_info->transaction_kthread); - kthread_stop(fs_info->cleaner_kthread); - ASSERT(list_empty(&fs_info->delayed_iputs)); set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);
On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote: > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > >> From: Omar Sandoval <osandov@fb.com> > > >> > > >> There's a race between close_ctree() and cleaner_kthread(). > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > >> sees it set, but this is racy; the cleaner might have already checked > > >> the bit and could be cleaning stuff. In particular, if it deletes > > >> unused > > >> block groups, it will create delayed iputs for the free space cache > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > >> longer running delayed iputs after a commit. Therefore, if the > > >> cleaner > > >> creates more delayed iputs after delayed iputs are run in > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > > >> inode crash from the VFS. > > >> > > >> Fix it by parking the cleaner > > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > > we're missing a commit or clean up data structures. Messing with state > > > of a thread would be the last thing I'd try after proving that it's > > > not > > > possible to fix in the logic of btrfs itself. > > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > > there. The interdependencies of thread and data structures and other > > > subsystems cannot have loops that could not find an ordering that will > > > not leak something. > > > > > > It's not a big problem if some step is done more than once, like > > > committing or cleaning up some other structures if we know that > > > it could create new. > > > > The problem is the cleaner thread needs to be told to stop doing new > > work, and we need to wait for the work it's already doing to be > > finished. We're getting "stop doing new work" already because the > > cleaner thread checks to see if the FS is closing, but we don't have a > > way today to wait for him to finish what he's already doing. > > > > kthread_park() is basically the same as adding another mutex or > > synchronization point. I'm not sure how we could change close_tree() or > > the final commit to pick this up more effectively? > > The idea is: > > cleaner close_ctree thread > > tell cleaner to stop > wait > start work > if should_stop, then exit > cleaner is stopped > > [does not run: finish work] > [does not run: loop] > pick up the work or make > sure there's nothing in > progress anymore > > > A simplified version in code: > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > > wait for defrag - could be started from cleaner but next iteration will > see the fs closed and will not continue > > kthread_stop(transaction_kthread) > > kthread_stop(cleaner_kthread) > > /* next, everything that could be left from cleaner should be finished */ > > btrfs_delete_unused_bgs(); > assert there are no defrags > assert there are no delayed iputs > commit if necessary > > IOW the unused block groups are removed from close_ctree too early, > moving that after the threads stop AND makins sure that it does not need > either of them should work. > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs > calls plain btrfs_end_transaction that wakes up transaction ktread, so > there would need to be an argument passed to tell it to do full commit. It's too fragile to run around in the filesystem with these threads freed. We can probably make it now, but I'm worried that we'll add another wakeup somewhere and blow up.
On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote: > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote: > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > > > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > > >> From: Omar Sandoval <osandov@fb.com> > > > >> > > > >> There's a race between close_ctree() and cleaner_kthread(). > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > > >> sees it set, but this is racy; the cleaner might have already checked > > > >> the bit and could be cleaning stuff. In particular, if it deletes > > > >> unused > > > >> block groups, it will create delayed iputs for the free space cache > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > > >> longer running delayed iputs after a commit. Therefore, if the > > > >> cleaner > > > >> creates more delayed iputs after delayed iputs are run in > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > > > >> inode crash from the VFS. > > > >> > > > >> Fix it by parking the cleaner > > > > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > > > we're missing a commit or clean up data structures. Messing with state > > > > of a thread would be the last thing I'd try after proving that it's > > > > not > > > > possible to fix in the logic of btrfs itself. > > > > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > > > there. The interdependencies of thread and data structures and other > > > > subsystems cannot have loops that could not find an ordering that will > > > > not leak something. > > > > > > > > It's not a big problem if some step is done more than once, like > > > > committing or cleaning up some other structures if we know that > > > > it could create new. > > > > > > The problem is the cleaner thread needs to be told to stop doing new > > > work, and we need to wait for the work it's already doing to be > > > finished. We're getting "stop doing new work" already because the > > > cleaner thread checks to see if the FS is closing, but we don't have a > > > way today to wait for him to finish what he's already doing. > > > > > > kthread_park() is basically the same as adding another mutex or > > > synchronization point. I'm not sure how we could change close_tree() or > > > the final commit to pick this up more effectively? > > > > The idea is: > > > > cleaner close_ctree thread > > > > tell cleaner to stop > > wait > > start work > > if should_stop, then exit > > cleaner is stopped > > > > [does not run: finish work] > > [does not run: loop] > > pick up the work or make > > sure there's nothing in > > progress anymore > > > > > > A simplified version in code: > > > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > > > > wait for defrag - could be started from cleaner but next iteration will > > see the fs closed and will not continue > > > > kthread_stop(transaction_kthread) > > > > kthread_stop(cleaner_kthread) > > > > /* next, everything that could be left from cleaner should be finished */ > > > > btrfs_delete_unused_bgs(); > > assert there are no defrags > > assert there are no delayed iputs > > commit if necessary > > > > IOW the unused block groups are removed from close_ctree too early, > > moving that after the threads stop AND makins sure that it does not need > > either of them should work. > > > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs > > calls plain btrfs_end_transaction that wakes up transaction ktread, so > > there would need to be an argument passed to tell it to do full commit. > > Not perfect, relies on the fact that wake_up_process(thread) on a stopped > thread is a no-op, How is that? kthread_stop() frees the task struct, so wake_up_process() would be a use-after-free.
On Thu, Nov 01, 2018 at 08:23:17AM -0700, Omar Sandoval wrote: > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote: > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > > > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > > >> From: Omar Sandoval <osandov@fb.com> > > > >> > > > >> There's a race between close_ctree() and cleaner_kthread(). > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > > >> sees it set, but this is racy; the cleaner might have already checked > > > >> the bit and could be cleaning stuff. In particular, if it deletes > > > >> unused > > > >> block groups, it will create delayed iputs for the free space cache > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > > >> longer running delayed iputs after a commit. Therefore, if the > > > >> cleaner > > > >> creates more delayed iputs after delayed iputs are run in > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > > > >> inode crash from the VFS. > > > >> > > > >> Fix it by parking the cleaner > > > > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > > > we're missing a commit or clean up data structures. Messing with state > > > > of a thread would be the last thing I'd try after proving that it's > > > > not > > > > possible to fix in the logic of btrfs itself. > > > > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > > > there. The interdependencies of thread and data structures and other > > > > subsystems cannot have loops that could not find an ordering that will > > > > not leak something. > > > > > > > > It's not a big problem if some step is done more than once, like > > > > committing or cleaning up some other structures if we know that > > > > it could create new. > > > > > > The problem is the cleaner thread needs to be told to stop doing new > > > work, and we need to wait for the work it's already doing to be > > > finished. We're getting "stop doing new work" already because the > > > cleaner thread checks to see if the FS is closing, but we don't have a > > > way today to wait for him to finish what he's already doing. > > > > > > kthread_park() is basically the same as adding another mutex or > > > synchronization point. I'm not sure how we could change close_tree() or > > > the final commit to pick this up more effectively? > > > > The idea is: > > > > cleaner close_ctree thread > > > > tell cleaner to stop > > wait > > start work > > if should_stop, then exit > > cleaner is stopped > > > > [does not run: finish work] > > [does not run: loop] > > pick up the work or make > > sure there's nothing in > > progress anymore > > > > > > A simplified version in code: > > > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > > > > wait for defrag - could be started from cleaner but next iteration will > > see the fs closed and will not continue > > > > kthread_stop(transaction_kthread) > > > > kthread_stop(cleaner_kthread) > > > > /* next, everything that could be left from cleaner should be finished */ > > > > btrfs_delete_unused_bgs(); > > assert there are no defrags > > assert there are no delayed iputs > > commit if necessary > > > > IOW the unused block groups are removed from close_ctree too early, > > moving that after the threads stop AND makins sure that it does not need > > either of them should work. > > > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs > > calls plain btrfs_end_transaction that wakes up transaction ktread, so > > there would need to be an argument passed to tell it to do full commit. > > It's too fragile to run around in the filesystem with these threads > freed. We can probably make it now, but I'm worried that we'll add > another wakeup somewhere and blow up. So functions that are called from close_ctree will be careful to either not wake up or check before trying to wake them up. We can also wrap the wakeups like wake_up_cleaner(fs_info) { if (!fs_info->state & CLOSING) wake_up_process(cleaner_kthread) else BUG_ON } Same for kthread.
On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote: > On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote: > > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote: > > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > > > > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > > > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > > > >> From: Omar Sandoval <osandov@fb.com> > > > > >> > > > > >> There's a race between close_ctree() and cleaner_kthread(). > > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > > > >> sees it set, but this is racy; the cleaner might have already checked > > > > >> the bit and could be cleaning stuff. In particular, if it deletes > > > > >> unused > > > > >> block groups, it will create delayed iputs for the free space cache > > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > > > >> longer running delayed iputs after a commit. Therefore, if the > > > > >> cleaner > > > > >> creates more delayed iputs after delayed iputs are run in > > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > > > > >> inode crash from the VFS. > > > > >> > > > > >> Fix it by parking the cleaner > > > > > > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > > > > we're missing a commit or clean up data structures. Messing with state > > > > > of a thread would be the last thing I'd try after proving that it's > > > > > not > > > > > possible to fix in the logic of btrfs itself. > > > > > > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > > > > there. The interdependencies of thread and data structures and other > > > > > subsystems cannot have loops that could not find an ordering that will > > > > > not leak something. > > > > > > > > > > It's not a big problem if some step is done more than once, like > > > > > committing or cleaning up some other structures if we know that > > > > > it could create new. > > > > > > > > The problem is the cleaner thread needs to be told to stop doing new > > > > work, and we need to wait for the work it's already doing to be > > > > finished. We're getting "stop doing new work" already because the > > > > cleaner thread checks to see if the FS is closing, but we don't have a > > > > way today to wait for him to finish what he's already doing. > > > > > > > > kthread_park() is basically the same as adding another mutex or > > > > synchronization point. I'm not sure how we could change close_tree() or > > > > the final commit to pick this up more effectively? > > > > > > The idea is: > > > > > > cleaner close_ctree thread > > > > > > tell cleaner to stop > > > wait > > > start work > > > if should_stop, then exit > > > cleaner is stopped > > > > > > [does not run: finish work] > > > [does not run: loop] > > > pick up the work or make > > > sure there's nothing in > > > progress anymore > > > > > > > > > A simplified version in code: > > > > > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > > > > > > wait for defrag - could be started from cleaner but next iteration will > > > see the fs closed and will not continue > > > > > > kthread_stop(transaction_kthread) > > > > > > kthread_stop(cleaner_kthread) > > > > > > /* next, everything that could be left from cleaner should be finished */ > > > > > > btrfs_delete_unused_bgs(); > > > assert there are no defrags > > > assert there are no delayed iputs > > > commit if necessary > > > > > > IOW the unused block groups are removed from close_ctree too early, > > > moving that after the threads stop AND makins sure that it does not need > > > either of them should work. > > > > > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs > > > calls plain btrfs_end_transaction that wakes up transaction ktread, so > > > there would need to be an argument passed to tell it to do full commit. > > > > Not perfect, relies on the fact that wake_up_process(thread) on a stopped > > thread is a no-op, > > How is that? kthread_stop() frees the task struct, so wake_up_process() > would be a use-after-free. (Indirectly, through the kthread calling do_exit() -> do_task_dead() -> being put in the scheduler)
On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote: > On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote: > > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote: > > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > > > > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > > > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > > > >> From: Omar Sandoval <osandov@fb.com> > > > > >> > > > > >> There's a race between close_ctree() and cleaner_kthread(). > > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > > > >> sees it set, but this is racy; the cleaner might have already checked > > > > >> the bit and could be cleaning stuff. In particular, if it deletes > > > > >> unused > > > > >> block groups, it will create delayed iputs for the free space cache > > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > > > >> longer running delayed iputs after a commit. Therefore, if the > > > > >> cleaner > > > > >> creates more delayed iputs after delayed iputs are run in > > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > > > > >> inode crash from the VFS. > > > > >> > > > > >> Fix it by parking the cleaner > > > > > > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > > > > we're missing a commit or clean up data structures. Messing with state > > > > > of a thread would be the last thing I'd try after proving that it's > > > > > not > > > > > possible to fix in the logic of btrfs itself. > > > > > > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > > > > there. The interdependencies of thread and data structures and other > > > > > subsystems cannot have loops that could not find an ordering that will > > > > > not leak something. > > > > > > > > > > It's not a big problem if some step is done more than once, like > > > > > committing or cleaning up some other structures if we know that > > > > > it could create new. > > > > > > > > The problem is the cleaner thread needs to be told to stop doing new > > > > work, and we need to wait for the work it's already doing to be > > > > finished. We're getting "stop doing new work" already because the > > > > cleaner thread checks to see if the FS is closing, but we don't have a > > > > way today to wait for him to finish what he's already doing. > > > > > > > > kthread_park() is basically the same as adding another mutex or > > > > synchronization point. I'm not sure how we could change close_tree() or > > > > the final commit to pick this up more effectively? > > > > > > The idea is: > > > > > > cleaner close_ctree thread > > > > > > tell cleaner to stop > > > wait > > > start work > > > if should_stop, then exit > > > cleaner is stopped > > > > > > [does not run: finish work] > > > [does not run: loop] > > > pick up the work or make > > > sure there's nothing in > > > progress anymore > > > > > > > > > A simplified version in code: > > > > > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > > > > > > wait for defrag - could be started from cleaner but next iteration will > > > see the fs closed and will not continue > > > > > > kthread_stop(transaction_kthread) > > > > > > kthread_stop(cleaner_kthread) > > > > > > /* next, everything that could be left from cleaner should be finished */ > > > > > > btrfs_delete_unused_bgs(); > > > assert there are no defrags > > > assert there are no delayed iputs > > > commit if necessary > > > > > > IOW the unused block groups are removed from close_ctree too early, > > > moving that after the threads stop AND makins sure that it does not need > > > either of them should work. > > > > > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs > > > calls plain btrfs_end_transaction that wakes up transaction ktread, so > > > there would need to be an argument passed to tell it to do full commit. > > > > Not perfect, relies on the fact that wake_up_process(thread) on a stopped > > thread is a no-op, > > How is that? kthread_stop() frees the task struct, so wake_up_process() > would be a use-after-free. That was an assumption for the demonstration purposes, the wording was confusing sorry.
On Thu, Nov 01, 2018 at 04:29:48PM +0100, David Sterba wrote: > On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote: > > On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote: > > > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote: > > > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > > > > > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > > > > > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > > > > >> From: Omar Sandoval <osandov@fb.com> > > > > > >> > > > > > >> There's a race between close_ctree() and cleaner_kthread(). > > > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > > > > >> sees it set, but this is racy; the cleaner might have already checked > > > > > >> the bit and could be cleaning stuff. In particular, if it deletes > > > > > >> unused > > > > > >> block groups, it will create delayed iputs for the free space cache > > > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > > > > >> longer running delayed iputs after a commit. Therefore, if the > > > > > >> cleaner > > > > > >> creates more delayed iputs after delayed iputs are run in > > > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > > > > > >> inode crash from the VFS. > > > > > >> > > > > > >> Fix it by parking the cleaner > > > > > > > > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > > > > > we're missing a commit or clean up data structures. Messing with state > > > > > > of a thread would be the last thing I'd try after proving that it's > > > > > > not > > > > > > possible to fix in the logic of btrfs itself. > > > > > > > > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > > > > > there. The interdependencies of thread and data structures and other > > > > > > subsystems cannot have loops that could not find an ordering that will > > > > > > not leak something. > > > > > > > > > > > > It's not a big problem if some step is done more than once, like > > > > > > committing or cleaning up some other structures if we know that > > > > > > it could create new. > > > > > > > > > > The problem is the cleaner thread needs to be told to stop doing new > > > > > work, and we need to wait for the work it's already doing to be > > > > > finished. We're getting "stop doing new work" already because the > > > > > cleaner thread checks to see if the FS is closing, but we don't have a > > > > > way today to wait for him to finish what he's already doing. > > > > > > > > > > kthread_park() is basically the same as adding another mutex or > > > > > synchronization point. I'm not sure how we could change close_tree() or > > > > > the final commit to pick this up more effectively? > > > > > > > > The idea is: > > > > > > > > cleaner close_ctree thread > > > > > > > > tell cleaner to stop > > > > wait > > > > start work > > > > if should_stop, then exit > > > > cleaner is stopped > > > > > > > > [does not run: finish work] > > > > [does not run: loop] > > > > pick up the work or make > > > > sure there's nothing in > > > > progress anymore > > > > > > > > > > > > A simplified version in code: > > > > > > > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > > > > > > > > wait for defrag - could be started from cleaner but next iteration will > > > > see the fs closed and will not continue > > > > > > > > kthread_stop(transaction_kthread) > > > > > > > > kthread_stop(cleaner_kthread) > > > > > > > > /* next, everything that could be left from cleaner should be finished */ > > > > > > > > btrfs_delete_unused_bgs(); > > > > assert there are no defrags > > > > assert there are no delayed iputs > > > > commit if necessary > > > > > > > > IOW the unused block groups are removed from close_ctree too early, > > > > moving that after the threads stop AND makins sure that it does not need > > > > either of them should work. > > > > > > > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs > > > > calls plain btrfs_end_transaction that wakes up transaction ktread, so > > > > there would need to be an argument passed to tell it to do full commit. > > > > > > Not perfect, relies on the fact that wake_up_process(thread) on a stopped > > > thread is a no-op, > > > > How is that? kthread_stop() frees the task struct, so wake_up_process() > > would be a use-after-free. > > That was an assumption for the demonstration purposes, the wording was > confusing sorry. Oh, well in that case, that's exactly what kthread_park() is ;) Stop the thread and make wake_up a noop, and then we don't need to add special cases everywhere else.
On Thu, Nov 01, 2018 at 09:00:56AM -0700, Omar Sandoval wrote: > > That was an assumption for the demonstration purposes, the wording was > > confusing sorry. > > Oh, well in that case, that's exactly what kthread_park() is ;) Stop the > thread and make wake_up a noop, and then we don't need to add special > cases everywhere else. The end result is equivalent and should be, I'm now weighing both approaches. Explicit state tracking outside of the thread structures seems more clear and in case we want to query the status, we can't after kthread_stop is called. My current version below. diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 68ca41dbbef3..cc076ae45ced 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -715,6 +715,7 @@ struct btrfs_delayed_root; #define BTRFS_FS_BARRIER 1 #define BTRFS_FS_CLOSING_START 2 #define BTRFS_FS_CLOSING_DONE 3 +#define BTRFS_FS_CLOSING_SINGLE 19 #define BTRFS_FS_LOG_RECOVERING 4 #define BTRFS_FS_OPEN 5 #define BTRFS_FS_QUOTA_ENABLED 6 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b0ab41da91d1..570096fb3f35 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1801,7 +1801,7 @@ static int transaction_kthread(void *arg) btrfs_end_transaction(trans); } sleep: - wake_up_process(fs_info->cleaner_kthread); + btrfs_wake_up_cleaner(fs_info); mutex_unlock(&fs_info->transaction_kthread_mutex); if (unlikely(test_bit(BTRFS_FS_STATE_ERROR, @@ -3914,7 +3914,7 @@ int btrfs_commit_super(struct btrfs_fs_info *fs_info) mutex_lock(&fs_info->cleaner_mutex); btrfs_run_delayed_iputs(fs_info); mutex_unlock(&fs_info->cleaner_mutex); - wake_up_process(fs_info->cleaner_kthread); + btrfs_wake_up_cleaner(fs_info); /* wait until ongoing cleanup work done */ down_write(&fs_info->cleanup_work_sem); @@ -3956,11 +3956,24 @@ void close_ctree(struct btrfs_fs_info *fs_info) cancel_work_sync(&fs_info->async_reclaim_work); + if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) || + test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state)) + btrfs_error_commit_super(fs_info); + + kthread_stop(fs_info->transaction_kthread); + kthread_stop(fs_info->cleaner_kthread); + + /* + * btrfs_delete_unused_bgs or btrfs_commit_super can still try to wake + * up the threads but will become a no-op + */ + set_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags); + if (!sb_rdonly(fs_info->sb)) { /* - * If the cleaner thread is stopped and there are - * block groups queued for removal, the deletion will be - * skipped when we quit the cleaner thread. + * The cleaner thread is now stopped and if there are block + * groups queued for removal, we have to pick up the work here + * so there are no delayed iputs triggered. */ btrfs_delete_unused_bgs(fs_info); @@ -3969,13 +3982,6 @@ void close_ctree(struct btrfs_fs_info *fs_info) btrfs_err(fs_info, "commit super ret %d", ret); } - if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) || - test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state)) - btrfs_error_commit_super(fs_info); - - kthread_stop(fs_info->transaction_kthread); - kthread_stop(fs_info->cleaner_kthread); - ASSERT(list_empty(&fs_info->delayed_iputs)); set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a990a9045139..a5786b6e8d37 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -5915,7 +5915,7 @@ long btrfs_ioctl(struct file *file, unsigned int * namely it pokes the cleaner kthread that will start * processing uncleaned subvols. */ - wake_up_process(fs_info->transaction_kthread); + btrfs_wake_up_transaction(fs_info); return ret; } case BTRFS_IOC_START_SYNC: diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index b362b45dd757..9e73e35b94ea 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1896,7 +1896,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) set_bit(BTRFS_FS_OPEN, &fs_info->flags); } out: - wake_up_process(fs_info->transaction_kthread); + btrfs_wake_up_transaction(fs_info); btrfs_remount_cleanup(fs_info, old_opts); return 0; diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 3717c864ba23..d00311dc76d2 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -154,7 +154,7 @@ static ssize_t btrfs_feature_attr_store(struct kobject *kobj, * We don't want to do full transaction commit from inside sysfs */ btrfs_set_pending(fs_info, COMMIT); - wake_up_process(fs_info->transaction_kthread); + btrfs_wake_up_transaction(fs_info); return count; } @@ -427,7 +427,7 @@ static ssize_t btrfs_label_store(struct kobject *kobj, * We don't want to do full transaction commit from inside sysfs */ btrfs_set_pending(fs_info, COMMIT); - wake_up_process(fs_info->transaction_kthread); + btrfs_wake_up_transaction(fs_info); return len; } diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index d1eeef9ec5da..3ffed32c680e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -867,7 +867,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, if (throttle) return btrfs_commit_transaction(trans); else - wake_up_process(info->transaction_kthread); + btrfs_wake_up_transaction(info); } if (trans->type & __TRANS_FREEZABLE) @@ -889,7 +889,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, if (trans->aborted || test_bit(BTRFS_FS_STATE_ERROR, &info->fs_state)) { - wake_up_process(info->transaction_kthread); + btrfs_wake_up_transaction(info); err = -EIO; } diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 4cbb1b55387d..64a38c4af565 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -199,6 +199,25 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans); int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, int wait_for_unblock); +static inline void btrfs_wake_up_cleaner(struct btrfs_fs_info *fs_info) +{ + if (!test_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags)) { + wake_up_process(fs_info->cleaner_kthread); + } else { + btrfs_debug(fs_info, + "attempt to wake up cleaner kthread during shutdown"); + } +} +static inline void btrfs_wake_up_transaction(struct btrfs_fs_info *fs_info) +{ + if (!test_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags)) { + wake_up_process(fs_info->transaction_kthread); + } else { + btrfs_debug(fs_info, + "attempt to wake up transaction kthread during shutdown"); + } +} + /* * Try to commit transaction asynchronously, so this is safe to call * even holding a spinlock. @@ -210,7 +229,7 @@ static inline void btrfs_commit_transaction_locksafe( struct btrfs_fs_info *fs_info) { set_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags); - wake_up_process(fs_info->transaction_kthread); + btrfs_wake_up_transaction(fs_info); } int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans); int btrfs_should_end_transaction(struct btrfs_trans_handle *trans);
On 1.11.18 г. 18:44 ч., David Sterba wrote: > On Thu, Nov 01, 2018 at 09:00:56AM -0700, Omar Sandoval wrote: >>> That was an assumption for the demonstration purposes, the wording was >>> confusing sorry. >> >> Oh, well in that case, that's exactly what kthread_park() is ;) Stop the >> thread and make wake_up a noop, and then we don't need to add special >> cases everywhere else. > > The end result is equivalent and should be, I'm now weighing both > approaches. Explicit state tracking outside of the thread structures > seems more clear and in case we want to query the status, we can't after > kthread_stop is called. My current version below. > <snip> > +static inline void btrfs_wake_up_transaction(struct btrfs_fs_info *fs_info) > +{ > + if (!test_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags)) { Why do we need a separate flag for a predicate? If we use FS_CLOSING_START then waking up becomes a noop from the moment close_ctree is called. > + wake_up_process(fs_info->transaction_kthread); > + } else { > + btrfs_debug(fs_info, > + "attempt to wake up transaction kthread during shutdown"); > + } > +} > + > /* > * Try to commit transaction asynchronously, so this is safe to call > * even holding a spinlock. > @@ -210,7 +229,7 @@ static inline void btrfs_commit_transaction_locksafe( > struct btrfs_fs_info *fs_info) > { > set_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags); > - wake_up_process(fs_info->transaction_kthread); > + btrfs_wake_up_transaction(fs_info); > } > int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans); > int btrfs_should_end_transaction(struct btrfs_trans_handle *trans); >
On Thu, Nov 01, 2018 at 06:50:03PM +0200, Nikolay Borisov wrote: > > > On 1.11.18 г. 18:44 ч., David Sterba wrote: > > On Thu, Nov 01, 2018 at 09:00:56AM -0700, Omar Sandoval wrote: > >>> That was an assumption for the demonstration purposes, the wording was > >>> confusing sorry. > >> > >> Oh, well in that case, that's exactly what kthread_park() is ;) Stop the > >> thread and make wake_up a noop, and then we don't need to add special > >> cases everywhere else. > > > > The end result is equivalent and should be, I'm now weighing both > > approaches. Explicit state tracking outside of the thread structures > > seems more clear and in case we want to query the status, we can't after > > kthread_stop is called. My current version below. > > > > <snip> > > > +static inline void btrfs_wake_up_transaction(struct btrfs_fs_info *fs_info) > > +{ > > + if (!test_bit(BTRFS_FS_CLOSING_SINGLE, &fs_info->flags)) { > > Why do we need a separate flag for a predicate? If we use > FS_CLOSING_START then waking up becomes a noop from the moment > close_ctree is called. Yeah it's not strictly necessary, it's a leftover from when I had a WARN_ON in the wakeup helpers that triggered too often so another flag was added just after the threads were stopped.
On 1 Nov 2018, at 12:00, Omar Sandoval wrote: > On Thu, Nov 01, 2018 at 04:29:48PM +0100, David Sterba wrote: >>> >>> How is that? kthread_stop() frees the task struct, so >>> wake_up_process() >>> would be a use-after-free. >> >> That was an assumption for the demonstration purposes, the wording >> was >> confusing sorry. > > Oh, well in that case, that's exactly what kthread_park() is ;) Stop > the > thread and make wake_up a noop, and then we don't need to add special > cases everywhere else. This is how Omar won me over to kthread_park(). I think Dave's alternatives can be made correct, but Omar's way solves a whole class of potential problems. The park() makes the thread safely ignore future work, and won't don't have to chase down weird bugs where people are sending future work at the wrong time. -chris
On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > From: Omar Sandoval <osandov@fb.com> > > There's a race between close_ctree() and cleaner_kthread(). > close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > sees it set, but this is racy; the cleaner might have already checked > the bit and could be cleaning stuff. In particular, if it deletes unused > block groups, it will create delayed iputs for the free space cache > inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > longer running delayed iputs after a commit. Therefore, if the cleaner > creates more delayed iputs after delayed iputs are run in > btrfs_commit_super(), we will leak inodes on unmount and get a busy > inode crash from the VFS. > > Fix it by parking the cleaner before we actually close anything. Then, > any remaining delayed iputs will always be handled in > btrfs_commit_super(). This also ensures that the commit in close_ctree() > is really the last commit, so we can get rid of the commit in > cleaner_kthread(). > > Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit") > Signed-off-by: Omar Sandoval <osandov@fb.com> I'll queue this patch for rc2 as it fixes crashes I see during testing. My version does more changes and would be more suitable for a series, that could actually document the shutdown sequence and add a few assertions on top. Reviewed-by: David Sterba <dsterba@suse.com>
On Wed, Nov 07, 2018 at 05:01:19PM +0100, David Sterba wrote: > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > From: Omar Sandoval <osandov@fb.com> > > > > There's a race between close_ctree() and cleaner_kthread(). > > close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > sees it set, but this is racy; the cleaner might have already checked > > the bit and could be cleaning stuff. In particular, if it deletes unused > > block groups, it will create delayed iputs for the free space cache > > inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > longer running delayed iputs after a commit. Therefore, if the cleaner > > creates more delayed iputs after delayed iputs are run in > > btrfs_commit_super(), we will leak inodes on unmount and get a busy > > inode crash from the VFS. > > > > Fix it by parking the cleaner before we actually close anything. Then, > > any remaining delayed iputs will always be handled in > > btrfs_commit_super(). This also ensures that the commit in close_ctree() > > is really the last commit, so we can get rid of the commit in > > cleaner_kthread(). > > > > Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit") > > Signed-off-by: Omar Sandoval <osandov@fb.com> > > I'll queue this patch for rc2 as it fixes crashes I see during testing. > My version does more changes and would be more suitable for a series, > that could actually document the shutdown sequence and add a few > assertions on top. Thanks, Dave! I'll keep an eye out for the further cleanups. > Reviewed-by: David Sterba <dsterba@suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b0ab41da91d1..40bcc45d827d 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; struct btrfs_fs_info *fs_info = root->fs_info; int again; - struct btrfs_trans_handle *trans; - do { + while (1) { again = 0; /* Make the cleaner go to sleep early. */ @@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg) */ btrfs_delete_unused_bgs(fs_info); sleep: + if (kthread_should_park()) + kthread_parkme(); + if (kthread_should_stop()) + return 0; if (!again) { set_current_state(TASK_INTERRUPTIBLE); - if (!kthread_should_stop()) - schedule(); + schedule(); __set_current_state(TASK_RUNNING); } - } while (!kthread_should_stop()); - - /* - * Transaction kthread is stopped before us and wakes us up. - * However we might have started a new transaction and COWed some - * tree blocks when deleting unused block groups for example. So - * make sure we commit the transaction we started to have a clean - * shutdown when evicting the btree inode - if it has dirty pages - * when we do the final iput() on it, eviction will trigger a - * writeback for it which will fail with null pointer dereferences - * since work queues and other resources were already released and - * destroyed by the time the iput/eviction/writeback is made. - */ - trans = btrfs_attach_transaction(root); - if (IS_ERR(trans)) { - if (PTR_ERR(trans) != -ENOENT) - btrfs_err(fs_info, - "cleaner transaction attach returned %ld", - PTR_ERR(trans)); - } else { - int ret; - - ret = btrfs_commit_transaction(trans); - if (ret) - btrfs_err(fs_info, - "cleaner open transaction commit returned %d", - ret); } - - return 0; } static int transaction_kthread(void *arg) @@ -3931,6 +3904,13 @@ void close_ctree(struct btrfs_fs_info *fs_info) int ret; set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); + /* + * We don't want the cleaner to start new transactions, add more delayed + * iputs, etc. while we're closing. We can't use kthread_stop() yet + * because that frees the task_struct, and the transaction kthread might + * still try to wake up the cleaner. + */ + kthread_park(fs_info->cleaner_kthread); /* wait for the qgroup rescan worker to stop */ btrfs_qgroup_wait_for_completion(fs_info, false); @@ -3958,9 +3938,8 @@ void close_ctree(struct btrfs_fs_info *fs_info) if (!sb_rdonly(fs_info->sb)) { /* - * If the cleaner thread is stopped and there are - * block groups queued for removal, the deletion will be - * skipped when we quit the cleaner thread. + * The cleaner kthread is stopped, so do one final pass over + * unused block groups. */ btrfs_delete_unused_bgs(fs_info);