btrfs: wait on caching when putting the bg cache

Message ID	20180912144545.5564-1-josef@toxicpanda.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Josef Bacik <josef@toxicpanda.com> To: kernel-team@fb.com, linux-btrfs@vger.kernel.org Subject: [PATCH] btrfs: wait on caching when putting the bg cache Date: Wed, 12 Sep 2018 10:45:45 -0400 Message-Id: <20180912144545.5564-1-josef@toxicpanda.com> Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs: wait on caching when putting the bg cache \| expand btrfs: wait on caching when putting the bg cache

Josef Bacik Sept. 12, 2018, 2:45 p.m. UTC

While testing my backport I noticed there was a panic if I ran
generic/416 generic/417 generic/418 all in a row.  This just happened to
uncover a race where we had outstanding IO after we destroy all of our
workqueues, and then we'd go to queue the endio work on those free'd
workqueues.  This is because we aren't waiting for the caching threads
to be done before freeing everything up, so to fix this make sure we
wait on any outstanding caching that's being done before we free up the
block group, so we're sure to be done with all IO by the time we get to
btrfs_stop_all_workers().  This fixes the panic I was seeing
consistently in testing.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 1 +
 1 file changed, 1 insertion(+)

Nikolay Borisov Sept. 12, 2018, 3:15 p.m. UTC | #1

On 12.09.2018 17:45, Josef Bacik wrote:
> While testing my backport I noticed there was a panic if I ran
> generic/416 generic/417 generic/418 all in a row.  This just happened to
> uncover a race where we had outstanding IO after we destroy all of our
> workqueues, and then we'd go to queue the endio work on those free'd
> workqueues.  This is because we aren't waiting for the caching threads
> to be done before freeing everything up, so to fix this make sure we
> wait on any outstanding caching that's being done before we free up the
> block group, so we're sure to be done with all IO by the time we get to
> btrfs_stop_all_workers().  This fixes the panic I was seeing
> consistently in testing.

It's not clear whether this is caused by one of the patches in your
latest patchbomb or has the issue been there all along?
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 414492a18f1e..2eb2e37f2354 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9889,6 +9889,7 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info *info)
>  
>  		block_group = btrfs_lookup_first_block_group(info, last);
>  		while (block_group) {
> +			wait_block_group_cache_done(block_group);
>  			spin_lock(&block_group->lock);
>  			if (block_group->iref)
>  				break;
>

Josef Bacik Sept. 12, 2018, 3:21 p.m. UTC | #2

On Wed, Sep 12, 2018 at 06:15:41PM +0300, Nikolay Borisov wrote:
> 
> 
> On 12.09.2018 17:45, Josef Bacik wrote:
> > While testing my backport I noticed there was a panic if I ran
> > generic/416 generic/417 generic/418 all in a row.  This just happened to
> > uncover a race where we had outstanding IO after we destroy all of our
> > workqueues, and then we'd go to queue the endio work on those free'd
> > workqueues.  This is because we aren't waiting for the caching threads
> > to be done before freeing everything up, so to fix this make sure we
> > wait on any outstanding caching that's being done before we free up the
> > block group, so we're sure to be done with all IO by the time we get to
> > btrfs_stop_all_workers().  This fixes the panic I was seeing
> > consistently in testing.
> 
> It's not clear whether this is caused by one of the patches in your
> latest patchbomb or has the issue been there all along?

Been here always, I noticed this on the backport of linus/master before I even
got to pulling my shit ontop of that.  Thanks,

Josef

Omar Sandoval Sept. 12, 2018, 6:26 p.m. UTC | #3

On Wed, Sep 12, 2018 at 10:45:45AM -0400, Josef Bacik wrote:
> While testing my backport I noticed there was a panic if I ran
> generic/416 generic/417 generic/418 all in a row.  This just happened to
> uncover a race where we had outstanding IO after we destroy all of our
> workqueues, and then we'd go to queue the endio work on those free'd
> workqueues.  This is because we aren't waiting for the caching threads
> to be done before freeing everything up, so to fix this make sure we
> wait on any outstanding caching that's being done before we free up the
> block group, so we're sure to be done with all IO by the time we get to
> btrfs_stop_all_workers().  This fixes the panic I was seeing
> consistently in testing.

Reviewed-by: Omar Sandoval <osandov@fb.com>

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent-tree.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 414492a18f1e..2eb2e37f2354 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9889,6 +9889,7 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info *info)
>  
>  		block_group = btrfs_lookup_first_block_group(info, last);
>  		while (block_group) {
> +			wait_block_group_cache_done(block_group);
>  			spin_lock(&block_group->lock);
>  			if (block_group->iref)
>  				break;
> -- 
> 2.14.3
>

David Sterba Sept. 12, 2018, 6:40 p.m. UTC | #4

On Wed, Sep 12, 2018 at 10:45:45AM -0400, Josef Bacik wrote:
> While testing my backport I noticed there was a panic if I ran
> generic/416 generic/417 generic/418 all in a row.  This just happened to
> uncover a race where we had outstanding IO after we destroy all of our
> workqueues, and then we'd go to queue the endio work on those free'd
> workqueues.  This is because we aren't waiting for the caching threads
> to be done before freeing everything up, so to fix this make sure we
> wait on any outstanding caching that's being done before we free up the
> block group, so we're sure to be done with all IO by the time we get to
> btrfs_stop_all_workers().  This fixes the panic I was seeing
> consistently in testing.

Can you please attach the stacktrace(s)? I think I've seen similar error
once or twice but not able to reproduce.

David Sterba Sept. 13, 2018, 11:52 a.m. UTC | #5

On Wed, Sep 12, 2018 at 08:40:44PM +0200, David Sterba wrote:
> On Wed, Sep 12, 2018 at 10:45:45AM -0400, Josef Bacik wrote:
> > While testing my backport I noticed there was a panic if I ran
> > generic/416 generic/417 generic/418 all in a row.  This just happened to
> > uncover a race where we had outstanding IO after we destroy all of our
> > workqueues, and then we'd go to queue the endio work on those free'd
> > workqueues.  This is because we aren't waiting for the caching threads
> > to be done before freeing everything up, so to fix this make sure we
> > wait on any outstanding caching that's being done before we free up the
> > block group, so we're sure to be done with all IO by the time we get to
> > btrfs_stop_all_workers().  This fixes the panic I was seeing
> > consistently in testing.
> 
> Can you please attach the stacktrace(s)? I think I've seen similar error
> once or twice but not able to reproduce.

I found at least this one https://patchwork.kernel.org/patch/10495885/,
when the rbio cache is destroyed, there's some in-flight IO. This is not
the example I had in mind before but still roughly matches the symptoms.

btrfs: wait on caching when putting the bg cache

Commit Message

Comments

Patch