diff mbox

[v6,1/2] btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error

Message ID 20170306025547.1858-1-quwenruo@cn.fujitsu.com (mailing list archive)
State Superseded
Headers show

Commit Message

Qu Wenruo March 6, 2017, 2:55 a.m. UTC
[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
changelog:
v6:
  New, split from v5 patch, as this is a separate bug.
---
 fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 12 deletions(-)

Comments

Filipe Manana March 6, 2017, 11:19 p.m. UTC | #1
On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
>
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
>
> a) Create one data block group with one 4K extent in it.
>    To avoid the bug that hangs btrfs due to ordered extent which never
>    finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
>
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
>
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
>                     |<----------- cleanup range --------------->|
> |<-----------  ----------->|
>              \/
>  btrfs_finish_ordered_io() range
>
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
>
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
>
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<-----------  ----------->|<---------- cleanup range --------->|
>              \/
>  btrfs_finish_ordered_io() range
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

thanks

> ---
> changelog:
> v6:
>   New, split from v5 patch, as this is a separate bug.
> ---
>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 39 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b2bc07aad1ae..1d83d504f2e5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>                     BTRFS_DATA_RELOC_TREE_OBJECTID) {
>                         ret = btrfs_reloc_clone_csums(inode, start,
>                                                       cur_alloc_size);
> +                       /*
> +                        * Only drop cache here, and process as normal.
> +                        *
> +                        * We must not allow extent_clear_unlock_delalloc()
> +                        * at out_unlock label to free meta of this ordered
> +                        * extent, as its meta should be freed by
> +                        * btrfs_finish_ordered_io().
> +                        *
> +                        * So we must continue until @start is increased to
> +                        * skip current ordered extent.
> +                        */
>                         if (ret)
> -                               goto out_drop_extent_cache;
> +                               btrfs_drop_extent_cache(BTRFS_I(inode), start,
> +                                               start + ram_size - 1, 0);
>                 }
>
>                 btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>
> -               if (disk_num_bytes < cur_alloc_size)
> -                       break;
> -
>                 /* we're not doing compressed IO, don't unlock the first
>                  * page (which the caller expects to stay locked), don't
>                  * clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>                                              delalloc_end, locked_page,
>                                              EXTENT_LOCKED | EXTENT_DELALLOC,
>                                              op);
> -               disk_num_bytes -= cur_alloc_size;
> +               if (disk_num_bytes > cur_alloc_size)
> +                       disk_num_bytes = 0;
> +               else
> +                       disk_num_bytes -= cur_alloc_size;
>                 num_bytes -= cur_alloc_size;
>                 alloc_hint = ins.objectid + ins.offset;
>                 start += cur_alloc_size;
> +
> +               /*
> +                * btrfs_reloc_clone_csums() error, since start is increased
> +                * extent_clear_unlock_delalloc() at out_unlock label won't
> +                * free metadata of current ordered extent, we're OK to exit.
> +                */
> +               if (ret)
> +                       goto out_unlock;
>         }
>  out:
>         return ret;
> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>                 BUG_ON(ret); /* -ENOMEM */
>
>                 if (root->root_key.objectid ==
> -                   BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +                   BTRFS_DATA_RELOC_TREE_OBJECTID)
> +                       /*
> +                        * Error handled later, as we must prevent
> +                        * extent_clear_unlock_delalloc() in error handler
> +                        * from freeing metadata of created ordered extent.
> +                        */
>                         ret = btrfs_reloc_clone_csums(inode, cur_offset,
>                                                       num_bytes);
> -                       if (ret) {
> -                               if (!nolock && nocow)
> -                                       btrfs_end_write_no_snapshoting(root);
> -                               goto error;
> -                       }
> -               }
>
>                 extent_clear_unlock_delalloc(inode, cur_offset,
>                                              cur_offset + num_bytes - 1, end,
> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>                 if (!nolock && nocow)
>                         btrfs_end_write_no_snapshoting(root);
>                 cur_offset = extent_end;
> +
> +               /*
> +                * btrfs_reloc_clone_csums() error, now we're OK to call error
> +                * handler, as metadata for created ordered extent will only
> +                * be freed by btrfs_finish_ordered_io().
> +                */
> +               if (ret)
> +                       goto error;
>                 if (cur_offset > end)
>                         break;
>         }
> --
> 2.12.0
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba March 7, 2017, 5:32 p.m. UTC | #2
On Mon, Mar 06, 2017 at 11:19:32PM +0000, Filipe Manana wrote:
> On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> > [BUG]
> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> > and leads to kernel assertion on outstanding extents in
> > run_delalloc_nocow() and cow_file_range().
> >
> >  BTRFS info (device vdb5): relocating block group 12582912 flags data
> >  BTRFS info (device vdb5): found 1 extents
> >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> >
> > Currently, due to another bug blocking ordered extents, the bug is only
> > reproducible under certain block group layout and using error injection.
> >
> > a) Create one data block group with one 4K extent in it.
> >    To avoid the bug that hangs btrfs due to ordered extent which never
> >    finishes
> > b) Make btrfs_reloc_clone_csum() always fail
> > c) Relocate that block group
> >
> > [CAUSE]
> > run_delalloc_nocow() and cow_file_range() handles error from
> > btrfs_reloc_clone_csum() wrongly:
> >
> > (The ascii chart shows a more generic case of this bug other than the
> > bug mentioned above)
> >
> > |<------------------ delalloc range --------------------------->|
> > | OE 1 | OE 2 | ... | OE n |
> >                     |<----------- cleanup range --------------->|
> > |<-----------  ----------->|
> >              \/
> >  btrfs_finish_ordered_io() range
> >
> > So error handler, which calls extent_clear_unlock_delalloc() with
> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> > will both cover OE n, and free its metadata, causing metadata under flow.
> >
> > [Fix]
> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> > call error handler after increasing the iteration offset, so that
> > cleanup range won't cover any created ordered extent.
> >
> > |<------------------ delalloc range --------------------------->|
> > | OE 1 | OE 2 | ... | OE n |
> > |<-----------  ----------->|<---------- cleanup range --------->|
> >              \/
> >  btrfs_finish_ordered_io() range
> >
> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>

1-2 added to for-next and queued for 4.11. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo March 7, 2017, 8:49 p.m. UTC | #3
On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
> [BUG]
> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> and leads to kernel assertion on outstanding extents in
> run_delalloc_nocow() and cow_file_range().
> 
>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>  BTRFS info (device vdb5): found 1 extents
>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> 
> Currently, due to another bug blocking ordered extents, the bug is only
> reproducible under certain block group layout and using error injection.
> 
> a) Create one data block group with one 4K extent in it.
>    To avoid the bug that hangs btrfs due to ordered extent which never
>    finishes
> b) Make btrfs_reloc_clone_csum() always fail
> c) Relocate that block group
> 
> [CAUSE]
> run_delalloc_nocow() and cow_file_range() handles error from
> btrfs_reloc_clone_csum() wrongly:
> 
> (The ascii chart shows a more generic case of this bug other than the
> bug mentioned above)
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
>                     |<----------- cleanup range --------------->|
> |<-----------  ----------->|
>              \/
>  btrfs_finish_ordered_io() range
> 
> So error handler, which calls extent_clear_unlock_delalloc() with
> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> will both cover OE n, and free its metadata, causing metadata under flow.
> 
> [Fix]
> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> call error handler after increasing the iteration offset, so that
> cleanup range won't cover any created ordered extent.
> 
> |<------------------ delalloc range --------------------------->|
> | OE 1 | OE 2 | ... | OE n |
> |<-----------  ----------->|<---------- cleanup range --------->|
>              \/
>  btrfs_finish_ordered_io() range
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
> changelog:
> v6:
>   New, split from v5 patch, as this is a separate bug.
> ---
>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 39 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index b2bc07aad1ae..1d83d504f2e5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>  			ret = btrfs_reloc_clone_csums(inode, start,
>  						      cur_alloc_size);
> +			/*
> +			 * Only drop cache here, and process as normal.
> +			 *
> +			 * We must not allow extent_clear_unlock_delalloc()
> +			 * at out_unlock label to free meta of this ordered
> +			 * extent, as its meta should be freed by
> +			 * btrfs_finish_ordered_io().
> +			 *
> +			 * So we must continue until @start is increased to
> +			 * skip current ordered extent.
> +			 */
>  			if (ret)
> -				goto out_drop_extent_cache;
> +				btrfs_drop_extent_cache(BTRFS_I(inode), start,
> +						start + ram_size - 1, 0);
>  		}
>  
>  		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>  
> -		if (disk_num_bytes < cur_alloc_size)
> -			break;
> -
>  		/* we're not doing compressed IO, don't unlock the first
>  		 * page (which the caller expects to stay locked), don't
>  		 * clear any dirty bits and don't set any writeback bits
> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>  					     delalloc_end, locked_page,
>  					     EXTENT_LOCKED | EXTENT_DELALLOC,
>  					     op);
> -		disk_num_bytes -= cur_alloc_size;
> +		if (disk_num_bytes > cur_alloc_size)
> +			disk_num_bytes = 0;
> +		else
> +			disk_num_bytes -= cur_alloc_size;

I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?

Thanks,

-liubo

>  		num_bytes -= cur_alloc_size;
>  		alloc_hint = ins.objectid + ins.offset;
>  		start += cur_alloc_size;
> +
> +		/*
> +		 * btrfs_reloc_clone_csums() error, since start is increased
> +		 * extent_clear_unlock_delalloc() at out_unlock label won't
> +		 * free metadata of current ordered extent, we're OK to exit.
> +		 */
> +		if (ret)
> +			goto out_unlock;
>  	}
>  out:
>  	return ret;
> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  		BUG_ON(ret); /* -ENOMEM */
>  
>  		if (root->root_key.objectid ==
> -		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
> +		    BTRFS_DATA_RELOC_TREE_OBJECTID)
> +			/*
> +			 * Error handled later, as we must prevent
> +			 * extent_clear_unlock_delalloc() in error handler
> +			 * from freeing metadata of created ordered extent.
> +			 */
>  			ret = btrfs_reloc_clone_csums(inode, cur_offset,
>  						      num_bytes);
> -			if (ret) {
> -				if (!nolock && nocow)
> -					btrfs_end_write_no_snapshoting(root);
> -				goto error;
> -			}
> -		}
>  
>  		extent_clear_unlock_delalloc(inode, cur_offset,
>  					     cur_offset + num_bytes - 1, end,
> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>  		if (!nolock && nocow)
>  			btrfs_end_write_no_snapshoting(root);
>  		cur_offset = extent_end;
> +
> +		/*
> +		 * btrfs_reloc_clone_csums() error, now we're OK to call error
> +		 * handler, as metadata for created ordered extent will only
> +		 * be freed by btrfs_finish_ordered_io().
> +		 */
> +		if (ret)
> +			goto error;
>  		if (cur_offset > end)
>  			break;
>  	}
> -- 
> 2.12.0
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo March 7, 2017, 8:51 p.m. UTC | #4
On Tue, Mar 07, 2017 at 06:32:23PM +0100, David Sterba wrote:
> On Mon, Mar 06, 2017 at 11:19:32PM +0000, Filipe Manana wrote:
> > On Mon, Mar 6, 2017 at 2:55 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> > > [BUG]
> > > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> > > and leads to kernel assertion on outstanding extents in
> > > run_delalloc_nocow() and cow_file_range().
> > >
> > >  BTRFS info (device vdb5): relocating block group 12582912 flags data
> > >  BTRFS info (device vdb5): found 1 extents
> > >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> > >
> > > Currently, due to another bug blocking ordered extents, the bug is only
> > > reproducible under certain block group layout and using error injection.
> > >
> > > a) Create one data block group with one 4K extent in it.
> > >    To avoid the bug that hangs btrfs due to ordered extent which never
> > >    finishes
> > > b) Make btrfs_reloc_clone_csum() always fail
> > > c) Relocate that block group
> > >
> > > [CAUSE]
> > > run_delalloc_nocow() and cow_file_range() handles error from
> > > btrfs_reloc_clone_csum() wrongly:
> > >
> > > (The ascii chart shows a more generic case of this bug other than the
> > > bug mentioned above)
> > >
> > > |<------------------ delalloc range --------------------------->|
> > > | OE 1 | OE 2 | ... | OE n |
> > >                     |<----------- cleanup range --------------->|
> > > |<-----------  ----------->|
> > >              \/
> > >  btrfs_finish_ordered_io() range
> > >
> > > So error handler, which calls extent_clear_unlock_delalloc() with
> > > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> > > will both cover OE n, and free its metadata, causing metadata under flow.
> > >
> > > [Fix]
> > > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> > > call error handler after increasing the iteration offset, so that
> > > cleanup range won't cover any created ordered extent.
> > >
> > > |<------------------ delalloc range --------------------------->|
> > > | OE 1 | OE 2 | ... | OE n |
> > > |<-----------  ----------->|<---------- cleanup range --------->|
> > >              \/
> > >  btrfs_finish_ordered_io() range
> > >
> > > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> > 
> > Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> 1-2 added to for-next and queued for 4.11. Thanks.

I'm afraid that patch 1 has a problem, please take a look at the thread.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo March 7, 2017, 8:59 p.m. UTC | #5
On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote:
> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
> > [BUG]
> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> > and leads to kernel assertion on outstanding extents in
> > run_delalloc_nocow() and cow_file_range().
> > 
> >  BTRFS info (device vdb5): relocating block group 12582912 flags data
> >  BTRFS info (device vdb5): found 1 extents
> >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> > 
> > Currently, due to another bug blocking ordered extents, the bug is only
> > reproducible under certain block group layout and using error injection.
> > 
> > a) Create one data block group with one 4K extent in it.
> >    To avoid the bug that hangs btrfs due to ordered extent which never
> >    finishes
> > b) Make btrfs_reloc_clone_csum() always fail
> > c) Relocate that block group
> > 
> > [CAUSE]
> > run_delalloc_nocow() and cow_file_range() handles error from
> > btrfs_reloc_clone_csum() wrongly:
> > 
> > (The ascii chart shows a more generic case of this bug other than the
> > bug mentioned above)
> > 
> > |<------------------ delalloc range --------------------------->|
> > | OE 1 | OE 2 | ... | OE n |
> >                     |<----------- cleanup range --------------->|
> > |<-----------  ----------->|
> >              \/
> >  btrfs_finish_ordered_io() range
> > 
> > So error handler, which calls extent_clear_unlock_delalloc() with
> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> > will both cover OE n, and free its metadata, causing metadata under flow.
> > 
> > [Fix]
> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> > call error handler after increasing the iteration offset, so that
> > cleanup range won't cover any created ordered extent.
> > 
> > |<------------------ delalloc range --------------------------->|
> > | OE 1 | OE 2 | ... | OE n |
> > |<-----------  ----------->|<---------- cleanup range --------->|
> >              \/
> >  btrfs_finish_ordered_io() range
> > 
> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> > ---
> > changelog:
> > v6:
> >   New, split from v5 patch, as this is a separate bug.
> > ---
> >  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
> >  1 file changed, 39 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index b2bc07aad1ae..1d83d504f2e5 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
> >  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >  			ret = btrfs_reloc_clone_csums(inode, start,
> >  						      cur_alloc_size);
> > +			/*
> > +			 * Only drop cache here, and process as normal.
> > +			 *
> > +			 * We must not allow extent_clear_unlock_delalloc()
> > +			 * at out_unlock label to free meta of this ordered
> > +			 * extent, as its meta should be freed by
> > +			 * btrfs_finish_ordered_io().
> > +			 *
> > +			 * So we must continue until @start is increased to
> > +			 * skip current ordered extent.
> > +			 */
> >  			if (ret)
> > -				goto out_drop_extent_cache;
> > +				btrfs_drop_extent_cache(BTRFS_I(inode), start,
> > +						start + ram_size - 1, 0);
> >  		}
> >  
> >  		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> >  
> > -		if (disk_num_bytes < cur_alloc_size)
> > -			break;
> > -
> >  		/* we're not doing compressed IO, don't unlock the first
> >  		 * page (which the caller expects to stay locked), don't
> >  		 * clear any dirty bits and don't set any writeback bits
> > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
> >  					     delalloc_end, locked_page,
> >  					     EXTENT_LOCKED | EXTENT_DELALLOC,
> >  					     op);
> > -		disk_num_bytes -= cur_alloc_size;
> > +		if (disk_num_bytes > cur_alloc_size)
> > +			disk_num_bytes = 0;
> > +		else
> > +			disk_num_bytes -= cur_alloc_size;
> 
> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?

I assume that you've run fstests against this patch, if so, I actually
start worrying about that no fstests found this problem.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Filipe Manana March 8, 2017, 12:17 a.m. UTC | #6
On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote:
>> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
>> > [BUG]
>> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
>> > and leads to kernel assertion on outstanding extents in
>> > run_delalloc_nocow() and cow_file_range().
>> >
>> >  BTRFS info (device vdb5): relocating block group 12582912 flags data
>> >  BTRFS info (device vdb5): found 1 extents
>> >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>> >
>> > Currently, due to another bug blocking ordered extents, the bug is only
>> > reproducible under certain block group layout and using error injection.
>> >
>> > a) Create one data block group with one 4K extent in it.
>> >    To avoid the bug that hangs btrfs due to ordered extent which never
>> >    finishes
>> > b) Make btrfs_reloc_clone_csum() always fail
>> > c) Relocate that block group
>> >
>> > [CAUSE]
>> > run_delalloc_nocow() and cow_file_range() handles error from
>> > btrfs_reloc_clone_csum() wrongly:
>> >
>> > (The ascii chart shows a more generic case of this bug other than the
>> > bug mentioned above)
>> >
>> > |<------------------ delalloc range --------------------------->|
>> > | OE 1 | OE 2 | ... | OE n |
>> >                     |<----------- cleanup range --------------->|
>> > |<-----------  ----------->|
>> >              \/
>> >  btrfs_finish_ordered_io() range
>> >
>> > So error handler, which calls extent_clear_unlock_delalloc() with
>> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
>> > will both cover OE n, and free its metadata, causing metadata under flow.
>> >
>> > [Fix]
>> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
>> > call error handler after increasing the iteration offset, so that
>> > cleanup range won't cover any created ordered extent.
>> >
>> > |<------------------ delalloc range --------------------------->|
>> > | OE 1 | OE 2 | ... | OE n |
>> > |<-----------  ----------->|<---------- cleanup range --------->|
>> >              \/
>> >  btrfs_finish_ordered_io() range
>> >
>> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> > ---
>> > changelog:
>> > v6:
>> >   New, split from v5 patch, as this is a separate bug.
>> > ---
>> >  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>> >  1 file changed, 39 insertions(+), 12 deletions(-)
>> >
>> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> > index b2bc07aad1ae..1d83d504f2e5 100644
>> > --- a/fs/btrfs/inode.c
>> > +++ b/fs/btrfs/inode.c
>> > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>> >                 BTRFS_DATA_RELOC_TREE_OBJECTID) {
>> >                     ret = btrfs_reloc_clone_csums(inode, start,
>> >                                                   cur_alloc_size);
>> > +                   /*
>> > +                    * Only drop cache here, and process as normal.
>> > +                    *
>> > +                    * We must not allow extent_clear_unlock_delalloc()
>> > +                    * at out_unlock label to free meta of this ordered
>> > +                    * extent, as its meta should be freed by
>> > +                    * btrfs_finish_ordered_io().
>> > +                    *
>> > +                    * So we must continue until @start is increased to
>> > +                    * skip current ordered extent.
>> > +                    */
>> >                     if (ret)
>> > -                           goto out_drop_extent_cache;
>> > +                           btrfs_drop_extent_cache(BTRFS_I(inode), start,
>> > +                                           start + ram_size - 1, 0);
>> >             }
>> >
>> >             btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>> >
>> > -           if (disk_num_bytes < cur_alloc_size)
>> > -                   break;
>> > -
>> >             /* we're not doing compressed IO, don't unlock the first
>> >              * page (which the caller expects to stay locked), don't
>> >              * clear any dirty bits and don't set any writeback bits
>> > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>> >                                          delalloc_end, locked_page,
>> >                                          EXTENT_LOCKED | EXTENT_DELALLOC,
>> >                                          op);
>> > -           disk_num_bytes -= cur_alloc_size;
>> > +           if (disk_num_bytes > cur_alloc_size)
>> > +                   disk_num_bytes = 0;
>> > +           else
>> > +                   disk_num_bytes -= cur_alloc_size;
>>
>> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?
>
> I assume that you've run fstests against this patch, if so, I actually
> start worrying about that no fstests found this problem.

The operator is definitely wrong, should be < instead of > (previous
patch versions were correct).

It's not a surprise fstests don't catch this - one would need a fs
with no more free space to allocate new data chunks and existing data
chunks would have to be fragmented to the point we need to allocate
two or more smaller extents to satisfy the write, so whether fstests
exercises this depends on the size of the test devices (mine are 100G
for example) and the size of the data the tests create. Having a
deterministic test for that for that should be possible and useful.

>
> Thanks,
>
> -liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo March 8, 2017, 12:17 a.m. UTC | #7
At 03/08/2017 04:49 AM, Liu Bo wrote:
> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
>> [BUG]
>> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
>> and leads to kernel assertion on outstanding extents in
>> run_delalloc_nocow() and cow_file_range().
>>
>>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>>  BTRFS info (device vdb5): found 1 extents
>>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>>
>> Currently, due to another bug blocking ordered extents, the bug is only
>> reproducible under certain block group layout and using error injection.
>>
>> a) Create one data block group with one 4K extent in it.
>>    To avoid the bug that hangs btrfs due to ordered extent which never
>>    finishes
>> b) Make btrfs_reloc_clone_csum() always fail
>> c) Relocate that block group
>>
>> [CAUSE]
>> run_delalloc_nocow() and cow_file_range() handles error from
>> btrfs_reloc_clone_csum() wrongly:
>>
>> (The ascii chart shows a more generic case of this bug other than the
>> bug mentioned above)
>>
>> |<------------------ delalloc range --------------------------->|
>> | OE 1 | OE 2 | ... | OE n |
>>                     |<----------- cleanup range --------------->|
>> |<-----------  ----------->|
>>              \/
>>  btrfs_finish_ordered_io() range
>>
>> So error handler, which calls extent_clear_unlock_delalloc() with
>> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
>> will both cover OE n, and free its metadata, causing metadata under flow.
>>
>> [Fix]
>> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
>> call error handler after increasing the iteration offset, so that
>> cleanup range won't cover any created ordered extent.
>>
>> |<------------------ delalloc range --------------------------->|
>> | OE 1 | OE 2 | ... | OE n |
>> |<-----------  ----------->|<---------- cleanup range --------->|
>>              \/
>>  btrfs_finish_ordered_io() range
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>> changelog:
>> v6:
>>   New, split from v5 patch, as this is a separate bug.
>> ---
>>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>>  1 file changed, 39 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index b2bc07aad1ae..1d83d504f2e5 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>>  		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>  			ret = btrfs_reloc_clone_csums(inode, start,
>>  						      cur_alloc_size);
>> +			/*
>> +			 * Only drop cache here, and process as normal.
>> +			 *
>> +			 * We must not allow extent_clear_unlock_delalloc()
>> +			 * at out_unlock label to free meta of this ordered
>> +			 * extent, as its meta should be freed by
>> +			 * btrfs_finish_ordered_io().
>> +			 *
>> +			 * So we must continue until @start is increased to
>> +			 * skip current ordered extent.
>> +			 */
>>  			if (ret)
>> -				goto out_drop_extent_cache;
>> +				btrfs_drop_extent_cache(BTRFS_I(inode), start,
>> +						start + ram_size - 1, 0);
>>  		}
>>
>>  		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>>
>> -		if (disk_num_bytes < cur_alloc_size)
>> -			break;
>> -
>>  		/* we're not doing compressed IO, don't unlock the first
>>  		 * page (which the caller expects to stay locked), don't
>>  		 * clear any dirty bits and don't set any writeback bits
>> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>>  					     delalloc_end, locked_page,
>>  					     EXTENT_LOCKED | EXTENT_DELALLOC,
>>  					     op);
>> -		disk_num_bytes -= cur_alloc_size;
>> +		if (disk_num_bytes > cur_alloc_size)
>> +			disk_num_bytes = 0;
>> +		else
>> +			disk_num_bytes -= cur_alloc_size;
>
> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?

Thanks for the catch, it should be disk_num_bytes < cur_alloc_size.

I'll update it.

Thanks,
Qu
>
> Thanks,
>
> -liubo
>
>>  		num_bytes -= cur_alloc_size;
>>  		alloc_hint = ins.objectid + ins.offset;
>>  		start += cur_alloc_size;
>> +
>> +		/*
>> +		 * btrfs_reloc_clone_csums() error, since start is increased
>> +		 * extent_clear_unlock_delalloc() at out_unlock label won't
>> +		 * free metadata of current ordered extent, we're OK to exit.
>> +		 */
>> +		if (ret)
>> +			goto out_unlock;
>>  	}
>>  out:
>>  	return ret;
>> @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  		BUG_ON(ret); /* -ENOMEM */
>>
>>  		if (root->root_key.objectid ==
>> -		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
>> +		    BTRFS_DATA_RELOC_TREE_OBJECTID)
>> +			/*
>> +			 * Error handled later, as we must prevent
>> +			 * extent_clear_unlock_delalloc() in error handler
>> +			 * from freeing metadata of created ordered extent.
>> +			 */
>>  			ret = btrfs_reloc_clone_csums(inode, cur_offset,
>>  						      num_bytes);
>> -			if (ret) {
>> -				if (!nolock && nocow)
>> -					btrfs_end_write_no_snapshoting(root);
>> -				goto error;
>> -			}
>> -		}
>>
>>  		extent_clear_unlock_delalloc(inode, cur_offset,
>>  					     cur_offset + num_bytes - 1, end,
>> @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
>>  		if (!nolock && nocow)
>>  			btrfs_end_write_no_snapshoting(root);
>>  		cur_offset = extent_end;
>> +
>> +		/*
>> +		 * btrfs_reloc_clone_csums() error, now we're OK to call error
>> +		 * handler, as metadata for created ordered extent will only
>> +		 * be freed by btrfs_finish_ordered_io().
>> +		 */
>> +		if (ret)
>> +			goto error;
>>  		if (cur_offset > end)
>>  			break;
>>  	}
>> --
>> 2.12.0
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo March 8, 2017, 1:16 a.m. UTC | #8
On Wed, Mar 08, 2017 at 12:17:16AM +0000, Filipe Manana wrote:
> On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
> > On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote:
> >> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
> >> > [BUG]
> >> > When btrfs_reloc_clone_csum() reports error, it can underflow metadata
> >> > and leads to kernel assertion on outstanding extents in
> >> > run_delalloc_nocow() and cow_file_range().
> >> >
> >> >  BTRFS info (device vdb5): relocating block group 12582912 flags data
> >> >  BTRFS info (device vdb5): found 1 extents
> >> >  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
> >> >
> >> > Currently, due to another bug blocking ordered extents, the bug is only
> >> > reproducible under certain block group layout and using error injection.
> >> >
> >> > a) Create one data block group with one 4K extent in it.
> >> >    To avoid the bug that hangs btrfs due to ordered extent which never
> >> >    finishes
> >> > b) Make btrfs_reloc_clone_csum() always fail
> >> > c) Relocate that block group
> >> >
> >> > [CAUSE]
> >> > run_delalloc_nocow() and cow_file_range() handles error from
> >> > btrfs_reloc_clone_csum() wrongly:
> >> >
> >> > (The ascii chart shows a more generic case of this bug other than the
> >> > bug mentioned above)
> >> >
> >> > |<------------------ delalloc range --------------------------->|
> >> > | OE 1 | OE 2 | ... | OE n |
> >> >                     |<----------- cleanup range --------------->|
> >> > |<-----------  ----------->|
> >> >              \/
> >> >  btrfs_finish_ordered_io() range
> >> >
> >> > So error handler, which calls extent_clear_unlock_delalloc() with
> >> > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
> >> > will both cover OE n, and free its metadata, causing metadata under flow.
> >> >
> >> > [Fix]
> >> > The fix is to ensure after calling btrfs_add_ordered_extent(), we only
> >> > call error handler after increasing the iteration offset, so that
> >> > cleanup range won't cover any created ordered extent.
> >> >
> >> > |<------------------ delalloc range --------------------------->|
> >> > | OE 1 | OE 2 | ... | OE n |
> >> > |<-----------  ----------->|<---------- cleanup range --------->|
> >> >              \/
> >> >  btrfs_finish_ordered_io() range
> >> >
> >> > Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> >> > ---
> >> > changelog:
> >> > v6:
> >> >   New, split from v5 patch, as this is a separate bug.
> >> > ---
> >> >  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
> >> >  1 file changed, 39 insertions(+), 12 deletions(-)
> >> >
> >> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >> > index b2bc07aad1ae..1d83d504f2e5 100644
> >> > --- a/fs/btrfs/inode.c
> >> > +++ b/fs/btrfs/inode.c
> >> > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
> >> >                 BTRFS_DATA_RELOC_TREE_OBJECTID) {
> >> >                     ret = btrfs_reloc_clone_csums(inode, start,
> >> >                                                   cur_alloc_size);
> >> > +                   /*
> >> > +                    * Only drop cache here, and process as normal.
> >> > +                    *
> >> > +                    * We must not allow extent_clear_unlock_delalloc()
> >> > +                    * at out_unlock label to free meta of this ordered
> >> > +                    * extent, as its meta should be freed by
> >> > +                    * btrfs_finish_ordered_io().
> >> > +                    *
> >> > +                    * So we must continue until @start is increased to
> >> > +                    * skip current ordered extent.
> >> > +                    */
> >> >                     if (ret)
> >> > -                           goto out_drop_extent_cache;
> >> > +                           btrfs_drop_extent_cache(BTRFS_I(inode), start,
> >> > +                                           start + ram_size - 1, 0);
> >> >             }
> >> >
> >> >             btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> >> >
> >> > -           if (disk_num_bytes < cur_alloc_size)
> >> > -                   break;
> >> > -
> >> >             /* we're not doing compressed IO, don't unlock the first
> >> >              * page (which the caller expects to stay locked), don't
> >> >              * clear any dirty bits and don't set any writeback bits
> >> > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
> >> >                                          delalloc_end, locked_page,
> >> >                                          EXTENT_LOCKED | EXTENT_DELALLOC,
> >> >                                          op);
> >> > -           disk_num_bytes -= cur_alloc_size;
> >> > +           if (disk_num_bytes > cur_alloc_size)
> >> > +                   disk_num_bytes = 0;
> >> > +           else
> >> > +                   disk_num_bytes -= cur_alloc_size;
> >>
> >> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?
> >
> > I assume that you've run fstests against this patch, if so, I actually
> > start worrying about that no fstests found this problem.
> 
> The operator is definitely wrong, should be < instead of > (previous
> patch versions were correct).
> 
> It's not a surprise fstests don't catch this - one would need a fs
> with no more free space to allocate new data chunks and existing data
> chunks would have to be fragmented to the point we need to allocate
> two or more smaller extents to satisfy the write, so whether fstests
> exercises this depends on the size of the test devices (mine are 100G
> for example) and the size of the data the tests create. Having a
> deterministic test for that for that should be possible and useful.

I see, I think that 'mount -o fragment=data' could help us get it.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Qu Wenruo March 8, 2017, 1:21 a.m. UTC | #9
At 03/08/2017 09:16 AM, Liu Bo wrote:
> On Wed, Mar 08, 2017 at 12:17:16AM +0000, Filipe Manana wrote:
>> On Tue, Mar 7, 2017 at 8:59 PM, Liu Bo <bo.li.liu@oracle.com> wrote:
>>> On Tue, Mar 07, 2017 at 12:49:58PM -0800, Liu Bo wrote:
>>>> On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote:
>>>>> [BUG]
>>>>> When btrfs_reloc_clone_csum() reports error, it can underflow metadata
>>>>> and leads to kernel assertion on outstanding extents in
>>>>> run_delalloc_nocow() and cow_file_range().
>>>>>
>>>>>  BTRFS info (device vdb5): relocating block group 12582912 flags data
>>>>>  BTRFS info (device vdb5): found 1 extents
>>>>>  assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
>>>>>
>>>>> Currently, due to another bug blocking ordered extents, the bug is only
>>>>> reproducible under certain block group layout and using error injection.
>>>>>
>>>>> a) Create one data block group with one 4K extent in it.
>>>>>    To avoid the bug that hangs btrfs due to ordered extent which never
>>>>>    finishes
>>>>> b) Make btrfs_reloc_clone_csum() always fail
>>>>> c) Relocate that block group
>>>>>
>>>>> [CAUSE]
>>>>> run_delalloc_nocow() and cow_file_range() handles error from
>>>>> btrfs_reloc_clone_csum() wrongly:
>>>>>
>>>>> (The ascii chart shows a more generic case of this bug other than the
>>>>> bug mentioned above)
>>>>>
>>>>> |<------------------ delalloc range --------------------------->|
>>>>> | OE 1 | OE 2 | ... | OE n |
>>>>>                     |<----------- cleanup range --------------->|
>>>>> |<-----------  ----------->|
>>>>>              \/
>>>>>  btrfs_finish_ordered_io() range
>>>>>
>>>>> So error handler, which calls extent_clear_unlock_delalloc() with
>>>>> EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
>>>>> will both cover OE n, and free its metadata, causing metadata under flow.
>>>>>
>>>>> [Fix]
>>>>> The fix is to ensure after calling btrfs_add_ordered_extent(), we only
>>>>> call error handler after increasing the iteration offset, so that
>>>>> cleanup range won't cover any created ordered extent.
>>>>>
>>>>> |<------------------ delalloc range --------------------------->|
>>>>> | OE 1 | OE 2 | ... | OE n |
>>>>> |<-----------  ----------->|<---------- cleanup range --------->|
>>>>>              \/
>>>>>  btrfs_finish_ordered_io() range
>>>>>
>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>> ---
>>>>> changelog:
>>>>> v6:
>>>>>   New, split from v5 patch, as this is a separate bug.
>>>>> ---
>>>>>  fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------
>>>>>  1 file changed, 39 insertions(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>> index b2bc07aad1ae..1d83d504f2e5 100644
>>>>> --- a/fs/btrfs/inode.c
>>>>> +++ b/fs/btrfs/inode.c
>>>>> @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
>>>>>                 BTRFS_DATA_RELOC_TREE_OBJECTID) {
>>>>>                     ret = btrfs_reloc_clone_csums(inode, start,
>>>>>                                                   cur_alloc_size);
>>>>> +                   /*
>>>>> +                    * Only drop cache here, and process as normal.
>>>>> +                    *
>>>>> +                    * We must not allow extent_clear_unlock_delalloc()
>>>>> +                    * at out_unlock label to free meta of this ordered
>>>>> +                    * extent, as its meta should be freed by
>>>>> +                    * btrfs_finish_ordered_io().
>>>>> +                    *
>>>>> +                    * So we must continue until @start is increased to
>>>>> +                    * skip current ordered extent.
>>>>> +                    */
>>>>>                     if (ret)
>>>>> -                           goto out_drop_extent_cache;
>>>>> +                           btrfs_drop_extent_cache(BTRFS_I(inode), start,
>>>>> +                                           start + ram_size - 1, 0);
>>>>>             }
>>>>>
>>>>>             btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>>>>>
>>>>> -           if (disk_num_bytes < cur_alloc_size)
>>>>> -                   break;
>>>>> -
>>>>>             /* we're not doing compressed IO, don't unlock the first
>>>>>              * page (which the caller expects to stay locked), don't
>>>>>              * clear any dirty bits and don't set any writeback bits
>>>>> @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
>>>>>                                          delalloc_end, locked_page,
>>>>>                                          EXTENT_LOCKED | EXTENT_DELALLOC,
>>>>>                                          op);
>>>>> -           disk_num_bytes -= cur_alloc_size;
>>>>> +           if (disk_num_bytes > cur_alloc_size)
>>>>> +                   disk_num_bytes = 0;
>>>>> +           else
>>>>> +                   disk_num_bytes -= cur_alloc_size;
>>>>
>>>> I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size?
>>>
>>> I assume that you've run fstests against this patch, if so, I actually
>>> start worrying about that no fstests found this problem.
>>
>> The operator is definitely wrong, should be < instead of > (previous
>> patch versions were correct).
>>
>> It's not a surprise fstests don't catch this - one would need a fs
>> with no more free space to allocate new data chunks and existing data
>> chunks would have to be fragmented to the point we need to allocate
>> two or more smaller extents to satisfy the write, so whether fstests
>> exercises this depends on the size of the test devices (mine are 100G
>> for example) and the size of the data the tests create. Having a
>> deterministic test for that for that should be possible and useful.
>
> I see, I think that 'mount -o fragment=data' could help us get it.
>
> Thanks,
>
> -liubo
>
>
I'll also submit a test case for this.

By filling the fs with small files(sectorsize) with sequential number as 
file name.

Then sync fs, remove the files with odd number, then do a large write.

This one should trigger the problem without the use of the debug option, 
and can be a generic test.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b2bc07aad1ae..1d83d504f2e5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -998,15 +998,24 @@  static noinline int cow_file_range(struct inode *inode,
 		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
 			ret = btrfs_reloc_clone_csums(inode, start,
 						      cur_alloc_size);
+			/*
+			 * Only drop cache here, and process as normal.
+			 *
+			 * We must not allow extent_clear_unlock_delalloc()
+			 * at out_unlock label to free meta of this ordered
+			 * extent, as its meta should be freed by
+			 * btrfs_finish_ordered_io().
+			 *
+			 * So we must continue until @start is increased to
+			 * skip current ordered extent.
+			 */
 			if (ret)
-				goto out_drop_extent_cache;
+				btrfs_drop_extent_cache(BTRFS_I(inode), start,
+						start + ram_size - 1, 0);
 		}
 
 		btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
-		if (disk_num_bytes < cur_alloc_size)
-			break;
-
 		/* we're not doing compressed IO, don't unlock the first
 		 * page (which the caller expects to stay locked), don't
 		 * clear any dirty bits and don't set any writeback bits
@@ -1022,10 +1031,21 @@  static noinline int cow_file_range(struct inode *inode,
 					     delalloc_end, locked_page,
 					     EXTENT_LOCKED | EXTENT_DELALLOC,
 					     op);
-		disk_num_bytes -= cur_alloc_size;
+		if (disk_num_bytes > cur_alloc_size)
+			disk_num_bytes = 0;
+		else
+			disk_num_bytes -= cur_alloc_size;
 		num_bytes -= cur_alloc_size;
 		alloc_hint = ins.objectid + ins.offset;
 		start += cur_alloc_size;
+
+		/*
+		 * btrfs_reloc_clone_csums() error, since start is increased
+		 * extent_clear_unlock_delalloc() at out_unlock label won't
+		 * free metadata of current ordered extent, we're OK to exit.
+		 */
+		if (ret)
+			goto out_unlock;
 	}
 out:
 	return ret;
@@ -1414,15 +1434,14 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 		BUG_ON(ret); /* -ENOMEM */
 
 		if (root->root_key.objectid ==
-		    BTRFS_DATA_RELOC_TREE_OBJECTID) {
+		    BTRFS_DATA_RELOC_TREE_OBJECTID)
+			/*
+			 * Error handled later, as we must prevent
+			 * extent_clear_unlock_delalloc() in error handler
+			 * from freeing metadata of created ordered extent.
+			 */
 			ret = btrfs_reloc_clone_csums(inode, cur_offset,
 						      num_bytes);
-			if (ret) {
-				if (!nolock && nocow)
-					btrfs_end_write_no_snapshoting(root);
-				goto error;
-			}
-		}
 
 		extent_clear_unlock_delalloc(inode, cur_offset,
 					     cur_offset + num_bytes - 1, end,
@@ -1434,6 +1453,14 @@  static noinline int run_delalloc_nocow(struct inode *inode,
 		if (!nolock && nocow)
 			btrfs_end_write_no_snapshoting(root);
 		cur_offset = extent_end;
+
+		/*
+		 * btrfs_reloc_clone_csums() error, now we're OK to call error
+		 * handler, as metadata for created ordered extent will only
+		 * be freed by btrfs_finish_ordered_io().
+		 */
+		if (ret)
+			goto error;
 		if (cur_offset > end)
 			break;
 	}