diff mbox series

[v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

Message ID 1540554115-11226-1-git-send-email-nborisov@suse.com (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents | expand

Commit Message

Nikolay Borisov Oct. 26, 2018, 11:41 a.m. UTC
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:
	btrfs           D    0  5760   5324 0x00000000
	Call Trace:
	 ? __schedule+0x243/0x800
	 schedule+0x33/0x90
	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
	 ? wait_woken+0xa0/0xa0
	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
	 btrfs_balance+0xbeb/0x1740 [btrfs]
	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
	 btrfs_ioctl+0x1691/0x3110 [btrfs]
	 ? lockdep_hardirqs_on+0xed/0x180
	 ? __handle_mm_fault+0x8e7/0xfb0
	 ? _raw_spin_unlock+0x24/0x30
	 ? __handle_mm_fault+0x8e7/0xfb0
	 ? do_vfs_ioctl+0xa5/0x6e0
	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
	 do_vfs_ioctl+0xa5/0x6e0
	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
	 ksys_ioctl+0x3a/0x70
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x60/0x1b0
	 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Turns out during page writeback it's possible that the code in
writepage_delalloc can instantiate a delalloc range which doesn't
belong to the page currently being written back. This happens since
find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
range when asked and doens't really consider the range of the passed
page. When such a foregin range is found the code proceeds to
run_delalloc_range and calls the appropriate function to fill the
delalloc and create ordered extents. If, however, a failure occurs
while this operation is in effect then the clean up code in
btrfs_cleanup_ordered_extents will be called. This function has the
wrong assumption that caller of run_delalloc_range always properly
cleans the first page of the range hence when it calls
__endio_write_update_ordered it explicitly ommits the first page of
the delalloc range. This is wrong because this function could be
cleaning a delalloc range that doesn't belong to the current page. This
in turn means that the page cleanup code in __extent_writepage will
not really free the initial page for the range, leaving a hanging
ordered extent with bytes_left set to 4k. This bug has been present
ever since the original introduction of the cleanup code in 524272607e88.

Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so correctly adjust the range parameters
passed for cleaning up. If it doesn't, then just clean the whole OE
range directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
---

V2: 
 * Fix compilation failure due to missing parentheses
 * Fixed the "Fixes" tag. 

 fs/btrfs/inode.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

Comments

Qu Wenruo Oct. 26, 2018, 11:53 a.m. UTC | #1
On 2018/10/26 下午7:41, Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
> 	btrfs           D    0  5760   5324 0x00000000
> 	Call Trace:
> 	 ? __schedule+0x243/0x800
> 	 schedule+0x33/0x90
> 	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
> 	 ? wait_woken+0xa0/0xa0
> 	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
> 	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
> 	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
> 	 btrfs_balance+0xbeb/0x1740 [btrfs]
> 	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
> 	 btrfs_ioctl+0x1691/0x3110 [btrfs]
> 	 ? lockdep_hardirqs_on+0xed/0x180
> 	 ? __handle_mm_fault+0x8e7/0xfb0
> 	 ? _raw_spin_unlock+0x24/0x30
> 	 ? __handle_mm_fault+0x8e7/0xfb0
> 	 ? do_vfs_ioctl+0xa5/0x6e0
> 	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
> 	 do_vfs_ioctl+0xa5/0x6e0
> 	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
> 	 ksys_ioctl+0x3a/0x70
> 	 __x64_sys_ioctl+0x16/0x20
> 	 do_syscall_64+0x60/0x1b0
> 	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Turns out during page writeback it's possible that the code in
> writepage_delalloc can instantiate a delalloc range which doesn't
> belong to the page currently being written back.

Just a nitpick, would you please split long paragraphs with newlines?

It makes reviewers' eyes less painful.

> This happens since
> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
> range when asked and doens't really consider the range of the passed
> page.


> When such a foregin range is found the code proceeds to
> run_delalloc_range and calls the appropriate function to fill the
> delalloc and create ordered extents. If, however, a failure occurs
> while this operation is in effect then the clean up code in
> btrfs_cleanup_ordered_extents will be called. This function has the
> wrong assumption that caller of run_delalloc_range always properly
> cleans the first page of the range hence when it calls
> __endio_write_update_ordered it explicitly ommits the first page of
> the delalloc range.

Yes, that's the old assumption, at least well explained by some ascii
art. (even it's wrong)

> This is wrong because this function could be
> cleaning a delalloc range that doesn't belong to the current page. This
> in turn means that the page cleanup code in __extent_writepage will
> not really free the initial page for the range, leaving a hanging
> ordered extent with bytes_left set to 4k. This bug has been present
> ever since the original introduction of the cleanup code in 524272607e88.

The cause sounds valid, however would you please explain more about how
such cleaning on unrelated delalloc range happens?

Even the fix looks solid, it's still better to explain the cause a
little more, as the more we understand the cause, the better solution we
may get.

> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so correctly adjust the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.

And the solution also looks good to me, and much more robust, without
any (possibly wrong) assumption.

Thanks,
Qu

> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
> ---
> 
> V2: 
>  * Fix compilation failure due to missing parentheses
>  * Fixed the "Fixes" tag. 
> 
>  fs/btrfs/inode.c | 29 ++++++++++++++++++++---------
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index e1f00d8c24ce..5906564ae2e9 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -109,17 +109,17 @@ static void __endio_write_update_ordered(struct inode *inode,
>   * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
>   * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
>   * to be released, which we want to happen only when finishing the ordered
> - * extent (btrfs_finish_ordered_io()). Also note that the caller of the
> - * fill_delalloc() callback already does proper cleanup for the first page of
> - * the range, that is, it invokes the callback writepage_end_io_hook() for the
> - * range of the first page.
> + * extent (btrfs_finish_ordered_io()).
>   */
>  static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
> -						 const u64 offset,
> -						 const u64 bytes)
> +						 struct page *locked_page,
> +						 u64 offset, u64 bytes)
>  {
>  	unsigned long index = offset >> PAGE_SHIFT;
>  	unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
> +	u64 page_start = page_offset(locked_page);
> +	u64 page_end = page_start + PAGE_SIZE - 1;
> +
>  	struct page *page;
>  
>  	while (index <= end_index) {
> @@ -130,8 +130,18 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>  		ClearPagePrivate2(page);
>  		put_page(page);
>  	}
> -	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> -					    bytes - PAGE_SIZE, false);
> +
> +	/*
> +	 * In case this page belongs to the delalloc range being instantiated
> +	 * then skip it, since the first page of a range is going to be
> +	 * properly cleaned up by the caller of run_delalloc_range
> +	 */
> +	if (page_start >= offset && page_end <= (offset + bytes - 1)) {
> +		offset += PAGE_SIZE;
> +		bytes -= PAGE_SIZE;
> +	}
> +
> +	return __endio_write_update_ordered(inode, offset, bytes, false);
>  }
>  
>  static int btrfs_dirty_inode(struct inode *inode);
> @@ -1606,7 +1616,8 @@ static int run_delalloc_range(void *private_data, struct page *locked_page,
>  					   write_flags);
>  	}
>  	if (ret)
> -		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
> +		btrfs_cleanup_ordered_extents(inode, locked_page, start,
> +					      end - start + 1);
>  	return ret;
>  }
>  
>
Nikolay Borisov Oct. 26, 2018, 12:02 p.m. UTC | #2
On 26.10.2018 14:53, Qu Wenruo wrote:
> 
> 
> On 2018/10/26 下午7:41, Nikolay Borisov wrote:
>> Running btrfs/124 in a loop hung up on me sporadically with the
>> following call trace:
>> 	btrfs           D    0  5760   5324 0x00000000
>> 	Call Trace:
>> 	 ? __schedule+0x243/0x800
>> 	 schedule+0x33/0x90
>> 	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>> 	 ? wait_woken+0xa0/0xa0
>> 	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>> 	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>> 	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
>> 	 btrfs_balance+0xbeb/0x1740 [btrfs]
>> 	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>> 	 btrfs_ioctl+0x1691/0x3110 [btrfs]
>> 	 ? lockdep_hardirqs_on+0xed/0x180
>> 	 ? __handle_mm_fault+0x8e7/0xfb0
>> 	 ? _raw_spin_unlock+0x24/0x30
>> 	 ? __handle_mm_fault+0x8e7/0xfb0
>> 	 ? do_vfs_ioctl+0xa5/0x6e0
>> 	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>> 	 do_vfs_ioctl+0xa5/0x6e0
>> 	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>> 	 ksys_ioctl+0x3a/0x70
>> 	 __x64_sys_ioctl+0x16/0x20
>> 	 do_syscall_64+0x60/0x1b0
>> 	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Turns out during page writeback it's possible that the code in
>> writepage_delalloc can instantiate a delalloc range which doesn't
>> belong to the page currently being written back.
> 
> Just a nitpick, would you please split long paragraphs with newlines?
> 
> It makes reviewers' eyes less painful.
> 
>> This happens since
>> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
>> range when asked and doens't really consider the range of the passed
>> page.
> 
> 
>> When such a foregin range is found the code proceeds to
>> run_delalloc_range and calls the appropriate function to fill the
>> delalloc and create ordered extents. If, however, a failure occurs
>> while this operation is in effect then the clean up code in
>> btrfs_cleanup_ordered_extents will be called. This function has the
>> wrong assumption that caller of run_delalloc_range always properly
>> cleans the first page of the range hence when it calls
>> __endio_write_update_ordered it explicitly ommits the first page of
>> the delalloc range.
> 
> Yes, that's the old assumption, at least well explained by some ascii
> art. (even it's wrong)
> 
>> This is wrong because this function could be
>> cleaning a delalloc range that doesn't belong to the current page. This
>> in turn means that the page cleanup code in __extent_writepage will
>> not really free the initial page for the range, leaving a hanging
>> ordered extent with bytes_left set to 4k. This bug has been present
>> ever since the original introduction of the cleanup code in 524272607e88.
> 
> The cause sounds valid, however would you please explain more about how
> such cleaning on unrelated delalloc range happens?

So in my case the following happened - 2 block groups were created as
delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
dirtied, so when page 0 - 0-4k when into writepage_delalloc,
find_lock_delalloc_range would return the range 0-1m. So the call to
fill_delalloc instantiates OE 0-1m and writeback continues as expected.

Now, when the 2nd page from range 0-1m is again set for writeback and
find_lock_delalloc_range is called with delalloc_start ==  4096 it will
actually return the range 1m-128m. Then fill_delalloc is called with
locked_page belonging to the range which was already instantiated and
the start/end arguments pointing to 1m-128m if an error occurred in
run_delalloc_range in this case then  btrfs_cleanup_ordered_extents will
be called which will clear Private2 bit for pages belonging to 1m-128m
range and the OE will be cleared of all but the first page (since the
code wrongly assumed locked_page always belongs to the range currently
being instantiated).

> 
> Even the fix looks solid, it's still better to explain the cause a
> little more, as the more we understand the cause, the better solution we
> may get.
> 
>>
>> Fix this by correctly checking whether the current page belongs to the
>> range being instantiated and if so correctly adjust the range parameters
>> passed for cleaning up. If it doesn't, then just clean the whole OE
>> range directly.
> 
> And the solution also looks good to me, and much more robust, without
> any (possibly wrong) assumption.
> 
> Thanks,
> Qu
> 
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
>> ---
>>
>> V2: 
>>  * Fix compilation failure due to missing parentheses
>>  * Fixed the "Fixes" tag. 
>>
>>  fs/btrfs/inode.c | 29 ++++++++++++++++++++---------
>>  1 file changed, 20 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index e1f00d8c24ce..5906564ae2e9 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -109,17 +109,17 @@ static void __endio_write_update_ordered(struct inode *inode,
>>   * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
>>   * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
>>   * to be released, which we want to happen only when finishing the ordered
>> - * extent (btrfs_finish_ordered_io()). Also note that the caller of the
>> - * fill_delalloc() callback already does proper cleanup for the first page of
>> - * the range, that is, it invokes the callback writepage_end_io_hook() for the
>> - * range of the first page.
>> + * extent (btrfs_finish_ordered_io()).
>>   */
>>  static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>> -						 const u64 offset,
>> -						 const u64 bytes)
>> +						 struct page *locked_page,
>> +						 u64 offset, u64 bytes)
>>  {
>>  	unsigned long index = offset >> PAGE_SHIFT;
>>  	unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
>> +	u64 page_start = page_offset(locked_page);
>> +	u64 page_end = page_start + PAGE_SIZE - 1;
>> +
>>  	struct page *page;
>>  
>>  	while (index <= end_index) {
>> @@ -130,8 +130,18 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>>  		ClearPagePrivate2(page);
>>  		put_page(page);
>>  	}
>> -	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
>> -					    bytes - PAGE_SIZE, false);
>> +
>> +	/*
>> +	 * In case this page belongs to the delalloc range being instantiated
>> +	 * then skip it, since the first page of a range is going to be
>> +	 * properly cleaned up by the caller of run_delalloc_range
>> +	 */
>> +	if (page_start >= offset && page_end <= (offset + bytes - 1)) {
>> +		offset += PAGE_SIZE;
>> +		bytes -= PAGE_SIZE;
>> +	}
>> +
>> +	return __endio_write_update_ordered(inode, offset, bytes, false);
>>  }
>>  
>>  static int btrfs_dirty_inode(struct inode *inode);
>> @@ -1606,7 +1616,8 @@ static int run_delalloc_range(void *private_data, struct page *locked_page,
>>  					   write_flags);
>>  	}
>>  	if (ret)
>> -		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>> +		btrfs_cleanup_ordered_extents(inode, locked_page, start,
>> +					      end - start + 1);
>>  	return ret;
>>  }
>>  
>>
>
Qu Wenruo Oct. 29, 2018, 5:53 a.m. UTC | #3
[snip]
>> The cause sounds valid, however would you please explain more about how
>> such cleaning on unrelated delalloc range happens?
> 
> So in my case the following happened - 2 block groups were created as
> delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
> dirtied, so when page 0 - 0-4k when into writepage_delalloc,
> find_lock_delalloc_range would return the range 0-1m. So the call to
> fill_delalloc instantiates OE 0-1m and writeback continues as expected.
> 
> Now, when the 2nd page from range 0-1m is again set for writeback and
> find_lock_delalloc_range is called with delalloc_start ==  4096 it will
> actually return the range 1m-128m.

IMHO this looks strange and may need extra investigation.

Normally I would expect it returns the range 0~1M or 4K~1M.

But that doesn't affect the fix patch anyway.

Thanks,
Qu

> Then fill_delalloc is called with
> locked_page belonging to the range which was already instantiated and
> the start/end arguments pointing to 1m-128m if an error occurred in
> run_delalloc_range in this case then  btrfs_cleanup_ordered_extents will
> be called which will clear Private2 bit for pages belonging to 1m-128m
> range and the OE will be cleared of all but the first page (since the
> code wrongly assumed locked_page always belongs to the range currently
> being instantiated).
>
Nikolay Borisov Oct. 29, 2018, 7:51 a.m. UTC | #4
On 29.10.18 г. 7:53 ч., Qu Wenruo wrote:
> [snip]
>>> The cause sounds valid, however would you please explain more about how
>>> such cleaning on unrelated delalloc range happens?
>>
>> So in my case the following happened - 2 block groups were created as
>> delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
>> dirtied, so when page 0 - 0-4k when into writepage_delalloc,
>> find_lock_delalloc_range would return the range 0-1m. So the call to
>> fill_delalloc instantiates OE 0-1m and writeback continues as expected.
>>
>> Now, when the 2nd page from range 0-1m is again set for writeback and
>> find_lock_delalloc_range is called with delalloc_start ==  4096 it will
>> actually return the range 1m-128m.
> 
> IMHO this looks strange and may need extra investigation.
> 
> Normally I would expect it returns the range 0~1M or 4K~1M.

It cannot return 4k-1m since the writeback for the first page has
already dealt with this range. Also take a look in writepage_delalloc
how find_lock_delalloc_range is called : for 'start' it's passed the
page offset, calculated in __extent_writepage. And when
find_delalloc_range is called it just searches for an extent which ends
after passed start value. In find_delalloc_range  first    tree_search
is called which returns the 1m-128m range, then we go in the while(1)
loop on the first itertaion found is 0 so *end is populated with 128m ,
found is set to 1 and *start is set to 1m.

On the second iteration the check  if (found && (state->start !=
cur_start || (state->state & EXTENT_BOUNDARY)))

is triggered since the next extent found will have EXTENT_BOUNDARY since
it will be the next block group from relocation. EXTENT_BOUNDARY will be
set from relocate_file_extent_cluster' main loop:

          if (nr < cluster->nr &&

                    page_start + offset == cluster->boundary[nr]) {

                        set_extent_bits(&BTRFS_I(inode)->io_tree,

                                        page_start, page_end,

                                        EXTENT_BOUNDARY);

                        nr++;

                }
> 
> But that doesn't affect the fix patch anyway.
> 
> Thanks,
> Qu
> 
>> Then fill_delalloc is called with
>> locked_page belonging to the range which was already instantiated and
>> the start/end arguments pointing to 1m-128m if an error occurred in
>> run_delalloc_range in this case then  btrfs_cleanup_ordered_extents will
>> be called which will clear Private2 bit for pages belonging to 1m-128m
>> range and the OE will be cleared of all but the first page (since the
>> code wrongly assumed locked_page always belongs to the range currently
>> being instantiated).
>>
> 
>
Nikolay Borisov Oct. 29, 2018, 12:21 p.m. UTC | #5
On 29.10.18 г. 9:51 ч., Nikolay Borisov wrote:
> 
> 
> On 29.10.18 г. 7:53 ч., Qu Wenruo wrote:
>> [snip]
>>>> The cause sounds valid, however would you please explain more about how
>>>> such cleaning on unrelated delalloc range happens?
>>>
>>> So in my case the following happened - 2 block groups were created as
>>> delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
>>> dirtied, so when page 0 - 0-4k when into writepage_delalloc,
>>> find_lock_delalloc_range would return the range 0-1m. So the call to
>>> fill_delalloc instantiates OE 0-1m and writeback continues as expected.
>>>
>>> Now, when the 2nd page from range 0-1m is again set for writeback and
>>> find_lock_delalloc_range is called with delalloc_start ==  4096 it will
>>> actually return the range 1m-128m.
>>
>> IMHO this looks strange and may need extra investigation.
>>
>> Normally I would expect it returns the range 0~1M or 4K~1M.
> 
> It cannot return 4k-1m since the writeback for the first page has
> already dealt with this range. Also take a look in writepage_delalloc
> how find_lock_delalloc_range is called : for 'start' it's passed the
> page offset, calculated in __extent_writepage. And when
> find_delalloc_range is called it just searches for an extent which ends
> after passed start value. In find_delalloc_range  first    tree_search
> is called which returns the 1m-128m range, then we go in the while(1)
> loop on the first itertaion found is 0 so *end is populated with 128m ,
> found is set to 1 and *start is set to 1m.
> 
> On the second iteration the check  if (found && (state->start !=
> cur_start || (state->state & EXTENT_BOUNDARY)))
> 
> is triggered since the next extent found will have EXTENT_BOUNDARY since
> it will be the next block group from relocation. EXTENT_BOUNDARY will be
> set from relocate_file_extent_cluster' main loop:
> 
>           if (nr < cluster->nr &&
> 
>                     page_start + offset == cluster->boundary[nr]) {
> 
>                         set_extent_bits(&BTRFS_I(inode)->io_tree,
> 
>                                         page_start, page_end,
> 
>                                         EXTENT_BOUNDARY);
> 
>                         nr++;
> 
>                 }

So it seems I was wrong w.r.t to sequence of events that result in the  extra extent being returned. 
Here is what I got after further investigation. First let's look at the relocation side: 


btrfs-4018  [001] ....   186.783244: relocate_file_extent_cluster: Setting EXTENT_BOUNDARY for page: 0000000074cc47c4 page_offset: 0 end: 4095 <- first page of range 0-1m
btrfs-4018  [001] ....   186.783248: relocate_file_extent_cluster: Setting DELALLOC for page: 0000000074cc47c4 page_offset: 0 block group:1104150528 
btrfs-4018  [001] ....   186.783286: relocate_file_extent_cluster: Setting DELALLOC for page: 000000004a28475a page_offset: 4096 block group:1104150528 <- 2nd page of range 0-1m 
btrfs-4018  [001] ....   186.784855: relocate_file_extent_cluster: Setting EXTENT_BOUNDARY for page: 00000000f58f50dc page_offset: 1048576 end: 1052671 < - 1st page of range 1m-128m

Mind the addresses of the given pages, they are all predicated on btrfs_ino == 260, which is the ino for relocation inode. 

So the assumption is that when writing back page 000000004a28475a we cannot really be processing range >1m since it will
be outside of the range for the page, but this is not the case. Now on the writeback side: 

kworker/u12:1-68    [002] ....   188.100471: find_lock_delalloc_range: Processing delalloc range: 0 - 1048575 for page: 0000000074cc47c4 < - writeback for first page of range 0-1m happens 
so we naturally instantiate this range.

kworker/u12:1-68    [002] ...1   188.106523: find_delalloc_range.constprop.25: 1213015261107184058: Got start: 1048576 end 1052671 <- This is output from find_delalloc_range when start (the offset of the passed page to find_lock_delalloc_range is 4096 i.e it's the 2nd page for range 0-1m). So we find 1m - 1m + 4k on the first iteration of the loop in find_delalloc_range, at this point *start = 1048576 and *end = 1052671 and cached_state = the extent _state representing this first 4k range of the larger 1m-128m range.

kworker/u12:1-68    [002] ...1   188.106526: find_delalloc_range.constprop.25: 1213015261107184058: Got start: 1052672 end 135266303 - We loop for the second time, this time we find the 1m+4k - 128m range and actually trigger the (total_bytes >= max_bytes) check

kworker/u12:1-68    [002] ...1   188.106526: find_delalloc_range.constprop.25: 1213015261107184058: returning from total_bytes >= max_bytes

kworker/u12:1-68    [002] ....   188.106528: find_delalloc_range.constprop.25: 1213015261107184058: Returning cached_state->start: 1048576 cached_state->end: 1052671 *start = 1048576 *end = 135266303 <--- this is what is returned from find_delalloc_range - cached_state and *start and *end differ, since *end was set to 128m right before we hit the total_bytes check. 

kworker/u12:1-68    [002] ....   188.106529: find_lock_delalloc_range: Processing delalloc range: 1048576 - 135266303 for page: 000000004a28475a <--- after we return from find_delalloc_range this is what 
fin_lock_delalloc_range is processing - the next delalloc range 1m-128m but the page passed is indeed the 2nd page for the previous range as evident from the address 000000004a28475a
kworker/u12:1-68    [002] ....   188.113870: run_delalloc_nocow: btrfs_reloc_clone_csums err for OE[1048576 134217728] page->offset: 4096 end: 8191 <--- here we just create the OE in run_delalloc_nocow. 

The only thing which I'm now struggling to rationalize is why find_delalloc_range finds an extent_state in the range 1048576-1052671 and then finds another range 1052672 - 135266303, shouldn't in fact it 
find just 1048576 - 135266303. Perhaps it's due to the search being performed in the io_extent tree and not the extent_tree. 

Here is the data_reloc tree :

item 2 key (260 INODE_ITEM 0) itemoff 15951 itemsize 160
		generation 1 transid 30 size 940572672 nbytes 940572672
		block group 0 mode 100600 links 0 uid 0 gid 0 rdev 0
		sequence 8 flags 0x18(NOCOMPRESS|PREALLOC)
		atime 0.0 (1970-01-01 00:00:00)
		ctime 1540814915.362521123 (2018-10-29 12:08:35)
		mtime 0.0 (1970-01-01 00:00:00)
		otime 0.0 (1970-01-01 00:00:00)
	item 3 key (260 EXTENT_DATA 0) itemoff 15898 itemsize 53
		generation 28 type 2 (prealloc)
		prealloc data disk byte 3872391168 nr 1048576
		prealloc data offset 0 nr 1048576
	item 4 key (260 EXTENT_DATA 1048576) itemoff 15845 itemsize 53
		generation 28 type 2 (prealloc)
		prealloc data disk byte 3873439744 nr 134217728
		prealloc data offset 0 nr 134217728

as you can see we have a 0- 1048576 extent and 1048576-134217728. Those correspond to the following data block group being relocated: 

 item 139 key (1104150528 EXTENT_ITEM 1048576) itemoff 11661 itemsize 53
                refs 1 gen 7 flags DATA
                extent data backref root FS_TREE objectid 257 offset 0 count 1
 item 140 key (1104150528 BLOCK_GROUP_ITEM 1073741824) itemoff 11637 itemsize 24
                block group used 940572672 chunk_objectid 256 flags DATA|RAID1
 item 141 key (1105199104 EXTENT_ITEM 134217728) itemoff 11584 itemsize 53
                refs 1 gen 9 flags DATA
                extent data backref root FS_TREE objectid 258 offset 0 count 1











<snip>
Nikolay Borisov Oct. 29, 2018, 2:39 p.m. UTC | #6
On 29.10.18 г. 14:21 ч., Nikolay Borisov wrote:
> 
> 
> On 29.10.18 г. 9:51 ч., Nikolay Borisov wrote:
>>
>>
>> On 29.10.18 г. 7:53 ч., Qu Wenruo wrote:
>>> [snip]
>>>>> The cause sounds valid, however would you please explain more about how
>>>>> such cleaning on unrelated delalloc range happens?
>>>>
>>>> So in my case the following happened - 2 block groups were created as
>>>> delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
>>>> dirtied, so when page 0 - 0-4k when into writepage_delalloc,
>>>> find_lock_delalloc_range would return the range 0-1m. So the call to
>>>> fill_delalloc instantiates OE 0-1m and writeback continues as expected.
>>>>
>>>> Now, when the 2nd page from range 0-1m is again set for writeback and
>>>> find_lock_delalloc_range is called with delalloc_start ==  4096 it will
>>>> actually return the range 1m-128m.
>>>
>>> IMHO this looks strange and may need extra investigation.
>>>
>>> Normally I would expect it returns the range 0~1M or 4K~1M.
>>
>> It cannot return 4k-1m since the writeback for the first page has
>> already dealt with this range. Also take a look in writepage_delalloc
>> how find_lock_delalloc_range is called : for 'start' it's passed the
>> page offset, calculated in __extent_writepage. And when
>> find_delalloc_range is called it just searches for an extent which ends
>> after passed start value. In find_delalloc_range  first    tree_search
>> is called which returns the 1m-128m range, then we go in the while(1)
>> loop on the first itertaion found is 0 so *end is populated with 128m ,
>> found is set to 1 and *start is set to 1m.
>>
>> On the second iteration the check  if (found && (state->start !=
>> cur_start || (state->state & EXTENT_BOUNDARY)))
>>
>> is triggered since the next extent found will have EXTENT_BOUNDARY since
>> it will be the next block group from relocation. EXTENT_BOUNDARY will be
>> set from relocate_file_extent_cluster' main loop:
>>
>>           if (nr < cluster->nr &&
>>
>>                     page_start + offset == cluster->boundary[nr]) {
>>
>>                         set_extent_bits(&BTRFS_I(inode)->io_tree,
>>
>>                                         page_start, page_end,
>>
>>                                         EXTENT_BOUNDARY);
>>
>>                         nr++;
>>
>>                 }
> 
> So it seems I was wrong w.r.t to sequence of events that result in the  extra extent being returned. 
> Here is what I got after further investigation. First let's look at the relocation side: 
> 
> 
> btrfs-4018  [001] ....   186.783244: relocate_file_extent_cluster: Setting EXTENT_BOUNDARY for page: 0000000074cc47c4 page_offset: 0 end: 4095 <- first page of range 0-1m
> btrfs-4018  [001] ....   186.783248: relocate_file_extent_cluster: Setting DELALLOC for page: 0000000074cc47c4 page_offset: 0 block group:1104150528 
> btrfs-4018  [001] ....   186.783286: relocate_file_extent_cluster: Setting DELALLOC for page: 000000004a28475a page_offset: 4096 block group:1104150528 <- 2nd page of range 0-1m 
> btrfs-4018  [001] ....   186.784855: relocate_file_extent_cluster: Setting EXTENT_BOUNDARY for page: 00000000f58f50dc page_offset: 1048576 end: 1052671 < - 1st page of range 1m-128m
> 
> Mind the addresses of the given pages, they are all predicated on btrfs_ino == 260, which is the ino for relocation inode. 
> 
> So the assumption is that when writing back page 000000004a28475a we cannot really be processing range >1m since it will
> be outside of the range for the page, but this is not the case. Now on the writeback side: 
> 
> kworker/u12:1-68    [002] ....   188.100471: find_lock_delalloc_range: Processing delalloc range: 0 - 1048575 for page: 0000000074cc47c4 < - writeback for first page of range 0-1m happens 
> so we naturally instantiate this range.
> 
> kworker/u12:1-68    [002] ...1   188.106523: find_delalloc_range.constprop.25: 1213015261107184058: Got start: 1048576 end 1052671 <- This is output from find_delalloc_range when start (the offset of the passed page to find_lock_delalloc_range is 4096 i.e it's the 2nd page for range 0-1m). So we find 1m - 1m + 4k on the first iteration of the loop in find_delalloc_range, at this point *start = 1048576 and *end = 1052671 and cached_state = the extent _state representing this first 4k range of the larger 1m-128m range.
> 
> kworker/u12:1-68    [002] ...1   188.106526: find_delalloc_range.constprop.25: 1213015261107184058: Got start: 1052672 end 135266303 - We loop for the second time, this time we find the 1m+4k - 128m range and actually trigger the (total_bytes >= max_bytes) check
> 
> kworker/u12:1-68    [002] ...1   188.106526: find_delalloc_range.constprop.25: 1213015261107184058: returning from total_bytes >= max_bytes
> 
> kworker/u12:1-68    [002] ....   188.106528: find_delalloc_range.constprop.25: 1213015261107184058: Returning cached_state->start: 1048576 cached_state->end: 1052671 *start = 1048576 *end = 135266303 <--- this is what is returned from find_delalloc_range - cached_state and *start and *end differ, since *end was set to 128m right before we hit the total_bytes check. 
> 
> kworker/u12:1-68    [002] ....   188.106529: find_lock_delalloc_range: Processing delalloc range: 1048576 - 135266303 for page: 000000004a28475a <--- after we return from find_delalloc_range this is what 
> fin_lock_delalloc_range is processing - the next delalloc range 1m-128m but the page passed is indeed the 2nd page for the previous range as evident from the address 000000004a28475a
> kworker/u12:1-68    [002] ....   188.113870: run_delalloc_nocow: btrfs_reloc_clone_csums err for OE[1048576 134217728] page->offset: 4096 end: 8191 <--- here we just create the OE in run_delalloc_nocow. 
> 
> The only thing which I'm now struggling to rationalize is why find_delalloc_range finds an extent_state in the range 1048576-1052671 and then finds another range 1052672 - 135266303, shouldn't in fact it 
> find just 1048576 - 135266303. Perhaps it's due to the search being performed in the io_extent tree and not the extent_tree. 

And the explanation: 

btrfs-3230  [004] ...1    74.499205: merge_state: PREV: Didn't merge state[1052672-1056767 state: 36] with other[1048576-1052671 state: 548]
btrfs-3230  [004] ...1    74.499205: merge_state: NEXT: Didn't merge state[1052672-1056767 state: 36] with other[1056768-4128767 state: 4]

36 is EXTENT_UPTODATE|EXTENT_DELALLOC and 548 is EXTENT_UPTODATE|EXTENT_DELALLOC|EXTENT_BOUNDARY 
set by the relocation code. Now I'm convinced that the problem is really that find_delalloc_range only guarantees
to return a delalloc extent that ends after *start so returning one which is outside of our range is a valid execution.  

<SNIP>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> <snip>
>
Josef Bacik Nov. 20, 2018, 7 p.m. UTC | #7
On Fri, Oct 26, 2018 at 02:41:55PM +0300, Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
> 	btrfs           D    0  5760   5324 0x00000000
> 	Call Trace:
> 	 ? __schedule+0x243/0x800
> 	 schedule+0x33/0x90
> 	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
> 	 ? wait_woken+0xa0/0xa0
> 	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
> 	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
> 	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
> 	 btrfs_balance+0xbeb/0x1740 [btrfs]
> 	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
> 	 btrfs_ioctl+0x1691/0x3110 [btrfs]
> 	 ? lockdep_hardirqs_on+0xed/0x180
> 	 ? __handle_mm_fault+0x8e7/0xfb0
> 	 ? _raw_spin_unlock+0x24/0x30
> 	 ? __handle_mm_fault+0x8e7/0xfb0
> 	 ? do_vfs_ioctl+0xa5/0x6e0
> 	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
> 	 do_vfs_ioctl+0xa5/0x6e0
> 	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
> 	 ksys_ioctl+0x3a/0x70
> 	 __x64_sys_ioctl+0x16/0x20
> 	 do_syscall_64+0x60/0x1b0
> 	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Turns out during page writeback it's possible that the code in
> writepage_delalloc can instantiate a delalloc range which doesn't
> belong to the page currently being written back. This happens since
> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
> range when asked and doens't really consider the range of the passed
> page. When such a foregin range is found the code proceeds to
> run_delalloc_range and calls the appropriate function to fill the
> delalloc and create ordered extents. If, however, a failure occurs
> while this operation is in effect then the clean up code in
> btrfs_cleanup_ordered_extents will be called. This function has the
> wrong assumption that caller of run_delalloc_range always properly
> cleans the first page of the range hence when it calls
> __endio_write_update_ordered it explicitly ommits the first page of
> the delalloc range. This is wrong because this function could be
> cleaning a delalloc range that doesn't belong to the current page. This
> in turn means that the page cleanup code in __extent_writepage will
> not really free the initial page for the range, leaving a hanging
> ordered extent with bytes_left set to 4k. This bug has been present
> ever since the original introduction of the cleanup code in 524272607e88.
> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so correctly adjust the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.
> 
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")

Can we just remove the endio cleanup in __extent_writepage() and make this do
the proper cleanup?  I'm not sure if that is feasible or not, but seems like it
would make the cleanup stuff less confusing and more straightforward.  If not
you can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
Nikolay Borisov Nov. 20, 2018, 9:09 p.m. UTC | #8
On 20.11.18 г. 21:00 ч., Josef Bacik wrote:
> On Fri, Oct 26, 2018 at 02:41:55PM +0300, Nikolay Borisov wrote:
>> Running btrfs/124 in a loop hung up on me sporadically with the
>> following call trace:
>> 	btrfs           D    0  5760   5324 0x00000000
>> 	Call Trace:
>> 	 ? __schedule+0x243/0x800
>> 	 schedule+0x33/0x90
>> 	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>> 	 ? wait_woken+0xa0/0xa0
>> 	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>> 	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>> 	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
>> 	 btrfs_balance+0xbeb/0x1740 [btrfs]
>> 	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>> 	 btrfs_ioctl+0x1691/0x3110 [btrfs]
>> 	 ? lockdep_hardirqs_on+0xed/0x180
>> 	 ? __handle_mm_fault+0x8e7/0xfb0
>> 	 ? _raw_spin_unlock+0x24/0x30
>> 	 ? __handle_mm_fault+0x8e7/0xfb0
>> 	 ? do_vfs_ioctl+0xa5/0x6e0
>> 	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>> 	 do_vfs_ioctl+0xa5/0x6e0
>> 	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>> 	 ksys_ioctl+0x3a/0x70
>> 	 __x64_sys_ioctl+0x16/0x20
>> 	 do_syscall_64+0x60/0x1b0
>> 	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Turns out during page writeback it's possible that the code in
>> writepage_delalloc can instantiate a delalloc range which doesn't
>> belong to the page currently being written back. This happens since
>> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
>> range when asked and doens't really consider the range of the passed
>> page. When such a foregin range is found the code proceeds to
>> run_delalloc_range and calls the appropriate function to fill the
>> delalloc and create ordered extents. If, however, a failure occurs
>> while this operation is in effect then the clean up code in
>> btrfs_cleanup_ordered_extents will be called. This function has the
>> wrong assumption that caller of run_delalloc_range always properly
>> cleans the first page of the range hence when it calls
>> __endio_write_update_ordered it explicitly ommits the first page of
>> the delalloc range. This is wrong because this function could be
>> cleaning a delalloc range that doesn't belong to the current page. This
>> in turn means that the page cleanup code in __extent_writepage will
>> not really free the initial page for the range, leaving a hanging
>> ordered extent with bytes_left set to 4k. This bug has been present
>> ever since the original introduction of the cleanup code in 524272607e88.
>>
>> Fix this by correctly checking whether the current page belongs to the
>> range being instantiated and if so correctly adjust the range parameters
>> passed for cleaning up. If it doesn't, then just clean the whole OE
>> range directly.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
> 
> Can we just remove the endio cleanup in __extent_writepage() and make this do
> the proper cleanup?  I'm not sure if that is feasible or not, but seems like it
> would make the cleanup stuff less confusing and more straightforward.  If not
> you can add

Quickly skimming the code I think the cleanup in __extent_writepage
could be moved into __extent_writepage_io where we have 2 branches that
set PageError. So I guess it could be done, but I will have to
experiment with it.

> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> 
> Thanks,
> 
> Josef
>
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1f00d8c24ce..5906564ae2e9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -109,17 +109,17 @@  static void __endio_write_update_ordered(struct inode *inode,
  * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
  * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata
  * to be released, which we want to happen only when finishing the ordered
- * extent (btrfs_finish_ordered_io()). Also note that the caller of the
- * fill_delalloc() callback already does proper cleanup for the first page of
- * the range, that is, it invokes the callback writepage_end_io_hook() for the
- * range of the first page.
+ * extent (btrfs_finish_ordered_io()).
  */
 static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
-						 const u64 offset,
-						 const u64 bytes)
+						 struct page *locked_page,
+						 u64 offset, u64 bytes)
 {
 	unsigned long index = offset >> PAGE_SHIFT;
 	unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
+	u64 page_start = page_offset(locked_page);
+	u64 page_end = page_start + PAGE_SIZE - 1;
+
 	struct page *page;
 
 	while (index <= end_index) {
@@ -130,8 +130,18 @@  static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
 		ClearPagePrivate2(page);
 		put_page(page);
 	}
-	return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
-					    bytes - PAGE_SIZE, false);
+
+	/*
+	 * In case this page belongs to the delalloc range being instantiated
+	 * then skip it, since the first page of a range is going to be
+	 * properly cleaned up by the caller of run_delalloc_range
+	 */
+	if (page_start >= offset && page_end <= (offset + bytes - 1)) {
+		offset += PAGE_SIZE;
+		bytes -= PAGE_SIZE;
+	}
+
+	return __endio_write_update_ordered(inode, offset, bytes, false);
 }
 
 static int btrfs_dirty_inode(struct inode *inode);
@@ -1606,7 +1616,8 @@  static int run_delalloc_range(void *private_data, struct page *locked_page,
 					   write_flags);
 	}
 	if (ret)
-		btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
+		btrfs_cleanup_ordered_extents(inode, locked_page, start,
+					      end - start + 1);
 	return ret;
 }