[3/5] Btrfs: only associate the locked page with one async_cow struct
diff mbox series

Message ID 20190710192818.1069475-4-tj@kernel.org
State New
Headers show
Series
  • [1/5] Btrfs: stop using btrfs_schedule_bio()
Related show

Commit Message

Tejun Heo July 10, 2019, 7:28 p.m. UTC
From: Chris Mason <clm@fb.com>

The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing.  When compression is on, we allocate one async_cow
structure for every 512K, and then run those pages through the
compression code for IO submission.

writepages starts all of this off with a single page, locked by
the original call to extent_write_cache_pages(), and it's important to
keep track of this page because it has already been through
clear_page_dirty_for_io().

The btrfs async_cow struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_cow
struct really owns that page.

But, this is racey.  If a given delalloc range is broken up into two
async_cows (cow_A and cow_B), we can end up with something like this:

compress_file_range(cowA)
submit_compress_extents(cowA)
submit compressed bios(cowA)
put_page(locked_page)

				compress_file_range(cowB)
				...

The end result is that cowA is completed and cleaned up before cowB even
starts processing.  This means we can free locked_page() and reuse it
elsewhere.  If we get really lucky, it'll have the same page->index in
its new home as it did before.

While we're processing cowB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on cowB->locked_page.

Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world.  With cgroups in use, we
might crash in the accounting code because page->mapping->i_wb isn't
set.

[ 8308.523110] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
[ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
[ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
[ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
[ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
[ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
[ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
[ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
[ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
[ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
[ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
[ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
[ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8308.594469] Call Trace:
[ 8308.595149]  account_page_cleaned+0x15b/0x1f0
[ 8308.596340]  __cancel_dirty_page+0x146/0x200
[ 8308.599395]  truncate_cleanup_page+0x92/0xb0
[ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
[ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
[ 8308.619108]  evict+0xc1/0x190
[ 8308.620023]  do_unlinkat+0x176/0x280
[ 8308.621202]  do_syscall_64+0x63/0x1a0
[ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

The fix here is to make asyc_cow->locked_page NULL everywhere but the
one async_cow struct that's allowed to do things to the locked page.

Signed-off-by: Chris Mason <clm@fb.com>
Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c |  2 +-
 fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
 2 files changed, 22 insertions(+), 5 deletions(-)

Comments

Nikolay Borisov July 11, 2019, 4 p.m. UTC | #1
On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
> From: Chris Mason <clm@fb.com>
> 
> The btrfs writepages function collects a large range of pages flagged
> for delayed allocation, and then sends them down through the COW code
> for processing.  When compression is on, we allocate one async_cow

nit: The code no longer uses async_cow to represent in-flight chunks but
the more aptly named async_chunk. Presumably this patchset predates
those changes.

> structure for every 512K, and then run those pages through the
> compression code for IO submission.
> 
> writepages starts all of this off with a single page, locked by
> the original call to extent_write_cache_pages(), and it's important to
> keep track of this page because it has already been through
> clear_page_dirty_for_io().

IMO it will be beneficial to state what are the implications of
clear_page_dirty_for_io being called, i.e what special handling should
this particular page receive to the rest of its lifetime.

> 
> The btrfs async_cow struct has a pointer to the locked_page, and when
> we're redirtying the page because compression had to fallback to
> uncompressed IO, we use page->index to decide if a given async_cow
> struct really owns that page.
> 
> But, this is racey.  If a given delalloc range is broken up into two
> async_cows (cow_A and cow_B), we can end up with something like this:
> 
> compress_file_range(cowA)
> submit_compress_extents(cowA)
> submit compressed bios(cowA)
> put_page(locked_page)
> 
> 				compress_file_range(cowB)
> 				...

This call trace is _really_ hand wavy and the correct one is more
complex, hence it should be something like :

async_cow_submit
 submit_compressed_extents <--- falls back to buffered writeout
  cow_file_range
   extent_clear_unlock_delalloc
    __process_pages_contig
      put_page(locked_pages)

                                           async_cow_submit

> 
> The end result is that cowA is completed and cleaned up before cowB even
> starts processing.  This means we can free locked_page() and reuse it
> elsewhere.  If we get really lucky, it'll have the same page->index in
> its new home as it did before.
> 
> While we're processing cowB, we might decide we need to fall back to
> uncompressed IO, and so compress_file_range() will call
> __set_page_dirty_nobufers() on cowB->locked_page.
> 
> Without cgroups in use, this creates as a phantom dirty page, which> isn't great but isn't the end of the world.  With cgroups in use, we

Having a phantom dirty page is not great but not terrible without
cgroups but apart from that, does it have any other implications?


> might crash in the accounting code because page->mapping->i_wb isn't
> set.
> 
> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
> [ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
> [ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 8308.594469] Call Trace:
> [ 8308.595149]  account_page_cleaned+0x15b/0x1f0
> [ 8308.596340]  __cancel_dirty_page+0x146/0x200
> [ 8308.599395]  truncate_cleanup_page+0x92/0xb0
> [ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
> [ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
> [ 8308.619108]  evict+0xc1/0x190
> [ 8308.620023]  do_unlinkat+0x176/0x280
> [ 8308.621202]  do_syscall_64+0x63/0x1a0
> [ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> The fix here is to make asyc_cow->locked_page NULL everywhere but the
> one async_cow struct that's allowed to do things to the locked page.
> 
> Signed-off-by: Chris Mason <clm@fb.com>
> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/extent_io.c |  2 +-
>  fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
>  2 files changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5106008f5e28..a31574df06aa 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct address_space *mapping,
>  			if (page_ops & PAGE_SET_PRIVATE2)
>  				SetPagePrivate2(pages[i]);
>  
> -			if (pages[i] == locked_page) {
> +			if (locked_page && pages[i] == locked_page) {

Why not make the check just if (locked_page) then clean it up, since if
__process_pages_contig is called from the owner of the page then it's
guaranteed that the page will fall within it's range.

>  				put_page(pages[i]);
>  				pages_locked++;
>  				continue;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 6e6df0eab324..a81e9860ee1f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -666,10 +666,12 @@ static noinline void compress_file_range(struct async_chunk *async_chunk,
>  	 * to our extent and set things up for the async work queue to run
>  	 * cow_file_range to do the normal delalloc dance.
>  	 */
> -	if (page_offset(async_chunk->locked_page) >= start &&
> -	    page_offset(async_chunk->locked_page) <= end)
> +	if (async_chunk->locked_page &&
> +	    (page_offset(async_chunk->locked_page) >= start &&
> +	     page_offset(async_chunk->locked_page)) <= end) {

DITTO since locked_page is now only set to the chunk that has the right
to it then there is no need to check the offsets and this will simplify
the code.

>  		__set_page_dirty_nobuffers(async_chunk->locked_page);
>  		/* unlocked later on in the async handlers */
> +	}
>  
>  	if (redirty)
>  		extent_range_redirty_for_io(inode, start, end);
> @@ -759,7 +761,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
>  						  async_extent->start +
>  						  async_extent->ram_size - 1,
>  						  WB_SYNC_ALL);
> -			else if (ret)
> +			else if (ret && async_chunk->locked_page)
>  				unlock_page(async_chunk->locked_page);
>  			kfree(async_extent);
>  			cond_resched();
> @@ -1236,10 +1238,25 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
>  		async_chunk[i].inode = inode;
>  		async_chunk[i].start = start;
>  		async_chunk[i].end = cur_end;
> -		async_chunk[i].locked_page = locked_page;
>  		async_chunk[i].write_flags = write_flags;
>  		INIT_LIST_HEAD(&async_chunk[i].extents);
>  
> +		/*
> +		 * The locked_page comes all the way from writepage and its
> +		 * the original page we were actually given.  As we spread
> +		 * this large delalloc region across multiple async_cow
> +		 * structs, only the first struct needs a pointer to locked_page
> +		 *
> +		 * This way we don't need racey decisions about who is supposed
> +		 * to unlock it.
> +		 */
> +		if (locked_page) {
> +			async_chunk[i].locked_page = locked_page;
> +			locked_page = NULL;
> +		} else {
> +			async_chunk[i].locked_page = NULL;
> +		}
> +
>  		btrfs_init_work(&async_chunk[i].work,
>  				btrfs_delalloc_helper,
>  				async_cow_start, async_cow_submit,
>
Chris Mason July 11, 2019, 7:52 p.m. UTC | #2
On 11 Jul 2019, at 12:00, Nikolay Borisov wrote:

> On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
>> From: Chris Mason <clm@fb.com>
>>
>> The btrfs writepages function collects a large range of pages flagged
>> for delayed allocation, and then sends them down through the COW code
>> for processing.  When compression is on, we allocate one async_cow
>
> nit: The code no longer uses async_cow to represent in-flight chunks 
> but
> the more aptly named async_chunk. Presumably this patchset predates
> those changes.

Not by much, but yes.

>
>>
>> The end result is that cowA is completed and cleaned up before cowB 
>> even
>> starts processing.  This means we can free locked_page() and reuse it
>> elsewhere.  If we get really lucky, it'll have the same page->index 
>> in
>> its new home as it did before.
>>
>> While we're processing cowB, we might decide we need to fall back to
>> uncompressed IO, and so compress_file_range() will call
>> __set_page_dirty_nobufers() on cowB->locked_page.
>>
>> Without cgroups in use, this creates as a phantom dirty page, which> 
>> isn't great but isn't the end of the world.  With cgroups in use, we
>
> Having a phantom dirty page is not great but not terrible without
> cgroups but apart from that, does it have any other implications?

Best case, it'll go through the writepage fixup worker and go through 
the whole cow machinery again.  Worst case we go to this code more than 
once:

                         /*
                          * if page_started, cow_file_range inserted an
                          * inline extent and took care of all the 
unlocking
                          * and IO for us.  Otherwise, we need to submit
                          * all those pages down to the drive.
                          */
                         if (!page_started && !ret)
                                 extent_write_locked_range(inode,
                                                   async_extent->start,
                                                   async_extent->start +
                                                   async_extent->ram_size 
- 1,
                                                   WB_SYNC_ALL);
                         else if (ret)
                                 unlock_page(async_chunk->locked_page);


That never happened in production as far as I can tell, but it seems 
possible.

>
>
>> might crash in the accounting code because page->mapping->i_wb isn't
>> set.
>>
>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference 
>> at 00000000000000d0
>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 
>> 0000000000026115
>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 
>> 0000000000000090
>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 
>> 0000000000000000
>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: 
>> ffffffffffffffff
>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 
>> 0000000000000001
>> [ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) 
>> knlGS:0000000000000000
>> [ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 
>> 0000000000360ee0
>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>> 0000000000000000
>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>> 0000000000000400
>> [ 8308.594469] Call Trace:
>> [ 8308.595149]  account_page_cleaned+0x15b/0x1f0
>> [ 8308.596340]  __cancel_dirty_page+0x146/0x200
>> [ 8308.599395]  truncate_cleanup_page+0x92/0xb0
>> [ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
>> [ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
>> [ 8308.619108]  evict+0xc1/0x190
>> [ 8308.620023]  do_unlinkat+0x176/0x280
>> [ 8308.621202]  do_syscall_64+0x63/0x1a0
>> [ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> The fix here is to make asyc_cow->locked_page NULL everywhere but the
>> one async_cow struct that's allowed to do things to the locked page.
>>
>> Signed-off-by: Chris Mason <clm@fb.com>
>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and 
>> reads")
>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>  fs/btrfs/extent_io.c |  2 +-
>>  fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
>>  2 files changed, 22 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 5106008f5e28..a31574df06aa 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct 
>> address_space *mapping,
>>  			if (page_ops & PAGE_SET_PRIVATE2)
>>  				SetPagePrivate2(pages[i]);
>>
>> -			if (pages[i] == locked_page) {
>> +			if (locked_page && pages[i] == locked_page) {
>
> Why not make the check just if (locked_page) then clean it up, since 
> if
> __process_pages_contig is called from the owner of the page then it's
> guaranteed that the page will fall within it's range.

I'm not convinced that every single caller of __process_pages_contig is 
making sure to only send locked_page for ranges that correspond to the 
locked_page.  I'm not sure exactly what you're asking for though, it 
looks like it would require some larger changes to the flow of that 
loop.

>
>>  				put_page(pages[i]);
>>  				pages_locked++;
>>  				continue;
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 6e6df0eab324..a81e9860ee1f 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -666,10 +666,12 @@ static noinline void compress_file_range(struct 
>> async_chunk *async_chunk,
>>  	 * to our extent and set things up for the async work queue to run
>>  	 * cow_file_range to do the normal delalloc dance.
>>  	 */
>> -	if (page_offset(async_chunk->locked_page) >= start &&
>> -	    page_offset(async_chunk->locked_page) <= end)
>> +	if (async_chunk->locked_page &&
>> +	    (page_offset(async_chunk->locked_page) >= start &&
>> +	     page_offset(async_chunk->locked_page)) <= end) {
>
> DITTO since locked_page is now only set to the chunk that has the 
> right
> to it then there is no need to check the offsets and this will 
> simplify
> the code.
>

start is adjusted higher up in the loop:

                         if (start + total_in < end) {
                                 start += total_in;
                                 pages = NULL;
                                 cond_resched();
                                 goto again;
                         }

So we might get to the __set_page_dirty_nobuffers() test with a range 
that no longer corresponds to the locked page.

-chris
Nikolay Borisov July 26, 2019, 3:29 p.m. UTC | #3
On 11.07.19 г. 22:52 ч., Chris Mason wrote:
> On 11 Jul 2019, at 12:00, Nikolay Borisov wrote:
> 
>> On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
>>> From: Chris Mason <clm@fb.com>
>>>
>>> The btrfs writepages function collects a large range of pages flagged
>>> for delayed allocation, and then sends them down through the COW code
>>> for processing.  When compression is on, we allocate one async_cow
>>
>> nit: The code no longer uses async_cow to represent in-flight chunks 
>> but
>> the more aptly named async_chunk. Presumably this patchset predates
>> those changes.
> 
> Not by much, but yes.
> 
>>
>>>
>>> The end result is that cowA is completed and cleaned up before cowB 
>>> even
>>> starts processing.  This means we can free locked_page() and reuse it
>>> elsewhere.  If we get really lucky, it'll have the same page->index 
>>> in
>>> its new home as it did before.
>>>
>>> While we're processing cowB, we might decide we need to fall back to
>>> uncompressed IO, and so compress_file_range() will call
>>> __set_page_dirty_nobufers() on cowB->locked_page.
>>>
>>> Without cgroups in use, this creates as a phantom dirty page, which> 
>>> isn't great but isn't the end of the world.  With cgroups in use, we
>>
>> Having a phantom dirty page is not great but not terrible without
>> cgroups but apart from that, does it have any other implications?
> 
> Best case, it'll go through the writepage fixup worker and go through 
> the whole cow machinery again.  Worst case we go to this code more than 
> once:
> 
>                          /*
>                           * if page_started, cow_file_range inserted an
>                           * inline extent and took care of all the 
> unlocking
>                           * and IO for us.  Otherwise, we need to submit
>                           * all those pages down to the drive.
>                           */
>                          if (!page_started && !ret)
>                                  extent_write_locked_range(inode,
>                                                    async_extent->start,
>                                                    async_extent->start +
>                                                    async_extent->ram_size 
> - 1,
>                                                    WB_SYNC_ALL);
>                          else if (ret)
>                                  unlock_page(async_chunk->locked_page);
> 
> 
> That never happened in production as far as I can tell, but it seems 
> possible.
> 
>>
>>
>>> might crash in the accounting code because page->mapping->i_wb isn't
>>> set.
>>>
>>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference 
>>> at 00000000000000d0
>>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
>>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
>>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
>>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
>>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
>>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 
>>> 0000000000026115
>>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 
>>> 0000000000000090
>>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 
>>> 0000000000000000
>>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: 
>>> ffffffffffffffff
>>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 
>>> 0000000000000001
>>> [ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) 
>>> knlGS:0000000000000000
>>> [ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 
>>> 0000000000360ee0
>>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>> 0000000000000000
>>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>> 0000000000000400
>>> [ 8308.594469] Call Trace:
>>> [ 8308.595149]  account_page_cleaned+0x15b/0x1f0
>>> [ 8308.596340]  __cancel_dirty_page+0x146/0x200
>>> [ 8308.599395]  truncate_cleanup_page+0x92/0xb0
>>> [ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
>>> [ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
>>> [ 8308.619108]  evict+0xc1/0x190
>>> [ 8308.620023]  do_unlinkat+0x176/0x280
>>> [ 8308.621202]  do_syscall_64+0x63/0x1a0
>>> [ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>
>>> The fix here is to make asyc_cow->locked_page NULL everywhere but the
>>> one async_cow struct that's allowed to do things to the locked page.
>>>
>>> Signed-off-by: Chris Mason <clm@fb.com>
>>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and 
>>> reads")
>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>> ---
>>>  fs/btrfs/extent_io.c |  2 +-
>>>  fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
>>>  2 files changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 5106008f5e28..a31574df06aa 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct 
>>> address_space *mapping,
>>>  			if (page_ops & PAGE_SET_PRIVATE2)
>>>  				SetPagePrivate2(pages[i]);
>>>
>>> -			if (pages[i] == locked_page) {
>>> +			if (locked_page && pages[i] == locked_page) {
>>
>> Why not make the check just if (locked_page) then clean it up, since 
>> if
>> __process_pages_contig is called from the owner of the page then it's
>> guaranteed that the page will fall within it's range.
> 
> I'm not convinced that every single caller of __process_pages_contig is 
> making sure to only send locked_page for ranges that correspond to the 
> locked_page.  I'm not sure exactly what you're asking for though, it 
> looks like it would require some larger changes to the flow of that 
> loop.


What I meant it is to simply factor out the code dealing with locked
page outside of the loop and still place it inside
__process_pages_contig. Also looking at the way locked_pages is passed
across different call chains I arrive at:


compress_file_range  <-- locked page is null
 extent_clear_unlock_delalloc
  __process_pages_contig

submit_compressed_extents <---- locked page is null
 extent_clear_unlock_delalloc
  __process_pages_contig

btrfs_run_delalloc_range | run_delalloc_nocow
 cow_file_range <--- [when called from btrfs_run_delalloc_range we are
all fine and dandy because it will always iterates a range which belongs
to the page. So we can free the page and set it null for subsequent
passes of the loop.]

Looking run_delalloc_nocow I see the page is used 5
times - 2 of those, at the beginning and end of the function, are only
used during error cases. The other 2 times is if cow_start is different
than -1 , which happens if !nocow is true. I've yet to wrap my head
around run_delalloc_nocow but I think it should also be safe to pass
locked page just once.

cow_file_range_async <--- always called with the correct locked page, in
this case the function is called before any async chunks are going to be
submitted.
 extent_clear_unlock_delalloc
  __process_pages_contig

btrfs_run_delalloc_range <--- this one is called with locked_page
belonging to the passed delalloc range.
 run_delalloc_nocow
  extent_clear_unlock_delalloc
   __process_pages_contig


writepage_delalloc <-- calls find_lock_delalloc_range only if we aren't
caalled from compress path and the start range always belongs to the page
 find_lock_delalloc_range <----  if the range is not delalloc it will
retry. But that function is also called with the correct page.
  lock_delalloc_pages <--- ignores range which belongs only to this page
    __unlock_for_delaloc <--- ignores range which belongs only to this page




<snip>

Patch
diff mbox series

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5106008f5e28..a31574df06aa 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1838,7 +1838,7 @@  static int __process_pages_contig(struct address_space *mapping,
 			if (page_ops & PAGE_SET_PRIVATE2)
 				SetPagePrivate2(pages[i]);
 
-			if (pages[i] == locked_page) {
+			if (locked_page && pages[i] == locked_page) {
 				put_page(pages[i]);
 				pages_locked++;
 				continue;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6e6df0eab324..a81e9860ee1f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -666,10 +666,12 @@  static noinline void compress_file_range(struct async_chunk *async_chunk,
 	 * to our extent and set things up for the async work queue to run
 	 * cow_file_range to do the normal delalloc dance.
 	 */
-	if (page_offset(async_chunk->locked_page) >= start &&
-	    page_offset(async_chunk->locked_page) <= end)
+	if (async_chunk->locked_page &&
+	    (page_offset(async_chunk->locked_page) >= start &&
+	     page_offset(async_chunk->locked_page)) <= end) {
 		__set_page_dirty_nobuffers(async_chunk->locked_page);
 		/* unlocked later on in the async handlers */
+	}
 
 	if (redirty)
 		extent_range_redirty_for_io(inode, start, end);
@@ -759,7 +761,7 @@  static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 						  async_extent->start +
 						  async_extent->ram_size - 1,
 						  WB_SYNC_ALL);
-			else if (ret)
+			else if (ret && async_chunk->locked_page)
 				unlock_page(async_chunk->locked_page);
 			kfree(async_extent);
 			cond_resched();
@@ -1236,10 +1238,25 @@  static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		async_chunk[i].inode = inode;
 		async_chunk[i].start = start;
 		async_chunk[i].end = cur_end;
-		async_chunk[i].locked_page = locked_page;
 		async_chunk[i].write_flags = write_flags;
 		INIT_LIST_HEAD(&async_chunk[i].extents);
 
+		/*
+		 * The locked_page comes all the way from writepage and its
+		 * the original page we were actually given.  As we spread
+		 * this large delalloc region across multiple async_cow
+		 * structs, only the first struct needs a pointer to locked_page
+		 *
+		 * This way we don't need racey decisions about who is supposed
+		 * to unlock it.
+		 */
+		if (locked_page) {
+			async_chunk[i].locked_page = locked_page;
+			locked_page = NULL;
+		} else {
+			async_chunk[i].locked_page = NULL;
+		}
+
 		btrfs_init_work(&async_chunk[i].work,
 				btrfs_delalloc_helper,
 				async_cow_start, async_cow_submit,