Bug#841144: kernel BUG at /build/linux-Wgpe2M/linux-4.8.11/fs/ocfs2/alloc.c:1514!
diff mbox

Message ID 1511204090.3644.6.camel@nixnuts.net
State New
Headers show

Commit Message

John Lightsey Nov. 20, 2017, 6:54 p.m. UTC
In January Ben Hutchings reported Debian bug 841144 to the ocfs2-devel
list:

https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html

cPanel encountered this bug after upgrading our cluster to the 4.9
Debian stable kernel. In our environment, the bug would trigger every
few hours.

The core problem seems to be that the size of dw_zero_list is not
tracked correctly. This causes the ocfs2_lock_allocators() call in
ocfs2_dio_end_io_write() to underestimate the number of extents needed.
As a result, meta_ac is null when it's needed in ocfs2_grow_tree().

The attached patch is a forward-ported version of the fix we applied to
Debian's 4.9 kernel to correct the issue.

Comments

Changwei Ge Nov. 21, 2017, 12:58 a.m. UTC | #1
Hi John,
It's better to paste your patch directly into message body. It's easy 
for reviewing.

So I copied your patch below:

> The dw_zero_count tracking was assuming that w_unwritten_list would
> always contain one element. The actual count is now tracked whenever
> the list is extended.
> ---
>  fs/ocfs2/aops.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 88a31e9340a0..eb0a81368dbb 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -784,6 +784,8 @@ struct ocfs2_write_ctxt {
>  	struct ocfs2_cached_dealloc_ctxt w_dealloc;
>  
>  	struct list_head		w_unwritten_list;
> +
> +	unsigned int			w_unwritten_count;
>  };
>  
>  void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages)
> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp,
>  
>  	ocfs2_init_dealloc_ctxt(&wc->w_dealloc);
>  	INIT_LIST_HEAD(&wc->w_unwritten_list);
> +	wc->w_unwritten_count = 0;

I think you don't have to evaluate ::w_unwritten_count to zero since 
kzalloc already did that.

>  
>  	*wcp = wc;
>  
> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode *inode,
>  	desc->c_clear_unwritten = 0;
>  	list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list);
>  	list_add_tail(&new->ue_node, &wc->w_unwritten_list);
> +	wc->w_unwritten_count++;

You increase ::w_unwritten_coun once a new _ue_ is attached to 
::w_unwritten_list. So if no _ue_ ever is attached, ::w_unwritten_list 
is still empty. I think your change has the same effect with origin.

Moreover I don't see the relation between the reported crash issue and 
your patch change. Can you elaborate further?

Thanks,
Changwei

>  	new = NULL;
>  unlock:
>  	spin_unlock(&oi->ip_lock);
> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock,
>  		ue->ue_phys = desc->c_phys;
>  
>  		list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list);
> -		dwc->dw_zero_count++;
> +		dwc->dw_zero_count += wc->w_unwritten_count;
>  	}
>  
>  	ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc);
> -- 
> 2.11.0



On 2017/11/21 2:56, John Lightsey wrote:
> In January Ben Hutchings reported Debian bug 841144 to the ocfs2-devel
> list:
> 
> https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html
> 
> cPanel encountered this bug after upgrading our cluster to the 4.9
> Debian stable kernel. In our environment, the bug would trigger every
> few hours.
> 
> The core problem seems to be that the size of dw_zero_list is not
> tracked correctly. This causes the ocfs2_lock_allocators() call in
> ocfs2_dio_end_io_write() to underestimate the number of extents needed.
> As a result, meta_ac is null when it's needed in ocfs2_grow_tree().
> 
> The attached patch is a forward-ported version of the fix we applied to
> Debian's 4.9 kernel to correct the issue.
>
piaojun Nov. 21, 2017, 3:04 a.m. UTC | #2
Hi John,

On 2017/11/21 8:58, Changwei Ge wrote:
> Hi John,
> It's better to paste your patch directly into message body. It's easy 
> for reviewing.
> 
> So I copied your patch below:
> 
>> The dw_zero_count tracking was assuming that w_unwritten_list would
>> always contain one element. The actual count is now tracked whenever
>> the list is extended.
>> ---
>>  fs/ocfs2/aops.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index 88a31e9340a0..eb0a81368dbb 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
>> @@ -784,6 +784,8 @@ struct ocfs2_write_ctxt {
>>  	struct ocfs2_cached_dealloc_ctxt w_dealloc;
>>  
>>  	struct list_head		w_unwritten_list;
>> +
>> +	unsigned int			w_unwritten_count;
>>  };
>>  
>>  void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages)
>> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp,
>>  
>>  	ocfs2_init_dealloc_ctxt(&wc->w_dealloc);
>>  	INIT_LIST_HEAD(&wc->w_unwritten_list);
>> +	wc->w_unwritten_count = 0;
> 
> I think you don't have to evaluate ::w_unwritten_count to zero since 
> kzalloc already did that.
> 
>>  
>>  	*wcp = wc;
>>  
>> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode *inode,
>>  	desc->c_clear_unwritten = 0;
>>  	list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list);
>>  	list_add_tail(&new->ue_node, &wc->w_unwritten_list);
>> +	wc->w_unwritten_count++;
> 
> You increase ::w_unwritten_coun once a new _ue_ is attached to 
> ::w_unwritten_list. So if no _ue_ ever is attached, ::w_unwritten_list 
> is still empty. I think your change has the same effect with origin.
> 
> Moreover I don't see the relation between the reported crash issue and 
> your patch change. Can you elaborate further?
> 
> Thanks,
> Changwei
> 
>>  	new = NULL;
>>  unlock:
>>  	spin_unlock(&oi->ip_lock);
>> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock,
>>  		ue->ue_phys = desc->c_phys;
>>  
>>  		list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list);
>> -		dwc->dw_zero_count++;
>> +		dwc->dw_zero_count += wc->w_unwritten_count;

I prefer using a loop to calculate 'dwc->dw_zero_count' rather than
introducing a new variable as below:

list_for_each(iter, &wc->w_unwritten_list)
	dwc->dw_zero_count++;

thanks,
Jun

>>  	}
>>  
>>  	ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc);
>> -- 
>> 2.11.0
> 
> 
> 
> On 2017/11/21 2:56, John Lightsey wrote:
>> In January Ben Hutchings reported Debian bug 841144 to the ocfs2-devel
>> list:
>>
>> https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html
>>
>> cPanel encountered this bug after upgrading our cluster to the 4.9
>> Debian stable kernel. In our environment, the bug would trigger every
>> few hours.
>>
>> The core problem seems to be that the size of dw_zero_list is not
>> tracked correctly. This causes the ocfs2_lock_allocators() call in
>> ocfs2_dio_end_io_write() to underestimate the number of extents needed.
>> As a result, meta_ac is null when it's needed in ocfs2_grow_tree().
>>
>> The attached patch is a forward-ported version of the fix we applied to
>> Debian's 4.9 kernel to correct the issue.
>>
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
Changwei Ge Nov. 21, 2017, 5:58 a.m. UTC | #3
On 2017/11/21 10:45, John Lightsey wrote:
> On Tue, 2017-11-21 at 00:58 +0000, Changwei Ge wrote:
>>> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct
>>> ocfs2_write_ctxt **wcp,
>>>   
>>>   	ocfs2_init_dealloc_ctxt(&wc->w_dealloc);
>>>   	INIT_LIST_HEAD(&wc->w_unwritten_list);
>>> +	wc->w_unwritten_count = 0;
>>
>> I think you don't have to evaluate ::w_unwritten_count to zero since
>> kzalloc already did that.
> 
> Very true. I was following the example of how dwc was handling the
> dw_zero_count. You'll have to forgive me a bit. I'm very unfamiliar
> with the linux kernel codebase.
> 
>>
>>>   
>>>   	*wcp = wc;
>>>   
>>> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode
>>> *inode,
>>>   	desc->c_clear_unwritten = 0;
>>>   	list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list);
>>>   	list_add_tail(&new->ue_node, &wc->w_unwritten_list);
>>> +	wc->w_unwritten_count++;
>>
>> You increase ::w_unwritten_coun once a new _ue_ is attached to
>> ::w_unwritten_list. So if no _ue_ ever is attached,
>> ::w_unwritten_list
>> is still empty. I think your change has the same effect with origin.
>>
>> Moreover I don't see the relation between the reported crash issue
>> and your patch change. Can you elaborate further?
> 
> The important part is in the next segment in the patch. This block is
> just using w_unwritten_count to track the size of w_unwritten_list.
> 
>>> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode
>>> *inode, sector_t iblock,
>>>   		ue->ue_phys = desc->c_phys;
>>>   
>>>   		list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list);
>>> -		dwc->dw_zero_count++;
>>> +		dwc->dw_zero_count += wc->w_unwritten_count;
>>>   	}
>>>   
>>>
> 
> dw_zero_count is tracking the number of elements in dw_zero_list.
> 
> The old version assumed that after dw_zero_list and w_unwritten_list
> were spliced together, that the new length was dw_zero_count + 1. This
> assumption is not correct if w_unwritten_list contained more than one
> element.
> 
> The length of dw_zero_list is used by ocfs2_dio_end_io_write() to
> determine whether or not meta_ac will be needed to complete the write:
> 
>      ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2,
>                      &data_ac, &meta_ac);
Hi John,

Thanks for reporting.
I probably get your point.

Can your tell me how did you format your volume?
What's your _cluster size_ and _block size_?
Your can obtain such information via debugfs.ocfs2 <your volume> -R 
'stats' | grep 'Cluster Size'

It's better for you provide a way to reproduce this issue so that we can 
perform some test.

Thanks,
Changwei

> 
> This will return with success and a null meta_ac if there are at least
> dw_zero_count * 2 extents available for the write.
> 
> Since dw_zero_count was not being calculated correctly, this will
> occasionally result in the write getting into ocfs2_grow_tree() with a
> null meta_ac following this chain:
> 
> ocfs2_dio_end_io_write()
> ocfs2_mark_extent_written()
> ocfs2_change_extent_flag()
> ocfs2_split_extent()
> ocfs2_split_and_insert()
> ocfs2_grow_tree()
> 
> That's my understanding of what's causing the bug.
> 
> Our OCFS2 cluster was crashing every two to three hours after we
> upgraded to a 4.x kernel. We've gone about 18 hours with this patch
> applied and no crashes.
>

Patch
diff mbox

From a3107e92b07ed95752d72703ee53ae71a7607098 Mon Sep 17 00:00:00 2001
From: John Lightsey <jd@cpanel.net>
Date: Mon, 20 Nov 2017 12:05:37 -0600
Subject: [PATCH] Fix OCFS2 extent split estimation for dio allocators locking.

The dw_zero_count tracking was assuming that w_unwritten_list would
always contain one element. The actual count is now tracked whenever
the list is extended.
---
 fs/ocfs2/aops.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 88a31e9340a0..eb0a81368dbb 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -784,6 +784,8 @@  struct ocfs2_write_ctxt {
 	struct ocfs2_cached_dealloc_ctxt w_dealloc;
 
 	struct list_head		w_unwritten_list;
+
+	unsigned int			w_unwritten_count;
 };
 
 void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages)
@@ -873,6 +875,7 @@  static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp,
 
 	ocfs2_init_dealloc_ctxt(&wc->w_dealloc);
 	INIT_LIST_HEAD(&wc->w_unwritten_list);
+	wc->w_unwritten_count = 0;
 
 	*wcp = wc;
 
@@ -1373,6 +1376,7 @@  static int ocfs2_unwritten_check(struct inode *inode,
 	desc->c_clear_unwritten = 0;
 	list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list);
 	list_add_tail(&new->ue_node, &wc->w_unwritten_list);
+	wc->w_unwritten_count++;
 	new = NULL;
 unlock:
 	spin_unlock(&oi->ip_lock);
@@ -2246,7 +2250,7 @@  static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock,
 		ue->ue_phys = desc->c_phys;
 
 		list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list);
-		dwc->dw_zero_count++;
+		dwc->dw_zero_count += wc->w_unwritten_count;
 	}
 
 	ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc);
-- 
2.11.0