Message ID | 1511204090.3644.6.camel@nixnuts.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi John, It's better to paste your patch directly into message body. It's easy for reviewing. So I copied your patch below: > The dw_zero_count tracking was assuming that w_unwritten_list would > always contain one element. The actual count is now tracked whenever > the list is extended. > --- > fs/ocfs2/aops.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c > index 88a31e9340a0..eb0a81368dbb 100644 > --- a/fs/ocfs2/aops.c > +++ b/fs/ocfs2/aops.c > @@ -784,6 +784,8 @@ struct ocfs2_write_ctxt { > struct ocfs2_cached_dealloc_ctxt w_dealloc; > > struct list_head w_unwritten_list; > + > + unsigned int w_unwritten_count; > }; > > void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages) > @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp, > > ocfs2_init_dealloc_ctxt(&wc->w_dealloc); > INIT_LIST_HEAD(&wc->w_unwritten_list); > + wc->w_unwritten_count = 0; I think you don't have to evaluate ::w_unwritten_count to zero since kzalloc already did that. > > *wcp = wc; > > @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode *inode, > desc->c_clear_unwritten = 0; > list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list); > list_add_tail(&new->ue_node, &wc->w_unwritten_list); > + wc->w_unwritten_count++; You increase ::w_unwritten_coun once a new _ue_ is attached to ::w_unwritten_list. So if no _ue_ ever is attached, ::w_unwritten_list is still empty. I think your change has the same effect with origin. Moreover I don't see the relation between the reported crash issue and your patch change. Can you elaborate further? Thanks, Changwei > new = NULL; > unlock: > spin_unlock(&oi->ip_lock); > @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock, > ue->ue_phys = desc->c_phys; > > list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list); > - dwc->dw_zero_count++; > + dwc->dw_zero_count += wc->w_unwritten_count; > } > > ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc); > -- > 2.11.0 On 2017/11/21 2:56, John Lightsey wrote: > In January Ben Hutchings reported Debian bug 841144 to the ocfs2-devel > list: > > https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html > > cPanel encountered this bug after upgrading our cluster to the 4.9 > Debian stable kernel. In our environment, the bug would trigger every > few hours. > > The core problem seems to be that the size of dw_zero_list is not > tracked correctly. This causes the ocfs2_lock_allocators() call in > ocfs2_dio_end_io_write() to underestimate the number of extents needed. > As a result, meta_ac is null when it's needed in ocfs2_grow_tree(). > > The attached patch is a forward-ported version of the fix we applied to > Debian's 4.9 kernel to correct the issue. >
Hi John, On 2017/11/21 8:58, Changwei Ge wrote: > Hi John, > It's better to paste your patch directly into message body. It's easy > for reviewing. > > So I copied your patch below: > >> The dw_zero_count tracking was assuming that w_unwritten_list would >> always contain one element. The actual count is now tracked whenever >> the list is extended. >> --- >> fs/ocfs2/aops.c | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c >> index 88a31e9340a0..eb0a81368dbb 100644 >> --- a/fs/ocfs2/aops.c >> +++ b/fs/ocfs2/aops.c >> @@ -784,6 +784,8 @@ struct ocfs2_write_ctxt { >> struct ocfs2_cached_dealloc_ctxt w_dealloc; >> >> struct list_head w_unwritten_list; >> + >> + unsigned int w_unwritten_count; >> }; >> >> void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages) >> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp, >> >> ocfs2_init_dealloc_ctxt(&wc->w_dealloc); >> INIT_LIST_HEAD(&wc->w_unwritten_list); >> + wc->w_unwritten_count = 0; > > I think you don't have to evaluate ::w_unwritten_count to zero since > kzalloc already did that. > >> >> *wcp = wc; >> >> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode *inode, >> desc->c_clear_unwritten = 0; >> list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list); >> list_add_tail(&new->ue_node, &wc->w_unwritten_list); >> + wc->w_unwritten_count++; > > You increase ::w_unwritten_coun once a new _ue_ is attached to > ::w_unwritten_list. So if no _ue_ ever is attached, ::w_unwritten_list > is still empty. I think your change has the same effect with origin. > > Moreover I don't see the relation between the reported crash issue and > your patch change. Can you elaborate further? > > Thanks, > Changwei > >> new = NULL; >> unlock: >> spin_unlock(&oi->ip_lock); >> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock, >> ue->ue_phys = desc->c_phys; >> >> list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list); >> - dwc->dw_zero_count++; >> + dwc->dw_zero_count += wc->w_unwritten_count; I prefer using a loop to calculate 'dwc->dw_zero_count' rather than introducing a new variable as below: list_for_each(iter, &wc->w_unwritten_list) dwc->dw_zero_count++; thanks, Jun >> } >> >> ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc); >> -- >> 2.11.0 > > > > On 2017/11/21 2:56, John Lightsey wrote: >> In January Ben Hutchings reported Debian bug 841144 to the ocfs2-devel >> list: >> >> https://oss.oracle.com/pipermail/ocfs2-devel/2017-January/012701.html >> >> cPanel encountered this bug after upgrading our cluster to the 4.9 >> Debian stable kernel. In our environment, the bug would trigger every >> few hours. >> >> The core problem seems to be that the size of dw_zero_list is not >> tracked correctly. This causes the ocfs2_lock_allocators() call in >> ocfs2_dio_end_io_write() to underestimate the number of extents needed. >> As a result, meta_ac is null when it's needed in ocfs2_grow_tree(). >> >> The attached patch is a forward-ported version of the fix we applied to >> Debian's 4.9 kernel to correct the issue. >> > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel >
On 2017/11/21 10:45, John Lightsey wrote: > On Tue, 2017-11-21 at 00:58 +0000, Changwei Ge wrote: >>> @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct >>> ocfs2_write_ctxt **wcp, >>> >>> ocfs2_init_dealloc_ctxt(&wc->w_dealloc); >>> INIT_LIST_HEAD(&wc->w_unwritten_list); >>> + wc->w_unwritten_count = 0; >> >> I think you don't have to evaluate ::w_unwritten_count to zero since >> kzalloc already did that. > > Very true. I was following the example of how dwc was handling the > dw_zero_count. You'll have to forgive me a bit. I'm very unfamiliar > with the linux kernel codebase. > >> >>> >>> *wcp = wc; >>> >>> @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode >>> *inode, >>> desc->c_clear_unwritten = 0; >>> list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list); >>> list_add_tail(&new->ue_node, &wc->w_unwritten_list); >>> + wc->w_unwritten_count++; >> >> You increase ::w_unwritten_coun once a new _ue_ is attached to >> ::w_unwritten_list. So if no _ue_ ever is attached, >> ::w_unwritten_list >> is still empty. I think your change has the same effect with origin. >> >> Moreover I don't see the relation between the reported crash issue >> and your patch change. Can you elaborate further? > > The important part is in the next segment in the patch. This block is > just using w_unwritten_count to track the size of w_unwritten_list. > >>> @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode >>> *inode, sector_t iblock, >>> ue->ue_phys = desc->c_phys; >>> >>> list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list); >>> - dwc->dw_zero_count++; >>> + dwc->dw_zero_count += wc->w_unwritten_count; >>> } >>> >>> > > dw_zero_count is tracking the number of elements in dw_zero_list. > > The old version assumed that after dw_zero_list and w_unwritten_list > were spliced together, that the new length was dw_zero_count + 1. This > assumption is not correct if w_unwritten_list contained more than one > element. > > The length of dw_zero_list is used by ocfs2_dio_end_io_write() to > determine whether or not meta_ac will be needed to complete the write: > > ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2, > &data_ac, &meta_ac); Hi John, Thanks for reporting. I probably get your point. Can your tell me how did you format your volume? What's your _cluster size_ and _block size_? Your can obtain such information via debugfs.ocfs2 <your volume> -R 'stats' | grep 'Cluster Size' It's better for you provide a way to reproduce this issue so that we can perform some test. Thanks, Changwei > > This will return with success and a null meta_ac if there are at least > dw_zero_count * 2 extents available for the write. > > Since dw_zero_count was not being calculated correctly, this will > occasionally result in the write getting into ocfs2_grow_tree() with a > null meta_ac following this chain: > > ocfs2_dio_end_io_write() > ocfs2_mark_extent_written() > ocfs2_change_extent_flag() > ocfs2_split_extent() > ocfs2_split_and_insert() > ocfs2_grow_tree() > > That's my understanding of what's causing the bug. > > Our OCFS2 cluster was crashing every two to three hours after we > upgraded to a 4.x kernel. We've gone about 18 hours with this patch > applied and no crashes. >
From a3107e92b07ed95752d72703ee53ae71a7607098 Mon Sep 17 00:00:00 2001 From: John Lightsey <jd@cpanel.net> Date: Mon, 20 Nov 2017 12:05:37 -0600 Subject: [PATCH] Fix OCFS2 extent split estimation for dio allocators locking. The dw_zero_count tracking was assuming that w_unwritten_list would always contain one element. The actual count is now tracked whenever the list is extended. --- fs/ocfs2/aops.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 88a31e9340a0..eb0a81368dbb 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -784,6 +784,8 @@ struct ocfs2_write_ctxt { struct ocfs2_cached_dealloc_ctxt w_dealloc; struct list_head w_unwritten_list; + + unsigned int w_unwritten_count; }; void ocfs2_unlock_and_free_pages(struct page **pages, int num_pages) @@ -873,6 +875,7 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt **wcp, ocfs2_init_dealloc_ctxt(&wc->w_dealloc); INIT_LIST_HEAD(&wc->w_unwritten_list); + wc->w_unwritten_count = 0; *wcp = wc; @@ -1373,6 +1376,7 @@ static int ocfs2_unwritten_check(struct inode *inode, desc->c_clear_unwritten = 0; list_add_tail(&new->ue_ip_node, &oi->ip_unwritten_list); list_add_tail(&new->ue_node, &wc->w_unwritten_list); + wc->w_unwritten_count++; new = NULL; unlock: spin_unlock(&oi->ip_lock); @@ -2246,7 +2250,7 @@ static int ocfs2_dio_get_block(struct inode *inode, sector_t iblock, ue->ue_phys = desc->c_phys; list_splice_tail_init(&wc->w_unwritten_list, &dwc->dw_zero_list); - dwc->dw_zero_count++; + dwc->dw_zero_count += wc->w_unwritten_count; } ret = ocfs2_write_end_nolock(inode->i_mapping, pos, len, len, wc); -- 2.11.0