Message ID | 1314512558-16912-1-git-send-email-gusev.vitaliy@nexenta.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, 2011-08-28 at 10:22 +0400, Vitaliy Gusev wrote: > pnfs_layout_segment can be already under handling LAYOUTCOMMIT, > so adding list twice causes hang: > > BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4] > Call Trace: > > nfs4_layoutcommit_release+0x5a/0x9c [nfs] > rpc_release_calldata+0x17/0x19 [sunrpc] > rpc_free_task+0x5e/0x66 [sunrpc] > __rpc_execute+0x29e/0x2ad [sunrpc] > rpc_async_schedule+0x15/0x17 [sunrpc] > process_one_work+0x213/0x3ba > process_one_work+0x142/0x3ba > __rpc_execute+0x2ad/0x2ad [sunrpc] > worker_thread+0xfd/0x181 > > Signed-off-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com> > --- > fs/nfs/pnfs.c | 3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c > index e550e88..1465f44 100644 > --- a/fs/nfs/pnfs.c > +++ b/fs/nfs/pnfs.c > @@ -1376,7 +1376,8 @@ static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp) > > list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) { > if (lseg->pls_range.iomode == IOMODE_RW && > - test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags)) > + test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags) && > + list_empty(&lseg->pls_lc_list)) > list_add(&lseg->pls_lc_list, listp); > } > } If the lseg is already part of one layoutcommit, but we're sending a second one for the same range (presumably because we wrote more data in the same region), then the above causes the lseg to be excluded. I agree that the current code causes list corruption, but before applying something like the above patch, I'd like to understand why it is correct. Trond
>> @@ -1376,7 +1376,8 @@ static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp) >> >> list_for_each_entry(lseg,&NFS_I(inode)->layout->plh_segs, pls_list) { >> if (lseg->pls_range.iomode == IOMODE_RW&& >> - test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags)) >> + test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags)&& >> + list_empty(&lseg->pls_lc_list)) >> list_add(&lseg->pls_lc_list, listp); >> } >> } > > If the lseg is already part of one layoutcommit, but we're sending a > second one for the same range (presumably because we wrote more data in > the same region), then the above causes the lseg to be excluded. Yes, lseg is excluded, This patch does exactly only exclusion of lseg. lseg is used here only to get/put reference on this lseg, so skipping is correct. However, checking on list_empty can occur (on other CPU) in the middle: list_del_init(&lseg->pls_lc_list); Here >>>>>> if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags)) put_lseg(lseg); So list_del_init must be executed under the same lock as pnfs_list_write_lseg, i.e. inode->i_lock. > > I agree that the current code causes list corruption, but before > applying something like the above patch, I'd like to understand why it > is correct. > > Trond > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2011-09-07 at 02:13 +0400, Vitaliy Gusev wrote: > >> @@ -1376,7 +1376,8 @@ static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp) > >> > >> list_for_each_entry(lseg,&NFS_I(inode)->layout->plh_segs, pls_list) { > >> if (lseg->pls_range.iomode == IOMODE_RW&& > >> - test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags)) > >> + test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags)&& > >> + list_empty(&lseg->pls_lc_list)) > >> list_add(&lseg->pls_lc_list, listp); > >> } > >> } > > > > If the lseg is already part of one layoutcommit, but we're sending a > > second one for the same range (presumably because we wrote more data in > > the same region), then the above causes the lseg to be excluded. > > > Yes, lseg is excluded, This patch does exactly only exclusion of lseg. > lseg is used here only to get/put reference on this lseg, so skipping is > correct. Are you sure? As far as I can see, pnfs_list_write_lseg() actually assigns the lseg to a particular layoutcommit call. My questions are: if I write to an area described by that lseg _after_ it has been assigned to that layoutcommit RPC call (and the latter has already been dispatched to the metadata server), then A. shouldn't it be assigned to a second layoutcommit call too? B. If not, what are we doing to ensure mutual exclusion between layoutcommit RPC calls and pNFS writes to the data server? > However, checking on list_empty can occur (on other CPU) in the middle: > > list_del_init(&lseg->pls_lc_list); > Here >>>>>> > if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT, > &lseg->pls_flags)) > put_lseg(lseg); > > > So list_del_init must be executed under the same lock as > pnfs_list_write_lseg, i.e. inode->i_lock. Agreed if my questions above make no sense. Cheers Trond
On Thu, Sep 8, 2011 at 8:01 PM, Myklebust, Trond <Trond.Myklebust@netapp.com> wrote: >> -----Original Message----- >> From: tao.peng@emc.com [mailto:tao.peng@emc.com] >> Sent: Thursday, September 08, 2011 6:21 AM >> To: Myklebust, Trond; gusev.vitaliy@nexenta.com >> Cc: gusev.vitaliy@gmail.com; linux-nfs@vger.kernel.org >> Subject: RE: [PATCH] nfs: fix inifinite loop at >> nfs4_layoutcommit_release >> >> Hi, Trond, >> >> Sorry for the late response. >> >> > -----Original Message----- >> > From: Trond Myklebust [mailto:Trond.Myklebust@netapp.com] >> > Sent: Wednesday, September 07, 2011 6:33 AM >> > To: Vitaliy Gusev >> > Cc: Vitaliy Gusev; Peng, Tao; linux-nfs@vger.kernel.org >> > Subject: Re: [PATCH] nfs: fix inifinite loop at >> nfs4_layoutcommit_release >> > >> > On Wed, 2011-09-07 at 02:13 +0400, Vitaliy Gusev wrote: >> > > >> @@ -1376,7 +1376,8 @@ static void pnfs_list_write_lseg(struct >> inode *inode, >> > struct list_head *listp) >> > > >> >> > > >> list_for_each_entry(lseg,&NFS_I(inode)->layout->plh_segs, >> pls_list) { >> > > >> if (lseg->pls_range.iomode == IOMODE_RW&& >> > > >> - test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags)) >> > > >> + test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg- >> >pls_flags)&& >> > > >> + list_empty(&lseg->pls_lc_list)) >> > > >> list_add(&lseg->pls_lc_list, listp); >> > > >> } >> > > >> } >> > > > >> > > > If the lseg is already part of one layoutcommit, but we're >> sending a >> > > > second one for the same range (presumably because we wrote more >> data in >> > > > the same region), then the above causes the lseg to be excluded. >> > > >> > > >> > > Yes, lseg is excluded, This patch does exactly only exclusion of >> lseg. >> > > lseg is used here only to get/put reference on this lseg, so >> skipping is >> > > correct. >> > >> > Are you sure? As far as I can see, pnfs_list_write_lseg() actually >> > assigns the lseg to a particular layoutcommit call. >> > >> > My questions are: if I write to an area described by that lseg >> _after_ >> > it has been assigned to that layoutcommit RPC call (and the latter >> has >> > already been dispatched to the metadata server), then >> > A. shouldn't it be assigned to a second layoutcommit call too? >> > B. If not, what are we doing to ensure mutual exclusion between >> > layoutcommit RPC calls and pNFS writes to the data server? >> I think it depends on the purpose of layoutcommit. >> 1. for updating atime/mtime. In this case, we don't need mutual >> exclusion > > Agreed. > >> between layoutcommit RPC and pNFS writes to data server. >> 2. for updating LD specific state. In this case, LD should decide what >> to commit >> (and actually ignores the range of lseg). Take block layout for >> example. It maintains >> blocksized extent state inside LD and determines what to encode inside >> layoutcommit >> RPC whenever there is a generic layer layoutcommit call. So when an >> extent is being >> layoutcommitted, client can still write to the same range, as long as >> the lseg is held by client. > > Yes, but as far as I can see, even in the blocks case there can be multiple extents per layout segment. What if I write to one uninitialised extent, layoutcommit, then write to another uninitialized extent in the same layout segment and layoutcommit? In my reading of the code, there is a chance that the second layoutcommit will fail to pick up the layout segment, and so will fail to notify the MDS that the second extent now contains data. blocklayout does not decide what to layoutcommit according to the lseg list. Instead, it keeps track of each extent's state at the granularity of blocksize, and encode whatever needs layoutcommitted in the layoutcommit call. So in your above case, as long as the second layoutcommit is issued, blocklayout will encode the newly written extent in the second layoutcommit call, even if the lseg is not attached to the second layoutcommit. But that leads to another two question: 1. How do we protect against layoutrecall if lseg is not linked to layoutcommit? For this one, can we just reject layoutrecall if there is inflight layoutcommit? It will be less parallel but can guarantee current implementation correctness. 2. blocklayout ONLY: bl_committing may be overloaded by several layoutcommit calls and we don't have information in cleanup_layoutcommit() on how many entry should be removed from bl_committing. Maybe we can add a (void*) to struct nfs4_layoutcommit_data, so that LD can pass some private information between encode_layoutcommit() and cleanup_layoutcommit()? Thanks, Tao -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index e550e88..1465f44 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -1376,7 +1376,8 @@ static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp) list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) { if (lseg->pls_range.iomode == IOMODE_RW && - test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags)) + test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags) && + list_empty(&lseg->pls_lc_list)) list_add(&lseg->pls_lc_list, listp); } }
pnfs_layout_segment can be already under handling LAYOUTCOMMIT, so adding list twice causes hang: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4] Call Trace: nfs4_layoutcommit_release+0x5a/0x9c [nfs] rpc_release_calldata+0x17/0x19 [sunrpc] rpc_free_task+0x5e/0x66 [sunrpc] __rpc_execute+0x29e/0x2ad [sunrpc] rpc_async_schedule+0x15/0x17 [sunrpc] process_one_work+0x213/0x3ba process_one_work+0x142/0x3ba __rpc_execute+0x2ad/0x2ad [sunrpc] worker_thread+0xfd/0x181 Signed-off-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com> --- fs/nfs/pnfs.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-)