diff mbox

nfs: fix inifinite loop at nfs4_layoutcommit_release

Message ID 1314512558-16912-1-git-send-email-gusev.vitaliy@nexenta.com (mailing list archive)
State New, archived
Headers show

Commit Message

Vitaliy Gusev Aug. 28, 2011, 6:22 a.m. UTC
pnfs_layout_segment can be already under handling LAYOUTCOMMIT,
so adding list twice causes hang:

   BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4]
   Call Trace:

     nfs4_layoutcommit_release+0x5a/0x9c [nfs]
     rpc_release_calldata+0x17/0x19 [sunrpc]
     rpc_free_task+0x5e/0x66 [sunrpc]
     __rpc_execute+0x29e/0x2ad [sunrpc]
     rpc_async_schedule+0x15/0x17 [sunrpc]
     process_one_work+0x213/0x3ba
     process_one_work+0x142/0x3ba
     __rpc_execute+0x2ad/0x2ad [sunrpc]
     worker_thread+0xfd/0x181

Signed-off-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
---
 fs/nfs/pnfs.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

Comments

Trond Myklebust Sept. 6, 2011, 7:29 p.m. UTC | #1
On Sun, 2011-08-28 at 10:22 +0400, Vitaliy Gusev wrote: 
> pnfs_layout_segment can be already under handling LAYOUTCOMMIT,
> so adding list twice causes hang:
> 
>    BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:4]
>    Call Trace:
> 
>      nfs4_layoutcommit_release+0x5a/0x9c [nfs]
>      rpc_release_calldata+0x17/0x19 [sunrpc]
>      rpc_free_task+0x5e/0x66 [sunrpc]
>      __rpc_execute+0x29e/0x2ad [sunrpc]
>      rpc_async_schedule+0x15/0x17 [sunrpc]
>      process_one_work+0x213/0x3ba
>      process_one_work+0x142/0x3ba
>      __rpc_execute+0x2ad/0x2ad [sunrpc]
>      worker_thread+0xfd/0x181
> 
> Signed-off-by: Vitaliy Gusev <gusev.vitaliy@nexenta.com>
> ---
>  fs/nfs/pnfs.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index e550e88..1465f44 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -1376,7 +1376,8 @@ static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp)
>  
>  	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
>  		if (lseg->pls_range.iomode == IOMODE_RW &&
> -		    test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags))
> +		    test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags) &&
> +		    list_empty(&lseg->pls_lc_list))
>  			list_add(&lseg->pls_lc_list, listp);
>  	}
>  }

If the lseg is already part of one layoutcommit, but we're sending a
second one for the same range (presumably because we wrote more data in
the same region), then the above causes the lseg to be excluded.

I agree that the current code causes list corruption, but before
applying something like the above patch, I'd like to understand why it
is correct.

Trond
Vitaliy Gusev Sept. 6, 2011, 10:13 p.m. UTC | #2
>> @@ -1376,7 +1376,8 @@ static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp)
>>
>>   	list_for_each_entry(lseg,&NFS_I(inode)->layout->plh_segs, pls_list) {
>>   		if (lseg->pls_range.iomode == IOMODE_RW&&
>> -		    test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags))
>> +		    test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags)&&
>> +		    list_empty(&lseg->pls_lc_list))
>>   			list_add(&lseg->pls_lc_list, listp);
>>   	}
>>   }
>
> If the lseg is already part of one layoutcommit, but we're sending a
> second one for the same range (presumably because we wrote more data in
> the same region), then the above causes the lseg to be excluded.


Yes, lseg is excluded, This patch does exactly only exclusion of lseg. 
lseg is used here only to get/put reference on this lseg, so skipping is 
correct.


However, checking on list_empty can occur (on other CPU) in the middle:

	list_del_init(&lseg->pls_lc_list);
Here >>>>>>
	if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT,
			       &lseg->pls_flags))
		put_lseg(lseg);


So list_del_init must be executed under the same lock as 
pnfs_list_write_lseg, i.e. inode->i_lock.


>
> I agree that the current code causes list corruption, but before
> applying something like the above patch, I'd like to understand why it
> is correct.
>
> Trond
>

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Sept. 6, 2011, 10:32 p.m. UTC | #3
On Wed, 2011-09-07 at 02:13 +0400, Vitaliy Gusev wrote: 
> >> @@ -1376,7 +1376,8 @@ static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp)
> >>
> >>   	list_for_each_entry(lseg,&NFS_I(inode)->layout->plh_segs, pls_list) {
> >>   		if (lseg->pls_range.iomode == IOMODE_RW&&
> >> -		    test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags))
> >> +		    test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags)&&
> >> +		    list_empty(&lseg->pls_lc_list))
> >>   			list_add(&lseg->pls_lc_list, listp);
> >>   	}
> >>   }
> >
> > If the lseg is already part of one layoutcommit, but we're sending a
> > second one for the same range (presumably because we wrote more data in
> > the same region), then the above causes the lseg to be excluded.
> 
> 
> Yes, lseg is excluded, This patch does exactly only exclusion of lseg. 
> lseg is used here only to get/put reference on this lseg, so skipping is 
> correct.

Are you sure? As far as I can see, pnfs_list_write_lseg() actually
assigns the lseg to a particular layoutcommit call.

My questions are: if I write to an area described by that lseg _after_
it has been assigned to that layoutcommit RPC call (and the latter has
already been dispatched to the metadata server), then 
     A. shouldn't it be assigned to a second layoutcommit call too? 
     B. If not, what are we doing to ensure mutual exclusion between
        layoutcommit RPC calls and pNFS writes to the data server?

> However, checking on list_empty can occur (on other CPU) in the middle:
> 
> 	list_del_init(&lseg->pls_lc_list);
> Here >>>>>>
> 	if (test_and_clear_bit(NFS_LSEG_LAYOUTCOMMIT,
> 			       &lseg->pls_flags))
> 		put_lseg(lseg);
> 
> 
> So list_del_init must be executed under the same lock as 
> pnfs_list_write_lseg, i.e. inode->i_lock.

Agreed if my questions above make no sense.

Cheers
  Trond
Peng Tao Sept. 8, 2011, 3 p.m. UTC | #4
On Thu, Sep 8, 2011 at 8:01 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
>> -----Original Message-----
>> From: tao.peng@emc.com [mailto:tao.peng@emc.com]
>> Sent: Thursday, September 08, 2011 6:21 AM
>> To: Myklebust, Trond; gusev.vitaliy@nexenta.com
>> Cc: gusev.vitaliy@gmail.com; linux-nfs@vger.kernel.org
>> Subject: RE: [PATCH] nfs: fix inifinite loop at
>> nfs4_layoutcommit_release
>>
>> Hi, Trond,
>>
>> Sorry for the late response.
>>
>> > -----Original Message-----
>> > From: Trond Myklebust [mailto:Trond.Myklebust@netapp.com]
>> > Sent: Wednesday, September 07, 2011 6:33 AM
>> > To: Vitaliy Gusev
>> > Cc: Vitaliy Gusev; Peng, Tao; linux-nfs@vger.kernel.org
>> > Subject: Re: [PATCH] nfs: fix inifinite loop at
>> nfs4_layoutcommit_release
>> >
>> > On Wed, 2011-09-07 at 02:13 +0400, Vitaliy Gusev wrote:
>> > > >> @@ -1376,7 +1376,8 @@ static void pnfs_list_write_lseg(struct
>> inode *inode,
>> > struct list_head *listp)
>> > > >>
>> > > >>        list_for_each_entry(lseg,&NFS_I(inode)->layout->plh_segs,
>> pls_list) {
>> > > >>                if (lseg->pls_range.iomode == IOMODE_RW&&
>> > > >> -                  test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg->pls_flags))
>> > > >> +                  test_bit(NFS_LSEG_LAYOUTCOMMIT,&lseg-
>> >pls_flags)&&
>> > > >> +                  list_empty(&lseg->pls_lc_list))
>> > > >>                        list_add(&lseg->pls_lc_list, listp);
>> > > >>        }
>> > > >>   }
>> > > >
>> > > > If the lseg is already part of one layoutcommit, but we're
>> sending a
>> > > > second one for the same range (presumably because we wrote more
>> data in
>> > > > the same region), then the above causes the lseg to be excluded.
>> > >
>> > >
>> > > Yes, lseg is excluded, This patch does exactly only exclusion of
>> lseg.
>> > > lseg is used here only to get/put reference on this lseg, so
>> skipping is
>> > > correct.
>> >
>> > Are you sure? As far as I can see, pnfs_list_write_lseg() actually
>> > assigns the lseg to a particular layoutcommit call.
>> >
>> > My questions are: if I write to an area described by that lseg
>> _after_
>> > it has been assigned to that layoutcommit RPC call (and the latter
>> has
>> > already been dispatched to the metadata server), then
>> >      A. shouldn't it be assigned to a second layoutcommit call too?
>> >      B. If not, what are we doing to ensure mutual exclusion between
>> >         layoutcommit RPC calls and pNFS writes to the data server?
>> I think it depends on the purpose of layoutcommit.
>> 1. for updating atime/mtime. In this case, we don't need mutual
>> exclusion
>
> Agreed.
>
>> between layoutcommit RPC and pNFS writes to data server.
>> 2. for updating LD specific state. In this case, LD should decide what
>> to commit
>> (and actually ignores the range of lseg). Take block layout for
>> example. It maintains
>> blocksized extent state inside LD and determines what to encode inside
>> layoutcommit
>> RPC whenever there is a generic layer layoutcommit call. So when an
>> extent is being
>> layoutcommitted, client can still write to the same range, as long as
>> the lseg is held by client.
>
> Yes, but as far as I can see, even in the blocks case there can be multiple extents per layout segment. What if I write to one uninitialised extent, layoutcommit, then write to another uninitialized extent in the same layout segment and layoutcommit? In my reading of the code, there is a chance that the second layoutcommit will fail to pick up the layout segment, and so will fail to notify the MDS that the second extent now contains data.

blocklayout does not decide what to layoutcommit according to the lseg
list. Instead, it keeps track of each extent's state at the
granularity of blocksize, and encode whatever needs layoutcommitted in
the layoutcommit call. So in your above case, as long as the second
layoutcommit is issued, blocklayout will encode the newly written
extent in the second layoutcommit call, even if the lseg is not
attached to the second layoutcommit.

But that leads to another two question:
1. How do we protect against layoutrecall if lseg is not linked to
layoutcommit? For this one, can we just reject layoutrecall if there
is inflight layoutcommit? It will be less parallel but can guarantee
current implementation correctness.
2. blocklayout ONLY: bl_committing may be overloaded by several
layoutcommit calls and we don't have information in
cleanup_layoutcommit() on how many entry should be removed from
bl_committing. Maybe we can add a (void*) to struct
nfs4_layoutcommit_data, so that LD can pass some private information
between encode_layoutcommit() and cleanup_layoutcommit()?

Thanks,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index e550e88..1465f44 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -1376,7 +1376,8 @@  static void pnfs_list_write_lseg(struct inode *inode, struct list_head *listp)
 
 	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list) {
 		if (lseg->pls_range.iomode == IOMODE_RW &&
-		    test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags))
+		    test_bit(NFS_LSEG_LAYOUTCOMMIT, &lseg->pls_flags) &&
+		    list_empty(&lseg->pls_lc_list))
 			list_add(&lseg->pls_lc_list, listp);
 	}
 }