diff mbox

4.0 NFS client in infinite loop in state recovery after getting BAD_STATEID

Message ID CAN-5tyG8ukoGJATK1RA85xv9BDikfC1CPP0nc=-80h=BSGV6=w@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Olga Kornievskaia May 7, 2015, 5:04 p.m. UTC
Hi folks,

Problem:
The upstream nfs4.0 client has problem where it will go into an
infinite loop of re-sending an OPEN when it's trying to recover from
receiving a BAD_STATEID error on an IO operation such READ or WRITE.

How to easily reproduce (by using fault injection):
1. Do nfs4.0 mount to a server.
2. Open a file such that the server gives you a write delegation.
3. Do a write. Have a server return a BAD_STATEID. One way to do so is
by using a python proxy, nfs4proxy, and inject BAD_STATEID error on
WRITE.
4. And off it goes with the loop.

Here’s why….

IO op like WRITE receives a BAD_STATEID.
1. for this error, in async handle error we  call
nfs4_schedule_stateid_recover()
2. that in turn will call nfs4_state_mark_reclaim_nograce() that will
set a RECLAIM_NOGRACE in the state flags.
3. state manager thread will run and call nfs4_do_reclaim() to recover.
4. that will call nfs4_reclaim_open_state()

in that function:

restart:
for open states in state
test if RECLAIM_NOGRACE is set in state flags, if so clear it (it’s
set and we’ll clear it)
check open_stateid (checks if RECOVERY_FAILED is not set) (it’s not)
checks if we have state
calls ops->recover_open()

for nfs4.0, it’ll call nfs40_open_expired()
it’ll call nfs40_clear_delegation_stateid()
it’ll call nfs_finish_clear_delegation_stateid()
it’ll call nfs_remove_bad_delegation()
it’ll call nfs_inode_find_state_and_recover()
it’ll call nfs4_state_mark_reclaim_nograce() **** this will set
RECLAIM_NOGRACE in state flags

we return from recover_open() with status 0
call nfs4_reclaim_locks() returns 0 then
goto restart; **************  what happens is since we reset the flag
in the state flags the whole loop starts again.

Solution:
nfs_remove_bad_delegation() is only called from
nfs_finish_clear_delegation_stateid() which is called from either 4.0
or 4.1 recover open functions in nograce case. In both cases, this is
already state manager doing recovery based on the RECLAIM_NOGRACE flag
set and it's going thru opens that need to be recovered.

I propose to correct the loop by removing the call:
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Mkrtchyan, Tigran May 8, 2015, 8:48 a.m. UTC | #1
Hi Olga, 

I believe we see the same infinite loop without delegation with RHEL6/7
kernels, but without delegation being involved. Currently, on the server
side, if client looping is detected, we return RESOURCE. This breaks the
loop and application gets an IO error.

Your fix is only covers the delegation case isn't it?

Tigran.

----- Original Message -----
> From: "Olga Kornievskaia" <aglo@umich.edu>
> To: "Trond Myklebust" <trond.myklebust@primarydata.com>, "linux-nfs" <linux-nfs@vger.kernel.org>
> Sent: Thursday, May 7, 2015 7:04:58 PM
> Subject: 4.0 NFS client in infinite loop in state recovery after getting BAD_STATEID

> Hi folks,
> 
> Problem:
> The upstream nfs4.0 client has problem where it will go into an
> infinite loop of re-sending an OPEN when it's trying to recover from
> receiving a BAD_STATEID error on an IO operation such READ or WRITE.
> 
> How to easily reproduce (by using fault injection):
> 1. Do nfs4.0 mount to a server.
> 2. Open a file such that the server gives you a write delegation.
> 3. Do a write. Have a server return a BAD_STATEID. One way to do so is
> by using a python proxy, nfs4proxy, and inject BAD_STATEID error on
> WRITE.
> 4. And off it goes with the loop.
> 
> Here’s why….
> 
> IO op like WRITE receives a BAD_STATEID.
> 1. for this error, in async handle error we  call
> nfs4_schedule_stateid_recover()
> 2. that in turn will call nfs4_state_mark_reclaim_nograce() that will
> set a RECLAIM_NOGRACE in the state flags.
> 3. state manager thread will run and call nfs4_do_reclaim() to recover.
> 4. that will call nfs4_reclaim_open_state()
> 
> in that function:
> 
> restart:
> for open states in state
> test if RECLAIM_NOGRACE is set in state flags, if so clear it (it’s
> set and we’ll clear it)
> check open_stateid (checks if RECOVERY_FAILED is not set) (it’s not)
> checks if we have state
> calls ops->recover_open()
> 
> for nfs4.0, it’ll call nfs40_open_expired()
> it’ll call nfs40_clear_delegation_stateid()
> it’ll call nfs_finish_clear_delegation_stateid()
> it’ll call nfs_remove_bad_delegation()
> it’ll call nfs_inode_find_state_and_recover()
> it’ll call nfs4_state_mark_reclaim_nograce() **** this will set
> RECLAIM_NOGRACE in state flags
> 
> we return from recover_open() with status 0
> call nfs4_reclaim_locks() returns 0 then
> goto restart; **************  what happens is since we reset the flag
> in the state flags the whole loop starts again.
> 
> Solution:
> nfs_remove_bad_delegation() is only called from
> nfs_finish_clear_delegation_stateid() which is called from either 4.0
> or 4.1 recover open functions in nograce case. In both cases, this is
> already state manager doing recovery based on the RECLAIM_NOGRACE flag
> set and it's going thru opens that need to be recovered.
> 
> I propose to correct the loop by removing the call:
> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> index 4711d04..b322823 100644
> --- a/fs/nfs/delegation.c
> +++ b/fs/nfs/delegation.c
> @@ -632,10 +632,8 @@ void nfs_remove_bad_delegation(struct inode *inode)
> 
>        nfs_revoke_delegation(inode);
>        delegation = nfs_inode_detach_delegation(inode);
> -       if (delegation) {
> -               nfs_inode_find_state_and_recover(inode, &delegation->stateid);
> +       if (delegation)
>                nfs_free_delegation(delegation);
> -       }
> }
> EXPORT_SYMBOL_GPL(nfs_remove_bad_delegation);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Coddington May 8, 2015, 12:25 p.m. UTC | #2
On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:

> Hi Olga,
>
> I believe we see the same infinite loop without delegation with RHEL6/7
> kernels, but without delegation being involved. Currently, on the server
> side, if client looping is detected, we return RESOURCE. This breaks the
> loop and application gets an IO error.
>
> Your fix is only covers the delegation case isn't it?
>
> Tigran.

Tigran, do you have a BZ or can you tell me how to reproduce this?

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mkrtchyan, Tigran May 8, 2015, 1 p.m. UTC | #3
Hi Ben,

probably you can simulate, as Olga has suggested, with fault injection.
I can I can prepare a snadalone version of your server, which returns
BAD_STATEID on any IO request.

Tigran.

----- Original Message -----
> From: "Benjamin Coddington" <bcodding@redhat.com>
> To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
> Cc: "Olga Kornievskaia" <aglo@umich.edu>, "Trond Myklebust" <trond.myklebust@primarydata.com>, "linux-nfs"
> <linux-nfs@vger.kernel.org>
> Sent: Friday, May 8, 2015 2:25:19 PM
> Subject: Re: 4.0 NFS client in infinite loop in state recovery after getting BAD_STATEID

> On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:
> 
>> Hi Olga,
>>
>> I believe we see the same infinite loop without delegation with RHEL6/7
>> kernels, but without delegation being involved. Currently, on the server
>> side, if client looping is detected, we return RESOURCE. This breaks the
>> loop and application gets an IO error.
>>
>> Your fix is only covers the delegation case isn't it?
>>
>> Tigran.
> 
> Tigran, do you have a BZ or can you tell me how to reproduce this?
> 
> Ben
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Coddington May 8, 2015, 1:08 p.m. UTC | #4
Yes, I've done that, and the client's behavior correctly cycles through
OPEN, then READ with the new stateid.  Are you able to create what Olga's
talking about -- which is (I believe) a loop of just OPENs?

Ben

On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:

> Hi Ben,
>
> probably you can simulate, as Olga has suggested, with fault injection.
> I can I can prepare a snadalone version of your server, which returns
> BAD_STATEID on any IO request.
>
> Tigran.
>
> ----- Original Message -----
> > From: "Benjamin Coddington" <bcodding@redhat.com>
> > To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
> > Cc: "Olga Kornievskaia" <aglo@umich.edu>, "Trond Myklebust" <trond.myklebust@primarydata.com>, "linux-nfs"
> > <linux-nfs@vger.kernel.org>
> > Sent: Friday, May 8, 2015 2:25:19 PM
> > Subject: Re: 4.0 NFS client in infinite loop in state recovery after getting BAD_STATEID
>
> > On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:
> >
> >> Hi Olga,
> >>
> >> I believe we see the same infinite loop without delegation with RHEL6/7
> >> kernels, but without delegation being involved. Currently, on the server
> >> side, if client looping is detected, we return RESOURCE. This breaks the
> >> loop and application gets an IO error.
> >>
> >> Your fix is only covers the delegation case isn't it?
> >>
> >> Tigran.
> >
> > Tigran, do you have a BZ or can you tell me how to reproduce this?
> >
> > Ben
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mkrtchyan, Tigran May 8, 2015, 1:13 p.m. UTC | #5
No, Olga's case includes delegation, which our server does not supports.

Tigran.

----- Original Message -----
> From: "Benjamin Coddington" <bcodding@redhat.com>
> To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
> Cc: "Olga Kornievskaia" <aglo@umich.edu>, "Trond Myklebust" <trond.myklebust@primarydata.com>, "linux-nfs"
> <linux-nfs@vger.kernel.org>
> Sent: Friday, May 8, 2015 3:08:03 PM
> Subject: Re: 4.0 NFS client in infinite loop in state recovery after getting BAD_STATEID

> Yes, I've done that, and the client's behavior correctly cycles through
> OPEN, then READ with the new stateid.  Are you able to create what Olga's
> talking about -- which is (I believe) a loop of just OPENs?
> 
> Ben
> 
> On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:
> 
>> Hi Ben,
>>
>> probably you can simulate, as Olga has suggested, with fault injection.
>> I can I can prepare a snadalone version of your server, which returns
>> BAD_STATEID on any IO request.
>>
>> Tigran.
>>
>> ----- Original Message -----
>> > From: "Benjamin Coddington" <bcodding@redhat.com>
>> > To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
>> > Cc: "Olga Kornievskaia" <aglo@umich.edu>, "Trond Myklebust"
>> > <trond.myklebust@primarydata.com>, "linux-nfs"
>> > <linux-nfs@vger.kernel.org>
>> > Sent: Friday, May 8, 2015 2:25:19 PM
>> > Subject: Re: 4.0 NFS client in infinite loop in state recovery after getting
>> > BAD_STATEID
>>
>> > On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:
>> >
>> >> Hi Olga,
>> >>
>> >> I believe we see the same infinite loop without delegation with RHEL6/7
>> >> kernels, but without delegation being involved. Currently, on the server
>> >> side, if client looping is detected, we return RESOURCE. This breaks the
>> >> loop and application gets an IO error.
>> >>
>> >> Your fix is only covers the delegation case isn't it?
>> >>
>> >> Tigran.
>> >
>> > Tigran, do you have a BZ or can you tell me how to reproduce this?
>> >
>> > Ben
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia May 8, 2015, 3:18 p.m. UTC | #6
Hi Tigran,

I think you are hitting something else as mine is as you mentioned has
to do with delegation. Also, RHEL6.6 code is very different from
RHEL7.1 (or upstream). I haven't tested my test case on RHEL6.6 but
just looking at it, I don't think it has the same OPEN loop problem.

Ben,

I didn't include this in my original message but we do have BZ open
for the problem by Jorge Mora,
Bug 1219184:
Infinite OPEN loop on NFSv4.0 when I/O receives NFS4ERR_BAD_STATEID

https://bugzilla.redhat.com/show_bug.cgi?id=1219184

On Fri, May 8, 2015 at 9:13 AM, Mkrtchyan, Tigran
<tigran.mkrtchyan@desy.de> wrote:
> No, Olga's case includes delegation, which our server does not supports.
>
> Tigran.
>
> ----- Original Message -----
>> From: "Benjamin Coddington" <bcodding@redhat.com>
>> To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
>> Cc: "Olga Kornievskaia" <aglo@umich.edu>, "Trond Myklebust" <trond.myklebust@primarydata.com>, "linux-nfs"
>> <linux-nfs@vger.kernel.org>
>> Sent: Friday, May 8, 2015 3:08:03 PM
>> Subject: Re: 4.0 NFS client in infinite loop in state recovery after getting BAD_STATEID
>
>> Yes, I've done that, and the client's behavior correctly cycles through
>> OPEN, then READ with the new stateid.  Are you able to create what Olga's
>> talking about -- which is (I believe) a loop of just OPENs?
>>
>> Ben
>>
>> On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:
>>
>>> Hi Ben,
>>>
>>> probably you can simulate, as Olga has suggested, with fault injection.
>>> I can I can prepare a snadalone version of your server, which returns
>>> BAD_STATEID on any IO request.
>>>
>>> Tigran.
>>>
>>> ----- Original Message -----
>>> > From: "Benjamin Coddington" <bcodding@redhat.com>
>>> > To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
>>> > Cc: "Olga Kornievskaia" <aglo@umich.edu>, "Trond Myklebust"
>>> > <trond.myklebust@primarydata.com>, "linux-nfs"
>>> > <linux-nfs@vger.kernel.org>
>>> > Sent: Friday, May 8, 2015 2:25:19 PM
>>> > Subject: Re: 4.0 NFS client in infinite loop in state recovery after getting
>>> > BAD_STATEID
>>>
>>> > On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:
>>> >
>>> >> Hi Olga,
>>> >>
>>> >> I believe we see the same infinite loop without delegation with RHEL6/7
>>> >> kernels, but without delegation being involved. Currently, on the server
>>> >> side, if client looping is detected, we return RESOURCE. This breaks the
>>> >> loop and application gets an IO error.
>>> >>
>>> >> Your fix is only covers the delegation case isn't it?
>>> >>
>>> >> Tigran.
>>> >
>>> > Tigran, do you have a BZ or can you tell me how to reproduce this?
>>> >
>>> > Ben
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Coddington May 8, 2015, 3:29 p.m. UTC | #7
On Fri, 8 May 2015, Olga Kornievskaia wrote:

> Hi Tigran,
>
> I think you are hitting something else as mine is as you mentioned has
> to do with delegation. Also, RHEL6.6 code is very different from
> RHEL7.1 (or upstream). I haven't tested my test case on RHEL6.6 but
> just looking at it, I don't think it has the same OPEN loop problem.
>
> Ben,
>
> I didn't include this in my original message but we do have BZ open
> for the problem by Jorge Mora,
> Bug 1219184:
> Infinite OPEN loop on NFSv4.0 when I/O receives NFS4ERR_BAD_STATEID
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1219184

Hi Olga,

Yep, I'm working that one.  My ONTAP sims got purged, so I've been setting
them back up this morning.

Ben


> On Fri, May 8, 2015 at 9:13 AM, Mkrtchyan, Tigran
> <tigran.mkrtchyan@desy.de> wrote:
> > No, Olga's case includes delegation, which our server does not supports.
> >
> > Tigran.
> >
> > ----- Original Message -----
> >> From: "Benjamin Coddington" <bcodding@redhat.com>
> >> To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
> >> Cc: "Olga Kornievskaia" <aglo@umich.edu>, "Trond Myklebust" <trond.myklebust@primarydata.com>, "linux-nfs"
> >> <linux-nfs@vger.kernel.org>
> >> Sent: Friday, May 8, 2015 3:08:03 PM
> >> Subject: Re: 4.0 NFS client in infinite loop in state recovery after getting BAD_STATEID
> >
> >> Yes, I've done that, and the client's behavior correctly cycles through
> >> OPEN, then READ with the new stateid.  Are you able to create what Olga's
> >> talking about -- which is (I believe) a loop of just OPENs?
> >>
> >> Ben
> >>
> >> On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:
> >>
> >>> Hi Ben,
> >>>
> >>> probably you can simulate, as Olga has suggested, with fault injection.
> >>> I can I can prepare a snadalone version of your server, which returns
> >>> BAD_STATEID on any IO request.
> >>>
> >>> Tigran.
> >>>
> >>> ----- Original Message -----
> >>> > From: "Benjamin Coddington" <bcodding@redhat.com>
> >>> > To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
> >>> > Cc: "Olga Kornievskaia" <aglo@umich.edu>, "Trond Myklebust"
> >>> > <trond.myklebust@primarydata.com>, "linux-nfs"
> >>> > <linux-nfs@vger.kernel.org>
> >>> > Sent: Friday, May 8, 2015 2:25:19 PM
> >>> > Subject: Re: 4.0 NFS client in infinite loop in state recovery after getting
> >>> > BAD_STATEID
> >>>
> >>> > On Fri, 8 May 2015, Mkrtchyan, Tigran wrote:
> >>> >
> >>> >> Hi Olga,
> >>> >>
> >>> >> I believe we see the same infinite loop without delegation with RHEL6/7
> >>> >> kernels, but without delegation being involved. Currently, on the server
> >>> >> side, if client looping is detected, we return RESOURCE. This breaks the
> >>> >> loop and application gets an IO error.
> >>> >>
> >>> >> Your fix is only covers the delegation case isn't it?
> >>> >>
> >>> >> Tigran.
> >>> >
> >>> > Tigran, do you have a BZ or can you tell me how to reproduce this?
> >>> >
> >>> > Ben
> >>> > --
> >>> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >>> > the body of a message to majordomo@vger.kernel.org
> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Coddington May 29, 2015, 1:44 p.m. UTC | #8
On Thu, 7 May 2015, Olga Kornievskaia wrote:

> Hi folks,
>
> Problem:
> The upstream nfs4.0 client has problem where it will go into an
> infinite loop of re-sending an OPEN when it's trying to recover from
> receiving a BAD_STATEID error on an IO operation such READ or WRITE.
>
> How to easily reproduce (by using fault injection):
> 1. Do nfs4.0 mount to a server.
> 2. Open a file such that the server gives you a write delegation.
> 3. Do a write. Have a server return a BAD_STATEID. One way to do so is
> by using a python proxy, nfs4proxy, and inject BAD_STATEID error on
> WRITE.
> 4. And off it goes with the loop.

Hi Olga,

I've been trying to reproduce it, and I'm frustratingly unable.  It sounds
fairly easy to produce..  What version of the client produces this?

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia May 29, 2015, 4:51 p.m. UTC | #9
On Fri, May 29, 2015 at 9:44 AM, Benjamin Coddington
<bcodding@redhat.com> wrote:
> On Thu, 7 May 2015, Olga Kornievskaia wrote:
>
>> Hi folks,
>>
>> Problem:
>> The upstream nfs4.0 client has problem where it will go into an
>> infinite loop of re-sending an OPEN when it's trying to recover from
>> receiving a BAD_STATEID error on an IO operation such READ or WRITE.
>>
>> How to easily reproduce (by using fault injection):
>> 1. Do nfs4.0 mount to a server.
>> 2. Open a file such that the server gives you a write delegation.
>> 3. Do a write. Have a server return a BAD_STATEID. One way to do so is
>> by using a python proxy, nfs4proxy, and inject BAD_STATEID error on
>> WRITE.
>> 4. And off it goes with the loop.
>
> Hi Olga,
>
> I've been trying to reproduce it, and I'm frustratingly unable.  It sounds
> fairly easy to produce..  What version of the client produces this?
>

Hi Ben,

Problem exists in the upstream kernels as well but we noticed the
problem on RHEL7.1 distro (RedHat's 2.6.32-229.el7 kernel I think).

> Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia May 29, 2015, 5:21 p.m. UTC | #10
I meant to say 3.10.0-229. Mixing my RHEL6 and RHEL7 prefixes...

On Fri, May 29, 2015 at 12:51 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> On Fri, May 29, 2015 at 9:44 AM, Benjamin Coddington
> <bcodding@redhat.com> wrote:
>> On Thu, 7 May 2015, Olga Kornievskaia wrote:
>>
>>> Hi folks,
>>>
>>> Problem:
>>> The upstream nfs4.0 client has problem where it will go into an
>>> infinite loop of re-sending an OPEN when it's trying to recover from
>>> receiving a BAD_STATEID error on an IO operation such READ or WRITE.
>>>
>>> How to easily reproduce (by using fault injection):
>>> 1. Do nfs4.0 mount to a server.
>>> 2. Open a file such that the server gives you a write delegation.
>>> 3. Do a write. Have a server return a BAD_STATEID. One way to do so is
>>> by using a python proxy, nfs4proxy, and inject BAD_STATEID error on
>>> WRITE.
>>> 4. And off it goes with the loop.
>>
>> Hi Olga,
>>
>> I've been trying to reproduce it, and I'm frustratingly unable.  It sounds
>> fairly easy to produce..  What version of the client produces this?
>>
>
> Hi Ben,
>
> Problem exists in the upstream kernels as well but we noticed the
> problem on RHEL7.1 distro (RedHat's 2.6.32-229.el7 kernel I think).
>
>> Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Coddington June 3, 2015, 3:51 p.m. UTC | #11
On Fri, 29 May 2015, Olga Kornievskaia wrote:

> I meant to say 3.10.0-229. Mixing my RHEL6 and RHEL7 prefixes...
>
> On Fri, May 29, 2015 at 12:51 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> > On Fri, May 29, 2015 at 9:44 AM, Benjamin Coddington
> > <bcodding@redhat.com> wrote:
> >> On Thu, 7 May 2015, Olga Kornievskaia wrote:
> >>
> >>> Hi folks,
> >>>
> >>> Problem:
> >>> The upstream nfs4.0 client has problem where it will go into an
> >>> infinite loop of re-sending an OPEN when it's trying to recover from
> >>> receiving a BAD_STATEID error on an IO operation such READ or WRITE.
> >>>
> >>> How to easily reproduce (by using fault injection):
> >>> 1. Do nfs4.0 mount to a server.
> >>> 2. Open a file such that the server gives you a write delegation.
> >>> 3. Do a write. Have a server return a BAD_STATEID. One way to do so is
> >>> by using a python proxy, nfs4proxy, and inject BAD_STATEID error on
> >>> WRITE.
> >>> 4. And off it goes with the loop.
> >>
> >> Hi Olga,
> >>
> >> I've been trying to reproduce it, and I'm frustratingly unable.  It sounds
> >> fairly easy to produce..  What version of the client produces this?
> >>
> >
> > Hi Ben,
> >
> > Problem exists in the upstream kernels as well but we noticed the
> > problem on RHEL7.1 distro (RedHat's 2.6.32-229.el7 kernel I think).

I've now been able to reproduce this upstream, and 7.1.. and just today on a
6.7 client.  The 6.7 client seems to self-limit the OPEN storm to around 7k
OPENs..  That's interesting enough to look into further..

Taking this issue to the BZs for now, just wanted to let the list know that
we see this now, too.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index 4711d04..b322823 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -632,10 +632,8 @@  void nfs_remove_bad_delegation(struct inode *inode)

        nfs_revoke_delegation(inode);
        delegation = nfs_inode_detach_delegation(inode);
-       if (delegation) {
-               nfs_inode_find_state_and_recover(inode, &delegation->stateid);
+       if (delegation)
                nfs_free_delegation(delegation);
-       }
 }
 EXPORT_SYMBOL_GPL(nfs_remove_bad_delegation);
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in