mbox series

[v2,0/9] Various NFSv4 state error handling fixes

Message ID 20190916204419.21717-1-trond.myklebust@hammerspace.com (mailing list archive)
Headers show
Series Various NFSv4 state error handling fixes | expand

Message

Trond Myklebust Sept. 16, 2019, 8:44 p.m. UTC
Various NFSv4 fixes to ensure we handle state errors correctly. In
particular, we need to ensure that for COMPOUNDs like CLOSE and
DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle the
layout state errors so that a retry of either the LAYOUTRETURN, or
the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
reply.

Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do our
best to still try to destroy the state on the server, in order to
avoid causing state leakage.

v2: Fix bug reports from Olga
 - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE when
   doing fully serialised NFSv4.0.
 - Ensure LOCKU initialises the stateid correctly.

Trond Myklebust (9):
  pNFS: Ensure we do clear the return-on-close layout stateid on fatal
    errors
  NFSv4: Clean up pNFS return-on-close error handling
  NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
  NFSv4: Handle RPC level errors in LAYOUTRETURN
  NFSv4: Add a helper to increment stateid seqids
  pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state
    seqid
  NFSv4: Fix OPEN_DOWNGRADE error handling
  NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
  NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU

 fs/nfs/nfs4_fs.h   |  11 ++-
 fs/nfs/nfs4proc.c  | 204 ++++++++++++++++++++++++++++++---------------
 fs/nfs/nfs4state.c |  16 ----
 fs/nfs/pnfs.c      |  71 ++++++++++++++--
 fs/nfs/pnfs.h      |  17 +++-
 5 files changed, 229 insertions(+), 90 deletions(-)

Comments

Olga Kornievskaia Sept. 18, 2019, 7:38 p.m. UTC | #1
Hi Trond,

These set of patches do not address the locking problem. It's actually
not the locking patch (which I thought it was as I reverted it and
still had the issue). Without the whole patch series the unlock works
fine so something in these new patches. Something is up with the 2
patches:
NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU

If I remove either one separately, unlock fails but if I remove both
unlock works.

On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <trondmy@gmail.com> wrote:
>
> Various NFSv4 fixes to ensure we handle state errors correctly. In
> particular, we need to ensure that for COMPOUNDs like CLOSE and
> DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle the
> layout state errors so that a retry of either the LAYOUTRETURN, or
> the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
> reply.
>
> Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do our
> best to still try to destroy the state on the server, in order to
> avoid causing state leakage.
>
> v2: Fix bug reports from Olga
>  - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE when
>    doing fully serialised NFSv4.0.
>  - Ensure LOCKU initialises the stateid correctly.
>
> Trond Myklebust (9):
>   pNFS: Ensure we do clear the return-on-close layout stateid on fatal
>     errors
>   NFSv4: Clean up pNFS return-on-close error handling
>   NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
>   NFSv4: Handle RPC level errors in LAYOUTRETURN
>   NFSv4: Add a helper to increment stateid seqids
>   pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state
>     seqid
>   NFSv4: Fix OPEN_DOWNGRADE error handling
>   NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
>   NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
>
>  fs/nfs/nfs4_fs.h   |  11 ++-
>  fs/nfs/nfs4proc.c  | 204 ++++++++++++++++++++++++++++++---------------
>  fs/nfs/nfs4state.c |  16 ----
>  fs/nfs/pnfs.c      |  71 ++++++++++++++--
>  fs/nfs/pnfs.h      |  17 +++-
>  5 files changed, 229 insertions(+), 90 deletions(-)
>
> --
> 2.21.0
>
Trond Myklebust Sept. 19, 2019, 1:49 a.m. UTC | #2
Hi Olga

On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote:
> Hi Trond,
> 
> These set of patches do not address the locking problem. It's
> actually
> not the locking patch (which I thought it was as I reverted it and
> still had the issue). Without the whole patch series the unlock works
> fine so something in these new patches. Something is up with the 2
> patches:
> NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> 
> If I remove either one separately, unlock fails but if I remove both
> unlock works.


Can you describe how you are testing this, and perhaps provide
wireshark traces that show how we're triggering these problems?

Thanks!
  Trond

> On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <trondmy@gmail.com>
> wrote:
> > Various NFSv4 fixes to ensure we handle state errors correctly. In
> > particular, we need to ensure that for COMPOUNDs like CLOSE and
> > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle the
> > layout state errors so that a retry of either the LAYOUTRETURN, or
> > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
> > reply.
> > 
> > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do our
> > best to still try to destroy the state on the server, in order to
> > avoid causing state leakage.
> > 
> > v2: Fix bug reports from Olga
> >  - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE when
> >    doing fully serialised NFSv4.0.
> >  - Ensure LOCKU initialises the stateid correctly.
> > 
> > Trond Myklebust (9):
> >   pNFS: Ensure we do clear the return-on-close layout stateid on
> > fatal
> >     errors
> >   NFSv4: Clean up pNFS return-on-close error handling
> >   NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
> >   NFSv4: Handle RPC level errors in LAYOUTRETURN
> >   NFSv4: Add a helper to increment stateid seqids
> >   pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the
> > state
> >     seqid
> >   NFSv4: Fix OPEN_DOWNGRADE error handling
> >   NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> >   NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > 
> >  fs/nfs/nfs4_fs.h   |  11 ++-
> >  fs/nfs/nfs4proc.c  | 204 ++++++++++++++++++++++++++++++-----------
> > ----
> >  fs/nfs/nfs4state.c |  16 ----
> >  fs/nfs/pnfs.c      |  71 ++++++++++++++--
> >  fs/nfs/pnfs.h      |  17 +++-
> >  5 files changed, 229 insertions(+), 90 deletions(-)
> > 
> > --
> > 2.21.0
> >
Trond Myklebust Sept. 19, 2019, 11:42 p.m. UTC | #3
Hi Olga

On Thu, 2019-09-19 at 09:14 -0400, Olga Kornievskaia wrote:
> Hi Trond,
> 
> On Wed, Sep 18, 2019 at 9:49 PM Trond Myklebust <
> trondmy@hammerspace.com> wrote:
> > Hi Olga
> > 
> > On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote:
> > > Hi Trond,
> > > 
> > > These set of patches do not address the locking problem. It's
> > > actually
> > > not the locking patch (which I thought it was as I reverted it
> > > and
> > > still had the issue). Without the whole patch series the unlock
> > > works
> > > fine so something in these new patches. Something is up with the
> > > 2
> > > patches:
> > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > > 
> > > If I remove either one separately, unlock fails but if I remove
> > > both
> > > unlock works.
> > 
> > Can you describe how you are testing this, and perhaps provide
> > wireshark traces that show how we're triggering these problems?
> 
> I'm triggering by running "nfstest_lock --nfsversion 4.1 --runtest
> btest01" against either linux or ontap servers (while the test
> doesn't
> fail but on the network trace you can see unlock failing with
> bad_stateid). Network trace attached.
> 
> But actually a simple test open, lock, unlock does the trick (network
> trace attached).
> fd1 = open(RDWR)
> fctl(fd1) (lock /unlock)


These traces really do not mesh with what I'm seeing using a simple
Connectathon lock test run. When I look at the wireshark output from
that, I see exadtly two cases where the stateid arguments are both
zero, and those are both SETATTR, so expected.

All the LOCKU are showing up as non-zero stateids, and so I'm seeing no
BAD_STATEID or OLD_STATEID errors at all.

Is there something special about how your test is running?

Cheers
  Trond

PS: Note: I do think I need a v3 of the LOCKU patch in order to add a
spinlock around the new copy of the lock stateid in
nfs4_alloc_unlockdata(). However I don't see how the missing spinlocks
could cause you to consistently be seeing an all-zero stateid argument.


> 
> > Thanks!
> >   Trond
> > 
> > > On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <
> > > trondmy@gmail.com>
> > > wrote:
> > > > Various NFSv4 fixes to ensure we handle state errors correctly.
> > > > In
> > > > particular, we need to ensure that for COMPOUNDs like CLOSE and
> > > > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle
> > > > the
> > > > layout state errors so that a retry of either the LAYOUTRETURN,
> > > > or
> > > > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
> > > > reply.
> > > > 
> > > > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do
> > > > our
> > > > best to still try to destroy the state on the server, in order
> > > > to
> > > > avoid causing state leakage.
> > > > 
> > > > v2: Fix bug reports from Olga
> > > >  - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE
> > > > when
> > > >    doing fully serialised NFSv4.0.
> > > >  - Ensure LOCKU initialises the stateid correctly.
> > > > 
> > > > Trond Myklebust (9):
> > > >   pNFS: Ensure we do clear the return-on-close layout stateid
> > > > on
> > > > fatal
> > > >     errors
> > > >   NFSv4: Clean up pNFS return-on-close error handling
> > > >   NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
> > > >   NFSv4: Handle RPC level errors in LAYOUTRETURN
> > > >   NFSv4: Add a helper to increment stateid seqids
> > > >   pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping
> > > > the
> > > > state
> > > >     seqid
> > > >   NFSv4: Fix OPEN_DOWNGRADE error handling
> > > >   NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > >   NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > > > 
> > > >  fs/nfs/nfs4_fs.h   |  11 ++-
> > > >  fs/nfs/nfs4proc.c  | 204 ++++++++++++++++++++++++++++++-------
> > > > ----
> > > > ----
> > > >  fs/nfs/nfs4state.c |  16 ----
> > > >  fs/nfs/pnfs.c      |  71 ++++++++++++++--
> > > >  fs/nfs/pnfs.h      |  17 +++-
> > > >  5 files changed, 229 insertions(+), 90 deletions(-)
> > > > 
> > > > --
> > > > 2.21.0
> > > > 
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@hammerspace.com
> > 
> > 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space
Olga Kornievskaia Sept. 20, 2019, 2:25 p.m. UTC | #4
Hi Trond,

On Thu, Sep 19, 2019 at 7:42 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> Hi Olga
>
> On Thu, 2019-09-19 at 09:14 -0400, Olga Kornievskaia wrote:
> > Hi Trond,
> >
> > On Wed, Sep 18, 2019 at 9:49 PM Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > > Hi Olga
> > >
> > > On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote:
> > > > Hi Trond,
> > > >
> > > > These set of patches do not address the locking problem. It's
> > > > actually
> > > > not the locking patch (which I thought it was as I reverted it
> > > > and
> > > > still had the issue). Without the whole patch series the unlock
> > > > works
> > > > fine so something in these new patches. Something is up with the
> > > > 2
> > > > patches:
> > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > > >
> > > > If I remove either one separately, unlock fails but if I remove
> > > > both
> > > > unlock works.
> > >
> > > Can you describe how you are testing this, and perhaps provide
> > > wireshark traces that show how we're triggering these problems?
> >
> > I'm triggering by running "nfstest_lock --nfsversion 4.1 --runtest
> > btest01" against either linux or ontap servers (while the test
> > doesn't
> > fail but on the network trace you can see unlock failing with
> > bad_stateid). Network trace attached.
> >
> > But actually a simple test open, lock, unlock does the trick (network
> > trace attached).
> > fd1 = open(RDWR)
> > fctl(fd1) (lock /unlock)
>
>
> These traces really do not mesh with what I'm seeing using a simple
> Connectathon lock test run. When I look at the wireshark output from
> that, I see exadtly two cases where the stateid arguments are both
> zero, and those are both SETATTR, so expected.
>
> All the LOCKU are showing up as non-zero stateids, and so I'm seeing no
> BAD_STATEID or OLD_STATEID errors at all.
>
> Is there something special about how your test is running?

There is nothing special that I can think of about my setup or how
test run. I pull from your testing branch, build it (no extra
patches). Run tests over 4.1 (default mount opts) against a linux
server (typically same kernel).

Is this patch series somewhere in your git branches? I've been testing
your testing branch (as I could see v2 changes were in the testing
branch). It's not obvious to me what was changed in v3 to see if the
testing branch has the right code.

>
> Cheers
>   Trond
>
> PS: Note: I do think I need a v3 of the LOCKU patch in order to add a
> spinlock around the new copy of the lock stateid in
> nfs4_alloc_unlockdata(). However I don't see how the missing spinlocks
> could cause you to consistently be seeing an all-zero stateid argument.

I'm not testing with v3 so I don't think that's it.

>
>
> >
> > > Thanks!
> > >   Trond
> > >
> > > > On Mon, Sep 16, 2019 at 4:46 PM Trond Myklebust <
> > > > trondmy@gmail.com>
> > > > wrote:
> > > > > Various NFSv4 fixes to ensure we handle state errors correctly.
> > > > > In
> > > > > particular, we need to ensure that for COMPOUNDs like CLOSE and
> > > > > DELEGRETURN, that may have an embedded LAYOUTRETURN, we handle
> > > > > the
> > > > > layout state errors so that a retry of either the LAYOUTRETURN,
> > > > > or
> > > > > the later CLOSE/DELEGRETURN does not corrupt the LAYOUTRETURN
> > > > > reply.
> > > > >
> > > > > Also ensure that if we get a NFS4ERR_OLD_STATEID, then we do
> > > > > our
> > > > > best to still try to destroy the state on the server, in order
> > > > > to
> > > > > avoid causing state leakage.
> > > > >
> > > > > v2: Fix bug reports from Olga
> > > > >  - Try to avoid sending old stateids on CLOSE/OPEN_DOWNGRADE
> > > > > when
> > > > >    doing fully serialised NFSv4.0.
> > > > >  - Ensure LOCKU initialises the stateid correctly.
> > > > >
> > > > > Trond Myklebust (9):
> > > > >   pNFS: Ensure we do clear the return-on-close layout stateid
> > > > > on
> > > > > fatal
> > > > >     errors
> > > > >   NFSv4: Clean up pNFS return-on-close error handling
> > > > >   NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
> > > > >   NFSv4: Handle RPC level errors in LAYOUTRETURN
> > > > >   NFSv4: Add a helper to increment stateid seqids
> > > > >   pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping
> > > > > the
> > > > > state
> > > > >     seqid
> > > > >   NFSv4: Fix OPEN_DOWNGRADE error handling
> > > > >   NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > > >   NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > > > >
> > > > >  fs/nfs/nfs4_fs.h   |  11 ++-
> > > > >  fs/nfs/nfs4proc.c  | 204 ++++++++++++++++++++++++++++++-------
> > > > > ----
> > > > > ----
> > > > >  fs/nfs/nfs4state.c |  16 ----
> > > > >  fs/nfs/pnfs.c      |  71 ++++++++++++++--
> > > > >  fs/nfs/pnfs.h      |  17 +++-
> > > > >  5 files changed, 229 insertions(+), 90 deletions(-)
> > > > >
> > > > > --
> > > > > 2.21.0
> > > > >
> > > --
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > trond.myklebust@hammerspace.com
> > >
> > >
> Trond Myklebust
> CTO, Hammerspace Inc
> 4300 El Camino Real, Suite 105
> Los Altos, CA 94022
> www.hammer.space
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>
Trond Myklebust Sept. 20, 2019, 2:54 p.m. UTC | #5
On Fri, 2019-09-20 at 10:25 -0400, Olga Kornievskaia wrote:
> Hi Trond,
> 
> On Thu, Sep 19, 2019 at 7:42 PM Trond Myklebust <
> trondmy@hammerspace.com> wrote:
> > Hi Olga
> > 
> > On Thu, 2019-09-19 at 09:14 -0400, Olga Kornievskaia wrote:
> > > Hi Trond,
> > > 
> > > On Wed, Sep 18, 2019 at 9:49 PM Trond Myklebust <
> > > trondmy@hammerspace.com> wrote:
> > > > Hi Olga
> > > > 
> > > > On Wed, 2019-09-18 at 15:38 -0400, Olga Kornievskaia wrote:
> > > > > Hi Trond,
> > > > > 
> > > > > These set of patches do not address the locking problem. It's
> > > > > actually
> > > > > not the locking patch (which I thought it was as I reverted
> > > > > it
> > > > > and
> > > > > still had the issue). Without the whole patch series the
> > > > > unlock
> > > > > works
> > > > > fine so something in these new patches. Something is up with
> > > > > the
> > > > > 2
> > > > > patches:
> > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
> > > > > NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
> > > > > 
> > > > > If I remove either one separately, unlock fails but if I
> > > > > remove
> > > > > both
> > > > > unlock works.
> > > > 
> > > > Can you describe how you are testing this, and perhaps provide
> > > > wireshark traces that show how we're triggering these problems?
> > > 
> > > I'm triggering by running "nfstest_lock --nfsversion 4.1 --
> > > runtest
> > > btest01" against either linux or ontap servers (while the test
> > > doesn't
> > > fail but on the network trace you can see unlock failing with
> > > bad_stateid). Network trace attached.
> > > 
> > > But actually a simple test open, lock, unlock does the trick
> > > (network
> > > trace attached).
> > > fd1 = open(RDWR)
> > > fctl(fd1) (lock /unlock)
> > 
> > These traces really do not mesh with what I'm seeing using a simple
> > Connectathon lock test run. When I look at the wireshark output
> > from
> > that, I see exadtly two cases where the stateid arguments are both
> > zero, and those are both SETATTR, so expected.
> > 
> > All the LOCKU are showing up as non-zero stateids, and so I'm
> > seeing no
> > BAD_STATEID or OLD_STATEID errors at all.
> > 
> > Is there something special about how your test is running?
> 
> There is nothing special that I can think of about my setup or how
> test run. I pull from your testing branch, build it (no extra
> patches). Run tests over 4.1 (default mount opts) against a linux
> server (typically same kernel).
> 
> Is this patch series somewhere in your git branches? I've been
> testing
> your testing branch (as I could see v2 changes were in the testing
> branch). It's not obvious to me what was changed in v3 to see if the
> testing branch has the right code.

I hadn't yet updated the testing branch with the v3 code. Pushed out
now as a forced-update.

Cheers
  Trond