diff mbox series

[pynfs,1/4] nfs4.1: add some reboot tests

Message ID 20190314211210.7454-2-smayhew@redhat.com (mailing list archive)
State New, archived
Headers show
Series nfs4.1: add a bunch of reboot tests | expand

Commit Message

Scott Mayhew March 14, 2019, 9:12 p.m. UTC
REBT2a, REBT2b, and REBT2c test recovery with multiple clients following
a server reboot.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 nfs4.1/server41tests/st_reboot.py | 151 +++++++++++++++++++++++++++++-
 1 file changed, 147 insertions(+), 4 deletions(-)

Comments

J. Bruce Fields March 15, 2019, 7:43 p.m. UTC | #1
On Thu, Mar 14, 2019 at 05:12:07PM -0400, Scott Mayhew wrote:
> +def doTestRebootWithNClients(t, env, n=10):
> +    boot_time = int(time.time())
> +    lease_time = 90

Looks like these two variables aren't used till they're set again a
little further down, so I'll delete these two lines.

--b.

> +    states = []
> +    block = env.c1.new_client_session("%s_block" % env.testname(t))
> +    for i in range(n):
> +        name = "%s_client_%i" % (env.testname(t), i)
> +        owner = "owner_%s" % name
> +        c = env.c1.new_client(name)
> +        sess = c.create_session()
> +        reclaim_complete(sess)
> +        fh, stateid = create_confirm(sess, owner)
> +        states.append(State(name, owner, c, sess, fh))
> +    lease_time = _getleasetime(sess)
> +    boot_time = _waitForReboot(env)
Scott Mayhew March 15, 2019, 7:52 p.m. UTC | #2
On Fri, 15 Mar 2019, J. Bruce Fields wrote:

> On Thu, Mar 14, 2019 at 05:12:07PM -0400, Scott Mayhew wrote:
> > +def doTestRebootWithNClients(t, env, n=10):
> > +    boot_time = int(time.time())
> > +    lease_time = 90
> 
> Looks like these two variables aren't used till they're set again a
> little further down, so I'll delete these two lines.

The intention there was to have some default values for the exception
handler if the test were to barf right away for some reason.
> 
> --b.
> 
> > +    states = []
> > +    block = env.c1.new_client_session("%s_block" % env.testname(t))
> > +    for i in range(n):
> > +        name = "%s_client_%i" % (env.testname(t), i)
> > +        owner = "owner_%s" % name
> > +        c = env.c1.new_client(name)
> > +        sess = c.create_session()
> > +        reclaim_complete(sess)
> > +        fh, stateid = create_confirm(sess, owner)
> > +        states.append(State(name, owner, c, sess, fh))
> > +    lease_time = _getleasetime(sess)
> > +    boot_time = _waitForReboot(env)
J. Bruce Fields March 15, 2019, 8:48 p.m. UTC | #3
On Thu, Mar 14, 2019 at 05:12:07PM -0400, Scott Mayhew wrote:
> +def testRebootWithManyManyManyClients(t, env):
> +    """Reboot with many many many clients
> +
> +    FLAGS: reboot
> +    CODE: REBT2c
> +    """
> +    return doTestRebootWithNClients(t, env, 1000)

My test server uses a 15 second lease time, mainly just to speed up
tests.  That's not enough for pynfs to send out reclaims for 1000
clients.

So I'm wondering whether that's a reasonable test or not.

On the one hand, we should be able to handle 1000 clients, and a 15
second lease is probably unrealistically short.  And maybe we could
choose more patient behavior for the server (currently it will wait at
most 2 grace periods while reclaims continue to arrive).

On the other hand, real clients will send their reclaims simultaneously
rather than one at a time.  And from a trace it looks like most of the
time's spent waiting for pynfs to send the next request rather than
waiting for replies.  So this is a bit unusual.

I'm inclined to drop the "many many many clients" tests.  It's easy
enough for someone doing reboot testing to patch the tests if they need
to.

By the way, the longest round trip time I see is the RECLAIM_COMPLETE.
I assume that's doing a commit to disk.  It looks like there's nothing
on the server to prevent processing RECLAIM_COMPLETEs in parallel so as
long as that's true I suppose we're OK.

--b.
J. Bruce Fields March 15, 2019, 8:50 p.m. UTC | #4
On Fri, Mar 15, 2019 at 03:52:33PM -0400, Scott Mayhew wrote:
> On Fri, 15 Mar 2019, J. Bruce Fields wrote:
> 
> > On Thu, Mar 14, 2019 at 05:12:07PM -0400, Scott Mayhew wrote:
> > > +def doTestRebootWithNClients(t, env, n=10):
> > > +    boot_time = int(time.time())
> > > +    lease_time = 90
> > 
> > Looks like these two variables aren't used till they're set again a
> > little further down, so I'll delete these two lines.
> 
> The intention there was to have some default values for the exception
> handler if the test were to barf right away for some reason.

Oh, I see, OK.

--b.

> > 
> > --b.
> > 
> > > +    states = []
> > > +    block = env.c1.new_client_session("%s_block" % env.testname(t))
> > > +    for i in range(n):
> > > +        name = "%s_client_%i" % (env.testname(t), i)
> > > +        owner = "owner_%s" % name
> > > +        c = env.c1.new_client(name)
> > > +        sess = c.create_session()
> > > +        reclaim_complete(sess)
> > > +        fh, stateid = create_confirm(sess, owner)
> > > +        states.append(State(name, owner, c, sess, fh))
> > > +    lease_time = _getleasetime(sess)
> > > +    boot_time = _waitForReboot(env)
Frank Filz March 18, 2019, 2:30 p.m. UTC | #5
> On Thu, Mar 14, 2019 at 05:12:07PM -0400, Scott Mayhew wrote:
> > +def testRebootWithManyManyManyClients(t, env):
> > +    """Reboot with many many many clients
> > +
> > +    FLAGS: reboot
> > +    CODE: REBT2c
> > +    """
> > +    return doTestRebootWithNClients(t, env, 1000)
> 
> My test server uses a 15 second lease time, mainly just to speed up tests.
That's
> not enough for pynfs to send out reclaims for 1000 clients.
> 
> So I'm wondering whether that's a reasonable test or not.
> 
> On the one hand, we should be able to handle 1000 clients, and a 15 second
> lease is probably unrealistically short.  And maybe we could choose more
patient
> behavior for the server (currently it will wait at most 2 grace periods
while
> reclaims continue to arrive).
> 
> On the other hand, real clients will send their reclaims simultaneously
rather
> than one at a time.  And from a trace it looks like most of the time's
spent
> waiting for pynfs to send the next request rather than waiting for
replies.  So this
> is a bit unusual.
> 
> I'm inclined to drop the "many many many clients" tests.  It's easy enough
for
> someone doing reboot testing to patch the tests if they need to.
> 
> By the way, the longest round trip time I see is the RECLAIM_COMPLETE.
> I assume that's doing a commit to disk.  It looks like there's nothing on
the
> server to prevent processing RECLAIM_COMPLETEs in parallel so as long as
> that's true I suppose we're OK.

How about having the many many many clients tests under a different flag so
they are still available but easy to pick or not pick?

Considering that CID5 with the huge number of client-ids it creates but
doesn't clean up (so they all eventually expire) has caught bugs in Ganesha,
I like the idea of messy big tests being available for QE to run...

Frank
J. Bruce Fields March 18, 2019, 2:57 p.m. UTC | #6
On Mon, Mar 18, 2019 at 07:30:20AM -0700, Frank Filz wrote:
> > On Thu, Mar 14, 2019 at 05:12:07PM -0400, Scott Mayhew wrote:
> > > +def testRebootWithManyManyManyClients(t, env):
> > > +    """Reboot with many many many clients
> > > +
> > > +    FLAGS: reboot
> > > +    CODE: REBT2c
> > > +    """
> > > +    return doTestRebootWithNClients(t, env, 1000)
> > 
> > My test server uses a 15 second lease time, mainly just to speed up tests.
> That's
> > not enough for pynfs to send out reclaims for 1000 clients.
> > 
> > So I'm wondering whether that's a reasonable test or not.
> > 
> > On the one hand, we should be able to handle 1000 clients, and a 15 second
> > lease is probably unrealistically short.  And maybe we could choose more
> patient
> > behavior for the server (currently it will wait at most 2 grace periods
> while
> > reclaims continue to arrive).
> > 
> > On the other hand, real clients will send their reclaims simultaneously
> rather
> > than one at a time.  And from a trace it looks like most of the time's
> spent
> > waiting for pynfs to send the next request rather than waiting for
> replies.  So this
> > is a bit unusual.
> > 
> > I'm inclined to drop the "many many many clients" tests.  It's easy enough
> for
> > someone doing reboot testing to patch the tests if they need to.
> > 
> > By the way, the longest round trip time I see is the RECLAIM_COMPLETE.
> > I assume that's doing a commit to disk.  It looks like there's nothing on
> the
> > server to prevent processing RECLAIM_COMPLETEs in parallel so as long as
> > that's true I suppose we're OK.
> 
> How about having the many many many clients tests under a different flag so
> they are still available but easy to pick or not pick?

That might be OK.

Or it might also be possible to make the test a little smarter; e.g., if
reclaims start to fail with NOGRACE after a lease period, keep going and
maybe have the test WARN instead of failing.

--b.

> Considering that CID5 with the huge number of client-ids it creates but
> doesn't clean up (so they all eventually expire) has caught bugs in Ganesha,
> I like the idea of messy big tests being available for QE to run...
diff mbox series

Patch

diff --git a/nfs4.1/server41tests/st_reboot.py b/nfs4.1/server41tests/st_reboot.py
index b852ded..0216127 100644
--- a/nfs4.1/server41tests/st_reboot.py
+++ b/nfs4.1/server41tests/st_reboot.py
@@ -1,25 +1,33 @@ 
 from xdrdef.nfs4_const import *
 from xdrdef.nfs4_type import *
-from .environment import check, fail, create_file, open_file, create_confirm
+from .environment import check, fail, create_file, open_file, create_confirm, close_file
 import sys
 import os
-import nfs4lib
+import time
+import logging
 import nfs_ops
 op = nfs_ops.NFS4ops()
 from rpc import RPCTimeout
 
 # NOTE - reboot tests are NOT part of the standard test suite
 
+log = logging.getLogger("test.env")
+
 def _getleasetime(sess):
     res = sess.compound([op.putrootfh(), op.getattr(1 << FATTR4_LEASE_TIME)])
     return res.resarray[-1].obj_attributes[FATTR4_LEASE_TIME]
 
-def _waitForReboot(c, sess, env):
+def _waitForReboot(env):
     env.serverhelper("reboot")
     # Wait until the server is back up.
     # The following blocks until it gets a response,
     # which happens when the server comes back up.
     env.c1.c1 = env.c1.connect(env.c1.server_address)
+    # Go ahead and whack the cached session and client ids now
+    # to avoid errors in Environment.finish().
+    env.c1.sessions = {}
+    env.c1.clients = {}
+    return int(time.time())
 
 def create_session(c, cred=None, flags=0):
     """Send a simple CREATE_SESSION"""
@@ -51,7 +59,7 @@  def testRebootValid(t, env):
     reclaim_complete(sess)
     fh, stateid = create_confirm(sess, owner)
     sleeptime = 5 + _getleasetime(sess)
-    _waitForReboot(c, sess, env)
+    _waitForReboot(env)
     try:
         res = create_session(c)
         check(res, NFS4ERR_STALE_CLIENTID, "Reclaim using old clientid")
@@ -67,3 +75,138 @@  def testRebootValid(t, env):
         reclaim_complete(sess)
     finally:
         env.sleep(sleeptime, "Waiting for grace period to end")
+
+class State(object):
+    def __init__(self, name, owner, c, sess, fh):
+        self.name = name
+        self.owner = owner
+        self.c = c
+        self.sess = sess
+        self.fh = fh
+
+def doTestOneClientGrace(t, env, state):
+    res = state.sess.compound([])
+    check(res, NFS4ERR_BADSESSION, "Bare sequence after reboot")
+    res = create_session(state.c)
+    check(res, NFS4ERR_STALE_CLIENTID, "Reclaim using old clientid")
+    c = env.c1.new_client(state.name)
+    state.c = c
+    sess = c.create_session()
+    state.sess = sess
+    lease_time = _getleasetime(sess)
+    res = open_file(sess, state.owner, path=state.fh,
+                   claim_type=CLAIM_PREVIOUS,
+                   access=OPEN4_SHARE_ACCESS_BOTH,
+                   deny=OPEN4_SHARE_DENY_NONE,
+                   deleg_type=OPEN_DELEGATE_NONE)
+    check(res, msg="Reclaim using newly created clientid")
+    fh = res.resarray[-1].object
+    stateid = res.resarray[-2].stateid
+    reclaim_complete(sess)
+    close_file(sess, fh, stateid=stateid)
+    res = open_file(sess, state.owner, claim_type=CLAIM_NULL,
+                   access=OPEN4_SHARE_ACCESS_BOTH,
+                   deny=OPEN4_SHARE_DENY_NONE,
+                   deleg_type=OPEN_DELEGATE_NONE)
+    check(res, NFS4ERR_GRACE, "New open during grace")
+    return lease_time
+
+def doTestOneClientNoGrace(t, env, state):
+    res = open_file(state.sess, state.owner, claim_type=CLAIM_NULL,
+                   access=OPEN4_SHARE_ACCESS_BOTH,
+                   deny=OPEN4_SHARE_DENY_NONE,
+                   deleg_type=OPEN_DELEGATE_NONE)
+    if (res.status == NFS4ERR_GRACE):
+        return res
+    check(res, msg="New open after all clients done reclaiming")
+    fh = res.resarray[-1].object
+    stateid = res.resarray[-2].stateid
+    close_file(state.sess, fh, stateid=stateid)
+    return res
+
+# The server may have lifted the grace period early, but it's not obligated.
+# Keep looping until all the clients have done a normal open.  If the server
+# didn't lift the grace period early we don't want to fail the test, but we
+# do want to log a message.
+def doTestAllClientsNoGrace(t, env, states):
+        all_done = False
+        warn_grace = False
+        start_time = int(time.time())
+        ok_time = 0
+        while not all_done:
+            all_done = True
+            for state in states:
+                res = doTestOneClientNoGrace(t, env, state)
+                if res.status == NFS4ERR_GRACE:
+                    warn_grace = True
+                    all_done = False
+                elif not ok_time:
+                    ok_time = int(time.time())
+            if not all_done:
+                time.sleep(1)
+        if warn_grace:
+            lift_time = ok_time - start_time
+            log.warn("server took approximately %d seconds to lift grace "
+                        "after all clients reclaimed" % lift_time)
+
+def doTestRebootWithNClients(t, env, n=10):
+    boot_time = int(time.time())
+    lease_time = 90
+    states = []
+    block = env.c1.new_client_session("%s_block" % env.testname(t))
+    for i in range(n):
+        name = "%s_client_%i" % (env.testname(t), i)
+        owner = "owner_%s" % name
+        c = env.c1.new_client(name)
+        sess = c.create_session()
+        reclaim_complete(sess)
+        fh, stateid = create_confirm(sess, owner)
+        states.append(State(name, owner, c, sess, fh))
+    lease_time = _getleasetime(sess)
+    boot_time = _waitForReboot(env)
+
+    try:
+        for i in range(n):
+            lease_time = doTestOneClientGrace(t, env, states[i])
+
+        # At this point, all clients should have recovered except for 'block'.
+        # Recover that one now.
+        block = env.c1.new_client_session("%s_block" % env.testname(t))
+
+        # The server may have lifted the grace period early.  Test it.
+        doTestAllClientsNoGrace(t, env, states)
+
+    # If the test went normally, then the grace period should have already
+    # ended.  If we got some unexpected error, then wait a bit for the server
+    # to expire the clients so cleanup can proceed more smoothly.
+    except:
+        grace_end_time = boot_time + lease_time + 5
+        now = int(time.time())
+        if now < grace_end_time:
+            sleep_time = grace_end_time - now
+            env.sleep(sleep_time, "Waiting for grace period to end")
+        raise
+
+def testRebootWithManyClients(t, env):
+    """Reboot with many clients
+
+    FLAGS: reboot
+    CODE: REBT2a
+    """
+    return doTestRebootWithNClients(t, env)
+
+def testRebootWithManyManyClients(t, env):
+    """Reboot with many many clients
+
+    FLAGS: reboot
+    CODE: REBT2b
+    """
+    return doTestRebootWithNClients(t, env, 100)
+
+def testRebootWithManyManyManyClients(t, env):
+    """Reboot with many many many clients
+
+    FLAGS: reboot
+    CODE: REBT2c
+    """
+    return doTestRebootWithNClients(t, env, 1000)