[RFC,v5,0/2] nfsd: Initial implementation of NFSv4 Courteous Server

Message ID	20210929005641.60861-1-dai.ngo@oracle.com (mailing list archive)
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: Dai Ngo <dai.ngo@oracle.com> To: bfields@fieldses.org Cc: chuck.lever@oracle.com, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH RFC v5 0/2] nfsd: Initial implementation of NFSv4 Courteous Server Date: Tue, 28 Sep 2021 20:56:39 -0400 Message-Id: <20210929005641.60861-1-dai.ngo@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	nfsd: Initial implementation of NFSv4 Courteous Server \| expand [RFC,v5,0/2] nfsd: Initial implementation of NFSv4 Courteous Server [RFC,v5,1/2] fs/lock: add new callback, lm_expire_lock, to lock_manager_operations [RFC,v5,2/2] nfsd: Initial implementation of NFSv4 Courteous Server

Dai Ngo Sept. 29, 2021, 12:56 a.m. UTC

Hi Bruce,

This series of patches implement the NFSv4 Courteous Server.

A server which does not immediately expunge the state on lease expiration
is known as a Courteous Server.  A Courteous Server continues to recognize
previously generated state tokens as valid until conflict arises between
the expired state and the requests from another client, or the server
reboots.

The v2 patch includes the following:

. add new callback, lm_expire_lock, to lock_manager_operations to
  allow the lock manager to take appropriate action with conflict lock.

. handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.

. expire courtesy client after 24hr if client has not reconnected.

. do not allow expired client to become courtesy client if there are
  waiters for client's locks.

. modify client_info_show to show courtesy client and seconds from
  last renew.

. fix a problem with NFSv4.1 server where the it keeps returning
  SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
  the courtesy client re-connects, causing the client to keep sending
  BCTS requests to server.

The v3 patch includes the following:

. modified posix_test_lock to check and resolve conflict locks
  to handle NLM TEST and NFSv4 LOCKT requests.

. separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.

The v4 patch includes:

. rework nfsd_check_courtesy to avoid dead lock of fl_lock and client_lock
  by asking the laudromat thread to destroy the courtesy client.

. handle NFSv4 share reservation conflicts with courtesy client. This
  includes conflicts between access mode and deny mode and vice versa.

. drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.

The v5 patch includes:

. fix recursive locking of file_rwsem from posix_lock_file. 

. retest with LOCKDEP enabled.

NOTE: I will submit pynfs tests for courteous server including tests
for share reservation conflicts in a separate patch.

J. Bruce Fields Oct. 1, 2021, 8:53 p.m. UTC | #1

On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
> 
> Hi Bruce,
> 
> This series of patches implement the NFSv4 Courteous Server.

Apologies, I keep meaning to get back to this and haven't yet.

I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.

--b.

> 
> A server which does not immediately expunge the state on lease expiration
> is known as a Courteous Server.  A Courteous Server continues to recognize
> previously generated state tokens as valid until conflict arises between
> the expired state and the requests from another client, or the server
> reboots.
> 
> The v2 patch includes the following:
> 
> . add new callback, lm_expire_lock, to lock_manager_operations to
>   allow the lock manager to take appropriate action with conflict lock.
> 
> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
> 
> . expire courtesy client after 24hr if client has not reconnected.
> 
> . do not allow expired client to become courtesy client if there are
>   waiters for client's locks.
> 
> . modify client_info_show to show courtesy client and seconds from
>   last renew.
> 
> . fix a problem with NFSv4.1 server where the it keeps returning
>   SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
>   the courtesy client re-connects, causing the client to keep sending
>   BCTS requests to server.
> 
> The v3 patch includes the following:
> 
> . modified posix_test_lock to check and resolve conflict locks
>   to handle NLM TEST and NFSv4 LOCKT requests.
> 
> . separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
> 
> The v4 patch includes:
> 
> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and client_lock
>   by asking the laudromat thread to destroy the courtesy client.
> 
> . handle NFSv4 share reservation conflicts with courtesy client. This
>   includes conflicts between access mode and deny mode and vice versa.
> 
> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
> 
> The v5 patch includes:
> 
> . fix recursive locking of file_rwsem from posix_lock_file. 
> 
> . retest with LOCKDEP enabled.
> 
> NOTE: I will submit pynfs tests for courteous server including tests
> for share reservation conflicts in a separate patch.
>

Dai Ngo Oct. 1, 2021, 9:41 p.m. UTC | #2

On 10/1/21 1:53 PM, J. Bruce Fields wrote:
> On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
>> Hi Bruce,
>>
>> This series of patches implement the NFSv4 Courteous Server.
> Apologies, I keep meaning to get back to this and haven't yet.
>
> I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.

It's weird, this test passes on my system:


[root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
INIT     st_setclientid.testValid                                 : RUNNING
INIT     st_setclientid.testValid                                 : PASS
MKFILE   st_open.testOpen                                         : RUNNING
MKFILE   st_open.testOpen                                         : PASS
OPEN18   st_open.testShareConflict1                               : RUNNING
OPEN18   st_open.testShareConflict1                               : PASS
**************************************************
INIT     st_setclientid.testValid                                 : PASS
OPEN18   st_open.testShareConflict1                               : PASS
MKFILE   st_open.testOpen                                         : PASS
**************************************************
Command line asked for 3 of 673 tests
Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
[root@nfsvmf25 nfs4.0]#

Do you have a network trace?

-Dai

>
> --b.
>
>> A server which does not immediately expunge the state on lease expiration
>> is known as a Courteous Server.  A Courteous Server continues to recognize
>> previously generated state tokens as valid until conflict arises between
>> the expired state and the requests from another client, or the server
>> reboots.
>>
>> The v2 patch includes the following:
>>
>> . add new callback, lm_expire_lock, to lock_manager_operations to
>>    allow the lock manager to take appropriate action with conflict lock.
>>
>> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
>>
>> . expire courtesy client after 24hr if client has not reconnected.
>>
>> . do not allow expired client to become courtesy client if there are
>>    waiters for client's locks.
>>
>> . modify client_info_show to show courtesy client and seconds from
>>    last renew.
>>
>> . fix a problem with NFSv4.1 server where the it keeps returning
>>    SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
>>    the courtesy client re-connects, causing the client to keep sending
>>    BCTS requests to server.
>>
>> The v3 patch includes the following:
>>
>> . modified posix_test_lock to check and resolve conflict locks
>>    to handle NLM TEST and NFSv4 LOCKT requests.
>>
>> . separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>
>> The v4 patch includes:
>>
>> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and client_lock
>>    by asking the laudromat thread to destroy the courtesy client.
>>
>> . handle NFSv4 share reservation conflicts with courtesy client. This
>>    includes conflicts between access mode and deny mode and vice versa.
>>
>> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>
>> The v5 patch includes:
>>
>> . fix recursive locking of file_rwsem from posix_lock_file.
>>
>> . retest with LOCKDEP enabled.
>>
>> NOTE: I will submit pynfs tests for courteous server including tests
>> for share reservation conflicts in a separate patch.
>>

J. Bruce Fields Oct. 1, 2021, 11:03 p.m. UTC | #3

On Fri, Oct 01, 2021 at 02:41:55PM -0700, dai.ngo@oracle.com wrote:
> 
> On 10/1/21 1:53 PM, J. Bruce Fields wrote:
> >On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
> >>Hi Bruce,
> >>
> >>This series of patches implement the NFSv4 Courteous Server.
> >Apologies, I keep meaning to get back to this and haven't yet.
> >
> >I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
> 
> It's weird, this test passes on my system:
> 
> 
> [root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
> INIT     st_setclientid.testValid                                 : RUNNING
> INIT     st_setclientid.testValid                                 : PASS
> MKFILE   st_open.testOpen                                         : RUNNING
> MKFILE   st_open.testOpen                                         : PASS
> OPEN18   st_open.testShareConflict1                               : RUNNING
> OPEN18   st_open.testShareConflict1                               : PASS
> **************************************************
> INIT     st_setclientid.testValid                                 : PASS
> OPEN18   st_open.testShareConflict1                               : PASS
> MKFILE   st_open.testOpen                                         : PASS
> **************************************************
> Command line asked for 3 of 673 tests
> Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
> [root@nfsvmf25 nfs4.0]#
> 
> Do you have a network trace?

Yeah, weirdly, I think it's failing only when I run it with all the
other pynfs tests, not when I run it alone.  I'll check again and see if
I can get a trace, probably next week.

--b.

Dai Ngo Nov. 16, 2021, 11:06 p.m. UTC | #4

Hi Bruce,

Just a reminder that this patch is still waiting for your review.

Thanks,
-Dai

On 10/1/21 2:41 PM, dai.ngo@oracle.com wrote:
>
> On 10/1/21 1:53 PM, J. Bruce Fields wrote:
>> On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
>>> Hi Bruce,
>>>
>>> This series of patches implement the NFSv4 Courteous Server.
>> Apologies, I keep meaning to get back to this and haven't yet.
>>
>> I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
>
> It's weird, this test passes on my system:
>
>
> [root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
> INIT     st_setclientid.testValid : RUNNING
> INIT     st_setclientid.testValid : PASS
> MKFILE   st_open.testOpen : RUNNING
> MKFILE   st_open.testOpen : PASS
> OPEN18   st_open.testShareConflict1 : RUNNING
> OPEN18   st_open.testShareConflict1 : PASS
> **************************************************
> INIT     st_setclientid.testValid : PASS
> OPEN18   st_open.testShareConflict1 : PASS
> MKFILE   st_open.testOpen : PASS
> **************************************************
> Command line asked for 3 of 673 tests
> Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
> [root@nfsvmf25 nfs4.0]#
>
> Do you have a network trace?
>
> -Dai
>
>>
>> --b.
>>
>>> A server which does not immediately expunge the state on lease 
>>> expiration
>>> is known as a Courteous Server.  A Courteous Server continues to 
>>> recognize
>>> previously generated state tokens as valid until conflict arises 
>>> between
>>> the expired state and the requests from another client, or the server
>>> reboots.
>>>
>>> The v2 patch includes the following:
>>>
>>> . add new callback, lm_expire_lock, to lock_manager_operations to
>>>    allow the lock manager to take appropriate action with conflict 
>>> lock.
>>>
>>> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
>>>
>>> . expire courtesy client after 24hr if client has not reconnected.
>>>
>>> . do not allow expired client to become courtesy client if there are
>>>    waiters for client's locks.
>>>
>>> . modify client_info_show to show courtesy client and seconds from
>>>    last renew.
>>>
>>> . fix a problem with NFSv4.1 server where the it keeps returning
>>>    SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
>>>    the courtesy client re-connects, causing the client to keep sending
>>>    BCTS requests to server.
>>>
>>> The v3 patch includes the following:
>>>
>>> . modified posix_test_lock to check and resolve conflict locks
>>>    to handle NLM TEST and NFSv4 LOCKT requests.
>>>
>>> . separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>
>>> The v4 patch includes:
>>>
>>> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and 
>>> client_lock
>>>    by asking the laudromat thread to destroy the courtesy client.
>>>
>>> . handle NFSv4 share reservation conflicts with courtesy client. This
>>>    includes conflicts between access mode and deny mode and vice versa.
>>>
>>> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>
>>> The v5 patch includes:
>>>
>>> . fix recursive locking of file_rwsem from posix_lock_file.
>>>
>>> . retest with LOCKDEP enabled.
>>>
>>> NOTE: I will submit pynfs tests for courteous server including tests
>>> for share reservation conflicts in a separate patch.
>>>

J. Bruce Fields Nov. 17, 2021, 2:14 p.m. UTC | #5

On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
> Just a reminder that this patch is still waiting for your review.

Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
failure for me....  I'll see if I can get some time today.--b.

> 
> Thanks,
> -Dai
> 
> On 10/1/21 2:41 PM, dai.ngo@oracle.com wrote:
> >
> >On 10/1/21 1:53 PM, J. Bruce Fields wrote:
> >>On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
> >>>Hi Bruce,
> >>>
> >>>This series of patches implement the NFSv4 Courteous Server.
> >>Apologies, I keep meaning to get back to this and haven't yet.
> >>
> >>I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
> >
> >It's weird, this test passes on my system:
> >
> >
> >[root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
> >INIT     st_setclientid.testValid : RUNNING
> >INIT     st_setclientid.testValid : PASS
> >MKFILE   st_open.testOpen : RUNNING
> >MKFILE   st_open.testOpen : PASS
> >OPEN18   st_open.testShareConflict1 : RUNNING
> >OPEN18   st_open.testShareConflict1 : PASS
> >**************************************************
> >INIT     st_setclientid.testValid : PASS
> >OPEN18   st_open.testShareConflict1 : PASS
> >MKFILE   st_open.testOpen : PASS
> >**************************************************
> >Command line asked for 3 of 673 tests
> >Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
> >[root@nfsvmf25 nfs4.0]#
> >
> >Do you have a network trace?
> >
> >-Dai
> >
> >>
> >>--b.
> >>
> >>>A server which does not immediately expunge the state on lease
> >>>expiration
> >>>is known as a Courteous Server.  A Courteous Server continues
> >>>to recognize
> >>>previously generated state tokens as valid until conflict
> >>>arises between
> >>>the expired state and the requests from another client, or the server
> >>>reboots.
> >>>
> >>>The v2 patch includes the following:
> >>>
> >>>. add new callback, lm_expire_lock, to lock_manager_operations to
> >>>   allow the lock manager to take appropriate action with
> >>>conflict lock.
> >>>
> >>>. handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
> >>>
> >>>. expire courtesy client after 24hr if client has not reconnected.
> >>>
> >>>. do not allow expired client to become courtesy client if there are
> >>>   waiters for client's locks.
> >>>
> >>>. modify client_info_show to show courtesy client and seconds from
> >>>   last renew.
> >>>
> >>>. fix a problem with NFSv4.1 server where the it keeps returning
> >>>   SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
> >>>   the courtesy client re-connects, causing the client to keep sending
> >>>   BCTS requests to server.
> >>>
> >>>The v3 patch includes the following:
> >>>
> >>>. modified posix_test_lock to check and resolve conflict locks
> >>>   to handle NLM TEST and NFSv4 LOCKT requests.
> >>>
> >>>. separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
> >>>
> >>>The v4 patch includes:
> >>>
> >>>. rework nfsd_check_courtesy to avoid dead lock of fl_lock and
> >>>client_lock
> >>>   by asking the laudromat thread to destroy the courtesy client.
> >>>
> >>>. handle NFSv4 share reservation conflicts with courtesy client. This
> >>>   includes conflicts between access mode and deny mode and vice versa.
> >>>
> >>>. drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
> >>>
> >>>The v5 patch includes:
> >>>
> >>>. fix recursive locking of file_rwsem from posix_lock_file.
> >>>
> >>>. retest with LOCKDEP enabled.
> >>>
> >>>NOTE: I will submit pynfs tests for courteous server including tests
> >>>for share reservation conflicts in a separate patch.
> >>>

Dai Ngo Nov. 17, 2021, 5:59 p.m. UTC | #6

On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>> Just a reminder that this patch is still waiting for your review.
> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
> failure for me....

Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
all OPEN tests together with 5.15-rc7 to see if the problem you've
seen still there.

-Dai

>   I'll see if I can get some time today.--b.
>
>> Thanks,
>> -Dai
>>
>> On 10/1/21 2:41 PM, dai.ngo@oracle.com wrote:
>>> On 10/1/21 1:53 PM, J. Bruce Fields wrote:
>>>> On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
>>>>> Hi Bruce,
>>>>>
>>>>> This series of patches implement the NFSv4 Courteous Server.
>>>> Apologies, I keep meaning to get back to this and haven't yet.
>>>>
>>>> I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
>>> It's weird, this test passes on my system:
>>>
>>>
>>> [root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
>>> INIT     st_setclientid.testValid : RUNNING
>>> INIT     st_setclientid.testValid : PASS
>>> MKFILE   st_open.testOpen : RUNNING
>>> MKFILE   st_open.testOpen : PASS
>>> OPEN18   st_open.testShareConflict1 : RUNNING
>>> OPEN18   st_open.testShareConflict1 : PASS
>>> **************************************************
>>> INIT     st_setclientid.testValid : PASS
>>> OPEN18   st_open.testShareConflict1 : PASS
>>> MKFILE   st_open.testOpen : PASS
>>> **************************************************
>>> Command line asked for 3 of 673 tests
>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
>>> [root@nfsvmf25 nfs4.0]#
>>>
>>> Do you have a network trace?
>>>
>>> -Dai
>>>
>>>> --b.
>>>>
>>>>> A server which does not immediately expunge the state on lease
>>>>> expiration
>>>>> is known as a Courteous Server.  A Courteous Server continues
>>>>> to recognize
>>>>> previously generated state tokens as valid until conflict
>>>>> arises between
>>>>> the expired state and the requests from another client, or the server
>>>>> reboots.
>>>>>
>>>>> The v2 patch includes the following:
>>>>>
>>>>> . add new callback, lm_expire_lock, to lock_manager_operations to
>>>>>     allow the lock manager to take appropriate action with
>>>>> conflict lock.
>>>>>
>>>>> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
>>>>>
>>>>> . expire courtesy client after 24hr if client has not reconnected.
>>>>>
>>>>> . do not allow expired client to become courtesy client if there are
>>>>>     waiters for client's locks.
>>>>>
>>>>> . modify client_info_show to show courtesy client and seconds from
>>>>>     last renew.
>>>>>
>>>>> . fix a problem with NFSv4.1 server where the it keeps returning
>>>>>     SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
>>>>>     the courtesy client re-connects, causing the client to keep sending
>>>>>     BCTS requests to server.
>>>>>
>>>>> The v3 patch includes the following:
>>>>>
>>>>> . modified posix_test_lock to check and resolve conflict locks
>>>>>     to handle NLM TEST and NFSv4 LOCKT requests.
>>>>>
>>>>> . separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>>>
>>>>> The v4 patch includes:
>>>>>
>>>>> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and
>>>>> client_lock
>>>>>     by asking the laudromat thread to destroy the courtesy client.
>>>>>
>>>>> . handle NFSv4 share reservation conflicts with courtesy client. This
>>>>>     includes conflicts between access mode and deny mode and vice versa.
>>>>>
>>>>> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>>>
>>>>> The v5 patch includes:
>>>>>
>>>>> . fix recursive locking of file_rwsem from posix_lock_file.
>>>>>
>>>>> . retest with LOCKDEP enabled.
>>>>>
>>>>> NOTE: I will submit pynfs tests for courteous server including tests
>>>>> for share reservation conflicts in a separate patch.
>>>>>

Dai Ngo Nov. 17, 2021, 9:46 p.m. UTC | #7

On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>
> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>> Just a reminder that this patch is still waiting for your review.
>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>> failure for me....
>
> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
> all OPEN tests together with 5.15-rc7 to see if the problem you've
> seen still there.

I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
5.15-rc7 server.

Nfs4.1 results are the same for both courteous and non-courteous server:
> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed

Results of nfs4.0 with non-courteous server:
>Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
test failed: LOCK24

Results of nfs4.0 with courteous server:
>Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
tests failed: LOCK24, OPEN18, OPEN30

OPEN18 and OPEN30 test pass if each is run by itself.
I will look into this problem.

-Dai

>
> -Dai
>
>>   I'll see if I can get some time today.--b.
>>
>>> Thanks,
>>> -Dai
>>>
>>> On 10/1/21 2:41 PM, dai.ngo@oracle.com wrote:
>>>> On 10/1/21 1:53 PM, J. Bruce Fields wrote:
>>>>> On Tue, Sep 28, 2021 at 08:56:39PM -0400, Dai Ngo wrote:
>>>>>> Hi Bruce,
>>>>>>
>>>>>> This series of patches implement the NFSv4 Courteous Server.
>>>>> Apologies, I keep meaning to get back to this and haven't yet.
>>>>>
>>>>> I do notice I'm seeing a timeout on pynfs 4.0 test OPEN18.
>>>> It's weird, this test passes on my system:
>>>>
>>>>
>>>> [root@nfsvmf25 nfs4.0]# ./testserver.py $server --rundeps -v OPEN18
>>>> INIT     st_setclientid.testValid : RUNNING
>>>> INIT     st_setclientid.testValid : PASS
>>>> MKFILE   st_open.testOpen : RUNNING
>>>> MKFILE   st_open.testOpen : PASS
>>>> OPEN18   st_open.testShareConflict1 : RUNNING
>>>> OPEN18   st_open.testShareConflict1 : PASS
>>>> **************************************************
>>>> INIT     st_setclientid.testValid : PASS
>>>> OPEN18   st_open.testShareConflict1 : PASS
>>>> MKFILE   st_open.testOpen : PASS
>>>> **************************************************
>>>> Command line asked for 3 of 673 tests
>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 3 Passed
>>>> [root@nfsvmf25 nfs4.0]#
>>>>
>>>> Do you have a network trace?
>>>>
>>>> -Dai
>>>>
>>>>> --b.
>>>>>
>>>>>> A server which does not immediately expunge the state on lease
>>>>>> expiration
>>>>>> is known as a Courteous Server.  A Courteous Server continues
>>>>>> to recognize
>>>>>> previously generated state tokens as valid until conflict
>>>>>> arises between
>>>>>> the expired state and the requests from another client, or the 
>>>>>> server
>>>>>> reboots.
>>>>>>
>>>>>> The v2 patch includes the following:
>>>>>>
>>>>>> . add new callback, lm_expire_lock, to lock_manager_operations to
>>>>>>     allow the lock manager to take appropriate action with
>>>>>> conflict lock.
>>>>>>
>>>>>> . handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.
>>>>>>
>>>>>> . expire courtesy client after 24hr if client has not reconnected.
>>>>>>
>>>>>> . do not allow expired client to become courtesy client if there are
>>>>>>     waiters for client's locks.
>>>>>>
>>>>>> . modify client_info_show to show courtesy client and seconds from
>>>>>>     last renew.
>>>>>>
>>>>>> . fix a problem with NFSv4.1 server where the it keeps returning
>>>>>>     SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
>>>>>>     the courtesy client re-connects, causing the client to keep 
>>>>>> sending
>>>>>>     BCTS requests to server.
>>>>>>
>>>>>> The v3 patch includes the following:
>>>>>>
>>>>>> . modified posix_test_lock to check and resolve conflict locks
>>>>>>     to handle NLM TEST and NFSv4 LOCKT requests.
>>>>>>
>>>>>> . separate out fix for back channel stuck in 
>>>>>> SEQ4_STATUS_CB_PATH_DOWN.
>>>>>>
>>>>>> The v4 patch includes:
>>>>>>
>>>>>> . rework nfsd_check_courtesy to avoid dead lock of fl_lock and
>>>>>> client_lock
>>>>>>     by asking the laudromat thread to destroy the courtesy client.
>>>>>>
>>>>>> . handle NFSv4 share reservation conflicts with courtesy client. 
>>>>>> This
>>>>>>     includes conflicts between access mode and deny mode and vice 
>>>>>> versa.
>>>>>>
>>>>>> . drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.
>>>>>>
>>>>>> The v5 patch includes:
>>>>>>
>>>>>> . fix recursive locking of file_rwsem from posix_lock_file.
>>>>>>
>>>>>> . retest with LOCKDEP enabled.
>>>>>>
>>>>>> NOTE: I will submit pynfs tests for courteous server including tests
>>>>>> for share reservation conflicts in a separate patch.
>>>>>>

J. Bruce Fields Nov. 18, 2021, 12:34 a.m. UTC | #8

On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
> 
> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
> >
> >On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> >>On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
> >>>Just a reminder that this patch is still waiting for your review.
> >>Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
> >>failure for me....
> >
> >Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
> >all OPEN tests together with 5.15-rc7 to see if the problem you've
> >seen still there.
> 
> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
> 5.15-rc7 server.
> 
> Nfs4.1 results are the same for both courteous and non-courteous server:
> >Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
> 
> Results of nfs4.0 with non-courteous server:
> >Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> test failed: LOCK24
> 
> Results of nfs4.0 with courteous server:
> >Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
> tests failed: LOCK24, OPEN18, OPEN30
> 
> OPEN18 and OPEN30 test pass if each is run by itself.

Could well be a bug in the tests, I don't know.

> I will look into this problem.

Thanks!

--b.

Dai Ngo Nov. 22, 2021, 3:04 a.m. UTC | #9

On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>> Just a reminder that this patch is still waiting for your review.
>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>> failure for me....
>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>> seen still there.
>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>> 5.15-rc7 server.
>>
>> Nfs4.1 results are the same for both courteous and non-courteous server:
>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>> Results of nfs4.0 with non-courteous server:
>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>> test failed: LOCK24
>>
>> Results of nfs4.0 with courteous server:
>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>> tests failed: LOCK24, OPEN18, OPEN30
>>
>> OPEN18 and OPEN30 test pass if each is run by itself.
> Could well be a bug in the tests, I don't know.

The reason OPEN18 failed was because the test timed out waiting for
the reply of an OPEN call. The RPC connection used for the test was
configured with 15 secs timeout. Note that OPEN18 only fails when
the tests were run with 'all' option, this test passes if it's run
by itself.

With courteous server, by the time OPEN18 runs, there are about 1026
courtesy 4.0 clients on the server and all of these clients have opened
the same file X with WRITE access. These clients were created by the
previous tests. After each test completed, since 4.0 does not have
session, the client states are not cleaned up immediately on the
server and are allowed to become courtesy clients.

When OPEN18 runs (about 20 minutes after the 1st test started), it
sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
server to check for conflicts with courtesy clients. The loop that
checks 1026 courtesy clients for share/access conflict took less
than 1 sec. But it took about 55 secs, on my VM, for the server
to expire all 1026 courtesy clients.

I modified pynfs to configure the 4.0 RPC connection with 60 seconds
timeout and OPEN18 now consistently passed. The 4.0 test results are
now the same for courteous and non-courteous server:

8 Skipped, 1 Failed, 0 Warned, 577 Passed

Note that 4.1 tests do not suffer this timeout problem because the
4.1 clients and sessions are destroyed after each test completes.

-Dai

>> I will look into this problem.
> Thanks!
>
> --b.

Dai Ngo Nov. 29, 2021, 5:13 p.m. UTC | #10

Hi Bruce,

On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>
> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>> failure for me....
>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>> seen still there.
>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>> 5.15-rc7 server.
>>>
>>> Nfs4.1 results are the same for both courteous and non-courteous 
>>> server:
>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>> Results of nfs4.0 with non-courteous server:
>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>> test failed: LOCK24
>>>
>>> Results of nfs4.0 with courteous server:
>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>> tests failed: LOCK24, OPEN18, OPEN30
>>>
>>> OPEN18 and OPEN30 test pass if each is run by itself.
>> Could well be a bug in the tests, I don't know.
>
> The reason OPEN18 failed was because the test timed out waiting for
> the reply of an OPEN call. The RPC connection used for the test was
> configured with 15 secs timeout. Note that OPEN18 only fails when
> the tests were run with 'all' option, this test passes if it's run
> by itself.
>
> With courteous server, by the time OPEN18 runs, there are about 1026
> courtesy 4.0 clients on the server and all of these clients have opened
> the same file X with WRITE access. These clients were created by the
> previous tests. After each test completed, since 4.0 does not have
> session, the client states are not cleaned up immediately on the
> server and are allowed to become courtesy clients.
>
> When OPEN18 runs (about 20 minutes after the 1st test started), it
> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
> server to check for conflicts with courtesy clients. The loop that
> checks 1026 courtesy clients for share/access conflict took less
> than 1 sec. But it took about 55 secs, on my VM, for the server
> to expire all 1026 courtesy clients.
>
> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
> timeout and OPEN18 now consistently passed. The 4.0 test results are
> now the same for courteous and non-courteous server:
>
> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>
> Note that 4.1 tests do not suffer this timeout problem because the
> 4.1 clients and sessions are destroyed after each test completes.

Do you want me to send the patch to increase the timeout for pynfs?
or is there any other things you think we should do?

Thanks,
-Dai

J. Bruce Fields Nov. 29, 2021, 5:30 p.m. UTC | #11

On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
> Hi Bruce,
> 
> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
> >
> >On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> >>On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
> >>>On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
> >>>>On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> >>>>>On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
> >>>>>>Just a reminder that this patch is still waiting for your review.
> >>>>>Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
> >>>>>failure for me....
> >>>>Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
> >>>>all OPEN tests together with 5.15-rc7 to see if the problem you've
> >>>>seen still there.
> >>>I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
> >>>5.15-rc7 server.
> >>>
> >>>Nfs4.1 results are the same for both courteous and
> >>>non-courteous server:
> >>>>Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
> >>>Results of nfs4.0 with non-courteous server:
> >>>>Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> >>>test failed: LOCK24
> >>>
> >>>Results of nfs4.0 with courteous server:
> >>>>Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
> >>>tests failed: LOCK24, OPEN18, OPEN30
> >>>
> >>>OPEN18 and OPEN30 test pass if each is run by itself.
> >>Could well be a bug in the tests, I don't know.
> >
> >The reason OPEN18 failed was because the test timed out waiting for
> >the reply of an OPEN call. The RPC connection used for the test was
> >configured with 15 secs timeout. Note that OPEN18 only fails when
> >the tests were run with 'all' option, this test passes if it's run
> >by itself.
> >
> >With courteous server, by the time OPEN18 runs, there are about 1026
> >courtesy 4.0 clients on the server and all of these clients have opened
> >the same file X with WRITE access. These clients were created by the
> >previous tests. After each test completed, since 4.0 does not have
> >session, the client states are not cleaned up immediately on the
> >server and are allowed to become courtesy clients.
> >
> >When OPEN18 runs (about 20 minutes after the 1st test started), it
> >sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
> >server to check for conflicts with courtesy clients. The loop that
> >checks 1026 courtesy clients for share/access conflict took less
> >than 1 sec. But it took about 55 secs, on my VM, for the server
> >to expire all 1026 courtesy clients.
> >
> >I modified pynfs to configure the 4.0 RPC connection with 60 seconds
> >timeout and OPEN18 now consistently passed. The 4.0 test results are
> >now the same for courteous and non-courteous server:
> >
> >8 Skipped, 1 Failed, 0 Warned, 577 Passed
> >
> >Note that 4.1 tests do not suffer this timeout problem because the
> >4.1 clients and sessions are destroyed after each test completes.
> 
> Do you want me to send the patch to increase the timeout for pynfs?
> or is there any other things you think we should do?

I don't know.

55 seconds to clean up 1026 clients is about 50ms per client, which is
pretty slow.  I wonder why.  I guess it's probably updating the stable
storage information.  Is /var/lib/nfs/ on your server backed by a hard
drive or an SSD or something else?

I wonder if that's an argument for limiting the number of courtesy
clients.

--b.

Dai Ngo Nov. 29, 2021, 6:32 p.m. UTC | #12

On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
>> Hi Bruce,
>>
>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>> failure for me....
>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>> seen still there.
>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>> 5.15-rc7 server.
>>>>>
>>>>> Nfs4.1 results are the same for both courteous and
>>>>> non-courteous server:
>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>> Results of nfs4.0 with non-courteous server:
>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>> test failed: LOCK24
>>>>>
>>>>> Results of nfs4.0 with courteous server:
>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>
>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>> Could well be a bug in the tests, I don't know.
>>> The reason OPEN18 failed was because the test timed out waiting for
>>> the reply of an OPEN call. The RPC connection used for the test was
>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>> the tests were run with 'all' option, this test passes if it's run
>>> by itself.
>>>
>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>> courtesy 4.0 clients on the server and all of these clients have opened
>>> the same file X with WRITE access. These clients were created by the
>>> previous tests. After each test completed, since 4.0 does not have
>>> session, the client states are not cleaned up immediately on the
>>> server and are allowed to become courtesy clients.
>>>
>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>> server to check for conflicts with courtesy clients. The loop that
>>> checks 1026 courtesy clients for share/access conflict took less
>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>> to expire all 1026 courtesy clients.
>>>
>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>> now the same for courteous and non-courteous server:
>>>
>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>
>>> Note that 4.1 tests do not suffer this timeout problem because the
>>> 4.1 clients and sessions are destroyed after each test completes.
>> Do you want me to send the patch to increase the timeout for pynfs?
>> or is there any other things you think we should do?
> I don't know.
>
> 55 seconds to clean up 1026 clients is about 50ms per client, which is
> pretty slow.  I wonder why.  I guess it's probably updating the stable
> storage information.  Is /var/lib/nfs/ on your server backed by a hard
> drive or an SSD or something else?

My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
disk. I think a production system that supports this many clients should
have faster CPUs, faster storage.

>
> I wonder if that's an argument for limiting the number of courtesy
> clients.

I think we might want to treat 4.0 clients a bit different from 4.1
clients. With 4.0, every client will become a courtesy client after
the client is done with the export and unmounts it. Since there is
no destroy session/client with 4.0, the courteous server allows the
client to be around and becomes a courtesy client. So after awhile,
even with normal usage, there will be lots 4.0 courtesy clients
hanging around and these clients won't be destroyed until 24hrs
later, or until they cause conflicts with other clients.

We can reduce the courtesy_client_expiry time for 4.0 clients from
24hrs to 15/20 mins, enough for most network partition to heal?,
or limit the number of 4.0 courtesy clients. Or don't support 4.0
clients at all which is my preference since I think in general users
should skip 4.0 and use 4.1 instead.

-Dai

Chuck Lever Nov. 29, 2021, 7:03 p.m. UTC | #13

Hello Dai!


> On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> 
> 
> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
>>> Hi Bruce,
>>> 
>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>> failure for me....
>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>> seen still there.
>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>> 5.15-rc7 server.
>>>>>> 
>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>> non-courteous server:
>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>> test failed: LOCK24
>>>>>> 
>>>>>> Results of nfs4.0 with courteous server:
>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>> 
>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>> Could well be a bug in the tests, I don't know.
>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>> the tests were run with 'all' option, this test passes if it's run
>>>> by itself.
>>>> 
>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>> the same file X with WRITE access. These clients were created by the
>>>> previous tests. After each test completed, since 4.0 does not have
>>>> session, the client states are not cleaned up immediately on the
>>>> server and are allowed to become courtesy clients.
>>>> 
>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>> server to check for conflicts with courtesy clients. The loop that
>>>> checks 1026 courtesy clients for share/access conflict took less
>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>> to expire all 1026 courtesy clients.
>>>> 
>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>> now the same for courteous and non-courteous server:
>>>> 
>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>> 
>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>> 4.1 clients and sessions are destroyed after each test completes.
>>> Do you want me to send the patch to increase the timeout for pynfs?
>>> or is there any other things you think we should do?
>> I don't know.
>> 
>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>> pretty slow.  I wonder why.  I guess it's probably updating the stable
>> storage information.  Is /var/lib/nfs/ on your server backed by a hard
>> drive or an SSD or something else?
> 
> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
> disk. I think a production system that supports this many clients should
> have faster CPUs, faster storage.
> 
>> 
>> I wonder if that's an argument for limiting the number of courtesy
>> clients.
> 
> I think we might want to treat 4.0 clients a bit different from 4.1
> clients. With 4.0, every client will become a courtesy client after
> the client is done with the export and unmounts it.

It should be safe for a server to purge a client's lease immediately
if there is no open or lock state associated with it.

When an NFSv4.0 client unmounts, all files should be closed at that
point, so the server can wait for the lease to expire and purge it
normally. Or am I missing something?


> Since there is
> no destroy session/client with 4.0, the courteous server allows the
> client to be around and becomes a courtesy client. So after awhile,
> even with normal usage, there will be lots 4.0 courtesy clients
> hanging around and these clients won't be destroyed until 24hrs
> later, or until they cause conflicts with other clients.
> 
> We can reduce the courtesy_client_expiry time for 4.0 clients from
> 24hrs to 15/20 mins, enough for most network partition to heal?,
> or limit the number of 4.0 courtesy clients. Or don't support 4.0
> clients at all which is my preference since I think in general users
> should skip 4.0 and use 4.1 instead.
> 
> -Dai

--
Chuck Lever

J. Bruce Fields Nov. 29, 2021, 7:13 p.m. UTC | #14

On Mon, Nov 29, 2021 at 07:03:12PM +0000, Chuck Lever III wrote:
> Hello Dai!
> 
> 
> > On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> > 
> > 
> > On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> >> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
> >>> Hi Bruce,
> >>> 
> >>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
> >>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> >>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
> >>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
> >>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> >>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
> >>>>>>>>> Just a reminder that this patch is still waiting for your review.
> >>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
> >>>>>>>> failure for me....
> >>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
> >>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
> >>>>>>> seen still there.
> >>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
> >>>>>> 5.15-rc7 server.
> >>>>>> 
> >>>>>> Nfs4.1 results are the same for both courteous and
> >>>>>> non-courteous server:
> >>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
> >>>>>> Results of nfs4.0 with non-courteous server:
> >>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> >>>>>> test failed: LOCK24
> >>>>>> 
> >>>>>> Results of nfs4.0 with courteous server:
> >>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
> >>>>>> tests failed: LOCK24, OPEN18, OPEN30
> >>>>>> 
> >>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
> >>>>> Could well be a bug in the tests, I don't know.
> >>>> The reason OPEN18 failed was because the test timed out waiting for
> >>>> the reply of an OPEN call. The RPC connection used for the test was
> >>>> configured with 15 secs timeout. Note that OPEN18 only fails when
> >>>> the tests were run with 'all' option, this test passes if it's run
> >>>> by itself.
> >>>> 
> >>>> With courteous server, by the time OPEN18 runs, there are about 1026
> >>>> courtesy 4.0 clients on the server and all of these clients have opened
> >>>> the same file X with WRITE access. These clients were created by the
> >>>> previous tests. After each test completed, since 4.0 does not have
> >>>> session, the client states are not cleaned up immediately on the
> >>>> server and are allowed to become courtesy clients.
> >>>> 
> >>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
> >>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
> >>>> server to check for conflicts with courtesy clients. The loop that
> >>>> checks 1026 courtesy clients for share/access conflict took less
> >>>> than 1 sec. But it took about 55 secs, on my VM, for the server
> >>>> to expire all 1026 courtesy clients.
> >>>> 
> >>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
> >>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
> >>>> now the same for courteous and non-courteous server:
> >>>> 
> >>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> >>>> 
> >>>> Note that 4.1 tests do not suffer this timeout problem because the
> >>>> 4.1 clients and sessions are destroyed after each test completes.
> >>> Do you want me to send the patch to increase the timeout for pynfs?
> >>> or is there any other things you think we should do?
> >> I don't know.
> >> 
> >> 55 seconds to clean up 1026 clients is about 50ms per client, which is
> >> pretty slow.  I wonder why.  I guess it's probably updating the stable
> >> storage information.  Is /var/lib/nfs/ on your server backed by a hard
> >> drive or an SSD or something else?
> > 
> > My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
> > disk. I think a production system that supports this many clients should
> > have faster CPUs, faster storage.
> > 
> >> 
> >> I wonder if that's an argument for limiting the number of courtesy
> >> clients.
> > 
> > I think we might want to treat 4.0 clients a bit different from 4.1
> > clients. With 4.0, every client will become a courtesy client after
> > the client is done with the export and unmounts it.
> 
> It should be safe for a server to purge a client's lease immediately
> if there is no open or lock state associated with it.
> 
> When an NFSv4.0 client unmounts, all files should be closed at that
> point, so the server can wait for the lease to expire and purge it
> normally. Or am I missing something?

Makes sense to me!

> > Since there is
> > no destroy session/client with 4.0, the courteous server allows the
> > client to be around and becomes a courtesy client. So after awhile,
> > even with normal usage, there will be lots 4.0 courtesy clients
> > hanging around and these clients won't be destroyed until 24hrs
> > later, or until they cause conflicts with other clients.
> > 
> > We can reduce the courtesy_client_expiry time for 4.0 clients from
> > 24hrs to 15/20 mins, enough for most network partition to heal?,
> > or limit the number of 4.0 courtesy clients. Or don't support 4.0
> > clients at all which is my preference since I think in general users
> > should skip 4.0 and use 4.1 instead.

I'm also totally fine with leaving out 4.0, at least to start.

--b.

Dai Ngo Nov. 29, 2021, 7:36 p.m. UTC | #15

On 11/29/21 11:03 AM, Chuck Lever III wrote:
> Hello Dai!
>
>
>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>
>>
>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
>>>> Hi Bruce,
>>>>
>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>> failure for me....
>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>> seen still there.
>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>> 5.15-rc7 server.
>>>>>>>
>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>> non-courteous server:
>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>> test failed: LOCK24
>>>>>>>
>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>
>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>> Could well be a bug in the tests, I don't know.
>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>> by itself.
>>>>>
>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>> the same file X with WRITE access. These clients were created by the
>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>> session, the client states are not cleaned up immediately on the
>>>>> server and are allowed to become courtesy clients.
>>>>>
>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>> to expire all 1026 courtesy clients.
>>>>>
>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>> now the same for courteous and non-courteous server:
>>>>>
>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>
>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>> or is there any other things you think we should do?
>>> I don't know.
>>>
>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>> pretty slow.  I wonder why.  I guess it's probably updating the stable
>>> storage information.  Is /var/lib/nfs/ on your server backed by a hard
>>> drive or an SSD or something else?
>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>> disk. I think a production system that supports this many clients should
>> have faster CPUs, faster storage.
>>
>>> I wonder if that's an argument for limiting the number of courtesy
>>> clients.
>> I think we might want to treat 4.0 clients a bit different from 4.1
>> clients. With 4.0, every client will become a courtesy client after
>> the client is done with the export and unmounts it.
> It should be safe for a server to purge a client's lease immediately
> if there is no open or lock state associated with it.

In this case, each client has opened files so there are open states
associated with them.

>
> When an NFSv4.0 client unmounts, all files should be closed at that
> point,

I'm not sure pynfs does proper clean up after each subtest, I will
check. There must be state associated with the client in order for
it to become courtesy client.

> so the server can wait for the lease to expire and purge it
> normally. Or am I missing something?

When 4.0 client lease expires and there are still states associated
with the client then the server allows this client to become courtesy
client.

-Dai

>
>
>> Since there is
>> no destroy session/client with 4.0, the courteous server allows the
>> client to be around and becomes a courtesy client. So after awhile,
>> even with normal usage, there will be lots 4.0 courtesy clients
>> hanging around and these clients won't be destroyed until 24hrs
>> later, or until they cause conflicts with other clients.
>>
>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>> clients at all which is my preference since I think in general users
>> should skip 4.0 and use 4.1 instead.
>>
>> -Dai
> --
> Chuck Lever
>
>
>

Dai Ngo Nov. 29, 2021, 7:39 p.m. UTC | #16

On 11/29/21 11:13 AM, Bruce Fields wrote:
> On Mon, Nov 29, 2021 at 07:03:12PM +0000, Chuck Lever III wrote:
>> Hello Dai!
>>
>>
>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>
>>>
>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
>>>>> Hi Bruce,
>>>>>
>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>> failure for me....
>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>> seen still there.
>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>> 5.15-rc7 server.
>>>>>>>>
>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>> non-courteous server:
>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>> test failed: LOCK24
>>>>>>>>
>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>
>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>> by itself.
>>>>>>
>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>> session, the client states are not cleaned up immediately on the
>>>>>> server and are allowed to become courtesy clients.
>>>>>>
>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>> to expire all 1026 courtesy clients.
>>>>>>
>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>> now the same for courteous and non-courteous server:
>>>>>>
>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>
>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>> or is there any other things you think we should do?
>>>> I don't know.
>>>>
>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>> pretty slow.  I wonder why.  I guess it's probably updating the stable
>>>> storage information.  Is /var/lib/nfs/ on your server backed by a hard
>>>> drive or an SSD or something else?
>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>> disk. I think a production system that supports this many clients should
>>> have faster CPUs, faster storage.
>>>
>>>> I wonder if that's an argument for limiting the number of courtesy
>>>> clients.
>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>> clients. With 4.0, every client will become a courtesy client after
>>> the client is done with the export and unmounts it.
>> It should be safe for a server to purge a client's lease immediately
>> if there is no open or lock state associated with it.
>>
>> When an NFSv4.0 client unmounts, all files should be closed at that
>> point, so the server can wait for the lease to expire and purge it
>> normally. Or am I missing something?
> Makes sense to me!
>
>>> Since there is
>>> no destroy session/client with 4.0, the courteous server allows the
>>> client to be around and becomes a courtesy client. So after awhile,
>>> even with normal usage, there will be lots 4.0 courtesy clients
>>> hanging around and these clients won't be destroyed until 24hrs
>>> later, or until they cause conflicts with other clients.
>>>
>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>> clients at all which is my preference since I think in general users
>>> should skip 4.0 and use 4.1 instead.
> I'm also totally fine with leaving out 4.0, at least to start.

Ok Bruce, I will submit v6 patch for this.

Thanks,
-Dai

>
> --b.

Dai Ngo Nov. 29, 2021, 9:01 p.m. UTC | #17

On 11/29/21 11:36 AM, dai.ngo@oracle.com wrote:
>
> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>> Hello Dai!
>>
>>
>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>
>>>
>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
>>>>> Hi Bruce,
>>>>>
>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com 
>>>>>>>>>> wrote:
>>>>>>>>>>> Just a reminder that this patch is still waiting for your 
>>>>>>>>>>> review.
>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the 
>>>>>>>>>> pynfs
>>>>>>>>>> failure for me....
>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I 
>>>>>>>>> will run
>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem 
>>>>>>>>> you've
>>>>>>>>> seen still there.
>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and 
>>>>>>>> non-courteous
>>>>>>>> 5.15-rc7 server.
>>>>>>>>
>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>> non-courteous server:
>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>> test failed: LOCK24
>>>>>>>>
>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>
>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>> by itself.
>>>>>>
>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>> courtesy 4.0 clients on the server and all of these clients have 
>>>>>> opened
>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>> session, the client states are not cleaned up immediately on the
>>>>>> server and are allowed to become courtesy clients.
>>>>>>
>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>> to expire all 1026 courtesy clients.
>>>>>>
>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>> now the same for courteous and non-courteous server:
>>>>>>
>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>
>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>> or is there any other things you think we should do?
>>>> I don't know.
>>>>
>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>> pretty slow.  I wonder why.  I guess it's probably updating the stable
>>>> storage information.  Is /var/lib/nfs/ on your server backed by a hard
>>>> drive or an SSD or something else?
>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>> disk. I think a production system that supports this many clients 
>>> should
>>> have faster CPUs, faster storage.
>>>
>>>> I wonder if that's an argument for limiting the number of courtesy
>>>> clients.
>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>> clients. With 4.0, every client will become a courtesy client after
>>> the client is done with the export and unmounts it.
>> It should be safe for a server to purge a client's lease immediately
>> if there is no open or lock state associated with it.
>
> In this case, each client has opened files so there are open states
> associated with them.
>
>>
>> When an NFSv4.0 client unmounts, all files should be closed at that
>> point,
>
> I'm not sure pynfs does proper clean up after each subtest, I will
> check. There must be state associated with the client in order for
> it to become courtesy client.

pynfs 4.0 test uses LOOKUP, OPEN with OPEN4_CREATE to create the
test file and uses PUTFH and REMOVE to remove the test file when
done. I don't see where the open state associated the removed
file being freed by nfsd_remove. I guess for 4.0, the open state
remains valid on the server until the client lease expires.

I attached the pcap of OPEN18 test for reference.

-Dai

>
>> so the server can wait for the lease to expire and purge it
>> normally. Or am I missing something?
>
> When 4.0 client lease expires and there are still states associated
> with the client then the server allows this client to become courtesy
> client.
>
> -Dai
>
>>
>>
>>> Since there is
>>> no destroy session/client with 4.0, the courteous server allows the
>>> client to be around and becomes a courtesy client. So after awhile,
>>> even with normal usage, there will be lots 4.0 courtesy clients
>>> hanging around and these clients won't be destroyed until 24hrs
>>> later, or until they cause conflicts with other clients.
>>>
>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>> clients at all which is my preference since I think in general users
>>> should skip 4.0 and use 4.1 instead.
>>>
>>> -Dai
>> -- 
>> Chuck Lever
>>
>>
>>

Chuck Lever Nov. 29, 2021, 9:10 p.m. UTC | #18

> On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> 
> 
> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>> Hello Dai!
>> 
>> 
>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>> 
>>> 
>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
>>>>> Hi Bruce,
>>>>> 
>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>> failure for me....
>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>> seen still there.
>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>> 5.15-rc7 server.
>>>>>>>> 
>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>> non-courteous server:
>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>> test failed: LOCK24
>>>>>>>> 
>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>> 
>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>> by itself.
>>>>>> 
>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>> session, the client states are not cleaned up immediately on the
>>>>>> server and are allowed to become courtesy clients.
>>>>>> 
>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>> to expire all 1026 courtesy clients.
>>>>>> 
>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>> now the same for courteous and non-courteous server:
>>>>>> 
>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>> 
>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>> or is there any other things you think we should do?
>>>> I don't know.
>>>> 
>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>> pretty slow.  I wonder why.  I guess it's probably updating the stable
>>>> storage information.  Is /var/lib/nfs/ on your server backed by a hard
>>>> drive or an SSD or something else?
>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>> disk. I think a production system that supports this many clients should
>>> have faster CPUs, faster storage.
>>> 
>>>> I wonder if that's an argument for limiting the number of courtesy
>>>> clients.
>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>> clients. With 4.0, every client will become a courtesy client after
>>> the client is done with the export and unmounts it.
>> It should be safe for a server to purge a client's lease immediately
>> if there is no open or lock state associated with it.
> 
> In this case, each client has opened files so there are open states
> associated with them.
> 
>> 
>> When an NFSv4.0 client unmounts, all files should be closed at that
>> point,
> 
> I'm not sure pynfs does proper clean up after each subtest, I will
> check. There must be state associated with the client in order for
> it to become courtesy client.

Makes sense. Then a synthetic client like pynfs can DoS a courteous
server.


>> so the server can wait for the lease to expire and purge it
>> normally. Or am I missing something?
> 
> When 4.0 client lease expires and there are still states associated
> with the client then the server allows this client to become courtesy
> client.

I think the same thing happens if an NFSv4.1 client neglects to send
DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is broken
or malicious, but the server faces the same issue of protecting
itself from a DoS attack.

IMO you should consider limiting the number of courteous clients
the server can hold onto. Let's say that number is 1000. When the
server wants to turn a 1001st client into a courteous client, it
can simply expire and purge the oldest courteous client on its
list. Otherwise, over time, the 24-hour expiry will reduce the
set of courteous clients back to zero.

What do you think?


>>> Since there is
>>> no destroy session/client with 4.0, the courteous server allows the
>>> client to be around and becomes a courtesy client. So after awhile,
>>> even with normal usage, there will be lots 4.0 courtesy clients
>>> hanging around and these clients won't be destroyed until 24hrs
>>> later, or until they cause conflicts with other clients.
>>> 
>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>> clients at all which is my preference since I think in general users
>>> should skip 4.0 and use 4.1 instead.
>>> 
>>> -Dai
>> --
>> Chuck Lever
>> 
>> 
>> 

--
Chuck Lever

Dai Ngo Nov. 30, 2021, 12:11 a.m. UTC | #19

On 11/29/21 1:10 PM, Chuck Lever III wrote:
>
>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>
>>
>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>> Hello Dai!
>>>
>>>
>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>
>>>>
>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
>>>>>> Hi Bruce,
>>>>>>
>>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>>> failure for me....
>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>>> seen still there.
>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>
>>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>>> non-courteous server:
>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>> test failed: LOCK24
>>>>>>>>>
>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>
>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>>> by itself.
>>>>>>>
>>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>>> session, the client states are not cleaned up immediately on the
>>>>>>> server and are allowed to become courtesy clients.
>>>>>>>
>>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>
>>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>>> now the same for courteous and non-courteous server:
>>>>>>>
>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>
>>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>>> or is there any other things you think we should do?
>>>>> I don't know.
>>>>>
>>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>>> pretty slow.  I wonder why.  I guess it's probably updating the stable
>>>>> storage information.  Is /var/lib/nfs/ on your server backed by a hard
>>>>> drive or an SSD or something else?
>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>>> disk. I think a production system that supports this many clients should
>>>> have faster CPUs, faster storage.
>>>>
>>>>> I wonder if that's an argument for limiting the number of courtesy
>>>>> clients.
>>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>>> clients. With 4.0, every client will become a courtesy client after
>>>> the client is done with the export and unmounts it.
>>> It should be safe for a server to purge a client's lease immediately
>>> if there is no open or lock state associated with it.
>> In this case, each client has opened files so there are open states
>> associated with them.
>>
>>> When an NFSv4.0 client unmounts, all files should be closed at that
>>> point,
>> I'm not sure pynfs does proper clean up after each subtest, I will
>> check. There must be state associated with the client in order for
>> it to become courtesy client.
> Makes sense. Then a synthetic client like pynfs can DoS a courteous
> server.
>
>
>>> so the server can wait for the lease to expire and purge it
>>> normally. Or am I missing something?
>> When 4.0 client lease expires and there are still states associated
>> with the client then the server allows this client to become courtesy
>> client.
> I think the same thing happens if an NFSv4.1 client neglects to send
> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is broken
> or malicious, but the server faces the same issue of protecting
> itself from a DoS attack.
>
> IMO you should consider limiting the number of courteous clients
> the server can hold onto. Let's say that number is 1000. When the
> server wants to turn a 1001st client into a courteous client, it
> can simply expire and purge the oldest courteous client on its
> list. Otherwise, over time, the 24-hour expiry will reduce the
> set of courteous clients back to zero.
>
> What do you think?

Limiting the number of courteous clients to handle the cases of
broken/malicious 4.1 clients seems reasonable as the last resort.

I think if a malicious 4.1 clients could mount the server's export,
opens a file (to create state) and repeats the same with a different
client id then it seems like some basic security was already broken;
allowing unauthorized clients to mount server's exports.

I think if we have to enforce a limit, then it's only for handling
of seriously buggy 4.1 clients which should not be the norm. The
issue with this is how to pick an optimal number that is suitable
for the running server which can be a very slow or a very fast server.

Note that even if we impose an limit, that does not completely solve
the problem with pynfs 4.0 test since its RPC timeout is configured
with 15 secs which just enough to expire 277 clients based on 53ms
for each client, unless we limit it ~270 clients which I think it's
too low.

This is what I plan to do:

1. do not support 4.0 courteous clients, for sure.

2. limit the number of courteous clients to 1000 (?), if you still
think we need it.

Pls let me know what you think.

Thanks,
-Dai

>
>
>>>> Since there is
>>>> no destroy session/client with 4.0, the courteous server allows the
>>>> client to be around and becomes a courtesy client. So after awhile,
>>>> even with normal usage, there will be lots 4.0 courtesy clients
>>>> hanging around and these clients won't be destroyed until 24hrs
>>>> later, or until they cause conflicts with other clients.
>>>>
>>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>>> clients at all which is my preference since I think in general users
>>>> should skip 4.0 and use 4.1 instead.
>>>>
>>>> -Dai
>>> --
>>> Chuck Lever
>>>
>>>
>>>
> --
> Chuck Lever
>
>
>

Chuck Lever Nov. 30, 2021, 1:42 a.m. UTC | #20

> On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> 
> 
>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>> 
>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>> 
>>> 
>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>> Hello Dai!
>>>> 
>>>> 
>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>> 
>>>>> 
>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
>>>>>>> Hi Bruce,
>>>>>>> 
>>>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>>>> failure for me....
>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>>>> seen still there.
>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>> 
>>>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>>>> non-courteous server:
>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>> test failed: LOCK24
>>>>>>>>>> 
>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>> 
>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>>>> by itself.
>>>>>>>> 
>>>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>>>> session, the client states are not cleaned up immediately on the
>>>>>>>> server and are allowed to become courtesy clients.
>>>>>>>> 
>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>> 
>>>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>>>> now the same for courteous and non-courteous server:
>>>>>>>> 
>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>> 
>>>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>>>> or is there any other things you think we should do?
>>>>>> I don't know.
>>>>>> 
>>>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>>>> pretty slow.  I wonder why.  I guess it's probably updating the stable
>>>>>> storage information.  Is /var/lib/nfs/ on your server backed by a hard
>>>>>> drive or an SSD or something else?
>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>>>> disk. I think a production system that supports this many clients should
>>>>> have faster CPUs, faster storage.
>>>>> 
>>>>>> I wonder if that's an argument for limiting the number of courtesy
>>>>>> clients.
>>>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>>>> clients. With 4.0, every client will become a courtesy client after
>>>>> the client is done with the export and unmounts it.
>>>> It should be safe for a server to purge a client's lease immediately
>>>> if there is no open or lock state associated with it.
>>> In this case, each client has opened files so there are open states
>>> associated with them.
>>> 
>>>> When an NFSv4.0 client unmounts, all files should be closed at that
>>>> point,
>>> I'm not sure pynfs does proper clean up after each subtest, I will
>>> check. There must be state associated with the client in order for
>>> it to become courtesy client.
>> Makes sense. Then a synthetic client like pynfs can DoS a courteous
>> server.
>> 
>> 
>>>> so the server can wait for the lease to expire and purge it
>>>> normally. Or am I missing something?
>>> When 4.0 client lease expires and there are still states associated
>>> with the client then the server allows this client to become courtesy
>>> client.
>> I think the same thing happens if an NFSv4.1 client neglects to send
>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is broken
>> or malicious, but the server faces the same issue of protecting
>> itself from a DoS attack.
>> 
>> IMO you should consider limiting the number of courteous clients
>> the server can hold onto. Let's say that number is 1000. When the
>> server wants to turn a 1001st client into a courteous client, it
>> can simply expire and purge the oldest courteous client on its
>> list. Otherwise, over time, the 24-hour expiry will reduce the
>> set of courteous clients back to zero.
>> 
>> What do you think?
> 
> Limiting the number of courteous clients to handle the cases of
> broken/malicious 4.1 clients seems reasonable as the last resort.
> 
> I think if a malicious 4.1 clients could mount the server's export,
> opens a file (to create state) and repeats the same with a different
> client id then it seems like some basic security was already broken;
> allowing unauthorized clients to mount server's exports.

You can do this today with AUTH_SYS. I consider it a genuine attack surface.


> I think if we have to enforce a limit, then it's only for handling
> of seriously buggy 4.1 clients which should not be the norm. The
> issue with this is how to pick an optimal number that is suitable
> for the running server which can be a very slow or a very fast server.
> 
> Note that even if we impose an limit, that does not completely solve
> the problem with pynfs 4.0 test since its RPC timeout is configured
> with 15 secs which just enough to expire 277 clients based on 53ms
> for each client, unless we limit it ~270 clients which I think it's
> too low.
> 
> This is what I plan to do:
> 
> 1. do not support 4.0 courteous clients, for sure.

Not supporting 4.0 isn’t an option, IMHO. It is a fully supported protocol at this time, and the same exposure exists for 4.1, it’s just a little harder to exploit.

If you submit the courteous server patch without support for 4.0, I think it needs to include a plan for how 4.0 will be added later.


> 2. limit the number of courteous clients to 1000 (?), if you still
> think we need it.

 I think this limit is necessary. It can be set based on the server’s physical memory size if a dynamic limit is desired.


> Pls let me know what you think.
> 
> Thanks,
> -Dai
> 
>> 
>> 
>>>>> Since there is
>>>>> no destroy session/client with 4.0, the courteous server allows the
>>>>> client to be around and becomes a courtesy client. So after awhile,
>>>>> even with normal usage, there will be lots 4.0 courtesy clients
>>>>> hanging around and these clients won't be destroyed until 24hrs
>>>>> later, or until they cause conflicts with other clients.
>>>>> 
>>>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>>>> clients at all which is my preference since I think in general users
>>>>> should skip 4.0 and use 4.1 instead.
>>>>> 
>>>>> -Dai
>>>> --
>>>> Chuck Lever
>>>> 
>>>> 
>>>> 
>> --
>> Chuck Lever
>> 
>> 
>>

Trond Myklebust Nov. 30, 2021, 4:08 a.m. UTC | #21

On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
> 
> > On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> > 
> > 
> > > On 11/29/21 1:10 PM, Chuck Lever III wrote:
> > > 
> > > > > On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com>
> > > > > wrote:
> > > > 
> > > > 
> > > > On 11/29/21 11:03 AM, Chuck Lever III wrote:
> > > > > Hello Dai!
> > > > > 
> > > > > 
> > > > > > On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com>
> > > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> > > > > > > On Mon, Nov 29, 2021 at 09:13:16AM -0800,
> > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > Hi Bruce,
> > > > > > > > 
> > > > > > > > On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
> > > > > > > > > On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> > > > > > > > > > On Wed, Nov 17, 2021 at 01:46:02PM -0800,
> > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
> > > > > > > > > > > > On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> > > > > > > > > > > > > On Tue, Nov 16, 2021 at 03:06:32PM -0800,
> > > > > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > > > > Just a reminder that this patch is still
> > > > > > > > > > > > > > waiting for your review.
> > > > > > > > > > > > > Yeah, I was procrastinating and hoping yo'ud
> > > > > > > > > > > > > figure out the pynfs
> > > > > > > > > > > > > failure for me....
> > > > > > > > > > > > Last time I ran 4.0 OPEN18 test by itself and
> > > > > > > > > > > > it passed. I will run
> > > > > > > > > > > > all OPEN tests together with 5.15-rc7 to see if
> > > > > > > > > > > > the problem you've
> > > > > > > > > > > > seen still there.
> > > > > > > > > > > I ran all tests in nfsv4.1 and nfsv4.0 with
> > > > > > > > > > > courteous and non-courteous
> > > > > > > > > > > 5.15-rc7 server.
> > > > > > > > > > > 
> > > > > > > > > > > Nfs4.1 results are the same for both courteous
> > > > > > > > > > > and
> > > > > > > > > > > non-courteous server:
> > > > > > > > > > > > Of those: 0 Skipped, 0 Failed, 0 Warned, 169
> > > > > > > > > > > > Passed
> > > > > > > > > > > Results of nfs4.0 with non-courteous server:
> > > > > > > > > > > > Of those: 8 Skipped, 1 Failed, 0 Warned, 577
> > > > > > > > > > > > Passed
> > > > > > > > > > > test failed: LOCK24
> > > > > > > > > > > 
> > > > > > > > > > > Results of nfs4.0 with courteous server:
> > > > > > > > > > > > Of those: 8 Skipped, 3 Failed, 0 Warned, 575
> > > > > > > > > > > > Passed
> > > > > > > > > > > tests failed: LOCK24, OPEN18, OPEN30
> > > > > > > > > > > 
> > > > > > > > > > > OPEN18 and OPEN30 test pass if each is run by
> > > > > > > > > > > itself.
> > > > > > > > > > Could well be a bug in the tests, I don't know.
> > > > > > > > > The reason OPEN18 failed was because the test timed
> > > > > > > > > out waiting for
> > > > > > > > > the reply of an OPEN call. The RPC connection used
> > > > > > > > > for the test was
> > > > > > > > > configured with 15 secs timeout. Note that OPEN18
> > > > > > > > > only fails when
> > > > > > > > > the tests were run with 'all' option, this test
> > > > > > > > > passes if it's run
> > > > > > > > > by itself.
> > > > > > > > > 
> > > > > > > > > With courteous server, by the time OPEN18 runs, there
> > > > > > > > > are about 1026
> > > > > > > > > courtesy 4.0 clients on the server and all of these
> > > > > > > > > clients have opened
> > > > > > > > > the same file X with WRITE access. These clients were
> > > > > > > > > created by the
> > > > > > > > > previous tests. After each test completed, since 4.0
> > > > > > > > > does not have
> > > > > > > > > session, the client states are not cleaned up
> > > > > > > > > immediately on the
> > > > > > > > > server and are allowed to become courtesy clients.
> > > > > > > > > 
> > > > > > > > > When OPEN18 runs (about 20 minutes after the 1st test
> > > > > > > > > started), it
> > > > > > > > > sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
> > > > > > > > > which causes the
> > > > > > > > > server to check for conflicts with courtesy clients.
> > > > > > > > > The loop that
> > > > > > > > > checks 1026 courtesy clients for share/access
> > > > > > > > > conflict took less
> > > > > > > > > than 1 sec. But it took about 55 secs, on my VM, for
> > > > > > > > > the server
> > > > > > > > > to expire all 1026 courtesy clients.
> > > > > > > > > 
> > > > > > > > > I modified pynfs to configure the 4.0 RPC connection
> > > > > > > > > with 60 seconds
> > > > > > > > > timeout and OPEN18 now consistently passed. The 4.0
> > > > > > > > > test results are
> > > > > > > > > now the same for courteous and non-courteous server:
> > > > > > > > > 
> > > > > > > > > 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> > > > > > > > > 
> > > > > > > > > Note that 4.1 tests do not suffer this timeout
> > > > > > > > > problem because the
> > > > > > > > > 4.1 clients and sessions are destroyed after each
> > > > > > > > > test completes.
> > > > > > > > Do you want me to send the patch to increase the
> > > > > > > > timeout for pynfs?
> > > > > > > > or is there any other things you think we should do?
> > > > > > > I don't know.
> > > > > > > 
> > > > > > > 55 seconds to clean up 1026 clients is about 50ms per
> > > > > > > client, which is
> > > > > > > pretty slow.  I wonder why.  I guess it's probably
> > > > > > > updating the stable
> > > > > > > storage information.  Is /var/lib/nfs/ on your server
> > > > > > > backed by a hard
> > > > > > > drive or an SSD or something else?
> > > > > > My server is a virtualbox VM that has 1 CPU, 4GB RAM and
> > > > > > 64GB of hard
> > > > > > disk. I think a production system that supports this many
> > > > > > clients should
> > > > > > have faster CPUs, faster storage.
> > > > > > 
> > > > > > > I wonder if that's an argument for limiting the number of
> > > > > > > courtesy
> > > > > > > clients.
> > > > > > I think we might want to treat 4.0 clients a bit different
> > > > > > from 4.1
> > > > > > clients. With 4.0, every client will become a courtesy
> > > > > > client after
> > > > > > the client is done with the export and unmounts it.
> > > > > It should be safe for a server to purge a client's lease
> > > > > immediately
> > > > > if there is no open or lock state associated with it.
> > > > In this case, each client has opened files so there are open
> > > > states
> > > > associated with them.
> > > > 
> > > > > When an NFSv4.0 client unmounts, all files should be closed
> > > > > at that
> > > > > point,
> > > > I'm not sure pynfs does proper clean up after each subtest, I
> > > > will
> > > > check. There must be state associated with the client in order
> > > > for
> > > > it to become courtesy client.
> > > Makes sense. Then a synthetic client like pynfs can DoS a
> > > courteous
> > > server.
> > > 
> > > 
> > > > > so the server can wait for the lease to expire and purge it
> > > > > normally. Or am I missing something?
> > > > When 4.0 client lease expires and there are still states
> > > > associated
> > > > with the client then the server allows this client to become
> > > > courtesy
> > > > client.
> > > I think the same thing happens if an NFSv4.1 client neglects to
> > > send
> > > DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
> > > broken
> > > or malicious, but the server faces the same issue of protecting
> > > itself from a DoS attack.
> > > 
> > > IMO you should consider limiting the number of courteous clients
> > > the server can hold onto. Let's say that number is 1000. When the
> > > server wants to turn a 1001st client into a courteous client, it
> > > can simply expire and purge the oldest courteous client on its
> > > list. Otherwise, over time, the 24-hour expiry will reduce the
> > > set of courteous clients back to zero.
> > > 
> > > What do you think?
> > 
> > Limiting the number of courteous clients to handle the cases of
> > broken/malicious 4.1 clients seems reasonable as the last resort.
> > 
> > I think if a malicious 4.1 clients could mount the server's export,
> > opens a file (to create state) and repeats the same with a
> > different
> > client id then it seems like some basic security was already
> > broken;
> > allowing unauthorized clients to mount server's exports.
> 
> You can do this today with AUTH_SYS. I consider it a genuine attack
> surface.
> 
> 
> > I think if we have to enforce a limit, then it's only for handling
> > of seriously buggy 4.1 clients which should not be the norm. The
> > issue with this is how to pick an optimal number that is suitable
> > for the running server which can be a very slow or a very fast
> > server.
> > 
> > Note that even if we impose an limit, that does not completely
> > solve
> > the problem with pynfs 4.0 test since its RPC timeout is configured
> > with 15 secs which just enough to expire 277 clients based on 53ms
> > for each client, unless we limit it ~270 clients which I think it's
> > too low.
> > 
> > This is what I plan to do:
> > 
> > 1. do not support 4.0 courteous clients, for sure.
> 
> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
> protocol at this time, and the same exposure exists for 4.1, it’s
> just a little harder to exploit.
> 
> If you submit the courteous server patch without support for 4.0, I
> think it needs to include a plan for how 4.0 will be added later.
> 
> > 

Why is there a problem here? The requirements are the same for 4.0 and
4.1 (or 4.2). If the lease under which the courtesy lock was
established has expired, then that courtesy lock must be released if
some other client requests a lock that conflicts with the cached lock
(unless the client breaks the courtesy framework by renewing that
original lease before the conflict occurs). Otherwise, it is completely
up to the server when it decides to actually release the lock.

For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
server when the client is actually done with the lease, making it easy
to determine when it is safe to release all the courtesy locks. However
if the client does not send DESTROY_CLIENTID, then we're in the same
situation with 4.x (x>0) as we would be with bog standard NFSv4.0. The
lease has expired, and so the courtesy locks are liable to being
dropped.

At Hammerspace we have implemented courtesy locks, and our strategy is
that when a conflict occurs, we drop the entire set of courtesy locks
so that we don't have to deal with the "some locks were revoked"
scenario. The reason is that when we originally implemented courtesy
locks, the Linux NFSv4 client support for lock revocation was a lot
less sophisticated than today. My suggestion is that you might
therefore consider starting along this path, and then refining the
support to make revocation more nuanced once you are confident that the
coarser strategy is working as expected.

Chuck Lever Nov. 30, 2021, 4:47 a.m. UTC | #22

> On Nov 29, 2021, at 11:08 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
>> 
>>>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>> 
>>> 
>>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>> 
>>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com>
>>>>>> wrote:
>>>>> 
>>>>> 
>>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>>> Hello Dai!
>>>>>> 
>>>>>> 
>>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800,
>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>> Hi Bruce,
>>>>>>>>> 
>>>>>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800,
>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800,
>>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>>> Just a reminder that this patch is still
>>>>>>>>>>>>>>> waiting for your review.
>>>>>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud
>>>>>>>>>>>>>> figure out the pynfs
>>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and
>>>>>>>>>>>>> it passed. I will run
>>>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if
>>>>>>>>>>>>> the problem you've
>>>>>>>>>>>>> seen still there.
>>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with
>>>>>>>>>>>> courteous and non-courteous
>>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>> 
>>>>>>>>>>>> Nfs4.1 results are the same for both courteous
>>>>>>>>>>>> and
>>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169
>>>>>>>>>>>>> Passed
>>>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577
>>>>>>>>>>>>> Passed
>>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>> 
>>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575
>>>>>>>>>>>>> Passed
>>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>> 
>>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by
>>>>>>>>>>>> itself.
>>>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>>>> The reason OPEN18 failed was because the test timed
>>>>>>>>>> out waiting for
>>>>>>>>>> the reply of an OPEN call. The RPC connection used
>>>>>>>>>> for the test was
>>>>>>>>>> configured with 15 secs timeout. Note that OPEN18
>>>>>>>>>> only fails when
>>>>>>>>>> the tests were run with 'all' option, this test
>>>>>>>>>> passes if it's run
>>>>>>>>>> by itself.
>>>>>>>>>> 
>>>>>>>>>> With courteous server, by the time OPEN18 runs, there
>>>>>>>>>> are about 1026
>>>>>>>>>> courtesy 4.0 clients on the server and all of these
>>>>>>>>>> clients have opened
>>>>>>>>>> the same file X with WRITE access. These clients were
>>>>>>>>>> created by the
>>>>>>>>>> previous tests. After each test completed, since 4.0
>>>>>>>>>> does not have
>>>>>>>>>> session, the client states are not cleaned up
>>>>>>>>>> immediately on the
>>>>>>>>>> server and are allowed to become courtesy clients.
>>>>>>>>>> 
>>>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st test
>>>>>>>>>> started), it
>>>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
>>>>>>>>>> which causes the
>>>>>>>>>> server to check for conflicts with courtesy clients.
>>>>>>>>>> The loop that
>>>>>>>>>> checks 1026 courtesy clients for share/access
>>>>>>>>>> conflict took less
>>>>>>>>>> than 1 sec. But it took about 55 secs, on my VM, for
>>>>>>>>>> the server
>>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>> 
>>>>>>>>>> I modified pynfs to configure the 4.0 RPC connection
>>>>>>>>>> with 60 seconds
>>>>>>>>>> timeout and OPEN18 now consistently passed. The 4.0
>>>>>>>>>> test results are
>>>>>>>>>> now the same for courteous and non-courteous server:
>>>>>>>>>> 
>>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>> 
>>>>>>>>>> Note that 4.1 tests do not suffer this timeout
>>>>>>>>>> problem because the
>>>>>>>>>> 4.1 clients and sessions are destroyed after each
>>>>>>>>>> test completes.
>>>>>>>>> Do you want me to send the patch to increase the
>>>>>>>>> timeout for pynfs?
>>>>>>>>> or is there any other things you think we should do?
>>>>>>>> I don't know.
>>>>>>>> 
>>>>>>>> 55 seconds to clean up 1026 clients is about 50ms per
>>>>>>>> client, which is
>>>>>>>> pretty slow.  I wonder why.  I guess it's probably
>>>>>>>> updating the stable
>>>>>>>> storage information.  Is /var/lib/nfs/ on your server
>>>>>>>> backed by a hard
>>>>>>>> drive or an SSD or something else?
>>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and
>>>>>>> 64GB of hard
>>>>>>> disk. I think a production system that supports this many
>>>>>>> clients should
>>>>>>> have faster CPUs, faster storage.
>>>>>>> 
>>>>>>>> I wonder if that's an argument for limiting the number of
>>>>>>>> courtesy
>>>>>>>> clients.
>>>>>>> I think we might want to treat 4.0 clients a bit different
>>>>>>> from 4.1
>>>>>>> clients. With 4.0, every client will become a courtesy
>>>>>>> client after
>>>>>>> the client is done with the export and unmounts it.
>>>>>> It should be safe for a server to purge a client's lease
>>>>>> immediately
>>>>>> if there is no open or lock state associated with it.
>>>>> In this case, each client has opened files so there are open
>>>>> states
>>>>> associated with them.
>>>>> 
>>>>>> When an NFSv4.0 client unmounts, all files should be closed
>>>>>> at that
>>>>>> point,
>>>>> I'm not sure pynfs does proper clean up after each subtest, I
>>>>> will
>>>>> check. There must be state associated with the client in order
>>>>> for
>>>>> it to become courtesy client.
>>>> Makes sense. Then a synthetic client like pynfs can DoS a
>>>> courteous
>>>> server.
>>>> 
>>>> 
>>>>>> so the server can wait for the lease to expire and purge it
>>>>>> normally. Or am I missing something?
>>>>> When 4.0 client lease expires and there are still states
>>>>> associated
>>>>> with the client then the server allows this client to become
>>>>> courtesy
>>>>> client.
>>>> I think the same thing happens if an NFSv4.1 client neglects to
>>>> send
>>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
>>>> broken
>>>> or malicious, but the server faces the same issue of protecting
>>>> itself from a DoS attack.
>>>> 
>>>> IMO you should consider limiting the number of courteous clients
>>>> the server can hold onto. Let's say that number is 1000. When the
>>>> server wants to turn a 1001st client into a courteous client, it
>>>> can simply expire and purge the oldest courteous client on its
>>>> list. Otherwise, over time, the 24-hour expiry will reduce the
>>>> set of courteous clients back to zero.
>>>> 
>>>> What do you think?
>>> 
>>> Limiting the number of courteous clients to handle the cases of
>>> broken/malicious 4.1 clients seems reasonable as the last resort.
>>> 
>>> I think if a malicious 4.1 clients could mount the server's export,
>>> opens a file (to create state) and repeats the same with a
>>> different
>>> client id then it seems like some basic security was already
>>> broken;
>>> allowing unauthorized clients to mount server's exports.
>> 
>> You can do this today with AUTH_SYS. I consider it a genuine attack
>> surface.
>> 
>> 
>>> I think if we have to enforce a limit, then it's only for handling
>>> of seriously buggy 4.1 clients which should not be the norm. The
>>> issue with this is how to pick an optimal number that is suitable
>>> for the running server which can be a very slow or a very fast
>>> server.
>>> 
>>> Note that even if we impose an limit, that does not completely
>>> solve
>>> the problem with pynfs 4.0 test since its RPC timeout is configured
>>> with 15 secs which just enough to expire 277 clients based on 53ms
>>> for each client, unless we limit it ~270 clients which I think it's
>>> too low.
>>> 
>>> This is what I plan to do:
>>> 
>>> 1. do not support 4.0 courteous clients, for sure.
>> 
>> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
>> protocol at this time, and the same exposure exists for 4.1, it’s
>> just a little harder to exploit.
>> 
>> If you submit the courteous server patch without support for 4.0, I
>> think it needs to include a plan for how 4.0 will be added later.
>> 
>>> 
> 
> Why is there a problem here? The requirements are the same for 4.0 and
> 4.1 (or 4.2). If the lease under which the courtesy lock was
> established has expired, then that courtesy lock must be released if
> some other client requests a lock that conflicts with the cached lock
> (unless the client breaks the courtesy framework by renewing that
> original lease before the conflict occurs). Otherwise, it is completely
> up to the server when it decides to actually release the lock.
> 
> For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
> server when the client is actually done with the lease, making it easy
> to determine when it is safe to release all the courtesy locks. However
> if the client does not send DESTROY_CLIENTID, then we're in the same
> situation with 4.x (x>0) as we would be with bog standard NFSv4.0. The
> lease has expired, and so the courtesy locks are liable to being
> dropped.

I agree the situation is the same for all minor versions.


> At Hammerspace we have implemented courtesy locks, and our strategy is
> that when a conflict occurs, we drop the entire set of courtesy locks
> so that we don't have to deal with the "some locks were revoked"
> scenario. The reason is that when we originally implemented courtesy
> locks, the Linux NFSv4 client support for lock revocation was a lot
> less sophisticated than today. My suggestion is that you might
> therefore consider starting along this path, and then refining the
> support to make revocation more nuanced once you are confident that the
> coarser strategy is working as expected.

Dai’s implementation does all that, and takes the coarser approach at the moment. There are plans to explore the more nuanced behavior (by revoking only the conflicting lock instead of dropping the whole lease) after this initial work is merged.

The issue is there are certain pathological client behaviors (whether malicious or accidental) that can run the server out of resources, since it is holding onto lease state for a much longer time. We are simply trying to design a lease garbage collection scheme to meet that challenge.

I think limiting the number of courteous clients is a simple way to do this, but we could also shorten the courtesy lifetime as more clients enter that state, to ensure that they don’t overrun the server’s memory. Another approach might be to add a shrinker that purges the oldest courteous clients when the server comes under memory pressure.

Trond Myklebust Nov. 30, 2021, 4:57 a.m. UTC | #23

On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
> 
> > On Nov 29, 2021, at 11:08 PM, Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > 
> > On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
> > > 
> > > > > On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com>
> > > > > wrote:
> > > > 
> > > > 
> > > > > On 11/29/21 1:10 PM, Chuck Lever III wrote:
> > > > > 
> > > > > > > On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com>
> > > > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > On 11/29/21 11:03 AM, Chuck Lever III wrote:
> > > > > > > Hello Dai!
> > > > > > > 
> > > > > > > 
> > > > > > > > On Nov 29, 2021, at 1:32 PM, Dai Ngo
> > > > > > > > <dai.ngo@oracle.com>
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> > > > > > > > > On Mon, Nov 29, 2021 at 09:13:16AM -0800,
> > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > Hi Bruce,
> > > > > > > > > > 
> > > > > > > > > > On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
> > > > > > > > > > > On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> > > > > > > > > > > > On Wed, Nov 17, 2021 at 01:46:02PM -0800,
> > > > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > > > On 11/17/21 9:59 AM,
> > > > > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > > > > On 11/17/21 6:14 AM, J. Bruce Fields wrote:
> > > > > > > > > > > > > > > On Tue, Nov 16, 2021 at 03:06:32PM -0800,
> > > > > > > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > > > > > > Just a reminder that this patch is
> > > > > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > waiting for your review.
> > > > > > > > > > > > > > > Yeah, I was procrastinating and hoping
> > > > > > > > > > > > > > > yo'ud
> > > > > > > > > > > > > > > figure out the pynfs
> > > > > > > > > > > > > > > failure for me....
> > > > > > > > > > > > > > Last time I ran 4.0 OPEN18 test by itself
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > it passed. I will run
> > > > > > > > > > > > > > all OPEN tests together with 5.15-rc7 to
> > > > > > > > > > > > > > see if
> > > > > > > > > > > > > > the problem you've
> > > > > > > > > > > > > > seen still there.
> > > > > > > > > > > > > I ran all tests in nfsv4.1 and nfsv4.0 with
> > > > > > > > > > > > > courteous and non-courteous
> > > > > > > > > > > > > 5.15-rc7 server.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Nfs4.1 results are the same for both
> > > > > > > > > > > > > courteous
> > > > > > > > > > > > > and
> > > > > > > > > > > > > non-courteous server:
> > > > > > > > > > > > > > Of those: 0 Skipped, 0 Failed, 0 Warned,
> > > > > > > > > > > > > > 169
> > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > Results of nfs4.0 with non-courteous server:
> > > > > > > > > > > > > > Of those: 8 Skipped, 1 Failed, 0 Warned,
> > > > > > > > > > > > > > 577
> > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > test failed: LOCK24
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Results of nfs4.0 with courteous server:
> > > > > > > > > > > > > > Of those: 8 Skipped, 3 Failed, 0 Warned,
> > > > > > > > > > > > > > 575
> > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > tests failed: LOCK24, OPEN18, OPEN30
> > > > > > > > > > > > > 
> > > > > > > > > > > > > OPEN18 and OPEN30 test pass if each is run by
> > > > > > > > > > > > > itself.
> > > > > > > > > > > > Could well be a bug in the tests, I don't know.
> > > > > > > > > > > The reason OPEN18 failed was because the test
> > > > > > > > > > > timed
> > > > > > > > > > > out waiting for
> > > > > > > > > > > the reply of an OPEN call. The RPC connection
> > > > > > > > > > > used
> > > > > > > > > > > for the test was
> > > > > > > > > > > configured with 15 secs timeout. Note that OPEN18
> > > > > > > > > > > only fails when
> > > > > > > > > > > the tests were run with 'all' option, this test
> > > > > > > > > > > passes if it's run
> > > > > > > > > > > by itself.
> > > > > > > > > > > 
> > > > > > > > > > > With courteous server, by the time OPEN18 runs,
> > > > > > > > > > > there
> > > > > > > > > > > are about 1026
> > > > > > > > > > > courtesy 4.0 clients on the server and all of
> > > > > > > > > > > these
> > > > > > > > > > > clients have opened
> > > > > > > > > > > the same file X with WRITE access. These clients
> > > > > > > > > > > were
> > > > > > > > > > > created by the
> > > > > > > > > > > previous tests. After each test completed, since
> > > > > > > > > > > 4.0
> > > > > > > > > > > does not have
> > > > > > > > > > > session, the client states are not cleaned up
> > > > > > > > > > > immediately on the
> > > > > > > > > > > server and are allowed to become courtesy
> > > > > > > > > > > clients.
> > > > > > > > > > > 
> > > > > > > > > > > When OPEN18 runs (about 20 minutes after the 1st
> > > > > > > > > > > test
> > > > > > > > > > > started), it
> > > > > > > > > > > sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
> > > > > > > > > > > which causes the
> > > > > > > > > > > server to check for conflicts with courtesy
> > > > > > > > > > > clients.
> > > > > > > > > > > The loop that
> > > > > > > > > > > checks 1026 courtesy clients for share/access
> > > > > > > > > > > conflict took less
> > > > > > > > > > > than 1 sec. But it took about 55 secs, on my VM,
> > > > > > > > > > > for
> > > > > > > > > > > the server
> > > > > > > > > > > to expire all 1026 courtesy clients.
> > > > > > > > > > > 
> > > > > > > > > > > I modified pynfs to configure the 4.0 RPC
> > > > > > > > > > > connection
> > > > > > > > > > > with 60 seconds
> > > > > > > > > > > timeout and OPEN18 now consistently passed. The
> > > > > > > > > > > 4.0
> > > > > > > > > > > test results are
> > > > > > > > > > > now the same for courteous and non-courteous
> > > > > > > > > > > server:
> > > > > > > > > > > 
> > > > > > > > > > > 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> > > > > > > > > > > 
> > > > > > > > > > > Note that 4.1 tests do not suffer this timeout
> > > > > > > > > > > problem because the
> > > > > > > > > > > 4.1 clients and sessions are destroyed after each
> > > > > > > > > > > test completes.
> > > > > > > > > > Do you want me to send the patch to increase the
> > > > > > > > > > timeout for pynfs?
> > > > > > > > > > or is there any other things you think we should
> > > > > > > > > > do?
> > > > > > > > > I don't know.
> > > > > > > > > 
> > > > > > > > > 55 seconds to clean up 1026 clients is about 50ms per
> > > > > > > > > client, which is
> > > > > > > > > pretty slow.  I wonder why.  I guess it's probably
> > > > > > > > > updating the stable
> > > > > > > > > storage information.  Is /var/lib/nfs/ on your server
> > > > > > > > > backed by a hard
> > > > > > > > > drive or an SSD or something else?
> > > > > > > > My server is a virtualbox VM that has 1 CPU, 4GB RAM
> > > > > > > > and
> > > > > > > > 64GB of hard
> > > > > > > > disk. I think a production system that supports this
> > > > > > > > many
> > > > > > > > clients should
> > > > > > > > have faster CPUs, faster storage.
> > > > > > > > 
> > > > > > > > > I wonder if that's an argument for limiting the
> > > > > > > > > number of
> > > > > > > > > courtesy
> > > > > > > > > clients.
> > > > > > > > I think we might want to treat 4.0 clients a bit
> > > > > > > > different
> > > > > > > > from 4.1
> > > > > > > > clients. With 4.0, every client will become a courtesy
> > > > > > > > client after
> > > > > > > > the client is done with the export and unmounts it.
> > > > > > > It should be safe for a server to purge a client's lease
> > > > > > > immediately
> > > > > > > if there is no open or lock state associated with it.
> > > > > > In this case, each client has opened files so there are
> > > > > > open
> > > > > > states
> > > > > > associated with them.
> > > > > > 
> > > > > > > When an NFSv4.0 client unmounts, all files should be
> > > > > > > closed
> > > > > > > at that
> > > > > > > point,
> > > > > > I'm not sure pynfs does proper clean up after each subtest,
> > > > > > I
> > > > > > will
> > > > > > check. There must be state associated with the client in
> > > > > > order
> > > > > > for
> > > > > > it to become courtesy client.
> > > > > Makes sense. Then a synthetic client like pynfs can DoS a
> > > > > courteous
> > > > > server.
> > > > > 
> > > > > 
> > > > > > > so the server can wait for the lease to expire and purge
> > > > > > > it
> > > > > > > normally. Or am I missing something?
> > > > > > When 4.0 client lease expires and there are still states
> > > > > > associated
> > > > > > with the client then the server allows this client to
> > > > > > become
> > > > > > courtesy
> > > > > > client.
> > > > > I think the same thing happens if an NFSv4.1 client neglects
> > > > > to
> > > > > send
> > > > > DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
> > > > > broken
> > > > > or malicious, but the server faces the same issue of
> > > > > protecting
> > > > > itself from a DoS attack.
> > > > > 
> > > > > IMO you should consider limiting the number of courteous
> > > > > clients
> > > > > the server can hold onto. Let's say that number is 1000. When
> > > > > the
> > > > > server wants to turn a 1001st client into a courteous client,
> > > > > it
> > > > > can simply expire and purge the oldest courteous client on
> > > > > its
> > > > > list. Otherwise, over time, the 24-hour expiry will reduce
> > > > > the
> > > > > set of courteous clients back to zero.
> > > > > 
> > > > > What do you think?
> > > > 
> > > > Limiting the number of courteous clients to handle the cases of
> > > > broken/malicious 4.1 clients seems reasonable as the last
> > > > resort.
> > > > 
> > > > I think if a malicious 4.1 clients could mount the server's
> > > > export,
> > > > opens a file (to create state) and repeats the same with a
> > > > different
> > > > client id then it seems like some basic security was already
> > > > broken;
> > > > allowing unauthorized clients to mount server's exports.
> > > 
> > > You can do this today with AUTH_SYS. I consider it a genuine
> > > attack
> > > surface.
> > > 
> > > 
> > > > I think if we have to enforce a limit, then it's only for
> > > > handling
> > > > of seriously buggy 4.1 clients which should not be the norm.
> > > > The
> > > > issue with this is how to pick an optimal number that is
> > > > suitable
> > > > for the running server which can be a very slow or a very fast
> > > > server.
> > > > 
> > > > Note that even if we impose an limit, that does not completely
> > > > solve
> > > > the problem with pynfs 4.0 test since its RPC timeout is
> > > > configured
> > > > with 15 secs which just enough to expire 277 clients based on
> > > > 53ms
> > > > for each client, unless we limit it ~270 clients which I think
> > > > it's
> > > > too low.
> > > > 
> > > > This is what I plan to do:
> > > > 
> > > > 1. do not support 4.0 courteous clients, for sure.
> > > 
> > > Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
> > > protocol at this time, and the same exposure exists for 4.1, it’s
> > > just a little harder to exploit.
> > > 
> > > If you submit the courteous server patch without support for 4.0,
> > > I
> > > think it needs to include a plan for how 4.0 will be added later.
> > > 
> > > > 
> > 
> > Why is there a problem here? The requirements are the same for 4.0
> > and
> > 4.1 (or 4.2). If the lease under which the courtesy lock was
> > established has expired, then that courtesy lock must be released
> > if
> > some other client requests a lock that conflicts with the cached
> > lock
> > (unless the client breaks the courtesy framework by renewing that
> > original lease before the conflict occurs). Otherwise, it is
> > completely
> > up to the server when it decides to actually release the lock.
> > 
> > For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
> > server when the client is actually done with the lease, making it
> > easy
> > to determine when it is safe to release all the courtesy locks.
> > However
> > if the client does not send DESTROY_CLIENTID, then we're in the
> > same
> > situation with 4.x (x>0) as we would be with bog standard NFSv4.0.
> > The
> > lease has expired, and so the courtesy locks are liable to being
> > dropped.
> 
> I agree the situation is the same for all minor versions.
> 
> 
> > At Hammerspace we have implemented courtesy locks, and our strategy
> > is
> > that when a conflict occurs, we drop the entire set of courtesy
> > locks
> > so that we don't have to deal with the "some locks were revoked"
> > scenario. The reason is that when we originally implemented
> > courtesy
> > locks, the Linux NFSv4 client support for lock revocation was a lot
> > less sophisticated than today. My suggestion is that you might
> > therefore consider starting along this path, and then refining the
> > support to make revocation more nuanced once you are confident that
> > the
> > coarser strategy is working as expected.
> 
> Dai’s implementation does all that, and takes the coarser approach at
> the moment. There are plans to explore the more nuanced behavior (by
> revoking only the conflicting lock instead of dropping the whole
> lease) after this initial work is merged.
> 
> The issue is there are certain pathological client behaviors (whether
> malicious or accidental) that can run the server out of resources,
> since it is holding onto lease state for a much longer time. We are
> simply trying to design a lease garbage collection scheme to meet
> that challenge.
> 
> I think limiting the number of courteous clients is a simple way to
> do this, but we could also shorten the courtesy lifetime as more
> clients enter that state, to ensure that they don’t overrun the
> server’s memory. Another approach might be to add a shrinker that
> purges the oldest courteous clients when the server comes under
> memory pressure.
> 
> 

We already have a scanner that tries to release all client state after
1 lease period. Just extend that to do it after 10 lease periods. If a
network partition hasn't recovered after 10 minutes, you probably have
bigger problems.

You can limit the number of clients as well, but that leads into a rats
nest of other issues that have nothing to do with courtesy locks and
everything to do with the fact that any client can hold a lot of state.

Dai Ngo Nov. 30, 2021, 7:13 a.m. UTC | #24

On 11/29/21 5:42 PM, Chuck Lever III wrote:
>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>
>> 
>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>
>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>
>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>> Hello Dai!
>>>>>
>>>>>
>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800, dai.ngo@oracle.com wrote:
>>>>>>>> Hi Bruce,
>>>>>>>>
>>>>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>>>> On 11/17/21 9:59 AM, dai.ngo@oracle.com wrote:
>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800, dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>> Just a reminder that this patch is still waiting for your review.
>>>>>>>>>>>>> Yeah, I was procrastinating and hoping yo'ud figure out the pynfs
>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself and it passed. I will run
>>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to see if the problem you've
>>>>>>>>>>>> seen still there.
>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with courteous and non-courteous
>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>
>>>>>>>>>>> Nfs4.1 results are the same for both courteous and
>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned, 169 Passed
>>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>
>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned, 575 Passed
>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>
>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by itself.
>>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>>> The reason OPEN18 failed was because the test timed out waiting for
>>>>>>>>> the reply of an OPEN call. The RPC connection used for the test was
>>>>>>>>> configured with 15 secs timeout. Note that OPEN18 only fails when
>>>>>>>>> the tests were run with 'all' option, this test passes if it's run
>>>>>>>>> by itself.
>>>>>>>>>
>>>>>>>>> With courteous server, by the time OPEN18 runs, there are about 1026
>>>>>>>>> courtesy 4.0 clients on the server and all of these clients have opened
>>>>>>>>> the same file X with WRITE access. These clients were created by the
>>>>>>>>> previous tests. After each test completed, since 4.0 does not have
>>>>>>>>> session, the client states are not cleaned up immediately on the
>>>>>>>>> server and are allowed to become courtesy clients.
>>>>>>>>>
>>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st test started), it
>>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE which causes the
>>>>>>>>> server to check for conflicts with courtesy clients. The loop that
>>>>>>>>> checks 1026 courtesy clients for share/access conflict took less
>>>>>>>>> than 1 sec. But it took about 55 secs, on my VM, for the server
>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>
>>>>>>>>> I modified pynfs to configure the 4.0 RPC connection with 60 seconds
>>>>>>>>> timeout and OPEN18 now consistently passed. The 4.0 test results are
>>>>>>>>> now the same for courteous and non-courteous server:
>>>>>>>>>
>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>
>>>>>>>>> Note that 4.1 tests do not suffer this timeout problem because the
>>>>>>>>> 4.1 clients and sessions are destroyed after each test completes.
>>>>>>>> Do you want me to send the patch to increase the timeout for pynfs?
>>>>>>>> or is there any other things you think we should do?
>>>>>>> I don't know.
>>>>>>>
>>>>>>> 55 seconds to clean up 1026 clients is about 50ms per client, which is
>>>>>>> pretty slow.  I wonder why.  I guess it's probably updating the stable
>>>>>>> storage information.  Is /var/lib/nfs/ on your server backed by a hard
>>>>>>> drive or an SSD or something else?
>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM and 64GB of hard
>>>>>> disk. I think a production system that supports this many clients should
>>>>>> have faster CPUs, faster storage.
>>>>>>
>>>>>>> I wonder if that's an argument for limiting the number of courtesy
>>>>>>> clients.
>>>>>> I think we might want to treat 4.0 clients a bit different from 4.1
>>>>>> clients. With 4.0, every client will become a courtesy client after
>>>>>> the client is done with the export and unmounts it.
>>>>> It should be safe for a server to purge a client's lease immediately
>>>>> if there is no open or lock state associated with it.
>>>> In this case, each client has opened files so there are open states
>>>> associated with them.
>>>>
>>>>> When an NFSv4.0 client unmounts, all files should be closed at that
>>>>> point,
>>>> I'm not sure pynfs does proper clean up after each subtest, I will
>>>> check. There must be state associated with the client in order for
>>>> it to become courtesy client.
>>> Makes sense. Then a synthetic client like pynfs can DoS a courteous
>>> server.
>>>
>>>
>>>>> so the server can wait for the lease to expire and purge it
>>>>> normally. Or am I missing something?
>>>> When 4.0 client lease expires and there are still states associated
>>>> with the client then the server allows this client to become courtesy
>>>> client.
>>> I think the same thing happens if an NFSv4.1 client neglects to send
>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is broken
>>> or malicious, but the server faces the same issue of protecting
>>> itself from a DoS attack.
>>>
>>> IMO you should consider limiting the number of courteous clients
>>> the server can hold onto. Let's say that number is 1000. When the
>>> server wants to turn a 1001st client into a courteous client, it
>>> can simply expire and purge the oldest courteous client on its
>>> list. Otherwise, over time, the 24-hour expiry will reduce the
>>> set of courteous clients back to zero.
>>>
>>> What do you think?
>> Limiting the number of courteous clients to handle the cases of
>> broken/malicious 4.1 clients seems reasonable as the last resort.
>>
>> I think if a malicious 4.1 clients could mount the server's export,
>> opens a file (to create state) and repeats the same with a different
>> client id then it seems like some basic security was already broken;
>> allowing unauthorized clients to mount server's exports.
> You can do this today with AUTH_SYS. I consider it a genuine attack surface.
>
>
>> I think if we have to enforce a limit, then it's only for handling
>> of seriously buggy 4.1 clients which should not be the norm. The
>> issue with this is how to pick an optimal number that is suitable
>> for the running server which can be a very slow or a very fast server.
>>
>> Note that even if we impose an limit, that does not completely solve
>> the problem with pynfs 4.0 test since its RPC timeout is configured
>> with 15 secs which just enough to expire 277 clients based on 53ms
>> for each client, unless we limit it ~270 clients which I think it's
>> too low.
>>
>> This is what I plan to do:
>>
>> 1. do not support 4.0 courteous clients, for sure.
> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported protocol at this time, and the same exposure exists for 4.1, it’s just a little harder to exploit.
>
> If you submit the courteous server patch without support for 4.0, I think it needs to include a plan for how 4.0 will be added later.

Seems like we should support both 4.0 and 4.x (x>=1) at the same time.

>
>
>> 2. limit the number of courteous clients to 1000 (?), if you still
>> think we need it.
>   I think this limit is necessary. It can be set based on the server’s physical memory size if a dynamic limit is desired.

Just to be clear, the problem of pynfs with 4.0 is that the server takes
~55 secs to expire 1026 4.0 courteous clients, which comes out to ~50ms
per client. This causes the test to time out in waiting for RPC reply of
the OPEN that triggers the conflicts.

I don't know exactly where the time spent in the process of expiring a
client. But as Bruce mentioned, it could be related to the time to access
/var/lib/nfs to remove the client's persistent record. I think that is most
likely the case because the number of states owned by each client should be
small since each test is short and does simple ops. So I think this problem
is related to the number of clients and not number of states owned by the
clients. This is not the memory resource shortage problem due to too many
state which we have planned to address it after the initial phase.

I'd vote to use a static limit for now, say 1000 clients, to avoid
complicating the courteous server code for something that would not
happen most of the time.

-Dai

>
>
>> Pls let me know what you think.
>>
>> Thanks,
>> -Dai
>>
>>>
>>>>>> Since there is
>>>>>> no destroy session/client with 4.0, the courteous server allows the
>>>>>> client to be around and becomes a courtesy client. So after awhile,
>>>>>> even with normal usage, there will be lots 4.0 courtesy clients
>>>>>> hanging around and these clients won't be destroyed until 24hrs
>>>>>> later, or until they cause conflicts with other clients.
>>>>>>
>>>>>> We can reduce the courtesy_client_expiry time for 4.0 clients from
>>>>>> 24hrs to 15/20 mins, enough for most network partition to heal?,
>>>>>> or limit the number of 4.0 courtesy clients. Or don't support 4.0
>>>>>> clients at all which is my preference since I think in general users
>>>>>> should skip 4.0 and use 4.1 instead.
>>>>>>
>>>>>> -Dai
>>>>> --
>>>>> Chuck Lever
>>>>>
>>>>>
>>>>>
>>> --
>>> Chuck Lever
>>>
>>>
>>>

Dai Ngo Nov. 30, 2021, 7:22 a.m. UTC | #25

On 11/29/21 8:57 PM, Trond Myklebust wrote:
> On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
>>> On Nov 29, 2021, at 11:08 PM, Trond Myklebust
>>> <trondmy@hammerspace.com> wrote:
>>>
>>> On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
>>>>>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com>
>>>>>> wrote:
>>>>> 
>>>>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>>>>
>>>>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com>
>>>>>>>> wrote:
>>>>>>>
>>>>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>>>>> Hello Dai!
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo
>>>>>>>>> <dai.ngo@oracle.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800,
>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>> Hi Bruce,
>>>>>>>>>>>
>>>>>>>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800,
>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>> On 11/17/21 9:59 AM,
>>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800,
>>>>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>>>>> Just a reminder that this patch is
>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>> waiting for your review.
>>>>>>>>>>>>>>>> Yeah, I was procrastinating and hoping
>>>>>>>>>>>>>>>> yo'ud
>>>>>>>>>>>>>>>> figure out the pynfs
>>>>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> it passed. I will run
>>>>>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to
>>>>>>>>>>>>>>> see if
>>>>>>>>>>>>>>> the problem you've
>>>>>>>>>>>>>>> seen still there.
>>>>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with
>>>>>>>>>>>>>> courteous and non-courteous
>>>>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nfs4.1 results are the same for both
>>>>>>>>>>>>>> courteous
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned,
>>>>>>>>>>>>>>> 169
>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned,
>>>>>>>>>>>>>>> 577
>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned,
>>>>>>>>>>>>>>> 575
>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by
>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>>>>>> The reason OPEN18 failed was because the test
>>>>>>>>>>>> timed
>>>>>>>>>>>> out waiting for
>>>>>>>>>>>> the reply of an OPEN call. The RPC connection
>>>>>>>>>>>> used
>>>>>>>>>>>> for the test was
>>>>>>>>>>>> configured with 15 secs timeout. Note that OPEN18
>>>>>>>>>>>> only fails when
>>>>>>>>>>>> the tests were run with 'all' option, this test
>>>>>>>>>>>> passes if it's run
>>>>>>>>>>>> by itself.
>>>>>>>>>>>>
>>>>>>>>>>>> With courteous server, by the time OPEN18 runs,
>>>>>>>>>>>> there
>>>>>>>>>>>> are about 1026
>>>>>>>>>>>> courtesy 4.0 clients on the server and all of
>>>>>>>>>>>> these
>>>>>>>>>>>> clients have opened
>>>>>>>>>>>> the same file X with WRITE access. These clients
>>>>>>>>>>>> were
>>>>>>>>>>>> created by the
>>>>>>>>>>>> previous tests. After each test completed, since
>>>>>>>>>>>> 4.0
>>>>>>>>>>>> does not have
>>>>>>>>>>>> session, the client states are not cleaned up
>>>>>>>>>>>> immediately on the
>>>>>>>>>>>> server and are allowed to become courtesy
>>>>>>>>>>>> clients.
>>>>>>>>>>>>
>>>>>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st
>>>>>>>>>>>> test
>>>>>>>>>>>> started), it
>>>>>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
>>>>>>>>>>>> which causes the
>>>>>>>>>>>> server to check for conflicts with courtesy
>>>>>>>>>>>> clients.
>>>>>>>>>>>> The loop that
>>>>>>>>>>>> checks 1026 courtesy clients for share/access
>>>>>>>>>>>> conflict took less
>>>>>>>>>>>> than 1 sec. But it took about 55 secs, on my VM,
>>>>>>>>>>>> for
>>>>>>>>>>>> the server
>>>>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>>>>
>>>>>>>>>>>> I modified pynfs to configure the 4.0 RPC
>>>>>>>>>>>> connection
>>>>>>>>>>>> with 60 seconds
>>>>>>>>>>>> timeout and OPEN18 now consistently passed. The
>>>>>>>>>>>> 4.0
>>>>>>>>>>>> test results are
>>>>>>>>>>>> now the same for courteous and non-courteous
>>>>>>>>>>>> server:
>>>>>>>>>>>>
>>>>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>>>>
>>>>>>>>>>>> Note that 4.1 tests do not suffer this timeout
>>>>>>>>>>>> problem because the
>>>>>>>>>>>> 4.1 clients and sessions are destroyed after each
>>>>>>>>>>>> test completes.
>>>>>>>>>>> Do you want me to send the patch to increase the
>>>>>>>>>>> timeout for pynfs?
>>>>>>>>>>> or is there any other things you think we should
>>>>>>>>>>> do?
>>>>>>>>>> I don't know.
>>>>>>>>>>
>>>>>>>>>> 55 seconds to clean up 1026 clients is about 50ms per
>>>>>>>>>> client, which is
>>>>>>>>>> pretty slow.  I wonder why.  I guess it's probably
>>>>>>>>>> updating the stable
>>>>>>>>>> storage information.  Is /var/lib/nfs/ on your server
>>>>>>>>>> backed by a hard
>>>>>>>>>> drive or an SSD or something else?
>>>>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM
>>>>>>>>> and
>>>>>>>>> 64GB of hard
>>>>>>>>> disk. I think a production system that supports this
>>>>>>>>> many
>>>>>>>>> clients should
>>>>>>>>> have faster CPUs, faster storage.
>>>>>>>>>
>>>>>>>>>> I wonder if that's an argument for limiting the
>>>>>>>>>> number of
>>>>>>>>>> courtesy
>>>>>>>>>> clients.
>>>>>>>>> I think we might want to treat 4.0 clients a bit
>>>>>>>>> different
>>>>>>>>> from 4.1
>>>>>>>>> clients. With 4.0, every client will become a courtesy
>>>>>>>>> client after
>>>>>>>>> the client is done with the export and unmounts it.
>>>>>>>> It should be safe for a server to purge a client's lease
>>>>>>>> immediately
>>>>>>>> if there is no open or lock state associated with it.
>>>>>>> In this case, each client has opened files so there are
>>>>>>> open
>>>>>>> states
>>>>>>> associated with them.
>>>>>>>
>>>>>>>> When an NFSv4.0 client unmounts, all files should be
>>>>>>>> closed
>>>>>>>> at that
>>>>>>>> point,
>>>>>>> I'm not sure pynfs does proper clean up after each subtest,
>>>>>>> I
>>>>>>> will
>>>>>>> check. There must be state associated with the client in
>>>>>>> order
>>>>>>> for
>>>>>>> it to become courtesy client.
>>>>>> Makes sense. Then a synthetic client like pynfs can DoS a
>>>>>> courteous
>>>>>> server.
>>>>>>
>>>>>>
>>>>>>>> so the server can wait for the lease to expire and purge
>>>>>>>> it
>>>>>>>> normally. Or am I missing something?
>>>>>>> When 4.0 client lease expires and there are still states
>>>>>>> associated
>>>>>>> with the client then the server allows this client to
>>>>>>> become
>>>>>>> courtesy
>>>>>>> client.
>>>>>> I think the same thing happens if an NFSv4.1 client neglects
>>>>>> to
>>>>>> send
>>>>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
>>>>>> broken
>>>>>> or malicious, but the server faces the same issue of
>>>>>> protecting
>>>>>> itself from a DoS attack.
>>>>>>
>>>>>> IMO you should consider limiting the number of courteous
>>>>>> clients
>>>>>> the server can hold onto. Let's say that number is 1000. When
>>>>>> the
>>>>>> server wants to turn a 1001st client into a courteous client,
>>>>>> it
>>>>>> can simply expire and purge the oldest courteous client on
>>>>>> its
>>>>>> list. Otherwise, over time, the 24-hour expiry will reduce
>>>>>> the
>>>>>> set of courteous clients back to zero.
>>>>>>
>>>>>> What do you think?
>>>>> Limiting the number of courteous clients to handle the cases of
>>>>> broken/malicious 4.1 clients seems reasonable as the last
>>>>> resort.
>>>>>
>>>>> I think if a malicious 4.1 clients could mount the server's
>>>>> export,
>>>>> opens a file (to create state) and repeats the same with a
>>>>> different
>>>>> client id then it seems like some basic security was already
>>>>> broken;
>>>>> allowing unauthorized clients to mount server's exports.
>>>> You can do this today with AUTH_SYS. I consider it a genuine
>>>> attack
>>>> surface.
>>>>
>>>>
>>>>> I think if we have to enforce a limit, then it's only for
>>>>> handling
>>>>> of seriously buggy 4.1 clients which should not be the norm.
>>>>> The
>>>>> issue with this is how to pick an optimal number that is
>>>>> suitable
>>>>> for the running server which can be a very slow or a very fast
>>>>> server.
>>>>>
>>>>> Note that even if we impose an limit, that does not completely
>>>>> solve
>>>>> the problem with pynfs 4.0 test since its RPC timeout is
>>>>> configured
>>>>> with 15 secs which just enough to expire 277 clients based on
>>>>> 53ms
>>>>> for each client, unless we limit it ~270 clients which I think
>>>>> it's
>>>>> too low.
>>>>>
>>>>> This is what I plan to do:
>>>>>
>>>>> 1. do not support 4.0 courteous clients, for sure.
>>>> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
>>>> protocol at this time, and the same exposure exists for 4.1, it’s
>>>> just a little harder to exploit.
>>>>
>>>> If you submit the courteous server patch without support for 4.0,
>>>> I
>>>> think it needs to include a plan for how 4.0 will be added later.
>>>>
>>> Why is there a problem here? The requirements are the same for 4.0
>>> and
>>> 4.1 (or 4.2). If the lease under which the courtesy lock was
>>> established has expired, then that courtesy lock must be released
>>> if
>>> some other client requests a lock that conflicts with the cached
>>> lock
>>> (unless the client breaks the courtesy framework by renewing that
>>> original lease before the conflict occurs). Otherwise, it is
>>> completely
>>> up to the server when it decides to actually release the lock.
>>>
>>> For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
>>> server when the client is actually done with the lease, making it
>>> easy
>>> to determine when it is safe to release all the courtesy locks.
>>> However
>>> if the client does not send DESTROY_CLIENTID, then we're in the
>>> same
>>> situation with 4.x (x>0) as we would be with bog standard NFSv4.0.
>>> The
>>> lease has expired, and so the courtesy locks are liable to being
>>> dropped.
>> I agree the situation is the same for all minor versions.
>>
>>
>>> At Hammerspace we have implemented courtesy locks, and our strategy
>>> is
>>> that when a conflict occurs, we drop the entire set of courtesy
>>> locks
>>> so that we don't have to deal with the "some locks were revoked"
>>> scenario. The reason is that when we originally implemented
>>> courtesy
>>> locks, the Linux NFSv4 client support for lock revocation was a lot
>>> less sophisticated than today. My suggestion is that you might
>>> therefore consider starting along this path, and then refining the
>>> support to make revocation more nuanced once you are confident that
>>> the
>>> coarser strategy is working as expected.
>> Dai’s implementation does all that, and takes the coarser approach at
>> the moment. There are plans to explore the more nuanced behavior (by
>> revoking only the conflicting lock instead of dropping the whole
>> lease) after this initial work is merged.
>>
>> The issue is there are certain pathological client behaviors (whether
>> malicious or accidental) that can run the server out of resources,
>> since it is holding onto lease state for a much longer time. We are
>> simply trying to design a lease garbage collection scheme to meet
>> that challenge.
>>
>> I think limiting the number of courteous clients is a simple way to
>> do this, but we could also shorten the courtesy lifetime as more
>> clients enter that state, to ensure that they don’t overrun the
>> server’s memory. Another approach might be to add a shrinker that
>> purges the oldest courteous clients when the server comes under
>> memory pressure.
>>
>>
> We already have a scanner that tries to release all client state after
> 1 lease period. Just extend that to do it after 10 lease periods. If a
> network partition hasn't recovered after 10 minutes, you probably have
> bigger problems.

Currently the courteous server allows 24hr for the network partition to
heal before releasing all client state. That seems to be excessive but
it was suggested for longer network partition conditions when switch/routers
being repaired/upgraded.

>
> You can limit the number of clients as well, but that leads into a rats
> nest of other issues that have nothing to do with courtesy locks and
> everything to do with the fact that any client can hold a lot of state.

The issue we currently have with courteous server and pynfs 4.0 tests
is the number of courteous 4.0 clients the server has to expire when a
share reservation conflict occurs when servicing the OPEN. Each client
owns only few state in this case so we think the server spent most time
for deleting client's record in /var/lib/nfs. This is why we plan to
limit the number of courteous clients for now. As a side effect, it might
also help to reduce resource consumption too.

-Dai

>

Trond Myklebust Nov. 30, 2021, 1:37 p.m. UTC | #26

On Mon, 2021-11-29 at 23:22 -0800, dai.ngo@oracle.com wrote:
> 
> On 11/29/21 8:57 PM, Trond Myklebust wrote:
> > On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
> > > > On Nov 29, 2021, at 11:08 PM, Trond Myklebust
> > > > <trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
> > > > > > > On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com>
> > > > > > > wrote:
> > > > > > 
> > > > > > > On 11/29/21 1:10 PM, Chuck Lever III wrote:
> > > > > > > 
> > > > > > > > > On Nov 29, 2021, at 2:36 PM, Dai Ngo
> > > > > > > > > <dai.ngo@oracle.com>
> > > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > On 11/29/21 11:03 AM, Chuck Lever III wrote:
> > > > > > > > > Hello Dai!
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > On Nov 29, 2021, at 1:32 PM, Dai Ngo
> > > > > > > > > > <dai.ngo@oracle.com>
> > > > > > > > > > wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > On 11/29/21 9:30 AM, J. Bruce Fields wrote:
> > > > > > > > > > > On Mon, Nov 29, 2021 at 09:13:16AM -0800,
> > > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > > Hi Bruce,
> > > > > > > > > > > > 
> > > > > > > > > > > > On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
> > > > > > > > > > > > > On 11/17/21 4:34 PM, J. Bruce Fields wrote:
> > > > > > > > > > > > > > On Wed, Nov 17, 2021 at 01:46:02PM -0800,
> > > > > > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > > > > > On 11/17/21 9:59 AM,
> > > > > > > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > > > > > > On 11/17/21 6:14 AM, J. Bruce Fields
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > On Tue, Nov 16, 2021 at 03:06:32PM -
> > > > > > > > > > > > > > > > > 0800,
> > > > > > > > > > > > > > > > > dai.ngo@oracle.com wrote:
> > > > > > > > > > > > > > > > > > Just a reminder that this patch is
> > > > > > > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > > > waiting for your review.
> > > > > > > > > > > > > > > > > Yeah, I was procrastinating and
> > > > > > > > > > > > > > > > > hoping
> > > > > > > > > > > > > > > > > yo'ud
> > > > > > > > > > > > > > > > > figure out the pynfs
> > > > > > > > > > > > > > > > > failure for me....
> > > > > > > > > > > > > > > > Last time I ran 4.0 OPEN18 test by
> > > > > > > > > > > > > > > > itself
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > it passed. I will run
> > > > > > > > > > > > > > > > all OPEN tests together with 5.15-rc7
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > see if
> > > > > > > > > > > > > > > > the problem you've
> > > > > > > > > > > > > > > > seen still there.
> > > > > > > > > > > > > > > I ran all tests in nfsv4.1 and nfsv4.0
> > > > > > > > > > > > > > > with
> > > > > > > > > > > > > > > courteous and non-courteous
> > > > > > > > > > > > > > > 5.15-rc7 server.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Nfs4.1 results are the same for both
> > > > > > > > > > > > > > > courteous
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > non-courteous server:
> > > > > > > > > > > > > > > > Of those: 0 Skipped, 0 Failed, 0
> > > > > > > > > > > > > > > > Warned,
> > > > > > > > > > > > > > > > 169
> > > > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > > > Results of nfs4.0 with non-courteous
> > > > > > > > > > > > > > > server:
> > > > > > > > > > > > > > > > Of those: 8 Skipped, 1 Failed, 0
> > > > > > > > > > > > > > > > Warned,
> > > > > > > > > > > > > > > > 577
> > > > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > > > test failed: LOCK24
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Results of nfs4.0 with courteous server:
> > > > > > > > > > > > > > > > Of those: 8 Skipped, 3 Failed, 0
> > > > > > > > > > > > > > > > Warned,
> > > > > > > > > > > > > > > > 575
> > > > > > > > > > > > > > > > Passed
> > > > > > > > > > > > > > > tests failed: LOCK24, OPEN18, OPEN30
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > OPEN18 and OPEN30 test pass if each is
> > > > > > > > > > > > > > > run by
> > > > > > > > > > > > > > > itself.
> > > > > > > > > > > > > > Could well be a bug in the tests, I don't
> > > > > > > > > > > > > > know.
> > > > > > > > > > > > > The reason OPEN18 failed was because the test
> > > > > > > > > > > > > timed
> > > > > > > > > > > > > out waiting for
> > > > > > > > > > > > > the reply of an OPEN call. The RPC connection
> > > > > > > > > > > > > used
> > > > > > > > > > > > > for the test was
> > > > > > > > > > > > > configured with 15 secs timeout. Note that
> > > > > > > > > > > > > OPEN18
> > > > > > > > > > > > > only fails when
> > > > > > > > > > > > > the tests were run with 'all' option, this
> > > > > > > > > > > > > test
> > > > > > > > > > > > > passes if it's run
> > > > > > > > > > > > > by itself.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > With courteous server, by the time OPEN18
> > > > > > > > > > > > > runs,
> > > > > > > > > > > > > there
> > > > > > > > > > > > > are about 1026
> > > > > > > > > > > > > courtesy 4.0 clients on the server and all of
> > > > > > > > > > > > > these
> > > > > > > > > > > > > clients have opened
> > > > > > > > > > > > > the same file X with WRITE access. These
> > > > > > > > > > > > > clients
> > > > > > > > > > > > > were
> > > > > > > > > > > > > created by the
> > > > > > > > > > > > > previous tests. After each test completed,
> > > > > > > > > > > > > since
> > > > > > > > > > > > > 4.0
> > > > > > > > > > > > > does not have
> > > > > > > > > > > > > session, the client states are not cleaned up
> > > > > > > > > > > > > immediately on the
> > > > > > > > > > > > > server and are allowed to become courtesy
> > > > > > > > > > > > > clients.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > When OPEN18 runs (about 20 minutes after the
> > > > > > > > > > > > > 1st
> > > > > > > > > > > > > test
> > > > > > > > > > > > > started), it
> > > > > > > > > > > > > sends OPEN of file X with
> > > > > > > > > > > > > OPEN4_SHARE_DENY_WRITE
> > > > > > > > > > > > > which causes the
> > > > > > > > > > > > > server to check for conflicts with courtesy
> > > > > > > > > > > > > clients.
> > > > > > > > > > > > > The loop that
> > > > > > > > > > > > > checks 1026 courtesy clients for share/access
> > > > > > > > > > > > > conflict took less
> > > > > > > > > > > > > than 1 sec. But it took about 55 secs, on my
> > > > > > > > > > > > > VM,
> > > > > > > > > > > > > for
> > > > > > > > > > > > > the server
> > > > > > > > > > > > > to expire all 1026 courtesy clients.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I modified pynfs to configure the 4.0 RPC
> > > > > > > > > > > > > connection
> > > > > > > > > > > > > with 60 seconds
> > > > > > > > > > > > > timeout and OPEN18 now consistently passed.
> > > > > > > > > > > > > The
> > > > > > > > > > > > > 4.0
> > > > > > > > > > > > > test results are
> > > > > > > > > > > > > now the same for courteous and non-courteous
> > > > > > > > > > > > > server:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 8 Skipped, 1 Failed, 0 Warned, 577 Passed
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Note that 4.1 tests do not suffer this
> > > > > > > > > > > > > timeout
> > > > > > > > > > > > > problem because the
> > > > > > > > > > > > > 4.1 clients and sessions are destroyed after
> > > > > > > > > > > > > each
> > > > > > > > > > > > > test completes.
> > > > > > > > > > > > Do you want me to send the patch to increase
> > > > > > > > > > > > the
> > > > > > > > > > > > timeout for pynfs?
> > > > > > > > > > > > or is there any other things you think we
> > > > > > > > > > > > should
> > > > > > > > > > > > do?
> > > > > > > > > > > I don't know.
> > > > > > > > > > > 
> > > > > > > > > > > 55 seconds to clean up 1026 clients is about 50ms
> > > > > > > > > > > per
> > > > > > > > > > > client, which is
> > > > > > > > > > > pretty slow.  I wonder why.  I guess it's
> > > > > > > > > > > probably
> > > > > > > > > > > updating the stable
> > > > > > > > > > > storage information.  Is /var/lib/nfs/ on your
> > > > > > > > > > > server
> > > > > > > > > > > backed by a hard
> > > > > > > > > > > drive or an SSD or something else?
> > > > > > > > > > My server is a virtualbox VM that has 1 CPU, 4GB
> > > > > > > > > > RAM
> > > > > > > > > > and
> > > > > > > > > > 64GB of hard
> > > > > > > > > > disk. I think a production system that supports
> > > > > > > > > > this
> > > > > > > > > > many
> > > > > > > > > > clients should
> > > > > > > > > > have faster CPUs, faster storage.
> > > > > > > > > > 
> > > > > > > > > > > I wonder if that's an argument for limiting the
> > > > > > > > > > > number of
> > > > > > > > > > > courtesy
> > > > > > > > > > > clients.
> > > > > > > > > > I think we might want to treat 4.0 clients a bit
> > > > > > > > > > different
> > > > > > > > > > from 4.1
> > > > > > > > > > clients. With 4.0, every client will become a
> > > > > > > > > > courtesy
> > > > > > > > > > client after
> > > > > > > > > > the client is done with the export and unmounts it.
> > > > > > > > > It should be safe for a server to purge a client's
> > > > > > > > > lease
> > > > > > > > > immediately
> > > > > > > > > if there is no open or lock state associated with it.
> > > > > > > > In this case, each client has opened files so there are
> > > > > > > > open
> > > > > > > > states
> > > > > > > > associated with them.
> > > > > > > > 
> > > > > > > > > When an NFSv4.0 client unmounts, all files should be
> > > > > > > > > closed
> > > > > > > > > at that
> > > > > > > > > point,
> > > > > > > > I'm not sure pynfs does proper clean up after each
> > > > > > > > subtest,
> > > > > > > > I
> > > > > > > > will
> > > > > > > > check. There must be state associated with the client
> > > > > > > > in
> > > > > > > > order
> > > > > > > > for
> > > > > > > > it to become courtesy client.
> > > > > > > Makes sense. Then a synthetic client like pynfs can DoS a
> > > > > > > courteous
> > > > > > > server.
> > > > > > > 
> > > > > > > 
> > > > > > > > > so the server can wait for the lease to expire and
> > > > > > > > > purge
> > > > > > > > > it
> > > > > > > > > normally. Or am I missing something?
> > > > > > > > When 4.0 client lease expires and there are still
> > > > > > > > states
> > > > > > > > associated
> > > > > > > > with the client then the server allows this client to
> > > > > > > > become
> > > > > > > > courtesy
> > > > > > > > client.
> > > > > > > I think the same thing happens if an NFSv4.1 client
> > > > > > > neglects
> > > > > > > to
> > > > > > > send
> > > > > > > DESTROY_SESSION / DESTROY_CLIENTID. Either such a client
> > > > > > > is
> > > > > > > broken
> > > > > > > or malicious, but the server faces the same issue of
> > > > > > > protecting
> > > > > > > itself from a DoS attack.
> > > > > > > 
> > > > > > > IMO you should consider limiting the number of courteous
> > > > > > > clients
> > > > > > > the server can hold onto. Let's say that number is 1000.
> > > > > > > When
> > > > > > > the
> > > > > > > server wants to turn a 1001st client into a courteous
> > > > > > > client,
> > > > > > > it
> > > > > > > can simply expire and purge the oldest courteous client
> > > > > > > on
> > > > > > > its
> > > > > > > list. Otherwise, over time, the 24-hour expiry will
> > > > > > > reduce
> > > > > > > the
> > > > > > > set of courteous clients back to zero.
> > > > > > > 
> > > > > > > What do you think?
> > > > > > Limiting the number of courteous clients to handle the
> > > > > > cases of
> > > > > > broken/malicious 4.1 clients seems reasonable as the last
> > > > > > resort.
> > > > > > 
> > > > > > I think if a malicious 4.1 clients could mount the server's
> > > > > > export,
> > > > > > opens a file (to create state) and repeats the same with a
> > > > > > different
> > > > > > client id then it seems like some basic security was
> > > > > > already
> > > > > > broken;
> > > > > > allowing unauthorized clients to mount server's exports.
> > > > > You can do this today with AUTH_SYS. I consider it a genuine
> > > > > attack
> > > > > surface.
> > > > > 
> > > > > 
> > > > > > I think if we have to enforce a limit, then it's only for
> > > > > > handling
> > > > > > of seriously buggy 4.1 clients which should not be the
> > > > > > norm.
> > > > > > The
> > > > > > issue with this is how to pick an optimal number that is
> > > > > > suitable
> > > > > > for the running server which can be a very slow or a very
> > > > > > fast
> > > > > > server.
> > > > > > 
> > > > > > Note that even if we impose an limit, that does not
> > > > > > completely
> > > > > > solve
> > > > > > the problem with pynfs 4.0 test since its RPC timeout is
> > > > > > configured
> > > > > > with 15 secs which just enough to expire 277 clients based
> > > > > > on
> > > > > > 53ms
> > > > > > for each client, unless we limit it ~270 clients which I
> > > > > > think
> > > > > > it's
> > > > > > too low.
> > > > > > 
> > > > > > This is what I plan to do:
> > > > > > 
> > > > > > 1. do not support 4.0 courteous clients, for sure.
> > > > > Not supporting 4.0 isn’t an option, IMHO. It is a fully
> > > > > supported
> > > > > protocol at this time, and the same exposure exists for 4.1,
> > > > > it’s
> > > > > just a little harder to exploit.
> > > > > 
> > > > > If you submit the courteous server patch without support for
> > > > > 4.0,
> > > > > I
> > > > > think it needs to include a plan for how 4.0 will be added
> > > > > later.
> > > > > 
> > > > Why is there a problem here? The requirements are the same for
> > > > 4.0
> > > > and
> > > > 4.1 (or 4.2). If the lease under which the courtesy lock was
> > > > established has expired, then that courtesy lock must be
> > > > released
> > > > if
> > > > some other client requests a lock that conflicts with the
> > > > cached
> > > > lock
> > > > (unless the client breaks the courtesy framework by renewing
> > > > that
> > > > original lease before the conflict occurs). Otherwise, it is
> > > > completely
> > > > up to the server when it decides to actually release the lock.
> > > > 
> > > > For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells
> > > > the
> > > > server when the client is actually done with the lease, making
> > > > it
> > > > easy
> > > > to determine when it is safe to release all the courtesy locks.
> > > > However
> > > > if the client does not send DESTROY_CLIENTID, then we're in the
> > > > same
> > > > situation with 4.x (x>0) as we would be with bog standard
> > > > NFSv4.0.
> > > > The
> > > > lease has expired, and so the courtesy locks are liable to
> > > > being
> > > > dropped.
> > > I agree the situation is the same for all minor versions.
> > > 
> > > 
> > > > At Hammerspace we have implemented courtesy locks, and our
> > > > strategy
> > > > is
> > > > that when a conflict occurs, we drop the entire set of courtesy
> > > > locks
> > > > so that we don't have to deal with the "some locks were
> > > > revoked"
> > > > scenario. The reason is that when we originally implemented
> > > > courtesy
> > > > locks, the Linux NFSv4 client support for lock revocation was a
> > > > lot
> > > > less sophisticated than today. My suggestion is that you might
> > > > therefore consider starting along this path, and then refining
> > > > the
> > > > support to make revocation more nuanced once you are confident
> > > > that
> > > > the
> > > > coarser strategy is working as expected.
> > > Dai’s implementation does all that, and takes the coarser
> > > approach at
> > > the moment. There are plans to explore the more nuanced behavior
> > > (by
> > > revoking only the conflicting lock instead of dropping the whole
> > > lease) after this initial work is merged.
> > > 
> > > The issue is there are certain pathological client behaviors
> > > (whether
> > > malicious or accidental) that can run the server out of
> > > resources,
> > > since it is holding onto lease state for a much longer time. We
> > > are
> > > simply trying to design a lease garbage collection scheme to meet
> > > that challenge.
> > > 
> > > I think limiting the number of courteous clients is a simple way
> > > to
> > > do this, but we could also shorten the courtesy lifetime as more
> > > clients enter that state, to ensure that they don’t overrun the
> > > server’s memory. Another approach might be to add a shrinker that
> > > purges the oldest courteous clients when the server comes under
> > > memory pressure.
> > > 
> > > 
> > We already have a scanner that tries to release all client state
> > after
> > 1 lease period. Just extend that to do it after 10 lease periods.
> > If a
> > network partition hasn't recovered after 10 minutes, you probably
> > have
> > bigger problems.
> 
> Currently the courteous server allows 24hr for the network partition
> to
> heal before releasing all client state. That seems to be excessive
> but
> it was suggested for longer network partition conditions when
> switch/routers
> being repaired/upgraded.
> 
> > 
> > You can limit the number of clients as well, but that leads into a
> > rats
> > nest of other issues that have nothing to do with courtesy locks
> > and
> > everything to do with the fact that any client can hold a lot of
> > state.
> 
> The issue we currently have with courteous server and pynfs 4.0 tests
> is the number of courteous 4.0 clients the server has to expire when
> a
> share reservation conflict occurs when servicing the OPEN. Each
> client
> owns only few state in this case so we think the server spent most
> time
> for deleting client's record in /var/lib/nfs. This is why we plan to
> limit the number of courteous clients for now. As a side effect, it
> might
> also help to reduce resource consumption too.

Then kick off a thread or work item to do that asynchronously in the
background, and return NFS4ERR_DELAY to the clients that were trying to
grab locks in the meantime.

The above process is hardly just confined to NFSv4.0 clients. If there
is a network partition, then the exact same record deleting needs to be
applied to all NFSv4.1 and NFSv4.2 clients that hold locks and are
unable to renew their leases, so you might as well make it work for
everyone.

J. Bruce Fields Nov. 30, 2021, 3:32 p.m. UTC | #27

On Mon, Nov 29, 2021 at 11:13:34PM -0800, dai.ngo@oracle.com wrote:
> Just to be clear, the problem of pynfs with 4.0 is that the server takes
> ~55 secs to expire 1026 4.0 courteous clients, which comes out to ~50ms
> per client. This causes the test to time out in waiting for RPC reply of
> the OPEN that triggers the conflicts.
> 
> I don't know exactly where the time spent in the process of expiring a
> client. But as Bruce mentioned, it could be related to the time to access
> /var/lib/nfs to remove the client's persistent record.

Could you try something like

	strace -r -$(pidof) -oTRACE

and maybe we could take a look at TRACE?  My hope would be that there'd
be a clear set of syscalls whose time, multiplied by 1026, explains most
of that 55 seconds.  Then it might be worth checking whether there are
any easy optimizations possible.

--b.

Chuck Lever Nov. 30, 2021, 3:36 p.m. UTC | #28

> On Nov 30, 2021, at 2:22 AM, Dai Ngo <dai.ngo@oracle.com> wrote:
> 
> 
> On 11/29/21 8:57 PM, Trond Myklebust wrote:
>> On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
>>>> On Nov 29, 2021, at 11:08 PM, Trond Myklebust
>>>> <trondmy@hammerspace.com> wrote:
>>>> 
>>>> On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
>>>>>>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com>
>>>>>>> wrote:
>>>>>> 
>>>>>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>>>>> 
>>>>>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo <dai.ngo@oracle.com>
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>>>>>> Hello Dai!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo
>>>>>>>>>> <dai.ngo@oracle.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800,
>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>> Hi Bruce,
>>>>>>>>>>>> 
>>>>>>>>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800,
>>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>>> On 11/17/21 9:59 AM,
>>>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields wrote:
>>>>>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -0800,
>>>>>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>>>>>> Just a reminder that this patch is
>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>> waiting for your review.
>>>>>>>>>>>>>>>>> Yeah, I was procrastinating and hoping
>>>>>>>>>>>>>>>>> yo'ud
>>>>>>>>>>>>>>>>> figure out the pynfs
>>>>>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by itself
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> it passed. I will run
>>>>>>>>>>>>>>>> all OPEN tests together with 5.15-rc7 to
>>>>>>>>>>>>>>>> see if
>>>>>>>>>>>>>>>> the problem you've
>>>>>>>>>>>>>>>> seen still there.
>>>>>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0 with
>>>>>>>>>>>>>>> courteous and non-courteous
>>>>>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Nfs4.1 results are the same for both
>>>>>>>>>>>>>>> courteous
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0 Warned,
>>>>>>>>>>>>>>>> 169
>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>> Results of nfs4.0 with non-courteous server:
>>>>>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0 Warned,
>>>>>>>>>>>>>>>> 577
>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0 Warned,
>>>>>>>>>>>>>>>> 575
>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is run by
>>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>>> Could well be a bug in the tests, I don't know.
>>>>>>>>>>>>> The reason OPEN18 failed was because the test
>>>>>>>>>>>>> timed
>>>>>>>>>>>>> out waiting for
>>>>>>>>>>>>> the reply of an OPEN call. The RPC connection
>>>>>>>>>>>>> used
>>>>>>>>>>>>> for the test was
>>>>>>>>>>>>> configured with 15 secs timeout. Note that OPEN18
>>>>>>>>>>>>> only fails when
>>>>>>>>>>>>> the tests were run with 'all' option, this test
>>>>>>>>>>>>> passes if it's run
>>>>>>>>>>>>> by itself.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> With courteous server, by the time OPEN18 runs,
>>>>>>>>>>>>> there
>>>>>>>>>>>>> are about 1026
>>>>>>>>>>>>> courtesy 4.0 clients on the server and all of
>>>>>>>>>>>>> these
>>>>>>>>>>>>> clients have opened
>>>>>>>>>>>>> the same file X with WRITE access. These clients
>>>>>>>>>>>>> were
>>>>>>>>>>>>> created by the
>>>>>>>>>>>>> previous tests. After each test completed, since
>>>>>>>>>>>>> 4.0
>>>>>>>>>>>>> does not have
>>>>>>>>>>>>> session, the client states are not cleaned up
>>>>>>>>>>>>> immediately on the
>>>>>>>>>>>>> server and are allowed to become courtesy
>>>>>>>>>>>>> clients.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> When OPEN18 runs (about 20 minutes after the 1st
>>>>>>>>>>>>> test
>>>>>>>>>>>>> started), it
>>>>>>>>>>>>> sends OPEN of file X with OPEN4_SHARE_DENY_WRITE
>>>>>>>>>>>>> which causes the
>>>>>>>>>>>>> server to check for conflicts with courtesy
>>>>>>>>>>>>> clients.
>>>>>>>>>>>>> The loop that
>>>>>>>>>>>>> checks 1026 courtesy clients for share/access
>>>>>>>>>>>>> conflict took less
>>>>>>>>>>>>> than 1 sec. But it took about 55 secs, on my VM,
>>>>>>>>>>>>> for
>>>>>>>>>>>>> the server
>>>>>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I modified pynfs to configure the 4.0 RPC
>>>>>>>>>>>>> connection
>>>>>>>>>>>>> with 60 seconds
>>>>>>>>>>>>> timeout and OPEN18 now consistently passed. The
>>>>>>>>>>>>> 4.0
>>>>>>>>>>>>> test results are
>>>>>>>>>>>>> now the same for courteous and non-courteous
>>>>>>>>>>>>> server:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Note that 4.1 tests do not suffer this timeout
>>>>>>>>>>>>> problem because the
>>>>>>>>>>>>> 4.1 clients and sessions are destroyed after each
>>>>>>>>>>>>> test completes.
>>>>>>>>>>>> Do you want me to send the patch to increase the
>>>>>>>>>>>> timeout for pynfs?
>>>>>>>>>>>> or is there any other things you think we should
>>>>>>>>>>>> do?
>>>>>>>>>>> I don't know.
>>>>>>>>>>> 
>>>>>>>>>>> 55 seconds to clean up 1026 clients is about 50ms per
>>>>>>>>>>> client, which is
>>>>>>>>>>> pretty slow.  I wonder why.  I guess it's probably
>>>>>>>>>>> updating the stable
>>>>>>>>>>> storage information.  Is /var/lib/nfs/ on your server
>>>>>>>>>>> backed by a hard
>>>>>>>>>>> drive or an SSD or something else?
>>>>>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB RAM
>>>>>>>>>> and
>>>>>>>>>> 64GB of hard
>>>>>>>>>> disk. I think a production system that supports this
>>>>>>>>>> many
>>>>>>>>>> clients should
>>>>>>>>>> have faster CPUs, faster storage.
>>>>>>>>>> 
>>>>>>>>>>> I wonder if that's an argument for limiting the
>>>>>>>>>>> number of
>>>>>>>>>>> courtesy
>>>>>>>>>>> clients.
>>>>>>>>>> I think we might want to treat 4.0 clients a bit
>>>>>>>>>> different
>>>>>>>>>> from 4.1
>>>>>>>>>> clients. With 4.0, every client will become a courtesy
>>>>>>>>>> client after
>>>>>>>>>> the client is done with the export and unmounts it.
>>>>>>>>> It should be safe for a server to purge a client's lease
>>>>>>>>> immediately
>>>>>>>>> if there is no open or lock state associated with it.
>>>>>>>> In this case, each client has opened files so there are
>>>>>>>> open
>>>>>>>> states
>>>>>>>> associated with them.
>>>>>>>> 
>>>>>>>>> When an NFSv4.0 client unmounts, all files should be
>>>>>>>>> closed
>>>>>>>>> at that
>>>>>>>>> point,
>>>>>>>> I'm not sure pynfs does proper clean up after each subtest,
>>>>>>>> I
>>>>>>>> will
>>>>>>>> check. There must be state associated with the client in
>>>>>>>> order
>>>>>>>> for
>>>>>>>> it to become courtesy client.
>>>>>>> Makes sense. Then a synthetic client like pynfs can DoS a
>>>>>>> courteous
>>>>>>> server.
>>>>>>> 
>>>>>>> 
>>>>>>>>> so the server can wait for the lease to expire and purge
>>>>>>>>> it
>>>>>>>>> normally. Or am I missing something?
>>>>>>>> When 4.0 client lease expires and there are still states
>>>>>>>> associated
>>>>>>>> with the client then the server allows this client to
>>>>>>>> become
>>>>>>>> courtesy
>>>>>>>> client.
>>>>>>> I think the same thing happens if an NFSv4.1 client neglects
>>>>>>> to
>>>>>>> send
>>>>>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client is
>>>>>>> broken
>>>>>>> or malicious, but the server faces the same issue of
>>>>>>> protecting
>>>>>>> itself from a DoS attack.
>>>>>>> 
>>>>>>> IMO you should consider limiting the number of courteous
>>>>>>> clients
>>>>>>> the server can hold onto. Let's say that number is 1000. When
>>>>>>> the
>>>>>>> server wants to turn a 1001st client into a courteous client,
>>>>>>> it
>>>>>>> can simply expire and purge the oldest courteous client on
>>>>>>> its
>>>>>>> list. Otherwise, over time, the 24-hour expiry will reduce
>>>>>>> the
>>>>>>> set of courteous clients back to zero.
>>>>>>> 
>>>>>>> What do you think?
>>>>>> Limiting the number of courteous clients to handle the cases of
>>>>>> broken/malicious 4.1 clients seems reasonable as the last
>>>>>> resort.
>>>>>> 
>>>>>> I think if a malicious 4.1 clients could mount the server's
>>>>>> export,
>>>>>> opens a file (to create state) and repeats the same with a
>>>>>> different
>>>>>> client id then it seems like some basic security was already
>>>>>> broken;
>>>>>> allowing unauthorized clients to mount server's exports.
>>>>> You can do this today with AUTH_SYS. I consider it a genuine
>>>>> attack
>>>>> surface.
>>>>> 
>>>>> 
>>>>>> I think if we have to enforce a limit, then it's only for
>>>>>> handling
>>>>>> of seriously buggy 4.1 clients which should not be the norm.
>>>>>> The
>>>>>> issue with this is how to pick an optimal number that is
>>>>>> suitable
>>>>>> for the running server which can be a very slow or a very fast
>>>>>> server.
>>>>>> 
>>>>>> Note that even if we impose an limit, that does not completely
>>>>>> solve
>>>>>> the problem with pynfs 4.0 test since its RPC timeout is
>>>>>> configured
>>>>>> with 15 secs which just enough to expire 277 clients based on
>>>>>> 53ms
>>>>>> for each client, unless we limit it ~270 clients which I think
>>>>>> it's
>>>>>> too low.
>>>>>> 
>>>>>> This is what I plan to do:
>>>>>> 
>>>>>> 1. do not support 4.0 courteous clients, for sure.
>>>>> Not supporting 4.0 isn’t an option, IMHO. It is a fully supported
>>>>> protocol at this time, and the same exposure exists for 4.1, it’s
>>>>> just a little harder to exploit.
>>>>> 
>>>>> If you submit the courteous server patch without support for 4.0,
>>>>> I
>>>>> think it needs to include a plan for how 4.0 will be added later.
>>>>> 
>>>> Why is there a problem here? The requirements are the same for 4.0
>>>> and
>>>> 4.1 (or 4.2). If the lease under which the courtesy lock was
>>>> established has expired, then that courtesy lock must be released
>>>> if
>>>> some other client requests a lock that conflicts with the cached
>>>> lock
>>>> (unless the client breaks the courtesy framework by renewing that
>>>> original lease before the conflict occurs). Otherwise, it is
>>>> completely
>>>> up to the server when it decides to actually release the lock.
>>>> 
>>>> For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells the
>>>> server when the client is actually done with the lease, making it
>>>> easy
>>>> to determine when it is safe to release all the courtesy locks.
>>>> However
>>>> if the client does not send DESTROY_CLIENTID, then we're in the
>>>> same
>>>> situation with 4.x (x>0) as we would be with bog standard NFSv4.0.
>>>> The
>>>> lease has expired, and so the courtesy locks are liable to being
>>>> dropped.
>>> I agree the situation is the same for all minor versions.
>>> 
>>> 
>>>> At Hammerspace we have implemented courtesy locks, and our strategy
>>>> is
>>>> that when a conflict occurs, we drop the entire set of courtesy
>>>> locks
>>>> so that we don't have to deal with the "some locks were revoked"
>>>> scenario. The reason is that when we originally implemented
>>>> courtesy
>>>> locks, the Linux NFSv4 client support for lock revocation was a lot
>>>> less sophisticated than today. My suggestion is that you might
>>>> therefore consider starting along this path, and then refining the
>>>> support to make revocation more nuanced once you are confident that
>>>> the
>>>> coarser strategy is working as expected.
>>> Dai’s implementation does all that, and takes the coarser approach at
>>> the moment. There are plans to explore the more nuanced behavior (by
>>> revoking only the conflicting lock instead of dropping the whole
>>> lease) after this initial work is merged.
>>> 
>>> The issue is there are certain pathological client behaviors (whether
>>> malicious or accidental) that can run the server out of resources,
>>> since it is holding onto lease state for a much longer time. We are
>>> simply trying to design a lease garbage collection scheme to meet
>>> that challenge.
>>> 
>>> I think limiting the number of courteous clients is a simple way to
>>> do this, but we could also shorten the courtesy lifetime as more
>>> clients enter that state, to ensure that they don’t overrun the
>>> server’s memory. Another approach might be to add a shrinker that
>>> purges the oldest courteous clients when the server comes under
>>> memory pressure.
>>> 
>>> 
>> We already have a scanner that tries to release all client state after
>> 1 lease period. Just extend that to do it after 10 lease periods. If a
>> network partition hasn't recovered after 10 minutes, you probably have
>> bigger problems.
> 
> Currently the courteous server allows 24hr for the network partition to
> heal before releasing all client state. That seems to be excessive but
> it was suggested for longer network partition conditions when switch/routers
> being repaired/upgraded.

Sure, 24 hours is a long time.

For the benefit of others on the list, we have seen customer
failure scenarios where networks were partitioned for that
long.

But it's an arbitrary number, and there's no specification
for how long a server needs to hold a courtesy client. We
can make this number anything that is both convenient for
the server implementation and valuable for outage recovery.


>> You can limit the number of clients as well, but that leads into a rats
>> nest of other issues that have nothing to do with courtesy locks and
>> everything to do with the fact that any client can hold a lot of state.
> 
> The issue we currently have with courteous server and pynfs 4.0 tests
> is the number of courteous 4.0 clients the server has to expire when a
> share reservation conflict occurs when servicing the OPEN. Each client
> owns only few state in this case so we think the server spent most time
> for deleting client's record in /var/lib/nfs. This is why we plan to
> limit the number of courteous clients for now. As a side effect, it might
> also help to reduce resource consumption too.

I am a little concerned that we are trying to optimize a case
that won't happen during practice. pynfs does not reflect any
kind of realistic or reasonable client behavior -- it's designed
to test very specific server operations.

All that needs to happen, IMO, is that the server needs to
protect itself from resource exhaustion (which may occur for
any of the minor versions).

So I'm taking a shine to the idea of using a shrinker to trim
the older courtesy clients, rather than placing an arbitrary
limit on the number of courtesy clients the server can hold at
once. A shrinker should take into account the wide variance in
the amount of lease state each client might have.


--
Chuck Lever

J. Bruce Fields Nov. 30, 2021, 4:05 p.m. UTC | #29

On Tue, Nov 30, 2021 at 03:36:43PM +0000, Chuck Lever III wrote:
> I am a little concerned that we are trying to optimize a case
> that won't happen during practice. pynfs does not reflect any
> kind of realistic or reasonable client behavior -- it's designed
> to test very specific server operations.

I wonder how hard this problem would be to hit in normal use.  I mean, a
few hundred or a thousand clients doesn't sound that crazy.  This case
depends on an open deny, but you could hit the same problem with file
locks.  Would it be that weird to have a client trying to get a write
lock on a file read-locked by a bunch of other clients?

--b.

Trond Myklebust Nov. 30, 2021, 4:14 p.m. UTC | #30

On Tue, 2021-11-30 at 11:05 -0500, Bruce Fields wrote:
> On Tue, Nov 30, 2021 at 03:36:43PM +0000, Chuck Lever III wrote:
> > I am a little concerned that we are trying to optimize a case
> > that won't happen during practice. pynfs does not reflect any
> > kind of realistic or reasonable client behavior -- it's designed
> > to test very specific server operations.
> 
> I wonder how hard this problem would be to hit in normal use.  I
> mean, a
> few hundred or a thousand clients doesn't sound that crazy.  This
> case
> depends on an open deny, but you could hit the same problem with file
> locks.  Would it be that weird to have a client trying to get a write
> lock on a file read-locked by a bunch of other clients?
> 

That's a scenario that is subject to starvation problems anyway.
Particularly so on NFSv4.0, which lacks CB_NOTIFY_LOCK.

J. Bruce Fields Nov. 30, 2021, 7:01 p.m. UTC | #31

On Tue, Nov 30, 2021 at 04:14:10PM +0000, Trond Myklebust wrote:
> On Tue, 2021-11-30 at 11:05 -0500, Bruce Fields wrote:
> > On Tue, Nov 30, 2021 at 03:36:43PM +0000, Chuck Lever III wrote:
> > > I am a little concerned that we are trying to optimize a case
> > > that won't happen during practice. pynfs does not reflect any
> > > kind of realistic or reasonable client behavior -- it's designed
> > > to test very specific server operations.
> > 
> > I wonder how hard this problem would be to hit in normal use.  I
> > mean, a
> > few hundred or a thousand clients doesn't sound that crazy.  This
> > case
> > depends on an open deny, but you could hit the same problem with file
> > locks.  Would it be that weird to have a client trying to get a write
> > lock on a file read-locked by a bunch of other clients?
> > 
> 
> That's a scenario that is subject to starvation problems anyway.

Yes, if it's hundreds of clients continuously grabbing read locks.  But
if it's something like: send all the readers a signal, then request a
write lock as a way to wait for them to finish; then you'd normally
expect to get it soon after the last client drops its lock.

I don't know, maybe that's uncommon.

--b.

Dai Ngo Dec. 1, 2021, 3:50 a.m. UTC | #32

On 11/30/21 7:32 AM, Bruce Fields wrote:
> On Mon, Nov 29, 2021 at 11:13:34PM -0800, dai.ngo@oracle.com wrote:
>> Just to be clear, the problem of pynfs with 4.0 is that the server takes
>> ~55 secs to expire 1026 4.0 courteous clients, which comes out to ~50ms
>> per client. This causes the test to time out in waiting for RPC reply of
>> the OPEN that triggers the conflicts.
>>
>> I don't know exactly where the time spent in the process of expiring a
>> client. But as Bruce mentioned, it could be related to the time to access
>> /var/lib/nfs to remove the client's persistent record.
> Could you try something like
>
> 	strace -r -$(pidof) -oTRACE

Strace does not have any info that shows where the server spent time when
expiring client state. The client record is removed by nfsd4_umh_cltrack_remove
doing upcall to user space helper /sbin/nfsdcltrack to do the job. I used
the low-tech debug tool, printk, to measure the time spent by
nfsd4_client_record_remove. Here is a sample of the output, START and END
are in milliseconds:

Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d418] clp[ffff888119206040] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d459] clp[ffff888119206040] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d461] clp[ffff888119206740] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d48e] clp[ffff888119206740] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d49c] clp[ffff88811b54e000] client_tracking_ops[ffffffffa04bc2e0]
Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d4c5] clp[ffff88811b54e000] client_tracking_ops[ffffffffa04bc2e0]

The average time spent to remove the client record is about ~50ms, matches
with the time reported by pynfs test. This confirms what Bruce suspected
earlier.

-Dai

>
> and maybe we could take a look at TRACE?  My hope would be that there'd
> be a clear set of syscalls whose time, multiplied by 1026, explains most
> of that 55 seconds.  Then it might be worth checking whether there are
> any easy optimizations possible.
>
> --b.

Dai Ngo Dec. 1, 2021, 3:52 a.m. UTC | #33

On 11/30/21 5:37 AM, Trond Myklebust wrote:
> On Mon, 2021-11-29 at 23:22 -0800, dai.ngo@oracle.com wrote:
>> On 11/29/21 8:57 PM, Trond Myklebust wrote:
>>> On Tue, 2021-11-30 at 04:47 +0000, Chuck Lever III wrote:
>>>>> On Nov 29, 2021, at 11:08 PM, Trond Myklebust
>>>>> <trondmy@hammerspace.com> wrote:
>>>>>
>>>>> On Tue, 2021-11-30 at 01:42 +0000, Chuck Lever III wrote:
>>>>>>>> On Nov 29, 2021, at 7:11 PM, Dai Ngo <dai.ngo@oracle.com>
>>>>>>>> wrote:
>>>>>>> 
>>>>>>>> On 11/29/21 1:10 PM, Chuck Lever III wrote:
>>>>>>>>
>>>>>>>>>> On Nov 29, 2021, at 2:36 PM, Dai Ngo
>>>>>>>>>> <dai.ngo@oracle.com>
>>>>>>>>>> wrote:
>>>>>>>>> On 11/29/21 11:03 AM, Chuck Lever III wrote:
>>>>>>>>>> Hello Dai!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Nov 29, 2021, at 1:32 PM, Dai Ngo
>>>>>>>>>>> <dai.ngo@oracle.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 11/29/21 9:30 AM, J. Bruce Fields wrote:
>>>>>>>>>>>> On Mon, Nov 29, 2021 at 09:13:16AM -0800,
>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>> Hi Bruce,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/21/21 7:04 PM, dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>> On 11/17/21 4:34 PM, J. Bruce Fields wrote:
>>>>>>>>>>>>>>> On Wed, Nov 17, 2021 at 01:46:02PM -0800,
>>>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>>>> On 11/17/21 9:59 AM,
>>>>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>>>>> On 11/17/21 6:14 AM, J. Bruce Fields
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> On Tue, Nov 16, 2021 at 03:06:32PM -
>>>>>>>>>>>>>>>>>> 0800,
>>>>>>>>>>>>>>>>>> dai.ngo@oracle.com wrote:
>>>>>>>>>>>>>>>>>>> Just a reminder that this patch is
>>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>>> waiting for your review.
>>>>>>>>>>>>>>>>>> Yeah, I was procrastinating and
>>>>>>>>>>>>>>>>>> hoping
>>>>>>>>>>>>>>>>>> yo'ud
>>>>>>>>>>>>>>>>>> figure out the pynfs
>>>>>>>>>>>>>>>>>> failure for me....
>>>>>>>>>>>>>>>>> Last time I ran 4.0 OPEN18 test by
>>>>>>>>>>>>>>>>> itself
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> it passed. I will run
>>>>>>>>>>>>>>>>> all OPEN tests together with 5.15-rc7
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> see if
>>>>>>>>>>>>>>>>> the problem you've
>>>>>>>>>>>>>>>>> seen still there.
>>>>>>>>>>>>>>>> I ran all tests in nfsv4.1 and nfsv4.0
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> courteous and non-courteous
>>>>>>>>>>>>>>>> 5.15-rc7 server.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nfs4.1 results are the same for both
>>>>>>>>>>>>>>>> courteous
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> non-courteous server:
>>>>>>>>>>>>>>>>> Of those: 0 Skipped, 0 Failed, 0
>>>>>>>>>>>>>>>>> Warned,
>>>>>>>>>>>>>>>>> 169
>>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>>> Results of nfs4.0 with non-courteous
>>>>>>>>>>>>>>>> server:
>>>>>>>>>>>>>>>>> Of those: 8 Skipped, 1 Failed, 0
>>>>>>>>>>>>>>>>> Warned,
>>>>>>>>>>>>>>>>> 577
>>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>>> test failed: LOCK24
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Results of nfs4.0 with courteous server:
>>>>>>>>>>>>>>>>> Of those: 8 Skipped, 3 Failed, 0
>>>>>>>>>>>>>>>>> Warned,
>>>>>>>>>>>>>>>>> 575
>>>>>>>>>>>>>>>>> Passed
>>>>>>>>>>>>>>>> tests failed: LOCK24, OPEN18, OPEN30
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> OPEN18 and OPEN30 test pass if each is
>>>>>>>>>>>>>>>> run by
>>>>>>>>>>>>>>>> itself.
>>>>>>>>>>>>>>> Could well be a bug in the tests, I don't
>>>>>>>>>>>>>>> know.
>>>>>>>>>>>>>> The reason OPEN18 failed was because the test
>>>>>>>>>>>>>> timed
>>>>>>>>>>>>>> out waiting for
>>>>>>>>>>>>>> the reply of an OPEN call. The RPC connection
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>> for the test was
>>>>>>>>>>>>>> configured with 15 secs timeout. Note that
>>>>>>>>>>>>>> OPEN18
>>>>>>>>>>>>>> only fails when
>>>>>>>>>>>>>> the tests were run with 'all' option, this
>>>>>>>>>>>>>> test
>>>>>>>>>>>>>> passes if it's run
>>>>>>>>>>>>>> by itself.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With courteous server, by the time OPEN18
>>>>>>>>>>>>>> runs,
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>> are about 1026
>>>>>>>>>>>>>> courtesy 4.0 clients on the server and all of
>>>>>>>>>>>>>> these
>>>>>>>>>>>>>> clients have opened
>>>>>>>>>>>>>> the same file X with WRITE access. These
>>>>>>>>>>>>>> clients
>>>>>>>>>>>>>> were
>>>>>>>>>>>>>> created by the
>>>>>>>>>>>>>> previous tests. After each test completed,
>>>>>>>>>>>>>> since
>>>>>>>>>>>>>> 4.0
>>>>>>>>>>>>>> does not have
>>>>>>>>>>>>>> session, the client states are not cleaned up
>>>>>>>>>>>>>> immediately on the
>>>>>>>>>>>>>> server and are allowed to become courtesy
>>>>>>>>>>>>>> clients.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When OPEN18 runs (about 20 minutes after the
>>>>>>>>>>>>>> 1st
>>>>>>>>>>>>>> test
>>>>>>>>>>>>>> started), it
>>>>>>>>>>>>>> sends OPEN of file X with
>>>>>>>>>>>>>> OPEN4_SHARE_DENY_WRITE
>>>>>>>>>>>>>> which causes the
>>>>>>>>>>>>>> server to check for conflicts with courtesy
>>>>>>>>>>>>>> clients.
>>>>>>>>>>>>>> The loop that
>>>>>>>>>>>>>> checks 1026 courtesy clients for share/access
>>>>>>>>>>>>>> conflict took less
>>>>>>>>>>>>>> than 1 sec. But it took about 55 secs, on my
>>>>>>>>>>>>>> VM,
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> the server
>>>>>>>>>>>>>> to expire all 1026 courtesy clients.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I modified pynfs to configure the 4.0 RPC
>>>>>>>>>>>>>> connection
>>>>>>>>>>>>>> with 60 seconds
>>>>>>>>>>>>>> timeout and OPEN18 now consistently passed.
>>>>>>>>>>>>>> The
>>>>>>>>>>>>>> 4.0
>>>>>>>>>>>>>> test results are
>>>>>>>>>>>>>> now the same for courteous and non-courteous
>>>>>>>>>>>>>> server:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 8 Skipped, 1 Failed, 0 Warned, 577 Passed
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that 4.1 tests do not suffer this
>>>>>>>>>>>>>> timeout
>>>>>>>>>>>>>> problem because the
>>>>>>>>>>>>>> 4.1 clients and sessions are destroyed after
>>>>>>>>>>>>>> each
>>>>>>>>>>>>>> test completes.
>>>>>>>>>>>>> Do you want me to send the patch to increase
>>>>>>>>>>>>> the
>>>>>>>>>>>>> timeout for pynfs?
>>>>>>>>>>>>> or is there any other things you think we
>>>>>>>>>>>>> should
>>>>>>>>>>>>> do?
>>>>>>>>>>>> I don't know.
>>>>>>>>>>>>
>>>>>>>>>>>> 55 seconds to clean up 1026 clients is about 50ms
>>>>>>>>>>>> per
>>>>>>>>>>>> client, which is
>>>>>>>>>>>> pretty slow.  I wonder why.  I guess it's
>>>>>>>>>>>> probably
>>>>>>>>>>>> updating the stable
>>>>>>>>>>>> storage information.  Is /var/lib/nfs/ on your
>>>>>>>>>>>> server
>>>>>>>>>>>> backed by a hard
>>>>>>>>>>>> drive or an SSD or something else?
>>>>>>>>>>> My server is a virtualbox VM that has 1 CPU, 4GB
>>>>>>>>>>> RAM
>>>>>>>>>>> and
>>>>>>>>>>> 64GB of hard
>>>>>>>>>>> disk. I think a production system that supports
>>>>>>>>>>> this
>>>>>>>>>>> many
>>>>>>>>>>> clients should
>>>>>>>>>>> have faster CPUs, faster storage.
>>>>>>>>>>>
>>>>>>>>>>>> I wonder if that's an argument for limiting the
>>>>>>>>>>>> number of
>>>>>>>>>>>> courtesy
>>>>>>>>>>>> clients.
>>>>>>>>>>> I think we might want to treat 4.0 clients a bit
>>>>>>>>>>> different
>>>>>>>>>>> from 4.1
>>>>>>>>>>> clients. With 4.0, every client will become a
>>>>>>>>>>> courtesy
>>>>>>>>>>> client after
>>>>>>>>>>> the client is done with the export and unmounts it.
>>>>>>>>>> It should be safe for a server to purge a client's
>>>>>>>>>> lease
>>>>>>>>>> immediately
>>>>>>>>>> if there is no open or lock state associated with it.
>>>>>>>>> In this case, each client has opened files so there are
>>>>>>>>> open
>>>>>>>>> states
>>>>>>>>> associated with them.
>>>>>>>>>
>>>>>>>>>> When an NFSv4.0 client unmounts, all files should be
>>>>>>>>>> closed
>>>>>>>>>> at that
>>>>>>>>>> point,
>>>>>>>>> I'm not sure pynfs does proper clean up after each
>>>>>>>>> subtest,
>>>>>>>>> I
>>>>>>>>> will
>>>>>>>>> check. There must be state associated with the client
>>>>>>>>> in
>>>>>>>>> order
>>>>>>>>> for
>>>>>>>>> it to become courtesy client.
>>>>>>>> Makes sense. Then a synthetic client like pynfs can DoS a
>>>>>>>> courteous
>>>>>>>> server.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> so the server can wait for the lease to expire and
>>>>>>>>>> purge
>>>>>>>>>> it
>>>>>>>>>> normally. Or am I missing something?
>>>>>>>>> When 4.0 client lease expires and there are still
>>>>>>>>> states
>>>>>>>>> associated
>>>>>>>>> with the client then the server allows this client to
>>>>>>>>> become
>>>>>>>>> courtesy
>>>>>>>>> client.
>>>>>>>> I think the same thing happens if an NFSv4.1 client
>>>>>>>> neglects
>>>>>>>> to
>>>>>>>> send
>>>>>>>> DESTROY_SESSION / DESTROY_CLIENTID. Either such a client
>>>>>>>> is
>>>>>>>> broken
>>>>>>>> or malicious, but the server faces the same issue of
>>>>>>>> protecting
>>>>>>>> itself from a DoS attack.
>>>>>>>>
>>>>>>>> IMO you should consider limiting the number of courteous
>>>>>>>> clients
>>>>>>>> the server can hold onto. Let's say that number is 1000.
>>>>>>>> When
>>>>>>>> the
>>>>>>>> server wants to turn a 1001st client into a courteous
>>>>>>>> client,
>>>>>>>> it
>>>>>>>> can simply expire and purge the oldest courteous client
>>>>>>>> on
>>>>>>>> its
>>>>>>>> list. Otherwise, over time, the 24-hour expiry will
>>>>>>>> reduce
>>>>>>>> the
>>>>>>>> set of courteous clients back to zero.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>> Limiting the number of courteous clients to handle the
>>>>>>> cases of
>>>>>>> broken/malicious 4.1 clients seems reasonable as the last
>>>>>>> resort.
>>>>>>>
>>>>>>> I think if a malicious 4.1 clients could mount the server's
>>>>>>> export,
>>>>>>> opens a file (to create state) and repeats the same with a
>>>>>>> different
>>>>>>> client id then it seems like some basic security was
>>>>>>> already
>>>>>>> broken;
>>>>>>> allowing unauthorized clients to mount server's exports.
>>>>>> You can do this today with AUTH_SYS. I consider it a genuine
>>>>>> attack
>>>>>> surface.
>>>>>>
>>>>>>
>>>>>>> I think if we have to enforce a limit, then it's only for
>>>>>>> handling
>>>>>>> of seriously buggy 4.1 clients which should not be the
>>>>>>> norm.
>>>>>>> The
>>>>>>> issue with this is how to pick an optimal number that is
>>>>>>> suitable
>>>>>>> for the running server which can be a very slow or a very
>>>>>>> fast
>>>>>>> server.
>>>>>>>
>>>>>>> Note that even if we impose an limit, that does not
>>>>>>> completely
>>>>>>> solve
>>>>>>> the problem with pynfs 4.0 test since its RPC timeout is
>>>>>>> configured
>>>>>>> with 15 secs which just enough to expire 277 clients based
>>>>>>> on
>>>>>>> 53ms
>>>>>>> for each client, unless we limit it ~270 clients which I
>>>>>>> think
>>>>>>> it's
>>>>>>> too low.
>>>>>>>
>>>>>>> This is what I plan to do:
>>>>>>>
>>>>>>> 1. do not support 4.0 courteous clients, for sure.
>>>>>> Not supporting 4.0 isn’t an option, IMHO. It is a fully
>>>>>> supported
>>>>>> protocol at this time, and the same exposure exists for 4.1,
>>>>>> it’s
>>>>>> just a little harder to exploit.
>>>>>>
>>>>>> If you submit the courteous server patch without support for
>>>>>> 4.0,
>>>>>> I
>>>>>> think it needs to include a plan for how 4.0 will be added
>>>>>> later.
>>>>>>
>>>>> Why is there a problem here? The requirements are the same for
>>>>> 4.0
>>>>> and
>>>>> 4.1 (or 4.2). If the lease under which the courtesy lock was
>>>>> established has expired, then that courtesy lock must be
>>>>> released
>>>>> if
>>>>> some other client requests a lock that conflicts with the
>>>>> cached
>>>>> lock
>>>>> (unless the client breaks the courtesy framework by renewing
>>>>> that
>>>>> original lease before the conflict occurs). Otherwise, it is
>>>>> completely
>>>>> up to the server when it decides to actually release the lock.
>>>>>
>>>>> For NFSv4.1 and NFSv4.2, we have DESTROY_CLIENTID, which tells
>>>>> the
>>>>> server when the client is actually done with the lease, making
>>>>> it
>>>>> easy
>>>>> to determine when it is safe to release all the courtesy locks.
>>>>> However
>>>>> if the client does not send DESTROY_CLIENTID, then we're in the
>>>>> same
>>>>> situation with 4.x (x>0) as we would be with bog standard
>>>>> NFSv4.0.
>>>>> The
>>>>> lease has expired, and so the courtesy locks are liable to
>>>>> being
>>>>> dropped.
>>>> I agree the situation is the same for all minor versions.
>>>>
>>>>
>>>>> At Hammerspace we have implemented courtesy locks, and our
>>>>> strategy
>>>>> is
>>>>> that when a conflict occurs, we drop the entire set of courtesy
>>>>> locks
>>>>> so that we don't have to deal with the "some locks were
>>>>> revoked"
>>>>> scenario. The reason is that when we originally implemented
>>>>> courtesy
>>>>> locks, the Linux NFSv4 client support for lock revocation was a
>>>>> lot
>>>>> less sophisticated than today. My suggestion is that you might
>>>>> therefore consider starting along this path, and then refining
>>>>> the
>>>>> support to make revocation more nuanced once you are confident
>>>>> that
>>>>> the
>>>>> coarser strategy is working as expected.
>>>> Dai’s implementation does all that, and takes the coarser
>>>> approach at
>>>> the moment. There are plans to explore the more nuanced behavior
>>>> (by
>>>> revoking only the conflicting lock instead of dropping the whole
>>>> lease) after this initial work is merged.
>>>>
>>>> The issue is there are certain pathological client behaviors
>>>> (whether
>>>> malicious or accidental) that can run the server out of
>>>> resources,
>>>> since it is holding onto lease state for a much longer time. We
>>>> are
>>>> simply trying to design a lease garbage collection scheme to meet
>>>> that challenge.
>>>>
>>>> I think limiting the number of courteous clients is a simple way
>>>> to
>>>> do this, but we could also shorten the courtesy lifetime as more
>>>> clients enter that state, to ensure that they don’t overrun the
>>>> server’s memory. Another approach might be to add a shrinker that
>>>> purges the oldest courteous clients when the server comes under
>>>> memory pressure.
>>>>
>>>>
>>> We already have a scanner that tries to release all client state
>>> after
>>> 1 lease period. Just extend that to do it after 10 lease periods.
>>> If a
>>> network partition hasn't recovered after 10 minutes, you probably
>>> have
>>> bigger problems.
>> Currently the courteous server allows 24hr for the network partition
>> to
>> heal before releasing all client state. That seems to be excessive
>> but
>> it was suggested for longer network partition conditions when
>> switch/routers
>> being repaired/upgraded.
>>
>>> You can limit the number of clients as well, but that leads into a
>>> rats
>>> nest of other issues that have nothing to do with courtesy locks
>>> and
>>> everything to do with the fact that any client can hold a lot of
>>> state.
>> The issue we currently have with courteous server and pynfs 4.0 tests
>> is the number of courteous 4.0 clients the server has to expire when
>> a
>> share reservation conflict occurs when servicing the OPEN. Each
>> client
>> owns only few state in this case so we think the server spent most
>> time
>> for deleting client's record in /var/lib/nfs. This is why we plan to
>> limit the number of courteous clients for now. As a side effect, it
>> might
>> also help to reduce resource consumption too.
> Then kick off a thread or work item to do that asynchronously in the
> background, and return NFS4ERR_DELAY to the clients that were trying to
> grab locks in the meantime.

Thanks Trond, I think this is a reasonable approach. The behavior would
be similar to a delegation recall during the OPEN.

My plan is:

1. If the number of conflict clients is less than 100 (some numbers that
cover realistic usage) then release all their state synchronously in
the OPEN call, and returns NFS4_OK to the NFS client. Most of conflicts
should be handled by this case.

2. If the number of conflict clients is more than 100 then release the
state of the 1st 100 clients as in (1) and trigger the laundromat thread
to release state of the rest of the conflict clients, and return
NFS4ERR_DELAY to the NFS client. This should be a rare condition.

-Dai

>
> The above process is hardly just confined to NFSv4.0 clients. If there
> is a network partition, then the exact same record deleting needs to be
> applied to all NFSv4.1 and NFSv4.2 clients that hold locks and are
> unable to renew their leases, so you might as well make it work for
> everyone.
>

J. Bruce Fields Dec. 1, 2021, 2:19 p.m. UTC | #34

On Tue, Nov 30, 2021 at 07:52:10PM -0800, dai.ngo@oracle.com wrote:
> On 11/30/21 5:37 AM, Trond Myklebust wrote:
> >Then kick off a thread or work item to do that asynchronously in the
> >background, and return NFS4ERR_DELAY to the clients that were trying to
> >grab locks in the meantime.
> 
> Thanks Trond, I think this is a reasonable approach. The behavior would
> be similar to a delegation recall during the OPEN.
> 
> My plan is:
> 
> 1. If the number of conflict clients is less than 100 (some numbers that
> cover realistic usage) then release all their state synchronously in
> the OPEN call, and returns NFS4_OK to the NFS client. Most of conflicts
> should be handled by this case.
> 
> 2. If the number of conflict clients is more than 100 then release the
> state of the 1st 100 clients as in (1) and trigger the laundromat thread
> to release state of the rest of the conflict clients, and return
> NFS4ERR_DELAY to the NFS client. This should be a rare condition.

Honestly, conflict with a courtesy client is itself not going to be that
common, so personally I'd start simple and handle everything with the
asynchronous approach.

--b.

J. Bruce Fields Dec. 1, 2021, 2:36 p.m. UTC | #35

On Tue, Nov 30, 2021 at 07:50:13PM -0800, dai.ngo@oracle.com wrote:
> 
> On 11/30/21 7:32 AM, Bruce Fields wrote:
> >On Mon, Nov 29, 2021 at 11:13:34PM -0800, dai.ngo@oracle.com wrote:
> >>Just to be clear, the problem of pynfs with 4.0 is that the server takes
> >>~55 secs to expire 1026 4.0 courteous clients, which comes out to ~50ms
> >>per client. This causes the test to time out in waiting for RPC reply of
> >>the OPEN that triggers the conflicts.
> >>
> >>I don't know exactly where the time spent in the process of expiring a
> >>client. But as Bruce mentioned, it could be related to the time to access
> >>/var/lib/nfs to remove the client's persistent record.
> >Could you try something like
> >
> >	strace -r -$(pidof) -oTRACE

Oops, I mean $(pidof nfsdcld).  But, your system isn't using that:

> 
> Strace does not have any info that shows where the server spent time when
> expiring client state. The client record is removed by nfsd4_umh_cltrack_remove
> doing upcall to user space helper /sbin/nfsdcltrack to do the job.  I used
> the low-tech debug tool, printk, to measure the time spent by
> nfsd4_client_record_remove. Here is a sample of the output, START and END
> are in milliseconds:
> 
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d418] clp[ffff888119206040] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d459] clp[ffff888119206040] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d461] clp[ffff888119206740] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d48e] clp[ffff888119206740] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: START [0x15d49c] clp[ffff88811b54e000] client_tracking_ops[ffffffffa04bc2e0]
> Nov 30 12:31:04 localhost kernel: nfsd4_client_record_remove: END [0x15d4c5] clp[ffff88811b54e000] client_tracking_ops[ffffffffa04bc2e0]
> 
> The average time spent to remove the client record is about ~50ms, matches
> with the time reported by pynfs test. This confirms what Bruce suspected
> earlier.

OK, good to know.  It'd be interesting to dig into where nfsdcltrack is
spending its time, which we could do by replacing it with a wrapper that
runs the real nfsdcltrack under strace.

Though maybe it'd be better to do this on a system using nfsdcld, since
that's what we're transitioning to.

--b.

J. Bruce Fields Dec. 1, 2021, 2:51 p.m. UTC | #36

Do you have a public git tree with your latest patches?

--b.

J. Bruce Fields Dec. 1, 2021, 5:42 p.m. UTC | #37

On Wed, Dec 01, 2021 at 09:36:30AM -0500, Bruce Fields wrote:
> OK, good to know.  It'd be interesting to dig into where nfsdcltrack is
> spending its time, which we could do by replacing it with a wrapper that
> runs the real nfsdcltrack under strace.
> 
> Though maybe it'd be better to do this on a system using nfsdcld, since
> that's what we're transitioning to.

Trying that on a test VM here, I see each upcall doing 3 fdatasyncs() of
an sqlite-journal file.  On my setup, each of those is taking a few
milliseconds.  I wonder if it an do better.

--b.

J. Bruce Fields Dec. 1, 2021, 6:03 p.m. UTC | #38

On Wed, Dec 01, 2021 at 12:42:05PM -0500, Bruce Fields wrote:
> On Wed, Dec 01, 2021 at 09:36:30AM -0500, Bruce Fields wrote:
> > OK, good to know.  It'd be interesting to dig into where nfsdcltrack is
> > spending its time, which we could do by replacing it with a wrapper that
> > runs the real nfsdcltrack under strace.
> > 
> > Though maybe it'd be better to do this on a system using nfsdcld, since
> > that's what we're transitioning to.
> 
> Trying that on a test VM here, I see each upcall doing 3 fdatasyncs() of
> an sqlite-journal file.  On my setup, each of those is taking a few
> milliseconds.  I wonder if it an do better.

If I understand the sqlite documentation correctly, I *think* that if we
use journal_mode WAL with synchronous FULL, we should get the assurances
nfsd needs with one sync per transaction.

--b.

Dai Ngo Dec. 1, 2021, 6:47 p.m. UTC | #39

On 12/1/21 6:51 AM, Bruce Fields wrote:
> Do you have a public git tree with your latest patches?

No, I don't but I can push it to Chuck's public tree. I need to prepare the patch.

-Dai

>
> --b.

J. Bruce Fields Dec. 1, 2021, 7:25 p.m. UTC | #40

On Wed, Dec 01, 2021 at 10:47:28AM -0800, dai.ngo@oracle.com wrote:
> 
> On 12/1/21 6:51 AM, Bruce Fields wrote:
> >Do you have a public git tree with your latest patches?
> 
> No, I don't but I can push it to Chuck's public tree. I need to prepare the patch.

OK, it's not a big deal.--b.

J. Bruce Fields Dec. 1, 2021, 7:50 p.m. UTC | #41

On Wed, Dec 01, 2021 at 01:03:39PM -0500, Bruce Fields wrote:
> On Wed, Dec 01, 2021 at 12:42:05PM -0500, Bruce Fields wrote:
> > On Wed, Dec 01, 2021 at 09:36:30AM -0500, Bruce Fields wrote:
> > > OK, good to know.  It'd be interesting to dig into where nfsdcltrack is
> > > spending its time, which we could do by replacing it with a wrapper that
> > > runs the real nfsdcltrack under strace.
> > > 
> > > Though maybe it'd be better to do this on a system using nfsdcld, since
> > > that's what we're transitioning to.
> > 
> > Trying that on a test VM here, I see each upcall doing 3 fdatasyncs() of
> > an sqlite-journal file.  On my setup, each of those is taking a few
> > milliseconds.  I wonder if it an do better.
> 
> If I understand the sqlite documentation correctly, I *think* that if we
> use journal_mode WAL with synchronous FULL, we should get the assurances
> nfsd needs with one sync per transaction.

So I *think* that would mean just doing something like (untested, don't have
much idea what I'm doing):

diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
index 03016fb95823..b30f2614497b 100644
--- a/utils/nfsdcld/sqlite.c
+++ b/utils/nfsdcld/sqlite.c
@@ -826,6 +826,13 @@ sqlite_prepare_dbh(const char *topdir)
                goto out_close;
        }
 
+       ret = sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
+       if (ret)
+               goto out_close;
+       ret = sqlite3_exec(dbh, "PRAGMA synchronous = FULL;", NULL, NULL, NULL);
+       if (ret)
+               goto out_close;
+
        ret = sqlite_query_schema_version();
        switch (ret) {
        case CLD_SQLITE_LATEST_SCHEMA_VERSION:

I also wonder how expensive may be the extra overhead of starting up
nfsdcltrack each time.

--b.

Chuck Lever Dec. 2, 2021, 5:53 p.m. UTC | #42

> On Dec 1, 2021, at 9:51 AM, Bruce Fields <bfields@fieldses.org> wrote:
> 
> Do you have a public git tree with your latest patches?
> 
> --b.

Dai's patches have been pushed to the nfsd-courteous-server topic branch at

git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

I can fold them into my for-next branch if we agree they are ready for
broader test exposure.


--
Chuck Lever

J. Bruce Fields Dec. 3, 2021, 9:22 p.m. UTC | #43

On Wed, Dec 01, 2021 at 02:50:50PM -0500, Bruce Fields wrote:
> On Wed, Dec 01, 2021 at 01:03:39PM -0500, Bruce Fields wrote:
> > On Wed, Dec 01, 2021 at 12:42:05PM -0500, Bruce Fields wrote:
> > > On Wed, Dec 01, 2021 at 09:36:30AM -0500, Bruce Fields wrote:
> > > > OK, good to know.  It'd be interesting to dig into where nfsdcltrack is
> > > > spending its time, which we could do by replacing it with a wrapper that
> > > > runs the real nfsdcltrack under strace.
> > > > 
> > > > Though maybe it'd be better to do this on a system using nfsdcld, since
> > > > that's what we're transitioning to.
> > > 
> > > Trying that on a test VM here, I see each upcall doing 3 fdatasyncs() of
> > > an sqlite-journal file.  On my setup, each of those is taking a few
> > > milliseconds.  I wonder if it an do better.
> > 
> > If I understand the sqlite documentation correctly, I *think* that if we
> > use journal_mode WAL with synchronous FULL, we should get the assurances
> > nfsd needs with one sync per transaction.
> 
> So I *think* that would mean just doing something like (untested, don't have
> much idea what I'm doing):

OK, tried that out on my test VM, and: yes, the resulting strace was
much simpler (and, in particular, had only one fdatasync per upcall
instead of 3), and total time to expire 1000 courtesy clients was 6.5
seconds instead of 15.9.  So, I'll clean up that patch and pass it along
to Steve D.

This is all a bit of a derail, I know, but I suspect this will be a
bottleneck in other cases too, like when a lot of clients are reclaiming
after reboot.

We do need nfsdcld to sync to disk before returning to the kernel, so
this probably can't be further optimized without doing something more
complicated to allow some kind of parallelism and batching.

So if you have a ton of clients you'll just need /var/lib/nfs to be on
low-latency storage.

--b.

> 
> diff --git a/utils/nfsdcld/sqlite.c b/utils/nfsdcld/sqlite.c
> index 03016fb95823..b30f2614497b 100644
> --- a/utils/nfsdcld/sqlite.c
> +++ b/utils/nfsdcld/sqlite.c
> @@ -826,6 +826,13 @@ sqlite_prepare_dbh(const char *topdir)
>                 goto out_close;
>         }
>  
> +       ret = sqlite3_exec(dbh, "PRAGMA journal_mode = WAL;", NULL, NULL, NULL);
> +       if (ret)
> +               goto out_close;
> +       ret = sqlite3_exec(dbh, "PRAGMA synchronous = FULL;", NULL, NULL, NULL);
> +       if (ret)
> +               goto out_close;
> +
>         ret = sqlite_query_schema_version();
>         switch (ret) {
>         case CLD_SQLITE_LATEST_SCHEMA_VERSION:
> 
> I also wonder how expensive may be the extra overhead of starting up
> nfsdcltrack each time.
> 
> --b.

[RFC,v5,0/2] nfsd: Initial implementation of NFSv4 Courteous Server

Message

Comments