mbox series

[RFC,v4,0/2] nfsd: Initial implementation of NFSv4 Courteous Server

Message ID 20210924205252.82502-1-dai.ngo@oracle.com (mailing list archive)
Headers show
Series nfsd: Initial implementation of NFSv4 Courteous Server | expand

Message

Dai Ngo Sept. 24, 2021, 8:52 p.m. UTC
Hi Bruce,

This series of patches implement the NFSv4 Courteous Server.

A server which does not immediately expunge the state on lease expiration
is known as a Courteous Server.  A Courteous Server continues to recognize
previously generated state tokens as valid until conflict arises between
the expired state and the requests from another client, or the server
reboots.

The v2 patch includes the following:

. add new callback, lm_expire_lock, to lock_manager_operations to
  allow the lock manager to take appropriate action with conflict lock.

. handle conflicts of NFSv4 locks with NFSv3/NLM and local locks.

. expire courtesy client after 24hr if client has not reconnected.

. do not allow expired client to become courtesy client if there are
  waiters for client's locks.

. modify client_info_show to show courtesy client and seconds from
  last renew.

. fix a problem with NFSv4.1 server where the it keeps returning
  SEQ4_STATUS_CB_PATH_DOWN in the successful SEQUENCE reply, after
  the courtesy client re-connects, causing the client to keep sending
  BCTS requests to server.

The v3 patch includes the following:

. modified posix_test_lock to check and resolve conflict locks
  to handle NLM TEST and NFSv4 LOCKT requests.

. separate out fix for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.

The v4 patch includes:

. rework nfsd_check_courtesy to avoid dead lock of fl_lock and client_lock
  by asking the laudromat thread to destroy the courtesy client.

. handle NFSv4 share reservation conflicts with courtesy client. This
  includes conflicts between access mode and deny mode and vice versa.

. drop the patch for back channel stuck in SEQ4_STATUS_CB_PATH_DOWN.

NOTE: I will submit pynfs tests for courteous server including tests
for share reservation conflicts in a separate patch.

Comments

Bruce Fields Sept. 27, 2021, 8:14 p.m. UTC | #1
The file_rwsem is used for /proc/locks; only the code that produces the
/proc/locks output calls down_write, the rest only calls down_read.

I assumed that it was OK to nest read acquisitions of a rwsem, but I
think that's wrong.

I think it should be no big deal to move the lm_expire_lock(.,0) call
outside of the file_rwsem?

--b.

[  959.807364] ============================================
[  959.807803] WARNING: possible recursive locking detected
[  959.808228] 5.15.0-rc2-00009-g4e5af4d2635a #533 Not tainted
[  959.808675] --------------------------------------------
[  959.809189] nfsd/5675 is trying to acquire lock:
[  959.809664] ffffffff8519e470 (file_rwsem){.+.+}-{0:0}, at: locks_remove_posix+0x37f/0x4e0
[  959.810647] 
               but task is already holding lock:
[  959.811097] ffffffff8519e470 (file_rwsem){.+.+}-{0:0}, at: nfsd4_lock+0xcb9/0x3850 [nfsd]
[  959.812147] 
               other info that might help us debug this:
[  959.812698]  Possible unsafe locking scenario:

[  959.813189]        CPU0
[  959.813362]        ----
[  959.813544]   lock(file_rwsem);
[  959.813812]   lock(file_rwsem);
[  959.814078] 
                *** DEADLOCK ***

[  959.814386]  May be due to missing lock nesting notation

[  959.814968] 3 locks held by nfsd/5675:
[  959.815315]  #0: ffff888007d42bc8 (&rp->rp_mutex){+.+.}-{3:3}, at: nfs4_preprocess_seqid_op+0x395/0x730 [nfsd]
[  959.816546]  #1: ffff88800f378b70 (&stp->st_mutex#2){+.+.}-{3:3}, at: nfsd4_lock+0x1f91/0x3850 [nfsd]
[  959.817697]  #2: ffffffff8519e470 (file_rwsem){.+.+}-{0:0}, at: nfsd4_lock+0xcb9/0x3850 [nfsd]
[  959.818755] 
               stack backtrace:
[  959.819010] CPU: 2 PID: 5675 Comm: nfsd Not tainted 5.15.0-rc2-00009-g4e5af4d2635a #533
[  959.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014
[  959.820637] Call Trace:
[  959.820759]  dump_stack_lvl+0x45/0x59
[  959.821016]  __lock_acquire.cold+0x175/0x3a5
[  959.821316]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  959.821741]  lock_acquire+0x1a6/0x4b0
[  959.821976]  ? locks_remove_posix+0x37f/0x4e0
[  959.822316]  ? lock_release+0x6d0/0x6d0
[  959.822591]  ? find_held_lock+0x2c/0x110
[  959.822852]  ? lock_is_held_type+0xd5/0x130
[  959.823139]  posix_lock_inode+0x143/0x1ab0
[  959.823414]  ? locks_remove_posix+0x37f/0x4e0
[  959.823739]  ? do_raw_spin_unlock+0x54/0x220
[  959.824031]  ? lockdep_init_map_type+0x2c3/0x7a0
[  959.824355]  ? locks_remove_flock+0x2e0/0x2e0
[  959.824681]  locks_remove_posix+0x37f/0x4e0
[  959.824984]  ? do_lock_file_wait+0x2a0/0x2a0
[  959.825287]  ? lock_downgrade+0x6a0/0x6a0
[  959.825584]  ? nfsd_file_put+0x170/0x170 [nfsd]
[  959.825941]  filp_close+0xed/0x130
[  959.826191]  nfs4_free_lock_stateid+0xcc/0x190 [nfsd]
[  959.826625]  free_ol_stateid_reaplist+0x128/0x1f0 [nfsd]
[  959.827037]  release_openowner+0xee/0x150 [nfsd]
[  959.827382]  ? release_last_closed_stateid+0x460/0x460 [nfsd]
[  959.827837]  ? rwlock_bug.part.0+0x90/0x90
[  959.828115]  __destroy_client+0x39f/0x6f0 [nfsd]
[  959.828460]  ? nfsd4_cb_recall_release+0x20/0x20 [nfsd]
[  959.828868]  nfsd4_fl_expire_lock+0x2bc/0x460 [nfsd]
[  959.829273]  posix_lock_inode+0xa46/0x1ab0
[  959.829579]  ? lockdep_init_map_type+0x2c3/0x7a0
[  959.829913]  ? locks_remove_flock+0x2e0/0x2e0
[  959.830250]  ? __init_waitqueue_head+0x70/0xd0
[  959.830568]  nfsd4_lock+0xcb9/0x3850 [nfsd]
[  959.830878]  ? nfsd4_delegreturn+0x3b0/0x3b0 [nfsd]
[  959.831248]  ? percpu_counter_add_batch+0x77/0x130
[  959.831594]  nfsd4_proc_compound+0xcee/0x21d0 [nfsd]
[  959.831973]  ? nfsd4_release_compoundargs+0x140/0x140 [nfsd]
[  959.832414]  nfsd_dispatch+0x4df/0xc50 [nfsd]
[  959.832737]  ? nfsd_svc+0xca0/0xca0 [nfsd]
[  959.833051]  svc_process_common+0xdeb/0x2480 [sunrpc]
[  959.833462]  ? svc_create+0x20/0x20 [sunrpc]
[  959.833830]  ? nfsd_svc+0xca0/0xca0 [nfsd]
[  959.834144]  ? svc_sock_secure_port+0x36/0x40 [sunrpc]
[  959.834578]  ? svc_recv+0xd9c/0x2490 [sunrpc]
[  959.834915]  svc_process+0x32e/0x4a0 [sunrpc]
[  959.835249]  nfsd+0x306/0x530 [nfsd]
[  959.835499]  ? nfsd_shutdown_threads+0x300/0x300 [nfsd]
[  959.835899]  kthread+0x391/0x470
[  959.836094]  ? _raw_spin_unlock_irq+0x24/0x50
[  959.836394]  ? set_kthread_struct+0x100/0x100
[  959.836698]  ret_from_fork+0x22/0x30
[  960.750222] nfsd: last server has exited, flushing export cache
[  960.880355] NFSD: Using nfsdcld client tracking operations.
[  960.880956] NFSD: starting 15-second grace period (net f0000098)
[ 1403.405511] nfsd: last server has exited, flushing export cache
[ 1403.656335] NFSD: Using nfsdcld client tracking operations.
[ 1403.657585] NFSD: starting 15-second grace period (net f0000098)
[ 1445.741596] nfsd: last server has exited, flushing export cache
[ 1445.981980] NFSD: Using nfsdcld client tracking operations.
[ 1445.983143] NFSD: starting 15-second grace period (net f0000098)
[ 1450.025112] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
[ 1472.325551] nfsd: last server has exited, flushing export cache
[ 1472.583073] NFSD: Using nfsdcld client tracking operations.
[ 1472.583998] NFSD: starting 15-second grace period (net f0000098)
[ 1473.175582] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
[ 1494.637499] nfsd: last server has exited, flushing export cache
[ 1494.885795] NFSD: Using nfsdcld client tracking operations.
[ 1494.886484] NFSD: starting 15-second grace period (net f0000098)
[ 1495.393667] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
[ 1516.781474] nfsd: last server has exited, flushing export cache
[ 1516.902903] NFSD: Using nfsdcld client tracking operations.
[ 1516.903460] NFSD: starting 15-second grace period (net f0000098)
[ 1538.045156] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
[ 1559.125362] nfsd: last server has exited, flushing export cache
[ 1559.362856] NFSD: Using nfsdcld client tracking operations.
[ 1559.363658] NFSD: starting 15-second grace period (net f0000098)
[ 1559.480531] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
[ 1583.745353] nfsd: last server has exited, flushing export cache
[ 1583.876877] NFSD: Using nfsdcld client tracking operations.
[ 1583.877573] NFSD: starting 15-second grace period (net f0000098)
[ 1586.401321] nfsd: last server has exited, flushing export cache
[ 1586.525629] NFSD: Using nfsdcld client tracking operations.
[ 1586.526388] NFSD: starting 15-second grace period (net f0000098)
[ 1625.993218] nfsd: last server has exited, flushing export cache
[ 1626.442627] NFSD: Using nfsdcld client tracking operations.
[ 1626.444397] NFSD: starting 15-second grace period (net f0000098)
[ 1627.117214] nfsd: last server has exited, flushing export cache
[ 1627.351487] NFSD: Using nfsdcld client tracking operations.
[ 1627.352663] NFSD: starting 15-second grace period (net f0000098)
[ 1627.854410] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
[ 3285.818905] clocksource: timekeeping watchdog on CPU3: acpi_pm retried 2 times before success
Bruce Fields Sept. 27, 2021, 8:39 p.m. UTC | #2
On Mon, Sep 27, 2021 at 04:14:33PM -0400, J. Bruce Fields wrote:
> The file_rwsem is used for /proc/locks; only the code that produces the
> /proc/locks output calls down_write, the rest only calls down_read.
> 
> I assumed that it was OK to nest read acquisitions of a rwsem, but I
> think that's wrong.
> 
> I think it should be no big deal to move the lm_expire_lock(.,0) call
> outside of the file_rwsem?

You probably want to turn on LOCKDEP for any more testing.

I wonder if there's any potential issue with the other locks held here
(st_mutex, rp_mutex).

--b.

> 
> --b.
> 
> [  959.807364] ============================================
> [  959.807803] WARNING: possible recursive locking detected
> [  959.808228] 5.15.0-rc2-00009-g4e5af4d2635a #533 Not tainted
> [  959.808675] --------------------------------------------
> [  959.809189] nfsd/5675 is trying to acquire lock:
> [  959.809664] ffffffff8519e470 (file_rwsem){.+.+}-{0:0}, at: locks_remove_posix+0x37f/0x4e0
> [  959.810647] 
>                but task is already holding lock:
> [  959.811097] ffffffff8519e470 (file_rwsem){.+.+}-{0:0}, at: nfsd4_lock+0xcb9/0x3850 [nfsd]
> [  959.812147] 
>                other info that might help us debug this:
> [  959.812698]  Possible unsafe locking scenario:
> 
> [  959.813189]        CPU0
> [  959.813362]        ----
> [  959.813544]   lock(file_rwsem);
> [  959.813812]   lock(file_rwsem);
> [  959.814078] 
>                 *** DEADLOCK ***
> 
> [  959.814386]  May be due to missing lock nesting notation
> 
> [  959.814968] 3 locks held by nfsd/5675:
> [  959.815315]  #0: ffff888007d42bc8 (&rp->rp_mutex){+.+.}-{3:3}, at: nfs4_preprocess_seqid_op+0x395/0x730 [nfsd]
> [  959.816546]  #1: ffff88800f378b70 (&stp->st_mutex#2){+.+.}-{3:3}, at: nfsd4_lock+0x1f91/0x3850 [nfsd]
> [  959.817697]  #2: ffffffff8519e470 (file_rwsem){.+.+}-{0:0}, at: nfsd4_lock+0xcb9/0x3850 [nfsd]
> [  959.818755] 
>                stack backtrace:
> [  959.819010] CPU: 2 PID: 5675 Comm: nfsd Not tainted 5.15.0-rc2-00009-g4e5af4d2635a #533
> [  959.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014
> [  959.820637] Call Trace:
> [  959.820759]  dump_stack_lvl+0x45/0x59
> [  959.821016]  __lock_acquire.cold+0x175/0x3a5
> [  959.821316]  ? lockdep_hardirqs_on_prepare+0x400/0x400
> [  959.821741]  lock_acquire+0x1a6/0x4b0
> [  959.821976]  ? locks_remove_posix+0x37f/0x4e0
> [  959.822316]  ? lock_release+0x6d0/0x6d0
> [  959.822591]  ? find_held_lock+0x2c/0x110
> [  959.822852]  ? lock_is_held_type+0xd5/0x130
> [  959.823139]  posix_lock_inode+0x143/0x1ab0
> [  959.823414]  ? locks_remove_posix+0x37f/0x4e0
> [  959.823739]  ? do_raw_spin_unlock+0x54/0x220
> [  959.824031]  ? lockdep_init_map_type+0x2c3/0x7a0
> [  959.824355]  ? locks_remove_flock+0x2e0/0x2e0
> [  959.824681]  locks_remove_posix+0x37f/0x4e0
> [  959.824984]  ? do_lock_file_wait+0x2a0/0x2a0
> [  959.825287]  ? lock_downgrade+0x6a0/0x6a0
> [  959.825584]  ? nfsd_file_put+0x170/0x170 [nfsd]
> [  959.825941]  filp_close+0xed/0x130
> [  959.826191]  nfs4_free_lock_stateid+0xcc/0x190 [nfsd]
> [  959.826625]  free_ol_stateid_reaplist+0x128/0x1f0 [nfsd]
> [  959.827037]  release_openowner+0xee/0x150 [nfsd]
> [  959.827382]  ? release_last_closed_stateid+0x460/0x460 [nfsd]
> [  959.827837]  ? rwlock_bug.part.0+0x90/0x90
> [  959.828115]  __destroy_client+0x39f/0x6f0 [nfsd]
> [  959.828460]  ? nfsd4_cb_recall_release+0x20/0x20 [nfsd]
> [  959.828868]  nfsd4_fl_expire_lock+0x2bc/0x460 [nfsd]
> [  959.829273]  posix_lock_inode+0xa46/0x1ab0
> [  959.829579]  ? lockdep_init_map_type+0x2c3/0x7a0
> [  959.829913]  ? locks_remove_flock+0x2e0/0x2e0
> [  959.830250]  ? __init_waitqueue_head+0x70/0xd0
> [  959.830568]  nfsd4_lock+0xcb9/0x3850 [nfsd]
> [  959.830878]  ? nfsd4_delegreturn+0x3b0/0x3b0 [nfsd]
> [  959.831248]  ? percpu_counter_add_batch+0x77/0x130
> [  959.831594]  nfsd4_proc_compound+0xcee/0x21d0 [nfsd]
> [  959.831973]  ? nfsd4_release_compoundargs+0x140/0x140 [nfsd]
> [  959.832414]  nfsd_dispatch+0x4df/0xc50 [nfsd]
> [  959.832737]  ? nfsd_svc+0xca0/0xca0 [nfsd]
> [  959.833051]  svc_process_common+0xdeb/0x2480 [sunrpc]
> [  959.833462]  ? svc_create+0x20/0x20 [sunrpc]
> [  959.833830]  ? nfsd_svc+0xca0/0xca0 [nfsd]
> [  959.834144]  ? svc_sock_secure_port+0x36/0x40 [sunrpc]
> [  959.834578]  ? svc_recv+0xd9c/0x2490 [sunrpc]
> [  959.834915]  svc_process+0x32e/0x4a0 [sunrpc]
> [  959.835249]  nfsd+0x306/0x530 [nfsd]
> [  959.835499]  ? nfsd_shutdown_threads+0x300/0x300 [nfsd]
> [  959.835899]  kthread+0x391/0x470
> [  959.836094]  ? _raw_spin_unlock_irq+0x24/0x50
> [  959.836394]  ? set_kthread_struct+0x100/0x100
> [  959.836698]  ret_from_fork+0x22/0x30
> [  960.750222] nfsd: last server has exited, flushing export cache
> [  960.880355] NFSD: Using nfsdcld client tracking operations.
> [  960.880956] NFSD: starting 15-second grace period (net f0000098)
> [ 1403.405511] nfsd: last server has exited, flushing export cache
> [ 1403.656335] NFSD: Using nfsdcld client tracking operations.
> [ 1403.657585] NFSD: starting 15-second grace period (net f0000098)
> [ 1445.741596] nfsd: last server has exited, flushing export cache
> [ 1445.981980] NFSD: Using nfsdcld client tracking operations.
> [ 1445.983143] NFSD: starting 15-second grace period (net f0000098)
> [ 1450.025112] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
> [ 1472.325551] nfsd: last server has exited, flushing export cache
> [ 1472.583073] NFSD: Using nfsdcld client tracking operations.
> [ 1472.583998] NFSD: starting 15-second grace period (net f0000098)
> [ 1473.175582] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
> [ 1494.637499] nfsd: last server has exited, flushing export cache
> [ 1494.885795] NFSD: Using nfsdcld client tracking operations.
> [ 1494.886484] NFSD: starting 15-second grace period (net f0000098)
> [ 1495.393667] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
> [ 1516.781474] nfsd: last server has exited, flushing export cache
> [ 1516.902903] NFSD: Using nfsdcld client tracking operations.
> [ 1516.903460] NFSD: starting 15-second grace period (net f0000098)
> [ 1538.045156] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
> [ 1559.125362] nfsd: last server has exited, flushing export cache
> [ 1559.362856] NFSD: Using nfsdcld client tracking operations.
> [ 1559.363658] NFSD: starting 15-second grace period (net f0000098)
> [ 1559.480531] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
> [ 1583.745353] nfsd: last server has exited, flushing export cache
> [ 1583.876877] NFSD: Using nfsdcld client tracking operations.
> [ 1583.877573] NFSD: starting 15-second grace period (net f0000098)
> [ 1586.401321] nfsd: last server has exited, flushing export cache
> [ 1586.525629] NFSD: Using nfsdcld client tracking operations.
> [ 1586.526388] NFSD: starting 15-second grace period (net f0000098)
> [ 1625.993218] nfsd: last server has exited, flushing export cache
> [ 1626.442627] NFSD: Using nfsdcld client tracking operations.
> [ 1626.444397] NFSD: starting 15-second grace period (net f0000098)
> [ 1627.117214] nfsd: last server has exited, flushing export cache
> [ 1627.351487] NFSD: Using nfsdcld client tracking operations.
> [ 1627.352663] NFSD: starting 15-second grace period (net f0000098)
> [ 1627.854410] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
> [ 3285.818905] clocksource: timekeeping watchdog on CPU3: acpi_pm retried 2 times before success
Dai Ngo Sept. 28, 2021, 6:22 a.m. UTC | #3
On 9/27/21 1:39 PM, J. Bruce Fields wrote:
> On Mon, Sep 27, 2021 at 04:14:33PM -0400, J. Bruce Fields wrote:
>> The file_rwsem is used for /proc/locks; only the code that produces the
>> /proc/locks output calls down_write, the rest only calls down_read.
>>
>> I assumed that it was OK to nest read acquisitions of a rwsem, but I
>> think that's wrong.
>>
>> I think it should be no big deal to move the lm_expire_lock(.,0) call
>> outside of the file_rwsem?
> You probably want to turn on LOCKDEP for any more testing.

I use 'make menuconfig' to turn on LOCKDEP by enabling
'Kernel hacking/Lock debugging/prove locking correctness'. However
when I boot this kernel, DHCP failed to get an IP address making
the system inaccessible from the network. I'm not sure if it's
related to this error:

Sep 27 23:14:51 nfsvme24 kernel: unchecked MSR access error: WRMSR to 0xe00 (tried to write 0x0000000000010003) at rIP: 0xffffffff8101584d (wrmsrl+0xb/0x1f)
Sep 27 23:14:51 nfsvme24 kernel: Call Trace:
Sep 27 23:14:51 nfsvme24 kernel: uncore_box_ref.part.0+0x60/0x78
Sep 27 23:14:51 nfsvme24 kernel: ? uncore_pci_pmu_register+0xea/0xea
Sep 27 23:14:51 nfsvme24 kernel: uncore_event_cpu_online+0x51/0x107
Sep 27 23:14:51 nfsvme24 kernel: ? uncore_pci_pmu_register+0xea/0xea
Sep 27 23:14:51 nfsvme24 kernel: cpuhp_invoke_callback+0xb2/0x23d
Sep 27 23:14:51 nfsvme24 kernel: ? __schedule+0x5d3/0x625
Sep 27 23:14:51 nfsvme24 kernel: cpuhp_thread_fun+0xc6/0x111
Sep 27 23:14:51 nfsvme24 kernel: ? smpboot_register_percpu_thread+0xcc/0xcc
Sep 27 23:14:51 nfsvme24 kernel: smpboot_thread_fn+0x1b1/0x1c6
Sep 27 23:14:51 nfsvme24 kernel: kthread+0x107/0x10f
Sep 27 23:14:51 nfsvme24 kernel: ? kthread_flush_worker+0x75/0x75
Sep 27 23:14:51 nfsvme24 kernel: ret_from_fork+0x22/0x30

Here is the diff of the working config and non-working (LOCKDEP) config:

[dngo@nfsdev linux]$ diff .config .config-LOCKDEP
245c245
< # CONFIG_KALLSYMS_ALL is not set
---
> CONFIG_KALLSYMS_ALL=y
470a471
> # CONFIG_LIVEPATCH is not set
905,909c906
< CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
< CONFIG_INLINE_READ_UNLOCK=y
< CONFIG_INLINE_READ_UNLOCK_IRQ=y
< CONFIG_INLINE_WRITE_UNLOCK=y
< CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
---
> CONFIG_UNINLINE_SPIN_UNLOCK=y
4463,4470c4460,4475
< # CONFIG_PROVE_LOCKING is not set
< # CONFIG_LOCK_STAT is not set
< # CONFIG_DEBUG_RT_MUTEXES is not set
< # CONFIG_DEBUG_SPINLOCK is not set
< # CONFIG_DEBUG_MUTEXES is not set
< # CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
< # CONFIG_DEBUG_RWSEMS is not set
< # CONFIG_DEBUG_LOCK_ALLOC is not set
---
> CONFIG_PROVE_LOCKING=y
> CONFIG_PROVE_RAW_LOCK_NESTING=y
> CONFIG_LOCK_STAT=y
> CONFIG_DEBUG_RT_MUTEXES=y
> CONFIG_DEBUG_SPINLOCK=y
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
> CONFIG_DEBUG_RWSEMS=y
> CONFIG_DEBUG_LOCK_ALLOC=y
> CONFIG_LOCKDEP=y
> CONFIG_LOCKDEP_BITS=15
> CONFIG_LOCKDEP_CHAINS_BITS=16
> CONFIG_LOCKDEP_STACK_TRACE_BITS=19
> CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
> CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
> CONFIG_DEBUG_LOCKDEP=y
4479c4484,4486
< # CONFIG_DEBUG_IRQFLAGS is not set
---
> CONFIG_TRACE_IRQFLAGS=y
> CONFIG_TRACE_IRQFLAGS_NMI=y
> CONFIG_DEBUG_IRQFLAGS=y
4498a4506
> CONFIG_PROVE_RCU=y
4526a4535
> CONFIG_PREEMPTIRQ_TRACEPOINTS=y
[dngo@nfsdev linux]$

can you share your config with LOCKDEP enabled so I can give
it a try?

Thanks,
-Dai

>
> I wonder if there's any potential issue with the other locks held here
> (st_mutex, rp_mutex).
>
> --b.
>
>> --b.
>>
>> [  959.807364] ============================================
>> [  959.807803] WARNING: possible recursive locking detected
>> [  959.808228] 5.15.0-rc2-00009-g4e5af4d2635a #533 Not tainted
>> [  959.808675] --------------------------------------------
>> [  959.809189] nfsd/5675 is trying to acquire lock:
>> [  959.809664] ffffffff8519e470 (file_rwsem){.+.+}-{0:0}, at: locks_remove_posix+0x37f/0x4e0
>> [  959.810647]
>>                 but task is already holding lock:
>> [  959.811097] ffffffff8519e470 (file_rwsem){.+.+}-{0:0}, at: nfsd4_lock+0xcb9/0x3850 [nfsd]
>> [  959.812147]
>>                 other info that might help us debug this:
>> [  959.812698]  Possible unsafe locking scenario:
>>
>> [  959.813189]        CPU0
>> [  959.813362]        ----
>> [  959.813544]   lock(file_rwsem);
>> [  959.813812]   lock(file_rwsem);
>> [  959.814078]
>>                  *** DEADLOCK ***
>>
>> [  959.814386]  May be due to missing lock nesting notation
>>
>> [  959.814968] 3 locks held by nfsd/5675:
>> [  959.815315]  #0: ffff888007d42bc8 (&rp->rp_mutex){+.+.}-{3:3}, at: nfs4_preprocess_seqid_op+0x395/0x730 [nfsd]
>> [  959.816546]  #1: ffff88800f378b70 (&stp->st_mutex#2){+.+.}-{3:3}, at: nfsd4_lock+0x1f91/0x3850 [nfsd]
>> [  959.817697]  #2: ffffffff8519e470 (file_rwsem){.+.+}-{0:0}, at: nfsd4_lock+0xcb9/0x3850 [nfsd]
>> [  959.818755]
>>                 stack backtrace:
>> [  959.819010] CPU: 2 PID: 5675 Comm: nfsd Not tainted 5.15.0-rc2-00009-g4e5af4d2635a #533
>> [  959.819847] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014
>> [  959.820637] Call Trace:
>> [  959.820759]  dump_stack_lvl+0x45/0x59
>> [  959.821016]  __lock_acquire.cold+0x175/0x3a5
>> [  959.821316]  ? lockdep_hardirqs_on_prepare+0x400/0x400
>> [  959.821741]  lock_acquire+0x1a6/0x4b0
>> [  959.821976]  ? locks_remove_posix+0x37f/0x4e0
>> [  959.822316]  ? lock_release+0x6d0/0x6d0
>> [  959.822591]  ? find_held_lock+0x2c/0x110
>> [  959.822852]  ? lock_is_held_type+0xd5/0x130
>> [  959.823139]  posix_lock_inode+0x143/0x1ab0
>> [  959.823414]  ? locks_remove_posix+0x37f/0x4e0
>> [  959.823739]  ? do_raw_spin_unlock+0x54/0x220
>> [  959.824031]  ? lockdep_init_map_type+0x2c3/0x7a0
>> [  959.824355]  ? locks_remove_flock+0x2e0/0x2e0
>> [  959.824681]  locks_remove_posix+0x37f/0x4e0
>> [  959.824984]  ? do_lock_file_wait+0x2a0/0x2a0
>> [  959.825287]  ? lock_downgrade+0x6a0/0x6a0
>> [  959.825584]  ? nfsd_file_put+0x170/0x170 [nfsd]
>> [  959.825941]  filp_close+0xed/0x130
>> [  959.826191]  nfs4_free_lock_stateid+0xcc/0x190 [nfsd]
>> [  959.826625]  free_ol_stateid_reaplist+0x128/0x1f0 [nfsd]
>> [  959.827037]  release_openowner+0xee/0x150 [nfsd]
>> [  959.827382]  ? release_last_closed_stateid+0x460/0x460 [nfsd]
>> [  959.827837]  ? rwlock_bug.part.0+0x90/0x90
>> [  959.828115]  __destroy_client+0x39f/0x6f0 [nfsd]
>> [  959.828460]  ? nfsd4_cb_recall_release+0x20/0x20 [nfsd]
>> [  959.828868]  nfsd4_fl_expire_lock+0x2bc/0x460 [nfsd]
>> [  959.829273]  posix_lock_inode+0xa46/0x1ab0
>> [  959.829579]  ? lockdep_init_map_type+0x2c3/0x7a0
>> [  959.829913]  ? locks_remove_flock+0x2e0/0x2e0
>> [  959.830250]  ? __init_waitqueue_head+0x70/0xd0
>> [  959.830568]  nfsd4_lock+0xcb9/0x3850 [nfsd]
>> [  959.830878]  ? nfsd4_delegreturn+0x3b0/0x3b0 [nfsd]
>> [  959.831248]  ? percpu_counter_add_batch+0x77/0x130
>> [  959.831594]  nfsd4_proc_compound+0xcee/0x21d0 [nfsd]
>> [  959.831973]  ? nfsd4_release_compoundargs+0x140/0x140 [nfsd]
>> [  959.832414]  nfsd_dispatch+0x4df/0xc50 [nfsd]
>> [  959.832737]  ? nfsd_svc+0xca0/0xca0 [nfsd]
>> [  959.833051]  svc_process_common+0xdeb/0x2480 [sunrpc]
>> [  959.833462]  ? svc_create+0x20/0x20 [sunrpc]
>> [  959.833830]  ? nfsd_svc+0xca0/0xca0 [nfsd]
>> [  959.834144]  ? svc_sock_secure_port+0x36/0x40 [sunrpc]
>> [  959.834578]  ? svc_recv+0xd9c/0x2490 [sunrpc]
>> [  959.834915]  svc_process+0x32e/0x4a0 [sunrpc]
>> [  959.835249]  nfsd+0x306/0x530 [nfsd]
>> [  959.835499]  ? nfsd_shutdown_threads+0x300/0x300 [nfsd]
>> [  959.835899]  kthread+0x391/0x470
>> [  959.836094]  ? _raw_spin_unlock_irq+0x24/0x50
>> [  959.836394]  ? set_kthread_struct+0x100/0x100
>> [  959.836698]  ret_from_fork+0x22/0x30
>> [  960.750222] nfsd: last server has exited, flushing export cache
>> [  960.880355] NFSD: Using nfsdcld client tracking operations.
>> [  960.880956] NFSD: starting 15-second grace period (net f0000098)
>> [ 1403.405511] nfsd: last server has exited, flushing export cache
>> [ 1403.656335] NFSD: Using nfsdcld client tracking operations.
>> [ 1403.657585] NFSD: starting 15-second grace period (net f0000098)
>> [ 1445.741596] nfsd: last server has exited, flushing export cache
>> [ 1445.981980] NFSD: Using nfsdcld client tracking operations.
>> [ 1445.983143] NFSD: starting 15-second grace period (net f0000098)
>> [ 1450.025112] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
>> [ 1472.325551] nfsd: last server has exited, flushing export cache
>> [ 1472.583073] NFSD: Using nfsdcld client tracking operations.
>> [ 1472.583998] NFSD: starting 15-second grace period (net f0000098)
>> [ 1473.175582] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
>> [ 1494.637499] nfsd: last server has exited, flushing export cache
>> [ 1494.885795] NFSD: Using nfsdcld client tracking operations.
>> [ 1494.886484] NFSD: starting 15-second grace period (net f0000098)
>> [ 1495.393667] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
>> [ 1516.781474] nfsd: last server has exited, flushing export cache
>> [ 1516.902903] NFSD: Using nfsdcld client tracking operations.
>> [ 1516.903460] NFSD: starting 15-second grace period (net f0000098)
>> [ 1538.045156] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
>> [ 1559.125362] nfsd: last server has exited, flushing export cache
>> [ 1559.362856] NFSD: Using nfsdcld client tracking operations.
>> [ 1559.363658] NFSD: starting 15-second grace period (net f0000098)
>> [ 1559.480531] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
>> [ 1583.745353] nfsd: last server has exited, flushing export cache
>> [ 1583.876877] NFSD: Using nfsdcld client tracking operations.
>> [ 1583.877573] NFSD: starting 15-second grace period (net f0000098)
>> [ 1586.401321] nfsd: last server has exited, flushing export cache
>> [ 1586.525629] NFSD: Using nfsdcld client tracking operations.
>> [ 1586.526388] NFSD: starting 15-second grace period (net f0000098)
>> [ 1625.993218] nfsd: last server has exited, flushing export cache
>> [ 1626.442627] NFSD: Using nfsdcld client tracking operations.
>> [ 1626.444397] NFSD: starting 15-second grace period (net f0000098)
>> [ 1627.117214] nfsd: last server has exited, flushing export cache
>> [ 1627.351487] NFSD: Using nfsdcld client tracking operations.
>> [ 1627.352663] NFSD: starting 15-second grace period (net f0000098)
>> [ 1627.854410] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000098)
>> [ 3285.818905] clocksource: timekeeping watchdog on CPU3: acpi_pm retried 2 times before success