mbox series

[RFC,0/2] Fix "sleep while locked" in RELEASE_LOCKOWNER

Message ID 165230584037.5906.5496655737644872339.stgit@klimt.1015granger.net (mailing list archive)
Headers show
Series Fix "sleep while locked" in RELEASE_LOCKOWNER | expand

Message

Chuck Lever III May 11, 2022, 9:52 p.m. UTC
This short series passes the usual tests on NFSv4.0. We still do not
have a reproducer for the splat, though, so it's not known if the
issue has been fully addressed.

Because this is a long-standing issue and we do not have a
reproducer, I'm inclined to be conservative and push this in v5.19
rather than in v5.18-rc.

Thoughts, questions, and virtual rotten fruit welcome.

---

Chuck Lever (2):
      NFSD: nfsd4_release_lockowner() should drop clp->cl_lock sooner
      NFSD: nfsd_file_put() can sleep


 fs/nfsd/filecache.c |  2 ++
 fs/nfsd/nfs4state.c | 56 ++++++++++++++++++++-------------------------
 2 files changed, 27 insertions(+), 31 deletions(-)

--
Chuck Lever

Comments

Chuck Lever III May 26, 2022, 11:39 p.m. UTC | #1
> On May 26, 2022, at 3:17 PM, Charles Hedrick <hedrick@rutgers.edu> wrote:
> 
> We are still stuck on NFS 3 because NFS 4 lock operations hang. Typically with thunderbird, firefox, etc. I had hoped that Ubuntu 22 would fix this, given the patch 
> 
> UNRPC: Don't call connect() more than once on a TCP socket
> 
> If this is part of the problem, that would mean we couldn't use NFS 4 until Ubuntu 24, i.e. summer of 2025, given delays in release and deployment.
> 
> Unfortunately I can't reproduce our problem. It doesn't show up until we're halfway into our semester and loads start getting heavier.
> 
> You say this is a long-standing issue. So are problems with NFS 4 locking (and also NFS 4 delegation). If you have a patch for both of these issues that we could put into 5.4.0, I might be willing to test it, assuming the patches are safe. We probably wouldn't know it has really fixed things for at least 6 months.

Charles, this mailing list is an upstream Linux forum. There honestly isn't anything we can do about Ubuntu backporting policies, and we can't help much at all with Linux kernels as old as v5.4 unless there are known fixes in later kernels. It's up to you to find those fixes, test them, and then convince the stable kernel folks and your distribution to include the fix in their kernel. The folks on linux-nfs@ are little more than process observers in those communities.

The RELEASE_LOCKOWNER lock inversion issue has been around forever, but it was exposed recently by a performance regression fix in v5.18-rc3. After that point, a client can leverage the existing lock inversion bug to provoke a deadlock on the server using normal NFSv4 operations. That makes the RELEASE_LOCKOWNER issue a potential denial-of-service in the latest kernels, which is priority one in my book.

I stand by my statement to Linus in this morning's pull request: I currently know of no other high priority bugs in v5.18's NFS server (I'm not talking about the NFS client) under active investigation except for the one I mentioned in the PR. If you know of /specific/ reports of significant incorrect behavior in the latest upstream Linux NFS client or server, please post links to them here, or better yet, file bugs and help the assignees to troubleshoot the problems.


--
Chuck Lever
Charles Hedrick May 27, 2022, 12:21 a.m. UTC | #2
I was hoping it was reasonable to ask for backporting to the latest LTS. It would be nice if you could make the statement you did about that. At that point I agree that it’s Ubuntu’s problem.

> On May 26, 2022, at 7:39 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On May 26, 2022, at 3:17 PM, Charles Hedrick <hedrick@rutgers.edu> wrote:
>> 
>> We are still stuck on NFS 3 because NFS 4 lock operations hang. Typically with thunderbird, firefox, etc. I had hoped that Ubuntu 22 would fix this, given the patch 
>> 
>> UNRPC: Don't call connect() more than once on a TCP socket
>> 
>> If this is part of the problem, that would mean we couldn't use NFS 4 until Ubuntu 24, i.e. summer of 2025, given delays in release and deployment.
>> 
>> Unfortunately I can't reproduce our problem. It doesn't show up until we're halfway into our semester and loads start getting heavier.
>> 
>> You say this is a long-standing issue. So are problems with NFS 4 locking (and also NFS 4 delegation). If you have a patch for both of these issues that we could put into 5.4.0, I might be willing to test it, assuming the patches are safe. We probably wouldn't know it has really fixed things for at least 6 months.
> 
> Charles, this mailing list is an upstream Linux forum. There honestly isn't anything we can do about Ubuntu backporting policies, and we can't help much at all with Linux kernels as old as v5.4 unless there are known fixes in later kernels. It's up to you to find those fixes, test them, and then convince the stable kernel folks and your distribution to include the fix in their kernel. The folks on linux-nfs@ are little more than process observers in those communities.
> 
> The RELEASE_LOCKOWNER lock inversion issue has been around forever, but it was exposed recently by a performance regression fix in v5.18-rc3. After that point, a client can leverage the existing lock inversion bug to provoke a deadlock on the server using normal NFSv4 operations. That makes the RELEASE_LOCKOWNER issue a potential denial-of-service in the latest kernels, which is priority one in my book.
> 
> I stand by my statement to Linus in this morning's pull request: I currently know of no other high priority bugs in v5.18's NFS server (I'm not talking about the NFS client) under active investigation except for the one I mentioned in the PR. If you know of /specific/ reports of significant incorrect behavior in the latest upstream Linux NFS client or server, please post links to them here, or better yet, file bugs and help the assignees to troubleshoot the problems.
> 
> 
> --
> Chuck Lever
> 
> 
>
Chuck Lever III May 27, 2022, 3:32 p.m. UTC | #3
> On May 26, 2022, at 8:21 PM, Charles Hedrick <hedrick@rutgers.edu> wrote:
> 
> I was hoping it was reasonable to ask for backporting to the latest LTS.

It is reasonable for you to make that request of stable@, cc: linux-nfs@.
(And you don't have to hijack someone else's thread to do that ;-)

But do note that upstream commit 89f42494f92f "SUNRPC: Don't call connect()
more than once on a TCP socket" has already been backported to v5.4.196
as commit 975a0f14d5cd. v5.4.196 was released just two days ago.

And, fwiw, 89f42494f92f had a Fixes: tag in it. The stable kernels have
some automation that look for that and apply such commits automatically.
Sometimes it takes a while, though.

Good luck.


> It would be nice if you could make the statement you did about that. At that point I agree that it’s Ubuntu’s problem.
> 
>> On May 26, 2022, at 7:39 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On May 26, 2022, at 3:17 PM, Charles Hedrick <hedrick@rutgers.edu> wrote:
>>> 
>>> We are still stuck on NFS 3 because NFS 4 lock operations hang. Typically with thunderbird, firefox, etc. I had hoped that Ubuntu 22 would fix this, given the patch 
>>> 
>>> UNRPC: Don't call connect() more than once on a TCP socket
>>> 
>>> If this is part of the problem, that would mean we couldn't use NFS 4 until Ubuntu 24, i.e. summer of 2025, given delays in release and deployment.
>>> 
>>> Unfortunately I can't reproduce our problem. It doesn't show up until we're halfway into our semester and loads start getting heavier.
>>> 
>>> You say this is a long-standing issue. So are problems with NFS 4 locking (and also NFS 4 delegation). If you have a patch for both of these issues that we could put into 5.4.0, I might be willing to test it, assuming the patches are safe. We probably wouldn't know it has really fixed things for at least 6 months.
>> 
>> Charles, this mailing list is an upstream Linux forum. There honestly isn't anything we can do about Ubuntu backporting policies, and we can't help much at all with Linux kernels as old as v5.4 unless there are known fixes in later kernels. It's up to you to find those fixes, test them, and then convince the stable kernel folks and your distribution to include the fix in their kernel. The folks on linux-nfs@ are little more than process observers in those communities.
>> 
>> The RELEASE_LOCKOWNER lock inversion issue has been around forever, but it was exposed recently by a performance regression fix in v5.18-rc3. After that point, a client can leverage the existing lock inversion bug to provoke a deadlock on the server using normal NFSv4 operations. That makes the RELEASE_LOCKOWNER issue a potential denial-of-service in the latest kernels, which is priority one in my book.
>> 
>> I stand by my statement to Linus in this morning's pull request: I currently know of no other high priority bugs in v5.18's NFS server (I'm not talking about the NFS client) under active investigation except for the one I mentioned in the PR. If you know of /specific/ reports of significant incorrect behavior in the latest upstream Linux NFS client or server, please post links to them here, or better yet, file bugs and help the assignees to troubleshoot the problems.
>> 
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 

--
Chuck Lever